### Урок 6. #Задача lookalike (Positive Unlabeled Learning)#

### Ссылки:

1. https://arxiv.org/pdf/1811.04820.pdf
2. https://habr.com/ru/company/JetBrains-education/blog/512032/
3. https://en.wikipedia.org/wiki/Bootstrap_aggregating
4. https://www.cs.uic.edu/~liub/publications/EMNLP-2010-no-negative.pdf
5. https://towardsdatascience.com/lookalikes-finding-needles-in-a-haystack-683bae8fdfff

### Домашнее задание

1. взять любой набор данных для бинарной классификации (можно скачать один из модельных с https://archive.ics.uci.edu/ml/datasets.php)
3. сделать feature engineering
4. обучить любой классификатор (какой вам нравится)
5. далее разделить ваш набор данных на два множества: P (positives) и U (unlabeled). Причем брать нужно не все положительные (класс 1) примеры, а только лишь часть
6. применить random negative sampling для построения классификатора в новых условиях
7. сравнить качество с решением из пункта 4 (построить отчет - таблицу метрик)
8. поэкспериментировать с долей P на шаге 5 (как будет меняться качество модели при уменьшении/увеличении размера P)

#1. Взять любой набор данных для бинарной классификации (можно скачать один из модельных с https://archive.ics.uci.edu/ml/datasets.php)

Набор данных: https://archive.ics.uci.edu/dataset/373/drug+consumption+quantified. Употребление наркотических средств (описание см. ниже)

In [None]:
from google.colab import files # drug+consumption+quantified.zip
files.upload()

!unzip drug+consumption+quantified.zip
!rm drug+consumption+quantified.zip

Saving drug+consumption+quantified.zip to drug+consumption+quantified.zip
Archive:  drug+consumption+quantified.zip
 extracting: drug_consumption.data   


In [None]:
import pandas as pd
import numpy as np

columns = ['Age', 'Gender', 'Education', 'Country', 'Ethnicity', 'Nscore',
           'Escore', 'Oscore', 'Ascore', 'Cscore', 'Impulsive', 'SS',
           'Alcohol', 'Amphet']
data = pd.read_csv('drug_consumption.data', header=None)
df = data.loc[:, 1:len(columns)] # ID отбрасываем
df.columns = columns

# уберем алкоголь и оставим только амфетамин (иначе задача приобретает вид детектирования аномалий)
df.drop(columns='Alcohol', inplace=True)
df

Unnamed: 0,Age,Gender,Education,Country,Ethnicity,Nscore,Escore,Oscore,Ascore,Cscore,Impulsive,SS,Amphet
0,0.49788,0.48246,-0.05921,0.96082,0.12600,0.31287,-0.57545,-0.58331,-0.91699,-0.00665,-0.21712,-1.18084,CL2
1,-0.07854,-0.48246,1.98437,0.96082,-0.31685,-0.67825,1.93886,1.43533,0.76096,-0.14277,-0.71126,-0.21575,CL2
2,0.49788,-0.48246,-0.05921,0.96082,-0.31685,-0.46725,0.80523,-0.84732,-1.62090,-1.01450,-1.37983,0.40148,CL0
3,-0.95197,0.48246,1.16365,0.96082,-0.31685,-0.14882,-0.80615,-0.01928,0.59042,0.58489,-1.37983,-1.18084,CL0
4,0.49788,0.48246,1.98437,0.96082,-0.31685,0.73545,-1.63340,-0.45174,-0.30172,1.30612,-0.21712,-0.21575,CL1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1880,-0.95197,0.48246,-0.61113,-0.57009,-0.31685,-1.19430,1.74091,1.88511,0.76096,-1.13788,0.88113,1.92173,CL0
1881,-0.95197,-0.48246,-0.61113,-0.57009,-0.31685,-0.24649,1.74091,0.58331,0.76096,-1.51840,0.88113,0.76540,CL0
1882,-0.07854,0.48246,0.45468,-0.57009,-0.31685,1.13281,-1.37639,-1.27553,-1.77200,-1.38502,0.52975,-0.52593,CL6
1883,-0.95197,0.48246,-0.61113,-0.57009,-0.31685,0.91093,-1.92173,0.29338,-1.62090,-2.57309,1.29221,1.22470,CL0


In [None]:
# удостоверимся в отсутствии пропусков

df.isnull().any().any()

False

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1885 entries, 0 to 1884
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Age        1885 non-null   float64
 1   Gender     1885 non-null   float64
 2   Education  1885 non-null   float64
 3   Country    1885 non-null   float64
 4   Ethnicity  1885 non-null   float64
 5   Nscore     1885 non-null   float64
 6   Escore     1885 non-null   float64
 7   Oscore     1885 non-null   float64
 8   Ascore     1885 non-null   float64
 9   Cscore     1885 non-null   float64
 10  Impulsive  1885 non-null   float64
 11  SS         1885 non-null   float64
 12  Amphet     1885 non-null   object 
dtypes: float64(12), object(1)
memory usage: 191.6+ KB


In [None]:
df.describe()

Unnamed: 0,Age,Gender,Education,Country,Ethnicity,Nscore,Escore,Oscore,Ascore,Cscore,Impulsive,SS
count,1885.0,1885.0,1885.0,1885.0,1885.0,1885.0,1885.0,1885.0,1885.0,1885.0,1885.0,1885.0
mean,0.03461,-0.000256,-0.003806,0.355542,-0.309577,4.7e-05,-0.000163,-0.000534,-0.000245,-0.000386,0.007216,-0.003292
std,0.87836,0.482588,0.950078,0.700335,0.166226,0.998106,0.997448,0.996229,0.99744,0.997523,0.954435,0.963701
min,-0.95197,-0.48246,-2.43591,-0.57009,-1.10702,-3.46436,-3.27393,-3.27393,-3.46436,-3.46436,-2.55524,-2.07848
25%,-0.95197,-0.48246,-0.61113,-0.57009,-0.31685,-0.67825,-0.69509,-0.71727,-0.60633,-0.65253,-0.71126,-0.52593
50%,-0.07854,-0.48246,-0.05921,0.96082,-0.31685,0.04257,0.00332,-0.01928,-0.01729,-0.00665,-0.21712,0.07987
75%,0.49788,0.48246,0.45468,0.96082,-0.31685,0.62967,0.63779,0.7233,0.76096,0.58489,0.52975,0.7654
max,2.59171,0.48246,1.98437,0.96082,1.90725,3.27393,3.27393,2.90161,3.46436,3.46436,2.90161,1.92173


Описание набора данных

1. ID is number of record in original database. Cannot be related to participant. It can be used for reference only.

2. Age (Real) is age of participant and has one of the values:


     Value    Meaning Cases Fraction
     -0.95197 18-24   643   34.11%
     -0.07854 25-34   481   25.52%
      0.49788 35-44   356   18.89%
      1.09449 45-54   294   15.60%
      1.82213 55-64    93    4.93%
      2.59171 65+      18    0.95%

     Descriptive statistics
     Min      Max     Mean    Std.dev.
     -0.95197 2.59171 0.03461 0.87813


3. Gender (Real) is gender of participant:


     Value    Meaning Cases Fraction
      0.48246 Female  942   49.97%
     -0.48246 Male    943   50.03%

     Descriptive statistics
     Min      Max     Mean     Std.dev.
     -0.48246 0.48246 -0.00026 0.48246


4. Education (Real) is level of education of participant and has one of the values:


     Value    Meaning                                             Cases Fraction
     -2.43591 Left school before 16 years                           28    1.49%
     -1.73790 Left school at 16 years                               99    5.25%
     -1.43719 Left school at 17 years                               30    1.59%
     -1.22751 Left school at 18 years                              100    5.31%
     -0.61113 Some college or university, no certificate or degree 506   26.84%
     -0.05921 Professional certificate/ diploma                    270   14.32%
      0.45468 University degree                                    480   25.46%
      1.16365 Masters degree                                       283   15.01%
      1.98437 Doctorate degree                                      89    4.72%
    
     Descriptive statistics
     Min      Max     Mean     Std.dev.
     -2.43591 1.98437 -0.00379 0.95004


5. Country (Real) is country of current residence of participant and has one of the values:


     Value    Meaning             Cases Fraction
     -0.09765 Australia             54   2.86%
      0.24923 Canada                87   4.62%
     -0.46841 New Zealand            5   0.27%
     -0.28519 Other                118   6.26%
      0.21128 Republic of Ireland   20   1.06%
      0.96082 UK                  1044  55.38%
     -0.57009 USA                  557  29.55%

     Descriptive statistics
     Min      Max     Mean    Std.dev.
     -0.57009 0.96082 0.35554 0.70015


6. Ethnicity (Real) is ethnicity of participant and has one of the values:

     
     Value    Meaning           Cases Fraction
     -0.50212 Asian               26   1.38%
     -1.10702 Black               33   1.75%
      1.90725 Mixed-Black/Asian    3   0.16%
      0.12600 Mixed-White/Asian   20   1.06%
     -0.22166 Mixed-White/Black   20   1.06%
      0.11440 Other               63   3.34%
     -0.31685 White             1720  91.25%

     Descriptive statistics
     Min      Max     Mean     Std.dev.
     -1.10702 1.90725 -0.30958 0.16618


7. Nscore (Real) is NEO-FFI-R Neuroticism. Possible values are presented in table below:

     
     Nscore Cases Value         Nscore Cases Value         Nscore Cases Value
     12      1    -3.46436      29     60    -0.67825      46     67    1.02119
     13      1    -3.15735      30     61    -0.58016      47     27    1.13281
     14      7    -2.75696      31     87    -0.46725      48     49    1.23461
     15      4    -2.52197      32     78    -0.34799      49     40    1.37297
     16      3    -2.42317      33     68    -0.24649      50     24    1.49158
     17      4    -2.34360      34     76    -0.14882      51     27    1.60383
     18     10    -2.21844      35     69    -0.05188      52     17    1.72012
     19     16    -2.05048      36     73     0.04257      53     20    1.83990
     20     24    -1.86962      37     67     0.13606      54     15    1.98437
     21     31    -1.69163      38     63     0.22393      55     11    2.12700
     22     26    -1.55078      39     66     0.31287      56     10    2.28554
     23     29    -1.43907      40     80     0.41667      57      6    2.46262
     24     35    -1.32828      41     61     0.52135      58      3    2.61139
     25     56    -1.19430      42     77     0.62967      59      5    2.82196
     26     57    -1.05308      43     49     0.73545      60      2    3.27393
     27     65    -0.92104      44     51     0.82562
     28     70    -0.79151      45     37     0.91093

     Descriptive statistics
     Min      Max     Mean    Std.dev.
     -3.46436 3.27393 0.00004 0.99808


8. Escore (Real) is NEO-FFI-R Extraversion. Possible values are presented in table below:

     
     Escore Cases Value         Escore Cases Value         Escore Cases Value
     16      2    -3.27393      31      55   -1.23177      45     91    0.80523
     18      1    -3.00537      32      52   -1.09207      46     69    0.96248
     19      6    -2.72827      33      77   -0.94779      47     64    1.11406
     20      3    -2.53830      34      68   -0.80615      48     62    1.28610
     21      3    -2.44904      35      58   -0.69509      49     37    1.45421
     22      8    -2.32338      36      89   -0.57545      50     25    1.58487
     23      5    -2.21069      37      90   -0.43999      51     34    1.74091
     24      9    -2.11437      38     106   -0.30033      52     21    1.93886
     25      4    -2.03972      39     107   -0.15487      53     15    2.12700
     26     21    -1.92173      40     130    0.00332      54     10    2.32338
     27     23    -1.76250      41     116    0.16767      55      9    2.57309
     28     23    -1.63340      42     109    0.32197      56      2    2.85950
     29     32    -1.50796      43     105    0.47617      58      1    3.00537
     30     38    -1.37639      44     103    0.63779      59      2    3.27393

     Descriptive statistics
     Min      Max     Mean     Std.dev.
     -3.27393 3.27393 -0.00016 0.99745


9. Oscore (Real) is NEO-FFI-R Openness to experience. Possible values are presented in table below:

     
     Oscore Cases Value         Oscore Cases Value         Oscore Cases Value
     24      2    -3.27393      38      64   -1.11902      50     83    0.58331
     26      4    -2.85950      39      60   -0.97631      51     87    0.72330
     28      4    -2.63199      40      68   -0.84732      52     87    0.88309
     29     11    -2.39883      41      76   -0.71727      53     81    1.06238
     30      9    -2.21069      42      87   -0.58331      54     57    1.24033
     31      9    -2.09015      43      86   -0.45174      55     63    1.43533
     32     13    -1.97495      44     101   -0.31776      56     38    1.65653
     33     23    -1.82919      45     103   -0.17779      57     34    1.88511
     34     25    -1.68062      46     134   -0.01928      58     19    2.15324
     35     26    -1.55521      47     107    0.14143      59     13    2.44904
     36     39    -1.42424      48     116    0.29338      60      7    2.90161
     37     51    -1.27553      49      98    0.44585
     
     Descriptive statistics
     Min      Max     Mean     Std.dev.
     -3.27393 2.90161 -0.00053 0.99623


10. Ascore (Real) is NEO-FFI-R Agreeableness. Possible values are presented in table below:

     
     Ascore Cases Value         Ascore Cases Value         Ascore Cases Value
     12      1    -3.46436      34      42   -1.34289      48     104   0.76096
     16      1    -3.15735      35      45   -1.21213      49      85   0.94156
     18      1    -3.00537      36      62   -1.07533      50      68   1.11406
     23      1    -2.90161      37      83   -0.91699      51      58   1.2861
     24      2    -2.78793      38      82   -0.76096      52      39   1.45039
     25      1    -2.70172      39     102   -0.60633      53      36   1.61108
     26      7    -2.53830      40      98   -0.45321      54      36   1.81866
     27      7    -2.35413      41     114   -0.30172      55      16   2.03972
     28      8    -2.21844      42     101   -0.15487      56      14   2.23427
     29     13    -2.07848      43     105   -0.01729      57       8   2.46262
     30     18    -1.92595      44     118    0.13136      58       7   2.75696
     31     24    -1.77200      45     112    0.28783      59       1   3.15735
     32     30    -1.62090      46     100    0.43852      60       1   3.46436
     33     34    -1.47955      47     100    0.59042                  
    
     Descriptive statistics
     Min      Max     Mean     Std.dev.
     -3.46436 3.46436 -0.00024 0.99744


11. Cscore (Real) is NEO-FFI-R Conscientiousness. Possible values are presented in table below:

     
     Cscore Cases Value         Cscore Cases Value         Cscore Cases Value
     17      1    -3.46436      32       39  -1.25773      46     113   0.58489
     19      1    -3.15735      33       49  -1.13788      47      95   0.7583
     20      3    -2.90161      34       55  -1.01450      48      95   0.93949
     21      2    -2.72827      35       55  -0.89891      49      76   1.13407
     22      5    -2.57309      36       69  -0.78155      50      47   1.30612
     23      5    -2.42317      37       81  -0.65253      51      43   1.46191
     24      6    -2.30408      38       77  -0.52745      52      34   1.63088
     25      9    -2.18109      39       87  -0.40581      53      28   1.81175
     26     13    -2.04506      40       97  -0.27607      54      27   2.04506
     27     13    -1.92173      41       99  -0.14277      55      13   2.33337
     28     25    -1.78169      42      105  -0.00665      56       8   2.63199
     29     24    -1.64101      43       90   0.12331      57       3   3.00537
     30     29    -1.51840      44      111   0.25953      59       1   3.46436
     31     41    -1.38502      45      111   0.41594                  
   
     Descriptive statistics
     Min      Max     Mean     Std.dev.
     -3.46436 3.46436 -0.00039 0.99752


12. Impulsive (Real) is impulsiveness measured by BIS-11. Possible values are presented in table below:

     
     Impulsiveness Cases Fraction
     -2.55524       20    1.06%
     -1.37983      276   14.64%
     -0.71126      307   16.29%
     -0.21712      355   18.83%
      0.19268      257   13.63%
      0.52975      216   11.46%
      0.88113      195   10.34%
      1.29221      148    7.85%
      1.86203      104    5.52%
      2.90161        7    0.37%
  
     Descriptive statistics
     Min      Max     Mean    Std.dev.
     -2.55524 2.90161 0.00721 0.95446


13. SS (Real) is sensation seeing measured by ImpSS. Possible values are presented in table below:

     
     SS       Cases Fraction
     -2.07848  71    3.77%
     -1.54858  87    4.62%
     -1.18084 132    7.00%
     -0.84637 169    8.97%
     -0.52593 211   11.19%
     -0.21575 223   11.83%
      0.07987 219   11.62%
      0.40148 249   13.21%
      0.76540 211   11.19%
      1.22470 210   11.14%
      1.92173 103    5.46%
    
     Descriptive statistics
     Min      Max     Mean     Std.dev.
     -2.07848 1.92173 -0.00329 0.96370

Amphet is class of amphetamines consumption. It is output attribute with following distribution of classes:


     Value Class                  Amphet          
                                  Cases Fraction
     CL0   Never Used              976   51.78%
     CL1   Used over a Decade Ago  230   12.20%
     CL2   Used in Last Decade     243   12.89%
     CL3   Used in Last Year       198   10.50%
     CL4   Used in Last Month      75    3.98%
     CL5   Used in Last Week       61    3.24%
     CL6   Used in Last Day       102    5.41%

#2. Сделать feature engineering

Перекодирование признаков обратно в смешанные для "чистоты эксперимента"

In [None]:
import re

# target (amphet)
# Сведем задачу к бинарной классификации: CL0 и CL1 перекодируем как 0 (не принимает амфетамина), CL2-6 -- как 1 (принимает)
df['Amphet'].replace(('CL0', 'CL1'), 0, inplace=True)
df.loc[df['Amphet'] != 0, 'Amphet'] = 1
df.Amphet = df.Amphet.astype(int)

# словари для перекодирования остальных признаков, кроме Impulsive и SS

age_dict = {
    -0.95197: '18-24',
    -0.07854: '25-34',
    0.49788: '35-44',
    1.09449: '45-54',
    1.82213: '55-64',
    2.59171: '65+'
}
gender_dict = {
    0.48246: 'Female',
    -0.48246: 'Male'
}
ed_dict = {
    -2.43591: 'Left school before 16 years',
    -1.73790: 'Left school at 16 years',
    -1.43719: 'Left school at 17 years',
    -1.22751: 'Left school at 18 years',
    -0.61113: 'Some college or university, no certificate or degree',
    -0.05921: 'Professional certificate/ diploma',
    0.45468: 'University degree',
    1.16365: 'Masters degree',
    1.98437: 'Doctorate degree'
}
country_dict = {
    -0.09765: 'Australia',
    0.24923: 'Canada',
    -0.46841: 'New Zealand',
    -0.28519: 'Other',
    0.21128: 'Republic of Ireland',
    0.96082: 'UK',
    -0.57009: 'USA'
}
ethnicity_dict = {
    -0.50212: 'Asian',
     -1.10702: 'Black',
      1.90725: 'Mixed-Black/Asian',
      0.12600: 'Mixed-White/Asian',
     -0.22166: 'Mixed-White/Black',
      0.11440: 'Other',
     -0.31685: 'White'
}
n_str = re.sub('\\n', ' ', """
12      1    -3.46436      29     60    -0.67825      46     67    1.02119
13      1    -3.15735      30     61    -0.58016      47     27    1.13281
14      7    -2.75696      31     87    -0.46725      48     49    1.23461
15      4    -2.52197      32     78    -0.34799      49     40    1.37297
16      3    -2.42317      33     68    -0.24649      50     24    1.49158
17      4    -2.34360      34     76    -0.14882      51     27    1.60383
18     10    -2.21844      35     69    -0.05188      52     17    1.72012
19     16    -2.05048      36     73     0.04257      53     20    1.83990
20     24    -1.86962      37     67     0.13606      54     15    1.98437
21     31    -1.69163      38     63     0.22393      55     11    2.12700
22     26    -1.55078      39     66     0.31287      56     10    2.28554
23     29    -1.43907      40     80     0.41667      57      6    2.46262
24     35    -1.32828      41     61     0.52135      58      3    2.61139
25     56    -1.19430      42     77     0.62967      59      5    2.82196
26     57    -1.05308      43     49     0.73545      60      2    3.27393
27     65    -0.92104      44     51     0.82562
28     70    -0.79151      45     37     0.91093
""").split()
n_keys = [float(n_str[i]) for i in list(range(2, len(n_str), 3))]
n_values = [int(n_str[i]) for i in list(range(0, len(n_str), 3))]
n_dict = dict(zip(n_keys, n_values))

e_str = re.sub('\\n', ' ', """
16      2    -3.27393      31      55   -1.23177      45     91    0.80523
18      1    -3.00537      32      52   -1.09207      46     69    0.96248
19      6    -2.72827      33      77   -0.94779      47     64    1.11406
20      3    -2.53830      34      68   -0.80615      48     62    1.28610
21      3    -2.44904      35      58   -0.69509      49     37    1.45421
22      8    -2.32338      36      89   -0.57545      50     25    1.58487
23      5    -2.21069      37      90   -0.43999      51     34    1.74091
24      9    -2.11437      38     106   -0.30033      52     21    1.93886
25      4    -2.03972      39     107   -0.15487      53     15    2.12700
26     21    -1.92173      40     130    0.00332      54     10    2.32338
27     23    -1.76250      41     116    0.16767      55      9    2.57309
28     23    -1.63340      42     109    0.32197      56      2    2.85950
29     32    -1.50796      43     105    0.47617      58      1    3.00537
30     38    -1.37639      44     103    0.63779      59      2    3.27393
""").split()
e_keys = [float(e_str[i]) for i in list(range(2, len(e_str), 3))]
e_values = [int(e_str[i]) for i in list(range(0, len(e_str), 3))]
e_dict = dict(zip(e_keys, e_values))

o_str = re.sub('\\n', ' ', """
24      2    -3.27393      38      64   -1.11902      50     83    0.58331
26      4    -2.85950      39      60   -0.97631      51     87    0.72330
28      4    -2.63199      40      68   -0.84732      52     87    0.88309
29     11    -2.39883      41      76   -0.71727      53     81    1.06238
30      9    -2.21069      42      87   -0.58331      54     57    1.24033
31      9    -2.09015      43      86   -0.45174      55     63    1.43533
32     13    -1.97495      44     101   -0.31776      56     38    1.65653
33     23    -1.82919      45     103   -0.17779      57     34    1.88511
34     25    -1.68062      46     134   -0.01928      58     19    2.15324
35     26    -1.55521      47     107    0.14143      59     13    2.44904
36     39    -1.42424      48     116    0.29338      60      7    2.90161
37     51    -1.27553      49      98    0.44585
""").split()
o_keys = [float(o_str[i]) for i in list(range(2, len(o_str), 3))]
o_values = [int(o_str[i]) for i in list(range(0, len(o_str), 3))]
o_dict = dict(zip(o_keys, o_values))

a_str = re.sub('\\n', ' ', """
12      1    -3.46436      34      42   -1.34289      48     104   0.76096
16      1    -3.15735      35      45   -1.21213      49      85   0.94156
18      1    -3.00537      36      62   -1.07533      50      68   1.11406
23      1    -2.90161      37      83   -0.91699      51      58   1.2861
24      2    -2.78793      38      82   -0.76096      52      39   1.45039
25      1    -2.70172      39     102   -0.60633      53      36   1.61108
26      7    -2.53830      40      98   -0.45321      54      36   1.81866
27      7    -2.35413      41     114   -0.30172      55      16   2.03972
28      8    -2.21844      42     101   -0.15487      56      14   2.23427
29     13    -2.07848      43     105   -0.01729      57       8   2.46262
30     18    -1.92595      44     118    0.13136      58       7   2.75696
31     24    -1.77200      45     112    0.28783      59       1   3.15735
32     30    -1.62090      46     100    0.43852      60       1   3.46436
33     34    -1.47955      47     100    0.59042
""").split()
a_keys = [float(a_str[i]) for i in list(range(2, len(a_str), 3))]
a_values = [int(a_str[i]) for i in list(range(0, len(a_str), 3))]
a_dict = dict(zip(a_keys, a_values))

c_str = re.sub('\\n', ' ', """
17      1    -3.46436      32       39  -1.25773      46     113   0.58489
19      1    -3.15735      33       49  -1.13788      47      95   0.7583
20      3    -2.90161      34       55  -1.01450      48      95   0.93949
21      2    -2.72827      35       55  -0.89891      49      76   1.13407
22      5    -2.57309      36       69  -0.78155      50      47   1.30612
23      5    -2.42317      37       81  -0.65253      51      43   1.46191
24      6    -2.30408      38       77  -0.52745      52      34   1.63088
25      9    -2.18109      39       87  -0.40581      53      28   1.81175
26     13    -2.04506      40       97  -0.27607      54      27   2.04506
27     13    -1.92173      41       99  -0.14277      55      13   2.33337
28     25    -1.78169      42      105  -0.00665      56       8   2.63199
29     24    -1.64101      43       90   0.12331      57       3   3.00537
30     29    -1.51840      44      111   0.25953      59       1   3.46436
31     41    -1.38502      45      111   0.41594
""").split()
c_keys = [float(c_str[i]) for i in list(range(2, len(c_str), 3))]
c_values = [int(c_str[i]) for i in list(range(0, len(c_str), 3))]
c_dict = dict(zip(c_keys, c_values))

columns_to_reencode = list(df.columns[:-3])
dicts = age_dict, gender_dict, ed_dict, country_dict, ethnicity_dict, n_dict, e_dict, o_dict, a_dict, c_dict

df = df.replace(dict(zip(columns_to_reencode, dicts)))
for col in columns_to_reencode[-5:]:
  df[col] = df[col].astype(int)
df

Unnamed: 0,Age,Gender,Education,Country,Ethnicity,Nscore,Escore,Oscore,Ascore,Cscore,Impulsive,SS,Amphet
0,35-44,Female,Professional certificate/ diploma,UK,Mixed-White/Asian,39,36,42,37,42,-0.21712,-1.18084,1
1,25-34,Male,Doctorate degree,UK,White,29,52,55,48,41,-0.71126,-0.21575,1
2,35-44,Male,Professional certificate/ diploma,UK,White,31,45,40,32,34,-1.37983,0.40148,0
3,18-24,Female,Masters degree,UK,White,34,34,46,47,46,-1.37983,-1.18084,0
4,35-44,Female,Doctorate degree,UK,White,43,28,43,41,50,-0.21712,-0.21575,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1880,18-24,Female,"Some college or university, no certificate or ...",USA,White,25,51,57,48,33,0.88113,1.92173,0
1881,18-24,Male,"Some college or university, no certificate or ...",USA,White,33,51,50,48,30,0.88113,0.76540,0
1882,25-34,Female,University degree,USA,White,47,30,37,31,31,0.52975,-0.52593,1
1883,18-24,Female,"Some college or university, no certificate or ...",USA,White,45,26,48,32,22,1.29221,1.22470,0


In [None]:
df['Amphet'].value_counts()

0    1206
1     679
Name: Amphet, dtype: int64

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1885 entries, 0 to 1884
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Age        1885 non-null   object 
 1   Gender     1885 non-null   object 
 2   Education  1885 non-null   object 
 3   Country    1885 non-null   object 
 4   Ethnicity  1885 non-null   object 
 5   Nscore     1885 non-null   int64  
 6   Escore     1885 non-null   int64  
 7   Oscore     1885 non-null   int64  
 8   Ascore     1885 non-null   int64  
 9   Cscore     1885 non-null   int64  
 10  Impulsive  1885 non-null   float64
 11  SS         1885 non-null   float64
 12  Amphet     1885 non-null   int64  
dtypes: float64(2), int64(6), object(5)
memory usage: 191.6+ KB


In [None]:
df.describe()

Unnamed: 0,Nscore,Escore,Oscore,Ascore,Cscore,Impulsive,SS,Amphet
count,1885.0,1885.0,1885.0,1885.0,1885.0,1885.0,1885.0,1885.0
mean,35.921485,39.576127,45.762334,42.866313,41.437135,0.007216,-0.003292,0.360212
std,9.135869,6.771769,6.579641,6.438106,6.966625,0.954435,0.963701,0.480189
min,12.0,16.0,24.0,12.0,17.0,-2.55524,-2.07848,0.0
25%,29.0,35.0,41.0,39.0,37.0,-0.71126,-0.52593,0.0
50%,36.0,40.0,46.0,43.0,42.0,-0.21712,0.07987,0.0
75%,42.0,44.0,51.0,48.0,46.0,0.52975,0.7654,1.0
max,60.0,59.0,60.0,60.0,59.0,2.90161,1.92173,1.0


Импортирование зависимостей

In [None]:
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
#from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier as rfc, GradientBoostingClassifier as gbc
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

from sklearn.metrics import precision_recall_curve, roc_auc_score, f1_score, precision_score, recall_score, confusion_matrix

import itertools, pickle

import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
X, y = df.drop(columns='Amphet'), df.Amphet

# формируем отложенную выборку для предотвращения утечек
X_base, X_holdout, y_base, y_holdout = train_test_split(X, y,
                                                        test_size=0.1,
                                                        shuffle=True,
                                                        stratify=y,
                                                        random_state=0)

# формируем базовые обучающую и тестовую (валидационную) выборки
X_train, X_test, y_train, y_test = train_test_split(X_base, y_base,
                                                    test_size=0.2,
                                                    shuffle=True,
                                                    stratify=y_base,
                                                    random_state=0)

In [None]:
cat_columns = columns[:5]
cont_columns = ['Impulsive', 'SS']

Предобработка признаков стандартная, внутри пайплайна (OHE и центрирование)

In [None]:
class FeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        self.column = column

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return X[self.column]

class NumberSelector(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[[self.key]]

class OHEEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key
        self.columns = []

    def fit(self, X, y=None):
        self.columns = [col for col in pd.get_dummies(X, prefix=self.key).columns]
        return self

    def transform(self, X):
        X = pd.get_dummies(X, prefix=self.key)
        test_columns = [col for col in X.columns]
        for col_ in self.columns:
          if col_ not in test_columns:
            X[col_] = 0
        return X[self.columns]


final_transformers = list()

for cat_col in cat_columns:
    cat_transformer = Pipeline([
        ('selector', FeatureSelector(column=cat_col)),
        ('ohe', OHEEncoder(key=cat_col))
        ])
    final_transformers.append((cat_col, cat_transformer))

for cont_col in cont_columns:
    cont_transformer = Pipeline([
        ('selector', NumberSelector(key=cont_col)),
        ('standard', StandardScaler())
        ])
    final_transformers.append((cont_col, cont_transformer))

feats = FeatureUnion(final_transformers)

Функции оценивания

In [None]:
# в представленном решении данная функция с калибровкой порогов не используется
def evaluate_f1(y_true, y_prob):
  precision, recall, thresholds = precision_recall_curve(y_true, y_prob)
  f1 = (2 * precision * recall) / (precision + recall)
  ix = np.argmax(f1)
  print('TH:', thresholds[ix], '\nF1:', f1[ix], '\nPR:', precision[ix], '\nRC:', recall[ix])
  print('RA:', roc_auc_score(y_true, y_prob > thresholds[ix]))
  cnf_mx = confusion_matrix(y_true, y_prob > thresholds[ix])
  TN, TP = cnf_mx[0,0], cnf_mx[1,1]
  FN, FP = cnf_mx[1,0], cnf_mx[0,1]
  print(f'\tTN={TN}\tFP={FP}, \n\tFN={FN}\tTP={TP}\n')


# стандартная функция оценивания, порог отсечки = 0.5. возвращает 8 метрик
def evaluate_05(y_true, y_pred, y_prob):
  f1 = f1_score(y_true, y_pred)
  pr = precision_score(y_true, y_pred)
  rc = recall_score(y_true, y_pred)
  ra = roc_auc_score(y_true, y_prob)
  print(f'F1: {f1}\nPREC: {pr}\nREC: {rc}\nRA: {ra}')

  cnf_mx = confusion_matrix(y_true, y_pred)
  TN, TP = cnf_mx[0,0], cnf_mx[1,1]
  FN, FP = cnf_mx[1,0], cnf_mx[0,1]
  print(f'\tTN={TN}\tFP={FP}, \n\tFN={FN}\tTP={TP}\n')

  return (f1, pr, rc, ra, TN, TP, FN, FP)

#3. Обучить любой классификатор (какой вам нравится)

Использована обыкновенная логистическая регрессия

In [None]:
model = Pipeline([
    ('features', feats),
    ('classifier', LogisticRegression(random_state=0))
    ])

model.fit(X_train, y_train)

# предсказание "нулевой" модели на тестовой (валидационной) выборке
print('test')
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
evaluate_05(y_test, y_pred, y_prob)

# предсказание "нулевой" модели на контрольной (отложенной) выборке
print('holdout')
y_pred = model.predict(X_holdout)
y_prob = model.predict_proba(X_holdout)[:, 1]

# словарь для сбора метрик
m_dict = {
    'P_frac': [],
    'f1': [],
    'precision': [],
    'recall': [],
    'roc_auc': [],
    'TN': [],
    'TP': [],
    'FN': [],
    'FP': []
}
m_dict['P_frac'].append(0)

for k, v in zip(list(m_dict.keys())[1:], evaluate_05(y_holdout, y_pred, y_prob)):
  m_dict[k].append(v)

test
F1: 0.6206896551724138
PREC: 0.6545454545454545
REC: 0.5901639344262295
RA: 0.8165701609264551
	TN=180	FP=38, 
	FN=50	TP=72

holdout
F1: 0.5891472868217055
PREC: 0.6229508196721312
REC: 0.5588235294117647
RA: 0.7921122994652406
	TN=98	FP=23, 
	FN=30	TP=38



#4. Разделить набор данных на два множества: P (positives) и U (unlabeled). Причем брать нужно не все положительные (класс 1) примеры, а только лишь часть
#5. Применить random negative sampling для построения классификатора в новых условиях

Поскольку датасет небольшой, сверяются и идут "в зачет" только метрики по отложенной выборке (holdout; таблица приведена ниже); метрики по остаточной выборке после формирования сэмплов выводятся исключительно в целях наглядности

In [None]:
dff = df.copy()

# из датасета извлекаются подмножества данных отложенной выборки
# согласно индексам изначального разбиения
dff_X_holdout, dff_y_holdout = (dff.iloc[y_holdout.index, :-2],
                                dff.iloc[y_holdout.index, -2])
# в рамках данного решения они не задействованы, но могут быть использованы
# для более детального изучения результатов (при дополнении предсказанными
# значениями и вероятностями: y_pu_pred_holdout и y_pu_prob_holdout)

# датасет без отложенной выборки
dff = dff.drop(index=y_holdout.index)

# перебор порогов сэмплирования
for frac in np.linspace(.05, .95, 10):
  print(f'FRAC = {frac}')

  pos_ind = np.array(dff.loc[df.Amphet==1].index)
  np.random.shuffle(pos_ind)
  pos_ind_sample = pos_ind[:int(np.ceil(len(pos_ind) * frac))]

  dff['PU_class'] = -1
  dff.loc[pos_ind_sample, 'PU_class'] = 1
  print(f'{len(pos_ind_sample)}/{len(pos_ind)} positives employed as PU_class 1')

  pos_sample = dff[dff.PU_class==1]
  neg_sample = dff[dff.PU_class==-1][:len(pos_sample)]
  sample_test = dff[dff.PU_class==-1][len(pos_sample):]
  sample_train = pd.concat([neg_sample, pos_sample]).sample(frac=1)

  X_pu_train, y_pu_train = sample_train.iloc[:, :-2], sample_train.iloc[:, -2]
  X_pu_test, y_pu_test = sample_test.iloc[:, :-2], sample_test.iloc[:, -2]

  # обучение модели на сдвоенной обучающей выборке с сэмплированием
  model.fit(X_pu_train, y_pu_train)

  # предсказание для тестовой выборки (остаток после сэмплирования)
  print('test')
  y_pu_pred = model.predict(X_pu_test)
  y_pu_prob = model.predict_proba(X_pu_test)[:, 1]
  # вывод результатов для каждого порога
  evaluate_05(y_pu_test, y_pu_pred, y_pu_prob)
  print()

  # предсказание для контрольной (отложенной) выборки
  print('holdout')
  y_pu_pred_holdout = model.predict(X_holdout)
  y_pu_prob_holdout = model.predict_proba(X_holdout)[:, 1]

  # сбор метрик
  m_dict['P_frac'].append(frac)
  for k, v in zip(list(m_dict.keys())[1:],
                  evaluate_05(y_holdout,
                              y_pu_pred_holdout,
                              y_pu_prob_holdout)):
    m_dict[k].append(v)

FRAC = 0.05
31/611 positives employed as PU_class 1
test
F1: 0.6119679210363973
PREC: 0.47418738049713194
REC: 0.8626086956521739
RA: 0.7669072545880034
	TN=509	FP=550, 
	FN=79	TP=496


holdout
F1: 0.6137566137566137
PREC: 0.4793388429752066
REC: 0.8529411764705882
RA: 0.7572314049586777
	TN=58	FP=63, 
	FN=10	TP=58

FRAC = 0.15
92/611 positives employed as PU_class 1
test
F1: 0.6134228187919464
PREC: 0.467280163599182
REC: 0.892578125
RA: 0.7572636718749999
	TN=479	FP=521, 
	FN=55	TP=457


holdout
F1: 0.631578947368421
PREC: 0.4918032786885246
REC: 0.8823529411764706
RA: 0.7715726786582401
	TN=59	FP=62, 
	FN=8	TP=60

FRAC = 0.25
153/611 positives employed as PU_class 1
test
F1: 0.610305958132045
PREC: 0.47853535353535354
REC: 0.8422222222222222
RA: 0.7616666666666667
	TN=527	FP=413, 
	FN=71	TP=379


holdout
F1: 0.6480446927374302
PREC: 0.5225225225225225
REC: 0.8529411764705882
RA: 0.7853062712688381
	TN=68	FP=53, 
	FN=10	TP=58

FRAC = 0.35
214/611 positives employed as PU_class 1
test

#6. Сравнить качество с решением из пункта 4 (построить отчет - таблицу метрик)
#7. Поэкспериментировать с долей P на шаге 5 (как будет меняться качество модели при уменьшении/увеличении размера P)

In [None]:
pd.DataFrame(m_dict).sort_values(by='f1')

Unnamed: 0,P_frac,f1,precision,recall,roc_auc,TN,TP,FN,FP
0,0.0,0.589147,0.622951,0.558824,0.792112,98,38,30,23
1,0.05,0.613757,0.479339,0.852941,0.757231,58,58,10,63
2,0.15,0.631579,0.491803,0.882353,0.771573,59,60,8,62
10,0.95,0.645963,0.55914,0.764706,0.790775,80,52,16,41
7,0.65,0.647399,0.533333,0.823529,0.782632,72,56,12,49
4,0.35,0.647727,0.527778,0.838235,0.786157,70,57,11,51
3,0.25,0.648045,0.522523,0.852941,0.785306,68,58,10,53
8,0.75,0.662722,0.554455,0.823529,0.791018,76,56,12,45
5,0.45,0.666667,0.541284,0.867647,0.790897,71,59,9,50
6,0.55,0.670455,0.546296,0.867647,0.782146,72,59,9,49


#Вывод:
Установлено, что на небольших наборах необходимо учитывать сокращение объемов данных после проведения сэмплирования, -- выше наглядно представлено, что по мере увеличения доли размеченных ответов снижается качество предсказания: модель "тяготеет" к максимизации полноты (критически возрастает величина ошибки 1-го рода, т.е. количество ложноположительных ответов классификатора). В то же время, очевидно, что качество предсказания на отложенной выборке (holdout), не задействованной в обучении, напротив, возрастает по мере увеличения объема сэмпла с положительной разметкой, и прирост составляет 8 п.п. (при P = 0.85) по сравнению с моделью, обученной без использования PU-подхода (0.6707 и 0.5891 соответственно).

Отсюда, можем сделать вывод о целесообразности применения данного подхода в решении подобных задач со следующей оговоркой: во всех случаях отмечается резкое (20-30 п.п.) возрастание полноты за счет точности (минус 6-15 п.п.).