<a href="https://colab.research.google.com/github/MikeSalnikov/ML_for_business/blob/main/HW6_ml_in_business_ipynb%22.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. взять любой набор данных для бинарной классификации (можно скачать один с https://archive.ics.uci.edu/ml/datasets.php)
2. сделать feature engineering
3. обучить любой классификатор (какой вам нравится)
4. далее разделить ваш набор данных на два множества: P (positives) и U (unlabeled). Причем брать нужно не все положительные (класс 1) примеры, а только лишь часть
5. применить random negative sampling для построения классификатора в новых условиях
6. сравнить качество с решением из пункта 4 (построить отчет - таблицу метрик)
7. поэкспериментировать с долей P на шаге 5 (как будет меняться качество модели при уменьшении/увеличении размера P)

Один мой друг Дмитрий Шелгунов поставил себе такую задачу: приготовить плов 100 раз - и только сотый по счету плов он признает удавшимся, таким образом научится готовить его. 

Именно в честь Шела я выбрал набор данных для бинарной классификации РИСА - основного ингридиента этого замечательного блюда: https://archive.ics.uci.edu/ml/datasets/Rice+%28Cammeo+and+Osmancik%29

Описание: 

A total of 3810 rice grain's images were taken for the two species (Cammeo and Osmancik), processed and feature inferences were made. 7 morphological features were obtained for each grain of rice.

Перевод: 

В общей сложности было получено 3810 изображений рисовых зерен для двух видов (Камео и Османчик), обработаны и сделаны выводы о признаках. Для каждого зерна риса было получено 7 морфологических признаков.

In [1]:
import pandas as pd
import numpy as np
import warnings 
warnings.filterwarnings('ignore')

In [2]:
df  = pd.read_csv('Rice_Osmancik_Cammeo_Dataset.csv')

In [3]:
df

Unnamed: 0,AREA,PERIMETER,MAJORAXIS,MINORAXIS,ECCENTRICITY,CONVEX_AREA,EXTENT,CLASS
0,15231,525.578979,229.749878,85.093788,0.928882,15617,0.572896,Cammeo
1,14656,494.311005,206.020065,91.730972,0.895405,15072,0.615436,Cammeo
2,14634,501.122009,214.106781,87.768288,0.912118,14954,0.693259,Cammeo
3,13176,458.342987,193.337387,87.448395,0.891861,13368,0.640669,Cammeo
4,14688,507.166992,211.743378,89.312454,0.906691,15262,0.646024,Cammeo
...,...,...,...,...,...,...,...,...
3805,11441,415.858002,170.486771,85.756592,0.864280,11628,0.681012,Osmancik
3806,11625,421.390015,167.714798,89.462570,0.845850,11904,0.694279,Osmancik
3807,12437,442.498993,183.572922,86.801979,0.881144,12645,0.626739,Osmancik
3808,9882,392.296997,161.193985,78.210480,0.874406,10097,0.659064,Osmancik


Attribute Information:
1. Area: Returns the number of pixels within the boundaries of the rice grain.
2. Perimeter: Calculates the circumference by calculating the distance between pixels around the boundaries of the rice grain.
3. Major Axis Length: The longest line that can be drawn on the rice grain, i.e. the main axis distance, gives.
4. Minor Axis Length: The shortest line that can be drawn on the rice grain, i.e. the small axis distance, gives.
5. Eccentricity: It measures how round the ellipse, which has the same moments as the rice grain, is.
6. Convex Area: Returns the pixel count of the smallest convex shell of the region formed by the rice grain.
7. Extent: Returns the ratio of the region formed by the rice grain to the bounding box pixels
8. Class: Cammeo and Osmancik.

Именно восьмой признак будет целевым: Cammeo (1) или Osmancik (0). 

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3810 entries, 0 to 3809
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   AREA          3810 non-null   int64  
 1   PERIMETER     3810 non-null   float64
 2   MAJORAXIS     3810 non-null   float64
 3   MINORAXIS     3810 non-null   float64
 4   ECCENTRICITY  3810 non-null   float64
 5   CONVEX_AREA   3810 non-null   int64  
 6   EXTENT        3810 non-null   float64
 7   CLASS         3810 non-null   object 
dtypes: float64(5), int64(2), object(1)
memory usage: 238.2+ KB


In [5]:
df.describe()

Unnamed: 0,AREA,PERIMETER,MAJORAXIS,MINORAXIS,ECCENTRICITY,CONVEX_AREA,EXTENT
count,3810.0,3810.0,3810.0,3810.0,3810.0,3810.0,3810.0
mean,12667.727559,454.23918,188.776222,86.31375,0.886871,12952.49685,0.661934
std,1732.367706,35.597081,17.448679,5.729817,0.020818,1776.972042,0.077239
min,7551.0,359.100006,145.264465,59.532406,0.777233,7723.0,0.497413
25%,11370.5,426.144752,174.353855,82.731695,0.872402,11626.25,0.598862
50%,12421.5,448.852493,185.810059,86.434647,0.88905,12706.5,0.645361
75%,13950.0,483.683746,203.550438,90.143677,0.902588,14284.0,0.726562
max,18913.0,548.445984,239.010498,107.54245,0.948007,19099.0,0.86105


In [6]:
# смотрю на соотношение классов:

df['CLASS'].value_counts()

Osmancik    2180
Cammeo      1630
Name: CLASS, dtype: int64

In [7]:
# провожу бинарное кодирование целевой переменной: 

df['CLASS'] = df['CLASS'].map({'Cammeo': 1, 'Osmancik': 0})

In [8]:
# проверка:

df.head(3)

Unnamed: 0,AREA,PERIMETER,MAJORAXIS,MINORAXIS,ECCENTRICITY,CONVEX_AREA,EXTENT,CLASS
0,15231,525.578979,229.749878,85.093788,0.928882,15617,0.572896,1
1,14656,494.311005,206.020065,91.730972,0.895405,15072,0.615436,1
2,14634,501.122009,214.106781,87.768288,0.912118,14954,0.693259,1


In [9]:
# разбиваю выборку на тренировочную и тестовую части и обучаю модель (я выбрал CatBoost):

from sklearn.model_selection import train_test_split

X_data = df.drop('CLASS', axis=1)
y_data = df['CLASS']

X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2, random_state=42)

In [20]:
#from catboost import CatBoostClassifier
!pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.1.1-cp38-none-manylinux1_x86_64.whl (76.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.6/76.6 MB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.1.1


In [21]:
from catboost import CatBoostClassifier

model = CatBoostClassifier(random_state=42, verbose=100)

model.fit(X_train, y_train)
y_predict = model.predict(X_test)

Learning rate set to 0.016581
0:	learn: 0.6655457	total: 50.3ms	remaining: 50.3s
100:	learn: 0.1889111	total: 357ms	remaining: 3.18s
200:	learn: 0.1659009	total: 683ms	remaining: 2.72s
300:	learn: 0.1548290	total: 1s	remaining: 2.32s
400:	learn: 0.1452210	total: 1.29s	remaining: 1.93s
500:	learn: 0.1364113	total: 1.6s	remaining: 1.6s
600:	learn: 0.1288964	total: 1.92s	remaining: 1.28s
700:	learn: 0.1212032	total: 2.24s	remaining: 957ms
800:	learn: 0.1139823	total: 2.53s	remaining: 629ms
900:	learn: 0.1074276	total: 2.86s	remaining: 315ms
999:	learn: 0.1010663	total: 3.16s	remaining: 0us


In [22]:
from sklearn.metrics import f1_score, roc_auc_score, precision_recall_curve

In [23]:
metrics_df = pd.DataFrame(columns=['model', 'thresh', 'F-Score', 'Precision', 'Recall', 'ROC AUC'])
metrics_df

Unnamed: 0,model,thresh,F-Score,Precision,Recall,ROC AUC


In [24]:
precision, recall, thresholds = precision_recall_curve(y_test, y_predict)

fscore = (2 * precision * recall) / (precision + recall)
# locate the index of the largest f score
ix = np.argmax(fscore)
print(f'Best Threshold={thresholds[ix]}, F-Score={fscore[ix]:.3f}, Precision={precision[ix]:.3f}, Recall={recall[ix]:.3f}')

Best Threshold=1, F-Score=0.921, Precision=0.928, Recall=0.914


In [25]:
roc_auc = roc_auc_score(y_test, y_predict)
roc_auc

0.9268030513176144

In [26]:
metrics_df = metrics_df.append({
    'model': 'supervised',
    'thresh': thresholds[ix],
    'F-Score': fscore[ix],
    'Precision': precision[ix],
    'Recall': recall[ix],
    'ROC AUC': roc_auc
}, ignore_index=True)

metrics_df

Unnamed: 0,model,thresh,F-Score,Precision,Recall,ROC AUC
0,supervised,1,0.920863,0.927536,0.914286,0.926803


#### Теперь очередь PU learning (25%)

In [27]:
# представлю, что неизвестны негативы и часть позитивов:

mod_data = X_train.copy()
mod_data['label'] = y_train
mod_data = mod_data.reset_index(drop=True)

# mod_data = data.copy()
# get the indices of the positives samples
pos_ind = np.where(mod_data.iloc[:, -1].values == 1)[0]

# shuffle them
np.random.shuffle(pos_ind)
# leave just 25% of the positives marked
perc = 0.25
pos_sample_len = int(np.ceil(perc * len(pos_ind)))

print(f'Using {pos_sample_len}/{len(pos_ind)} as positives and unlabeling the rest')
pos_sample = pos_ind[:pos_sample_len]

Using 320/1280 as positives and unlabeling the rest


In [28]:
# создаю столбец для новой целевой переменной, где будет два класса - P (1) и U (-1):

mod_data['class_test'] = -1
mod_data.loc[pos_sample, 'class_test'] = 1
print('target variable:\n', mod_data.iloc[:,-1].value_counts())

target variable:
 -1    2728
 1     320
Name: class_test, dtype: int64


* 320 позитивных примеров (1)
* 2728 без разметки (-1)

In [29]:
mod_data.head(10)

Unnamed: 0,AREA,PERIMETER,MAJORAXIS,MINORAXIS,ECCENTRICITY,CONVEX_AREA,EXTENT,label,class_test
0,12529,437.838989,174.86145,92.189262,0.849733,12840,0.766019,0,-1
1,11051,424.976013,180.871903,78.26738,0.901527,11240,0.568058,0,-1
2,12975,463.851013,196.423966,85.064117,0.901363,13358,0.609126,1,-1
3,10398,405.678986,162.227158,82.393456,0.861422,10658,0.644717,0,-1
4,14541,492.785004,204.257141,92.471016,0.891653,14893,0.758292,1,-1
5,10870,409.490997,169.20607,82.095322,0.874415,11030,0.670243,0,-1
6,13913,493.606995,212.985474,83.991348,0.918959,14218,0.572128,1,1
7,14720,494.862,207.092712,91.498894,0.897101,15071,0.704003,1,-1
8,11136,427.109985,175.653076,81.918777,0.884591,11474,0.574376,0,-1
9,14115,483.779999,206.841873,88.068649,0.904828,14312,0.747656,1,-1


#### random negative sampling

In [30]:
# помню, что (X_data) содержит целевой признак, который буду использовать для оценки качества
# отделю [:-2] как истиный класс для проверки и [:-1] как данные для входной разметки PUL:

mod_data = mod_data.sample(frac=1)


data_N = mod_data[mod_data['class_test'] == -1]
data_P = mod_data[mod_data['class_test'] == 1]

neg_sample = data_N[:data_P.shape[0]]
sample_test = data_N[data_P.shape[0]:]
pos_sample = data_P.copy()

print(neg_sample.shape, pos_sample.shape)
sample_train = pd.concat([neg_sample, pos_sample]).sample(frac=1)

(320, 9) (320, 9)


In [31]:
sample_train

Unnamed: 0,AREA,PERIMETER,MAJORAXIS,MINORAXIS,ECCENTRICITY,CONVEX_AREA,EXTENT,label,class_test
31,14478,469.161011,190.847717,97.343185,0.860141,14686,0.765424,1,1
597,11322,415.601013,164.598053,88.870033,0.841715,11619,0.754046,0,-1
1040,12220,465.686005,200.174164,79.041847,0.918739,12588,0.574383,1,-1
1542,14713,493.212006,206.916122,91.648468,0.896558,15162,0.594033,1,1
2183,16436,529.862976,230.931046,91.065186,0.918965,16648,0.634105,1,1
...,...,...,...,...,...,...,...,...,...
2693,12311,446.175995,182.649033,86.829353,0.879776,12608,0.690892,0,-1
2276,14974,498.625000,210.923950,90.903999,0.902362,15212,0.640901,1,-1
267,11197,429.450012,178.216919,81.277374,0.889950,11461,0.568838,0,-1
208,13404,476.205994,203.311569,84.621994,0.909265,13751,0.634659,1,-1


In [32]:
model = CatBoostClassifier(random_state=42, verbose=100)
sample_train.loc[sample_train['class_test'] == -1, 'class_test'] = 0

model.fit(sample_train.drop(columns=['class_test', 'label']), 
          sample_train['class_test'])

y_predict = model.predict(X_test)

Learning rate set to 0.008515
0:	learn: 0.6894471	total: 9.74ms	remaining: 9.73s
100:	learn: 0.5060993	total: 260ms	remaining: 2.31s
200:	learn: 0.4588510	total: 490ms	remaining: 1.95s
300:	learn: 0.4332235	total: 751ms	remaining: 1.75s
400:	learn: 0.4108060	total: 1s	remaining: 1.5s
500:	learn: 0.3911238	total: 1.23s	remaining: 1.23s
600:	learn: 0.3734552	total: 1.48s	remaining: 985ms
700:	learn: 0.3562672	total: 1.72s	remaining: 734ms
800:	learn: 0.3374994	total: 1.94s	remaining: 483ms
900:	learn: 0.3187872	total: 2.2s	remaining: 242ms
999:	learn: 0.2996679	total: 2.44s	remaining: 0us


In [33]:
precision, recall, thresholds = precision_recall_curve(y_test, y_predict)

fscore = (2 * precision * recall) / (precision + recall)
# locate the index of the largest f score
ix = np.argmax(fscore)
print(f'Best Threshold={thresholds[ix]}, F-Score={fscore[ix]:.3f}, Precision={precision[ix]:.3f}, Recall={recall[ix]:.3f}')

Best Threshold=1, F-Score=0.884, Precision=0.885, Recall=0.883


In [34]:
roc_auc = roc_auc_score(y_test, y_predict)
roc_auc

0.8928848821081832

In [35]:
metrics_df = metrics_df.append({
    'model': 'pu-learning (25%)',
    'thresh': thresholds[ix],
    'F-Score': fscore[ix],
    'Precision': precision[ix],
    'Recall': recall[ix],
    'ROC AUC': roc_auc
}, ignore_index=True)

metrics_df

Unnamed: 0,model,thresh,F-Score,Precision,Recall,ROC AUC
0,supervised,1,0.920863,0.927536,0.914286,0.926803
1,pu-learning (25%),1,0.88412,0.885387,0.882857,0.892885


#### 10%

In [36]:
# представлю, что неизвестны негативы и часть позитивов:

mod_data = X_train.copy()
mod_data['label'] = y_train
mod_data = mod_data.reset_index(drop=True)

# mod_data = data.copy()
# get the indices of the positives samples
pos_ind = np.where(mod_data.iloc[:, -1].values == 1)[0]

# shuffle them
np.random.shuffle(pos_ind)
# leave just 10% of the positives marked
perc = 0.1
pos_sample_len = int(np.ceil(perc * len(pos_ind)))

print(f'Using {pos_sample_len}/{len(pos_ind)} as positives and unlabeling the rest')
pos_sample = pos_ind[:pos_sample_len]

# создаю столбец для новой целевой переменной, где будет два класса - P (1) и U (-1):

mod_data['class_test'] = -1
mod_data.loc[pos_sample, 'class_test'] = 1
print('target variable:\n', mod_data.iloc[:,-1].value_counts())

# помню, что (X_data) содержит целевой признак, который буду использовать для оценки качества
# отделю [:-2] как истиный класс для проверки и [:-1] как данные для входной разметки PUL:

mod_data = mod_data.sample(frac=1)


data_N = mod_data[mod_data['class_test'] == -1]
data_P = mod_data[mod_data['class_test'] == 1]

neg_sample = data_N[:data_P.shape[0]]
sample_test = data_N[data_P.shape[0]:]
pos_sample = data_P.copy()

print(neg_sample.shape, pos_sample.shape)
sample_train = pd.concat([neg_sample, pos_sample]).sample(frac=1)

model = CatBoostClassifier(random_state=42, verbose=100)
sample_train.loc[sample_train['class_test'] == -1, 'class_test'] = 0

model.fit(sample_train.drop(columns=['class_test', 'label']), 
          sample_train['class_test'])

y_predict = model.predict(X_test)

precision, recall, thresholds = precision_recall_curve(y_test, y_predict)

fscore = (2 * precision * recall) / (precision + recall)
# locate the index of the largest f score
ix = np.argmax(fscore)
print(f'Best Threshold={thresholds[ix]}, F-Score={fscore[ix]:.3f}, Precision={precision[ix]:.3f}, Recall={recall[ix]:.3f}')

roc_auc = roc_auc_score(y_test, y_predict)
roc_auc

metrics_df = metrics_df.append({
    'model': 'pu-learning (10%)',
    'thresh': thresholds[ix],
    'F-Score': fscore[ix],
    'Precision': precision[ix],
    'Recall': recall[ix],
    'ROC AUC': roc_auc
}, ignore_index=True)

metrics_df

Using 128/1280 as positives and unlabeling the rest
target variable:
 -1    2920
 1     128
Name: class_test, dtype: int64
(128, 9) (128, 9)
Learning rate set to 0.005758
0:	learn: 0.6894307	total: 3.16ms	remaining: 3.16s
100:	learn: 0.4765074	total: 233ms	remaining: 2.07s
200:	learn: 0.3939835	total: 458ms	remaining: 1.82s
300:	learn: 0.3450304	total: 700ms	remaining: 1.63s
400:	learn: 0.3102285	total: 945ms	remaining: 1.41s
500:	learn: 0.2818372	total: 1.16s	remaining: 1.16s
600:	learn: 0.2582819	total: 1.42s	remaining: 943ms
700:	learn: 0.2391676	total: 1.66s	remaining: 707ms
800:	learn: 0.2212261	total: 1.88s	remaining: 467ms
900:	learn: 0.2046790	total: 2.1s	remaining: 230ms
999:	learn: 0.1898112	total: 2.33s	remaining: 0us
Best Threshold=1, F-Score=0.808, Precision=0.874, Recall=0.751


Unnamed: 0,model,thresh,F-Score,Precision,Recall,ROC AUC
0,supervised,1,0.920863,0.927536,0.914286,0.926803
1,pu-learning (25%),1,0.88412,0.885387,0.882857,0.892885
2,pu-learning (10%),1,0.807988,0.873754,0.751429,0.829598


#### 50%

In [37]:
# представлю, что неизвестны негативы и часть позитивов:

mod_data = X_train.copy()
mod_data['label'] = y_train
mod_data = mod_data.reset_index(drop=True)

# mod_data = data.copy()
# get the indices of the positives samples
pos_ind = np.where(mod_data.iloc[:, -1].values == 1)[0]

# shuffle them
np.random.shuffle(pos_ind)
# leave just 50% of the positives marked
perc = 0.5
pos_sample_len = int(np.ceil(perc * len(pos_ind)))

print(f'Using {pos_sample_len}/{len(pos_ind)} as positives and unlabeling the rest')
pos_sample = pos_ind[:pos_sample_len]

# создаю столбец для новой целевой переменной, где будет два класса - P (1) и U (-1):

mod_data['class_test'] = -1
mod_data.loc[pos_sample, 'class_test'] = 1
print('target variable:\n', mod_data.iloc[:,-1].value_counts())

# помню, что (X_data) содержит целевой признак, который буду использовать для оценки качества
# отделю [:-2] как истиный класс для проверки и [:-1] как данные для входной разметки PUL:

mod_data = mod_data.sample(frac=1)


data_N = mod_data[mod_data['class_test'] == -1]
data_P = mod_data[mod_data['class_test'] == 1]

neg_sample = data_N[:data_P.shape[0]]
sample_test = data_N[data_P.shape[0]:]
pos_sample = data_P.copy()

print(neg_sample.shape, pos_sample.shape)
sample_train = pd.concat([neg_sample, pos_sample]).sample(frac=1)

model = CatBoostClassifier(random_state=42, verbose=100)
sample_train.loc[sample_train['class_test'] == -1, 'class_test'] = 0

model.fit(sample_train.drop(columns=['class_test', 'label']), 
          sample_train['class_test'])

y_predict = model.predict(X_test)

precision, recall, thresholds = precision_recall_curve(y_test, y_predict)

fscore = (2 * precision * recall) / (precision + recall)
# locate the index of the largest f score
ix = np.argmax(fscore)
print(f'Best Threshold={thresholds[ix]}, F-Score={fscore[ix]:.3f}, Precision={precision[ix]:.3f}, Recall={recall[ix]:.3f}')

roc_auc = roc_auc_score(y_test, y_predict)
roc_auc

metrics_df = metrics_df.append({
    'model': 'pu-learning (50%)',
    'thresh': thresholds[ix],
    'F-Score': fscore[ix],
    'Precision': precision[ix],
    'Recall': recall[ix],
    'ROC AUC': roc_auc
}, ignore_index=True)

metrics_df

Using 640/1280 as positives and unlabeling the rest
target variable:
 -1    2408
 1     640
Name: class_test, dtype: int64
(640, 9) (640, 9)
Learning rate set to 0.011448
0:	learn: 0.6845987	total: 2.96ms	remaining: 2.96s
100:	learn: 0.4280135	total: 288ms	remaining: 2.56s
200:	learn: 0.3940982	total: 542ms	remaining: 2.15s
300:	learn: 0.3741805	total: 790ms	remaining: 1.83s
400:	learn: 0.3580037	total: 1.07s	remaining: 1.59s
500:	learn: 0.3429578	total: 1.35s	remaining: 1.35s
600:	learn: 0.3278717	total: 1.63s	remaining: 1.08s
700:	learn: 0.3110040	total: 1.9s	remaining: 809ms
800:	learn: 0.2952122	total: 2.16s	remaining: 538ms
900:	learn: 0.2801143	total: 2.43s	remaining: 267ms
999:	learn: 0.2659526	total: 2.69s	remaining: 0us
Best Threshold=1, F-Score=0.908, Precision=0.875, Recall=0.943


Unnamed: 0,model,thresh,F-Score,Precision,Recall,ROC AUC
0,supervised,1,0.920863,0.927536,0.914286,0.926803
1,pu-learning (25%),1,0.88412,0.885387,0.882857,0.892885
2,pu-learning (10%),1,0.807988,0.873754,0.751429,0.829598
3,pu-learning (50%),1,0.90784,0.875332,0.942857,0.91439


#### Краткие выводы: 

За базовый размер P было взято 25% и получены достаточно высокие метрики. 

При уменьшении размера P до 10% метрики соответственно падают. 

А при увеличении размера P до 50% метрики соответственно растут. 

Вообще Positive-Unlabeled (PU) learning можно перевести как «обучение на основе положительных и неразмеченных данных». 

По сути PU learning —  это аналог бинарной классификация для случаев, когда имеются размеченные данные только одного из классов, но доступна неразмеченная смесь данных обоих классов. 

В общем случае мы даже не знаем, сколько данных в смеси соответствует положительному классу, а сколько — отрицательному. На основе таких наборов данных мы хотим построить бинарный классификатор: такой же, как и при наличии размеченных данных обоих классов.