### Урок 6. Домашнее задание

1. взять любой набор данных для бинарной классификации (можно скачать один из модельных с https://archive.ics.uci.edu/ml/datasets.php)
3. сделать feature engineering
4. обучить любой классификатор (какой вам нравится)
5. далее разделить ваш набор данных на два множества: P (positives) и U (unlabeled). Причем брать нужно не все положительные (класс 1) примеры, а только лишь часть
6. применить random negative sampling для построения классификатора в новых условиях
7. сравнить качество с решением из пункта 4 (построить отчет - таблицу метрик)
8. поэкспериментировать с долей P на шаге 5 (как будет меняться качество модели при уменьшении/увеличении размера P)

Расмотрим пример на датасете из репозитория UCI

Беру на тест данные по банковской рекламной кампании - https://archive.ics.uci.edu/ml/datasets/Bank+Marketing#

### Input variables:

**Bank client data:**
1. age (numeric)
2. job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
                                   "blue-collar","self-employed","retired","technician","services") 
3. marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)
4. education (categorical: "unknown","secondary","primary","tertiary")
5. default: has credit in default? (binary: "yes","no")
6. balance: average yearly balance, in euros (numeric) 
7. housing: has housing loan? (binary: "yes","no")
8. loan: has personal loan? (binary: "yes","no")

**Related with the last contact of the current campaign:**
9. contact: contact communication type (categorical: "unknown","telephone","cellular") 
10. day: last contact day of the month (numeric)
11. month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
12. duration: last contact duration, in seconds (numeric)

**Other attributes:**
13. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
14. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
15. previous: number of contacts performed before this campaign and for this client (numeric)
16. poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

**Output variable (desired target):**
17. y - has the client subscribed a term deposit? (binary: "yes","no")

In [1]:
import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import catboost as catb

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import recall_score, precision_score, roc_auc_score, accuracy_score, f1_score

In [2]:
data = pd.read_csv("bank-full.csv", sep=";")
data.head(3)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no


In [3]:
print(data.shape)

(45211, 17)


Посмотрим на соотношение классов

In [4]:
data.iloc[:, -1].value_counts(normalize=True)

no     0.883015
yes    0.116985
Name: y, dtype: float64

In [5]:
data.loc[data['y'] == 'no', 'y'] = 0
data.loc[data['y'] == 'yes', 'y'] = 1

**Feature engineering**

In [6]:
data['balance_to_age'] = data['balance'] / data['age']

In [7]:
data['balance_to_duration'] = data['balance'] / data['duration']

In [8]:
mean_balance_at_education_level = data.groupby(['education'], as_index=False).agg({'balance':'mean'})\
                       .rename(columns={'balance':'mean_balance_at_education_level'})
data = data.merge(mean_balance_at_education_level, on='education', how='left')

mean_balance_at_education_level

Unnamed: 0,education,mean_balance_at_education_level
0,primary,1250.949934
1,secondary,1154.880786
2,tertiary,1758.416435
3,unknown,1526.754443


In [9]:
data = data[[c for c in data if c not in ['y']] + ['y']]
data.head(3)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,balance_to_age,balance_to_duration,mean_balance_at_education_level,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,36.948276,8.210728,1758.416435,0
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,0.659091,0.192053,1154.880786,0
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,0.060606,0.026316,1154.880786,0


**Train-test split**

Разбиваем выборку на тренировочную и тестовую части и обучаем модель (в примере - градиентный бустинг)

In [10]:
x_data = data.drop(columns=['y'])
y_data = data['y']

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2, random_state=7)

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45211 entries, 0 to 45210
Data columns (total 20 columns):
age                                45211 non-null int64
job                                45211 non-null object
marital                            45211 non-null object
education                          45211 non-null object
default                            45211 non-null object
balance                            45211 non-null int64
housing                            45211 non-null object
loan                               45211 non-null object
contact                            45211 non-null object
day                                45211 non-null int64
month                              45211 non-null object
duration                           45211 non-null int64
campaign                           45211 non-null int64
pdays                              45211 non-null int64
previous                           45211 non-null int64
poutcome                           45211 no

In [12]:
# 'day' намеренно не беру или можно было конкатенировать day + month
categorical_columns = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']

In [13]:
%%time
# Использую балансировку по весам классов - задаётся пропорционально 1 к 8
model = catb.CatBoostClassifier(iterations=20, thread_count=2, silent=True, random_state=21)
model.fit(x_train, y_train, categorical_columns)

y_predict = model.predict(x_test)

Wall time: 991 ms


Проверяем качество

In [14]:
def evaluate_results(y_test, y_predict, print_flag=True):
    f1 = f1_score(y_test, y_predict)
    roc = roc_auc_score(y_test, y_predict)
    rec = recall_score(y_test, y_predict, average='binary')
    prc = precision_score(y_test, y_predict, average='binary')
    
    if print_flag:
        print('Classification results:')
        print("f1: %.2f%%" % (f1 * 100.0))
        print("roc: %.2f%%" % (roc * 100.0))
        print("recall: %.2f%%" % (rec * 100.0))
        print("precision: %.2f%%" % (prc * 100.0))
    else:
        return [f1, roc, rec, prc]
    
metrics = []
evaluate_results(y_test, y_predict)
metrics.append(evaluate_results(y_test, y_predict, print_flag=False))

Classification results:
f1: 55.59%
roc: 72.58%
recall: 48.43%
precision: 65.25%


### Теперь очередь за PU learning (алгоритм - SPY)

Представим, что нам неизвестны негативы и часть позитивов

In [15]:
mod_data = data.copy()
#get the indices of the positives samples
pos_ind = np.where(mod_data.iloc[:,-1].values == 1)[0]
#shuffle them
np.random.shuffle(pos_ind)
# leave just 25% of the positives marked
pos_sample_len = int(np.ceil(0.25 * len(pos_ind)))
print(f'Using {pos_sample_len}/{len(pos_ind)} as positives and unlabeling the rest')
pos_sample = pos_ind[:pos_sample_len]

Using 1323/5289 as positives and unlabeling the rest


Создаем столбец для новой целевой переменной, где у нас два класса - P (1) и U (-1)

In [16]:
mod_data['class_test'] = -1
mod_data.loc[pos_sample,'class_test'] = 1
print('target variable:\n', mod_data.iloc[:,-1].value_counts())

target variable:
 -1    43888
 1     1323
Name: class_test, dtype: int64


* We now have just 1323 positive samples labeled as 1 in the 'class_test' col while the rest is unlabeled as -1
* Recall that 'y' still holds the actual label

In [17]:
mod_data.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,...,duration,campaign,pdays,previous,poutcome,balance_to_age,balance_to_duration,mean_balance_at_education_level,y,class_test
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,...,261,1,-1,0,unknown,36.948276,8.210728,1758.416435,0,-1
1,44,technician,single,secondary,no,29,yes,no,unknown,5,...,151,1,-1,0,unknown,0.659091,0.192053,1154.880786,0,-1
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,...,76,1,-1,0,unknown,0.060606,0.026316,1154.880786,0,-1
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,...,92,1,-1,0,unknown,32.042553,16.369565,1526.754443,0,-1
4,33,unknown,single,unknown,no,1,no,no,unknown,5,...,198,1,-1,0,unknown,0.030303,0.005051,1526.754443,0,-1


Remember that this data frame (x_data) includes the former target variable that we keep here just to compare the results

[:-2] is the original class label for positive and negative data [:-1] is the new class for positive and unlabeled data

In [18]:
x_data = mod_data.iloc[:,:-2].values # just the X 
y_labeled = mod_data.iloc[:,-1].values # new class (just the P & U)
y_positive = mod_data.iloc[:,-2].values # original class

### 1. random negative sampling

In [19]:
mod_data = mod_data.sample(frac=1)  # Случайно перемешивает всю выборку
# В neg_sample оставляем случайную выборку длиной как pos_sample
neg_sample = mod_data[mod_data['class_test']==-1][:len(mod_data[mod_data['class_test']==1])]
sample_test = mod_data[mod_data['class_test']==-1][len(mod_data[mod_data['class_test']==1]):]  # Оставшиеся U
pos_sample = mod_data[mod_data['class_test']==1]
print(neg_sample.shape, pos_sample.shape)  # Одинаковые по размеру выборки
sample_train = pd.concat([neg_sample, pos_sample]).sample(frac=1)

(1323, 21) (1323, 21)


In [20]:
%%time
model = catb.CatBoostClassifier(iterations=20, thread_count=2, silent=True, random_state=21)
model.fit(sample_train.iloc[:,:-2], sample_train.iloc[:,-2], categorical_columns)

y_predict = model.predict(sample_test.iloc[:,:-2])
evaluate_results(sample_test.iloc[:,-2], y_predict)
metrics.append(evaluate_results(sample_test.iloc[:,-2], y_predict, print_flag=False))

Classification results:
f1: 46.26%
roc: 84.25%
recall: 87.45%
precision: 31.45%
Wall time: 366 ms


In [21]:
metrics_comparison_df = pd.DataFrame(np.array(metrics),
                   columns=['fscore', 'roc_auc', 'recall', 'precision'],
                   index=['CatBoostClassifier', 'CatBoostClassifier + random negative sampling'])
metrics_comparison_df

Unnamed: 0,fscore,roc_auc,recall,precision
CatBoostClassifier,0.555932,0.725806,0.484252,0.65252
CatBoostClassifier + random negative sampling,0.462638,0.842527,0.87448,0.314515


Итак, произошло следующее: общее качество модели выросло по roc_auc. При этом random negative sampling сильно улучшил Recall за счет снижения Pracision. Мы можем эти параметры отрегулировать с помощью подбора порога. Но для нашей задачи высокий **Recall** это как раз то, что нам и нужно. Значит модель хорошо определяет класс 1 или "похожих" на него. С другой стороны упавший Precision говорит о том, что в замен стало больше FP. И вот учитывая, что мы решаем задачу PU и в данных присутствует не размеченный класс 1, то может быть это и есть искомые "look-alike" под видом FP.

**Попробуем разные доли случайной выборки**

**15%**

In [22]:
mod_data = data.copy()
#get the indices of the positives samples
pos_ind = np.where(mod_data.iloc[:,-1].values == 1)[0]
#shuffle them
np.random.shuffle(pos_ind)
# leave just 15% of the positives marked
pos_sample_len = int(np.ceil(0.15 * len(pos_ind)))
print(f'Using {pos_sample_len}/{len(pos_ind)} as positives and unlabeling the rest')
pos_sample = pos_ind[:pos_sample_len]

Using 794/5289 as positives and unlabeling the rest


In [23]:
mod_data['class_test'] = -1
mod_data.loc[pos_sample,'class_test'] = 1
print('target variable:\n', mod_data.iloc[:,-1].value_counts())

target variable:
 -1    44417
 1      794
Name: class_test, dtype: int64


In [24]:
x_data = mod_data.iloc[:,:-2].values # just the X 
y_labeled = mod_data.iloc[:,-1].values # new class (just the P & U)
y_positive = mod_data.iloc[:,-2].values # original class

In [25]:
mod_data = mod_data.sample(frac=1)  # Случайно перемешивает всю выборку
# В neg_sample оставляем случайную выборку длиной как pos_sample
neg_sample = mod_data[mod_data['class_test']==-1][:len(mod_data[mod_data['class_test']==1])]
sample_test = mod_data[mod_data['class_test']==-1][len(mod_data[mod_data['class_test']==1]):]  # Оставшиеся U
pos_sample = mod_data[mod_data['class_test']==1]
print(neg_sample.shape, pos_sample.shape)  # Одинаковые по размеру выборки
sample_train = pd.concat([neg_sample, pos_sample]).sample(frac=1)

(794, 21) (794, 21)


In [26]:
%%time
model = catb.CatBoostClassifier(iterations=20, thread_count=2, silent=True, random_state=21)
model.fit(sample_train.iloc[:,:-2], sample_train.iloc[:,-2], categorical_columns)

y_predict = model.predict(sample_test.iloc[:,:-2])
evaluate_results(sample_test.iloc[:,-2], y_predict)
metrics.append(evaluate_results(sample_test.iloc[:,-2], y_predict, print_flag=False))

Classification results:
f1: 47.52%
roc: 83.21%
recall: 86.36%
precision: 32.78%
Wall time: 351 ms


**35%**

In [27]:
mod_data = data.copy()
#get the indices of the positives samples
pos_ind = np.where(mod_data.iloc[:,-1].values == 1)[0]
#shuffle them
np.random.shuffle(pos_ind)
# leave just 35% of the positives marked
pos_sample_len = int(np.ceil(0.35 * len(pos_ind)))
print(f'Using {pos_sample_len}/{len(pos_ind)} as positives and unlabeling the rest')
pos_sample = pos_ind[:pos_sample_len]

Using 1852/5289 as positives and unlabeling the rest


In [28]:
mod_data['class_test'] = -1
mod_data.loc[pos_sample,'class_test'] = 1
print('target variable:\n', mod_data.iloc[:,-1].value_counts())

target variable:
 -1    43359
 1     1852
Name: class_test, dtype: int64


In [29]:
x_data = mod_data.iloc[:,:-2].values # just the X 
y_labeled = mod_data.iloc[:,-1].values # new class (just the P & U)
y_positive = mod_data.iloc[:,-2].values # original class

In [30]:
mod_data = mod_data.sample(frac=1)  # Случайно перемешивает всю выборку
# В neg_sample оставляем случайную выборку длиной как pos_sample
neg_sample = mod_data[mod_data['class_test']==-1][:len(mod_data[mod_data['class_test']==1])]
sample_test = mod_data[mod_data['class_test']==-1][len(mod_data[mod_data['class_test']==1]):]  # Оставшиеся U
pos_sample = mod_data[mod_data['class_test']==1]
print(neg_sample.shape, pos_sample.shape)  # Одинаковые по размеру выборки
sample_train = pd.concat([neg_sample, pos_sample]).sample(frac=1)

(1852, 21) (1852, 21)


In [31]:
%%time
model = catb.CatBoostClassifier(iterations=20, thread_count=2, silent=True, random_state=21)
model.fit(sample_train.iloc[:,:-2], sample_train.iloc[:,-2], categorical_columns)

y_predict = model.predict(sample_test.iloc[:,:-2])
evaluate_results(sample_test.iloc[:,-2], y_predict)
metrics.append(evaluate_results(sample_test.iloc[:,-2], y_predict, print_flag=False))

Classification results:
f1: 43.65%
roc: 84.82%
recall: 88.23%
precision: 29.00%
Wall time: 370 ms


In [32]:
metrics_comparison_df = pd.DataFrame(np.array(metrics),
                   columns=['fscore', 'roc_auc', 'recall', 'precision'],
                   index=['CatBoostClassifier', 'CatBoostClassifier + random negative sampling 25%', 
                          'CatBoostClassifier + random negative sampling 15%', 
                          'CatBoostClassifier + random negative sampling 35%'])
metrics_comparison_df

Unnamed: 0,fscore,roc_auc,recall,precision
CatBoostClassifier,0.555932,0.725806,0.484252,0.65252
CatBoostClassifier + random negative sampling 25%,0.462638,0.842527,0.87448,0.314515
CatBoostClassifier + random negative sampling 15%,0.47521,0.832149,0.863554,0.327798
CatBoostClassifier + random negative sampling 35%,0.436522,0.848214,0.882335,0.289997


Размер выборки немного влияет на качество, в данном случае лучше всего получиласть модель со случайной выборкой 35%.

<b>Бонусный вопрос:</b>

Как вы думаете, какой из методов на практике является более предпочтительным: random negative sampling или 2-step approach?

Ваш ответ здесь: думаю 2-step approach, т.к. он более устойчив за счет отсутствия случайности. Например, при random negative sampling может "не повезти" с выборкой.