### Домашнее задание

1. взять любой набор данных для бинарной классификации (можно скачать один из модельных с https://archive.ics.uci.edu/ml/datasets.php)
3. сделать feature engineering
4. обучить любой классификатор (какой вам нравится)
5. далее разделить ваш набор данных на два множества: P (positives) и U (unlabeled). Причем брать нужно не все положительные (класс 1) примеры, а только лишь часть
6. применить random negative sampling для построения классификатора в новых условиях
7. сравнить качество с решением из пункта 4 (построить отчет - таблицу метрик)
8. поэкспериментировать с долей P на шаге 5 (как будет меняться качество модели при уменьшении/увеличении размера P)

#### Описание датасета
Данные по кредитному скорингу.

Home Ownership - домовладение

Annual Income - годовой доход

Years in current job - количество лет на текущем месте работы

Tax Liens - налоговые обременения

Number of Open Accounts - количество открытых счетов

Years of Credit History - количество лет кредитной истории

Maximum Open Credit - наибольший открытый кредит

Number of Credit Problems - количество проблем с кредитом

Months since last delinquent - количество месяцев с последней просрочки платежа

Bankruptcies - банкротства

Purpose - цель кредита

Term - срок кредита

Current Loan Amount - текущая сумма кредита

Current Credit Balance - текущий кредитный баланс

Monthly Debt - ежемесячный долг

Credit Default - факт невыполнения кредитных обязательств (0 - погашен вовремя, 1 - просрочка)

In [2]:
pip install xgboost

Collecting xgboost
  Downloading xgboost-1.6.1-py3-none-win_amd64.whl (125.4 MB)
Installing collected packages: xgboost
Successfully installed xgboost-1.6.1
Note: you may need to restart the kernel to use updated packages.


In [3]:
import numpy as np
%matplotlib inline
import matplotlib.pylab as plt
from sklearn.metrics import recall_score, precision_score, roc_auc_score, accuracy_score, f1_score, precision_recall_curve
import pandas as pd
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
import warnings
warnings.filterwarnings('ignore')

In [5]:
df = pd.read_csv('train_data.csv')
df.head(5)

Unnamed: 0,Home Ownership,Annual Income,Years in current job,Tax Liens,Number of Open Accounts,Years of Credit History,Maximum Open Credit,Number of Credit Problems,Months since last delinquent,Bankruptcies,Purpose,Term,Current Loan Amount,Current Credit Balance,Monthly Debt,Credit Score,Credit Default
0,Own Home,482087.0,,0.0,11.0,26.3,685960.0,1.0,,1.0,debt consolidation,Short Term,99999999.0,47386.0,7914.0,749.0,0
1,Own Home,1025487.0,10+ years,0.0,15.0,15.3,1181730.0,0.0,,0.0,debt consolidation,Long Term,264968.0,394972.0,18373.0,737.0,1
2,Home Mortgage,751412.0,8 years,0.0,11.0,35.0,1182434.0,0.0,,0.0,debt consolidation,Short Term,99999999.0,308389.0,13651.0,742.0,0
3,Own Home,805068.0,6 years,0.0,8.0,22.5,147400.0,1.0,,1.0,debt consolidation,Short Term,121396.0,95855.0,11338.0,694.0,0
4,Rent,776264.0,8 years,0.0,13.0,13.6,385836.0,1.0,,0.0,debt consolidation,Short Term,125840.0,93309.0,7180.0,719.0,0


почистим от того , что мешает нам отправить данные в модель.

In [6]:
df.drop('Months since last delinquent', axis = 1, inplace=True) 
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5635 entries, 0 to 5634
Data columns (total 16 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Home Ownership             5635 non-null   object 
 1   Annual Income              5635 non-null   float64
 2   Years in current job       5635 non-null   object 
 3   Tax Liens                  5635 non-null   float64
 4   Number of Open Accounts    5635 non-null   float64
 5   Years of Credit History    5635 non-null   float64
 6   Maximum Open Credit        5635 non-null   float64
 7   Number of Credit Problems  5635 non-null   float64
 8   Bankruptcies               5635 non-null   float64
 9   Purpose                    5635 non-null   object 
 10  Term                       5635 non-null   object 
 11  Current Loan Amount        5635 non-null   float64
 12  Current Credit Balance     5635 non-null   float64
 13  Monthly Debt               5635 non-null   float

In [8]:
df['Credit Default'].value_counts()

0    4156
1    1479
Name: Credit Default, dtype: int64

Разделим на тест\трейн

In [9]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('Credit Default', axis = 1), df['Credit Default'], test_size=0.2, random_state=7)

Напишем пайплайн для обработки

In [10]:
class FeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        self.column = column

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return X[self.column]
    

class NumberSelector(BaseEstimator, TransformerMixin):

    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[[self.key]]
    
    
class OHEEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key
        self.columns = []

    def fit(self, X, y=None):
        self.columns = [col for col in pd.get_dummies(X, prefix=self.key, drop_first=True).columns]
        return self

    def transform(self, X):
        X = pd.get_dummies(X, prefix=self.key, drop_first=True)
        test_columns = [col for col in X.columns]
        
        for col_ in self.columns:
            if col_ not in test_columns:
                X[col_] = 0
        return X[self.columns]

Определим столбцы

In [11]:
continuous_columns = X_train.select_dtypes(include='number').columns.to_list()
categorical_columns = X_train.select_dtypes(exclude='number').columns.to_list()

Обработаем наши фичи и объеденим

In [12]:
final_transformers = list()

for cat_col in categorical_columns:
    cat_transformer = Pipeline([
                ('selector', FeatureSelector(column=cat_col)),
                ('ohe', OHEEncoder(key=cat_col))
            ])
    
    final_transformers.append((cat_col, cat_transformer))
    
for cont_col in continuous_columns:
    cont_transformer = Pipeline([
                ('selector', NumberSelector(key=cont_col)),
                
            ])
    
    final_transformers.append((cont_col, cont_transformer))

In [13]:
feats = FeatureUnion(final_transformers)

feature_processing = Pipeline([('feats', feats)])

Используем xgboost

In [14]:
model = xgb.XGBClassifier(random_state=7)

Соберем конечный пайплайн

In [15]:
pipeline = Pipeline([
    ('features', feats),
    ('classifier', model)
])

Обучим нашу изначальную модель

In [16]:
pipeline.fit(X_train, y_train)

Pipeline(steps=[('features',
                 FeatureUnion(transformer_list=[('Home Ownership',
                                                 Pipeline(steps=[('selector',
                                                                  FeatureSelector(column='Home '
                                                                                         'Ownership')),
                                                                 ('ohe',
                                                                  OHEEncoder(key='Home '
                                                                                 'Ownership'))])),
                                                ('Years in current job',
                                                 Pipeline(steps=[('selector',
                                                                  FeatureSelector(column='Years '
                                                                                         'in '
                    

In [17]:
y_predict = pipeline.predict(X_test)

Результаты соберем в таблицу

In [24]:
results = {'model' : [], 'f1' : [], 'recall' : [], 'precision' : [] }

In [25]:
results['model'].append('commonXGB')
results['f1'].append(f1_score(y_test, y_predict))
results['recall'].append(recall_score(y_test, y_predict, average='binary'))
results['precision'].append(precision_score(y_test, y_predict, average='binary'))

In [26]:
samples = np.linspace(0.1, 1, 10)

In [27]:
for i in samples:
    mod_data = X_train.copy()
    mod_data['label'] = y_train
    mod_data = mod_data.reset_index(drop=True)


    pos_ind = np.where(mod_data.iloc[:, -1].values == 1)[0]

    # shuffle them
    np.random.shuffle(pos_ind)
    
    perc = i
    pos_sample_len = int(np.ceil(perc * len(pos_ind)))

    
    pos_sample = pos_ind[:pos_sample_len]
    mod_data['class_test'] = -1
    mod_data.loc[pos_sample,'class_test'] = 1
   
    mod_data = mod_data.sample(frac=1)


    data_N = mod_data[mod_data['class_test'] == -1]
    data_P = mod_data[mod_data['class_test'] == 1]

    neg_sample = data_N[:data_P.shape[0]]
    sample_test = data_N[data_P.shape[0]:]
    pos_sample = data_P.copy()

    
    sample_train = pd.concat([neg_sample, pos_sample]).sample(frac=1)

    sample_train.loc[sample_train['class_test'] == -1, 'class_test'] = 0
    X_sample_train = sample_train.drop(columns=['class_test', 'label'])
    y_sample_train = sample_train['class_test'] 

    pipeline.fit(X_sample_train, y_sample_train)

    y_predict = pipeline.predict(X_test)

    results['model'].append(f'commonXGB+RNS_{i:.1f}sample')
    results['f1'].append(f1_score(y_test, y_predict))
    results['recall'].append(recall_score(y_test, y_predict, average='binary'))
    results['precision'].append(precision_score(y_test, y_predict, average='binary'))

In [28]:
pd.DataFrame(results)

Unnamed: 0,model,f1,recall,precision
0,commonXGB,0.525034,0.638158,0.445977
1,commonXGB+RNS_0.1sample,0.461949,0.569079,0.388764
2,commonXGB+RNS_0.2sample,0.461738,0.585526,0.381156
3,commonXGB+RNS_0.3sample,0.485788,0.618421,0.4
4,commonXGB+RNS_0.4sample,0.485876,0.565789,0.425743
5,commonXGB+RNS_0.5sample,0.502717,0.608553,0.428241
6,commonXGB+RNS_0.6sample,0.484456,0.615132,0.399573
7,commonXGB+RNS_0.7sample,0.507227,0.634868,0.422319
8,commonXGB+RNS_0.8sample,0.488211,0.578947,0.422062
9,commonXGB+RNS_0.9sample,0.527591,0.644737,0.446469


Вот такие получились результаты