# Курсовой проект по курсу "Машинное обучение в бизнесе".

## Соковнин Игорь Леонидович

### 1 шаг . Обучение пайплайна

1. Загрузим данные https://www.kaggle.com/c/654pds2courseproject/data
2. Соберем пайплайн
3. Обучим CatBoostClassifier (или логистическую регрессию) и сохраним на диск предобученный пайплайн

**Задача**

Требуется, на основании имеющихся данных о клиентах банка, построить модель, используяобучающий датасет, для прогнозирования невыполнения долговых обязательств по текущему кредиту. Выполнить прогноз для примеров из тестового датасета.

**Целевая переменная**

Credit Default - факт невыполнения кредитных обязательств

**Метрика качества**

F1-score (sklearn.metrics.f1_score)

За основу взят курсовой с курса "Библиотеки Python для Data Science продолжение II"<br>
https://www.kaggle.com/igorsokovnin/ilssokovnin-solution

## Подключение библиотек и скриптов


In [1]:
# !pip install catboost

In [2]:
# !pip install dill

In [3]:
import pandas as pd
import dill
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
# 4. Модели
import catboost as catb
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.metrics import roc_auc_score, roc_curve #, scorer - ошибка
from sklearn.metrics import f1_score
#working with text
from sklearn.feature_extraction.text import TfidfVectorizer
#normalizing data
from sklearn.preprocessing import StandardScaler
#pipeline
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import precision_score,recall_score
#imputer
from sklearn.impute import SimpleImputer

import sklearn.datasets

## Пути к директориям и файлам

In [4]:
# TRAIN_DATASET_PATH = './data/course_project/course_project_train.csv'
# TRAIN_DATASET_PATH = './data/course_project_train.csv'
TRAIN_DATASET_PATH = './course_project_train.csv'

# Шаг 1. Загрузка данных

**Описание датасета**
1. Home Ownership - домовладение
2. Annual Income - годовой доход
3. Years in current job - количество лет на текущем месте работы
4. Tax Liens - налоговые обременения
5. Number of Open Accounts - количество открытых счетов
6. Years of Credit History - количество лет кредитной истории
7. Maximum Open Credit - наибольший открытый кредит
8. Number of Credit Problems - количество проблем с кредитом
9. Months since last delinquent - количество месяцев с последней просрочки платежа
10. Bankruptcies - банкротства
11. Purpose - цель кредита
12. Term - срок кредита
13. Current Loan Amount - текущая сумма кредита
14. Current Credit Balance - текущий кредитный баланс
15. Monthly Debt - ежемесячный долг
16. Credit Score - Кредитный рейтинг
17. Credit Default - факт невыполнения кредитных обязательств (0 - погашен вовремя, 1 -просрочка)

In [5]:
# Тренировочные данные
df_train = pd.read_csv(TRAIN_DATASET_PATH)
df_train.head(3)

Unnamed: 0,Home Ownership,Annual Income,Years in current job,Tax Liens,Number of Open Accounts,Years of Credit History,Maximum Open Credit,Number of Credit Problems,Months since last delinquent,Bankruptcies,Purpose,Term,Current Loan Amount,Current Credit Balance,Monthly Debt,Credit Score,Credit Default
0,Own Home,482087.0,,0.0,11.0,26.3,685960.0,1.0,,1.0,debt consolidation,Short Term,99999999.0,47386.0,7914.0,749.0,0
1,Own Home,1025487.0,10+ years,0.0,15.0,15.3,1181730.0,0.0,,0.0,debt consolidation,Long Term,264968.0,394972.0,18373.0,737.0,1
2,Home Mortgage,751412.0,8 years,0.0,11.0,35.0,1182434.0,0.0,,0.0,debt consolidation,Short Term,99999999.0,308389.0,13651.0,742.0,0


## Приведение типов

In [6]:
for colname in ['Tax Liens', 'Number of Credit Problems', 'Bankruptcies']:
     df_train[colname] = df_train[colname].astype(str)   

## Корреляция с базовыми признаками

### Класс с подготовкой данных

In [7]:
class DataPipeLine:
    """Подготовка исходных данных"""
    
    def __init__(self):
        """Параметры класса:
           Константы для обработки выбрасов"""
        
        self.medians = None
        self.modes = None
       
        self.AnnualIncome_min = 165000
        self.AnnualIncome_max = 4000000
        
        self.YearsofCreditHistory_max = 40
        
        self.MaximumOpenCredit_min = 50000
        self.MaximumOpenCredit_max = 4000000
       
        self.MonthsSinceLastDelinquent_max = 83
        self.CurrentLoanAmount_max = 1000000
        self.CurrentCreditBalance_max = 1300000
        self.MonthlyDebt_max = 55000
       
        self.MonthlyDebt_min = 585
        self.MonthlyDebt_max = 7510
        
        
    def fit(self, df):
        """Сохранение статистик"""
        
        # Расчёт медиан
        self.medians = df_train[['Annual Income', 'Credit Score']].median()
        df = df_train.loc[df_train['Current Loan Amount'] < self.CurrentLoanAmount_max, ['Current Loan Amount']]
        self.modes = df[['Current Loan Amount']].median()
    
    
    def transform(self, df):
        """Трансформация данных"""
        
        # 1. Обработка пропусков
        #df_train = df_train.fillna(median)
        
        df[['Annual Income', 'Credit Score']] = df[['Annual Income', 'Credit Score']].fillna(self.medians)

        # Months since last delinquent
        # 3581 пропущенное значение из 7500 - удаляем
        if 'Months since last delinquent' in df.columns:
            # df = df.drop(['Months since last delinquent'], axis=1)
            df.drop('Months since last delinquent', axis=1, inplace=True)
        
        # Years in current job
        cat_colname = 'Years in current job'
        df[cat_colname] = df[cat_colname].replace(to_replace = np.nan, value = 'неизвестно')

        # 2. Выбросы (outliers)
        
        # Annual Income - годовой доход
        df.loc[df['Annual Income'] < self.AnnualIncome_min, 'Annual Income'] = self.AnnualIncome_min
        df.loc[df['Annual Income'] >= self.AnnualIncome_max, 'Annual Income'] = self.AnnualIncome_max
        
        # Years of Credit History - Количество лет кредитной истории
        df.loc[df['Years of Credit History'] >= self.YearsofCreditHistory_max, 'Years of Credit History'] = self.YearsofCreditHistory_max
        
        # Maximum Open Credit - наибольший открытый кредит
        df.loc[df['Maximum Open Credit'] < self.MaximumOpenCredit_min, 'Maximum Open Credit'] = self.MaximumOpenCredit_min
        df.loc[df['Maximum Open Credit'] >= self.MaximumOpenCredit_max, 'Maximum Open Credit'] = self.MaximumOpenCredit_max
      
        # Current Loan Amount - текущая сумма кредита
        df.loc[df['Current Loan Amount'] >= self.CurrentLoanAmount_max, 'Current Loan Amount'] = self.modes['Current Loan Amount']
        
        # Current Credit Balance - текущий кредитный баланс
        df.loc[df['Current Credit Balance'] >= self.CurrentCreditBalance_max, 'Current Credit Balance'] = self.CurrentCreditBalance_max

        # Monthly Debt - Ежемесячный долг
        df.loc[df['Monthly Debt'] >= self.MonthlyDebt_max, 'Monthly Debt'] = self.MonthlyDebt_max

        # Monthly Debt - Кредитный рейтинг
        df.loc[df['Monthly Debt'] < self.MonthlyDebt_min, 'Monthly Debt'] = self.MonthlyDebt_min
        df.loc[df['Monthly Debt'] >= self.MonthlyDebt_max, 'Monthly Debt'] = self.MonthlyDebt_max
        
        # 3. Обработка категорий
        colname = 'Bankruptcies'
        df[colname] = df[colname].replace(to_replace = 'nan', value = '0.0')
    
        return df

    
    def features(self, df):
        """4. Feature engineering
              Генерация новых фич"""
        
        # 1. Home Ownership - домовладение
        cat_colname = 'Home_Ownership_int'

        df[cat_colname] = df['Home Ownership']
        df.loc[df[cat_colname] == 'Have Mortgage', cat_colname] = 0
        df.loc[df[cat_colname] == 'Own Home', cat_colname] = 1
        df.loc[df[cat_colname] == 'Rent', cat_colname] = 2
        df.loc[df[cat_colname] == 'Home Mortgage', cat_colname] = 3
        
        # 3. 'Years in current job' (порядковые данные)
        cat_colname = 'Years_in_current_job_int'

        df[cat_colname] = df['Years in current job']
        df.loc[df[cat_colname] == '< 1 year', cat_colname] = 0
        df.loc[df[cat_colname] == '1 year', cat_colname] = 1
        df.loc[df[cat_colname] == '2 years', cat_colname] = 2
        df.loc[df[cat_colname] == '3 years', cat_colname] = 3
        df.loc[df[cat_colname] == '4 years', cat_colname] = 4
        df.loc[df[cat_colname] == '5 years', cat_colname] = 5
        df.loc[df[cat_colname] == '6 years', cat_colname] = 6
        df.loc[df[cat_colname] == '7 years', cat_colname] = 7
        df.loc[df[cat_colname] == '8 years', cat_colname] = 8
        df.loc[df[cat_colname] == '9 years', cat_colname] = 9
        df.loc[df[cat_colname] == '10+ years', cat_colname] = 10
        df.loc[df[cat_colname] == 'неизвестно', cat_colname] = 11

        # 11. Purpose - цель кредита (порядковые данные)
        cat_colname = 'Purpose_int'

        df[cat_colname] = df['Purpose']
        df.loc[df[cat_colname] == 'renewable energy', cat_colname] = 0
        df.loc[df[cat_colname] == 'vacation', cat_colname] = 1
        df.loc[df[cat_colname] == 'educational expenses', cat_colname] = 2
        df.loc[df[cat_colname] == 'moving', cat_colname] = 3
        df.loc[df[cat_colname] == 'wedding', cat_colname] = 4
        df.loc[df[cat_colname] == 'small business', cat_colname] = 5
        df.loc[df[cat_colname] == 'buy house', cat_colname] = 6
        df.loc[df[cat_colname] == 'take a trip', cat_colname] = 7
        df.loc[df[cat_colname] == 'major purchase', cat_colname] = 8
        df.loc[df[cat_colname] == 'medical bills', cat_colname] = 9
        df.loc[df[cat_colname] == 'buy a car', cat_colname] = 10
        df.loc[df[cat_colname] == 'business loan', cat_colname] = 11
        df.loc[df[cat_colname] == 'home improvements', cat_colname] = 12
        df.loc[df[cat_colname] == 'other', cat_colname] = 13
        df.loc[df[cat_colname] == 'debt consolidation', cat_colname] = 14

        # 12. Term - срок кредита (номинативные данные)
        cat_colname = 'Term_int'

        df[cat_colname] = df['Term']
        df.loc[df[cat_colname] == 'Long Term', cat_colname] = 0
        df.loc[df[cat_colname] == 'Short Term', cat_colname] = 1

       
        numbers = ['0.0', '1.0', '2.0', '3.0', '4.0', '5.0', '6.0', '7.0', '8.0', '9.0']
        numbers_int = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
        
        # Добавление признаков
        colnames_new = ['Tax_Liens_int', 'Number_of_Credit_Problems_int', 'Bankruptcies_int']
        colnames = ['Tax Liens', 'Number of Credit Problems', 'Bankruptcies']
        
        for i in range(len(colnames_new)):
            df[colnames_new[i]] = df[colnames[i]]
            for j in range(len(numbers)): 
                df.loc[df[colnames_new[i]] == numbers[j], colnames_new[i]] = numbers_int[j]
       
        # Обработка категорий
        for colname in ['Home_Ownership_int', 'Years_in_current_job_int', 'Purpose_int', 'Term_int']:
             df_train[colname] = df_train[colname].astype('int8')
        for colname in colnames_new:
             df_train[colname] = df_train[colname].astype('int8')
       
        # 16. Credit Score - Кредитный рейтинг
        df['CreditScore_small'] = df['Credit Score']
        df['CreditScore_large'] = df['Credit Score']

        df.loc[df['Credit Score'] > 2000, 'CreditScore_small'] = 0.0
        df.loc[df['Credit Score'] < 600, 'CreditScore_small'] = 0.0

        df.loc[df['Credit Score'] < 3000, 'CreditScore_large'] =  0.0
        df.loc[df['Credit Score'] > 9000, 'CreditScore_large'] = 0.0
        
        return df 

## Инициализируем класс

In [8]:
data_pl = DataPipeLine()

data_pl.fit(df_train)

# тренировочные данные
df = data_pl.transform(df_train)
df = data_pl.features(df_train)

## Масштабрование данных

In [9]:
NUM_FEATURE_NAMES = [
                 'Annual Income',
                 'Number of Open Accounts',
                 'Years of Credit History',
                 'Maximum Open Credit',
                 'Current Loan Amount',
                 'Current Credit Balance',
                 'Monthly Debt',
                 #'Credit Score',
                 #'Credit Default',
                 'CreditScore_small',
                 'CreditScore_large']

In [10]:
scaler = StandardScaler()

df_norm = df.copy()
df_norm[NUM_FEATURE_NAMES] = scaler.fit_transform(df_norm[NUM_FEATURE_NAMES])
df = df_norm.copy()
df.head(3)

Unnamed: 0,Home Ownership,Annual Income,Years in current job,Tax Liens,Number of Open Accounts,Years of Credit History,Maximum Open Credit,Number of Credit Problems,Bankruptcies,Purpose,...,Credit Default,Home_Ownership_int,Years_in_current_job_int,Purpose_int,Term_int,Tax_Liens_int,Number_of_Credit_Problems_int,Bankruptcies_int,CreditScore_small,CreditScore_large
0,Own Home,-1.276531,неизвестно,0.0,-0.026674,1.168421,0.079322,1.0,1.0,debt consolidation,...,0,1,11,14,1,0,1,1,0.393842,-0.237124
1,Own Home,-0.435285,10+ years,0.0,0.788223,-0.432336,0.922738,0.0,0.0,debt consolidation,...,1,1,10,14,0,0,0,0,0.321908,-0.237124
2,Home Mortgage,-0.859585,8 years,0.0,-0.026674,2.434474,0.923936,0.0,0.0,debt consolidation,...,0,3,8,14,1,0,0,0,0.351881,-0.237124


# 5. Отбор признаков

In [11]:
CAT_FEATURE_NAMES = [
                 'Home_Ownership_int',
                 'Years_in_current_job_int',
                 'Purpose_int',
                 'Term_int',
                 'Tax_Liens_int',
                 'Number_of_Credit_Problems_int']

In [12]:
NUM_FEATURE_NAMES = [
                 'Annual Income',
                 'Number of Open Accounts',
                 'Years of Credit History',
                 'Maximum Open Credit',
                 'Current Loan Amount',
                 'Current Credit Balance',
                 'Monthly Debt',
                 'CreditScore_small',
                 'CreditScore_large']


In [13]:
SELECTED_FEATURE_NAMES = NUM_FEATURE_NAMES + CAT_FEATURE_NAMES

In [14]:
TARGET_NAME = 'Credit Default'

In [15]:
scaler = StandardScaler()

df_norm = df.copy()
df_norm[NUM_FEATURE_NAMES] = scaler.fit_transform(df_norm[NUM_FEATURE_NAMES])
df = df_norm.copy()

# 4. Рабиение на train/test

Разделим данные на train/test и сохраним тестовую выборку на диск (здесь мы ее касаться уже не будем)

In [16]:
X = df[SELECTED_FEATURE_NAMES]
y = df[TARGET_NAME]

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    shuffle=True,
                                                    test_size=0.3,
                                                    random_state=21,
                                                    stratify=y)

In [17]:
#save test
X_test.to_csv("X_test.csv", index=None)
y_test.to_csv("y_test.csv", index=None)

#save train
X_train.to_csv("X_train.csv", index=None)
y_train.to_csv("y_train.csv", index=None)

In [18]:
class ColumnSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[[self.key]]  # !!!! X[self.key]

In [19]:
target = TARGET_NAME

Соберем кусок, ответственный за feature engineering

['Annual Income',
 'Number of Open Accounts',
 'Years of Credit History',
 'Maximum Open Credit',
 'Current Loan Amount',
 'Current Credit Balance',
 'Monthly Debt',
 'CreditScore_small',
 'CreditScore_large',
 
 'Home_Ownership_int',
 
 'Years_in_current_job_int',
 'Purpose_int',
 'Term_int',
 'Tax_Liens_int',
 'Number_of_Credit_Problems_int']

In [20]:
# combine v. 0.0.0.1
annual_income = Pipeline([
                ('selector', ColumnSelector(key='Annual Income'))
                ])

number_of_open_accounts = Pipeline([
                ('selector', ColumnSelector(key='Number of Open Accounts'))
                ])
years_of_credit_history = Pipeline([
                ('selector', ColumnSelector(key='Years of Credit History'))
                ])
maximum_open_credit = Pipeline([
                ('selector', ColumnSelector(key='Maximum Open Credit'))
                ])
current_loan_amount = Pipeline([
                ('selector', ColumnSelector(key='Current Loan Amount'))
                ])
current_credit_balance = Pipeline([
                ('selector', ColumnSelector(key='Current Credit Balance'))
                ])
monthly_debt = Pipeline([
                ('selector', ColumnSelector(key='Monthly Debt'))
                ])
creditScore_small = Pipeline([
                ('selector', ColumnSelector(key='CreditScore_small'))
                ])
creditScore_large = Pipeline([
                ('selector', ColumnSelector(key='CreditScore_large'))
                ])
home_ownership_int = Pipeline([
                ('selector', ColumnSelector(key='Home_Ownership_int'))
                ])
years_in_current_job_int = Pipeline([
                ('selector', ColumnSelector(key='Years_in_current_job_int'))
                ])
purpose_int = Pipeline([
                ('selector', ColumnSelector(key='Purpose_int'))
                ])
term_int = Pipeline([
                ('selector', ColumnSelector(key='Term_int'))
                ])

tax_liens_int = Pipeline([
                ('selector', ColumnSelector(key='Tax_Liens_int'))
                ])

number_of_credit_problems_int = Pipeline([
                ('selector', ColumnSelector(key='Number_of_Credit_Problems_int'))
                ])

feats = FeatureUnion([
                      ('Annual Income', annual_income),
                      ('Number of Open Accounts', number_of_open_accounts),
                      ('Years of Credit History', years_of_credit_history),
                      ('Maximum Open Credit', maximum_open_credit),                     
                      ('Current Loan Amount', current_loan_amount),
                      ('Current Credit Balance', current_credit_balance),                      
                      ('Monthly Debt', monthly_debt),
                      ('CreditScore_small', creditScore_small),
                      ('CreditScore_large', creditScore_large),
                      
                      ('Home_Ownership_int', home_ownership_int),
                      ('Years_in_current_job_int', years_in_current_job_int),
                      ('Purpose_int', purpose_int),
                      ('Term_int', term_int),
                      ('Tax_Liens_int', tax_liens_int),
                      ('Number_of_Credit_Problems_int', number_of_credit_problems_int)
                     ])

In [21]:
# # combine v. 0.0.0.0 - рабочая

# purpose_int = Pipeline([
#                 ('selector', ColumnSelector(key='Purpose_int'))
#                 ])
# term_int = Pipeline([
#                 ('selector', ColumnSelector(key='Term_int'))
#                 ])

# tax_liens_int = Pipeline([
#                 ('selector', ColumnSelector(key='Tax_Liens_int'))
#                 ])

# number_of_credit_problems_int = Pipeline([
#                 ('selector', ColumnSelector(key='Number_of_Credit_Problems_int'))
#                 ])

# feats = FeatureUnion([
#                       ('Purpose_int', purpose_int),
#                       ('Term_int', term_int),
#                       ('Tax_Liens_int', tax_liens_int),
#                       ('Number_of_Credit_Problems_int', number_of_credit_problems_int)
#                      ])

# 5. Построение модели¶

In [22]:
disbalance = y_train.value_counts()[0] / y_train.value_counts()[1]

In [23]:
final_model = LogisticRegression()

In [24]:
frozen_params = {
     'class_weights':[1, disbalance], 
     'silent':True,
     'random_state':21,
     # 'cat_features': CAT_FEATURE_NAMES,
     'eval_metric':'F1',
     'early_stopping_rounds':50
}

In [25]:
# final_model = catb.CatBoostClassifier(**frozen_params,
#                                       iterations=200,
#                                       max_depth=7,
#                                       #reg_lambda=0.5
#                                      )
# # final_model.fit(X_train, y_train, plot=True, eval_set=(X_test, y_test))

In [26]:
%%time

pipeline = Pipeline([
    ('features',feats),
    ('classifier',  final_model),
])

pipeline.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


CPU times: user 476 ms, sys: 112 ms, total: 588 ms
Wall time: 1.17 s


Pipeline(memory=None,
         steps=[('features',
                 FeatureUnion(n_jobs=None,
                              transformer_list=[('Annual Income',
                                                 Pipeline(memory=None,
                                                          steps=[('selector',
                                                                  ColumnSelector(key='Annual '
                                                                                     'Income'))],
                                                          verbose=False)),
                                                ('Number of Open Accounts',
                                                 Pipeline(memory=None,
                                                          steps=[('selector',
                                                                  ColumnSelector(key='Number '
                                                                                     'of '
            

Посмотрим, как выглядит наш pipeline

In [27]:
pipeline.steps

[('features',
  FeatureUnion(n_jobs=None,
               transformer_list=[('Annual Income',
                                  Pipeline(memory=None,
                                           steps=[('selector',
                                                   ColumnSelector(key='Annual '
                                                                      'Income'))],
                                           verbose=False)),
                                 ('Number of Open Accounts',
                                  Pipeline(memory=None,
                                           steps=[('selector',
                                                   ColumnSelector(key='Number of '
                                                                      'Open '
                                                                      'Accounts'))],
                                           verbose=False)),
                                 ('Years of Credit History',
                  

Сохраним модель (пайплайн)

In [28]:
with open("logreg_pipeline_v8.dill", "wb") as f:
    dill.dump(pipeline, f)

In [29]:
# with open("catboost_pipeline.dill", "wb") as f:
#     dill.dump(pipeline, f)