## <center>Курсовой проект<a class="anchor" id="course_project"></a><center>

### Обзор данных<a class="anchor" id="course_project_review"></a>

**Описание датасета**

* **Home Ownership** - домовладение
* **Annual Income** - годовой доход
* **Years in current job** - количество лет на текущем месте работы
* **Tax Liens** - налоговые обременения
* **Number of Open Accounts** - количество открытых счетов
* **Years of Credit History** - количество лет кредитной истории
* **Maximum Open Credit** - наибольший открытый кредит
* **Number of Credit Problems** - количество проблем с кредитом
* **Months since last delinquent** - количество месяцев с последней просрочки платежа
* **Bankruptcies** - банкротства
* **Purpose** - цель кредита
* **Term** - срок кредита
* **Current Loan Amount** - текущая сумма кредита
* **Current Credit Balance** - текущий кредитный баланс
* **Monthly Debt** - ежемесячный долг
* **Credit Default** - факт невыполнения кредитных обязательств (0 - погашен вовремя, 1 - просрочка)

**Используемые библиотеки**

In [None]:
!pip install catboost

In [None]:
import seaborn as sns

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import pickle

from scipy.stats.mstats import winsorize

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score, precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

import xgboost as xgb, lightgbm as lgbm, catboost as catb

from catboost import Pool,cv

**Используемые функции**

In [None]:
def get_classification_report(y_train_true, y_train_pred, y_test_true, y_test_pred):
    print('TRAIN\n\n' + classification_report(y_train_true, y_train_pred))
    print('TEST\n\n' + classification_report(y_test_true, y_test_pred))
    print('CONFUSION MATRIX\n')
    print(pd.crosstab(y_test_true, y_test_pred))

In [None]:
def balance_df_by_target(df, target_name):

    target_counts = df[target_name].value_counts()

    major_class_name = target_counts.argmax()
    minor_class_name = target_counts.argmin()

    disbalance_coeff = int(target_counts[major_class_name] / target_counts[minor_class_name]) - 1

    for i in range(disbalance_coeff):
        sample = df[df[target_name] == minor_class_name].sample(target_counts[minor_class_name])
        df = df.append(sample, ignore_index=True)

    return df.sample(frac=1) 

**Пути к директориям и файлам**

In [None]:
TRAIN_DATASET_PATH = 'course_project_train.csv'
MODEL_FILE_PATH = 'model.pkl'

**Загрузка данных**

In [None]:
df_train = pd.read_csv(TRAIN_DATASET_PATH)
df_train.head()

In [None]:
df_train.shape

In [None]:
df_train.columns

In [None]:
base_features = ['Home Ownership', 'Annual Income', 'Years in current job', 'Tax Liens',
       'Number of Open Accounts', 'Years of Credit History',
       'Maximum Open Credit', 'Number of Credit Problems',
       'Months since last delinquent', 'Bankruptcies', 'Purpose', 'Term',
       'Current Loan Amount', 'Current Credit Balance', 'Monthly Debt',
       'Credit Score']
X = df_train[base_features]
X.shape

In [None]:
target_name = 'Credit Default'
y = df_train[target_name]
y.shape

**Готовим класс для обработки выбросов**

In [None]:
class Model:
    def __init__(self):
        self.medians = None
        self.scaler = None
        
    def fit(self, X):
        #на тестовых данных будут использоваться медианы из тренировочной выборки
        if self.medians is None:
            self.medians = {}
            for colname in X.columns:
                self.medians[colname] = X[colname].mode()[0]
        print(self.medians)
        
    def scale(self, X):
        
        X_norm = X.copy()
        num_f= ['Annual Income', 'Tax Liens',
           'Number of Open Accounts', 'Years of Credit History',
           'Maximum Open Credit', 'Number of Credit Problems', 'Bankruptcies',
            'Current Loan Amount', 'Current Credit Balance',
           'Monthly Debt', 'Credit Score']
        
        if self.scaler is None: 
            scaler = StandardScaler()

            X_norm[num_f] = scaler.fit_transform(X_norm[num_f])
            
            self.scaler = scaler
        else:
            X_norm[num_f] = self.scaler.transform(X_norm[num_f])
            
        return X_norm    
    
    def nan_correction(self, X):
        
        X = X.copy()
        
        #избавляемся от ненужного признака
        
        if 'Months since last delinquent' in X.columns:
            X = X.drop(columns=['Months since last delinquent'])
        
        #устранияем пропуски через медиану
        if self.medians is not None:
            for colname in X.columns:
                X[f'{colname}_nan'] = X[colname].isna() * 1
                X.loc[X[f'{colname}_nan']==1, colname] = self.medians[colname]
                if X[f'{colname}_nan'].sum() == 0:
                    X.drop(f'{colname}_nan', axis=1, inplace=True)
        
        return X
    
    def outliners_fix(self, X):
        
        X = X.copy()
        
        X['Maximum Open Credit'] = winsorize(X['Maximum Open Credit'], limits = 0.15)
        
        return X
    
    def new_features(self, X):
        
        X = X.copy()
        
        # X.select_dtypes(include='object').columns[1:]
        
        for cat_colname in ['Term']:
            X = pd.concat([X, pd.get_dummies(X[cat_colname], prefix=cat_colname)], axis=1)
            
        return X

In [None]:
m = Model()

In [None]:
m.fit(X)

In [None]:
res = m.nan_correction(X)

In [None]:
res = m.outliners_fix(res)

In [None]:
res = m.scale(res)

In [None]:
res.columns

In [None]:
number_features = ['Annual Income', 'Annual Income_nan', 'Tax Liens',
       'Number of Open Accounts', 'Years of Credit History',
       'Maximum Open Credit', 'Number of Credit Problems', 'Bankruptcies',
        'Current Loan Amount', 'Current Credit Balance',
       'Monthly Debt', 'Credit Score',
       'Years in current job_nan', 'Bankruptcies_nan', 'Credit Score_nan']


In [None]:
additional_features = ['Term_Long Term', 'Term_Short Term']

In [None]:
cat_features = []
for cat_colname in X.select_dtypes(include='object').columns:
    cat_features.append(cat_colname)
    
cat_features

In [None]:
for colname in cat_features:
    res[colname] = pd.Categorical(res[colname])
    
res[cat_features].dtypes

**Сравним различные модели**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(res, y, shuffle=True, test_size=0.30)

Балансировка

In [None]:
df_for_balancing = pd.concat([X_train, y_train], axis=1)

In [None]:
print(df_for_balancing[target_name].value_counts())

In [None]:
df_for_balancing = pd.concat([X_train, y_train], axis=1)
df_balanced = balance_df_by_target(df_for_balancing, target_name)
    
print(df_balanced[target_name].value_counts())

X_train = df_balanced.drop(columns=target_name)
y_train = df_balanced[target_name]

In [None]:
train = pd.concat([X_train, y_train], axis=1)

In [None]:
train.to_csv("./current_train.csv", index=False, encoding='utf-8')

In [None]:
model_lr = LogisticRegression()
model_lr.fit(X_train[number_features], y_train)

y_train_pred = model_lr.predict(X_train[number_features])
y_test_pred = model_lr.predict(X_test[number_features])

print(y_train_pred.shape, y_test_pred.shape, y_test.shape, y_train.shape)

get_classification_report(y_train, y_train_pred, y_test, y_test_pred)

In [None]:
model_knn = KNeighborsClassifier()
model_knn.fit(X_train[number_features], y_train)

y_train_pred = model_knn.predict(X_train[number_features])
y_test_pred = model_knn.predict(X_test[number_features])

get_classification_report(y_train, y_train_pred, y_test, y_test_pred)

In [None]:
model_xgb = xgb.XGBClassifier(random_state=21)
model_xgb.fit(X_train[number_features], y_train)

y_train_pred = model_xgb.predict(X_train[number_features])
y_test_pred = model_xgb.predict(X_test[number_features])

get_classification_report(y_train, y_train_pred, y_test, y_test_pred)

In [None]:
model_lgbm = lgbm.LGBMClassifier(random_state=21)
model_lgbm.fit(X_train[number_features], y_train)

y_train_pred = model_lgbm.predict(X_train[number_features ])
y_test_pred = model_lgbm.predict(X_test[number_features])

get_classification_report(y_train, y_train_pred, y_test, y_test_pred)

In [None]:
model_catb = catb.CatBoostClassifier(silent=True, random_state=21, cat_features=cat_features)
model_catb.fit(X_train[number_features + cat_features], y_train)

y_train_pred = model_catb.predict(X_train[number_features + cat_features])
y_test_pred = model_catb.predict(X_test[number_features + cat_features])

get_classification_report(y_train, y_train_pred, y_test, y_test_pred)

In [None]:
c_dataset = Pool(data=pd.concat([X_train, X_test], ignore_index=True),
                  label=pd.concat([y_train, y_test], ignore_index=True),
                  cat_features=cat_features)


params = {"iterations": 220,
          
          "max_depth":3,
          "eval_metric":"F1",
          "l2_leaf_reg":10.0,
          "loss_function": "Logloss",
          "colsample_bylevel":0.5,
          "verbose": False}
scores = cv(c_dataset,
            params,
            fold_count=3, 
            plot="True")