# <center>Курсовой проект<a class="anchor" id="course_project"></a><center>

**Постановка задачи**<a class="anchor" id="course_project_task"></a>

**Задача**

Требуется, на основании имеющихся данных о клиентах банка, построить модель, используя обучающий датасет, для прогнозирования невыполнения долговых обязательств по текущему кредиту. Выполнить прогноз для примеров из тестового датасета.

**Наименование файлов с данными**

course_project_train.csv - обучающий датасет<br>
course_project_test.csv - тестовый датасет


**Целевая переменная**

Credit Default - факт невыполнения кредитных обязательств

**Метрика качества**

F1-score (sklearn.metrics.f1_score)

**Требования к решению**

*Целевая метрика*
* F1 > 0.5
* Метрика оценивается по качеству прогноза для главного класса (1 - просрочка по кредиту)

*Решение должно содержать*
1. Тетрадка Jupyter Notebook с кодом Вашего решения, названная по образцу {ФИО}\_solution.ipynb, пример SShirkin\_solution.ipynb
2. Файл CSV с прогнозами целевой переменной для тестового датасета, названный по образцу {ФИО}\_predictions.csv, пример SShirkin\_predictions.csv

*Рекомендации для файла с кодом (ipynb)*
1. Файл должен содержать заголовки и комментарии (markdown)
2. Повторяющиеся операции лучше оформлять в виде функций
3. Не делать вывод большого количества строк таблиц (5-10 достаточно)
4. По возможности добавлять графики, описывающие данные (около 3-5)
5. Добавлять только лучшую модель, то есть не включать в код все варианты решения проекта
6. Скрипт проекта должен отрабатывать от начала и до конца (от загрузки данных до выгрузки предсказаний)
7. Весь проект должен быть в одном скрипте (файл ipynb).
8. Допускается применение библиотек Python и моделей машинного обучения,
которые были в данном курсе.

**Сроки сдачи**

Cдать проект нужно в течение 5 дней после окончания последнего вебинара.
Оценки работ, сданных до дедлайна, будут представлены в виде рейтинга, ранжированного по заданной метрике качества.
Проекты, сданные после дедлайна или сданные повторно, не попадают в рейтинг, но можно будет узнать результат.

**Примерное описание этапов выполнения курсового проекта**<a class="anchor" id="course_project_steps"></a>

**Построение модели классификации**
1. Обзор обучающего датасета
2. Обработка выбросов
3. Обработка пропусков
4. Анализ данных
5. Отбор признаков
6. Балансировка классов
7. Подбор моделей, получение бейзлана
8. Выбор наилучшей модели, настройка гиперпараметров
9. Проверка качества, борьба с переобучением
10. Интерпретация результатов

**Прогнозирование на тестовом датасете**
1. Выполнить для тестового датасета те же этапы обработки и постронияния признаков
2. Спрогнозировать целевую переменную, используя модель, построенную на обучающем датасете
3. Прогнозы должны быть для всех примеров из тестового датасета (для всех строк)
4. Соблюдать исходный порядок примеров из тестового датасета

# Обзор данных<a class="anchor" id="course_project_review"></a>

**Описание датасета**

* **Home Ownership** - домовладение
* **Annual Income** - годовой доход
* **Years in current job** - количество лет на текущем месте работы
* **Tax Liens** - налоговые обременения
* **Number of Open Accounts** - количество открытых счетов
* **Years of Credit History** - количество лет кредитной истории
* **Maximum Open Credit** - наибольший открытый кредит
* **Number of Credit Problems** - количество проблем с кредитом
* **Months since last delinquent** - количество месяцев с последней просрочки платежа
* **Bankruptcies** - банкротства
* **Purpose** - цель кредита
* **Term** - срок кредита
* **Current Loan Amount** - текущая сумма кредита
* **Current Credit Balance** - текущий кредитный баланс
* **Monthly Debt** - ежемесячный долг
* **Credit Default** - факт невыполнения кредитных обязательств (0 - погашен вовремя, 1 - просрочка)

# Библиотеки. Функции

In [119]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import missingno
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
import tqdm
import re

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, ShuffleSplit, cross_val_score, learning_curve
from sklearn.model_selection import StratifiedKFold, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import classification_report, f1_score, precision_score, recall_score

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
import xgboost as xgb, lightgbm as lgbm, catboost as catb

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import (roc_auc_score, roc_curve, auc, confusion_matrix, f1_score, \
                             accuracy_score, classification_report, plot_confusion_matrix, \
                             plot_precision_recall_curve, precision_recall_curve, recall_score,
                             plot_roc_curve)

In [2]:
# !pip install xgboost
# !pip install lightgbm
# !pip install catboost

**Пути к директориям и файлам**

In [189]:
TRAIN_DATASET_PATH = './data/course_project_train.csv'
TEST_DATASET_PATH = './data/course_project_test.csv'

**Загрузка данных**

In [190]:
df_train = pd.read_csv(TRAIN_DATASET_PATH)
df_test = pd.read_csv(TEST_DATASET_PATH)

In [8]:
# разбиение признаков на числовые и категориальные
FEATURES = df_train.columns.to_list()

TARGET_NAME = 'Credit Default'

NUM_FEATURE_NAMES = ['Annual Income', 'Tax Liens', 'Number of Open Accounts', 'Years of Credit History',
            'Maximum Open Credit', 'Number of Credit Problems', 'Months since last delinquent',
            'Bankruptcies', 'Current Loan Amount', 'Current Credit Balance', 'Monthly Debt', 'Credit Score']

# получаем категориальные признаки вычитая из всех признаков числовые и целевой признак
CAT_FEATURE_NAMES = [item for item in FEATURES if item not in NUM_FEATURE_NAMES]

CAT_FEATURE_NAMES.remove(TARGET_NAME)

SELECTED_FEATURE_NAMES = NUM_FEATURE_NAMES + CAT_FEATURE_NAMES #+ NEW_FEATURE_NAMES

**Функции**

In [81]:
def get_classification_report(y_train_true, y_train_pred, y_test_true, y_test_pred):
    print('TRAIN\n\n' + classification_report(y_train_true, y_train_pred))
    print('TEST\n\n' + classification_report(y_test_true, y_test_pred))
    print('CONFUSION MATRIX\n')
    print(pd.crosstab(y_test_true, y_test_pred))

In [82]:
def evaluate_preds(model, X_train, X_test, y_train, y_test):
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    get_classification_report(y_train, y_train_pred, y_test, y_test_pred)

In [83]:
def balance_df_by_target(df, target_name, method='over'):

    assert method in ['over', 'under'], 'Неверный метод сэмплирования'
    
    target_counts = df[target_name].value_counts()

    major_class_name = target_counts.argmax()
    minor_class_name = target_counts.argmin()

    disbalance_coeff = int(target_counts[major_class_name] / target_counts[minor_class_name]) - 1
    if method == 'over':
        for i in range(disbalance_coeff):
            sample = df[df[target_name] == minor_class_name].sample(target_counts[minor_class_name])
            df = df.append(sample, ignore_index=True)
    elif method == 'under':
        df_ = df.copy()
        df = df_[df_[target_name] == minor_class_name]
        tmp = df_[df_[target_name] == major_class_name]
        df = df.append(tmp.iloc[
            np.random.randint(0, tmp.shape[0], target_counts[minor_class_name])
        ], ignore_index=True)


    return df.sample(frac=1) 

In [205]:
def LogRegression(df):

    # Заполняем пропуски модой
    #df.fillna(df.mode().iloc[0], inplace=True)
    #df = df.fillna(df.median(axis=0), axis=0)

    # Переводим категориальные признаки в количественные
    df = pd.get_dummies(df)

    # стандиртизируем данные
    scaler = StandardScaler()
    df_norm = df.copy()
    df_norm[NUM_FEATURE_NAMES] = scaler.fit_transform(df_norm[NUM_FEATURE_NAMES])
    df = df_norm.copy()
    
    # форммруем выборки
    X = df.drop(columns=[TARGET_NAME])
    y = df[TARGET_NAME]

    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        shuffle=True,
                                                        test_size=0.3,
                                                        random_state=21,
                                                        stratify=y)
    # обучаем модель
    model = LogisticRegression()
    model.fit(X_train, y_train)

    # предсказываем
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # считаем метрики
    #evaluate_preds(model, X_train, X_test, y_train, y_test)
    accuracy_train = accuracy_score(y_train, y_train_pred)
    accuracy_test = accuracy_score(y_test, y_test_pred)
    
    return f'f1 class 1: TRAIN = {round(accuracy_train, 4)}, TEST = {round(accuracy_test, 4)}'

In [206]:
def knn(df):

    # Заполняем пропуски модой
    #df.fillna(df.mode().iloc[0], inplace=True)
    #df = df.fillna(df.median(axis=0), axis=0)

    # Переводим категориальные признаки в количественные
    df = pd.get_dummies(df)

    # стандиртизируем данные
#     scaler = StandardScaler()
#     df_norm = df.copy()
#     df_norm[NUM_FEATURE_NAMES] = scaler.fit_transform(df_norm[NUM_FEATURE_NAMES])
#     df = df_norm.copy()
    
    # форммруем выборки
    X = df.drop(columns=[TARGET_NAME])
    y = df[TARGET_NAME]

    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        shuffle=True,
                                                        test_size=0.3,
                                                        random_state=21,
                                                        stratify=y)
    # обучаем модель
    model = KNeighborsClassifier()
    model.fit(X_train, y_train)

    # предсказываем
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # считаем метрики
    #evaluate_preds(model, X_train, X_test, y_train, y_test)
    accuracy_train = accuracy_score(y_train, y_train_pred)
    accuracy_test = accuracy_score(y_test, y_test_pred)
    
    return f'f1 class 1: TRAIN = {round(accuracy_train, 4)}, TEST = {round(accuracy_test, 4)}'

In [207]:
def decision_tree(df, max_depth):

    # подсчитываем дисбаланс классов
    class_weight = df[TARGET_NAME].value_counts()[1] / df[TARGET_NAME].value_counts()[0]
    
    # Заполняем пропуски модой
    #df.fillna(df.mode().iloc[0], inplace=True)
    #df = df.fillna(df.median(axis=0), axis=0)

    # Переводим категориальные признаки в количественные
    df = pd.get_dummies(df)

    # стандиртизируем данные
#     scaler = StandardScaler()
#     df_norm = df.copy()
#     df_norm[NUM_FEATURE_NAMES] = scaler.fit_transform(df_norm[NUM_FEATURE_NAMES])
#     df = df_norm.copy()
        
    # формируем выборки
    X = df.drop(columns=[TARGET_NAME])
    y = df[TARGET_NAME]

    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        shuffle=True,
                                                        test_size=0.3,
                                                        random_state=21,
                                                        stratify=y)
    # обучаем модель
    model = DecisionTreeClassifier(random_state=21,
                                    class_weight={0:class_weight, 1:1},
                                    max_depth=max_depth
                                    )
    model.fit(X_train, y_train)

    # предсказываем
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # считаем метрики
    #evaluate_preds(model, X_train, X_test, y_train, y_test)
    accuracy_train = accuracy_score(y_train, y_train_pred)
    accuracy_test = accuracy_score(y_test, y_test_pred)
    
    return f'f1 class 1: TRAIN = {round(accuracy_train, 4)}, TEST = {round(accuracy_test, 4)}'

In [208]:
def xgboost(df, n_estimators):

    # подсчитываем дисбаланс классов
    #class_weight = df[TARGET_NAME].value_counts()[1] / df[TARGET_NAME].value_counts()[0]
    
    # Заполняем пропуски модой
    #df.fillna(df.mode().iloc[0], inplace=True)
    #df = df.fillna(df.median(axis=0), axis=0)

    # Переводим категориальные признаки в количественные
    df = pd.get_dummies(df)
    
    # после перевода в категориальные данные появляется столбец 
    # 'Years in current job_< 1 year', необходимо заменить знак <
    #df_clean = df
    regex = re.compile(r"\[|\]|<", re.IGNORECASE)
    df.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) \
                       else col for col in df.columns.values]
    
    # стандиртизируем данные
#     scaler = StandardScaler()
#     df_norm = df.copy()
#     df_norm[NUM_FEATURE_NAMES] = scaler.fit_transform(df_norm[NUM_FEATURE_NAMES])
#     df = df_norm.copy()
        
    # формируем выборки
    X = df.drop(columns=[TARGET_NAME])
    y = df[TARGET_NAME]

    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        shuffle=True,
                                                        test_size=0.3,
                                                        random_state=21,
                                                        stratify=y)
    # обучаем модель
    model = xgb.XGBClassifier(random_state=21,
                               n_estimators=n_estimators,
                              #n_estimators=100
                             )
    model.fit(X_train, y_train)

    # предсказываем
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # считаем метрики
    #evaluate_preds(model, X_train, X_test, y_train, y_test)
    accuracy_train = accuracy_score(y_train, y_train_pred)
    accuracy_test = accuracy_score(y_test, y_test_pred)
    
    return f'f1 class 1: TRAIN = {round(accuracy_train, 4)}, TEST = {round(accuracy_test, 4)}'

In [209]:
def light_gbm(df, n_estimators=100, num_leaves=10):

    # подсчитываем дисбаланс классов
    class_weight = df[TARGET_NAME].value_counts()[1] / df[TARGET_NAME].value_counts()[0]
    
    # Заполняем пропуски модой
    #df.fillna(df.mode().iloc[0], inplace=True)
    #df = df.fillna(df.median(axis=0), axis=0)

    # Переводим категориальные признаки в количественные
    df = pd.get_dummies(df)
    
    # после перевода в категориальные данные появляется столбец 
    # 'Years in current job_< 1 year', необходимо заменить знак <
    #df_clean = df
    regex = re.compile(r"\[|\]|<", re.IGNORECASE)
    df.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) \
                       else col for col in df.columns.values]
    
    # стандиртизируем данные
#     scaler = StandardScaler()
#     df_norm = df.copy()
#     df_norm[NUM_FEATURE_NAMES] = scaler.fit_transform(df_norm[NUM_FEATURE_NAMES])
#     df = df_norm.copy()
        
    # формируем выборки
    X = df.drop(columns=[TARGET_NAME])
    y = df[TARGET_NAME]

    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        shuffle=True,
                                                        test_size=0.3,
                                                        random_state=21,
                                                        stratify=y)
    # обучаем модель
    model = lgbm.LGBMClassifier(random_state=21, 
                                 class_weight={0:class_weight, 1:1},
                                  n_estimators=n_estimators,
                                num_leaves=num_leaves
                               )
    model.fit(X_train, y_train)

    # предсказываем
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # считаем метрики
    #evaluate_preds(model, X_train, X_test, y_train, y_test)
    accuracy_train = accuracy_score(y_train, y_train_pred)
    accuracy_test = accuracy_score(y_test, y_test_pred)
    
    return f'f1 class 1: TRAIN = {round(accuracy_train, 4)}, TEST = {round(accuracy_test, 4)}'

**LightGBM**

**For Better Accuracy**
- Use large `max_bin` (may be slower)
- Use small learning_rate with large `num_iterations`
- Use large `num_leaves` (may cause over-fitting)
- Use bigger training data
- Try `dart`

**Deal with Over-fitting**
- Use small `max_bin`
- Use small `num_leaves`
- Use `min_data_in_leaf` and min_sum_hessian_in_leaf
- Use bagging by set `bagging_fraction` and `bagging_freq`
- Use feature sub-sampling by set `feature_fraction`
- Use bigger training data
- Try `lambda_l1`, `lambda_l2` and `min_gain_to_split` for regularization
- Try `max_depth` to avoid growing deep tree
- Try `extra_trees`
- Try increasing `path_smooth`

In [210]:
def cat_boost(df, n_estimators=100, num_leaves=10):

    # подсчитываем дисбаланс классов
    class_weight = df[TARGET_NAME].value_counts()[1] / df[TARGET_NAME].value_counts()[0]
    
    # Заполняем пропуски модой
    #df.fillna(df.mode().iloc[0], inplace=True)
    #df = df.fillna(df.median(axis=0), axis=0)

    # Переводим категориальные признаки в количественные
    #df = pd.get_dummies(df)
    
    # после перевода в категориальные данные появляется столбец 
    # 'Years in current job_< 1 year', необходимо заменить знак <
    #df_clean = df
#     regex = re.compile(r"\[|\]|<", re.IGNORECASE)
#     df.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) \
#                        else col for col in df.columns.values]
    
    # стандиртизируем данные
#     scaler = StandardScaler()
#     df_norm = df.copy()
#     df_norm[NUM_FEATURE_NAMES] = scaler.fit_transform(df_norm[NUM_FEATURE_NAMES])
#     df = df_norm.copy()
        
    # формируем выборки
    X = df.drop(columns=[TARGET_NAME])
    y = df[TARGET_NAME]

    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        shuffle=True,
                                                        test_size=0.3,
                                                        random_state=21,
                                                        stratify=y)
    # обучаем модель
    model = catb.CatBoostClassifier(silent=True, random_state=21, 
                                    cat_features=CAT_FEATURE_NAMES,
                                    one_hot_max_size=len(X.columns),
                                    class_weights=[class_weight, 1],
                                    eval_metric='F1',
                                    custom_metric=['Precision', 'Recall'],
                                   )
    model.fit(X_train, y_train)

    # предсказываем
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # считаем метрики
    #evaluate_preds(model, X_train, X_test, y_train, y_test)
    accuracy_train = accuracy_score(y_train, y_train_pred)
    accuracy_test = accuracy_score(y_test, y_test_pred)
    
    return f'f1 class 1: TRAIN = {round(accuracy_train, 4)}, TEST = {round(accuracy_test, 4)}'

# EDA: обзор датасета

In [264]:
df_train.head().style.applymap(lambda x: 'color: red' if pd.isnull(x) else '')

Unnamed: 0,Home Ownership,Annual Income,Years in current job,Tax Liens,Number of Open Accounts,Years of Credit History,Maximum Open Credit,Number of Credit Problems,Months since last delinquent,Bankruptcies,Purpose,Term,Current Loan Amount,Current Credit Balance,Monthly Debt,Credit Score,Credit Default
0,Own Home,482087.0,,0.0,11.0,26.3,685960.0,1.0,,1.0,debt consolidation,Short Term,99999999.0,47386.0,7914.0,749.0,0
1,Own Home,1025487.0,10+ years,0.0,15.0,15.3,1181730.0,0.0,,0.0,debt consolidation,Long Term,264968.0,394972.0,18373.0,737.0,1
2,Home Mortgage,751412.0,8 years,0.0,11.0,35.0,1182434.0,0.0,,0.0,debt consolidation,Short Term,99999999.0,308389.0,13651.0,742.0,0
3,Own Home,805068.0,6 years,0.0,8.0,22.5,147400.0,1.0,,1.0,debt consolidation,Short Term,121396.0,95855.0,11338.0,694.0,0
4,Rent,776264.0,8 years,0.0,13.0,13.6,385836.0,1.0,,0.0,debt consolidation,Short Term,125840.0,93309.0,7180.0,719.0,0


In [10]:
df_train.shape

(7500, 17)

In [11]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 17 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Home Ownership                7500 non-null   object 
 1   Annual Income                 5943 non-null   float64
 2   Years in current job          7129 non-null   object 
 3   Tax Liens                     7500 non-null   float64
 4   Number of Open Accounts       7500 non-null   float64
 5   Years of Credit History       7500 non-null   float64
 6   Maximum Open Credit           7500 non-null   float64
 7   Number of Credit Problems     7500 non-null   float64
 8   Months since last delinquent  3419 non-null   float64
 9   Bankruptcies                  7486 non-null   float64
 10  Purpose                       7500 non-null   object 
 11  Term                          7500 non-null   object 
 12  Current Loan Amount           7500 non-null   float64
 13  Cur

In [12]:
df_train.describe()

Unnamed: 0,Annual Income,Tax Liens,Number of Open Accounts,Years of Credit History,Maximum Open Credit,Number of Credit Problems,Months since last delinquent,Bankruptcies,Current Loan Amount,Current Credit Balance,Monthly Debt,Credit Score,Credit Default
count,5943.0,7500.0,7500.0,7500.0,7500.0,7500.0,3419.0,7486.0,7500.0,7500.0,7500.0,5943.0,7500.0
mean,1366392.0,0.030133,11.130933,18.317467,945153.7,0.17,34.6926,0.117152,11873180.0,289833.2,18314.454133,1151.087498,0.281733
std,845339.2,0.271604,4.908924,7.041946,16026220.0,0.498598,21.688806,0.347192,31926120.0,317871.4,11926.764673,1604.451418,0.449874
min,164597.0,0.0,2.0,4.0,0.0,0.0,0.0,0.0,11242.0,0.0,0.0,585.0,0.0
25%,844341.0,0.0,8.0,13.5,279229.5,0.0,16.0,0.0,180169.0,114256.5,10067.5,711.0,0.0
50%,1168386.0,0.0,10.0,17.0,478159.0,0.0,32.0,0.0,309573.0,209323.0,16076.5,731.0,0.0
75%,1640137.0,0.0,14.0,21.8,793501.5,0.0,50.0,0.0,519882.0,360406.2,23818.0,743.0,1.0
max,10149340.0,7.0,43.0,57.7,1304726000.0,7.0,118.0,4.0,100000000.0,6506797.0,136679.0,7510.0,1.0


In [13]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 16 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Home Ownership                2500 non-null   object 
 1   Annual Income                 1987 non-null   float64
 2   Years in current job          2414 non-null   object 
 3   Tax Liens                     2500 non-null   float64
 4   Number of Open Accounts       2500 non-null   float64
 5   Years of Credit History       2500 non-null   float64
 6   Maximum Open Credit           2500 non-null   float64
 7   Number of Credit Problems     2500 non-null   float64
 8   Months since last delinquent  1142 non-null   float64
 9   Bankruptcies                  2497 non-null   float64
 10  Purpose                       2500 non-null   object 
 11  Term                          2500 non-null   object 
 12  Current Loan Amount           2500 non-null   float64
 13  Cur

In [266]:
df_test.head().style.applymap(lambda x: 'color: red' if pd.isnull(x) else '')

Unnamed: 0,Home Ownership,Annual Income,Years in current job,Tax Liens,Number of Open Accounts,Years of Credit History,Maximum Open Credit,Number of Credit Problems,Months since last delinquent,Bankruptcies,Purpose,Term,Current Loan Amount,Current Credit Balance,Monthly Debt,Credit Score
0,Rent,,4 years,0.0,9.0,12.5,220968.0,0.0,70.0,0.0,debt consolidation,Short Term,162470.0,105906.0,6813.0,
1,Rent,231838.0,1 year,0.0,6.0,32.7,55946.0,0.0,8.0,0.0,educational expenses,Short Term,78298.0,46037.0,2318.0,699.0
2,Home Mortgage,1152540.0,3 years,0.0,10.0,13.7,204600.0,0.0,,0.0,debt consolidation,Short Term,200178.0,146490.0,18729.0,7260.0
3,Home Mortgage,1220313.0,10+ years,0.0,16.0,17.0,456302.0,0.0,70.0,0.0,debt consolidation,Short Term,217382.0,213199.0,27559.0,739.0
4,Home Mortgage,2340952.0,6 years,0.0,11.0,23.6,1207272.0,0.0,,0.0,debt consolidation,Long Term,777634.0,425391.0,42605.0,706.0


In [15]:
df_test.shape

(2500, 16)

Количество уникальных значений

In [16]:
df_train.nunique()

Home Ownership                     4
Annual Income                   5478
Years in current job              11
Tax Liens                          8
Number of Open Accounts           39
Years of Credit History          408
Maximum Open Credit             6963
Number of Credit Problems          8
Months since last delinquent      89
Bankruptcies                       5
Purpose                           15
Term                               2
Current Loan Amount             5386
Current Credit Balance          6592
Monthly Debt                    6716
Credit Score                     268
Credit Default                     2
dtype: int64

In [17]:
df_test.nunique()

Home Ownership                     4
Annual Income                   1929
Years in current job              11
Tax Liens                          8
Number of Open Accounts           35
Years of Credit History          345
Maximum Open Credit             2435
Number of Credit Problems          8
Months since last delinquent      83
Bankruptcies                       6
Purpose                           14
Term                               2
Current Loan Amount             2026
Current Credit Balance          2385
Monthly Debt                    2416
Credit Score                     211
dtype: int64

## Обзор категориальных переменных

In [18]:
df_train['Home Ownership'].unique()

array(['Own Home', 'Home Mortgage', 'Rent', 'Have Mortgage'], dtype=object)

In [20]:
df_train['Years in current job'].unique()

array([nan, '10+ years', '8 years', '6 years', '7 years', '5 years',
       '1 year', '< 1 year', '4 years', '3 years', '2 years', '9 years'],
      dtype=object)

In [22]:
df_train['Term'].unique()

array(['Short Term', 'Long Term'], dtype=object)

In [202]:
df_train['Purpose'].unique()

array(['debt consolidation', 'other', 'home improvements', 'take a trip',
       'buy a car', 'small business', 'business loan', 'wedding',
       'educational expenses', 'buy house', 'medical bills', 'moving',
       'major purchase', 'vacation', 'renewable energy'], dtype=object)

In [205]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 17 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Home Ownership                7500 non-null   object 
 1   Annual Income                 7500 non-null   float64
 2   Years in current job          7500 non-null   object 
 3   Tax Liens                     7500 non-null   float64
 4   Number of Open Accounts       7500 non-null   float64
 5   Years of Credit History       7500 non-null   float64
 6   Maximum Open Credit           7500 non-null   float64
 7   Number of Credit Problems     7500 non-null   float64
 8   Months since last delinquent  7500 non-null   float64
 9   Bankruptcies                  7500 non-null   float64
 10  Purpose                       7500 non-null   object 
 11  Term                          7500 non-null   object 
 12  Current Loan Amount           7500 non-null   float64
 13  Cur

In [206]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 16 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Home Ownership                2500 non-null   object 
 1   Annual Income                 2500 non-null   float64
 2   Years in current job          2500 non-null   object 
 3   Tax Liens                     2500 non-null   float64
 4   Number of Open Accounts       2500 non-null   float64
 5   Years of Credit History       2500 non-null   float64
 6   Maximum Open Credit           2500 non-null   float64
 7   Number of Credit Problems     2500 non-null   float64
 8   Months since last delinquent  2500 non-null   float64
 9   Bankruptcies                  2500 non-null   float64
 10  Purpose                       2500 non-null   object 
 11  Term                          2500 non-null   object 
 12  Current Loan Amount           2500 non-null   float64
 13  Cur

# Baseline

Обновляем данные и заполняем пропуски модой

In [251]:
df_train = pd.read_csv(TRAIN_DATASET_PATH)
df_test = pd.read_csv(TEST_DATASET_PATH)

df_train.fillna(df_train.mode().iloc[0], inplace=True)
#df = df.fillna(df.median(axis=0), axis=0)
df_train.head(10)

Unnamed: 0,Home Ownership,Annual Income,Years in current job,Tax Liens,Number of Open Accounts,Years of Credit History,Maximum Open Credit,Number of Credit Problems,Months since last delinquent,Bankruptcies,Purpose,Term,Current Loan Amount,Current Credit Balance,Monthly Debt,Credit Score,Credit Default
0,Own Home,482087.0,10+ years,0.0,11.0,26.3,685960.0,1.0,14.0,1.0,debt consolidation,Short Term,99999999.0,47386.0,7914.0,749.0,0
1,Own Home,1025487.0,10+ years,0.0,15.0,15.3,1181730.0,0.0,14.0,0.0,debt consolidation,Long Term,264968.0,394972.0,18373.0,737.0,1
2,Home Mortgage,751412.0,8 years,0.0,11.0,35.0,1182434.0,0.0,14.0,0.0,debt consolidation,Short Term,99999999.0,308389.0,13651.0,742.0,0
3,Own Home,805068.0,6 years,0.0,8.0,22.5,147400.0,1.0,14.0,1.0,debt consolidation,Short Term,121396.0,95855.0,11338.0,694.0,0
4,Rent,776264.0,8 years,0.0,13.0,13.6,385836.0,1.0,14.0,0.0,debt consolidation,Short Term,125840.0,93309.0,7180.0,719.0,0
5,Rent,969475.0,7 years,0.0,12.0,14.6,366784.0,0.0,14.0,0.0,other,Long Term,337304.0,165680.0,18692.0,740.0,1
6,Home Mortgage,1511108.0,10+ years,0.0,9.0,20.3,388124.0,0.0,73.0,0.0,home improvements,Short Term,99999999.0,51623.0,2317.0,745.0,0
7,Rent,1040060.0,10+ years,0.0,13.0,12.0,330374.0,0.0,18.0,0.0,other,Short Term,250888.0,89015.0,19761.0,705.0,1
8,Home Mortgage,969475.0,5 years,0.0,17.0,15.7,0.0,1.0,14.0,1.0,home improvements,Short Term,129734.0,19.0,17.0,740.0,0
9,Home Mortgage,969475.0,1 year,0.0,10.0,24.6,511302.0,0.0,6.0,0.0,debt consolidation,Long Term,572880.0,205333.0,17613.0,740.0,1


In [213]:
LogRegression(df_train)

'f1 class 1: TRAIN = 0.7817, TEST = 0.768'

In [214]:
knn(df_train)

'f1 class 1: TRAIN = 0.7813, TEST = 0.6747'

In [215]:
decision_tree(df_train, max_depth=2)

'f1 class 1: TRAIN = 0.7726, TEST = 0.7693'

In [216]:
xgboost(df_train, n_estimators=10)



'f1 class 1: TRAIN = 0.8038, TEST = 0.7698'

In [217]:
light_gbm(df_train)

'f1 class 1: TRAIN = 0.8072, TEST = 0.7018'

In [218]:
# tuning
light_gbm(df_train, num_leaves=2)

'f1 class 1: TRAIN = 0.7259, TEST = 0.7076'

In [219]:
cat_boost(df_train)

'f1 class 1: TRAIN = 0.9025, TEST = 0.7156'

# Обработка данных

## Обработка пропусков

### Выбросить данные с пропусками

In [229]:
df_train = pd.read_csv(TRAIN_DATASET_PATH)
df_test = pd.read_csv(TEST_DATASET_PATH)
print('Original dataset shape = ', df_train.shape)

df_train.dropna(axis='index', inplace=True) 
print('Dataset shape wothout NaN = ', df_train.shape)

Original dataset shape =  (7500, 17)
Dataset shape wothout NaN =  (2585, 17)


In [230]:
LogRegression(df_train)

'f1 class 1: TRAIN = 0.8115, TEST = 0.7977'

In [231]:
knn(df_train)

'f1 class 1: TRAIN = 0.7828, TEST = 0.6997'

In [232]:
decision_tree(df_train, max_depth=2)

'f1 class 1: TRAIN = 0.8043, TEST = 0.8015'

In [233]:
xgboost(df_train, n_estimators=10)



'f1 class 1: TRAIN = 0.8563, TEST = 0.8041'

In [235]:
light_gbm(df_train)

'f1 class 1: TRAIN = 0.9038, TEST = 0.7358'

In [234]:
# tuning
light_gbm(df_train, num_leaves=2)

'f1 class 1: TRAIN = 0.7529, TEST = 0.7307'

In [236]:
cat_boost(df_train)

'f1 class 1: TRAIN = 0.9663, TEST = 0.7668'

Model | Baseline | Dropped NaN
---|---|---
LogReg | TRAIN = 0.7817, TEST = 0.768 | TRAIN = 0.8115, TEST = 0.7977
KNN | TRAIN = 0.7813, TEST = 0.6747' | TRAIN = 0.7828, TEST = 0.6997
Decision Tree | TRAIN = 0.7726, TEST = 0.7693 | TRAIN = 0.8043, TEST = 0.8015
XGB | TRAIN = 0.8038, TEST = 0.7698 | TRAIN = 0.8563, TEST = 0.8041
LGBM | TRAIN = 0.8072, TEST = 0.7018 | TRAIN = 0.9038, TEST = 0.7358
CatB | TRAIN = 0.9025, TEST = 0.7156 | TRAIN = 0.9663, TEST = 0.7668

Вывод: выбрасывание пропущенных значений значительно улучшило скор по сравнению с заполнением модой. Лучше всего себя показало Decision Tree и XGB.

Идея: использовать параметр dropna -  threshint: Require that many non-NA values.

### Заменять разными методами (медианы, средние значения, бизнес-логика, строить модели...)

Заполним пропущенные значения: Мода для категориальный, среднее для числовых

In [250]:
# Мода для категориальный, среднее для числовых
df_train = pd.read_csv(TRAIN_DATASET_PATH)
df_test = pd.read_csv(TEST_DATASET_PATH)

df_train[CAT_FEATURE_NAMES] = df_train[CAT_FEATURE_NAMES].fillna(df_train.mode().iloc[0])
df_train[NUM_FEATURE_NAMES] = df_train[NUM_FEATURE_NAMES].fillna(df_train.median())
df_train.head(10)

Unnamed: 0,Home Ownership,Annual Income,Years in current job,Tax Liens,Number of Open Accounts,Years of Credit History,Maximum Open Credit,Number of Credit Problems,Months since last delinquent,Bankruptcies,Purpose,Term,Current Loan Amount,Current Credit Balance,Monthly Debt,Credit Score,Credit Default
0,Own Home,482087.0,10+ years,0.0,11.0,26.3,685960.0,1.0,32.0,1.0,debt consolidation,Short Term,99999999.0,47386.0,7914.0,749.0,0
1,Own Home,1025487.0,10+ years,0.0,15.0,15.3,1181730.0,0.0,32.0,0.0,debt consolidation,Long Term,264968.0,394972.0,18373.0,737.0,1
2,Home Mortgage,751412.0,8 years,0.0,11.0,35.0,1182434.0,0.0,32.0,0.0,debt consolidation,Short Term,99999999.0,308389.0,13651.0,742.0,0
3,Own Home,805068.0,6 years,0.0,8.0,22.5,147400.0,1.0,32.0,1.0,debt consolidation,Short Term,121396.0,95855.0,11338.0,694.0,0
4,Rent,776264.0,8 years,0.0,13.0,13.6,385836.0,1.0,32.0,0.0,debt consolidation,Short Term,125840.0,93309.0,7180.0,719.0,0
5,Rent,1168386.0,7 years,0.0,12.0,14.6,366784.0,0.0,32.0,0.0,other,Long Term,337304.0,165680.0,18692.0,731.0,1
6,Home Mortgage,1511108.0,10+ years,0.0,9.0,20.3,388124.0,0.0,73.0,0.0,home improvements,Short Term,99999999.0,51623.0,2317.0,745.0,0
7,Rent,1040060.0,10+ years,0.0,13.0,12.0,330374.0,0.0,18.0,0.0,other,Short Term,250888.0,89015.0,19761.0,705.0,1
8,Home Mortgage,1168386.0,5 years,0.0,17.0,15.7,0.0,1.0,32.0,1.0,home improvements,Short Term,129734.0,19.0,17.0,731.0,0
9,Home Mortgage,1168386.0,1 year,0.0,10.0,24.6,511302.0,0.0,6.0,0.0,debt consolidation,Long Term,572880.0,205333.0,17613.0,731.0,1


In [252]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 17 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Home Ownership                7500 non-null   object 
 1   Annual Income                 7500 non-null   float64
 2   Years in current job          7500 non-null   object 
 3   Tax Liens                     7500 non-null   float64
 4   Number of Open Accounts       7500 non-null   float64
 5   Years of Credit History       7500 non-null   float64
 6   Maximum Open Credit           7500 non-null   float64
 7   Number of Credit Problems     7500 non-null   float64
 8   Months since last delinquent  7500 non-null   float64
 9   Bankruptcies                  7500 non-null   float64
 10  Purpose                       7500 non-null   object 
 11  Term                          7500 non-null   object 
 12  Current Loan Amount           7500 non-null   float64
 13  Cur

In [253]:
LogRegression(df_train)

'f1 class 1: TRAIN = 0.7817, TEST = 0.768'

In [254]:
knn(df_train)

'f1 class 1: TRAIN = 0.7813, TEST = 0.6747'

In [255]:
decision_tree(df_train, max_depth=2)

'f1 class 1: TRAIN = 0.7726, TEST = 0.7693'

In [256]:
xgboost(df_train, n_estimators=10)



'f1 class 1: TRAIN = 0.8038, TEST = 0.7698'

In [257]:
light_gbm(df_train)

'f1 class 1: TRAIN = 0.8072, TEST = 0.7018'

In [258]:
# tuning
light_gbm(df_train, num_leaves=2)

'f1 class 1: TRAIN = 0.7259, TEST = 0.7076'

In [259]:
cat_boost(df_train)

'f1 class 1: TRAIN = 0.9025, TEST = 0.7156'

Model | Baseline | Dropped NaN | Moda + Mean
---|---|---|---
LogReg | TRAIN = 0.7817, TEST = 0.768 | TRAIN = 0.8115, TEST = 0.7977 | TRAIN = 0.7817, TEST = 0.768
KNN | TRAIN = 0.7813, TEST = 0.6747' | TRAIN = 0.7828, TEST = 0.6997 | TRAIN = 0.7813, TEST = 0.6747
D Tree | TRAIN = 0.7726, TEST = 0.7693 | TRAIN = 0.8043, TEST = 0.8015 | TRAIN = 0.7726, TEST = 0.7693
XGB | TRAIN = 0.8038, TEST = 0.7698 | TRAIN = 0.8563, TEST = 0.8041 | TRAIN = 0.8038, TEST = 0.7698
LGBM | TRAIN = 0.8072, TEST = 0.7018 | TRAIN = 0.9038, TEST = 0.7358 | TRAIN = 0.8072, TEST = 0.7018
CatB | TRAIN = 0.9025, TEST = 0.7156 | TRAIN = 0.9663, TEST = 0.7668 | TRAIN = 0.9025, TEST = 0.7156

Вывод: заполнение пропущенных значений модой + медианой не улучшило скор по сравнению с выбрасыванием пропусков.

### Оставить пропуски как есть (для алгоритмов работающих с пропусками)

In [263]:
df_train = pd.read_csv(TRAIN_DATASET_PATH)
df_test = pd.read_csv(TEST_DATASET_PATH)

df_train.head(5).style.applymap(lambda x: 'color: red' if pd.isnull(x) else '')

Unnamed: 0,Home Ownership,Annual Income,Years in current job,Tax Liens,Number of Open Accounts,Years of Credit History,Maximum Open Credit,Number of Credit Problems,Months since last delinquent,Bankruptcies,Purpose,Term,Current Loan Amount,Current Credit Balance,Monthly Debt,Credit Score,Credit Default
0,Own Home,482087.0,,0.0,11.0,26.3,685960.0,1.0,,1.0,debt consolidation,Short Term,99999999.0,47386.0,7914.0,749.0,0
1,Own Home,1025487.0,10+ years,0.0,15.0,15.3,1181730.0,0.0,,0.0,debt consolidation,Long Term,264968.0,394972.0,18373.0,737.0,1
2,Home Mortgage,751412.0,8 years,0.0,11.0,35.0,1182434.0,0.0,,0.0,debt consolidation,Short Term,99999999.0,308389.0,13651.0,742.0,0
3,Own Home,805068.0,6 years,0.0,8.0,22.5,147400.0,1.0,,1.0,debt consolidation,Short Term,121396.0,95855.0,11338.0,694.0,0
4,Rent,776264.0,8 years,0.0,13.0,13.6,385836.0,1.0,,0.0,debt consolidation,Short Term,125840.0,93309.0,7180.0,719.0,0


In [272]:
xgboost(df_train, n_estimators=10)



'f1 class 1: TRAIN = 0.8015, TEST = 0.7676'

In [273]:
light_gbm(df_train)

'f1 class 1: TRAIN = 0.8063, TEST = 0.7084'

In [274]:
# tuning
light_gbm(df_train, num_leaves=2)

'f1 class 1: TRAIN = 0.667, TEST = 0.6516'

Model | Baseline | Dropped NaN | With NaN
---|---|---|---
LogReg | TRAIN = 0.7817, TEST = 0.768 | TRAIN = 0.8115, TEST = 0.7977 | 
KNN | TRAIN = 0.7813, TEST = 0.6747' | TRAIN = 0.7828, TEST = 0.6997 | 
Des Tree | TRAIN = 0.7726, TEST = 0.7693 | TRAIN = 0.8043, TEST = 0.8015 | 
XGB | TRAIN = 0.8038, TEST = 0.7698 | TRAIN = 0.8563, TEST = 0.8041 | TRAIN = 0.8015, TEST = 0.7676
LGBM | TRAIN = 0.8072, TEST = 0.7018 | TRAIN = 0.9038, TEST = 0.7358 | TRAIN = 0.8063, TEST = 0.7084
CatB | TRAIN = 0.9025, TEST = 0.7156 | TRAIN = 0.9663, TEST = 0.7668 | 

Вывод: "Оставить NaN как есть" -  не улучшило скор по сравнению с выбрасывание пропусков.

## Обработка выбросов

### Выбросить данные с выбросами

### Заменять пропуски разными методами (медианы, средние значения, бизнес-логика, строить модели...)

# Финальная модель

In [283]:
df_train = pd.read_csv(TRAIN_DATASET_PATH)
df_test = pd.read_csv(TEST_DATASET_PATH)
print('Original dataset shape = ', df_train.shape)

df_train.dropna(axis='index', inplace=True)
print('Dataset shape wothout NaN = ', df_train.shape)

Original dataset shape =  (7500, 17)
Dataset shape wothout NaN =  (2585, 17)


In [286]:
X

Unnamed: 0,Annual Income,Tax Liens,Number of Open Accounts,Years of Credit History,Maximum Open Credit,Number of Credit Problems,Months since last delinquent,Bankruptcies,Current Loan Amount,Current Credit Balance,...,Purpose_major purchase,Purpose_medical bills,Purpose_moving,Purpose_other,Purpose_small business,Purpose_take a trip,Purpose_vacation,Purpose_wedding,Term_Long Term,Term_Short Term
6,1511108.0,0.0,9.0,20.3,388124.0,0.0,73.0,0.0,99999999.0,51623.0,...,0,0,0,0,0,0,0,0,0,1
7,1040060.0,0.0,13.0,12.0,330374.0,0.0,18.0,0.0,250888.0,89015.0,...,0,0,0,1,0,0,0,0,0,1
18,1401744.0,0.0,9.0,29.0,387222.0,0.0,40.0,0.0,553586.0,201989.0,...,0,0,0,0,0,0,0,0,1,0
20,1651993.0,0.0,11.0,26.5,663894.0,0.0,44.0,0.0,585090.0,252852.0,...,0,0,0,0,0,0,0,0,1,0
21,1047394.0,0.0,7.0,34.4,401104.0,0.0,45.0,0.0,324852.0,183597.0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7489,1394942.0,0.0,15.0,27.2,1441396.0,0.0,35.0,0.0,753764.0,496698.0,...,0,0,0,0,0,0,0,0,0,1
7490,1368000.0,0.0,20.0,26.7,897842.0,0.0,69.0,0.0,683650.0,517199.0,...,0,0,0,0,0,0,0,0,0,1
7491,2833185.0,0.0,18.0,21.3,280170.0,0.0,6.0,0.0,437404.0,108889.0,...,0,0,0,0,0,0,0,0,0,1
7493,1257610.0,0.0,14.0,16.5,821480.0,0.0,58.0,0.0,448052.0,167428.0,...,0,0,0,0,0,0,0,0,1,0


In [287]:
y

6       0
7       1
18      1
20      0
21      1
       ..
7489    1
7490    1
7491    0
7493    1
7496    1
Name: Credit Default, Length: 2585, dtype: int64

In [313]:
# Переводим категориальные признаки в количественные
df = pd.get_dummies(df)

# после перевода в категориальные данные появляется столбец 
# 'Years in current job_< 1 year', необходимо заменить знак <
#df_clean = df
regex = re.compile(r"\[|\]|<", re.IGNORECASE)
df.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) \
                   else col for col in df.columns.values]


# формируем выборки
X = df.drop(columns=[TARGET_NAME])
y = df[TARGET_NAME]

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    shuffle=True,
                                                    test_size=0.3,
                                                    random_state=21,
                                                    stratify=y)
# обучаем модель
model = xgb.XGBClassifier(random_state=21,
                           #n_estimators=n_estimators,
                          #n_estimators=100,
                          reg_lambda=10000,
                         )
model.fit(X_train, y_train)

# предсказываем
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# считаем метрики
#evaluate_preds(model, X_train, X_test, y_train, y_test)
accuracy_train = accuracy_score(y_train, y_train_pred)
accuracy_test = accuracy_score(y_test, y_test_pred)

print(f'f1 class 1: TRAIN = {round(accuracy_train, 4)}, TEST = {round(accuracy_test, 4)}')

f1 class 1: TRAIN = 0.789, TEST = 0.7769


In [314]:
# Переводим категориальные признаки в количественные
df_train = pd.get_dummies(df_train)
df_test = pd.get_dummies(df_test)

# после перевода в категориальные данные появляется столбец 
# 'Years in current job_< 1 year', необходимо заменить знак <
#df_clean = df
regex = re.compile(r"\[|\]|<", re.IGNORECASE)
df_train.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) \
                   else col for col in df_train.columns.values]
df_test.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) \
                   else col for col in df_test.columns.values]

# формируем выборки
X = df_train.drop(columns=[TARGET_NAME])
y = df_train[TARGET_NAME]


# X_train, X_test, y_train, y_test = train_test_split(X, y, 
#                                                     shuffle=True,
#                                                     test_size=0.3,
#                                                     random_state=21,
#                                                     stratify=y)
# обучаем модель
model = xgb.XGBClassifier(random_state=21,
                          # n_estimators=n_estimators,
                          #n_estimators=5,
                          reg_lambda=10000,
                         )
model.fit(X, y)

# предсказываем
y_train_pred = model.predict(X)
y_pred = model.predict(df_test)

# считаем метрики
#evaluate_preds(model, X_train, X_test, y_train, y_test)
accuracy_train = accuracy_score(y, y_train_pred)

print(f'f1 class 1: TRAIN = {round(accuracy_train, 4)}')

f1 class 1: TRAIN = 0.8066


In [290]:
y_pred

array([0, 1, 1, ..., 0, 0, 1], dtype=int64)

In [315]:
submission = pd.DataFrame()
submission['Credit Default'] = y_pred
submission.to_csv(f'data/AKalinichenko_xgb_drop_NaN_lambda10000_predictions.csv', 
                  index_label='Id', 
                  header=['Credit Default'], 
                  index=True, 
                  encoding='utf-8')
print('Done!')

Done!


Model | Processing | Train | Test | Private | Public
---|---|---|---|--|---
 1 | мода? | | | 0.50455 | 0.48577
XGB | Drop NaN n_estimators=5 | 0.8563 | 0.8041 |0.48820 | 0.42458
XGB | Drop NAN + lambda=10000 | 0.789 | 0.7769 | 0.51692 | 0.40601