<div style="font-size:18pt; padding-top:20px; text-align:center">СЕМИНАР. <b>Классификация с несбалансированной выборкой</b></div><hr>
<div style="text-align:right;">Папулин С.Ю. <span style="font-style: italic;font-weight: bold;">(papulin.study@yandex.ru)</span></div>

### Содержание
- [Загрузка исходных данных](#Загрузка-исходных-данных)
- [Обучение модели и оценка качества](#Обучение-модели-и-оценка-качества)
    - [Расчет базовой отметки](#Расчет-базовой-отметки)
    - [Логистическая регрессия](#Логистическая-регрессия)
    - [Изменение порога предсказания](#Изменение-порога-предсказания)
    - [Логистическая регрессия с весами классов](#Логистическая-регрессия-с-весами-классов)
- [Выбор веса классов в логистической регрессии](#Выбор-веса-классов-в-логистической-регрессии)
- [Задание](#Задание)
- [Источники](#Источники)

<p><b>Подключение библиотек</b></p>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import (
    accuracy_score, 
    precision_score, 
    recall_score, 
    f1_score,
    roc_curve, 
    roc_auc_score
)
from sklearn.model_selection import (
    train_test_split, 
    StratifiedKFold, 
    StratifiedShuffleSplit, 
    GridSearchCV, 
    cross_val_score
)

%matplotlib inline

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
# import warnings
# warnings.filterwarnings("ignore", category=DeprecationWarning)

In [None]:
import sys
sys.path.insert(0, "../lib/")
from plot_confusion_matrix import plot_confusion_matrix

## Загрузка исходных данных

Набор данных состоит из 10000 наблюдений со следующими признаками:

- `default`
    
    *Yes* указавает на то, что клиент не сможет выплатить долг, *No* если сможет.


- `student`
    
    Является ли клиент студентом (Yes/No)
    
    
- `balance`

    Средний баланс кредитной карты перед ежемесячными платежами
    
    
- `income`

    Доход клиента

In [None]:
FILE_PATH = "../data/Default.csv"

df = pd.read_csv(FILE_PATH)
df.head()

Конвертация строковых значений в числовые категориальные признаки:

In [None]:
# converters={"default": lambda x: int(x == "Yes"), "student": lambda x: int(x == "Yes")}

# Другие варианты
# df["default"] = np.where(df["default"]=="Yes", 1, 0)
# df["default"] = (df["default"] == "Yes").astype("int")

df["default"] = df["default"].apply(lambda x: int(x == "Yes"))
df["student"] = df["student"].apply(lambda x: int(x == "Yes"))
df.head(5)

Столбцы признаков и столбец целевых значений:

In [None]:
TARGET_COLUMN = "default"
FEATURE_COLUMNS = list(set(df.columns) - set([TARGET_COLUMN]))
FEATURE_COLUMNS

Количество элементов в каждом целевом классе:

In [None]:
df["default"].value_counts()

## Обучение модели и оценка качества

Разделение данных на обучающее и тестовое подмножества:

In [None]:
RANDOM_STATE = 123

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df[FEATURE_COLUMNS], 
    df[TARGET_COLUMN],
    test_size=0.3, 
    random_state=RANDOM_STATE
)

print("Обучающее множество:\n{}".format(y_train.value_counts()))
print("\nТестовое множество:\n{}".format(y_test.value_counts()))

### Расчет базовой отметки

In [None]:
y_train_pred = np.zeros(y_train.shape[0])
y_test_pred = np.zeros(y_test.shape[0])

print("Обучающее множество")
print("Accuracy \t= {}".format(accuracy_score(y_train, y_train_pred)))
print("Precision \t= {}".format(precision_score(y_train, y_train_pred)))
print("Recall \t\t= {}".format(recall_score(y_train, y_train_pred)))
print("F1 \t\t= {}".format(f1_score(y_train, y_train_pred)))

print("\nТестовое множество")
print("Accuracy \t= {}".format(accuracy_score(y_test, y_test_pred)))
print("Precision \t= {}".format(precision_score(y_test, y_test_pred)))
print("Recall \t\t= {}".format(recall_score(y_test, y_test_pred)))
print("F1 \t\t= {}".format(f1_score(y_test, y_test_pred)))

Матрица ошибок:

In [None]:
# Train
plot_confusion_matrix(y_train, 
                      y_train_pred, 
                      np.array(["Non-default", "Default"]))
plt.show()

In [None]:
# Test
plot_confusion_matrix(y_test, 
                      y_test_pred, 
                      np.array(["Non-default", "Default"]))
plt.show()

### Логистическая регрессия

In [None]:
logr_model = LogisticRegression(penalty="l2", fit_intercept=True, max_iter=100, C=1e5,
                                solver="lbfgs", random_state=RANDOM_STATE)
logr_model.fit(X_train, y_train)

y_train_pred = logr_model.predict(X_train)
y_test_pred = logr_model.predict(X_test)

print("Обучающее множество")
print("Accuracy \t= {}".format(accuracy_score(y_train, y_train_pred)))
print("Precision \t= {}".format(precision_score(y_train, y_train_pred)))
print("Recall \t\t= {}".format(recall_score(y_train, y_train_pred)))
print("F1 \t\t= {}".format(f1_score(y_train, y_train_pred)))

print("\nТестовое множество")
print("Accuracy \t= {}".format(accuracy_score(y_test, y_test_pred)))
print("Precision \t= {}".format(precision_score(y_test, y_test_pred)))
print("Recall \t\t= {}".format(recall_score(y_test, y_test_pred)))
print("F1 \t\t= {}".format(f1_score(y_test, y_test_pred)))

Матрица ошибок:

In [None]:
# Train
plot_confusion_matrix(y_train, 
                      logr_model.predict(X_train), 
                      np.array(["Non-default", "Default"]))
plt.show()

In [None]:
# Test
plot_confusion_matrix(y_test, 
                      logr_model.predict(X_test), 
                      np.array(["Non-default", "Default"]))
plt.show()

ROC:

In [None]:
fpr_train, tpr_train, thresholds_train = roc_curve(y_train, logr_model.predict_proba(X_train)[:,1])
fpr_test, tpr_test, thresholds_test = roc_curve(y_test, logr_model.predict_proba(X_test)[:,1])

In [None]:
# thresholds_train

In [None]:
plt.figure(1, figsize=[12, 4])

plt.subplot(1,2,1)
plt.plot([0,1], [0,1], "--", color="grey")
plt.title("ROC for train")
plt.axvline(0, linestyle="-", c="black", lw=1)
plt.axvline(1, linestyle="--", c="black", lw=1)
plt.axhline(1, linestyle="--", c="black", lw=1)
plt.plot(fpr_train, tpr_train, "-", c="seagreen", lw=4)
plt.grid(True)
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.xlim(-0.1, 1.1)
plt.ylim(0, 1.1)

plt.subplot(1,2,2)
plt.plot([0,1], [0,1], "--", color="grey")
plt.title("ROC for test")
plt.axvline(0, linestyle="-", c="black", lw=1)
plt.axvline(1, linestyle="--", c="black", lw=1)
plt.axhline(1, linestyle="--", c="black", lw=1)
plt.plot(fpr_test, tpr_test, "-", c="seagreen", lw=4)
plt.grid(True)
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.xlim(-0.1, 1.1)
plt.ylim(0, 1.1)

plt.show()

ROC AUC:

In [None]:
roc_auc_train = roc_auc_score(y_train, logr_model.predict_proba(X_train)[:,1])
roc_auc_test = roc_auc_score(y_test, logr_model.predict_proba(X_test)[:,1])

print("ROC AUC на обучающем множестве: {}".format(roc_auc_train))
print("ROC AUC на тестовом множестве: {}".format(roc_auc_test))

### Изменение порога предсказания

In [None]:
THRESHOLD = thresholds_train[np.argwhere((tpr_train > 0.8) & (tpr_train < 0.9))].mean()
THRESHOLD

In [None]:
def predict_with_threshold(model, threshold, X):
    if hasattr(model, "predict_proba") and callable(model.predict_proba):
        return np.where(model.predict_proba(X)[:,1] >= threshold, 1, 0)
    raise Exception("This model isn't supported.")

In [None]:
y_train_pred = predict_with_threshold(logr_model, THRESHOLD, X_train)
y_test_pred = predict_with_threshold(logr_model, THRESHOLD, X_test)

print("Обучающее множество")
print("Accuracy \t= {}".format(accuracy_score(y_train, y_train_pred)))
print("Precision \t= {}".format(precision_score(y_train, y_train_pred)))
print("Recall \t\t= {}".format(recall_score(y_train, y_train_pred)))
print("F1 \t\t= {}".format(f1_score(y_train, y_train_pred)))

print("\nТестовое множество")
print("Accuracy \t= {}".format(accuracy_score(y_test, y_test_pred)))
print("Precision \t= {}".format(precision_score(y_test, y_test_pred)))
print("Recall \t\t= {}".format(recall_score(y_test, y_test_pred)))
print("F1 \t\t= {}".format(f1_score(y_test, y_test_pred)))

Матрица ошибок:

In [None]:
# Train
plot_confusion_matrix(y_train, 
                      y_train_pred, 
                      np.array(["Non-default", "Default"]))
plt.show()

In [None]:
# Test
plot_confusion_matrix(y_test, 
                      y_test_pred, 
                      np.array(["Non-default", "Default"]))
plt.show()

### Логистическая регрессия с весами классов

In [None]:
logr_model = LogisticRegression(penalty="l2", fit_intercept=True, max_iter=100, C=1e5, 
                                class_weight="balanced",
                                solver="lbfgs", random_state=RANDOM_STATE)
logr_model.fit(X_train, y_train)

y_train_pred = logr_model.predict(X_train)
y_test_pred = logr_model.predict(X_test)

print("Обучающее множество")
print("Accuracy \t= {}".format(accuracy_score(y_train, y_train_pred)))
print("Precision \t= {}".format(precision_score(y_train, y_train_pred)))
print("Recall \t\t= {}".format(recall_score(y_train, y_train_pred)))
print("F1 \t\t= {}".format(f1_score(y_train, y_train_pred)))

print("\nТестовое множество")
print("Accuracy \t= {}".format(accuracy_score(y_test, y_test_pred)))
print("Precision \t= {}".format(precision_score(y_test, y_test_pred)))
print("Recall \t\t= {}".format(recall_score(y_test, y_test_pred)))
print("F1 \t\t= {}".format(f1_score(y_test, y_test_pred)))

Матрица ошибок:

In [None]:
# Train
plot_confusion_matrix(y_train, 
                      logr_model.predict(X_train), 
                      np.array(["Non-default", "Default"]))
plt.show()

In [None]:
# Test
plot_confusion_matrix(y_test, 
                      logr_model.predict(X_test), 
                      np.array(["Non-default", "Default"]))
plt.show()

ROC:

In [None]:
fpr_train, tpr_train, thresholds_train = roc_curve(y_train, logr_model.predict_proba(X_train)[:,1])
fpr_test, tpr_test, thresholds_test = roc_curve(y_test, logr_model.predict_proba(X_test)[:,1])

In [None]:
plt.figure(1, figsize=[12, 4])

plt.subplot(1,2,1)
plt.plot([0,1], [0,1], "--", color="grey")
plt.title("ROC for train")
plt.axvline(0, linestyle="-", c="black", lw=1)
plt.axvline(1, linestyle="--", c="black", lw=1)
plt.axhline(1, linestyle="--", c="black", lw=1)
plt.plot(fpr_train, tpr_train, "-", c="seagreen", lw=4)
plt.grid(True)
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.xlim(-0.1, 1.1)
plt.ylim(0, 1.1)

plt.subplot(1,2,2)
plt.plot([0,1], [0,1], "--", color="grey")
plt.title("ROC for test")
plt.axvline(0, linestyle="-", c="black", lw=1)
plt.axvline(1, linestyle="--", c="black", lw=1)
plt.axhline(1, linestyle="--", c="black", lw=1)
plt.plot(fpr_test, tpr_test, "-", c="seagreen", lw=4)
plt.grid(True)
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.xlim(-0.1, 1.1)
plt.ylim(0, 1.1)

plt.show()

ROC AUC:

In [None]:
roc_auc_train = roc_auc_score(y_train, logr_model.predict_proba(X_train)[:,1])
roc_auc_test = roc_auc_score(y_test, logr_model.predict_proba(X_test)[:,1])

print("ROC AUC на обучающем множестве: {}".format(roc_auc_train))
print("ROC AUC на тестовом множестве: {}".format(roc_auc_test))

Регулирование весов классов:

In [None]:
logr_model = LogisticRegression(penalty="l2", fit_intercept=True, max_iter=100, C=1e5, 
                                class_weight={0: 0.1, 1: 0.9},
                                solver="lbfgs", random_state=RANDOM_STATE)
logr_model.fit(X_train, y_train)

y_train_pred = logr_model.predict(X_train)
y_test_pred = logr_model.predict(X_test)

print("Обучающее множество")
print("Accuracy \t= {}".format(accuracy_score(y_train, y_train_pred)))
print("Precision \t= {}".format(precision_score(y_train, y_train_pred)))
print("Recall \t\t= {}".format(recall_score(y_train, y_train_pred)))
print("F1 \t\t= {}".format(f1_score(y_train, y_train_pred)))

print("\nТестовое множество")
print("Accuracy \t= {}".format(accuracy_score(y_test, y_test_pred)))
print("Precision \t= {}".format(precision_score(y_test, y_test_pred)))
print("Recall \t\t= {}".format(recall_score(y_test, y_test_pred)))
print("F1 \t\t= {}".format(f1_score(y_test, y_test_pred)))

In [None]:
plot_confusion_matrix(y_train, 
                      logr_model.predict(X_train), 
                      np.array(["Non-default", "Default"]))
plt.show()

In [None]:
plot_confusion_matrix(y_test, 
                      logr_model.predict(X_test), 
                      np.array(["Non-default", "Default"]))
plt.show()

## Выбор веса классов в логистической регрессии

Исходная модель

In [None]:
logr_model = LogisticRegression(penalty="l2", 
                                fit_intercept=True, 
                                max_iter=100, 
                                C=1e5,
                                solver="lbfgs", 
                                random_state=RANDOM_STATE)

**GridSearchCV**

Определение сетки параметров:

In [None]:
parameters = {
    "class_weight": (
        {0: 0.5, 1: 0.5}, 
        {0: 0.1, 1: 0.9}, 
        {0: 0.01, 1: 0.99}, 
        {0: 0.001, 1: 0.999}, 
        {0: 0.0001, 1: 0.9999},
        {0: 0.00001, 1: 0.99999}
    )
}

Кросс-валидация для выбора параметров:

In [None]:
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=RANDOM_STATE)

In [None]:
# skf = StratifiedShuffleSplit(n_splits=10, random_state=RANDOM_STATE)

Обучение:

In [None]:
clf = GridSearchCV(estimator=logr_model, 
                   param_grid=parameters, 
                   cv=skf, 
                   scoring="balanced_accuracy", 
                   refit=True, 
                   return_train_score=True)
clf.fit(X_train, y_train)

Значения метрики на валидационном подмножетсве для каждого параметра:

In [None]:
clf.cv_results_["mean_test_score"]

Построение графика `balanced_accuracy` на обучающем и проверочном множествах:

In [None]:
class_1_weights = [pair[0]/pair[1] for pair in parameters["class_weight"]]

plt.figure(figsize=[6, 4])

plt.subplot(1,1,1)
plt.title("balanced_accuracy")
plt.plot(class_1_weights, clf.cv_results_["mean_test_score"], "o-", label="Val")
plt.plot(class_1_weights, clf.cv_results_["mean_train_score"], "o-", label="Train")
plt.xlabel("ratio")
plt.ylabel("balanced_accuracy")
plt.xscale("log")
plt.legend()
plt.grid(True)

plt.show()

Лучшие параметры:

In [None]:
clf.best_params_

Отображение матрицы ошибок на тестовом подмножестве:

In [None]:
plot_confusion_matrix(y_train, 
                      clf.predict(X_train), 
                      np.array(["Non-default", "Default"]))
plt.show()

In [None]:
plot_confusion_matrix(y_test, 
                      clf.predict(X_test), 
                      np.array(["Non-default", "Default"]))
plt.show()

## Задание

## Источники

- [An Introduction to Statistical Learning by Gareth James,
Daniela Witten, Trevor Hastie, Robert Tibshir](http://faculty.marshall.usc.edu/gareth-james/ISL/)
- [3.1. Cross-validation: evaluating estimator performance](https://scikit-learn.org/stable/modules/cross_validation.html)
- [3.3. Metrics and scoring: quantifying the quality of predictions](https://scikit-learn.org/stable/modules/model_evaluation.html)