**Задача**

Требуется, на основании имеющихся данных о клиентах банка, построить модель, используя обучающий датасет, для прогнозирования невыполнения долговых обязательств по текущему кредиту. Выполнить прогноз для примеров из тестового датасета.  

**Наименование файлов с данными**

course_project_train.csv - обучающий датасет<br>
course_project_test.csv - тестовый датасет

**Целевая переменная**

Credit Default - факт невыполнения кредитных обязательств

**Метрика качества**

F1-score (sklearn.metrics.f1_score)

**Требования к решению**

*Целевая метрика*
* F1 > 0.5
* Метрика оценивается по качеству прогноза для главного класса (1 - просрочка по кредиту)  
* F1 > 0.5, precision > 0.5, recall > 0.5 на X_test 

*Решение должно содержать*
1. Тетрадка Jupyter Notebook с кодом Вашего решения, названная по образцу {ФИО}  
2. Файл CSV с прогнозами целевой переменной для тестового датасета, названный по образцу {ФИО}

In [1]:
import numpy as np
import pandas as pd
import random
import pickle

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, ShuffleSplit, cross_val_score, learning_curve
from sklearn.model_selection import KFold, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import classification_report, f1_score, precision_score, recall_score

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
import xgboost as xgb, lightgbm as lgbm, catboost as catb

import matplotlib
import matplotlib.image as img
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


### Подключение библиотек и скриптов

In [2]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
pd.set_option('display.float_format', lambda x: '%.2f' % x)
pd.set_option('display.max_rows', 50)

In [3]:
%config InlineBackend.figure_format = 'png'

**Пути к директориям и файлам**

In [4]:
# input
TRAIN_DATASET_PATH =  'course_project_train.csv'
TEST_DATASET_PATH = 'course_project_test.csv'

### Загрузка данных

**Описание датасета**

* **Home Ownership** - домовладение
* **Annual Income** - годовой доход
* **Years in current job** - количество лет на текущем месте работы
* **Tax Liens** - налоговые обременения
* **Number of Open Accounts** - количество открытых счетов
* **Years of Credit History** - количество лет кредитной истории
* **Maximum Open Credit** - наибольший открытый кредит
* **Number of Credit Problems** - количество проблем с кредитом
* **Months since last delinquent** - количество месяцев с последней просрочки платежа
* **Bankruptcies** - банкротства
* **Purpose** - цель кредита
* **Term** - срок кредита (0 - краткосрочный , 1 - долгосрочный)
* **Current Loan Amount** - текущая сумма кредита
* **Current Credit Balance** - текущий кредитный баланс
* **Monthly Debt** - ежемесячный долг
* **Credit Score** - кредитный рейтинг
* **Credit Default** - факт невыполнения кредитных обязательств (0 - погашен вовремя, 1 - просрочка)

In [5]:
train_df = pd.read_csv(TRAIN_DATASET_PATH)
train_df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'course_project_train.csv'

In [None]:
test_df = pd.read_csv(TEST_DATASET_PATH)
test_df.head()

In [None]:
train_df.shape, test_df.shape

### Общая информации о данных

In [None]:
train_df.info()

In [None]:
train_df.iloc[0]

### Приведение типов

In [None]:
train_df['ID'] = train_df.index.tolist()
train_df.set_index('ID', inplace=True)

In [None]:
test_df['ID'] = test_df.index.tolist()
test_df.set_index('ID', inplace=True)

### Оценка распредения целевой переменной в разрезе других признаков

**Целевая переменная**

In [None]:
target_name = 'Credit Default'

In [None]:
train_df['Credit Default'].value_counts()

In [None]:
base_feature_names = train_df.columns.drop(target_name).tolist()

In [None]:
base_feature_names

In [None]:
plt.figure(figsize=(8, 5))

sns.countplot(x=target_name, data=train_df)

plt.title('Target variable distribution')
plt.show()

**Корреляция с базовыми признаками**

In [None]:
corr_with_target = train_df[base_feature_names + [target_name]].corr().iloc[:-1, -1].sort_values(ascending=False)

plt.figure(figsize=(10, 8))

sns.barplot(x=corr_with_target.values, y=corr_with_target.index)

plt.title('Correlation with target variable')
plt.show()

In [None]:
plt.figure(figsize = (30,20))

sns.set(font_scale=1.4)
sns.heatmap(train_df[base_feature_names].corr().round(3), annot=True, linewidths=.5, cmap='GnBu')

plt.title('Correlation matrix')
plt.show()

### Обзор данных

In [None]:
train_df.nunique()

**Обзор количественных признаков**

In [None]:
train_df.describe()

In [None]:
feature_num_names = train_df.drop('Credit Default', axis=1).select_dtypes(include=['float64']).columns.tolist()
feature_num_names

In [None]:
train_df[feature_num_names].hist(figsize=(14,14), bins=20, grid=True);

**Обзор категориальных переменных** 

In [None]:
train_df.describe(include='object')

In [None]:
for cat_colname in train_df.select_dtypes(include='object').columns:
    print(str(cat_colname) + '\n\n' + str(train_df[cat_colname].value_counts()) + '\n' + '*' * 100 + '\n')

**Purpose**

In [None]:
train_df.loc[train_df['Purpose'] == 'renewable energy', 'Purpose'] = train_df['Purpose'].mode()[0]
test_df.loc[train_df['Purpose'] == 'renewable energy ', 'Purpose'] = test_df['Purpose'].mode()[0]

In [None]:
train_df.loc[train_df['Purpose'] == 'vacation', 'Purpose'] = train_df['Purpose'].mode()[0]
test_df.loc[train_df['Purpose'] == 'vacation', 'Purpose'] = test_df['Purpose'].mode()[0]

In [None]:
train_df.loc[train_df['Purpose'] == 'educational expenses', 'Purpose'] = train_df['Purpose'].mode()[0]
test_df.loc[train_df['Purpose'] == 'educational expenses', 'Purpose'] = test_df['Purpose'].mode()[0]

In [None]:
train_df.loc[train_df['Purpose'] == 'moving', 'Purpose'] = train_df['Purpose'].mode()[0]
test_df.loc[train_df['Purpose'] == 'moving', 'Purpose'] = test_df['Purpose'].mode()[0]

**Home Ownership**

In [None]:
train_df.loc[train_df['Home Ownership'] == 'Have Mortgage', 'Home Ownership'] = train_df['Home Ownership'].mode()[0]
test_df.loc[train_df['Home Ownership'] == 'Have Mortgage', 'Home Ownership'] = test_df['Home Ownership'].mode()[0]

**Term**

In [None]:
train_df.Term=train_df.Term.map({'Short Term':0,'Long Term ':1})
test_df.Term=test_df.Term.map({'Short Term':0,'Long Term ':1})

In [None]:
train_df['Term'] = train_df['Term'].astype(str)
test_df['Term'] = test_df['Term'].astype(str)

In [None]:
train_df.info()

### Обработка пропусков

In [None]:
train_df.describe()

In [None]:
train_df.isna().sum()

**Annual Income**

In [None]:
train_df['Annual Income'].value_counts()

In [None]:
train_df['Annual Income'].fillna(train_df['Annual Income'].median(), inplace=True)
test_df['Annual Income'].fillna(test_df['Annual Income'].median(), inplace=True)

**Years in current job**

In [None]:
train_df['Years in current job'].value_counts()

In [None]:
train_df['Years in current job'].fillna(train_df['Years in current job'].mode()[0], inplace=True)
test_df['Years in current job'].fillna(test_df['Years in current job'].mode()[0], inplace=True)

**Months since last delinquent**

In [None]:
train_df['Months since last delinquent'].value_counts()

In [None]:
train_df['Months since last delinquent'].fillna(train_df['Months since last delinquent'].median(), inplace=True)
test_df['Months since last delinquent'].fillna(test_df['Months since last delinquent'].median(), inplace=True)

**Bankruptcies**

In [None]:
train_df['Bankruptcies'].value_counts()

In [None]:
train_df['Bankruptcies'].fillna(train_df['Bankruptcies'].mode()[0], inplace=True)
test_df['Bankruptcies'].fillna(test_df['Bankruptcies'].mode()[0], inplace=True)

**Credit Score**

In [None]:
train_df['Credit Score'].value_counts()

In [None]:
train_df['Credit Score'].fillna(train_df['Annual Income'].median(), inplace=True)
test_df['Credit Score'].fillna(test_df['Annual Income'].median(), inplace=True)

In [None]:
train_df.isna().sum()

### Обработка выбросов

In [None]:
train_df[feature_num_names].describe()

In [None]:
train_df[feature_num_names].nunique()

**Tax Liens**

In [None]:
train_df['Tax Liens'].value_counts()

In [None]:
plt.scatter(train_df['Tax Liens'], train_df['Credit Default']);

In [None]:
train_df.loc[(train_df['Tax Liens'] > 3), 'Tax Liens'] = train_df['Tax Liens'].mode()[0]
test_df.loc[(test_df['Tax Liens'] > 3), 'Tax Liens'] = test_df['Tax Liens'].mode()[0]

In [None]:
train_df['Tax Liens'].value_counts()

**Number of Credit Problems**

In [None]:
train_df['Number of Credit Problems'].value_counts()

In [None]:
train_df.loc[(train_df['Number of Credit Problems'] > 3), 'Number of Credit Problems'] = \
train_df['Number of Credit Problems'].mode()[0]

test_df.loc[(train_df['Number of Credit Problems'] > 3), 'Number of Credit Problems'] = \
test_df['Number of Credit Problems'].mode()[0]

**Bankruptcies**

In [None]:
train_df['Bankruptcies'].value_counts()

In [None]:
train_df.loc[(train_df['Bankruptcies'] > 3), 'Bankruptcies'] = train_df['Bankruptcies'].mode()[0]
test_df.loc[(test_df['Bankruptcies'] > 3), 'Bankruptcies'] = test_df['Bankruptcies'].mode()[0]

### Построение новых признаков

**Преобразуем категориальные признаки в бинарные**

In [None]:
for cat_colname in train_df.select_dtypes(include='object').columns[1:]:
    train_df = pd.concat([train_df, pd.get_dummies(train_df[cat_colname], prefix=cat_colname)], axis=1)
    
for cat_colname in test_df.select_dtypes(include='object').columns[1:]:
    test_df = pd.concat([test_df, pd.get_dummies(test_df[cat_colname], prefix=cat_colname)], axis=1)    

**Month Income**

In [None]:
train_df['Month Income'] = train_df['Annual Income'].apply(lambda x: x / 12)
test_df['Month Income'] = test_df['Annual Income'].apply(lambda x: x / 12)

**Available funds**

In [None]:
train_df['Available funds'] = train_df['Month Income'] - train_df['Monthly Debt']
test_df['Available funds'] = test_df['Month Income'] - test_df['Monthly Debt']

### Отбор данных

**Базовые и новые признаки**

In [None]:
base_feature_names

In [None]:
new_feature_names = train_df.columns.drop([target_name] + base_feature_names).tolist()
new_feature_names

**Корреляция новых признаков в разрезе целевой переменной**

In [None]:
corr_with_target = train_df[new_feature_names + [target_name]].corr().iloc[:-1, -1].sort_values(ascending=False)

plt.figure(figsize=(10, 8))

sns.barplot(x=corr_with_target.values, y=corr_with_target.index)

plt.title('Correlation with target variable')
plt.show()

### Отбор признаков

In [None]:
num_feature_names = []
['Annual Income','Number of Open Accounts','Years of Credit History','Maximum Open Credit','Number of Credit Problems',
 'Months since last delinquent','Bankruptcies','Current Loan Amount','Current Credit Balance','Monthly Debt','Credit Score']

cat_feature_names = ['Home Ownership', 'Years in current job','Purpose', 'Term']

selected_feature_names = num_feature_names + new_feature_names