## Постановка задачи

**Задача**

Требуется, на основании имеющихся данных о клиентах банка, построить модель, используя обучающий датасет, для прогнозирования невыполнения долговых обязательств по текущему кредиту. Выполнить прогноз для примеров из тестового датасета.

**Наименование файлов с данными**

course_project_train.csv - обучающий датасет<br>
course_project_test.csv - тестовый датасет

**Целевая переменная**

Credit Default - факт невыполнения кредитных обязательств

**Метрика качества**

F1-score (sklearn.metrics.f1_score)

**Требования к решению**

*Целевая метрика*
* F1 > 0.5
* Метрика оценивается по качеству прогноза для главного класса (1 - просрочка по кредиту)

**Описание датасета**

* **Home Ownership** - домовладение
* **Annual Income** - годовой доход
* **Years in current job** - количество лет на текущем месте работы
* **Tax Liens** - налоговые обременения
* **Number of Open Accounts** - количество открытых счетов
* **Years of Credit History** - количество лет кредитной истории
* **Maximum Open Credit** - наибольший открытый кредит
* **Number of Credit Problems** - количество проблем с кредитом
* **Months since last delinquent** - количество месяцев с последней просрочки платежа
* **Bankruptcies** - банкротства
* **Purpose** - цель кредита
* **Term** - срок кредита
* **Current Loan Amount** - текущая сумма кредита
* **Current Credit Balance** - текущий кредитный баланс
* **Monthly Debt** - ежемесячный долг
* **Credit Default** - факт невыполнения кредитных обязательств (0 - погашен вовремя, 1 - просрочка)

**Подключение библиотек и скриптов**

In [1]:
import numpy as np
import pandas as pd

**Пути к директориям и файлам**

In [2]:
TRAIN_DATASET_PATH = './course_project_train.csv'
TEST_DATASET_PATH = './course_project_test.csv'
PREP_DATASET_PATH = './SSolovev_predictions.csv'

### Загрузка данных

In [3]:
df_train = pd.read_csv(TRAIN_DATASET_PATH)
df_train.head()

Unnamed: 0,Home Ownership,Annual Income,Years in current job,Tax Liens,Number of Open Accounts,Years of Credit History,Maximum Open Credit,Number of Credit Problems,Months since last delinquent,Bankruptcies,Purpose,Term,Current Loan Amount,Current Credit Balance,Monthly Debt,Credit Score,Credit Default
0,Own Home,482087.0,,0.0,11.0,26.3,685960.0,1.0,,1.0,debt consolidation,Short Term,99999999.0,47386.0,7914.0,749.0,0
1,Own Home,1025487.0,10+ years,0.0,15.0,15.3,1181730.0,0.0,,0.0,debt consolidation,Long Term,264968.0,394972.0,18373.0,737.0,1
2,Home Mortgage,751412.0,8 years,0.0,11.0,35.0,1182434.0,0.0,,0.0,debt consolidation,Short Term,99999999.0,308389.0,13651.0,742.0,0
3,Own Home,805068.0,6 years,0.0,8.0,22.5,147400.0,1.0,,1.0,debt consolidation,Short Term,121396.0,95855.0,11338.0,694.0,0
4,Rent,776264.0,8 years,0.0,13.0,13.6,385836.0,1.0,,0.0,debt consolidation,Short Term,125840.0,93309.0,7180.0,719.0,0


In [4]:
df_train.shape

(7500, 17)

In [5]:
# Посмотрим общую информацию.
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 17 columns):
Home Ownership                  7500 non-null object
Annual Income                   5943 non-null float64
Years in current job            7129 non-null object
Tax Liens                       7500 non-null float64
Number of Open Accounts         7500 non-null float64
Years of Credit History         7500 non-null float64
Maximum Open Credit             7500 non-null float64
Number of Credit Problems       7500 non-null float64
Months since last delinquent    3419 non-null float64
Bankruptcies                    7486 non-null float64
Purpose                         7500 non-null object
Term                            7500 non-null object
Current Loan Amount             7500 non-null float64
Current Credit Balance          7500 non-null float64
Monthly Debt                    7500 non-null float64
Credit Score                    5943 non-null float64
Credit Default                  7

### Обработка пропусков

In [6]:
# Имеются пропуски. Исследуем их

In [7]:
# Annual Income
print('Наименьшее: ', df_train['Annual Income'].min())
print('Наибольшее: ', df_train['Annual Income'].max())
print('Медиана: ', df_train['Annual Income'].median())
print('Среднее: ', df_train['Annual Income'].mean())
print('Мода: ', df_train['Annual Income'].mode()[0])

Наименьшее:  164597.0
Наибольшее:  10149344.0
Медиана:  1168386.0
Среднее:  1366391.7201749957
Мода:  969475.0


In [8]:
# Заменим пропуски на медиану
df_train.loc[df_train['Annual Income'].isnull(), 'Annual Income'] = df_train['Annual Income'].median()

In [9]:
# Сменим тип данных
df_train['Annual Income'] = df_train['Annual Income'].astype(int)

In [10]:
# Months since last delinquent
print('Наименьшее: ', df_train['Months since last delinquent'].min())
print('Наибольшее: ', df_train['Months since last delinquent'].max())
print('Медиана: ', df_train['Months since last delinquent'].median())
print('Среднее: ', df_train['Months since last delinquent'].mean())
print('Мода: ', df_train['Months since last delinquent'].mode()[0])

Наименьшее:  0.0
Наибольшее:  118.0
Медиана:  32.0
Среднее:  34.69260017548991
Мода:  14.0


In [11]:
# Заменим пропуски на moda
df_train.loc[df_train['Months since last delinquent'].isnull(), 'Months since last delinquent'] = df_train['Months since last delinquent'].mode()[0]

In [12]:
# Сменим тип данных
df_train['Months since last delinquent'] = df_train['Months since last delinquent'].astype(int)

In [13]:
# Credit Score
print('Наименьшее: ', df_train['Credit Score'].min())
print('Наибольшее: ', df_train['Credit Score'].max())
print('Медиана: ', df_train['Credit Score'].median())
print('Среднее: ', df_train['Credit Score'].mean())
print('Мода: ', df_train['Credit Score'].mode()[0])

Наименьшее:  585.0
Наибольшее:  7510.0
Медиана:  731.0
Среднее:  1151.0874978966851
Мода:  740.0


In [14]:
# Заменим пропуски на moda
df_train.loc[df_train['Credit Score'].isnull(), 'Credit Score'] = df_train['Credit Score'].mode()[0]

In [15]:
# Сменим тип данных
df_train['Credit Score'] = df_train['Credit Score'].astype(int)

In [16]:
df_train['Bankruptcies'].value_counts()

0.0    6660
1.0     786
2.0      31
3.0       7
4.0       2
Name: Bankruptcies, dtype: int64

In [17]:
# Заменим пропуски на moda
df_train.loc[df_train['Bankruptcies'].isnull(), 'Bankruptcies'] = df_train['Bankruptcies'].mode()[0]

In [18]:
# Сменим тип данных
df_train['Bankruptcies'] = df_train['Bankruptcies'].astype(int)

In [19]:
df_train['Years in current job'].value_counts()

10+ years    2332
2 years       705
3 years       620
< 1 year      563
5 years       516
1 year        504
4 years       469
6 years       426
7 years       396
8 years       339
9 years       259
Name: Years in current job, dtype: int64

In [20]:
# Заменим пропуски на 10+ years
df_train.loc[df_train['Years in current job'].isnull(), 'Years in current job'] = '10+ years'

In [21]:
df_train['Years in current job'].value_counts()

10+ years    2703
2 years       705
3 years       620
< 1 year      563
5 years       516
1 year        504
4 years       469
6 years       426
7 years       396
8 years       339
9 years       259
Name: Years in current job, dtype: int64

**Обзор номинативных признаков**

In [24]:
for cat_colname in df_train.select_dtypes(include='object').columns:
    print(str(cat_colname) + '\n\n' + str(df_train[cat_colname].value_counts()) + '\n' + '*' * 100 + '\n')

Home Ownership

Home Mortgage    3637
Rent             3204
Own Home          647
Have Mortgage      12
Name: Home Ownership, dtype: int64
****************************************************************************************************

Years in current job

10+ years    2703
2 years       705
3 years       620
< 1 year      563
5 years       516
1 year        504
4 years       469
6 years       426
7 years       396
8 years       339
9 years       259
Name: Years in current job, dtype: int64
****************************************************************************************************

Purpose

debt consolidation      5944
other                    665
home improvements        412
business loan            129
buy a car                 96
medical bills             71
major purchase            40
take a trip               37
buy house                 34
small business            26
wedding                   15
moving                    11
educational expenses      10
vacation  

### Приведение типов

In [26]:
# Переведем float в int
for colname in ['Number of Open Accounts', 'Number of Credit Problems', 'Years of Credit History',
                'Tax Liens', 'Maximum Open Credit', 'Current Loan Amount', 'Current Credit Balance',
               'Monthly Debt']:
    df_train[colname] = df_train[colname].astype(int)

In [27]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 17 columns):
Home Ownership                  7500 non-null object
Annual Income                   7500 non-null int64
Years in current job            7500 non-null object
Tax Liens                       7500 non-null int64
Number of Open Accounts         7500 non-null int64
Years of Credit History         7500 non-null int64
Maximum Open Credit             7500 non-null int64
Number of Credit Problems       7500 non-null int64
Months since last delinquent    7500 non-null int64
Bankruptcies                    7500 non-null int64
Purpose                         7500 non-null object
Term                            7500 non-null object
Current Loan Amount             7500 non-null int64
Current Credit Balance          7500 non-null int64
Monthly Debt                    7500 non-null int64
Credit Score                    7500 non-null int64
Credit Default                  7500 non-null int64
dtype

In [29]:
# Заменим Long Term на 1 и Short Term на 0 и тип на int
df_train['Term'] = (df_train['Term'] == 'Long Term').astype(int)
df_train['Term'].value_counts()

0    5556
1    1944
Name: Term, dtype: int64

### Построение новых признаков

**Id**

In [30]:
df_train['ID'] = df_train.index.tolist()

**Dummies**

In [34]:
for cat_colname in df_train.select_dtypes(include='object').columns[1:]:
    df_train = pd.concat([df_train, pd.get_dummies(df_train[cat_colname], prefix=cat_colname)], axis=1)

**Обзор количественных признаков**

In [35]:
df_train.describe()

Unnamed: 0,Annual Income,Tax Liens,Number of Open Accounts,Years of Credit History,Maximum Open Credit,Number of Credit Problems,Months since last delinquent,Bankruptcies,Term,Current Loan Amount,...,Purpose_home improvements,Purpose_major purchase,Purpose_medical bills,Purpose_moving,Purpose_other,Purpose_renewable energy,Purpose_small business,Purpose_take a trip,Purpose_vacation,Purpose_wedding
count,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,...,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0
mean,1325286.0,0.030133,11.130933,17.889333,945153.7,0.17,23.433067,0.116933,0.2592,11873180.0,...,0.054933,0.005333,0.009467,0.001467,0.088667,0.000267,0.003467,0.004933,0.001067,0.002
std,756755.1,0.271604,4.908924,7.050672,16026220.0,0.498598,17.906245,0.346904,0.438225,31926120.0,...,0.227865,0.07284,0.096842,0.038272,0.284281,0.016329,0.05878,0.070069,0.032645,0.04468
min,164597.0,0.0,2.0,4.0,0.0,0.0,0.0,0.0,0.0,11242.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,931133.0,0.0,8.0,13.0,279229.5,0.0,14.0,0.0,0.0,180169.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1168386.0,0.0,10.0,17.0,478159.0,0.0,14.0,0.0,0.0,309573.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1499974.0,0.0,14.0,21.0,793501.5,0.0,29.0,0.0,1.0,519882.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,10149340.0,7.0,43.0,57.0,1304726000.0,7.0,118.0,4.0,1.0,100000000.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [36]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 44 columns):
Home Ownership                    7500 non-null object
Annual Income                     7500 non-null int64
Years in current job              7500 non-null object
Tax Liens                         7500 non-null int64
Number of Open Accounts           7500 non-null int64
Years of Credit History           7500 non-null int64
Maximum Open Credit               7500 non-null int64
Number of Credit Problems         7500 non-null int64
Months since last delinquent      7500 non-null int64
Bankruptcies                      7500 non-null int64
Purpose                           7500 non-null object
Term                              7500 non-null int64
Current Loan Amount               7500 non-null int64
Current Credit Balance            7500 non-null int64
Monthly Debt                      7500 non-null int64
Credit Score                      7500 non-null int64
Credit Default            

**Обзор количественных признаков**

In [136]:
df_train.describe()

Unnamed: 0,Annual Income,Tax Liens,Number of Open Accounts,Years of Credit History,Maximum Open Credit,Number of Credit Problems,Months since last delinquent,Bankruptcies,Current Loan Amount,Current Credit Balance,Monthly Debt,Credit Score,Credit Default
count,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0
mean,1325286.0,0.030133,11.130933,18.317467,945153.7,0.17,23.433067,0.116933,11873180.0,289833.2,18314.454133,1065.745733,0.281733
std,756755.1,0.271604,4.908924,7.041946,16026220.0,0.498598,17.906245,0.346904,31926120.0,317871.4,11926.764673,1437.907935,0.449874
min,164597.0,0.0,2.0,4.0,0.0,0.0,0.0,0.0,11242.0,0.0,0.0,585.0,0.0
25%,931133.0,0.0,8.0,13.5,279229.5,0.0,14.0,0.0,180169.0,114256.5,10067.5,718.0,0.0
50%,1168386.0,0.0,10.0,17.0,478159.0,0.0,14.0,0.0,309573.0,209323.0,16076.5,738.0,0.0
75%,1499974.0,0.0,14.0,21.8,793501.5,0.0,29.0,0.0,519882.0,360406.2,23818.0,740.0,1.0
max,10149340.0,7.0,43.0,57.7,1304726000.0,7.0,118.0,4.0,100000000.0,6506797.0,136679.0,7510.0,1.0


**Обзор целевой переменной**

In [37]:
df_train['Credit Default'].value_counts()

0    5387
1    2113
Name: Credit Default, dtype: int64

### Обработка выбросов

### Анализ данных

## Построение модели классификации

### Отбор признаков

### Балансировка классов

### Подбор моделей, получение бейзлана

### Выбор наилучшей модели, настройка гиперпараметров

### Проверка качества, борьба с переобучением

### Интерпретация результатов

## Прогнозирование на тестовом датасете

### Загрузка данных

### Приведение типов

### Обработка выбросов

### Обработка пропусков

### Анализ данных

### Отбор признаков

### Балансировка классов

### Прогноз целевой переменной, используя модель, построенную на обучающем датасете