### Проект 5 [Одобрение кредита]

Вам предоставляется набор данных. Набор данных уже разделен на ```train_data.csv``` и ```test_data.csv```.

**Цель:** построить модели для **одобрения кредита** (```"Credit Default"```).

Пожалуйста, включите подробные объяснения следующих шагов:

1. Очистка, предварительная обработка данных, и Exploratory Data Analysis

2. Обучение и проверка моделей.

3. Сравнение моделей на основе метрик классификации: ```F-score```, ```Precision```, ```Recall```.

**Примечание:** вам **рекомендуется** искать так же другие алгоритмы машинного обучения в Интернете (не ограничиваясь материалом курса), но вы должны изучить и понять эти алгоритмы. Вы не можете удалить ни одну строку в файле ```test_data.csv```.

1. Очистка, предварительная обработка данных, и Exploratory Data Analysis

Для начала, импортируем необходимые библиотеки для обработки данных

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


Загружаем наборы данных

In [2]:
train_data = pd.read_csv('./train_data.csv')
test_data = pd.read_csv('./test_data.csv')

In [3]:
train_data.head()

Unnamed: 0,Home Ownership,Annual Income,Years in current job,Tax Liens,Number of Open Accounts,Years of Credit History,Maximum Open Credit,Number of Credit Problems,Months since last delinquent,Bankruptcies,Purpose,Term,Current Loan Amount,Current Credit Balance,Monthly Debt,Credit Score,Credit Default
0,Home Mortgage,,10+ years,0.0,10.0,21.8,267762.0,0.0,,0.0,debt consolidation,Short Term,193358.0,140372.0,19404.0,,0
1,Rent,767904.0,1 year,0.0,9.0,15.4,275528.0,0.0,73.0,0.0,debt consolidation,Short Term,222288.0,168226.0,18302.0,718.0,0
2,Own Home,,9 years,0.0,14.0,27.1,1635590.0,1.0,,0.0,debt consolidation,Short Term,433268.0,1017032.0,15295.0,,0
3,Home Mortgage,1267395.0,3 years,0.0,11.0,11.8,137676.0,1.0,61.0,0.0,home improvements,Short Term,99999999.0,34124.0,25559.0,719.0,0
4,Rent,1813493.0,7 years,0.0,19.0,14.0,501556.0,0.0,6.0,0.0,debt consolidation,Short Term,265232.0,114779.0,23877.0,713.0,0


In [4]:
test_data.head()

Unnamed: 0,Home Ownership,Annual Income,Years in current job,Tax Liens,Number of Open Accounts,Years of Credit History,Maximum Open Credit,Number of Credit Problems,Months since last delinquent,Bankruptcies,Purpose,Term,Current Loan Amount,Current Credit Balance,Monthly Debt,Credit Score,Credit Default
0,Home Mortgage,1886510.0,10+ years,0.0,13.0,17.1,552398.0,0.0,,0.0,debt consolidation,Short Term,595782.0,155059.0,26097.0,708.0,0
1,Rent,869877.0,1 year,0.0,16.0,14.2,657690.0,0.0,,0.0,debt consolidation,Long Term,501380.0,259008.0,19645.0,638.0,1
2,Home Mortgage,,6 years,0.0,13.0,16.5,638704.0,0.0,,0.0,debt consolidation,Short Term,238150.0,424745.0,28795.0,,0
3,Home Mortgage,1125142.0,8 years,0.0,17.0,16.5,570548.0,0.0,,0.0,debt consolidation,Long Term,393096.0,284810.0,22597.0,676.0,0
4,Home Mortgage,1060998.0,10+ years,0.0,7.0,18.9,379764.0,0.0,27.0,0.0,debt consolidation,Short Term,268048.0,188252.0,5438.0,739.0,0


Проверяем датасет на наличие отсутствующих элементов

In [5]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6750 entries, 0 to 6749
Data columns (total 17 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Home Ownership                6750 non-null   object 
 1   Annual Income                 5350 non-null   float64
 2   Years in current job          6418 non-null   object 
 3   Tax Liens                     6750 non-null   float64
 4   Number of Open Accounts       6750 non-null   float64
 5   Years of Credit History       6750 non-null   float64
 6   Maximum Open Credit           6750 non-null   float64
 7   Number of Credit Problems     6750 non-null   float64
 8   Months since last delinquent  3075 non-null   float64
 9   Bankruptcies                  6738 non-null   float64
 10  Purpose                       6750 non-null   object 
 11  Term                          6750 non-null   object 
 12  Current Loan Amount           6750 non-null   float64
 13  Cur

In [6]:
train_data.isnull().sum()

Home Ownership                     0
Annual Income                   1400
Years in current job             332
Tax Liens                          0
Number of Open Accounts            0
Years of Credit History            0
Maximum Open Credit                0
Number of Credit Problems          0
Months since last delinquent    3675
Bankruptcies                      12
Purpose                            0
Term                               0
Current Loan Amount                0
Current Credit Balance             0
Monthly Debt                       0
Credit Score                    1400
Credit Default                     0
dtype: int64

In [7]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 750 entries, 0 to 749
Data columns (total 17 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Home Ownership                750 non-null    object 
 1   Annual Income                 593 non-null    float64
 2   Years in current job          711 non-null    object 
 3   Tax Liens                     750 non-null    float64
 4   Number of Open Accounts       750 non-null    float64
 5   Years of Credit History       750 non-null    float64
 6   Maximum Open Credit           750 non-null    float64
 7   Number of Credit Problems     750 non-null    float64
 8   Months since last delinquent  344 non-null    float64
 9   Bankruptcies                  748 non-null    float64
 10  Purpose                       750 non-null    object 
 11  Term                          750 non-null    object 
 12  Current Loan Amount           750 non-null    float64
 13  Curre

In [8]:
test_data.isnull().sum()

Home Ownership                    0
Annual Income                   157
Years in current job             39
Tax Liens                         0
Number of Open Accounts           0
Years of Credit History           0
Maximum Open Credit               0
Number of Credit Problems         0
Months since last delinquent    406
Bankruptcies                      2
Purpose                           0
Term                              0
Current Loan Amount               0
Current Credit Balance            0
Monthly Debt                      0
Credit Score                    157
Credit Default                    0
dtype: int64

Как мы можем наблюдать, в тренировочных данных, а также в тестовых, имеются отсутствующие данные. Разберём каждый признак, в котором есть недостающие элементы. 

In [20]:
for col in train_data.columns:
    print(col, end = ", ")

Home Ownership, Annual Income, Years in current job, Tax Liens, Number of Open Accounts, Years of Credit History, Maximum Open Credit, Number of Credit Problems, Months since last delinquent, Bankruptcies, Purpose, Term, Current Loan Amount, Current Credit Balance, Monthly Debt, Credit Score, Credit Default, 