https://www.kaggle.com/competitions/home-credit-default-risk/data

## Home Credit Default Risk
Can you predict how capable each applicant is of repaying a loan?

In [180]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [181]:
df_train = pd.read_csv('C:/Agapov/Нетология/!Diplom/HomeCredit/application_train.csv')
df_test = pd.read_csv('C:/Agapov/Нетология/!Diplom/HomeCredit/application_test.csv')

In [182]:
df_train.shape, df_test.shape

((307511, 122), (48744, 121))

In [183]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB


1. проверим на пустые значения
2. обработаем категориальные поля OHE

In [184]:
df_train.isna().sum()

SK_ID_CURR                        0
TARGET                            0
NAME_CONTRACT_TYPE                0
CODE_GENDER                       0
FLAG_OWN_CAR                      0
                              ...  
AMT_REQ_CREDIT_BUREAU_DAY     41519
AMT_REQ_CREDIT_BUREAU_WEEK    41519
AMT_REQ_CREDIT_BUREAU_MON     41519
AMT_REQ_CREDIT_BUREAU_QRT     41519
AMT_REQ_CREDIT_BUREAU_YEAR    41519
Length: 122, dtype: int64

In [185]:
#узнаем сколько полей с пропусками больше 1\3 и удалим столбцы
(df_train.isna().sum().sort_values(ascending=False) > (len(df_train)/3)).sum()

49

In [186]:
del_columns = df_train.isna().sum().sort_values(ascending=False) > (len(df_train)/3)
del_columns

COMMONAREA_MEDI              True
COMMONAREA_AVG               True
COMMONAREA_MODE              True
NONLIVINGAPARTMENTS_MODE     True
NONLIVINGAPARTMENTS_AVG      True
                            ...  
NAME_HOUSING_TYPE           False
NAME_FAMILY_STATUS          False
NAME_EDUCATION_TYPE         False
NAME_INCOME_TYPE            False
SK_ID_CURR                  False
Length: 122, dtype: bool

In [187]:
del_columns_list = list(del_columns[del_columns == True].index)
df_train.drop(columns = del_columns_list, inplace=True)
df_test.drop(columns = del_columns_list, inplace=True)

In [188]:
df_train.shape, df_test.shape

((307511, 73), (48744, 72))

Обработаем 16 категориальных полей OHE

In [189]:
df_train = pd.get_dummies(df_train)
df_test = pd.get_dummies(df_test)

In [190]:
# кол-во полей выросло на 112
df_train.shape, df_test.shape

((307511, 185), (48744, 181))

при этом видим, что разница количества полей между train и test теперь больше 1, значит попались какие то категории, которых нет в тестовой выборке - можем их удалить

In [191]:
difference = set(list(df_train.columns)).difference(set(list(df_test.columns)))
list(difference)

['NAME_FAMILY_STATUS_Unknown',
 'TARGET',
 'NAME_INCOME_TYPE_Maternity leave',
 'CODE_GENDER_XNA']

удаляем все, кроме Target

In [192]:
df_train.drop(['CODE_GENDER_XNA', 'NAME_FAMILY_STATUS_Unknown', 'NAME_INCOME_TYPE_Maternity leave'], axis=1, inplace=True)

In [193]:
if set(list(df_train.columns)).difference(set(list(df_test.columns))) == {'TARGET'}: 
    print('ok')

ok


In [194]:
df_train.isna().sum().sort_values(ascending=False)

EXT_SOURCE_3                       60965
AMT_REQ_CREDIT_BUREAU_QRT          41519
AMT_REQ_CREDIT_BUREAU_HOUR         41519
AMT_REQ_CREDIT_BUREAU_YEAR         41519
AMT_REQ_CREDIT_BUREAU_DAY          41519
                                   ...  
NAME_TYPE_SUITE_Other_A                0
NAME_TYPE_SUITE_Other_B                0
NAME_TYPE_SUITE_Spouse, partner        0
NAME_TYPE_SUITE_Unaccompanied          0
ORGANIZATION_TYPE_XNA                  0
Length: 182, dtype: int64

In [195]:
df_train['EXT_SOURCE_3'].median()

0.5352762504724826

In [196]:
df_train[df_train['EXT_SOURCE_3'].isna()]['EXT_SOURCE_3']

1        NaN
3        NaN
4        NaN
9        NaN
14       NaN
          ..
307484   NaN
307501   NaN
307504   NaN
307506   NaN
307507   NaN
Name: EXT_SOURCE_3, Length: 60965, dtype: float64

In [197]:
for col in df_train.columns:
    median_ = df_train[col].median()
    df_train.loc[df_train[col].isna(), col] = median_

In [198]:
def imput_median(data):
    """
    Заполяем пропуски медианными значениями
    """
    for col in data.columns:
        median_ = data[col].median()
        data.loc[data[col].isna(), col] = median_
    return data

In [199]:
df_train = imput_median(df_train)
df_test = imput_median(df_test)

df_train.isna().sum().sum(), df_test.isna().sum().sum()

(0, 0)

Нормализуем данные

In [200]:
from sklearn.preprocessing import MinMaxScaler

In [201]:
X = df_train.drop(['TARGET'], axis=1)
y = df_train['TARGET']

scaler = MinMaxScaler()

scaler.fit(X)
X = scaler.transform(X)
X_test = scaler.transform(df_test)

строим Baseline

In [202]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=41)

In [205]:
model = LogisticRegression(max_iter=10000).fit(X_train, y_train)

In [206]:
pred_val = model.predict_proba(X_val)[:, 1]

In [207]:
pred_val

array([0.44082754, 0.09746514, 0.04030753, ..., 0.01480285, 0.02011254,
       0.07924822])

In [208]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_val, pred_val, squared=False)

0.26405539940209455

Создаем датасет для выгрузки на Kaggle

In [209]:
pred_test = model.predict_proba(X_test)[:, 1]

In [217]:
submit = df_test[['SK_ID_CURR']]
submit.loc[:, 'TARGET'] = pred_test

submit.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  submit.loc[:, 'TARGET'] = pred_test


Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0.064048
1,100005,0.247073
2,100013,0.038816
3,100028,0.039916
4,100038,0.098107


In [220]:
submit.to_csv('C:/Agapov/Нетология/!Diplom/baseline_LogisticRegression.csv', index = False)
# Score: 0.73157
# Private score: 0.72608

#### Топ-скор на каггл
- Скор на паблике: 0.81724
- Скор на привате: 0.80570

#### Дальнейшие идеи по улучшению модели:
1. Более детально заполнять пропуски, а не медианой
2. Добавить полиномиальные, логарифмированые признаки
3. Потестить еще 2-3 модели
4. Подбор гиперпараметров
5. Ансамбль алгоритмов