# Home Credit Default Risk

The objective of this project is to use historical loan application data to predict whether or not an applicant will be able to repay a loan. This is a standard supervised classification project.

## Classification: 
The label is a binary variable, 0 (will repay loan on time), 1 (will have difficulty repaying loan)

## Data
The data is provided by Home Credit, a service dedicated to provided lines of credit (loans) to the unbanked population. There are 7 data sources.

##### Application_train (307,511 obs, 121 features, 1 binary target variable )/Application_test(48744 obs, 121 features): 
The main training and testing data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature SK_ID_CURR. The training application data comes with the TARGET indicating 0: the loan was repaid or 1: the loan was not repaid.
    
##### Bureau: 
Data concerning client's previous credits from other financial institutions. Each previous credit has its own row in bureau, but one loan in the application data can have multiple previous credits.
    
##### Bureau_balance: 
Monthly data about the previous credits in bureau. Each row is one month of a previous credit, and a single previous credit can have multiple rows, one for each month of the credit length.
    
##### Previous_application: 
Previous applications for loans at Home Credit of clients who have loans in the application data. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature SK_ID_PREV.
    
##### POS_CASH_BALANCE: 
Monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows.
    
##### Credit_card_balance: 
Monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.
    
##### Installments_payment: 
Payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment.


In [60]:
#Loading Packages used in the project
import pandas as pd
import numpy as np
import sklearn as sk
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder as LE

In [87]:
df_train = pd.read_csv('C:\\kaamkidrive\\Python\\Kaggle\\HomeCreditDefaultRisk\\application_train.csv')
df_test = pd.read_csv('C:\\kaamkidrive\\Python\\Kaggle\\HomeCreditDefaultRisk\\application_test.csv')

In [14]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB


In [70]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(40), object(16)
memory usage: 45.0+ MB


In [21]:
df_train.dtypes

SK_ID_CURR                      int64
TARGET                          int64
NAME_CONTRACT_TYPE             object
CODE_GENDER                    object
FLAG_OWN_CAR                   object
FLAG_OWN_REALTY                object
CNT_CHILDREN                    int64
AMT_INCOME_TOTAL              float64
AMT_CREDIT                    float64
AMT_ANNUITY                   float64
AMT_GOODS_PRICE               float64
NAME_TYPE_SUITE                object
NAME_INCOME_TYPE               object
NAME_EDUCATION_TYPE            object
NAME_FAMILY_STATUS             object
NAME_HOUSING_TYPE              object
REGION_POPULATION_RELATIVE    float64
DAYS_BIRTH                      int64
DAYS_EMPLOYED                   int64
DAYS_REGISTRATION             float64
DAYS_ID_PUBLISH                 int64
OWN_CAR_AGE                   float64
FLAG_MOBIL                      int64
FLAG_EMP_PHONE                  int64
FLAG_WORK_PHONE                 int64
FLAG_CONT_MOBILE                int64
FLAG_PHONE  

In [15]:
df_train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


# Exploratory Data Analysis (EDA)

In [16]:
df_train.TARGET.value_counts()

0    282686
1     24825
Name: TARGET, dtype: int64

In [29]:
print("Percent defaulters in Training Dataset:","{0:.2f}%".format(100*df_train.TARGET.sum()/df_train.shape[0]))

Percent defaulters: 8.07%


## Missing Values:
We could drop the column with high missing values or impute the missing values with appropriate mean/median/zero. Technique such as XGBoost takes care of the missing value and hence we will keep the missing value column for the time being.

There are 67 columns with missing values with 57 columns have more than 13% of the values missing

In [48]:
percent_missing = (100*df_train.isnull().sum()/df_train.shape[0]).sort_values(ascending = False)
percent_missing[percent_missing > 0]

COMMONAREA_MEDI                 69.872297
COMMONAREA_AVG                  69.872297
COMMONAREA_MODE                 69.872297
NONLIVINGAPARTMENTS_MODE        69.432963
NONLIVINGAPARTMENTS_MEDI        69.432963
NONLIVINGAPARTMENTS_AVG         69.432963
FONDKAPREMONT_MODE              68.386172
LIVINGAPARTMENTS_MEDI           68.354953
LIVINGAPARTMENTS_MODE           68.354953
LIVINGAPARTMENTS_AVG            68.354953
FLOORSMIN_MEDI                  67.848630
FLOORSMIN_MODE                  67.848630
FLOORSMIN_AVG                   67.848630
YEARS_BUILD_MEDI                66.497784
YEARS_BUILD_AVG                 66.497784
YEARS_BUILD_MODE                66.497784
OWN_CAR_AGE                     65.990810
LANDAREA_MODE                   59.376738
LANDAREA_AVG                    59.376738
LANDAREA_MEDI                   59.376738
BASEMENTAREA_MEDI               58.515956
BASEMENTAREA_AVG                58.515956
BASEMENTAREA_MODE               58.515956
EXT_SOURCE_1                    56

## Handling Categorical Variables - Encoding Labels

###### Binary Variables : Label Encoding
###### Multi Category Variables: One-hot Encoding

In [58]:
df_train.dtypes.value_counts()

float64    65
int64      41
object     16
dtype: int64

In [59]:
df_train.select_dtypes('object').nunique().sort_values()

NAME_CONTRACT_TYPE             2
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
EMERGENCYSTATE_MODE            2
CODE_GENDER                    3
HOUSETYPE_MODE                 3
FONDKAPREMONT_MODE             4
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
NAME_TYPE_SUITE                7
WEEKDAY_APPR_PROCESS_START     7
WALLSMATERIAL_MODE             7
NAME_INCOME_TYPE               8
OCCUPATION_TYPE               18
ORGANIZATION_TYPE             58
dtype: int64

###### Label Encoder

In [89]:
df_train.EMERGENCYSTATE_MODE.fillna("No", inplace = True)
df_test.EMERGENCYSTATE_MODE.fillna("No", inplace = True)

In [90]:
le = LE()

for col in df_train.columns:
    if (df_train[col].dtype == 'object'):
        if(df_train[col].nunique() == 2):
            print("Label Encoded for Column - ", col)
            le.fit(df_train[col])
            df_train[col] = le.transform(df_train[col])
            df_test[col] = le.transform(df_test[col])

Label Encoded for Column -  NAME_CONTRACT_TYPE
Label Encoded for Column -  FLAG_OWN_CAR
Label Encoded for Column -  FLAG_OWN_REALTY
Label Encoded for Column -  EMERGENCYSTATE_MODE


In [91]:
# one-hot encoding of categorical variables
df_train = pd.get_dummies(df_train)
df_test = pd.get_dummies(df_test)

In [94]:
#Aligining training and testing data as one hot encoder created variables in training data not present in the test data
print('Training Features shape: ', df_train.shape)
print('Testing Features shape: ', df_test.shape)

train_labels = df_train['TARGET']

# Align the training and testing data, keep only columns present in both dataframes
df_train, df_test = df_train.align(df_test, join = 'inner', axis = 1)

# Add the target back in
df_train['TARGET'] = train_labels

print('Training Features shape after alignment: ', df_train.shape)
print('Testing Features shape after alignment: ', df_test.shape)

Training Features shape:  (307511, 242)
Testing Features shape:  (48744, 238)
Training Features shape after alignment:  (307511, 239)
Testing Features shape after alignment:  (48744, 238)


In [97]:
df_train.dtypes.value_counts()

uint8      129
float64     65
int64       41
int32        4
dtype: int64

## Data Outliers/Incosistencies

In [98]:
df_train.describe()


Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,...,HOUSETYPE_MODE_specific housing,HOUSETYPE_MODE_terraced house,WALLSMATERIAL_MODE_Block,WALLSMATERIAL_MODE_Mixed,WALLSMATERIAL_MODE_Monolithic,WALLSMATERIAL_MODE_Others,WALLSMATERIAL_MODE_Panel,"WALLSMATERIAL_MODE_Stone, brick",WALLSMATERIAL_MODE_Wooden,TARGET
count,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,307499.0,307233.0,307511.0,...,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0
mean,278180.518577,0.095213,0.340108,0.693673,0.417052,168797.9,599026.0,27108.573909,538396.2,0.020868,...,0.004875,0.003941,0.03009,0.007466,0.005785,0.005284,0.214757,0.210773,0.017437,0.080729
std,102790.175348,0.293509,0.473746,0.460968,0.722121,237123.1,402490.8,14493.737315,369446.5,0.013831,...,0.069648,0.062656,0.170835,0.086085,0.07584,0.072501,0.410654,0.407858,0.130892,0.272419
min,100002.0,0.0,0.0,0.0,0.0,25650.0,45000.0,1615.5,40500.0,0.00029,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,189145.5,0.0,0.0,0.0,0.0,112500.0,270000.0,16524.0,238500.0,0.010006,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,278202.0,0.0,0.0,1.0,0.0,147150.0,513531.0,24903.0,450000.0,0.01885,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,367142.5,0.0,1.0,1.0,1.0,202500.0,808650.0,34596.0,679500.0,0.028663,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,456255.0,1.0,1.0,1.0,19.0,117000000.0,4050000.0,258025.5,4050000.0,0.072508,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
