<a href="https://colab.research.google.com/github/ROARMarketingConcepts/Data-Analysis-Projects/blob/master/Coding_the_Training_Set.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Home Credit Default Risk Analysis: Coding the Training Set

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

**Home Credit Group**

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

### Analysis performed by:

Ken Wood

Marketing Data Scientist

ken@roarmarketingconcepts.com




---



### Mount the Google Drive where the dataset files are located...

In [0]:
from google.colab import drive
drive.mount('/gdrive')

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


### Install some necessary packages to perform the required analysis...

In [0]:
import pandas as pd
import numpy as np
import sklearn
import scipy

import matplotlib.pyplot as plt
from matplotlib import interactive
plt.rc("font", size=14)
from pylab import scatter, show, legend, xlabel, ylabel

import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

In [0]:
class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'

### Let's list the datasets available to us...

In [0]:
import os

print('### Home Credit Default Risk Analysis ###\n')
for idx, file in enumerate(os.listdir('/gdrive/My Drive/Colab Notebooks/Home Credit Default Risk/files')):
    print(idx, '-', file)

### Home Credit Default Risk Analysis ###

0 - HomeCredit_columns_description.csv
1 - sample_submission.csv
2 - application_test.csv
3 - bureau_balance.csv
4 - bureau.csv
5 - credit_card_balance.csv
6 - installments_payments.csv
7 - POS_CASH_balance.csv
8 - previous_application.csv
9 - home_credit.png
10 - HomeCredit_columns_description.gsheet
11 - application_train_files


### Load the 'application_train' dataset...

In [0]:
application_train = pd.read_csv('/gdrive/My Drive/Colab Notebooks/Home Credit Default Risk/files/application_train_files/application_train.csv')
# bureau_balance = pd.read_csv('/gdrive/My Drive/Colab Notebooks/Home Credit Default Risk/files/bureau_balance.csv')
# bureau = pd.read_csv('/gdrive/My Drive/Colab Notebooks/Home Credit Default Risk/files/bureau.csv')
# credit_card_balance = pd.read_csv('/gdrive/My Drive/Colab Notebooks/Home Credit Default Risk/files/credit_card_balance.csv')
# installments_payments = pd.read_csv('/gdrive/My Drive/Colab Notebooks/Home Credit Default Risk/files/installments_payments.csv')
# POS_CASH_balance = pd.read_csv('/gdrive/My Drive/Colab Notebooks/Home Credit Default Risk/files/POS_CASH_balance.csv')
# previous_application = pd.read_csv('/gdrive/My Drive/Colab Notebooks/Home Credit Default Risk/files/previous_application.csv')

### Let's get some preliminary information about 'application_train'...

In [0]:
application_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB


In [0]:
application_train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


### Separate 'application_train' into the numerical and categorical variables...

In [0]:
application_train_index_target = application_train[['SK_ID_CURR','TARGET']]
application_train_cat = application_train.select_dtypes(include=['object'])
application_train_num =  application_train.select_dtypes(include=['int64','float64'])
application_train_num = application_train_num.drop(['SK_ID_CURR','TARGET'],axis=1)

### Let’s convert the categorical variables from text to numbers. First, we take care of the missing values using pandas 'ffill' method (propagate last valid observation forward to next valid).  Then we use the sklearn function **OrdinalEncoder** which encodes categorical features as an integer array.

In [0]:
application_train_cat.fillna(method='ffill',inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  downcast=downcast, **kwargs)


In [0]:
from sklearn.preprocessing import OrdinalEncoder

In [0]:
ordinal_encoder = OrdinalEncoder()
application_train_cat_encoded = ordinal_encoder.fit_transform(application_train_cat)
application_train_cat_encoded[:10]

array([[ 0.,  1.,  0.,  1.,  6.,  7.,  4.,  3.,  1.,  8.,  6.,  5.,  2.,
         0.,  5.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  4.,  1.,  1.,  1.,  3.,  1., 39.,  2.,
         0.,  0.,  0.],
       [ 1.,  1.,  1.,  1.,  6.,  7.,  4.,  3.,  1.,  8.,  1., 11.,  2.,
         0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  6.,  7.,  4.,  0.,  1.,  8.,  6.,  5.,  2.,
         0.,  0.,  0.],
       [ 0.,  1.,  0.,  1.,  6.,  7.,  4.,  3.,  1.,  3.,  4., 37.,  2.,
         0.,  0.,  0.],
       [ 0.,  1.,  0.,  1.,  5.,  4.,  4.,  1.,  1.,  8.,  6., 33.,  2.,
         0.,  0.,  0.],
       [ 0.,  0.,  1.,  1.,  6.,  1.,  1.,  1.,  1.,  0.,  3.,  5.,  2.,
         0.,  0.,  0.],
       [ 0.,  1.,  1.,  1.,  6.,  4.,  1.,  1.,  1., 10.,  1., 33.,  2.,
         0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  3.,  4.,  1.,  1., 10.,  6., 57.,  2.,
         0.,  0.,  0.],
       [ 1.,  1.,  0.,  1.,  6.,  7.,  4.,  3.,  1.,  8.,  4.,  9.,  2.,
         0.,  0.,  0.]])

In [0]:
ordinal_encoder.categories_

[array(['Cash loans', 'Revolving loans'], dtype=object),
 array(['F', 'M', 'XNA'], dtype=object),
 array(['N', 'Y'], dtype=object),
 array(['N', 'Y'], dtype=object),
 array(['Children', 'Family', 'Group of people', 'Other_A', 'Other_B',
        'Spouse, partner', 'Unaccompanied'], dtype=object),
 array(['Businessman', 'Commercial associate', 'Maternity leave',
        'Pensioner', 'State servant', 'Student', 'Unemployed', 'Working'],
       dtype=object),
 array(['Academic degree', 'Higher education', 'Incomplete higher',
        'Lower secondary', 'Secondary / secondary special'], dtype=object),
 array(['Civil marriage', 'Married', 'Separated', 'Single / not married',
        'Unknown', 'Widow'], dtype=object),
 array(['Co-op apartment', 'House / apartment', 'Municipal apartment',
        'Office apartment', 'Rented apartment', 'With parents'],
       dtype=object),
 array(['Accountants', 'Cleaning staff', 'Cooking staff', 'Core staff',
        'Drivers', 'HR staff', 'High skill tech 

### Put the coded categorical variables back into a Pandas DataFrame.

In [0]:
application_train_cat = pd.DataFrame(application_train_cat_encoded,columns = application_train_cat.columns)
application_train_cat.head()

Unnamed: 0,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,OCCUPATION_TYPE,WEEKDAY_APPR_PROCESS_START,ORGANIZATION_TYPE,FONDKAPREMONT_MODE,HOUSETYPE_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE
0,0.0,1.0,0.0,1.0,6.0,7.0,4.0,3.0,1.0,8.0,6.0,5.0,2.0,0.0,5.0,0.0
1,0.0,0.0,0.0,0.0,1.0,4.0,1.0,1.0,1.0,3.0,1.0,39.0,2.0,0.0,0.0,0.0
2,1.0,1.0,1.0,1.0,6.0,7.0,4.0,3.0,1.0,8.0,1.0,11.0,2.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,6.0,7.0,4.0,0.0,1.0,8.0,6.0,5.0,2.0,0.0,0.0,0.0
4,0.0,1.0,0.0,1.0,6.0,7.0,4.0,3.0,1.0,3.0,4.0,37.0,2.0,0.0,0.0,0.0


### Now we need to deal with missing numerical values in 'application_train'. Scikit-Learn provides a handy class to take care of missing values: **Imputer**. First, we need to create an Imputer instance, specifying that we want to replace each attribute’s missing values with the median of that attribute.  Then we invoke sklearn's StandardScaler function to feature scale all of the numerical variables.

In [0]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('std_scaler', StandardScaler())
        ])

In [0]:
application_train_num_encoded = num_pipeline.fit_transform(application_train_num)
application_train_num_encoded

array([[-0.57753784,  0.14212925, -0.47809496, ..., -0.26994654,
        -0.30861959, -0.44092567],
       [-0.57753784,  0.42679193,  1.7254498 , ..., -0.26994654,
        -0.30861959, -1.00733095],
       [-0.57753784, -0.4271961 , -1.15288792, ..., -0.26994654,
        -0.30861959, -1.00733095],
       ...,
       [-0.57753784, -0.06662338,  0.19537871, ...,  0.89717516,
        -0.30861959, -0.44092567],
       [-0.57753784,  0.00928667, -0.56875681, ..., -0.26994654,
        -0.30861959, -1.00733095],
       [-0.57753784, -0.04764587,  0.18875991, ...,  2.06429685,
        -0.30861959, -0.44092567]])

In [0]:
application_train_num = pd.DataFrame(application_train_num_encoded,columns = application_train_num.columns)
application_train_num = pd.concat([application_train_index_target,application_train_num],axis=1)
application_train_num.head()

Unnamed: 0,SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,-0.577538,0.142129,-0.478095,-0.166143,-0.507236,-0.149452,1.50688,-0.456215,...,-0.090534,-0.024402,-0.022529,-0.018305,-0.070987,-0.058766,-0.155837,-0.269947,-0.30862,-0.440926
1,100003,0,-0.577538,0.426792,1.72545,0.592683,1.600873,-1.25275,-0.166821,-0.460115,...,-0.090534,-0.024402,-0.022529,-0.018305,-0.070987,-0.058766,-0.155837,-0.269947,-0.30862,-1.007331
2,100004,0,-0.577538,-0.427196,-1.152888,-1.404669,-1.092145,-0.783451,-0.689509,-0.453299,...,-0.090534,-0.024402,-0.022529,-0.018305,-0.070987,-0.058766,-0.155837,-0.269947,-0.30862,-1.007331
3,100006,0,-0.577538,-0.142533,-0.71143,0.177874,-0.653463,-0.928991,-0.680114,-0.473217,...,-0.090534,-0.024402,-0.022529,-0.018305,-0.070987,-0.058766,-0.155837,-0.269947,-0.30862,-0.440926
4,100007,0,-0.577538,-0.199466,-0.213734,-0.361749,-0.068554,0.56357,-0.892535,-0.47321,...,-0.090534,-0.024402,-0.022529,-0.018305,-0.070987,-0.058766,-0.155837,-0.269947,-0.30862,-1.007331


In [0]:
application_train_coded = pd.concat([application_train_num,application_train_cat],axis=1)
# application_train_coded.to_csv(r'/gdrive/My Drive/Colab Notebooks/Home Credit Default Risk/files/application_train_files/application_train_coded.csv')
application_train_coded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to EMERGENCYSTATE_MODE
dtypes: float64(120), int64(2)
memory usage: 286.2 MB
