# Developping a Borrower Scoring Algorithm

Last updated : September 25th, 2022

## Introduction

During this project, I will use a dataset provided by a consumer finance companies to develop a machine learning algorithm that will predict if the borrower will have payment difficulties or not.

## 1. Data Loading and Filtering

First we will load the necessary packages and dataset and then we will carry on with the Cleaning and Analysis.

### 1.1 Loading our packages

We will import the necessary packages to run this project: matplotlib, numpy, pandas, seaborn.
Since I am running the project on Windows, I will also use sklearnex to increase the speed of sklearn.

In [17]:
#Importing packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
#Setting large figure size for Seaborn
sns.set(rc={'figure.figsize':(11.7,8.27),"font.size":20,"axes.titlesize":20,"axes.labelsize":18})

#Importing Intel extension for sklearn to improve speed
from sklearnex import patch_sklearn
patch_sklearn()

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


### 1.2 Loading the dataset

We will now load the dataset

In [18]:
app_test = pd.read_csv("Data/application_test.csv", sep=",")
app = pd.read_csv("Data/application_train.csv", sep=",")

app.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


### 1.3 Feature Filtering

We will begin by removing features that have more than 50% na values :

In [19]:
#Increasing maximum number of info rows 
pd.options.display.max_info_columns = 130

#First we will define a function that drops columns that are null in more than x% of our database
def drop_na_columns(df: pd.DataFrame, percent: float):
    n = len(df)
    cutoff = n*percent/100
    for c in df.columns:
        if len(df[c].dropna()) < cutoff:
            df.drop(columns={c}, inplace=True)

#Dropping columns with less than 50% complete fields
drop_na_columns(app, 50)

len(app.columns)

app.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 81 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   SK_ID_CURR                    307511 non-null  int64  
 1   TARGET                        307511 non-null  int64  
 2   NAME_CONTRACT_TYPE            307511 non-null  object 
 3   CODE_GENDER                   307511 non-null  object 
 4   FLAG_OWN_CAR                  307511 non-null  object 
 5   FLAG_OWN_REALTY               307511 non-null  object 
 6   CNT_CHILDREN                  307511 non-null  int64  
 7   AMT_INCOME_TOTAL              307511 non-null  float64
 8   AMT_CREDIT                    307511 non-null  float64
 9   AMT_ANNUITY                   307499 non-null  float64
 10  AMT_GOODS_PRICE               307233 non-null  float64
 11  NAME_TYPE_SUITE               306219 non-null  object 
 12  NAME_INCOME_TYPE              307511 non-nul

In [20]:
#Counting the number of target vs not target variables:
app["TARGET"].value_counts(normalize=True)

#We have a significant difference in the number of data for both cases

0    0.919271
1    0.080729
Name: TARGET, dtype: float64

## 2. Data Preparation

We will now clean our dataset.

### 2.1 Cleaning categorical variables

We will begin the cleaning process by cleaning categorical variables.

In [21]:
#Looking at unique valeus of categorical variables
def investigate_categories(df: pd.DataFrame):
    for c in df.columns:
        if df[c].dtype == 'object':
            print("Column",c)
            print("Unique values: {}".format(df[c].unique()))
            print("")
            print("-----------------------------------")
            
investigate_categories(app)

Column NAME_CONTRACT_TYPE
Unique values: ['Cash loans' 'Revolving loans']

-----------------------------------
Column CODE_GENDER
Unique values: ['M' 'F' 'XNA']

-----------------------------------
Column FLAG_OWN_CAR
Unique values: ['N' 'Y']

-----------------------------------
Column FLAG_OWN_REALTY
Unique values: ['Y' 'N']

-----------------------------------
Column NAME_TYPE_SUITE
Unique values: ['Unaccompanied' 'Family' 'Spouse, partner' 'Children' 'Other_A' nan
 'Other_B' 'Group of people']

-----------------------------------
Column NAME_INCOME_TYPE
Unique values: ['Working' 'State servant' 'Commercial associate' 'Pensioner' 'Unemployed'
 'Student' 'Businessman' 'Maternity leave']

-----------------------------------
Column NAME_EDUCATION_TYPE
Unique values: ['Secondary / secondary special' 'Higher education' 'Incomplete higher'
 'Lower secondary' 'Academic degree']

-----------------------------------
Column NAME_FAMILY_STATUS
Unique values: ['Single / not married' 'Married' 'C

In [22]:
#Investigating "XNA" values in GENDER
app[app["CODE_GENDER"] == 'XNA']
#Only 4 rows

#Let's look at the test data
app_test[app_test["CODE_GENDER"] == 'XNA']
#0 row

#We will delete the rows with NA values from our dataset
app = app[app["CODE_GENDER"] != 'XNA']

In [23]:
#Investigating "XNA" values in ORGANIZATION_TYPE
app[app["ORGANIZATION_TYPE"] == 'XNA']
#55374 rows

app[app["ORGANIZATION_TYPE"] == 'XNA']["TARGET"].value_counts(normalize=True)
#Significant deviation from the normal percentages, so it is interesting to keep these values

#They will be encoded during the feature engineering part of the project

0    0.946004
1    0.053996
Name: TARGET, dtype: float64

In [24]:
#Looking at "nan" values in EMERGENCYSTATE_MODE
print(len(app[app["EMERGENCYSTATE_MODE"].isna()]))

app[app["EMERGENCYSTATE_MODE"].isna()]["TARGET"].value_counts(normalize=True)
#Here it represents about half our dataset, we will create a "NA" variable as well since there is a small deviation from what
#We would have expected

app.loc[app["EMERGENCYSTATE_MODE"].isna(),"EMERGENCYSTATE_MODE"] = 'UKN'

145754


In [25]:
#Looking at "nan" values in OCCUPATION TYPE
print(len(app[app["OCCUPATION_TYPE"].isna()]))

app[app["OCCUPATION_TYPE"].isna()]["TARGET"].value_counts(normalize=True)
#Here it represents about a third of our dataset, we will create a "NA" variable as well since there is a deviation from what
#we would have expected

app.loc[app["OCCUPATION_TYPE"].isna(),"OCCUPATION_TYPE"] = 'UKN'

96389


In [26]:
#Looking at "nan" values in NAME_TYPE_SUITE
print(len(app[app["NAME_TYPE_SUITE"].isna()]))
#Only 1292 NA values

#We will delete these rows
app = app[app["NAME_TYPE_SUITE"].notna()]

1292


In [27]:
#We can see that WEEKDAY_APPR_PROCESS_START is coded as a string

import time
#Let's convert it into week day number
app["WEEKDAY_APPR_PROCESS_START"] = app["WEEKDAY_APPR_PROCESS_START"].apply(lambda x: time.strptime(x, '%A').tm_wday)

In [28]:
#Verifying that we've dealt with all missing values of categorical variables
for c in app.columns:
    if app[c].dtype == 'object':
        print(app[c].isna().sum().sum())

0
0
0
0
0
0
0
0
0
0
0
0


We have finished cleaning up categorical variables, now we will look at numeric variables 

### 2.2 Cleaning numeric variables 

In [29]:
#Looking for outliers 

#Increasing the number of maximum columns shown
pd.options.display.max_columns = 100
app.describe()

Unnamed: 0,SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,EXT_SOURCE_2,EXT_SOURCE_3,YEARS_BEGINEXPLUATATION_AVG,FLOORSMAX_AVG,YEARS_BEGINEXPLUATATION_MODE,FLOORSMAX_MODE,YEARS_BEGINEXPLUATATION_MEDI,FLOORSMAX_MEDI,TOTALAREA_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
count,306215.0,306215.0,306215.0,306215.0,306215.0,306203.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,305556.0,245464.0,156794.0,153795.0,156794.0,153795.0,156794.0,153795.0,158364.0,305194.0,305194.0,305194.0,305194.0,306214.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,306215.0,264805.0,264805.0,264805.0,264805.0,264805.0,264805.0
mean,278164.519246,0.080842,0.417004,168783.0,598799.7,27122.21047,537947.9,0.020865,-16040.633855,63858.968166,-4987.987728,-2994.331035,0.999997,0.819767,0.19905,0.998126,0.280764,0.056797,2.152778,2.052617,2.031638,2.527104,12.061999,0.015163,0.050749,0.040619,0.078164,0.230492,0.179599,0.5143519,0.510923,0.977728,0.226261,0.977056,0.222292,0.977746,0.225877,0.102525,1.421532,0.143374,1.404605,0.100005,-964.425634,4.2e-05,0.71056,8.2e-05,0.014715,0.087857,0.00014,0.081342,0.003854,2e-05,0.00384,7e-06,0.003406,0.002805,0.00113,0.009405,0.000261,0.007818,0.000571,0.000493,0.00033,0.00639,0.006982,0.034448,0.267616,0.265697,1.903903
std,102786.814894,0.272593,0.722104,237517.9,401960.6,14490.897429,368918.6,0.01383,4362.856052,141313.558266,3522.557759,1509.518082,0.001807,0.384382,0.399286,0.043255,0.449373,0.231454,0.910584,0.509103,0.502794,1.79145,3.266155,0.122199,0.219484,0.197405,0.26843,0.421148,0.383854,0.1910904,0.194836,0.059251,0.144579,0.064624,0.143649,0.059927,0.145009,0.107424,2.400847,0.446637,2.37973,0.362213,826.707866,0.006516,0.453503,0.009035,0.12041,0.283087,0.011849,0.273359,0.061957,0.004426,0.061852,0.002556,0.058262,0.05289,0.033595,0.096523,0.016161,0.088073,0.023899,0.022201,0.018158,0.083791,0.110479,0.204792,0.915624,0.794827,1.869584
min,100002.0,0.0,0.0,25650.0,45000.0,1615.5,40500.0,0.00029,-25229.0,-17912.0,-24672.0,-7197.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.173617e-08,0.000527,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-4292.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,189130.5,0.0,0.0,112500.0,270000.0,16551.0,238500.0,0.010006,-19685.0,-2761.0,-7481.0,-4299.0,1.0,1.0,0.0,1.0,0.0,0.0,2.0,2.0,2.0,1.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.3923271,0.37065,0.9767,0.1667,0.9767,0.1667,0.9767,0.1667,0.0412,0.0,0.0,0.0,0.0,-1571.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,278184.0,0.0,0.0,147600.0,513531.0,24930.0,450000.0,0.01885,-15756.0,-1214.0,-4507.0,-3255.0,1.0,1.0,0.0,1.0,0.0,0.0,2.0,2.0,2.0,2.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5659453,0.535276,0.9816,0.1667,0.9816,0.1667,0.9816,0.1667,0.0688,0.0,0.0,0.0,0.0,-759.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,367126.5,0.0,1.0,202500.0,808650.0,34596.0,679500.0,0.028663,-12418.0,-289.0,-2013.0,-1720.0,1.0,1.0,0.0,1.0,1.0,0.0,3.0,2.0,2.0,4.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6636183,0.669057,0.9866,0.3333,0.9866,0.3333,0.9866,0.3333,0.1275,2.0,0.0,2.0,0.0,-276.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
max,456255.0,1.0,19.0,117000000.0,4050000.0,258025.5,4050000.0,0.072508,-7489.0,365243.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,20.0,3.0,3.0,6.0,23.0,1.0,1.0,1.0,1.0,1.0,1.0,0.8549997,0.89601,1.0,1.0,1.0,1.0,1.0,1.0,1.0,348.0,34.0,344.0,24.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,9.0,8.0,27.0,261.0,25.0


In [30]:
#DAYS_BIRTH, DAYS_REGISTRATION and DAYS_ID_PUBLISH only have negative values
app["DAYS_REGISTRATION"] = abs(app["DAYS_REGISTRATION"])
app["DAYS_ID_PUBLISH"] = abs(app["DAYS_ID_PUBLISH"])
app["DAYS_BIRTH"] = abs(app["DAYS_BIRTH"])

#DAYS EMPLOYED have abherrent values (365243 days, about 1000 years)
app.loc[app["DAYS_EMPLOYED"] > 100000, "DAYS_EMPLOYED"] = np.nan
app["DAYS_EMPLOYED"] = abs(app["DAYS_EMPLOYED"])

print(app["DAYS_BIRTH"].min()/365, app["DAYS_BIRTH"].max()/365)
#No outlier data, from 20 to 69 years

def label_age(days_birth):
    age_years = days_birth / 365
    if age_years < 30: return 1
    elif age_years < 40: return 2
    elif age_years < 50: return 3
    elif age_years < 60: return 4
    elif age_years < 70: return 5
    else: return 0
    
app["AGE_LABEL"] = app["DAYS_BIRTH"].apply(lambda x: label_age(x))

app = app[app['AMT_INCOME_TOTAL'] < 20000000] # remove an outlier (117 million)

# Calculated features
app['DAYS_EMPLOYED_PCT'] = app['DAYS_EMPLOYED'] / app['DAYS_BIRTH']
app['INCOME_CREDIT_PCT'] = app['AMT_INCOME_TOTAL'] / app['AMT_CREDIT']
app['INCOME_PER_PERSON'] = app['AMT_INCOME_TOTAL'] / app['CNT_FAM_MEMBERS']
app['ANNUITY_INCOME_PCT'] = app['AMT_ANNUITY'] / app['AMT_INCOME_TOTAL']
app['PAYMENT_RATE'] = app['AMT_ANNUITY'] / app['AMT_CREDIT']

20.517808219178082 69.12054794520547


In [31]:
app.describe()

Unnamed: 0,SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,EXT_SOURCE_2,EXT_SOURCE_3,YEARS_BEGINEXPLUATATION_AVG,FLOORSMAX_AVG,YEARS_BEGINEXPLUATATION_MODE,FLOORSMAX_MODE,YEARS_BEGINEXPLUATATION_MEDI,FLOORSMAX_MEDI,TOTALAREA_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,AGE_LABEL,DAYS_EMPLOYED_PCT,INCOME_CREDIT_PCT,INCOME_PER_PERSON,ANNUITY_INCOME_PCT,PAYMENT_RATE
count,306214.0,306214.0,306214.0,306214.0,306214.0,306202.0,306214.0,306214.0,306214.0,251036.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,305555.0,245463.0,156793.0,153794.0,156793.0,153794.0,156793.0,153794.0,158363.0,305193.0,305193.0,305193.0,305193.0,306213.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,306214.0,264804.0,264804.0,264804.0,264804.0,264804.0,264804.0,306214.0,251036.0,306214.0,306214.0,306202.0,306202.0
mean,278165.052199,0.080839,0.417002,168401.4,598799.8,27122.2135,537948.2,0.020865,16040.645042,2385.328778,4987.981934,2994.328917,0.999997,0.819767,0.19905,0.998125,0.280764,0.056797,2.152776,2.052617,2.031638,2.527108,12.061993,0.015163,0.050749,0.040619,0.078164,0.230492,0.1796,0.5143533,0.510925,0.977728,0.226262,0.977056,0.222292,0.977746,0.225878,0.102525,1.421537,0.143375,1.40461,0.100006,-964.428783,4.2e-05,0.710559,8.2e-05,0.014715,0.087857,0.00014,0.081342,0.003854,2e-05,0.00384,7e-06,0.003406,0.002805,0.00113,0.009405,0.000261,0.007818,0.000571,0.000493,0.00033,0.00639,0.006983,0.034448,0.267617,0.265698,1.903906,2.893617,0.156905,0.399176,92965.47,0.181055,0.05374
std,102786.559635,0.272588,0.722104,108809.4,401961.2,14490.920994,368919.2,0.01383,4362.858784,2338.980862,3522.562052,1509.520091,0.001807,0.384382,0.399287,0.043255,0.449373,0.231454,0.910585,0.509104,0.502795,1.791451,3.266158,0.1222,0.219485,0.197405,0.26843,0.421149,0.383854,0.1910893,0.194835,0.059251,0.144579,0.064624,0.143649,0.059928,0.145009,0.107425,2.40085,0.446638,2.379733,0.362213,826.707378,0.006516,0.453504,0.009035,0.120411,0.283087,0.011849,0.27336,0.061957,0.004426,0.061852,0.002556,0.058263,0.05289,0.033595,0.096523,0.016161,0.088074,0.023899,0.022201,0.018158,0.083792,0.110479,0.204792,0.915626,0.794829,1.869587,1.234927,0.133577,0.343636,73174.46,0.094634,0.02249
min,100002.0,0.0,0.0,25650.0,45000.0,1615.5,40500.0,0.00029,7489.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.173617e-08,0.000527,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-4292.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.011801,2812.5,0.003333,0.022073
25%,189131.25,0.0,0.0,112500.0,270000.0,16551.0,238500.0,0.010006,12418.0,768.0,2013.0,1720.0,1.0,1.0,0.0,1.0,0.0,0.0,2.0,2.0,2.0,1.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.3923297,0.37065,0.9767,0.1667,0.9767,0.1667,0.9767,0.1667,0.0412,0.0,0.0,0.0,0.0,-1571.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.056138,0.193803,47250.0,0.11485,0.036931
50%,278185.0,0.0,0.0,147600.0,513531.0,24930.0,450000.0,0.01885,15756.0,1649.0,4507.0,3255.0,1.0,1.0,0.0,1.0,0.0,0.0,2.0,2.0,2.0,2.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5659453,0.535276,0.9816,0.1667,0.9816,0.1667,0.9816,0.1667,0.0688,0.0,0.0,0.0,0.0,-759.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,3.0,0.118763,0.306272,75000.0,0.162956,0.05
75%,367126.75,0.0,1.0,202500.0,808650.0,34596.0,679500.0,0.028663,19685.0,3176.0,7481.0,4299.0,1.0,1.0,0.0,1.0,1.0,0.0,3.0,2.0,2.0,4.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6636195,0.669057,0.9866,0.3333,0.9866,0.3333,0.9866,0.3333,0.1275,2.0,0.0,2.0,0.0,-276.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,4.0,0.219206,0.495376,112500.0,0.2292,0.064079
max,456255.0,1.0,19.0,18000090.0,4050000.0,258025.5,4050000.0,0.072508,25229.0,17912.0,24672.0,7197.0,1.0,1.0,1.0,1.0,1.0,1.0,20.0,3.0,3.0,6.0,23.0,1.0,1.0,1.0,1.0,1.0,1.0,0.8549997,0.89601,1.0,1.0,1.0,1.0,1.0,1.0,1.0,348.0,34.0,344.0,24.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,9.0,8.0,27.0,261.0,25.0,5.0,0.728811,26.6668,6750000.0,1.875965,0.12443


In [32]:
#Turning SK_ID_CURR into an ID field :
app.set_index('SK_ID_CURR', inplace=True)

app.head()

Unnamed: 0_level_0,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_2,EXT_SOURCE_3,YEARS_BEGINEXPLUATATION_AVG,FLOORSMAX_AVG,YEARS_BEGINEXPLUATATION_MODE,FLOORSMAX_MODE,YEARS_BEGINEXPLUATATION_MEDI,FLOORSMAX_MEDI,TOTALAREA_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,AGE_LABEL,DAYS_EMPLOYED_PCT,INCOME_CREDIT_PCT,INCOME_PER_PERSON,ANNUITY_INCOME_PCT,PAYMENT_RATE
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1
100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,351000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.018801,9461,637.0,3648.0,2120,1,1,0,1,1,0,Laborers,1.0,2,2,2,10,0,0,0,0,0,0,Business Entity Type 3,0.262949,0.139376,0.9722,0.0833,0.9722,0.0833,0.9722,0.0833,0.0149,No,2.0,2.0,2.0,2.0,-1134.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,1,0.067329,0.498036,202500.0,0.121978,0.060749
100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,1129500.0,Family,State servant,Higher education,Married,House / apartment,0.003541,16765,1188.0,1186.0,291,1,1,0,1,1,0,Core staff,2.0,1,1,0,11,0,0,0,0,0,0,School,0.622246,,0.9851,0.2917,0.9851,0.2917,0.9851,0.2917,0.0714,No,1.0,0.0,1.0,0.0,-828.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,3,0.070862,0.208736,135000.0,0.132217,0.027598
100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,135000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.010032,19046,225.0,4260.0,2531,1,1,1,1,1,0,Laborers,1.0,2,2,0,9,0,0,0,0,0,0,Government,0.555912,0.729567,,,,,,,,UKN,0.0,0.0,0.0,0.0,-815.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,4,0.011814,0.5,67500.0,0.1,0.05
100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,297000.0,Unaccompanied,Working,Secondary / secondary special,Civil marriage,House / apartment,0.008019,19005,3039.0,9833.0,2437,1,1,0,1,0,0,Laborers,2.0,2,2,2,17,0,0,0,0,0,0,Business Entity Type 3,0.650442,,,,,,,,,UKN,2.0,0.0,2.0,0.0,-617.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,4,0.159905,0.431748,67500.0,0.2199,0.094941
100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,513000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.028663,19932,3038.0,4311.0,3458,1,1,0,1,0,0,Core staff,1.0,2,2,3,11,0,0,0,0,1,1,Religion,0.322738,,,,,,,,,UKN,0.0,0.0,0.0,0.0,-1106.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,4,0.152418,0.236842,121500.0,0.179963,0.042623


Analysis of the describe() output shows that there is **no clear outlier** in the rest of the numeric data. We can now start handling missing values.

In [33]:
len(app.columns[app.isnull().any()])
#21 columns with NA values

#Dropping rows with more than 50% na values
def drop_na_rows(df: pd.DataFrame, pct: float):
    n = len(df.columns)
    cutoff = n*pct/100 
    df = df[df.isna().sum(axis=1) > cutoff]

drop_na_rows(app, 50)
#No row was removed


We've now finished cleaning incorrect values. 
Before starting to perform data imputation, we need to perform a **train/validation/test split**. This will **prevent us from introducing data leakage during the cleaning process**. 

### 2.3 Performing train / test / validation split

We will divide our dataset as such : 

-  80% train set 
-  10% validation 
-  10% test

We will be able to revisit this values during the hyperparameter tuning part of the project.

In [34]:
from sklearn.model_selection import train_test_split

y = app["TARGET"]
ID = app.index
X = app.drop(columns={"TARGET"})

#Splitting train and test sets, we have to add indices to conserve the original index
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(
    X, y, ID, test_size=0.1, shuffle=True, random_state=8)

#Assigning the correct indices (the SK_IDs) to y_test
y_test.index = indices_test

#Applying the same function to separate train and validation set
X_train, X_val, y_train, y_val, indices_train, indices_val = train_test_split(
    X_train, y_train, indices_train, test_size = 0.1/0.9, shuffle=True, random_state=8)

#Assigning the SK IDs to y_train and y_val
y_train.index = indices_train
y_val.index = indices_val

In [35]:
print(len(X_train), len(X_test), len(X_val))
#Our test and validation set have the same length and its 10% of the overall length of X

244970 30622 30622


Now that we have performed the split, we can carry on to perform data imputation.

These operations will also have to be performed on the test and train_set, so we will create a function that we will be able to apply to the 3 sets.

### 2.4 Data Imputation

First we will investigate what columns still have missing values. 
Normally, we have replaced all missing features for categorical variables.

In [36]:
# #For ease of use, we will rename X_train to df so we can better replicate our code afterwards
df = X_train.copy()

# def check_col_nas_type(df: pd.DataFrame):
#     type_cols = []
#     #Verifying the type of columns with missing values
#     for c in df.columns[df.isna().any()].tolist():
#         if ~np.isin(df[c].dtype, type_cols):
#             type_cols.append(df[c].dtype)
#     return(type_cols)

# check_col_nas_type(df)
# #This verifies that we only need to perform data imputation on numeric features

In [37]:
# #Loading visualization functions present in the functions.py file
# from functions import *

# #Visualizing distribution of all numeric variables
# histPlotAll(df)

# #Apart from HOUR_APPR_PROCESS_START, all numeric variables seem to be not normally distributed

In [38]:
from scipy import stats

#Defining a data imputation function, we will use the NAME_CONTRACT_TYPE as a category_column

#This data_imputation script can be improved during the hyperparameter setting phase

def numeric_data_imputation(df: pd.DataFrame, category_column: str):
    
    #Creating a copy of our dataset
    df_imput = df.copy()
    #Creating a list of columns with missing values
    missing_cols = df.columns[df.isna().any()].tolist()
    max_unique_values = 3
    
    #Iterating over columns with missing data
    for c in missing_cols:
        
        #Verifying that we are in a numeric column
        if np.issubdtype(df[c].dtype,np.number):
            
            #If there are less or equal to max unique values, we will use mode imputation 
            if len(df[c].unique()) <= max_unique_values:
                            
                #We will create a subset from our categorical variable and perform mode imputation
                for t in df[category_column].unique():
                    #Creating subset
                    subset = df.loc[df[category_column] == t]
                    
                    #Calculating mode of subset
                    mode = subset[c].mode().values[0]
                    
                    #Applying imputation
                    df.loc[(df[c].isna()) & (df[category_column] == t), c] = mode
                    
            #If we have more numeric values, we will calculate the Kolmogorov Smirnoff pvalue to test for normalization
            else:
                
                #Normalizing target variable
                norm = c + '_norm'
                df_imput[norm] = (df_imput[c] - np.mean(df_imput[c].dropna())) / np.std(df_imput[c].dropna())

                #Calculating pvalue of KS test
                pval = stats.kstest(df_imput[norm].dropna(), 'norm').pvalue
                
                if pval >= 0.05:
                    #P value is superior to 0.05, we cannot reject the null hypothesis and thus conclude the variable is
                    #approximatively normally distributed
                    #We will use mean imputation on that variable
                    for t in df[category_column].unique():
                        #Creating subset
                        subset = df.loc[df[category_column] == t]

                        #Calculating mean based on that subset and our target column
                        mean = subset[c].mean()

                        #Applying imputation
                        df.loc[(df[c].isna()) & (df[category_column] == t), c] = mean
                else:
                    #P value is inferior to 0.05, we can reject the null hypothesis and thus conclude the variable is
                    #not normally distributed
                    #We will use median imputation on that variable
                    for t in df[category_column].unique():
                        #Creating subset
                        subset = df.loc[df[category_column] == t]

                        #Calculating mean based on that subset and our target column
                        med = subset[c].median()

                        #Applying imputation
                        df.loc[(df[c].isna()) & (df[category_column] == t), c] = med
    return None

#Applying the function to our 3 sets (X_train has been renamed to df)
numeric_data_imputation(df, 'NAME_CONTRACT_TYPE')
numeric_data_imputation(X_test, 'NAME_CONTRACT_TYPE')
numeric_data_imputation(X_val, 'NAME_CONTRACT_TYPE')

#Checking for nulls in our 3 sets
# for data in [df,X_test,X_val]:
#     print(np.count_nonzero(data.isnull()))
    
#We have no more NA values in all 3 sets

Now that we have 3 complete datasets, we can perform **feature engineering**

## 3. Feature Engineering

We will begin by encoding cyclical features.

### 3.1 Encoding Cyclical Features

We have 2 columns with time features that are cyclical in nature but coded with numbers.

- WEEDKAY_APPR_PROCESS_START
- HOUR_APPR_PROCESS_START

To increase the performance of our algorithm, we will apply a cyclical encoding algorithm to better represent their cyclical nature :

In [39]:
def encode_cyclical_vars(df: pd.DataFrame, cyclical_vars: list):
    for c in cyclical_vars:
        #Calculating the number of unique values
        n = len(df[c].unique())
        #Defining variable names
        cos_var = c + '_cos'
        sin_var = c + '_sin'
        #Calculating cyclical encoder variables
        df[sin_var] = np.sin(df[c] * (2*np.pi/n))
        df[cos_var] = np.cos(df[c] * (2*np.pi/n))
        #Dropping the base columns
        df.drop(columns = {c}, inplace=True)

cyclical_vars = ["WEEKDAY_APPR_PROCESS_START", "HOUR_APPR_PROCESS_START"]
encode_cyclical_vars(df, cyclical_vars)
encode_cyclical_vars(X_test, cyclical_vars)
encode_cyclical_vars(X_val, cyclical_vars)

df.head()

Unnamed: 0_level_0,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_2,EXT_SOURCE_3,YEARS_BEGINEXPLUATATION_AVG,FLOORSMAX_AVG,YEARS_BEGINEXPLUATATION_MODE,FLOORSMAX_MODE,YEARS_BEGINEXPLUATATION_MEDI,FLOORSMAX_MEDI,TOTALAREA_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,AGE_LABEL,DAYS_EMPLOYED_PCT,INCOME_CREDIT_PCT,INCOME_PER_PERSON,ANNUITY_INCOME_PCT,PAYMENT_RATE,WEEKDAY_APPR_PROCESS_START_sin,WEEKDAY_APPR_PROCESS_START_cos,HOUR_APPR_PROCESS_START_sin,HOUR_APPR_PROCESS_START_cos
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1
369727,Cash loans,F,N,Y,0,292500.0,970380.0,25726.5,810000.0,Unaccompanied,Commercial associate,Higher education,Civil marriage,House / apartment,0.02461,20291,951.0,1544.0,1054,1,1,1,1,1,0,Laborers,2.0,2,2,0,0,0,0,0,0,Transport: type 4,0.65381,0.103449,0.9925,0.1667,0.9926,0.1667,0.9925,0.1667,0.0715,No,0.0,0.0,0.0,0.0,-1026.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0,4,0.046868,0.301428,146250.0,0.087954,0.026512,-0.974928,-0.222521,0.258819,-0.965926
271823,Cash loans,F,N,Y,0,202500.0,904500.0,36000.0,904500.0,Unaccompanied,Working,Secondary / secondary special,Separated,House / apartment,0.025164,16735,3442.0,6006.0,267,1,1,1,1,0,0,Laborers,1.0,2,2,0,0,0,0,0,0,Industry: type 1,0.159688,0.522697,0.0,0.0417,0.0005,0.0417,0.0,0.0417,0.0166,No,1.0,1.0,1.0,1.0,-1752.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,3,0.205677,0.223881,202500.0,0.177778,0.039801,-0.433884,-0.900969,0.258819,-0.965926
136775,Cash loans,F,N,Y,0,270000.0,1035000.0,30393.0,1035000.0,Unaccompanied,Commercial associate,Incomplete higher,Married,House / apartment,0.02461,16141,188.0,283.0,4636,1,1,0,1,0,0,UKN,2.0,2,2,0,0,0,0,0,0,Agriculture,0.70462,0.58674,0.9816,0.1667,0.9816,0.1667,0.9816,0.1667,0.0685,UKN,0.0,0.0,0.0,0.0,-1036.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,1.0,1.0,3,0.011647,0.26087,135000.0,0.112567,0.029365,-0.433884,-0.900969,-0.8660254,-0.5
329676,Cash loans,M,N,N,0,112500.0,144000.0,9450.0,144000.0,Unaccompanied,Commercial associate,Secondary / secondary special,Single / not married,Rented apartment,0.01452,15108,1096.0,6018.0,4210,1,1,0,1,0,0,Core staff,1.0,2,2,0,0,0,1,1,0,Security,0.344191,0.526295,0.9846,0.0554,0.9816,0.0417,0.9816,0.0417,0.0126,No,0.0,0.0,0.0,0.0,-737.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,2.0,1.0,3,0.072544,0.78125,112500.0,0.084,0.065625,0.0,1.0,0.7071068,-0.707107
192999,Cash loans,F,Y,Y,0,198000.0,704844.0,34038.0,630000.0,Unaccompanied,Commercial associate,Secondary / secondary special,Married,House / apartment,0.011703,19811,1714.0,140.0,3339,1,1,0,1,0,0,Laborers,2.0,2,2,0,0,0,0,1,1,Other,0.617299,0.735221,0.9816,0.1667,0.9816,0.1667,0.9816,0.1667,0.0685,UKN,4.0,0.0,4.0,0.0,0.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,4.0,4,0.086518,0.280913,99000.0,0.171909,0.048292,0.974928,-0.222521,1.224647e-16,-1.0


### 3.2 Encoding categorical variables

Since our algorithms are only able to use numeric variables, we will need to **encode categorical variables**.

For variables with a small number of categories, we will perform **One-Hot Encoding**.

If there are more than 10 categories, we will perform **Weight of Evidence (WoE) encoding** instead to avoid a sharp increase in the dimensionality of our dataset.

In [40]:
from category_encoders import WOEEncoder
from category_encoders.one_hot import OneHotEncoder


def encode_cat_vars(df: pd.DataFrame, X_train: pd.DataFrame, y_train, max_categ: int):
    woe_cols = []
    ohe_cols = []
    for c in X_train.columns:
        
        #Keeping only categorical columns
        if not np.issubdtype(X_train[c].dtype,np.number):
            
            #If more than X categories, performing WOE encoding
            if len(X_train[c].unique()) >= max_categ:
                woe_cols.append(c)
            
            else: 
                #One hot encoding and remove the original column
                ohe_cols.append(c)
    
    #Defining WOE Encoder and fitting it to the TRAIN dataset
    woe_encoder = WOEEncoder(cols = woe_cols, return_df=True).fit(X_train, y_train)
    X_train_encoded = woe_encoder.transform(X_train)
    #Fitting the encoder to the selected dataframe
    df = woe_encoder.transform(df)
    
    #Performing one hot encoding on selected columns
    ohe_encoder = OneHotEncoder(cols=ohe_cols, return_df= True).fit(X_train_encoded)
    df = ohe_encoder.transform(df)
    
    
    del X_train_encoded
    return df

#Just a reminder that once again df = X_train
#We apply all this function to our 3 sets
X_test = encode_cat_vars(X_test, df, y_train, 10)
X_val = encode_cat_vars(X_val, df, y_train, 10)
df = encode_cat_vars(df, df, y_train, 10)

In [41]:
def check_dtypes(df: pd.DataFrame):
    type_cols = []
    for c in df.columns:
        if not np.isin(df[c].dtype, type_cols):
            type_cols.append(df[c].dtype)
    print(type_cols)

check_dtypes(df)
check_dtypes(X_test)
check_dtypes(X_val)

[dtype('int64'), dtype('float64')]
[dtype('int64'), dtype('float64')]
[dtype('int64'), dtype('float64')]


In [42]:
print(df.shape, X_test.shape, X_val.shape)

(244970, 119) (30622, 119) (30622, 119)


We have verified that all of our 3 sets are composed only of numeric features and that they have the same number of columns.

We will now use **additional features from other dataframes** to increase the performance of our models.

### 3.3 Using previous application data

In [43]:
prev_app = pd.read_csv("Data/previous_application.csv", sep=",")

prev_app.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1670214 entries, 0 to 1670213
Data columns (total 37 columns):
 #   Column                       Non-Null Count    Dtype  
---  ------                       --------------    -----  
 0   SK_ID_PREV                   1670214 non-null  int64  
 1   SK_ID_CURR                   1670214 non-null  int64  
 2   NAME_CONTRACT_TYPE           1670214 non-null  object 
 3   AMT_ANNUITY                  1297979 non-null  float64
 4   AMT_APPLICATION              1670214 non-null  float64
 5   AMT_CREDIT                   1670213 non-null  float64
 6   AMT_DOWN_PAYMENT             774370 non-null   float64
 7   AMT_GOODS_PRICE              1284699 non-null  float64
 8   WEEKDAY_APPR_PROCESS_START   1670214 non-null  object 
 9   HOUR_APPR_PROCESS_START      1670214 non-null  int64  
 10  FLAG_LAST_APPL_PER_CONTRACT  1670214 non-null  object 
 11  NFLAG_LAST_APPL_IN_DAY       1670214 non-null  int64  
 12  RATE_DOWN_PAYMENT            774370 non-nu

In [9]:
# prev_app.describe()

In [10]:
#We are interested in DAYS_LAST_DUE (the number of days the borrower has to pay previous applications)
#But there are illogical values (365243 which is equal to 1000 years)
#First we'll replace all the values by nan
prev_app.loc[prev_app.DAYS_LAST_DUE > 300000, "DAYS_LAST_DUE"] = np.nan
prev_app.loc[prev_app.DAYS_FIRST_DUE > 300000, "DAYS_FIRST_DUE"] = np.nan
prev_app.loc[prev_app.DAYS_LAST_DUE_1ST_VERSION > 300000, "DAYS_LAST_DUE_1ST_VERSION"] = np.nan
prev_app.loc[prev_app.DAYS_FIRST_DRAWING > 300000, "DAYS_FIRST_DRAWING"] = np.nan
prev_app.loc[prev_app.DAYS_TERMINATION > 300000, "DAYS_TERMINATION"] = np.nan

#Defining current amount due, we have to add a negative sign because DAYS_LAST_DUE is negative
prev_app["curr_amount_due"] = -prev_app["AMT_ANNUITY"]*prev_app["DAYS_LAST_DUE"]/365

prev_app["curr_annuity"] = 0
prev_app.loc[prev_app["DAYS_LAST_DUE"] < 0, "curr_annuity"] = prev_app["AMT_ANNUITY"]

# Calculated variables
prev_app['APPLICATION_CREDIT_DIF'] = prev_app['AMT_APPLICATION'] - prev_app['AMT_CREDIT']
prev_app['CREDIT_TO_ANNUITY'] = prev_app['AMT_CREDIT'] / prev_app['AMT_ANNUITY']
prev_app['DOWN_PAYMENT_TO_CREDIT'] = prev_app['AMT_DOWN_PAYMENT'] / prev_app['AMT_CREDIT']

prev_app.head()

In [11]:
#Verifying unique values of contract status
prev_app.NAME_CONTRACT_STATUS.unique()
#4 categories, Approved, Refused, Canceled and Unused offer

prev_app["AMT_GRANTED"] = 0
prev_app.loc[prev_app["NAME_CONTRACT_STATUS"] == "Approved", "AMT_GRANTED"] = prev_app["AMT_CREDIT"]

prev_app.head()

In [12]:
aggregations = {
        'AMT_ANNUITY': ['std', 'mean', 'sum'],
        'AMT_APPLICATION': ['std', 'mean', 'sum'],
        'AMT_CREDIT': ['std', 'mean', 'sum'],
        'APP_CREDIT_PERC': ['std', 'mean'],
        'AMT_DOWN_PAYMENT': ['std', 'mean', 'sum'],
        'AMT_GOODS_PRICE': ['std', 'mean', 'sum'],
        'HOUR_APPR_PROCESS_START': ['std', 'mean'],
        'RATE_DOWN_PAYMENT': ['min', 'max', 'mean'],
        'DAYS_DECISION': ['std', 'mean', 'sum'],
        'CNT_PAYMENT': ['mean', 'sum','std'],
        'SK_ID_PREV': ['nunique'],
        'DAYS_TERMINATION': ['mean', 'sum', 'std'],
        'CREDIT_TO_ANNUITY_RATIO': ['mean', 'std'],
        'APPLICATION_CREDIT_DIFF': ['mean', 'std'],
        'DOWN_PAYMENT_TO_CREDIT': ['sum', 'mean', 'std'],
        'PREV_GOODS_DIFF': ['sum', 'mean', 'std'],
        'PREV_GOODS_APPL_RATIO': ['mean', 'std'],
        'DAYS_LAST_DUE_DIFF': ['sum', 'mean', 'std'],
        'SIMPLE_INTERESTS': ['sum', 'mean', 'std']
    }

#We will aggregate by SK_ID_CURR and retrieve important information about previous applications :
prev_app_numbers = prev_app.groupby("SK_ID_CURR").agg(curr_amount_due = ('curr_amount_due', 'sum'), n_prev_app = ('SK_ID_PREV', 'count'),
                                  curr_annuity_due = ('curr_annuity', 'sum'), amount_requested = ('AMT_APPLICATION', 'sum'),
                                  amount_granted = ('AMT_GRANTED', 'sum'))

prev_app_numbers.head()

In [13]:
# #Creating a dataframe with the number of each different name contract status by SK_ID_CURR
# prev_app_status = pd.crosstab(prev_app['SK_ID_CURR'], prev_app['NAME_CONTRACT_STATUS'])

# prev_app_status.columns = ["n_prev_app_approved","n_prev_app_canceled","n_prev_app_refused","n_prev_app_unused"]
# prev_app_status.head()

In [14]:
# prev_app_df = pd.merge(prev_app_numbers, prev_app_status, how='inner', left_index=True, right_index=True)

# prev_app_df.head()

In [15]:
# #Saving prev_app_df to prevent RAM usage and reduce rerun time
# prev_app_df.to_csv("Data/prev_app_df.csv")

In [16]:
#Loading prev_app_df
prev_app_df = pd.read_csv("Data/prev_app_df.csv")

#Joining this new data and filling NAs with 0 (since it means there was no previous application)
df = pd.merge(df, prev_app_df, how='left', left_index=True, right_index=True).fillna(0)
X_test = pd.merge(X_test, prev_app_df, how='left', left_index=True, right_index=True).fillna(0)
X_val = pd.merge(X_val, prev_app_df, how='left', left_index=True, right_index=True).fillna(0)

X_val.head()

NameError: name 'pd' is not defined

### 3.4 Using Credit Bureau information

We also have information about CB for each borrower that we can use to increase the accuracy of our model:

In [None]:
# bureau = pd.read_csv("Data/bureau.csv", sep=",")

# bureau.info()

In [None]:
# bureau.describe()

In [None]:
# print(bureau.CREDIT_ACTIVE.unique())
# print(bureau.CREDIT_CURRENCY.unique())

# len(bureau[bureau.CREDIT_CURRENCY.isna()])
# #Credit active is interesting because of the bad debt field
# #Currency is also interesting because it could be an indicator to fraudulent transactions

In [None]:
# #We will count the number of CB credits with each of these attributes :
# bureau_categ1 = pd.crosstab(bureau['SK_ID_CURR'], bureau['CREDIT_ACTIVE'])
# bureau_categ2 = pd.crosstab(bureau['SK_ID_CURR'], bureau['CREDIT_CURRENCY'])

# bureau_categ = pd.merge(bureau_categ1, bureau_categ2, how="outer", left_index=True, right_index=True)

# bureau_categ.columns = ['n_CB_active', 'n_CB_bad_debt', 'n_CB_closed', 'n_CB_sold',
#                         'n_CB_curr1','n_CB_curr2','n_CB_curr3','n_CB_curr4']

# bureau_categ.head()

In [None]:
# #We will now aggreagte over SK_ID_CURR to calculate relevant numeric features 

# bureau_num = bureau.groupby("SK_ID_CURR").agg(
#     total_days_CB_overdue = ('CREDIT_DAY_OVERDUE', 'sum'), max_CB_overdue = ('AMT_CREDIT_MAX_OVERDUE', 'max'),
#     avg_prolonged = ('CNT_CREDIT_PROLONG' , 'mean'), CB_credit = ('AMT_CREDIT_SUM','sum'),
#     CB_debt = ('AMT_CREDIT_SUM_DEBT','sum'), CB_credit_limit = ('AMT_CREDIT_SUM_LIMIT', 'sum'),
#     CB_overdue = ('AMT_CREDIT_SUM_OVERDUE','sum'), CB_total_annuity = ('AMT_ANNUITY', 'sum')
# )

# bureau_num.info()
# #We only have null values for max_CB_overdue, we will replace null values by 0 because it probably means there was never
# #An overdue amount

# bureau_num = bureau_num.fillna(0)
# bureau_num.head()

In [None]:
# #We now load the bureau_balance csv file
# bureau_balance = pd.read_csv("Data/bureau_balance.csv", sep=',')

# bureau_balance.STATUS.value_counts(normalize=True)

In [None]:
# #We create a crosstab to count the number of status type for each sk_id_bureau
# bureau_balance_stats = pd.crosstab(bureau_balance['SK_ID_BUREAU'], bureau_balance['STATUS'])

# bureau_balance_stats.head()

In [None]:
# #Renaming the columns for better clarity
# bureau_balance_stats.columns=["cb_dpd_0","cb_dpd_1","cb_dpd_2","cb_dpd_3","cb_dpd_4","cb_dpd_5","cb_bal_closed","cb_bal_ukn"]
# bureau_balance_stats.info()

In [None]:
# #Joining with the main CB dataframe to retrieve SK_ID_CURR info
# bureau_num_bal = pd.merge(bureau_balance_stats, bureau[["SK_ID_BUREAU","SK_ID_CURR"]], how='inner', left_index=True, right_on='SK_ID_BUREAU')

# #Aggregating by SK_ID_CURR
# bureau_num_bal = bureau_num_bal.groupby("SK_ID_CURR").sum()

# #Dropping the now useless SK_ID_BUREAU column
# bureau_num_bal.drop(columns={"SK_ID_BUREAU"}, inplace=True)

# bureau_num_bal.head()

In [None]:
# bureau_num_bal.info()
# #We only have 134k different SK_ID, which is about 40% of our dataset. 
# #We will fill nulls with 0 because it means that the other SK_ID were not referenced at the Credit Bureau

In [None]:
# #Filling nulls with 0 as mentionned previously
# bureau_num_full = pd.merge(bureau_num, bureau_num_bal, how='outer', left_index=True, right_index=True).fillna(0)

# bureau_num_full.info()
# bureau_num_full.head()

In [None]:
# #Merging the 2 dataframes with bureau information
# bureau_df = pd.merge(bureau_categ, bureau_num_full, how='outer', left_index=True, right_index=True)

# print(bureau_df.isna().sum().sum())
# #No null values
# bureau_df.head()

In [None]:
#Saving bureau_df to reduce RAM usage
# bureau_df.to_csv("Data/bureau_df.csv")

In [None]:
bureau_df = pd.read_csv("Data/bureau_df.csv")

#Just as before, we join our CB information with our 3 dataframes and replace na values by 0
df = pd.merge(df, bureau_df, how='left', left_index=True, right_index=True).fillna(0)
X_test = pd.merge(X_test, bureau_df, how='left', left_index=True, right_index=True).fillna(0)
X_val = pd.merge(X_val, bureau_df, how='left', left_index=True, right_index=True).fillna(0)

X_val.head()

### 3.5 Using Cash balance information

In [None]:
# cash = pd.read_csv("Data/POS_CASH_balance.csv", sep=',')

# cash.head()

In [None]:
# #We will extract both Day past due (DPD) informations that seem the most relevant
# #We will use the DPD_DEF field because it removes low amounts debts
# cash_df = cash[["SK_ID_CURR","SK_DPD_DEF"]].groupby("SK_ID_CURR").sum()

# cash_df.head()

In [None]:
# #Saving cash_df to csv to save RAM usage
# cash_df.to_csv("Data/cash_df.csv")

In [None]:
#Loading saved file
cash_df = pd.read_csv("Data/cash_df.csv")

#We join our cash information with our 3 dataframes and replace na values by 0
df = pd.merge(df, cash_df, how='left', left_index=True, right_index=True).fillna(0)
X_test = pd.merge(X_test, cash_df, how='left', left_index=True, right_index=True).fillna(0)
X_val = pd.merge(X_val, cash_df, how='left', left_index=True, right_index=True).fillna(0)

X_val.head()

### 3.6 Using CC Balance information

In [None]:
# cc = pd.read_csv("Data/credit_card_balance.csv",sep=",")

# cc.head()

In [None]:
# #Investigating possible months balance values
# cc.MONTHS_BALANCE.value_counts()

# cc_df = cc.groupby("SK_ID_CURR").agg(
#     total_cc_balance = ('AMT_BALANCE','sum'), cc_credit_lim = ('AMT_CREDIT_LIMIT_ACTUAL', 'sum'),
#     total_cc_drawings = ('AMT_DRAWINGS_CURRENT','sum'), monthly_cc_payment = ('AMT_PAYMENT_TOTAL_CURRENT','sum'),
#     total_cc_receivable = ('AMT_TOTAL_RECEIVABLE', 'sum'), n_cc_drawings = ('CNT_DRAWINGS_CURRENT','sum'),
#     cc_dpd = ('SK_DPD_DEF', 'sum'))
# cc_df.info()
# cc_df.head()

In [None]:
# #Saving cc_df to prevent high RAM usage
# cc_df.to_csv("Data/cc_df.csv")

In [None]:
#Loading saved value
cc_df = pd.read_csv("Data/cc_df.csv")

#We join our CC information with our 3 dataframes and replace na values by 0
df = pd.merge(df, cc_df, how='left', left_index=True, right_index=True).fillna(0)
X_test = pd.merge(X_test, cc_df, how='left', left_index=True, right_index=True).fillna(0)
X_val = pd.merge(X_val, cc_df, how='left', left_index=True, right_index=True).fillna(0)

X_val.head()

### 3.7 Using installment payments information

In [None]:
# install = pd.read_csv("Data/installments_payments.csv",sep=",")

# install.head()

In [None]:
# #Converting the DAYS columns into positive values
# install["DAYS_INSTALMENT"] = install["DAYS_INSTALMENT"].apply(lambda x: abs(x))
# install["DAYS_ENTRY_PAYMENT"] = install["DAYS_ENTRY_PAYMENT"].apply(lambda x: abs(x))

# install["install_delay"] = install["DAYS_ENTRY_PAYMENT"] - install["DAYS_INSTALMENT"]

# install.describe()

In [None]:
# #We have both high negative and positive delay values which indicate early or very late payment
# #We will now calculate the difference in percentage between AMT_INSTALMENT and AMT_PAYMENT
# install["install_deficit_pct"] = (install["AMT_INSTALMENT"] - install["AMT_PAYMENT"])*100/install["AMT_INSTALMENT"]

# install.head()

In [None]:
# #We will now aggregate over SK_ID_CURR and take average values for our dealy and deficit fields
# install_df = install[["SK_ID_CURR","install_delay","install_deficit_pct"]].groupby("SK_ID_CURR").mean()

# install_df.info()
# install_df.head()


In [None]:
#Saving install_df to a csv to prevent repetitive rerun of the program
#install_df.to_csv("Data/install_df.csv")

In [None]:
#We join our installment information with our 3 dataframes and replace na values by 0

install_df = pd.read_csv("Data/install_df.csv")
df = pd.merge(df, install_df, how='left', left_index=True, right_index=True).fillna(0)
X_test = pd.merge(X_test, install_df, how='left', left_index=True, right_index=True).fillna(0)
X_val = pd.merge(X_val, install_df, how='left', left_index=True, right_index=True).fillna(0)

X_val.head()

We've finished Feature Engineering! Now we will summarize the pre processing that we've accomplished to ensure that we can quickly apply it to a new dataset.

## 4. Summarizing the Pre Processing steps

The **preprocessing function** assumes that there are **no missing values in new application data given**.

We will also create a **data cleaning function that will only work if there are a significant number of applications** fed in the system at the same time (to be able to perform data imputation efficiently).

In [None]:
clean_cols = app.columns

def clean_application(application_request: pd.DataFrame):
    
    #Feature filtering
    df = application_request[clean_cols]
    
    #Cleaning categorical variables
    df.loc[df["CODE_GENDER"] == 'XNA', "CODE_GENDER"] = df["CODE_GENDER"].mode().values[0]
    df.loc[df["EMERGENCYSTATE_MODE"].isna(),"EMERGENCYSTATE_MODE"] = 'UKN'
    df.loc[df["OCCUPATION_TYPE"].isna(),"OCCUPATION_TYPE"] = 'UKN'
    df.loc[df["NAME_TYPE_SUITE"].isna(), "NAME_TYPE_SUITE"] = df["NAME_TYPE_SUITE"].mode().values[0]
    
    #Data imputation (ONLY USE IF SIGNIFICANT AMOUNT OF APPLICATIONS GIVEN)
    numeric_data_imputation(df, 'NAME_CONTRACT_TYPE')
    
    return df
    

def preprocess_application(application_request: pd.DataFrame, X_train: pd.DataFrame, y_train):
    
    #Feature filtering 
    df = application_request[clean_cols]
    
    #Converting string weekday into number
    df["WEEKDAY_APPR_PROCESS_START"] = df["WEEKDAY_APPR_PROCESS_START"].apply(lambda x: time.strptime(x, '%A').tm_wday)
    
    #DAYS_BIRTH, DAYS_REGISTRATION and DAYS_ID_PUBLISH only have negative values
    df["DAYS_REGISTRATION"] = abs(df["DAYS_REGISTRATION"])
    df["DAYS_ID_PUBLISH"] = abs(df["DAYS_ID_PUBLISH"])
    df["DAYS_BIRTH"] = abs(df["DAYS_BIRTH"])

    #DAYS EMPLOYED have abherrent values (365243 days, about 1000 years)
    df.loc[df["DAYS_EMPLOYED"] > 100000, "DAYS_EMPLOYED"] = 0
    df["DAYS_EMPLOYED"] = abs(df["DAYS_EMPLOYED"])
    
    #Setting the correct index
    df.set_index('SK_ID_CURR', inplace=True)
    
    #Encoding cyclical variables
    encode_cyclical_vars(df, cyclical_vars)
    df = encode_cat_vars(df, X_train, y_train, 10)
    
    #Adding features from other files
    df = pd.merge(df, prev_app_df, how='left', left_index=True, right_index=True).fillna(0)
    df = pd.merge(df, bureau_df, how='left', left_index=True, right_index=True).fillna(0)
    df = pd.merge(df, cash_df, how='left', left_index=True, right_index=True).fillna(0)
    df = pd.merge(df, cc_df, how='left', left_index=True, right_index=True).fillna(0)
    df = pd.merge(df, install_df, how='left', left_index=True, right_index=True).fillna(0)
    
    return df

## 5. Resampling our training dataset

As we've seen at the beginning of this part, our dataset has a very big imbalance with 92% of rows with the TARGET = 0 and only 8% with the Target variable equal to 1.

To reduce this imbalance, we will perform SMOTE oversampling on our minority class.

Of course, **oversampling will only be performed on our train set**.

In [None]:
#Importing imblearn to be able to apply SMOTE oversampling
from imblearn.over_sampling import SMOTE

#Importing the SMOTE algorithm with default values
sm = SMOTE(random_state=12)

#Generating our resampled dataset
X_train_res, y_train_res = sm.fit_resample(df, y_train)

print(X_train_res.shape)
print(y_train_res.value_counts())
#We have successfully removed the imbalance from our dataset and equalized the number of observations for each class

In [None]:
# #Renaming the resampled variables for ease of use
# X_train_initial = df.copy()
# y_train_initial = y_train

# X_train = X_train_res.copy()
# y_train = y_train_res

In [None]:
# #Deleting some variables to clear memory
# import sys
# def sizeof_fmt(num, suffix='B'):
#     for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
#         if abs(num) < 1024.0:
#             return "%3.1f %s%s" % (num, unit, suffix)
#         num /= 1024.0
#     return "%.1f %s%s" % (num, 'Yi', suffix)

# for name, size in sorted(((name, sys.getsizeof(value)) for name, value in locals().items()),
#                          key= lambda x: -x[1])[:20]:
#     print("{:>30}: {:>8}".format(name, sizeof_fmt(size)))

# del df, prev_app, bureau_balance, cash, install, cc, bureau, X

Now that we have resampled our dataset, we want to perform **feature selection** to reduce the number of features and prevent overfitting.

## 6. Feature Selection

### 6.1 Removing low variance features

In [None]:
X_train = X_train_res
y_train = y_train_res

In [None]:
# # Perform feature selection using a variance threshold
# from sklearn.feature_selection import VarianceThreshold

# #We select 2% as our variance threshold, but this is a hyperparameter that we will be able to optimize later
# sel = VarianceThreshold(threshold=(0.02))
# sel.fit(X_train)

# #Using our selector to remove columns from our 3 sets
# X_train = sel.transform(X_train)
# X_test = sel.transform(X_test)
# X_val = sel.transform(X_val)

In [None]:
# #Creating a list of encoded columns to preserve their names
# i = 0
# #Retrieving the boolean values for each column (is the column kept or not)
# boolean_cols = sel.get_support()
# encoded_cols = []
# initial_cols = df.columns
# for i in range(len(initial_cols)):
#     if boolean_cols[i] == True:
#         encoded_cols.append(initial_cols[i])
#     i += 1

# #The selector has transformed our dataframes into np array, let's fix that
# X_train = pd.DataFrame(X_train, columns=encoded_cols)
# X_test = pd.DataFrame(X_test, columns=encoded_cols)
# X_val = pd.DataFrame(X_val, columns=encoded_cols)

In [None]:
X_train.shape
#We have deleted 49 columns out of the initial 156

### 6.2 Removing highly correlated features

In [None]:
# # Function to list features that are correlated
# # Adds the first of the correlated pair only (not both)
# def correlatedFeatures(dataset, threshold):
#     correlated_columns = set()
#     correlations = dataset.corr()
#     for i in range(len(correlations)):
#         for j in range(i):
#             if abs(correlations.iloc[i,j]) > threshold:
#                 correlated_columns.add(correlations.columns[i])
#     return correlated_columns

# cf = correlatedFeatures(X_train, 0.85)
# cf

In [None]:
# #Removing our highly correlated features
# X_train = X_train.drop(cf, axis=1)
# X_test = X_test.drop(cf, axis=1)
# X_val = X_val.drop(cf, axis=1)

# print(X_train.shape, X_test.shape, X_val.shape)

### 6.3 Selecting best features

We will now use the Kbest algorithm to select the X best features

In [None]:
# from sklearn.feature_selection import SelectKBest
# from sklearn.feature_selection import f_regression

# kbest = SelectKBest(score_func=f_regression, k=15)
# kbest.fit(X_train, y_train)

# print("Selected features:", list(X_train.columns[kbest.get_support()]))

In [None]:
# X_train_sel = kbest.transform(X_train)
# X_val_sel = kbest.transform(X_val)
# X_test_sel = kbest.transform(X_test)

In [None]:
# #Looking at the best number of features for Logistic Regression
# from sklearn.linear_model import LogisticRegression
# from sklearn.pipeline import Pipeline
# from sklearn.preprocessing import StandardScaler
# from sklearn.metrics import roc_auc_score
# from sklearn.metrics import precision_score

# scores = []
# for k in range(1,99):
#     #We need to scale the dataset before applying Logistic Regression because sklearn log_r includes L2 regularization
#     kbest = SelectKBest(score_func=f_regression, k=k)
#     pipe_lr = Pipeline([('scaler', StandardScaler()), ('kbest', kbest), ('log_r', LogisticRegression(max_iter = 1000))])
    
#     pipe_lr.fit(X_train, y_train)

#     train_predictions = pipe_lr.predict(X_train)
#     val_predictions = pipe_lr.predict(X_val)
#     roc_auc_train = roc_auc_score(y_train, train_predictions)
#     roc_auc_val = roc_auc_score(y_val, val_predictions)
#     mean_roc = (roc_auc_train + roc_auc_val)/2
#     preci_val = precision_score(y_val, val_predictions)
    
#     scores.append({'k': k, 'roc_train': roc_auc_train, 'roc_val': roc_auc_val,
#                    'mean_roc': mean_roc, 'precision': preci_val})

# scores = pd.DataFrame(scores)

# scores

In [None]:
# sns.lineplot(data=scores, x='k', y='roc_train', color='green')
# sns.lineplot(data=scores, x='k', y='roc_val', color='blue')
# ax = sns.lineplot(data=scores, x='k', y='mean_roc', color='red')
# ax.set_xlim(left=0, right=20)
# plt.show()

In [None]:
# sns.lineplot(data=scores, x='k', y='precision', color='red')
# plt.show()

Since we are **mostly interested in precision** (we do not want to avoid bad borrowers at any cost, since we still need to make money by allowing the largest part possible of good lenders), we shoud not remove features.



## 6. Model training

In this phase, I will train and compare 2 different models :

- A **Logistic regression model**
- A **Support Vector Machine / Classifier**

We will include the scaling of our data in a pipeline within each model.

### 6.1 Selecting a Performance Metric

Our task is to try to detect as many "bad borrowers" as possible while avoiding false negatives and losing too many clients.
It is hard to select the right metric without knowing 

### 6.1 Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, f1_score
from sklearn.metrics import roc_auc_score

#We need to scale the dataset before applying Logistic Regression because sklearn log_r includes L2 regularization
pipe_lr = Pipeline([('scaler', StandardScaler()), 
                    ('log_r', LogisticRegression(max_iter = 3000))])

pipe_lr.fit(X_train, y_train)

train_predictions = pipe_lr.predict(X_train)
test_predictions = pipe_lr.predict(X_test)

print("Logistic Regression results")
print("TRAIN:")
print(classification_report(y_train, train_predictions))
print("ROC AUC train : {:.2f}".format(roc_auc_score(y_train, train_predictions)))
print("----------------------")
print("TEST:")
print(classification_report(y_test, test_predictions))
print("ROC AUC train : {:.2f}".format(roc_auc_score(y_test, test_predictions)))

In [None]:
print(X_test.shape, y_test.shape)

### 6.2 Support Vector Classification (Linear)

In [None]:
from sklearn.svm import LinearSVC

pipe_lsvc = Pipeline([('scaler', StandardScaler()), ('svc_l', LinearSVC())])

pipe_lsvc.fit(X_train, y_train)

train_predictions = pipe_lsvc.predict(X_train)
test_predictions = pipe_lsvc.predict(X_test)

print("Logistic Regression results")
print("TRAIN:")
print(classification_report(y_train, train_predictions))
print("----------------------")
print("TEST:")
print(classification_report(y_test, test_predictions))

### 6.3 KNeighbors Classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier

pipe_knc = Pipeline([('scaler', StandardScaler()), ('knc', KNeighborsClassifier())])

pipe_knc.fit(X_train, y_train)

train_predictions = pipe_knc.predict(X_train)
test_predictions = pipe_knc.predict(X_test)

print("Logistic Regression results")
print("TRAIN:")
print(classification_report(y_train, train_predictions))
print("----------------------")
print("TEST:")
print(classification_report(y_test, test_predictions))

### 6.4 Support Vector Classification

In [None]:
from sklearn.svm import SVC

#pipe_svc = Pipeline([('scaler', StandardScaler()), ('svc', SVC(verbose=2))])

#pipe_svc.fit(X_train, y_train)

#train_predictions = pipe_svc.predict(X_train)
#test_predictions = pipe_svc.predict(X_test)

print("Logistic Regression results")
print("TRAIN:")
#print(classification_report(y_train, train_predictions))
print("----------------------")
print("TEST:")
#print(classification_report(y_test, test_predictions))

### 6.5 Ensemble Gradient Boosting Classifier

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

pipe_gbc = Pipeline([('scaler', StandardScaler()), ('EGBC', GradientBoostingClassifier(verbose=3))])

pipe_gbc.fit(X_train, y_train)

train_predictions = pipe_gbc.predict(X_train)
test_predictions = pipe_gbc.predict(X_test)

print("Logistic Regression results")
print("TRAIN:")
print(classification_report(y_train, train_predictions))
print("----------------------")
print("TEST:")
print(classification_report(y_test, test_predictions))