# Developping a Borrower Scoring Algorithm

Last updated : September 25th, 2022

## Introduction

During this project, I will use a dataset provided by a consumer finance companies to develop a machine learning algorithm that will predict if the borrower will have payment difficulties or not.

## 1. Data Loading and Filtering

First we will load the necessary packages and dataset and then we will carry on with the Cleaning and Analysis.

### 1.1 Loading our packages

We will import the necessary packages to run this project: matplotlib, numpy, pandas, seaborn.
Since I am running the project on Windows, I will also use sklearnex to increase the speed of sklearn.

In [40]:
#Importing packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
#Setting large figure size for Seaborn
sns.set(rc={'figure.figsize':(11.7,8.27),"font.size":20,"axes.titlesize":20,"axes.labelsize":18})

#Importing Intel extension for sklearn to improve speed
from sklearnex import patch_sklearn
patch_sklearn()

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


### 1.2 Loading the dataset

We will now load the dataset

In [51]:
app_test = pd.read_csv("Data/application_test.csv", sep=",")
app = pd.read_csv("Data/application_train.csv", sep=",")

app.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


### 1.3 Filtering our variables

We will begin by removing variables that have more than 50% na values :

In [52]:
#Increasing maximum number of info rows 
pd.options.display.max_info_columns = 130

#First we will define a function that drops columns that are null in more than x% of our database
def drop_na_columns(df: pd.DataFrame, percent: float):
    n = len(df)
    cutoff = n*percent/100
    for c in df.columns:
        if len(df[c].dropna()) < cutoff:
            df.drop(columns={c}, inplace=True)

#Dropping columns with less than 50% complete fields
drop_na_columns(app, 50)

len(app.columns)

app.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 81 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   SK_ID_CURR                    307511 non-null  int64  
 1   TARGET                        307511 non-null  int64  
 2   NAME_CONTRACT_TYPE            307511 non-null  object 
 3   CODE_GENDER                   307511 non-null  object 
 4   FLAG_OWN_CAR                  307511 non-null  object 
 5   FLAG_OWN_REALTY               307511 non-null  object 
 6   CNT_CHILDREN                  307511 non-null  int64  
 7   AMT_INCOME_TOTAL              307511 non-null  float64
 8   AMT_CREDIT                    307511 non-null  float64
 9   AMT_ANNUITY                   307499 non-null  float64
 10  AMT_GOODS_PRICE               307233 non-null  float64
 11  NAME_TYPE_SUITE               306219 non-null  object 
 12  NAME_INCOME_TYPE              307511 non-nul

In [54]:
#Counting the number of target vs not target variables:
app["TARGET"].value_counts(normalize=True)

#We have a significant difference in the number of data for both cases

0    0.919271
1    0.080729
Name: TARGET, dtype: float64

## 2. Data Cleaning

We will now clean our dataset.

### 2.1 Cleaning categorical variables

We will begin the cleaning process by cleaning categorical variables.

In [53]:
#Looking at unique valeus of categorical variables
def investigate_categories(df: pd.DataFrame):
    for c in df.columns:
        if df[c].dtype == 'object':
            print("Column",c)
            print("Unique values: {}".format(df[c].unique()))
            print("")
            print("-----------------------------------")
            
investigate_categories(app)

Column NAME_CONTRACT_TYPE
Unique values: ['Cash loans' 'Revolving loans']

-----------------------------------
Column CODE_GENDER
Unique values: ['M' 'F' 'XNA']

-----------------------------------
Column FLAG_OWN_CAR
Unique values: ['N' 'Y']

-----------------------------------
Column FLAG_OWN_REALTY
Unique values: ['Y' 'N']

-----------------------------------
Column NAME_TYPE_SUITE
Unique values: ['Unaccompanied' 'Family' 'Spouse, partner' 'Children' 'Other_A' nan
 'Other_B' 'Group of people']

-----------------------------------
Column NAME_INCOME_TYPE
Unique values: ['Working' 'State servant' 'Commercial associate' 'Pensioner' 'Unemployed'
 'Student' 'Businessman' 'Maternity leave']

-----------------------------------
Column NAME_EDUCATION_TYPE
Unique values: ['Secondary / secondary special' 'Higher education' 'Incomplete higher'
 'Lower secondary' 'Academic degree']

-----------------------------------
Column NAME_FAMILY_STATUS
Unique values: ['Single / not married' 'Married' 'C

In [44]:
import time
#We can see that WEEKDAY_APPR_PROCESS_START is coded as a string

#Let's convert it into time of day
app["WEEKDAY_APPR_PROCESS_START"] = app["WEEKDAY_APPR_PROCESS_START"].apply(lambda x: time.strptime(x, '%A').tm_wday)

#Now we encode it into angular distance to preseve the day intervals
app["WEEKDAY_START_sin"] = np.sin(app["WEEKDAY_APPR_PROCESS_START"] * (2 * np.pi/7))
app["WEEKDAY_START_cos"] = np.cos(app["WEEKDAY_APPR_PROCESS_START"] * (2 * np.pi/7))

#We then remove the weekday column
app.drop(columns={"WEEKDAY_APPR_PROCESS_START"}, inplace=True)

In [47]:
#Investigating "XNA" values in GENDER
app[app["CODE_GENDER"] == 'XNA']
#Only 4 rows

#Let's look at the test data
app_test[app_test["CODE_GENDER"] == 'XNA']
#0 row

#We will delete the rows with NA values from our dataset
app = app[app["CODE_GENDER"] != 'XNA']

In [57]:
#Investigating "XNA" values in ORGANIZATION_TYPE
app[app["ORGANIZATION_TYPE"] == 'XNA']
#55374 rows

app[app["ORGANIZATION_TYPE"] == 'XNA']["TARGET"].value_counts(normalize=True)
#Significant deviation from the normal percentages, so it is interesting to keep these values

#They will be encoded during the feature engineering part of the project

0    0.946004
1    0.053996
Name: TARGET, dtype: float64

In [60]:
#Looking at "nan" values in EMERGENCYSTATE_MODE
print(len(app[app["EMERGENCYSTATE_MODE"].isna()]))

app[app["EMERGENCYSTATE_MODE"].isna()]["TARGET"].value_counts(normalize=True)
#Here it represents about half our dataset, we will create a "NA" variable as well since there is a small deviation from what
#We would have expected

app.loc[app["EMERGENCYSTATE_MODE"].isna(),"EMERGENCYSTATE_MODE"] = 'UKN'

145755


In [69]:
#Looking at "nan" values in OCCUPATION TYPE
print(len(app[app["OCCUPATION_TYPE"].isna()]))

app[app["OCCUPATION_TYPE"].isna()]["TARGET"].value_counts(normalize=True)
#Here it represents about a third of our dataset, we will create a "NA" variable as well since there is a deviation from what
#we would have expected

app.loc[app["OCCUPATION_TYPE"].isna(),"OCCUPATION_TYPE"] = 'UKN'

96006


In [70]:
#Looking at "nan" values in NAME_TYPE_SUITE
print(len(app[app["NAME_TYPE_SUITE"].isna()]))
#Only 1292 NA values

#We will delete these rows
app = app[app["NAME_TYPE_SUITE"].notna()]

0


In [71]:
#Verifying that we've dealt with all missing values of categorical variables
for c in app.columns:
    if app[c].dtype == 'object':
        print(app[c].isna().sum().sum())

0
0
0
0
0
0
0
0
0
0
0
0
0


We have finished cleaning up categorical variables, now we will look at numeric variables 

### 2.2 Cleaning numeric variables 

In [120]:
#Looking for outliers 

#Increasing the number of maximum columns shown
pd.options.display.max_columns = 100
app.describe()

Unnamed: 0,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,EXT_SOURCE_2,EXT_SOURCE_3,YEARS_BEGINEXPLUATATION_AVG,FLOORSMAX_AVG,YEARS_BEGINEXPLUATATION_MODE,FLOORSMAX_MODE,YEARS_BEGINEXPLUATATION_MEDI,FLOORSMAX_MEDI,TOTALAREA_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
count,306219.0,306219.0,306219.0,306219.0,306207.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,305560.0,245468.0,156797.0,153798.0,156797.0,153798.0,156797.0,153798.0,158367.0,305198.0,305198.0,305198.0,305198.0,306218.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,306219.0,264809.0,264809.0,264809.0,264809.0,264809.0,264809.0
mean,0.080841,0.417009,168783.2,598797.1,27122.117024,537946.4,0.020865,16040.601468,63858.080573,-4987.976608,-2994.33174,0.999997,0.81977,0.19906,0.998126,0.280773,0.056796,2.152786,2.052619,2.031641,12.062024,0.015162,0.050748,0.040618,0.078163,0.230489,0.179597,0.5143528,0.51092,0.977728,0.226259,0.977056,0.22229,0.977746,0.225875,0.102523,1.421569,0.143389,1.404642,0.10002,-964.423848,4.2e-05,0.71055,8.2e-05,0.014715,0.087855,0.00014,0.08134,0.003853,2e-05,0.00384,7e-06,0.003406,0.002805,0.00113,0.009405,0.000261,0.007818,0.000571,0.000493,0.00033,0.00639,0.006982,0.034447,0.267623,0.265697,1.903927
std,0.272591,0.722107,237516.4,401958.8,14490.83622,368917.8,0.01383,4362.862329,141312.849258,3522.561074,1509.513916,0.001807,0.38438,0.399294,0.043255,0.449378,0.231453,0.910586,0.509102,0.502794,3.266152,0.122199,0.219483,0.197404,0.268428,0.421146,0.383852,0.1910903,0.194837,0.05925,0.144579,0.064623,0.143649,0.059927,0.145009,0.107424,2.400906,0.446692,2.37979,0.362283,826.704854,0.006515,0.453508,0.009035,0.12041,0.283085,0.011849,0.273358,0.061957,0.004426,0.061852,0.002556,0.058262,0.05289,0.033595,0.096523,0.016161,0.088073,0.023899,0.022201,0.018158,0.083791,0.110478,0.20479,0.915633,0.794823,1.869594
min,0.0,0.0,25650.0,45000.0,1615.5,40500.0,0.00029,7489.0,-17912.0,-24672.0,-7197.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.173617e-08,0.000527,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-4292.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,112500.0,270000.0,16551.0,238500.0,0.010006,12418.0,-2761.0,-7481.0,-4299.0,1.0,1.0,0.0,1.0,0.0,0.0,2.0,2.0,2.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.3923271,0.37065,0.9767,0.1667,0.9767,0.1667,0.9767,0.1667,0.0412,0.0,0.0,0.0,0.0,-1571.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,147600.0,513531.0,24930.0,450000.0,0.01885,15756.0,-1214.0,-4507.0,-3255.0,1.0,1.0,0.0,1.0,0.0,0.0,2.0,2.0,2.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5659453,0.535276,0.9816,0.1667,0.9816,0.1667,0.9816,0.1667,0.0688,0.0,0.0,0.0,0.0,-759.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,0.0,1.0,202500.0,808650.0,34596.0,679500.0,0.028663,19685.0,-289.0,-2013.0,-1720.0,1.0,1.0,0.0,1.0,1.0,0.0,3.0,2.0,2.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6636183,0.669057,0.9866,0.3333,0.9866,0.3333,0.9866,0.3333,0.1275,2.0,0.0,2.0,0.0,-276.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
max,1.0,19.0,117000000.0,4050000.0,258025.5,4050000.0,0.072508,25229.0,365243.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,20.0,3.0,3.0,23.0,1.0,1.0,1.0,1.0,1.0,1.0,0.8549997,0.89601,1.0,1.0,1.0,1.0,1.0,1.0,1.0,348.0,34.0,344.0,24.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,9.0,8.0,27.0,261.0,25.0


In [73]:
#DAYS_BIRTH, DAYS_REGISTRATION and DAYS_ID_PUBLISH only have negative values
app["DAYS_REGISTRATION"] = abs(app["DAYS_REGISTRATION"])
app["DAYS_ID_PUBLISH"] = abs(app["DAYS_ID_PUBLISH"])
app["DAYS_BIRTH"] = abs(app["DAYS_BIRTH"])

print(app["DAYS_BIRTH"].min()/365, app["DAYS_BIRTH"].max()/365)
#No outlier data

20.517808219178082 69.12054794520547


In [74]:
#Turning SK_ID_CURR into an ID field :
app.set_index('SK_ID_CURR', inplace=True)

app.head()

Unnamed: 0_level_0,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,351000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,1129500.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,135000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,297000.0,...,0,0,0,0,,,,,,
100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,513000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


Analysis of the describe() output shows that there is **no clear outlier** in the rest of the numeric data. We can now start handling missing values.

In [112]:
len(app.columns[app.isnull().any()])
#21 columns with NA values

#Dropping rows with more than 30% na values
def drop_na_rows(df: pd.DataFrame, pct: float):
    n = len(df.columns)
    cutoff = n*pct/100 
    df = df[df.isna().sum(axis=1) > cutoff]

drop_na_rows(app, 50)
#No row was removed


306219
306219


NameError: name 'df' is not defined

In [None]:
#Start with reviewing outliers == > hour_application_process_start