<h1><center>Home Credit Risk Prediction</center></h1>
<center>November 2024</center>
<center>Celine Ng</center>

# Table of Contents

1. Project Introduction   
    1. Notebook Preparation
    1. Dataset
1. Initial Data Cleaning
    1. Duplicate rows
    1. Datatypes
    1. Missing values
    1. Values
1. EDA
    1. Correlation
    1. Statistical Inference
    1. Distribution
1. Data Preprocessing
1. Feature Selection
    1. All features included
    1. Mutual Information
    1. PCA
1. Models
    1. Baseline model
    1. Basic model
    1. Hyperparameter Tuning
    1. Test Data
    1. Final Model
    1. Deployment
    1. Model Interpretation
1. Improvements

# 1. Project Introduction

**Crucial problem for retail banks** <br>
1. Minimize loan defaults by evaluating credit risk accurately.
2. Maximize profits by better identifying customers that are NOT 
currently handed out loans but are potentially reliable.


**Project Objective**<br>
The second problem is not solvable with provided data. So the focus of this 
project would be the following:<br>
1. Improve risk evaluation accuracy to retail banks. In practice
 meaning target variable classification.
2. Evaluate feature importance to explain decisions.
3. Provide actionable insights to improve credit scoring.

**Initial Plan**<br>
1. Data cleaning (missing values counting) of each table
2. Select important tables
3. Aggregate table information - 1 row per person instead of loan
4. Join columns to main table
5. EDA and statistical inference
6. Feature engineering - New feature creations with domain knowledge
7. Feature selection
8. Modeling 
9. Hyperparameter tuning
10. Evaluation with cross validation
11. Deployment
12. Model interpretation

## 1.1. Notebook Preparation

In [1]:
%%capture
%pip install -r requirements.txt

In [2]:
from IPython.display import HTML
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## 1.2. Dataset

Objective: Brief overview of our datasets, including the features and target 
variable

The data comes with 10 separate CSV files. It is originally based on a Kaggle
 competition that is now closed, 
[Home Credit Default Risk](https://www.kaggle.com/competitions/home-credit-default-risk/overview).

<ol>
<li>application_{train|test}.csv

This is the main table, broken into two files for Train (with TARGET) and Test (without TARGET).
Static data for all applications. One row represents one loan in our data 
sample.</li>
<li>bureau.csv

All client's previous credits provided by other financial institutions that were reported to Credit Bureau (for clients who have a loan in our sample).
For every loan in our sample, there are as many rows as number of credits 
the client had in Credit Bureau before the application date.</li>
<li>bureau_balance.csv

Monthly balances of previous credits in Credit Bureau.
This table has one row for each month of history of every previous credit 
reported to Credit Bureau – i.e the table has (#loans in sample * # of 
relative previous credits * # of months where we have some history 
observable for the previous credits) rows.</li>
<li>POS_CASH_balance.csv

Monthly balance snapshots of previous POS (point of sales) and cash loans that the applicant had with Home Credit.
This table has one row for each month of history of every previous credit in
 Home Credit (consumer credit and cash loans) related to loans in our sample
  – i.e. the table has (#loans in sample * # of relative previous credits 
  *# of months in which we have some history observable for the previous 
  credits) rows.</li>
<li>credit_card_balance.csv

Monthly balance snapshots of previous credit cards that the applicant has with Home Credit.
This table has one row for each month of history of every previous credit in
 Home Credit (consumer credit and cash loans) related to loans in our sample
  – i.e. the table has (#loans in sample * # of relative previous credit 
  cards *# of months where we have some history observable for the previous
   credit card) rows.</li>
<li>previous_application.csv

All previous applications for Home Credit loans of clients who have loans in our sample.
There is one row for each previous application related to loans in our data 
sample.</li>
<li>installments_payments.csv

Repayment history for the previously disbursed credits in Home Credit related to the loans in our sample.
There is a) one row for every payment that was made plus b) one row each for missed payment.
One row is equivalent to one payment of one installment OR one installment 
corresponding to one payment of one previous Home Credit credit related to 
loans in our sample.</li>

**Adjacent csv files:**
<li>HomeCredit_columns_description.csv

This file contains descriptions for the columns in the various data files.</li>

<li>sample submission.csv
Sample submission file for Kaggle competition</li>
</ol>

**Home Credit columns description csv**

In [3]:
description = pd.read_csv('data/HomeCredit_columns_description.csv', encoding='latin1')
display(description.head())
description_shape = description.shape
print(f"Number of rows on bureau data: {description_shape[0]}\nNumber of "
      f"columns on bureau data: {description_shape[1]}")

Unnamed: 0.1,Unnamed: 0,Table,Row,Description,Special
0,1,application_{train|test}.csv,SK_ID_CURR,ID of loan in our sample,
1,2,application_{train|test}.csv,TARGET,Target variable (1 - client with payment diffi...,
2,5,application_{train|test}.csv,NAME_CONTRACT_TYPE,Identification if loan is cash or revolving,
3,6,application_{train|test}.csv,CODE_GENDER,Gender of the client,
4,7,application_{train|test}.csv,FLAG_OWN_CAR,Flag if the client owns a car,


Number of rows on bureau data: 219
Number of columns on bureau data: 5


In [4]:
description.iloc[1, 3]

'Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases)'

In [5]:
description.loc[description['Row'] == 'SK_ID_CURR', 'Description']

0                               ID of loan in our sample
122    ID of loan in our sample - one loan in our sam...
143                             ID of loan in our sample
151                             ID of loan in our sample
174                             ID of loan in our sample
212                             ID of loan in our sample
Name: Description, dtype: object

**Application train csv**

In [6]:
application = pd.read_csv('data/application_train.csv')
display(application.head())
application_shape = application.shape
print(f"Number of rows on train data: {application_shape[0]}\nNumber of "
      f"columns on train data: {application_shape[1]}")

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


Number of rows on train data: 307511
Number of columns on train data: 122


In [7]:
application.columns.to_list()

['SK_ID_CURR',
 'TARGET',
 'NAME_CONTRACT_TYPE',
 'CODE_GENDER',
 'FLAG_OWN_CAR',
 'FLAG_OWN_REALTY',
 'CNT_CHILDREN',
 'AMT_INCOME_TOTAL',
 'AMT_CREDIT',
 'AMT_ANNUITY',
 'AMT_GOODS_PRICE',
 'NAME_TYPE_SUITE',
 'NAME_INCOME_TYPE',
 'NAME_EDUCATION_TYPE',
 'NAME_FAMILY_STATUS',
 'NAME_HOUSING_TYPE',
 'REGION_POPULATION_RELATIVE',
 'DAYS_BIRTH',
 'DAYS_EMPLOYED',
 'DAYS_REGISTRATION',
 'DAYS_ID_PUBLISH',
 'OWN_CAR_AGE',
 'FLAG_MOBIL',
 'FLAG_EMP_PHONE',
 'FLAG_WORK_PHONE',
 'FLAG_CONT_MOBILE',
 'FLAG_PHONE',
 'FLAG_EMAIL',
 'OCCUPATION_TYPE',
 'CNT_FAM_MEMBERS',
 'REGION_RATING_CLIENT',
 'REGION_RATING_CLIENT_W_CITY',
 'WEEKDAY_APPR_PROCESS_START',
 'HOUR_APPR_PROCESS_START',
 'REG_REGION_NOT_LIVE_REGION',
 'REG_REGION_NOT_WORK_REGION',
 'LIVE_REGION_NOT_WORK_REGION',
 'REG_CITY_NOT_LIVE_CITY',
 'REG_CITY_NOT_WORK_CITY',
 'LIVE_CITY_NOT_WORK_CITY',
 'ORGANIZATION_TYPE',
 'EXT_SOURCE_1',
 'EXT_SOURCE_2',
 'EXT_SOURCE_3',
 'APARTMENTS_AVG',
 'BASEMENTAREA_AVG',
 'YEARS_BEGINEXPLUATATION_A

**Bureau csv**

In [8]:
bureau = pd.read_csv('data/bureau.csv')
display(bureau.head())
bureau_shape = bureau.shape
print(f"Number of rows on bureau data: {bureau_shape[0]}\nNumber of "
      f"columns on bureau data: {bureau_shape[1]}")

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,Closed,currency 1,-497,0,-153.0,-153.0,,0,91323.0,0.0,,0.0,Consumer credit,-131,
1,215354,5714463,Active,currency 1,-208,0,1075.0,,,0,225000.0,171342.0,,0.0,Credit card,-20,
2,215354,5714464,Active,currency 1,-203,0,528.0,,,0,464323.5,,,0.0,Consumer credit,-16,
3,215354,5714465,Active,currency 1,-203,0,,,,0,90000.0,,,0.0,Credit card,-16,
4,215354,5714466,Active,currency 1,-629,0,1197.0,,77674.5,0,2700000.0,,,0.0,Consumer credit,-21,


Number of rows on bureau data: 1716428
Number of columns on bureau data: 17


**Bureau Balance csv**

In [9]:
bureau_balance = pd.read_csv('data/bureau_balance.csv')
display(bureau_balance.head())
bureau_balance_shape = bureau_balance.shape
print(f"Number of rows on bureau data: {bureau_balance_shape[0]}\nNumber of "
      f"columns on bureau data: {bureau_balance_shape[1]}")

Unnamed: 0,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
0,5715448,0,C
1,5715448,-1,C
2,5715448,-2,C
3,5715448,-3,C
4,5715448,-4,C


Number of rows on bureau data: 27299925
Number of columns on bureau data: 3


In [10]:
description.loc[description['Row'] == 'MONTHS_BALANCE']

Unnamed: 0.1,Unnamed: 0,Table,Row,Description,Special
140,143,bureau_balance.csv,MONTHS_BALANCE,Month of balance relative to application date ...,time only relative to the application
144,147,POS_CASH_balance.csv,MONTHS_BALANCE,Month of balance relative to application date ...,time only relative to the application
152,155,credit_card_balance.csv,MONTHS_BALANCE,Month of balance relative to application date ...,time only relative to the application


Month of balance relative to application date (-1 means the information to the freshest monthly snapshot, 0 means the information at application - often it will be the same as -1 as many banks are not updating the information to Credit Bureau regularly )

In [11]:
description.loc[description['Row'] == 'STATUS']

Unnamed: 0.1,Unnamed: 0,Table,Row,Description,Special
141,144,bureau_balance.csv,STATUS,Status of Credit Bureau loan during the month ...,


Status of Credit Bureau loan during the month (active, closed, DPD0-30, [C means closed, X means status unknown, 0 means no DPD, 1 means maximal did during month between 1-30, 2 means DPD 31-60, 5 means DPD 120+ or sold or written off ] )

**Previous Application csv**

In [12]:
prev_application = pd.read_csv('data/previous_application.csv')
display(prev_application.head())
prev_application_shape = prev_application.shape
print(f"Number of rows on bureau data: {prev_application_shape[0]}\nNumber of "
      f"columns on bureau data: {prev_application_shape[1]}")

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NAME_CONTRACT_TYPE,AMT_ANNUITY,AMT_APPLICATION,AMT_CREDIT,AMT_DOWN_PAYMENT,AMT_GOODS_PRICE,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,...,NAME_SELLER_INDUSTRY,CNT_PAYMENT,NAME_YIELD_GROUP,PRODUCT_COMBINATION,DAYS_FIRST_DRAWING,DAYS_FIRST_DUE,DAYS_LAST_DUE_1ST_VERSION,DAYS_LAST_DUE,DAYS_TERMINATION,NFLAG_INSURED_ON_APPROVAL
0,2030495,271877,Consumer loans,1730.43,17145.0,17145.0,0.0,17145.0,SATURDAY,15,...,Connectivity,12.0,middle,POS mobile with interest,365243.0,-42.0,300.0,-42.0,-37.0,0.0
1,2802425,108129,Cash loans,25188.615,607500.0,679671.0,,607500.0,THURSDAY,11,...,XNA,36.0,low_action,Cash X-Sell: low,365243.0,-134.0,916.0,365243.0,365243.0,1.0
2,2523466,122040,Cash loans,15060.735,112500.0,136444.5,,112500.0,TUESDAY,11,...,XNA,12.0,high,Cash X-Sell: high,365243.0,-271.0,59.0,365243.0,365243.0,1.0
3,2819243,176158,Cash loans,47041.335,450000.0,470790.0,,450000.0,MONDAY,7,...,XNA,12.0,middle,Cash X-Sell: middle,365243.0,-482.0,-152.0,-182.0,-177.0,1.0
4,1784265,202054,Cash loans,31924.395,337500.0,404055.0,,337500.0,THURSDAY,9,...,XNA,24.0,high,Cash Street: high,,,,,,


Number of rows on bureau data: 1670214
Number of columns on bureau data: 37


**POS CASH balance csv**

In [13]:
pos_cash = pd.read_csv('data/POS_CASH_balance.csv')
display(pos_cash.head())
pos_cash_shape = pos_cash.shape
print(f"Number of rows on bureau data: {pos_cash_shape[0]}\nNumber of "
      f"columns on bureau data: {pos_cash_shape[1]}")

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,CNT_INSTALMENT,CNT_INSTALMENT_FUTURE,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
0,1803195,182943,-31,48.0,45.0,Active,0,0
1,1715348,367990,-33,36.0,35.0,Active,0,0
2,1784872,397406,-32,12.0,9.0,Active,0,0
3,1903291,269225,-35,48.0,42.0,Active,0,0
4,2341044,334279,-35,36.0,35.0,Active,0,0


Number of rows on bureau data: 10001358
Number of columns on bureau data: 8


In [14]:
installments_payments = pd.read_csv('data/installments_payments.csv')
display(installments_payments.head())
installments_payments_shape = installments_payments.shape
print(f"Number of rows on bureau data: {installments_payments_shape[0]}\nNumber of "
      f"columns on bureau data: {installments_payments_shape[1]}")

# Improvements

1. Understand and identify the subpopulation of Home Credit who has no 
difficulty in repayment but were not given loans by other institutions -  
Collect loan application data from other banks