<h1><center>Home Credit Risk Prediction</center></h1>
<center>November 2024</center>
<center>Celine Ng</center>

# Table of Contents

1. Project Introduction   
    1. Notebook Preparation
    1. Data loading
    1. Train/Test Separation
1. Initial Data Cleaning
    1. Duplicate rows
    1. Datatypes
    1. Missing values
    1. Values
1. EDA
    1. Correlation
    1. Statistical Inference
    1. Distribution
1. Data Preprocessing
1. Feature Selection
    1. All features included
    1. Mutual Information
    1. PCA
1. Models
    1. Baseline model
    1. Basic model
    1. Hyperparameter Tuning
    1. Test Data
    1. Final Model
    1. Deployment
    1. Model Interpretation
1. Improvements

# 1. Project Introduction

**Crucial problem for retail banks** <br>
1. Minimize loan defaults by evaluating credit risk accurately.
2. Maximize profits by better identifying customers that are NOT 
currently handed out loans but are potentially reliable.


**Project Objective**<br>
The second problem is not solvable with provided data. So the focus of this 
project would be the following:<br>
1. Improve risk evaluation accuracy to retail banks. In practice
 meaning target variable classification.
2. Evaluate feature importance to explain decisions.
3. Provide actionable insights to improve credit scoring.

**Initial Plan**<br>
1. Data cleaning (missing values counting) of each table
2. Select important tables
3. Aggregate table information - 1 row per person instead of loan
4. Join columns to main table
5. EDA and statistical inference
6. Feature engineering - New feature creations with domain knowledge
7. Feature selection
8. Modeling 
9. Hyperparameter tuning
10. Evaluation with cross validation
11. Deployment
12. Model interpretation

## 1.1. Notebook Preparation

In [29]:
%%capture
%pip install -r requirements.txt

In [30]:
from IPython.display import HTML
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import os
from utils.eda import *
from utils.model import *
from utils.stats import *

## 1.2. Data Loading

Objective: Brief overview of our datasets, including the features and target 
variable

The data comes with 10 separate CSV files. It is originally based on a Kaggle
 competition that is now closed, 
[Home Credit Default Risk](https://www.kaggle.com/competitions/home-credit-default-risk/overview).

<ol>
<li>application_{train|test}.csv

This is the main table, broken into two files for Train (with TARGET) and Test (without TARGET).
Static data for all applications. One row represents one loan in our data 
sample.</li>
<li>bureau.csv

All client's previous credits provided by other financial institutions that were reported to Credit Bureau (for clients who have a loan in our sample).
For every loan in our sample, there are as many rows as number of credits 
the client had in Credit Bureau before the application date.</li>
<li>bureau_balance.csv

Monthly balances of previous credits in Credit Bureau.
This table has one row for each month of history of every previous credit 
reported to Credit Bureau – i.e the table has (#loans in sample * # of 
relative previous credits * # of months where we have some history 
observable for the previous credits) rows.</li>
<li>POS_CASH_balance.csv

Monthly balance snapshots of previous POS (point of sales) and cash loans that the applicant had with Home Credit.
This table has one row for each month of history of every previous credit in
 Home Credit (consumer credit and cash loans) related to loans in our sample
  – i.e. the table has (#loans in sample * # of relative previous credits 
  *# of months in which we have some history observable for the previous 
  credits) rows.</li>
<li>credit_card_balance.csv

Monthly balance snapshots of previous credit cards that the applicant has with Home Credit.
This table has one row for each month of history of every previous credit in
 Home Credit (consumer credit and cash loans) related to loans in our sample
  – i.e. the table has (#loans in sample * # of relative previous credit 
  cards *# of months where we have some history observable for the previous
   credit card) rows.</li>
<li>previous_application.csv

All previous applications for Home Credit loans of clients who have loans in our sample.
There is one row for each previous application related to loans in our data 
sample.</li>
<li>installments_payments.csv

Repayment history for the previously disbursed credits in Home Credit related to the loans in our sample.
There is a) one row for every payment that was made plus b) one row each for missed payment.
One row is equivalent to one payment of one installment OR one installment 
corresponding to one payment of one previous Home Credit credit related to 
loans in our sample.</li>

**Adjacent csv files:**
<li>HomeCredit_columns_description.csv

This file contains descriptions for the columns in the various data files.</li>

<li>sample submission.csv
Sample submission file for Kaggle competition</li>
</ol>

*Non-adjacent csv files will be converted into pkl files for more efficient 
memory usage. To do so, please run 'convert_csv_to_pkl.py'.*

**Home Credit columns description**

In [31]:
description = pd.read_csv('data_csv/HomeCredit_columns_description.csv', encoding='latin1')
display(description.head())
description_shape = description.shape
print(f"Number of rows on bureau data_csv: {description_shape[0]}\nNumber of "
      f"columns on bureau data_csv: {description_shape[1]}")

Unnamed: 0.1,Unnamed: 0,Table,Row,Description,Special
0,1,application_{train|test}.csv,SK_ID_CURR,ID of loan in our sample,
1,2,application_{train|test}.csv,TARGET,Target variable (1 - client with payment diffi...,
2,5,application_{train|test}.csv,NAME_CONTRACT_TYPE,Identification if loan is cash or revolving,
3,6,application_{train|test}.csv,CODE_GENDER,Gender of the client,
4,7,application_{train|test}.csv,FLAG_OWN_CAR,Flag if the client owns a car,


Number of rows on bureau data_csv: 219
Number of columns on bureau data_csv: 5


In [32]:
description.iloc[1, 3]

'Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases)'

In [33]:
description.loc[description['Row'] == 'SK_ID_CURR', 'Description']

0                               ID of loan in our sample
122    ID of loan in our sample - one loan in our sam...
143                             ID of loan in our sample
151                             ID of loan in our sample
174                             ID of loan in our sample
212                             ID of loan in our sample
Name: Description, dtype: object

**Application train**

In [34]:
application_train = pd.read_pickle('data_pkl/application_train.pkl')
display(application_train.head())
application_train_shape = application_train.shape
print(f"Number of rows on train data_csv: {application_train_shape[0]}\nNumber of "
      f"columns on train data_csv: {application_train_shape[1]}")

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


Number of rows on train data_csv: 307511
Number of columns on train data_csv: 122


**Application Test**

In [36]:
application_test = pd.read_pickle('data_pkl/application_test.pkl')
display(application_test.head())
application_test_shape = application_test.shape
print(f"Number of rows on train data_csv: {application_test_shape[0]}\nNumber of "
      f"columns on train data_csv: {application_test_shape[1]}")

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,0,0,0,0,,,,,,


Number of rows on train data_csv: 48744
Number of columns on train data_csv: 121


For both application train and test, SK_ID_CURR is the key that identifies 
each row in the table.

**Bureau**

In [37]:
bureau = pd.read_pickle('data_pkl/bureau.pkl')
display(bureau.head())
bureau_shape = bureau.shape
print(f"Number of rows on bureau data_csv: {bureau_shape[0]}\nNumber of "
      f"columns on bureau data_csv: {bureau_shape[1]}")

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,Closed,currency 1,-497,0,-153.0,-153.0,,0,91323.0,0.0,,0.0,Consumer credit,-131,
1,215354,5714463,Active,currency 1,-208,0,1075.0,,,0,225000.0,171342.0,,0.0,Credit card,-20,
2,215354,5714464,Active,currency 1,-203,0,528.0,,,0,464323.5,,,0.0,Consumer credit,-16,
3,215354,5714465,Active,currency 1,-203,0,,,,0,90000.0,,,0.0,Credit card,-16,
4,215354,5714466,Active,currency 1,-629,0,1197.0,,77674.5,0,2700000.0,,,0.0,Consumer credit,-21,


Number of rows on bureau data_csv: 1716428
Number of columns on bureau data_csv: 17


**Bureau Balance**

In [38]:
bureau_balance = pd.read_pickle('data_pkl/bureau_balance.pkl')
display(bureau_balance.head())
bureau_balance_shape = bureau_balance.shape
print(f"Number of rows on bureau data_csv: {bureau_balance_shape[0]}\nNumber of "
      f"columns on bureau data_csv: {bureau_balance_shape[1]}")

Unnamed: 0,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
0,5715448,0,C
1,5715448,-1,C
2,5715448,-2,C
3,5715448,-3,C
4,5715448,-4,C


Number of rows on bureau data_csv: 27299925
Number of columns on bureau data_csv: 3


**Previous Application**

In [41]:
previous_application = pd.read_pickle('data_pkl/previous_application.pkl')
display(previous_application.head())
previous_application_shape = previous_application.shape
print(f"Number of rows on bureau data_csv: {previous_application_shape[0]}\nNumber of "
      f"columns on bureau data_csv: {previous_application_shape[1]}")

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NAME_CONTRACT_TYPE,AMT_ANNUITY,AMT_APPLICATION,AMT_CREDIT,AMT_DOWN_PAYMENT,AMT_GOODS_PRICE,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,...,NAME_SELLER_INDUSTRY,CNT_PAYMENT,NAME_YIELD_GROUP,PRODUCT_COMBINATION,DAYS_FIRST_DRAWING,DAYS_FIRST_DUE,DAYS_LAST_DUE_1ST_VERSION,DAYS_LAST_DUE,DAYS_TERMINATION,NFLAG_INSURED_ON_APPROVAL
0,2030495,271877,Consumer loans,1730.43,17145.0,17145.0,0.0,17145.0,SATURDAY,15,...,Connectivity,12.0,middle,POS mobile with interest,365243.0,-42.0,300.0,-42.0,-37.0,0.0
1,2802425,108129,Cash loans,25188.615,607500.0,679671.0,,607500.0,THURSDAY,11,...,XNA,36.0,low_action,Cash X-Sell: low,365243.0,-134.0,916.0,365243.0,365243.0,1.0
2,2523466,122040,Cash loans,15060.735,112500.0,136444.5,,112500.0,TUESDAY,11,...,XNA,12.0,high,Cash X-Sell: high,365243.0,-271.0,59.0,365243.0,365243.0,1.0
3,2819243,176158,Cash loans,47041.335,450000.0,470790.0,,450000.0,MONDAY,7,...,XNA,12.0,middle,Cash X-Sell: middle,365243.0,-482.0,-152.0,-182.0,-177.0,1.0
4,1784265,202054,Cash loans,31924.395,337500.0,404055.0,,337500.0,THURSDAY,9,...,XNA,24.0,high,Cash Street: high,,,,,,


Number of rows on bureau data_csv: 1670214
Number of columns on bureau data_csv: 37


**POS CASH balance**

In [42]:
POS_CASH_balance = pd.read_pickle('data_pkl/POS_CASH_balance.pkl')
display(POS_CASH_balance.head())
POS_CASH_balance_shape = POS_CASH_balance.shape
print(f"Number of rows on bureau data_csv: {POS_CASH_balance_shape[0]}\nNumber of "
      f"columns on bureau data_csv: {POS_CASH_balance_shape[1]}")

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,CNT_INSTALMENT,CNT_INSTALMENT_FUTURE,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
0,1803195,182943,-31,48.0,45.0,Active,0,0
1,1715348,367990,-33,36.0,35.0,Active,0,0
2,1784872,397406,-32,12.0,9.0,Active,0,0
3,1903291,269225,-35,48.0,42.0,Active,0,0
4,2341044,334279,-35,36.0,35.0,Active,0,0


Number of rows on bureau data_csv: 10001358
Number of columns on bureau data_csv: 8


**Installments Payments**

In [43]:
installments_payments = pd.read_pickle('data_pkl/installments_payments.pkl')
display(installments_payments.head())
installments_payments_shape = installments_payments.shape
print(f"Number of rows on bureau data_csv: {installments_payments_shape[0]}\nNumber of "
      f"columns on bureau data_csv: {installments_payments_shape[1]}")

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
0,1054186,161674,1.0,6,-1180.0,-1187.0,6948.36,6948.36
1,1330831,151639,0.0,34,-2156.0,-2156.0,1716.525,1716.525
2,2085231,193053,2.0,1,-63.0,-63.0,25425.0,25425.0
3,2452527,199697,1.0,3,-2418.0,-2426.0,24350.13,24350.13
4,2714724,167756,1.0,2,-1383.0,-1366.0,2165.04,2160.585


Number of rows on bureau data_csv: 13605401
Number of columns on bureau data_csv: 8


**Credit Card Balance**

In [44]:
credit_card_balance = pd.read_pickle('data_pkl/credit_card_balance.pkl')
display(credit_card_balance.head())
credit_card_balance_shape = credit_card_balance.shape
print(f"Number of rows on bureau data_csv: {credit_card_balance_shape[0]}\nNumber of"
      f"columns on bureau data_csv: {credit_card_balance_shape[1]}")

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,AMT_BALANCE,AMT_CREDIT_LIMIT_ACTUAL,AMT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,AMT_INST_MIN_REGULARITY,...,AMT_RECIVABLE,AMT_TOTAL_RECEIVABLE,CNT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_POS_CURRENT,CNT_INSTALMENT_MATURE_CUM,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
0,2562384,378907,-6,56.97,135000,0.0,877.5,0.0,877.5,1700.325,...,0.0,0.0,0.0,1,0.0,1.0,35.0,Active,0,0
1,2582071,363914,-1,63975.555,45000,2250.0,2250.0,0.0,0.0,2250.0,...,64875.555,64875.555,1.0,1,0.0,0.0,69.0,Active,0,0
2,1740877,371185,-7,31815.225,450000,0.0,0.0,0.0,0.0,2250.0,...,31460.085,31460.085,0.0,0,0.0,0.0,30.0,Active,0,0
3,1389973,337855,-4,236572.11,225000,2250.0,2250.0,0.0,0.0,11795.76,...,233048.97,233048.97,1.0,1,0.0,0.0,10.0,Active,0,0
4,1891521,126868,-1,453919.455,450000,0.0,11547.0,0.0,11547.0,22924.89,...,453919.455,453919.455,0.0,1,0.0,1.0,101.0,Active,0,0


Number of rows on bureau data_csv: 3840312
Number ofcolumns on bureau data_csv: 23


## 1.3. Data Cleaning
Objective: Before further analysis, the tables need to be split into 
train/test. Application_train and application_test are already split, but a 
quick data cleaning is essential to understanding how can the rest be split.

**Duplicates**

In [None]:
application_test.duplicated().any()

In [None]:
application_train.duplicated().any()

In [None]:
sk_id_curr_train = application_train['SK_ID_CURR']
sk_id_curr_test = application_test['SK_ID_CURR']

duplicate_ids = set(sk_id_curr_train).intersection(set(sk_id_curr_test))
duplicate_ids

No duplicates, so each table has unique loan ID, SK_ID_CURR. Meaning 
SK_ID_CURR from application tables will help distinguish loans for training 
and loans for test.

**Check for Missing Values** -<br>
There is a possibility of abandoning certain tables if they contained too many 
missing values.

In [None]:
dataframes = {
      'application_train': application_train, 
      'application_test': application_test, 
      'bureau': bureau,
      'bureau_balance': bureau_balance,
      'credit_card_balance': credit_card_balance, 
      'installments_payments': installments_payments,
      'POS_CASH_balance': POS_CASH_balance, 
      'previous_application': previous_application
}

for df_name, df in dataframes.items():
      print(df_name)
      display(missing_values(df).sort_values(ascending=False, by='Missing '
                                                                 'Values'))

There are some features with large amount of missing values. However, no one
 table has too many missing values where it needs to be discarded as a whole
 .<br>
 For modeling, tree based and non tree based models like logistic 
 regression, naive bayes, random forest, XGBoost, and LightGBM will be 
 tested. Non tree based models require complete data, while imputation and 
 feature removal will still be considered when >40% of missing data for tree 
 based models.

## 1.4. Dataframes and Keys Information

In [81]:
dataframes = {
    'application_train': application_train,
    'application_test': application_test,
    'bureau': bureau,
    'bureau_balance': bureau_balance,
    'credit_card_balance': credit_card_balance,
    'installments_payments': installments_payments,
    'POS_CASH_balance': POS_CASH_balance,
    'previous_application': previous_application
}

keys_to_check = ['SK_ID_CURR', 'SK_ID_PREV', 'SK_ID_BUREAU']
results = []

for table_name, df in dataframes.items():
    row = {'Table': table_name, 'Total_Rows': len(df)}
    
    for key in keys_to_check:
        row[key] = df[key].nunique() if key in df.columns else None
    results.append(row)

key_counts_df = pd.DataFrame(results)
key_counts_df

Unnamed: 0,Table,Total_Rows,SK_ID_CURR,SK_ID_PREV,SK_ID_BUREAU
0,application_train,307511,307511.0,,
1,application_test,48744,48744.0,,
2,bureau,1716428,305811.0,,1716428.0
3,bureau_balance,27299925,,,817395.0
4,credit_card_balance,3840312,103558.0,104307.0,
5,installments_payments,13605401,339587.0,997752.0,
6,POS_CASH_balance,10001358,337252.0,936325.0,
7,previous_application,1670214,338857.0,1670214.0,


Total SK_ID_CURR, loan IDs : 356255<br>
Total SK_ID_BUREAU, bureau loan IDs : 1716428, from 305811 loan IDs. 
About 14.16% of SK_ID_CURR do not have bureau info.<br>
Total SK_ID_PREV, previous loan IDs : 1670214, from 338857 loan IDs
. About 4.88% of SK_ID_CURR do not have previous loan info.<br><br>
    
**Tables with unique keys:**<br>
    1. application_train: SK_ID_CURR<br>
    2. application_test: SK_ID_CURR <br>
    3. bureau: SK_ID_BUREAU. <i>Each SK_ID_CURR can correspond to 
    several SK_ID_BUREAU, and not all SK_ID_CURR are present in this 
    dataframe.</i><br>
    4. previous_application: SK_ID_PREV. <i>Each SK_ID_CURR can correspond to 
    several SK_ID_PREV, and not all SK_ID_CURR are present in this 
    dataframe.</i><br><br>
    
**Tables without unique keys:**<br>
    1. bureau_balance: SK_ID_BUREAU corresponds to several rows. Not all 
    SK_ID_BUREAU are present in this dataframe.<br>
    2. credit_card_balance: SK_ID_PREV corresponds to several rows. Not all 
    SK_ID_PREV are present in this dataframe.<br>
    3. installments_payments: SK_ID_PREV corresponds to several rows. Not 
    all SK_ID_PREV are present in this dataframe.<br>
    4. POS_CASH_balance: SK_ID_PREV corresponds to several rows. Not all 
    SK_ID_PREV are present in this dataframe.<br>

## 1.5. Main Data Preparation
Objective: <br>
1. Data split into train/test to avoid data leakage. 
2. Aggregating all tables' information into the main tables

*bureau_balance* does not have SK_ID_CURR, so it cannot be directly split into
 train/test by that key. However, rows are identified by SK_ID_BUREAU and 
 can be easily joined with *bureau*, to then be split. This method can also be 
 used for *credit_card_balance*, *installments_payment*, and 
 *POS_CASH_balance*.<br><br>
To join tables without unique keys to *bureau* and *previous_application*, 
they need to grouped by SK_ID_BUREAU and SK_ID_PREV, respectively. Some 
aggregation methods are mean, sum, std, max, and in our case prediction

**Split All Data into Train Test files**

In [48]:
dataframes = {
      'application_train': application_train, 
      'application_test': application_test, 
      'bureau': bureau,
      'bureau_balance': bureau_balance,
      'credit_card_balance': credit_card_balance, 
      'installments_payments': installments_payments,
      'POS_CASH_balance': POS_CASH_balance, 
      'previous_application': previous_application
}

sk_id_curr_train = application_train['SK_ID_CURR']
sk_id_curr_test = application_test['SK_ID_CURR']

for dfname, df in dataframes.items():
      dfname_train = df[df['SK_ID_CURR'] == sk_id_curr_train]
      dfname_test = df[df['SK_ID_CURR'] == sk_id_curr_test]

split now 
aggregate later
aggregate how? Look into features and id curr and prev 

are all tables perfectly divided into train and test SK_ID_CURR?
should I divide tables by ID CURR or PREV?

How to split the tables? <br>
application tables do not need splitting
split previous application, bureau by SK_ID_CURR ()
aggregate bureau_balance to bureau first (mean, min, max, sum, std, 
agg 

In [54]:
application_test['SK_ID_CURR'].nunique()

48744

In [55]:
application_test.shape

(48744, 121)

# Improvements

1. Understand and identify the subpopulation of Home Credit who has no 
difficulty in repayment but were not given loans by other institutions -  
Collect loan application data from other banks