# Home Credit Prediction: Data Cleaning of Installments Payments Table


For an up-to-date version / full view of the plotly - plots, please, go to

Data Cleaning - Installment Payments:  https://drive.google.com/file/d/17QxdLEpcFDgRFi9W28VSJgVFDU6cPLi9/view?usp=sharing

List of all notebooks and resources for this project: https://drive.google.com/file/d/1Z8vPNZAcivWOxeh3UKFfeARbQCMkQ_NR/view?usp=sharing

## Import Modules

In [None]:
%%capture
#! pip install -q pingouin
#! pip install -q scikit-optimize
! pip install -q scikit-optimize

In [None]:
import numpy as np
import pandas as pd

import sys
import os
import warnings
from importlib import reload

from dask import dataframe as dd
#import matplotlib.pyplot as plt
#import seaborn as sns
#import plotly.express as px

from google.colab import drive
drive.mount("/content/gdrive")

warnings.filterwarnings('ignore')

pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
#pd.reset_option('display.max_rows')

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



Mounted at /content/gdrive


In [None]:
home_folder = '/content/gdrive/MyDrive/Colab Notebooks/Portfolio/ML_HomeCredit_DefaultRiskEvaluation/'

### Functions

The Python-file with the functions is at
https://drive.google.com/file/d/17IchsTGy2QI9sq0LTIvGvxAk2mrWs4Xz/view?usp=sharing

In [None]:
%load_ext autoreload
%autoreload 2

sys.path.append(home_folder)
import driskfunc as dfunc

# 1. Load and Update Data

data source: https://storage.googleapis.com/341-home-credit-default/home-credit-default-risk.zip

description: https://storage.googleapis.com/341-home-credit-default/Home%20Credit%20Default%20Risk.pdf

In [None]:
HCdescr = pd.read_csv(home_folder+'data/HomeCredit_columns_description.csv', encoding='latin1') #, dtype=dtype)


In [None]:
HCdescr.loc[HCdescr.Table == 'installments_payments.csv']

Unnamed: 0.1,Unnamed: 0,Table,Row,Description,Special
211,214,installments_payments.csv,SK_ID_PREV,"ID of previous credit in Home credit related to loan in our sample. (One loan in our sample can have 0,1,2 or more previous loans in Home Credit)",hashed
212,215,installments_payments.csv,SK_ID_CURR,ID of loan in our sample,hashed
213,216,installments_payments.csv,NUM_INSTALMENT_VERSION,Version of installment calendar (0 is for credit card) of previous credit. Change of installment version from month to month signifies that some parameter of payment calendar has changed,
214,217,installments_payments.csv,NUM_INSTALMENT_NUMBER,On which installment we observe payment,
215,218,installments_payments.csv,DAYS_INSTALMENT,When the installment of previous credit was supposed to be paid (relative to application date of current loan),time only relative to the application
216,219,installments_payments.csv,DAYS_ENTRY_PAYMENT,When was the installments of previous credit paid actually (relative to application date of current loan),time only relative to the application
217,220,installments_payments.csv,AMT_INSTALMENT,What was the prescribed installment amount of previous credit on this installment,
218,221,installments_payments.csv,AMT_PAYMENT,What the client actually paid on previous credit on this installment,


In [None]:
csv_ip = home_folder+'data/installments_payments.csv'
HCapp_ip = dd.read_csv(csv_ip)

df = HCapp_ip
df_name = 'HCapp installments payments'

In [None]:
df.npartitions

11

In [None]:
df = df.sort_values(by=['SK_ID_PREV', 'NUM_INSTALMENT_NUMBER'], ascending=False)
df = df.sort_values(by=['SK_ID_CURR','SK_ID_PREV'], ascending=True)

In [None]:
df.head(50)

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
150414,1369693,100001,2.0,4,-1619.0,-1628.0,17397.9,17397.9
985102,1369693,100001,1.0,3,-1649.0,-1660.0,3951.0,3951.0
95112,1369693,100001,1.0,2,-1679.0,-1715.0,3951.0,3951.0
241821,1369693,100001,1.0,1,-1709.0,-1715.0,3951.0,3951.0
961763,1851984,100001,1.0,4,-2856.0,-2856.0,3980.925,3980.925
63666,1851984,100001,1.0,3,-2886.0,-2875.0,3982.05,3982.05
524212,1851984,100001,1.0,2,-2916.0,-2916.0,3982.05,3982.05
442432,1038818,100002,2.0,19,-25.0,-49.0,53093.745,53093.745
153711,1038818,100002,1.0,18,-55.0,-67.0,9251.775,9251.775
916044,1038818,100002,1.0,17,-85.0,-99.0,9251.775,9251.775


In [None]:
size_df = [df.shape[0].compute(),  df.shape[1]]

print('The dataset', df_name, 'has', size_df[0], 'rows and', size_df[1], 'features.')

The dataset HCapp installments payments has 13605401 rows and 8 features.


# 2. Data Cleaning

* Handling missing values.
* Removing duplicate samples and features.
* Remove unneccessary columns/rows.
* Treating (here rather checking) the outliers.

## Check Missing Values and Duplicates

Overview of amounts of Nan and of data type:

In [None]:
dfunc.count_dtypes(df, name = df_name)


The dataset HCapp installments payments has:
5 features of type float64.
3 features of type int64.


In [None]:
%%time
%reload_ext autoreload

nan_overview_df = dfunc.nan_type_overview_dd(df, size_df[0])
nan_overview_df.round(1).style.background_gradient(cmap="Blues")

CPU times: user 43.9 s, sys: 3.64 s, total: 47.6 s
Wall time: 32.3 s


Unnamed: 0,type,NaN[abs],NaN[%]
SK_ID_PREV,int64,0,0.0
SK_ID_CURR,int64,0,0.0
NUM_INSTALMENT_VERSION,float64,0,0.0
NUM_INSTALMENT_NUMBER,int64,0,0.0
DAYS_INSTALMENT,float64,0,0.0
DAYS_ENTRY_PAYMENT,float64,2905,0.0
AMT_INSTALMENT,float64,0,0.0
AMT_PAYMENT,float64,2905,0.0


### Duplicates Check

In [None]:
%reload_ext autoreload

df_dup = dfunc.get_dup_dd(df, name=df_name, size=size_df[0])

Total number of duplicates in " HCapp installments payments " : 0 ( 0.0 %).


## Other Checks and Modifications

#### Overview

In [None]:
df.describe().compute().T.round(1)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SK_ID_PREV,13605401.0,1903365.0,536202.9,1000001.0,1438816.0,1903080.0,2372168.0,2843499.0
SK_ID_CURR,13605401.0,278444.9,102718.3,100001.0,180579.0,276847.0,362551.0,456255.0
NUM_INSTALMENT_VERSION,13605401.0,0.9,1.0,0.0,0.0,1.0,1.0,178.0
NUM_INSTALMENT_NUMBER,13605401.0,18.9,26.7,1.0,4.0,8.0,20.0,277.0
DAYS_INSTALMENT,13605401.0,-1042.3,800.9,-2922.0,-1631.0,-807.0,-355.0,-1.0
DAYS_ENTRY_PAYMENT,13602496.0,-1051.1,800.6,-4921.0,-1640.0,-816.0,-364.0,-1.0
AMT_INSTALMENT,13605401.0,17050.9,50570.3,0.0,4356.5,8981.8,16897.3,3771487.8
AMT_PAYMENT,13602496.0,17238.2,54735.8,0.0,3539.1,8226.6,16272.4,3771487.8


#### NUM_INSTALMENT_VERSION

In [None]:
inst_vers_counts = df.NUM_INSTALMENT_VERSION.value_counts().compute()

In [None]:
inst_vers_counts[:10]

Unnamed: 0_level_0,count
NUM_INSTALMENT_VERSION,Unnamed: 1_level_1
1.0,8485004
0.0,4082498
2.0,620283
3.0,237063
4.0,55274
5.0,48404
6.0,17092
7.0,16771
9.0,8359
8.0,7814


In [None]:
len(inst_vers_counts), inst_vers_counts.keys().max()

(65, 178.0)

Results from subsection below: NUM_INSTALMENT_VERSION changes (+1) whenever AMT_INSTALMENT is changed. ---> max(NUM_INSTALMENT_VERSION per SK_ID_PREV) = number of changes in a payment history of a contract (=SK_ID_PREV).

BUT for NUM_INSTALMENT_VERSION=0.0 AMT_INSTALMENT changes often and in an unclear pattern (determined already at contract start?).

###### Installment Version Inspection/Experiments

In [None]:
df.loc[df.NUM_INSTALMENT_VERSION == 0.0].compute().head(100)

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
831320,1843384,100011,0.0,76,-37.0,-37.0,563.355,563.355
1198074,1843384,100011,0.0,75,-68.0,-68.0,563.355,563.355
731667,1843384,100011,0.0,74,-99.0,-99.0,563.355,563.355
1106537,1843384,100011,0.0,73,-129.0,-129.0,563.355,563.355
475838,1843384,100011,0.0,72,-160.0,-160.0,563.355,563.355
752807,1843384,100011,0.0,71,-190.0,-190.0,563.355,563.355
783722,1843384,100011,0.0,70,-221.0,-221.0,563.355,563.355
1180063,1843384,100011,0.0,69,-249.0,-249.0,563.355,563.355
108703,1843384,100011,0.0,68,-280.0,-280.0,563.355,563.355
147419,1843384,100011,0.0,67,-311.0,-311.0,563.355,563.355


In [None]:
df.loc[df.SK_ID_PREV == 2038692].compute()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
817877,2038692,100013,0.0,113,-14.0,-14.0,274.32,274.32
35785,2038692,100013,0.0,112,-45.0,-45.0,274.32,274.32
1097485,2038692,100013,0.0,111,-75.0,-75.0,274.32,274.32
1116495,2038692,100013,0.0,110,-106.0,-106.0,274.32,274.32
784819,2038692,100013,0.0,109,-134.0,-134.0,274.32,274.32
484122,2038692,100013,0.0,108,-165.0,-165.0,274.32,274.32
87373,2038692,100013,0.0,107,-196.0,-196.0,274.32,274.32
715913,2038692,100013,0.0,106,-226.0,-226.0,274.32,274.32
1192829,2038692,100013,0.0,105,-257.0,-257.0,274.32,274.32
718388,2038692,100013,0.0,104,-287.0,-287.0,274.32,274.32


#### Difference in planned and executed payment per installment number

In [None]:
df['DAYS_INSTALMENT_DIFF'] = df['DAYS_ENTRY_PAYMENT'] - df['DAYS_INSTALMENT']
df_days_diff_pos = df['DAYS_INSTALMENT_DIFF']
df_days_diff_pos = df_days_diff_pos.mask((df_days_diff_pos>0.), 1)
df_days_diff_pos = df_days_diff_pos.mask((df_days_diff_pos<=0.), 0)

df['DAYS_INSTALMENT_DIFF_pos'] = df_days_diff_pos

df['AMT_INSTALMENT_DIFF'] = df['AMT_PAYMENT'] - df['AMT_INSTALMENT']

df_amt_diff_pos = df['AMT_INSTALMENT_DIFF']
df_amt_diff_pos = df_amt_diff_pos.mask((df_amt_diff_pos>0.), 1)
df_amt_diff_pos = df_amt_diff_pos.mask((df_amt_diff_pos<=0.), 0)
df['AMT_INSTALMENT_DIFF_pos'] = df_amt_diff_pos

In [None]:
df.loc[df.SK_ID_PREV == 1851984].compute()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT,DAYS_INSTALMENT_DIFF,DAYS_INSTALMENT_DIFF_pos,AMT_INSTALMENT_DIFF,AMT_INSTALMENT_DIFF_pos
961763,1851984,100001,1.0,4,-2856.0,-2856.0,3980.925,3980.925,0.0,0.0,0.0,0.0
63666,1851984,100001,1.0,3,-2886.0,-2875.0,3982.05,3982.05,11.0,1.0,0.0,0.0
524212,1851984,100001,1.0,2,-2916.0,-2916.0,3982.05,3982.05,0.0,0.0,0.0,0.0


## Aggregate by SK_ID_PREV

In [None]:
install_agg = {'SK_ID_CURR': ['mean'],
               'NUM_INSTALMENT_NUMBER': ['max', 'min'],
               'NUM_INSTALMENT_VERSION': ['max', 'min'],
               'DAYS_INSTALMENT_DIFF': ['mean'],
               'AMT_INSTALMENT_DIFF': ['mean'],
               'DAYS_INSTALMENT_DIFF_pos': ['sum'],
               'AMT_INSTALMENT_DIFF_pos': ['sum'],
               }

df_install_agg = df.groupby("SK_ID_PREV").agg(install_agg)
df_install_agg.columns = df_install_agg.columns.map('_'.join).str.strip('_')

df_final = df_install_agg

df_final = df_final.rename(columns={'SK_ID_CURR_mean': 'SK_ID_CURR'})
df_final = df_final.reset_index()
df_final = df_final.sort_values(by='SK_ID_CURR')

df_final.round(2).head(5)

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NUM_INSTALMENT_NUMBER_max,NUM_INSTALMENT_NUMBER_min,NUM_INSTALMENT_VERSION_max,NUM_INSTALMENT_VERSION_min,DAYS_INSTALMENT_DIFF_mean,AMT_INSTALMENT_DIFF_mean,DAYS_INSTALMENT_DIFF_pos_sum,AMT_INSTALMENT_DIFF_pos_sum
0,1369693,100001.0,4,1,2.0,1.0,-15.5,0.0,0.0,0.0
1,1851984,100001.0,4,2,1.0,1.0,3.67,0.0,1.0,0.0
2,1038818,100002.0,19,1,2.0,1.0,-20.42,0.0,0.0,0.0
3,1810518,100003.0,7,1,2.0,1.0,-4.43,0.0,0.0,0.0
4,2396755,100003.0,12,1,1.0,1.0,-6.75,0.0,0.0,0.0


In [None]:
size_df_final = [df_final.shape[0].compute(),  df_final.shape[1]]

print('The condensed dataset has', size_df_final[0], 'rows and', size_df_final[1], 'features.')
print('Initial size was', size_df[0], 'rows and', size_df[1], 'features.')

The condensed dataset has 997752 rows and 10 features.
Initial size was 13605401 rows and 8 features.


In [None]:
nan_overview_df = dfunc.nan_type_overview_dd(df_final, size_df_final[0])
nan_overview_df.round(1).style.background_gradient(cmap="Blues")

Unnamed: 0,type,NaN[abs],NaN[%]
SK_ID_PREV,int64,0,0.0
SK_ID_CURR,float64,0,0.0
NUM_INSTALMENT_NUMBER_max,int64,0,0.0
NUM_INSTALMENT_NUMBER_min,int64,0,0.0
NUM_INSTALMENT_VERSION_max,float64,0,0.0
NUM_INSTALMENT_VERSION_min,float64,0,0.0
DAYS_INSTALMENT_DIFF_mean,float64,78,0.0
AMT_INSTALMENT_DIFF_mean,float64,78,0.0
DAYS_INSTALMENT_DIFF_pos_sum,float64,0,0.0
AMT_INSTALMENT_DIFF_pos_sum,float64,0,0.0


This modified data set can now be merged with the 'previous applictions' dataset.

# Export

In [None]:
%%capture
! mkdir home_folder+'cleaned/'
df_final.to_csv(home_folder+'cleaned/HC_installment_payments_cleaned.csv',
                 index=False, single_file = True)