# Home Credit Prediction: Data Cleaning of Bureau Balance Table


For an up-to-date version / full view of the plotly - plots, please, go to

Data Cleaning - Bureau Balance: https://drive.google.com/file/d/17CWrXSq0UD59yT0LgF_VXaqc-nkVcdFM/view?usp=sharing

List of all notebooks and resources for this project: https://drive.google.com/file/d/1Z8vPNZAcivWOxeh3UKFfeARbQCMkQ_NR/view?usp=sharing

## Import Modules

In [3]:
%%capture
#! pip install -q pingouin
#! pip install -q scikit-optimize
! pip install -q scikit-optimize

In [4]:
import numpy as np
import pandas as pd

import sys
import os
import warnings
from importlib import reload

from dask import dataframe as dd
#import matplotlib.pyplot as plt
#import seaborn as sns
#import plotly.express as px

from google.colab import drive
drive.mount("/content/gdrive")

warnings.filterwarnings('ignore')

pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
#pd.reset_option('display.max_rows')

Mounted at /content/gdrive


In [5]:
home_folder = '/content/gdrive/MyDrive/Colab Notebooks/Portfolio/ML_HomeCredit_DefaultRiskEvaluation/'

### Functions

The Python-file with the functions is at
https://drive.google.com/file/d/17IchsTGy2QI9sq0LTIvGvxAk2mrWs4Xz/view?usp=sharing

In [6]:
%load_ext autoreload
%autoreload 2

sys.path.append(home_folder)
import driskfunc as dfunc

# 1. Load and Update Data

data source: https://storage.googleapis.com/341-home-credit-default/home-credit-default-risk.zip

description: https://storage.googleapis.com/341-home-credit-default/Home%20Credit%20Default%20Risk.pdf

In [7]:
HCdescr = pd.read_csv(home_folder+'data/HomeCredit_columns_description.csv', encoding='latin1') #, dtype=dtype)


In [8]:
HCdescr.loc[HCdescr.Table == 'bureau_balance.csv']

Unnamed: 0.1,Unnamed: 0,Table,Row,Description,Special
139,142,bureau_balance.csv,SK_BUREAU_ID,Recoded ID of Credit Bureau credit (unique coding for each application) - use this to join to CREDIT_BUREAU table,hashed
140,143,bureau_balance.csv,MONTHS_BALANCE,Month of balance relative to application date (-1 means the freshest balance date),time only relative to the application
141,144,bureau_balance.csv,STATUS,"Status of Credit Bureau loan during the month (active, closed, DPD0-30, [C means closed, X means status unknown, 0 means no DPD, 1 means maximal did during month between 1-30, 2 means DPD 31-60, 5 means DPD 120+ or sold or written off ] )",


In [9]:
csv_bureau_bal = home_folder+'data/bureau_balance.csv'
HCapp_bureau_bal = dd.read_csv(csv_bureau_bal)

df = HCapp_bureau_bal
df_name = 'HCapp bureau balance'

In [10]:
npart = df.npartitions
npart

5

In [11]:
df.head(5)

Unnamed: 0,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
0,5715448,0,C
1,5715448,-1,C
2,5715448,-2,C
3,5715448,-3,C
4,5715448,-4,C


In [12]:
size_df = [df.shape[0].compute(),  df.shape[1]]
print('The dataset has', size_df[0], 'rows and', size_df[1], 'features.')

The dataset has 27299925 rows and 3 features.


# 2. Data Cleaning

* Handling missing values.
* Removing duplicate samples and features.
* Remove unneccessary columns/rows.
* Treating (here rather checking) the outliers.

## Check Missing Values and Duplicates

Overview of amounts of Nan and of data type:

In [13]:
dfunc.count_dtypes(df, name = 'df')


The dataset df has:
2 features of type int64.
1 features of type string.


In [14]:
%%time
%reload_ext autoreload

nan_overview_df = dfunc.nan_type_overview_dd(df, size_df[0])
nan_overview_df.round(1).style.background_gradient(cmap="Blues")

CPU times: user 11.4 s, sys: 2.49 s, total: 13.9 s
Wall time: 10.3 s


Unnamed: 0,type,NaN[abs],NaN[%]
SK_ID_BUREAU,int64,0,0.0
MONTHS_BALANCE,int64,0,0.0
STATUS,string,0,0.0


### Duplicates Check

In [15]:
%reload_ext autoreload

df_dup = dfunc.get_dup_dd(df, name='df', size=size_df[0])

Total number of duplicates in " df " : 0 ( 0.0 %).


## Modifications

#### Statistical Overview



In [16]:
df.describe().compute().T.round(1)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SK_ID_BUREAU,27299925.0,6036297.3,492348.9,5001709.0,5877609.0,6159779.0,6654301.0,6842888.0
MONTHS_BALANCE,27299925.0,-30.7,23.9,-96.0,-44.0,-24.0,-11.0,0.0


In [17]:
df_obj = df.describe(exclude=np.number).compute().T
df_obj['freq[%]'] = (df_obj['freq'].astype('float')/df_obj['count'].astype('float')*100) # alternative to pd.to_numeric()
df_obj.round(1)

Unnamed: 0,unique,count,top,freq,freq[%]
STATUS,8,27299925,C,13646993,50.0


## Aggregate payment performance in one row per SK_ID_BUREAU

Goals - reduce payment performance (many rows) into one row per credit:
* most recent payment status (MONTHS_BALANCE max)
* number of payments (MONTHS_BALANCE counts)
* counts per status category/number of payments



In [18]:
df_stat = df.categorize(columns = ['STATUS']).STATUS
df_status_cat = dd.get_dummies(df_stat, prefix = 'STATUS', dtype=int)
df_with_dummies = dd.concat([df, df_status_cat], axis=1)
#df_with_dummies.head(20)

In [19]:
balance_agg = {'MONTHS_BALANCE': ['max', 'count'],
               'STATUS_0': ['mean'],
               'STATUS_1': ['mean'],
               'STATUS_2': ['mean'],
               'STATUS_3': ['mean'],
               'STATUS_4': ['mean'],
               'STATUS_5': ['mean'],
               'STATUS_C': ['mean'],
               'STATUS_X': ['mean']
               }

df_bal_max = df_with_dummies.groupby("SK_ID_BUREAU").agg(balance_agg)
df_bal_max.columns = df_bal_max.columns.map('_'.join).str.strip('_')
df_bal_max = df_bal_max.reset_index()
df2 = df_with_dummies[['SK_ID_BUREAU', 'MONTHS_BALANCE', 'STATUS']].merge(df_bal_max, on='SK_ID_BUREAU')
df_final = df2.loc[df2['MONTHS_BALANCE'] == df2['MONTHS_BALANCE_max']]
df_final = df_final.drop(columns='MONTHS_BALANCE_max')
print('n partitions:', df_final.npartitions)
df_final.round(2).head()


n partitions: 5


Unnamed: 0,SK_ID_BUREAU,MONTHS_BALANCE,STATUS,MONTHS_BALANCE_count,STATUS_0_mean,STATUS_1_mean,STATUS_2_mean,STATUS_3_mean,STATUS_4_mean,STATUS_5_mean,STATUS_C_mean,STATUS_X_mean
0,5715448,0,C,27,0.3,0.0,0.0,0.0,0.0,0.0,0.33,0.37
27,5715449,0,C,12,0.42,0.0,0.0,0.0,0.0,0.0,0.5,0.08
39,5715451,-5,C,26,0.65,0.0,0.0,0.0,0.0,0.0,0.19,0.15
65,5715452,0,C,33,0.24,0.0,0.0,0.0,0.0,0.0,0.45,0.3
98,5715453,0,C,38,0.21,0.0,0.0,0.0,0.0,0.0,0.53,0.26


In [20]:
size_df_final = [df_final.shape[0].compute(),  df_final.shape[1]]

print('The condensed dataset has', size_df_final[0], 'rows and', size_df_final[1], 'features.')
print('Initial size was', size_df[0], 'rows and', size_df[1], 'features.')

The condensed dataset has 817395 rows and 12 features.
Initial size was 27299925 rows and 3 features.


In [21]:
nan_overview_df = dfunc.nan_type_overview_dd(df_final, size_df_final[0])
nan_overview_df.round(1).style.background_gradient(cmap="Blues")

Unnamed: 0,type,NaN[abs],NaN[%]
SK_ID_BUREAU,int64,0,0.0
MONTHS_BALANCE,int64,0,0.0
STATUS,string,0,0.0
MONTHS_BALANCE_count,int64,0,0.0
STATUS_0_mean,float64,0,0.0
STATUS_1_mean,float64,0,0.0
STATUS_2_mean,float64,0,0.0
STATUS_3_mean,float64,0,0.0
STATUS_4_mean,float64,0,0.0
STATUS_5_mean,float64,0,0.0


This modified data set can now be merged with the 'bureau' dataset.

# Export

In [22]:
%%capture
! mkdir home_folder+'cleaned/'
df_final.to_csv(home_folder+'cleaned/HC_bureau_balance_cleaned.csv',
                 index=False, single_file = True)