# Kaggle Competition - Home Credit Default Risk
## Exploatory Analysis

### Dataset Description

1. application_{train|test}.csv
 - This is the main table, broken into two files for Train (with TARGET) and Test (without TARGET).
 - Static data for all applications. One row represents one loan in our data sample.

2. bureau.csv
 - All client's previous credits provided by other financial institutions that were reported to Credit Bureau (for clients who have a loan in our sample).
 - For every loan in our sample, there are as many rows as number of credits the client had in Credit Bureau before the application date.

3. bureau_balance.csv
 - Monthly balances of previous credits in Credit Bureau.
 - This table has one row for each month of history of every previous credit reported to Credit Bureau – i.e the table has (#loans in sample * # of relative previous credits * # of months where we have some history observable for the previous credits) rows.

4. POS_CASH_balance.csv
 - Monthly balance snapshots of previous POS (point of sales) and cash loans that the applicant had with Home Credit.
 - This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample * # of relative previous credits * # of months in which we have some history observable for the previous credits) rows.

5. credit_card_balance.csv
 - Monthly balance snapshots of previous credit cards that the applicant has with Home Credit.
 - This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample * # of relative previous credit cards * # of months where we have some history observable for the previous credit card) rows.

6. previous_application.csv
 - All previous applications for Home Credit loans of clients who have loans in our sample.
 - There is one row for each previous application related to loans in our data sample.

7. installments_payments.csv
 - Repayment history for the previously disbursed credits in Home Credit related to the loans in our sample.
 - There is a) one row for every payment that was made plus b) one row each for missed payment.
 - One row is equivalent to one payment of one installment OR one installment corresponding to one payment of one previous Home Credit credit related to loans in our sample.

8. HomeCredit_columns_description.csv
 - This file contains descriptions for the columns in the various data files.
 
##### Structure of the dataset
![Architecture of the data model](images/data_arc.png)

<div class="alert alert-warning">
<h5>Some terminologies you need to know...</h5>
<p><b><i>Credit bureau: </i></b>A credit bureau is a collection agency that gathers account information from various creditors and provides that information to a consumer reporting agency in the United States, a credit reference agency in the United Kingdom, a credit reporting body in Australia, a credit information company (CIC) in India, Special Accessing Entity in the Philippines, and also to private lenders.[1] It is not the same as a credit rating agency. - <a href="https://en.wikipedia.org/wiki/Credit_bureau">Source</a></p>
<p><b><i>Home Credit: </i></b>Our client</p>

</div>

Let's have a look of all datasets...

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time

  return f(*args, **kwds)
  return f(*args, **kwds)


In [2]:
pd.options.display.max_columns = 250

## 1. Train dataset, a.k.a. the application dataset. It is the static data for all applications

In [3]:
train_dataset = pd.read_csv("data/application_train.csv")

%time
print("Train dataset Info ------")
train_dataset.info(verbose=True)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.15 µs
Train dataset Info ------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 122 columns):
SK_ID_CURR                      int64
TARGET                          int64
NAME_CONTRACT_TYPE              object
CODE_GENDER                     object
FLAG_OWN_CAR                    object
FLAG_OWN_REALTY                 object
CNT_CHILDREN                    int64
AMT_INCOME_TOTAL                float64
AMT_CREDIT                      float64
AMT_ANNUITY                     float64
AMT_GOODS_PRICE                 float64
NAME_TYPE_SUITE                 object
NAME_INCOME_TYPE                object
NAME_EDUCATION_TYPE             object
NAME_FAMILY_STATUS              object
NAME_HOUSING_TYPE               object
REGION_POPULATION_RELATIVE      float64
DAYS_BIRTH                      int64
DAYS_EMPLOYED                   int64
DAYS_REGISTRATION               float64
DAYS_ID_PUBLISH  

In [4]:
print("Train dataset head ------")
train_dataset.head(n=10)

Train dataset head ------


Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE,HOUSETYPE_MODE,TOTALAREA_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,351000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.018801,-9461,-637,-3648.0,-2120,,1,1,0,1,1,0,Laborers,1.0,2,2,WEDNESDAY,10,0,0,0,0,0,0,Business Entity Type 3,0.083037,0.262949,0.139376,0.0247,0.0369,0.9722,0.6192,0.0143,0.0,0.069,0.0833,0.125,0.0369,0.0202,0.019,0.0,0.0,0.0252,0.0383,0.9722,0.6341,0.0144,0.0,0.069,0.0833,0.125,0.0377,0.022,0.0198,0.0,0.0,0.025,0.0369,0.9722,0.6243,0.0144,0.0,0.069,0.0833,0.125,0.0375,0.0205,0.0193,0.0,0.0,reg oper account,block of flats,0.0149,"Stone, brick",No,2.0,2.0,2.0,2.0,-1134.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,1129500.0,Family,State servant,Higher education,Married,House / apartment,0.003541,-16765,-1188,-1186.0,-291,,1,1,0,1,1,0,Core staff,2.0,1,1,MONDAY,11,0,0,0,0,0,0,School,0.311267,0.622246,,0.0959,0.0529,0.9851,0.796,0.0605,0.08,0.0345,0.2917,0.3333,0.013,0.0773,0.0549,0.0039,0.0098,0.0924,0.0538,0.9851,0.804,0.0497,0.0806,0.0345,0.2917,0.3333,0.0128,0.079,0.0554,0.0,0.0,0.0968,0.0529,0.9851,0.7987,0.0608,0.08,0.0345,0.2917,0.3333,0.0132,0.0787,0.0558,0.0039,0.01,reg oper account,block of flats,0.0714,Block,No,1.0,0.0,1.0,0.0,-828.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,135000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.010032,-19046,-225,-4260.0,-2531,26.0,1,1,1,1,1,0,Laborers,1.0,2,2,MONDAY,9,0,0,0,0,0,0,Government,,0.555912,0.729567,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-815.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,297000.0,Unaccompanied,Working,Secondary / secondary special,Civil marriage,House / apartment,0.008019,-19005,-3039,-9833.0,-2437,,1,1,0,1,0,0,Laborers,2.0,2,2,WEDNESDAY,17,0,0,0,0,0,0,Business Entity Type 3,,0.650442,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,0.0,2.0,0.0,-617.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,513000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.028663,-19932,-3038,-4311.0,-3458,,1,1,0,1,0,0,Core staff,1.0,2,2,THURSDAY,11,0,0,0,0,1,1,Religion,,0.322738,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-1106.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
5,100008,0,Cash loans,M,N,Y,0,99000.0,490495.5,27517.5,454500.0,"Spouse, partner",State servant,Secondary / secondary special,Married,House / apartment,0.035792,-16941,-1588,-4970.0,-477,,1,1,1,1,1,0,Laborers,2.0,2,2,WEDNESDAY,16,0,0,0,0,0,0,Other,,0.354225,0.621226,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-2536.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,1.0,1.0
6,100009,0,Cash loans,F,Y,Y,1,171000.0,1560726.0,41301.0,1395000.0,Unaccompanied,Commercial associate,Higher education,Married,House / apartment,0.035792,-13778,-3130,-1213.0,-619,17.0,1,1,0,1,1,0,Accountants,3.0,2,2,SUNDAY,16,0,0,0,0,0,0,Business Entity Type 3,0.774761,0.724,0.49206,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,0.0,1.0,0.0,-1562.0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0.0,0.0,0.0,1.0,1.0,2.0
7,100010,0,Cash loans,M,Y,Y,0,360000.0,1530000.0,42075.0,1530000.0,Unaccompanied,State servant,Higher education,Married,House / apartment,0.003122,-18850,-449,-4597.0,-2379,8.0,1,1,1,1,0,0,Managers,2.0,3,3,MONDAY,16,0,0,0,0,1,1,Other,,0.714279,0.540654,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,0.0,2.0,0.0,-1070.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
8,100011,0,Cash loans,F,N,Y,0,112500.0,1019610.0,33826.5,913500.0,Children,Pensioner,Secondary / secondary special,Married,House / apartment,0.018634,-20099,365243,-7427.0,-3514,,1,0,0,1,0,0,,2.0,2,2,WEDNESDAY,14,0,0,0,0,0,0,XNA,0.587334,0.205747,0.751724,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,0.0,1.0,0.0,0.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
9,100012,0,Revolving loans,M,N,Y,0,135000.0,405000.0,20250.0,405000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.019689,-14469,-2019,-14437.0,-3992,,1,1,0,1,0,0,Laborers,1.0,2,2,THURSDAY,8,0,0,0,0,0,0,Electricity,,0.746644,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,0.0,2.0,0.0,-1673.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,


In [5]:
print("Train dataset descriptive statistics ------")
train_dataset.describe()

Train dataset descriptive statistics ------


Unnamed: 0,SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,TOTALAREA_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
count,307511.0,307511.0,307511.0,307511.0,307511.0,307499.0,307233.0,307511.0,307511.0,307511.0,307511.0,307511.0,104582.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,307509.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,134133.0,306851.0,246546.0,151450.0,127568.0,157504.0,103023.0,92646.0,143620.0,152683.0,154491.0,98869.0,124921.0,97312.0,153161.0,93997.0,137829.0,151450.0,127568.0,157504.0,103023.0,92646.0,143620.0,152683.0,154491.0,98869.0,124921.0,97312.0,153161.0,93997.0,137829.0,151450.0,127568.0,157504.0,103023.0,92646.0,143620.0,152683.0,154491.0,98869.0,124921.0,97312.0,153161.0,93997.0,137829.0,159080.0,306490.0,306490.0,306490.0,306490.0,307510.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,307511.0,265992.0,265992.0,265992.0,265992.0,265992.0,265992.0
mean,278180.518577,0.080729,0.417052,168797.9,599026.0,27108.573909,538396.2,0.020868,-16036.995067,63815.045904,-4986.120328,-2994.202373,12.061091,0.999997,0.819889,0.199368,0.998133,0.281066,0.05672,2.152665,2.052463,2.031521,12.063419,0.015144,0.050769,0.040659,0.078173,0.230454,0.179555,0.50213,0.5143927,0.510853,0.11744,0.088442,0.977735,0.752471,0.044621,0.078942,0.149725,0.226282,0.231894,0.066333,0.100775,0.107399,0.008809,0.028358,0.114231,0.087543,0.977065,0.759637,0.042553,0.07449,0.145193,0.222315,0.228058,0.064958,0.105645,0.105975,0.008076,0.027022,0.11785,0.087955,0.977752,0.755746,0.044595,0.078078,0.149213,0.225897,0.231625,0.067169,0.101954,0.108607,0.008651,0.028236,0.102547,1.422245,0.143421,1.405292,0.100049,-962.858788,4.2e-05,0.710023,8.1e-05,0.015115,0.088055,0.000192,0.081376,0.003896,2.3e-05,0.003912,7e-06,0.003525,0.002936,0.00121,0.009928,0.000267,0.00813,0.000595,0.000507,0.000335,0.006402,0.007,0.034362,0.267395,0.265474,1.899974
std,102790.175348,0.272419,0.722121,237123.1,402490.8,14493.737315,369446.5,0.013831,4363.988632,141275.766519,3522.886321,1509.450419,11.944812,0.001803,0.38428,0.399526,0.043164,0.449521,0.231307,0.910682,0.509034,0.502737,3.265832,0.122126,0.219526,0.197499,0.268444,0.421124,0.383817,0.211062,0.1910602,0.194844,0.10824,0.082438,0.059223,0.11328,0.076036,0.134576,0.100049,0.144641,0.16138,0.081184,0.092576,0.110565,0.047732,0.069523,0.107936,0.084307,0.064575,0.110111,0.074445,0.132256,0.100977,0.143709,0.16116,0.08175,0.09788,0.111845,0.046276,0.070254,0.109076,0.082179,0.059897,0.112066,0.076144,0.134467,0.100368,0.145067,0.161934,0.082167,0.093642,0.11226,0.047415,0.070166,0.107462,2.400989,0.446698,2.379803,0.362291,826.808487,0.006502,0.453752,0.009016,0.12201,0.283376,0.01385,0.273412,0.062295,0.004771,0.062424,0.00255,0.059268,0.05411,0.03476,0.099144,0.016327,0.089798,0.024387,0.022518,0.018299,0.083849,0.110757,0.204685,0.916002,0.794056,1.869295
min,100002.0,0.0,0.0,25650.0,45000.0,1615.5,40500.0,0.00029,-25229.0,-17912.0,-24672.0,-7197.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014568,8.173617e-08,0.000527,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-4292.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,189145.5,0.0,0.0,112500.0,270000.0,16524.0,238500.0,0.010006,-19682.0,-2760.0,-7479.5,-4299.0,5.0,1.0,1.0,0.0,1.0,0.0,0.0,2.0,2.0,2.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.334007,0.3924574,0.37065,0.0577,0.0442,0.9767,0.6872,0.0078,0.0,0.069,0.1667,0.0833,0.0187,0.0504,0.0453,0.0,0.0,0.0525,0.0407,0.9767,0.6994,0.0072,0.0,0.069,0.1667,0.0833,0.0166,0.0542,0.0427,0.0,0.0,0.0583,0.0437,0.9767,0.6914,0.0079,0.0,0.069,0.1667,0.0833,0.0187,0.0513,0.0457,0.0,0.0,0.0412,0.0,0.0,0.0,0.0,-1570.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,278202.0,0.0,0.0,147150.0,513531.0,24903.0,450000.0,0.01885,-15750.0,-1213.0,-4504.0,-3254.0,9.0,1.0,1.0,0.0,1.0,0.0,0.0,2.0,2.0,2.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.505998,0.5659614,0.535276,0.0876,0.0763,0.9816,0.7552,0.0211,0.0,0.1379,0.1667,0.2083,0.0481,0.0756,0.0745,0.0,0.0036,0.084,0.0746,0.9816,0.7648,0.019,0.0,0.1379,0.1667,0.2083,0.0458,0.0771,0.0731,0.0,0.0011,0.0864,0.0758,0.9816,0.7585,0.0208,0.0,0.1379,0.1667,0.2083,0.0487,0.0761,0.0749,0.0,0.0031,0.0688,0.0,0.0,0.0,0.0,-757.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,367142.5,0.0,1.0,202500.0,808650.0,34596.0,679500.0,0.028663,-12413.0,-289.0,-2010.0,-1720.0,15.0,1.0,1.0,0.0,1.0,1.0,0.0,3.0,2.0,2.0,14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.675053,0.6636171,0.669057,0.1485,0.1122,0.9866,0.8232,0.0515,0.12,0.2069,0.3333,0.375,0.0856,0.121,0.1299,0.0039,0.0277,0.1439,0.1124,0.9866,0.8236,0.049,0.1208,0.2069,0.3333,0.375,0.0841,0.1313,0.1252,0.0039,0.0231,0.1489,0.1116,0.9866,0.8256,0.0513,0.12,0.2069,0.3333,0.375,0.0868,0.1231,0.1303,0.0039,0.0266,0.1276,2.0,0.0,2.0,0.0,-274.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
max,456255.0,1.0,19.0,117000000.0,4050000.0,258025.5,4050000.0,0.072508,-7489.0,365243.0,0.0,0.0,91.0,1.0,1.0,1.0,1.0,1.0,1.0,20.0,3.0,3.0,23.0,1.0,1.0,1.0,1.0,1.0,1.0,0.962693,0.8549997,0.89601,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,348.0,34.0,344.0,24.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,9.0,8.0,27.0,261.0,25.0


In [8]:
# Explore the distribution of the target column
train_dataset['TARGET'].value_counts()

0    282686
1     24825
Name: TARGET, dtype: int64

For this, the problem is a kind of imbalanced classification problem, as there are far more repaid on time (TARGET=0) than not repaid on time (TARGET=1)

### Exploring missing values of the dataset

In [11]:
# Function to calculate missing values by column# Funct 
# https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns

In [12]:
missing_values_table(train_dataset)

Your selected dataframe has 122 columns.
There are 67 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values
COMMONAREA_MEDI,214865,69.9
COMMONAREA_AVG,214865,69.9
COMMONAREA_MODE,214865,69.9
NONLIVINGAPARTMENTS_MEDI,213514,69.4
NONLIVINGAPARTMENTS_MODE,213514,69.4
NONLIVINGAPARTMENTS_AVG,213514,69.4
FONDKAPREMONT_MODE,210295,68.4
LIVINGAPARTMENTS_MODE,210199,68.4
LIVINGAPARTMENTS_MEDI,210199,68.4
LIVINGAPARTMENTS_AVG,210199,68.4


### Exploring column types

In [13]:
train_dataset.dtypes.value_counts()

float64    65
int64      41
object     16
dtype: int64

In [17]:
# Number of unique values of each object column
train_dataset.select_dtypes('object').apply(pd.Series.nunique, axis=0)

NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64

<div class="alert alert-warning">
<h5>One-hot encoding or Label encoding?...</h5>
<p>
    It depends on the machine learning algorithm you’re using. For a decision tree (Forest, XGB, etc.), it’s OK to encode categories using ordinal values (0, 1, 2, 3, etc). For an algorithm that learns a weight for each variable it’s not OK (Logistic regression, NN).
<p>

</div>

<div>
    <p>Here is a nice way to do label encoding in pandas <a href="https://stackoverflow.com/questions/32011359/convert-categorical-data-in-pandas-dataframe">(Link)</a></p>    
</div>

So, we will use Label Encoding for any categorical variables with only 2 categories and One-Hot Encoding for any categorical variables with more than 2 categories. This process may need to change as we get further into the project, but for now, we will see where this gets us. (We will also not use any dimensionality reduction in this notebook but will explore in future iterations).

In [18]:
# Operation - Label encoding if the number of unique class is with only 2 categories. One-hot encoding otherwise. 

## 2. Bureau.csv
All client's previous credits provided by other financial institutions that were reported to Credit Bureau

In [None]:
t1 = time.time()
bureau_dataset = pd.read_csv("data/bureau.csv")
t2 = time.time() - t1

In [None]:
print("Time used - %s" % t2)
print("Train dataset Info ------")
bureau_dataset.info(verbose=True)

In [None]:
print("Train dataset descriptive statistics ------")
bureau_dataset.describe()

In [None]:
print("Train dataset head ------")
bureau_dataset.sort_values("SK_ID_CURR").head(n=10)

By joining the training dataset with CB's records, how's the data look like?

Take SK_ID_CURR = 100006 as an example

In [None]:
# For the historic records in Credit Breau for sample 100002
bureau_dataset[bureau_dataset["SK_ID_CURR"] == 100002].sort_values("DAYS_CREDIT", ascending=False)

In [None]:
# For the historic records in Credit Breau for sample 100006
bureau_dataset[bureau_dataset["SK_ID_CURR"] == 100006].sort_values("DAYS_CREDIT", ascending=False)

Note: Assume the history record in bureau.csv contains all history records of each client in the training dataset, which means by left joining two datasets, we could know how many times each client has borrowed money from other instutions and how many times they haven't repay their loan - *TBD*

### 3. bureau_balance.csv
 - Monthly balances of previous credits in Credit Bureau.
 - This table has one row for each month of history of every previous credit reported to Credit Bureau – i.e the table has (#loans in sample * # of relative previous credits * # of months where we have some history observable for the previous credits) rows.

In [None]:
t1 = time.time()
bureau_balance_dataset = pd.read_csv("data/bureau_balance.csv")
t2 = time.time() - t1

In [None]:
print("Time used - %s" % t2)
print("CB Balance dataset Info ------")
bureau_balance_dataset.info(verbose=True)

In [None]:
print("Train dataset descriptive statistics ------")
bureau_balance_dataset.describe()

<div class="alert alert-warning">
<h4>Status Code</h4>
<p>Status of Credit Bureau loan during the month
 - active, closed, DPD0-30,Ö [C means closed, X means status unknown, 0 means no DPD, 1 means maximal did during month between 1-30, 2 means DPD 31-60,Ö 5 means DPD 120+ or sold or written off ] )</p>
</div>

In [None]:
print("Bureau Balance dataset head ------")
bureau_balance_dataset.head(n=5)

In [None]:
# Sample 100002
bureau_balance_dataset[bureau_balance_dataset["SK_ID_BUREAU"] == 6158909]

### 4. POS_CASH_balance.csv
 - Monthly balance snapshots of previous POS (point of sales) and cash loans that the applicant had with Home Credit.
 - This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample * # of relative previous credits * # of months in which we have some history observable for the previous credits) rows.

<div class="alert alert-warning">
<h4> What is POS?</h4>

<h5> Point-of-sale (POS) loans </h5>

<p>The POS loan is the important product in our business model because for most of our customers it is the point of entry into Home Credit. Having acquired a customer through POS loan for durable goods, we start the long-term relationship and do repeat business with them.</p>

<p>Customers use POS loans to buy durable goods, such as fridges and washing machines, and pay in instalments. We sell POS loans primarily to first-time borrowers. The loan applications are processed through Home Credit employees based in the shop, or through shop assistants, while the underwriting takes place centrally. Home Credit has a unique client database and technological capability so it is able to expertly analyse data and the level of risk and deliver a response quickly and efficiently to the customer.</p>

<p>In certain markets, we also offer a specialised type of POS loan for buying motorbikes. These loans may be collateralised by the item purchased.</p>

<p>In mature markets, an initial POS loan is the first step in the process of defining a customer’s credit capacity and potential progress to credit cards, increased limits, and cash loans. In Asian emerging consumer finance markets, such as India, Indonesia and Philippines, POS loans drive the growth of our business.</p>

<h5>The POS loan process</h5>
<ol>
<li>A customer chooses an item in a shop, then sits down with a representative employed by Home Credit or by the shop, and together they complete a credit request.</li>
<li>The application is submitted to Home Credit for processing.</li>
<li>The credit request is processed in the scoring system. Home Credit sends a notification of approval or refusal of credit to the representative.</li>
<li>If the request is approved, the customer signs a contract with Home Credit and buys the item, paying a first installment to the shop.</li>
<li>Home Credit makes the payment for the item to the shop once the shop has submitted all the customer’s contract documents.</li>
<li>The customer pays monthly instalments to Home Credit.</li>
</ul>

<img src="images/hcg-product-flow.png">
</div>

In [None]:
t1 = time.time()
poc_cash_balance_dataset = pd.read_csv("data/POS_CASH_balance.csv")
t2 = time.time() - t1

In [None]:
print("Time used - %s" % t2)

# POC cash balance dataset Info
poc_cash_balance_dataset.info(verbose=True)

In [None]:
# POC cash balance descriptive statistics
poc_cash_balance_dataset.describe()

In [None]:
# POC cash balance head rows
poc_cash_balance_dataset.head(n=5)

In [None]:
# Same, if we want to have all the previous records for for sample 100002
poc_cash_balance_dataset[poc_cash_balance_dataset['SK_ID_CURR'] == 100002].sort_values("MONTHS_BALANCE", ascending=False)

### 5. credit_card_balance.csv
 - Monthly balance snapshots of previous credit cards that the applicant has with Home Credit.
 - This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample * # of relative previous credit cards * # of months where we have some history observable for the previous credit card) rows.

In [None]:
t1 = time.time()
credit_card_balance_dataset = pd.read_csv("data/credit_card_balance.csv")
t2 = time.time() - t1

In [None]:
print("Time used - %s" % t2)

# POC cash balance dataset Info
credit_card_balance_dataset.info(verbose=True)

In [None]:
# Credit card cash balance descriptive statistics
credit_card_balance_dataset.describe()

In [None]:
# Credit card balance head rows
credit_card_balance_dataset.sort_values("SK_ID_CURR").head(n=5)

In [None]:
# Same, if we want to have all the credit card balance records for for sample 100002
credit_card_balance_dataset[credit_card_balance_dataset['SK_ID_CURR'] == 100002].sort_values("MONTHS_BALANCE", ascending=False)

In [None]:
# Same, if we want to have all the credit card balance records for for sample 100006
credit_card_balance_dataset[credit_card_balance_dataset['SK_ID_CURR'] == 100006].sort_values("MONTHS_BALANCE", ascending=False)

### 6. previous_application.csv
 - All previous applications for Home Credit loans of clients who have loans in our sample.
 - There is one row for each previous application related to loans in our data sample.

In [None]:
t1 = time.time()
previous_application_dataset = pd.read_csv("data/previous_application.csv")
t2 = time.time() - t1

In [None]:
print("Time used - %s" % t2)

# Previous application dataset Info
previous_application_dataset.info(verbose=True)

In [None]:
# Previous application dataset descriptive statistics
previous_application_dataset.describe()

In [None]:
# Previous dataset head
previous_application_dataset.sort_values("SK_ID_CURR").head(n=5)

In [None]:
previous_application_dataset[previous_application_dataset["SK_ID_CURR"] == 100002]

In [None]:
previous_application_dataset[previous_application_dataset["SK_ID_CURR"] == 100006].sort_values("DAYS_DECISION", ascending=False)

### 7. installments_payments.csv
 - Repayment history for the previously disbursed credits in Home Credit related to the loans in our sample.
 - There is 
      - a) one row for every payment that was made plus 
      - b) one row each for missed payment.One row is equivalent to one payment of one installment OR one installment corresponding to one payment of one previous Home Credit credit related to loans in our sample.

In [None]:
t1 = time.time()
installments_payments_dataset = pd.read_csv("data/installments_payments.csv")
t2 = time.time() - t1

In [None]:
print("Time used - %s" % t2)

# Installments payments dataset Info
installments_payments_dataset.info(verbose=True)

In [None]:
# Installments payments dataset descriptive statistics
installments_payments_dataset.describe()

In [None]:
# Installments payments dataset head
installments_payments_dataset.sort_values("SK_ID_CURR").head(n=5)

In [None]:
installments_payments_dataset[installments_payments_dataset["SK_ID_CURR"] == 100002]

In [None]:
installments_payments_dataset[installments_payments_dataset["SK_ID_CURR"] == 100006]

# Case Investigation

### Questions
 - How can we use the datasets other than the training dataset? How do each tables interact with each other?
 - Assume for each application, we can check the record from credit bureau and previous applications?

In [None]:
# For example, for a sample application 100002

train_dataset[train_dataset.index == 100002]

In [None]:
# How about a case of target 0?
train_dataset[train_dataset["TARGET"] == 0]

In [None]:
# Records From the previous applications, there is one record found
previous_application_dataset[previous_application_dataset["SK_ID_CURR"] == 100002]

In [None]:
# Then, we want to know the records of previous instalments
installments_payments_dataset[installments_payments_dataset['SK_ID_PREV'] == 1038818].sort_values("NUM_INSTALMENT_NUMBER")

Two key columns are here regarding the late payment --> 
 - DAYS_INSTALMENT = When the installment of previous credit was supposed to be paid
 - DAYS_ENTRY_PAYMENT = When was the installments of previous credit actually paid
 
==> Difference between these two = Late payment date

Another two key columns regarding the amount of payment -->
 - AMT_INSTALMENT = What was the prescribed installment amount of previous credit on this installment
 - AMT_PAYMENT = What the client actually paid on previous credit on this installment

In [None]:
# Calculate the difference between days_installment and days_entry_payment, 
# as well as the difference between AMT_INSTALMENT and AMT_PAYMENT

# Assign DIFF_DAYS_INSTALMENT = DAYS_ENTRY_PAYMENT - DAYS_INSTALMENT => If diff > 0 means Late payment, hence otherwise
installments_payments_dataset = installments_payments_dataset.assign(
    DIFF_DAYS_INSTALMENT=lambda x: x.DAYS_ENTRY_PAYMENT - x.DAYS_INSTALMENT)

# Assign DIFF_DAYS_INSTALMENT = AMT_ENTRY_PAYMENT - AMT_INSTALMENT => If diff > 0 means over payment, hence otherwise
installments_payments_dataset = installments_payments_dataset.assign(
    DIFF_AMT_INSTALMENT=lambda x: x.AMT_PAYMENT - x.AMT_INSTALMENT)

In [None]:
# Rerun the the records of previous instalments
installments_payments_dataset[installments_payments_dataset['SK_ID_PREV'] == 1038818].sort_values("NUM_INSTALMENT_NUMBER")

In [None]:
# Group by the previous installment dataset into a new agg table for future use
installments_payments_dataset_agg = installments_payments_dataset.groupby(by=['SK_ID_PREV'], as_index=False)['DIFF_DAYS_INSTALMENT', 'DIFF_AMT_INSTALMENT'].aggregate(np.mean)

In [None]:
installments_payments_dataset_agg.head(5)

Also, we can see the credit card balance of previous loans in Home Credit

In [None]:
# Credit Card Balance of previous loans in Home Credit of SK_ID_PREV=1038818
credit_card_balance_dataset[credit_card_balance_dataset['SK_ID_PREV'] == 1038818]

As we can see, the dataset is empty, which means the client has no previous credit card loans with Home Credit
By deduction, should it be correct all the previous loans were gone through POS?

In [None]:
# Credit Card Balance of previous loans in Home Credit of SK_ID_PREV=1038818
poc_cash_balance_dataset[poc_cash_balance_dataset['SK_ID_PREV'] == 1038818].sort_values("MONTHS_BALANCE", ascending=False)

The key here is to see whether there are days past dues

In [None]:
# Group by the previous installment dataset into a new agg table for future use
poc_cash_balance_dataset_agg = poc_cash_balance_dataset.groupby(by=['SK_ID_PREV'])['SK_DPD', 'SK_DPD_DEF'].aggregate(np.mean)

In [None]:
poc_cash_balance_dataset_agg.head(5)

Back to the credit card dataset, let's find a dataset that have records in credit card loans

In [None]:
bureau_dataset.SK_ID_CURR = bureau_dataset.SK_ID_CURR.astype('int')

In [None]:
# Explore the case in the CB library
bureau_dataset[bureau_dataset["SK_ID_CURR"] == 100002].sort_values("DAYS_CREDIT", ascending=False)

In [None]:
bureau_dataset_new = bureau_dataset.assign(DEBT_TO_CREDIT=bureau_dataset.AMT_CREDIT_SUM_DEBT/bureau_dataset.AMT_CREDIT_SUM)

In [None]:
bureau_dataset_new_agg = bureau_dataset_new.groupby("SK_ID_CURR").agg(
                    {'CREDIT_DAY_OVERDUE': 'mean', 
                     'SK_ID_CURR': 'count', 
                     'AMT_CREDIT_MAX_OVERDUE': 'mean', 
                     'CNT_CREDIT_PROLONG': 'mean', 
                     'AMT_CREDIT_SUM': 'mean', 
                     'DEBT_TO_CREDIT': 'mean'}).rename(index=str, 
                      columns={'CREDIT_DAY_OVERDUE': 'AVG_CREDIT_DAY_OVERDUE', 
                     'SK_ID_CURR': 'CB_RECORD_COUNT', 
                     'AMT_CREDIT_MAX_OVERDUE': 'AVG_AMT_CREDIT_MAX_OVERDUE', 
                     'CNT_CREDIT_PROLONG': 'AVG_CNT_CREDIT_PROLONG', 
                     'AMT_CREDIT_SUM': 'AVG_AMT_CREDIT_SUM', 
                     'DEBT_TO_CREDIT': 'AVG_DEBT_TO_CREDIT'
                    })

In [None]:
bureau_dataset_new_agg['AVG_AMT_CREDIT_MAX_OVERDUE'] = bureau_dataset_new_agg['AVG_AMT_CREDIT_MAX_OVERDUE'].fillna(value=0)

In [None]:
bureau_dataset_new_agg.index = bureau_dataset_new_agg.index.astype('int')

In [None]:
bureau_dataset_new_agg.head(5)

In [None]:
# Join bureau_agg with train_dataset
combined_dataset = train_dataset.join(bureau_dataset_new_agg, how='left')

In [None]:
combined_dataset.head(5)

# Data Visualisation

## 2.1 Visualising the training dataset

In [None]:
import seaborn as sns

%matplotlib inline

sns.set(style="ticks")

In [None]:
# Distribution of labels
train_dataset['TARGET'].value_counts()

In [None]:
def calc_percentage(x):
    return (100*x.count()/train_dataset.shape[0])

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.NAME_CONTRACT_TYPE, values=train_dataset.index, aggfunc=np.sum, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.CODE_GENDER, values=train_dataset.index, aggfunc=np.sum, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_OWN_CAR, values=train_dataset.index, aggfunc=np.sum, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_OWN_REALTY, values=train_dataset.index, aggfunc=np.sum, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.CNT_CHILDREN, values=train_dataset.index, aggfunc=np.sum, normalize='columns')

## Data Cleansing and Transformation

In [None]:
#DAYS_EMPLOYED should be negative as it is relative to current application
train_dataset.DAYS_EMPLOYED = train_dataset.DAYS_EMPLOYED.replace(to_replace=365243, value=None)

In [None]:
# Log transform AMT_INCOME_TOTAL
LOG_AMT_INCOME_TOTAL = train_dataset.AMT_INCOME_TOTAL.apply(np.log1p)
train_dataset_transformed = train_dataset.assign(LOG_AMT_INCOME_TOTAL=LOG_AMT_INCOME_TOTAL)

In [None]:
LOG_AMT_CREDIT = train_dataset.AMT_CREDIT.apply(np.log1p)
train_dataset_transformed = train_dataset_transformed.assign(LOG_AMT_CREDIT=LOG_AMT_CREDIT)

In [None]:
LOG_AMT_ANNUITY = train_dataset.AMT_ANNUITY.apply(np.log1p)
train_dataset_transformed = train_dataset_transformed.assign(LOG_AMT_ANNUITY=LOG_AMT_ANNUITY)

In [None]:
LOG_AMT_GOODS_PRICE = train_dataset.AMT_GOODS_PRICE.apply(np.log1p)
train_dataset_transformed = train_dataset_transformed.assign(LOG_AMT_GOODS_PRICE=LOG_AMT_GOODS_PRICE)

In [None]:
LOG_DAYS_EMPLOYED = train_dataset.DAYS_EMPLOYED.apply(lambda x: np.log1p(np.abs(x)))
train_dataset_transformed = train_dataset_transformed.assign(LOG_DAYS_EMPLOYED=LOG_DAYS_EMPLOYED)

In [None]:
LOG_DAYS_REGISTRATION = train_dataset.DAYS_REGISTRATION.apply(lambda x: np.log1p(np.abs(x)))
train_dataset_transformed = train_dataset_transformed.assign(LOG_DAYS_REGISTRATION=LOG_DAYS_REGISTRATION)

In [None]:
LOG_DAYS_ID_PUBLISH = train_dataset.DAYS_ID_PUBLISH.apply(lambda x: np.log1p(np.abs(x)))
train_dataset_transformed = train_dataset_transformed.assign(LOG_DAYS_ID_PUBLISH=LOG_DAYS_ID_PUBLISH)

In [None]:
# Boxplot of AMT_INCOME_TOTAL
sns.violinplot(x="TARGET", y="LOG_AMT_INCOME_TOTAL", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET")['LOG_AMT_CREDIT'].describe()

In [None]:
# Boxplot of AMT_CREDIT
sns.violinplot(x="TARGET", y="LOG_AMT_CREDIT", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").LOG_AMT_CREDIT.describe()

In [None]:
# Boxplot of AMT_ANNUITY
sns.violinplot(x="TARGET", y="LOG_AMT_ANNUITY", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").LOG_AMT_ANNUITY.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="LOG_AMT_GOODS_PRICE", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").LOG_AMT_GOODS_PRICE.describe()

In [None]:
pd.crosstab(train_dataset_transformed.TARGET, train_dataset_transformed.NAME_TYPE_SUITE, values=train_dataset_transformed.index, aggfunc=np.sum, normalize='columns')

In [None]:
pd.crosstab(train_dataset_transformed.TARGET, train_dataset_transformed.NAME_INCOME_TYPE, values=train_dataset_transformed.index, aggfunc=np.sum, normalize='columns')

In [None]:
pd.crosstab(train_dataset_transformed.TARGET, train_dataset_transformed.NAME_EDUCATION_TYPE, values=train_dataset_transformed.index, aggfunc=np.sum, normalize='columns')

In [None]:
pd.crosstab(train_dataset_transformed.TARGET, train_dataset_transformed.NAME_FAMILY_STATUS, values=train_dataset_transformed.index, aggfunc=np.sum, normalize='columns')

In [None]:
pd.crosstab(train_dataset_transformed.TARGET, train_dataset_transformed.NAME_HOUSING_TYPE, values=train_dataset_transformed.index, aggfunc=np.sum, normalize='columns')

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="REGION_POPULATION_RELATIVE", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").REGION_POPULATION_RELATIVE.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="LOG_DAYS_EMPLOYED", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").LOG_DAYS_EMPLOYED.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="LOG_DAYS_REGISTRATION", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").LOG_DAYS_REGISTRATION.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="LOG_DAYS_ID_PUBLISH", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").LOG_DAYS_ID_PUBLISH.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="OWN_CAR_AGE", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").OWN_CAR_AGE.describe()

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_MOBIL, values=train_dataset.index, aggfunc=np.sum, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_EMP_PHONE, values=train_dataset.index, aggfunc=np.sum, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_WORK_PHONE, values=train_dataset.index, aggfunc=np.sum, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_CONT_MOBILE, values=train_dataset.index, aggfunc=np.sum, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_PHONE, values=train_dataset.index, aggfunc=np.sum, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_EMAIL, values=train_dataset.index, aggfunc=np.sum, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.OCCUPATION_TYPE, values=train_dataset.index, aggfunc=np.sum, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.CNT_FAM_MEMBERS, values=train_dataset.index, aggfunc=np.sum, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.REGION_RATING_CLIENT, values=train_dataset.index, aggfunc=np.sum, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.REGION_RATING_CLIENT_W_CITY, values=train_dataset.index, aggfunc=len, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.WEEKDAY_APPR_PROCESS_START, values=train_dataset.index, aggfunc=len, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.HOUR_APPR_PROCESS_START, values=train_dataset.index, aggfunc=len, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.REG_REGION_NOT_LIVE_REGION, values=train_dataset.index, aggfunc=len, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.REG_REGION_NOT_WORK_REGION, values=train_dataset.index, aggfunc=len, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.LIVE_REGION_NOT_WORK_REGION, values=train_dataset.index, aggfunc=len, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.ORGANIZATION_TYPE, values=train_dataset.index, aggfunc=len, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.ORGANIZATION_TYPE, values=train_dataset.index, aggfunc=len)

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="EXT_SOURCE_1", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").EXT_SOURCE_1.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="EXT_SOURCE_2", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").EXT_SOURCE_2.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="EXT_SOURCE_3", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").EXT_SOURCE_3.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="APARTMENTS_AVG", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").APARTMENTS_AVG.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="YEARS_BEGINEXPLUATATION_AVG", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").YEARS_BEGINEXPLUATATION_AVG.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="YEARS_BUILD_AVG", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").YEARS_BUILD_AVG.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="COMMONAREA_AVG", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").COMMONAREA_AVG.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="ELEVATORS_AVG", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").ELEVATORS_AVG.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="ENTRANCES_AVG", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").ENTRANCES_AVG.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="LANDAREA_AVG", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").LANDAREA_AVG.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="LIVINGAPARTMENTS_AVG", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").LIVINGAPARTMENTS_AVG.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="LIVINGAREA_AVG", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").LIVINGAREA_AVG.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="NONLIVINGAPARTMENTS_AVG", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").NONLIVINGAPARTMENTS_AVG.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="NONLIVINGAREA_AVG", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").NONLIVINGAREA_AVG.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="OBS_30_CNT_SOCIAL_CIRCLE", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").OBS_30_CNT_SOCIAL_CIRCLE.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="DEF_30_CNT_SOCIAL_CIRCLE", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").DEF_30_CNT_SOCIAL_CIRCLE.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="OBS_30_CNT_SOCIAL_CIRCLE", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").OBS_30_CNT_SOCIAL_CIRCLE.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="OBS_60_CNT_SOCIAL_CIRCLE", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").OBS_60_CNT_SOCIAL_CIRCLE.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="DEF_60_CNT_SOCIAL_CIRCLE", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").DEF_60_CNT_SOCIAL_CIRCLE.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="DAYS_LAST_PHONE_CHANGE", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").DAYS_LAST_PHONE_CHANGE.describe()

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_DOCUMENT_2, values=train_dataset.index, aggfunc=len, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_DOCUMENT_3, values=train_dataset.index, aggfunc=len, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_DOCUMENT_4, values=train_dataset.index, aggfunc=len)

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_DOCUMENT_5, values=train_dataset.index, aggfunc=len, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_DOCUMENT_6, values=train_dataset.index, aggfunc=len, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_DOCUMENT_7, values=train_dataset.index, aggfunc=len, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_DOCUMENT_8, values=train_dataset.index, aggfunc=len, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_DOCUMENT_9, values=train_dataset.index, aggfunc=len)

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_DOCUMENT_10, values=train_dataset.index, aggfunc=len)

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_DOCUMENT_11, values=train_dataset.index, aggfunc=len)

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_DOCUMENT_12, values=train_dataset.index, aggfunc=len)

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_DOCUMENT_13, values=train_dataset.index, aggfunc=len)

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_DOCUMENT_14, values=train_dataset.index, aggfunc=len, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_DOCUMENT_15, values=train_dataset.index, aggfunc=len)

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_DOCUMENT_16, values=train_dataset.index, aggfunc=len, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_DOCUMENT_17, values=train_dataset.index, aggfunc=len)

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_DOCUMENT_18, values=train_dataset.index, aggfunc=len, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_DOCUMENT_19, values=train_dataset.index, aggfunc=len, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_DOCUMENT_20, values=train_dataset.index, aggfunc=len, normalize='columns')

In [None]:
pd.crosstab(train_dataset.TARGET, train_dataset.FLAG_DOCUMENT_21, values=train_dataset.index, aggfunc=len)

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="AMT_REQ_CREDIT_BUREAU_HOUR", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").AMT_REQ_CREDIT_BUREAU_HOUR.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="AMT_REQ_CREDIT_BUREAU_DAY", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").AMT_REQ_CREDIT_BUREAU_DAY.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="AMT_REQ_CREDIT_BUREAU_WEEK", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").AMT_REQ_CREDIT_BUREAU_WEEK.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="AMT_REQ_CREDIT_BUREAU_MON", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").AMT_REQ_CREDIT_BUREAU_MON.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="AMT_REQ_CREDIT_BUREAU_QRT", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").AMT_REQ_CREDIT_BUREAU_QRT.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="AMT_REQ_CREDIT_BUREAU_YEAR", data=train_dataset_transformed, palette="PRGn")
sns.despine(offset=10, trim=True)

train_dataset_transformed.groupby("TARGET").AMT_REQ_CREDIT_BUREAU_YEAR.describe()

### Visualisation of CB dataset

In [None]:
# Transform combined dataset
combined_dataset.AVG_DEBT_TO_CREDIT = combined_dataset.AVG_DEBT_TO_CREDIT.replace(to_replace=np.Inf, value=None)

In [None]:
combined_dataset.AVG_DEBT_TO_CREDIT = combined_dataset.AVG_DEBT_TO_CREDIT.replace(to_replace=-np.Inf, value=None)

In [None]:
combined_dataset.AVG_DEBT_TO_CREDIT = combined_dataset.AVG_DEBT_TO_CREDIT.replace(to_replace=np.NaN, value=None)

In [None]:
# Log transform AMT_INCOME_TOTAL
LOG_AVG_CREDIT_DAY_OVERDUE = train_dataset.AMT_INCOME_TOTAL.apply(np.log1p)
train_dataset_transformed = train_dataset.assign(LOG_AMT_INCOME_TOTAL=LOG_AMT_INCOME_TOTAL)

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="AVG_CREDIT_DAY_OVERDUE", data=combined_dataset, palette="PRGn")
sns.despine(offset=10, trim=True)

combined_dataset.groupby("TARGET").AVG_CREDIT_DAY_OVERDUE.describe()

In [None]:
combined_dataset.TARGET[combined_dataset.AVG_CREDIT_DAY_OVERDUE == 0].value_counts(normalize=True)

In [None]:
combined_dataset.TARGET[combined_dataset.AVG_CREDIT_DAY_OVERDUE > 0].value_counts(normalize=True)


In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="CB_RECORD_COUNT", data=combined_dataset, palette="PRGn")
sns.despine(offset=10, trim=True)

combined_dataset.groupby("TARGET").CB_RECORD_COUNT.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="AVG_AMT_CREDIT_MAX_OVERDUE", data=combined_dataset, palette="PRGn")
sns.despine(offset=10, trim=True)

combined_dataset.groupby("TARGET").AVG_AMT_CREDIT_MAX_OVERDUE.describe()

In [None]:
combined_dataset.TARGET[combined_dataset.AVG_AMT_CREDIT_MAX_OVERDUE == 0].value_counts(normalize=True)

In [None]:
combined_dataset.TARGET[combined_dataset.AVG_AMT_CREDIT_MAX_OVERDUE > 0].value_counts(normalize=True)

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="AVG_CNT_CREDIT_PROLONG", data=combined_dataset, palette="PRGn")
sns.despine(offset=10, trim=True)

combined_dataset.groupby("TARGET").AVG_CNT_CREDIT_PROLONG.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="AVG_AMT_CREDIT_SUM", data=combined_dataset, palette="PRGn")
sns.despine(offset=10, trim=True)

combined_dataset.groupby("TARGET").AVG_AMT_CREDIT_SUM.describe()

In [None]:
# Boxplot of AMT_GOODS_PRICE
sns.violinplot(x="TARGET", y="AVG_DEBT_TO_CREDIT", data=combined_dataset, palette="PRGn")
sns.despine(offset=10, trim=True)

combined_dataset.groupby("TARGET").AVG_DEBT_TO_CREDIT.describe()

In [None]:
combined_dataset.AVG_DEBT_TO_CREDIT.describe()