<a href="https://colab.research.google.com/github/SaiAnognaChittudi/Capstone_Home_Credit_Default/blob/main/Capstone_EDA_Sai_Anogna_Chittudi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Table of Contents

# 1. Introduction

Home Credit, a financial institution, is dedicated to extending loans to individuals lacking sufficient or non-existent credit histories, aiming to enhance financial inclusivity across diverse markets. To achieve this, they leverage alternative data sources such as telco and transactional records to assess clients' repayment capabilities.Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

The goal of the Home Credit Default project is to develop a robust predictive model that precisely forecast the probability of the target variable which in turn identifies individuals who are likely to default on loan payments. This model can assist Home Credit in making informed decisions on loan approval reducing the credit default risk.

# 2. Setup, Import and Read data

In [4]:
# import statements

import os
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.patches as mp
import matplotlib.pyplot as plt
import glob
import chardet

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [7]:
import os
os.chdir('/content/drive/MyDrive/home-credit-default-risk') # changing the default directory

# 3. Exploration of Dataset

In [21]:
df_train = pd.read_csv("application_train.csv")
df_bureau = pd.read_csv("bureau.csv")
df_train.head(20)

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0.0,202500.0,406597.5,24700.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0.0,270000.0,1293502.5,35698.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0.0,67500.0,135000.0,6750.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0.0,135000.0,312682.5,29686.5,...,0.0,0.0,0.0,0.0,,,,,,
4,100007,0,Cash loans,M,N,Y,0.0,121500.0,513000.0,21865.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,100008,0,Cash loans,M,N,Y,0.0,99000.0,490495.5,27517.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
6,100009,0,Cash loans,F,Y,Y,1.0,171000.0,1560726.0,41301.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,2.0
7,100010,0,Cash loans,M,Y,Y,0.0,360000.0,1530000.0,42075.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,100011,0,Cash loans,F,N,Y,0.0,112500.0,1019610.0,33826.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
9,100012,0,Revolving loans,M,N,Y,0.0,135000.0,405000.0,20250.0,...,0.0,0.0,0.0,0.0,,,,,,


## 3.1 Descriptive Statistics of each column

In [22]:
df_train.describe()

Unnamed: 0,SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
count,75700.0,75700.0,75699.0,75699.0,75699.0,75694.0,75636.0,75699.0,75699.0,75699.0,...,75699.0,75699.0,75699.0,75699.0,65586.0,65586.0,65586.0,65586.0,65586.0,65586.0
mean,143900.649432,0.080291,0.41706,169513.7,598643.4,27077.7623,538043.3,0.020861,-16038.243715,63427.870976,...,0.008243,0.000674,0.000436,0.00033,0.006953,0.007471,0.033361,0.268823,0.265041,1.88647
std,25307.603982,0.271744,0.722493,435752.3,401407.0,14467.199609,368798.1,0.013794,4367.607419,140962.797073,...,0.090418,0.025948,0.020875,0.01817,0.087734,0.108237,0.200898,0.92407,0.612589,1.873602
min,100002.0,0.0,0.0,25650.0,45000.0,1980.0,45000.0,0.000533,-25201.0,-17531.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,122067.75,0.0,0.0,112500.0,270000.0,16474.5,238500.0,0.010006,-19678.0,-2776.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,143843.0,0.0,0.0,147600.0,513000.0,24907.5,450000.0,0.01885,-15770.0,-1217.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,165795.25,0.0,1.0,202500.0,808650.0,34587.0,679500.0,0.028663,-12390.0,-289.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
max,187779.0,1.0,11.0,117000000.0,4050000.0,258025.5,4050000.0,0.072508,-7676.0,365243.0,...,1.0,1.0,1.0,1.0,3.0,6.0,6.0,24.0,8.0,25.0


## 3.2 Summary of the df_train

In [23]:
df_train.info

<bound method DataFrame.info of        SK_ID_CURR  TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR  \
0          100002       1         Cash loans           M            N   
1          100003       0         Cash loans           F            N   
2          100004       0    Revolving loans           M            Y   
3          100006       0         Cash loans           F            N   
4          100007       0         Cash loans           M            N   
...           ...     ...                ...         ...          ...   
75695      187775       0         Cash loans           F            N   
75696      187776       0         Cash loans           F            N   
75697      187777       1         Cash loans           M            N   
75698      187778       0         Cash loans           F            N   
75699      187779       0                NaN         NaN          NaN   

      FLAG_OWN_REALTY  CNT_CHILDREN  AMT_INCOME_TOTAL  AMT_CREDIT  \
0                   Y 

## 3.4 Count of TARGET Variable