### Problem statement :
Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders. Home Credit Group Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities. While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

<img src="HomeCredit.JPEG" width="600" height="400">

## PreProcessing Steps
<br>
<ul>
    <li><b>Base dataset creation</b></li>
    <li><b>Cont,cat variables</b></li>
    <li><b>Null value treatement</b></li>
    <li><b>Outlier treatement</b></li>
    <li><b>Label enocders</b></li>
    <li><b>Dummy variables</b></li>
    <li><b>Normalization</b></li>
    <li><b>Standardization</b></li>
</ul>

#### Step1: Base dataset creation ( Reading the source files from either database or csv..etc)

In [19]:
import pandas as pd

In [20]:
train_data=pd.read_csv("C:\\Sridhar\\AI_ML\\Python\\Pandas\\file1\\application_train.csv")

In [21]:
train_data.shape

(307511, 122)

In [22]:
train_data.head(2)

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
train_data.tail(2)

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
307509,456254,1,Cash loans,F,N,Y,0,171000.0,370107.0,20205.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
307510,456255,0,Cash loans,F,N,N,0,157500.0,675000.0,49117.5,...,0,0,0,0,0.0,0.0,0.0,2.0,0.0,1.0


In [24]:
### EDA ( Exploratory Data Analysis)

In [25]:
train_data=train_data.sample(1000)

In [26]:
train_data.shape

(1000, 122)

In [27]:
train_data.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
237277,374830,0,Cash loans,F,Y,N,0,225000.0,675000.0,25146.0,...,0,0,0,0,0.0,0.0,0.0,2.0,5.0,1.0
214504,348562,0,Cash loans,F,N,Y,0,180000.0,675000.0,32602.5,...,0,0,0,0,,,,,,
282480,427212,0,Revolving loans,M,Y,Y,1,180000.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
244579,383079,0,Cash loans,M,N,Y,0,112500.0,521280.0,41062.5,...,0,0,0,0,,,,,,
157089,282083,0,Cash loans,F,N,Y,0,225000.0,260640.0,26838.0,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,2.0


##### Null value tratement :

In [28]:
null_value_table=train_data.isna().sum()

In [29]:
null_value_table=(train_data.isna().sum()/train_data.shape[0])*100

In [30]:
null_value_table

SK_ID_CURR                     0.0
TARGET                         0.0
NAME_CONTRACT_TYPE             0.0
CODE_GENDER                    0.0
FLAG_OWN_CAR                   0.0
FLAG_OWN_REALTY                0.0
CNT_CHILDREN                   0.0
AMT_INCOME_TOTAL               0.0
AMT_CREDIT                     0.0
AMT_ANNUITY                    0.0
AMT_GOODS_PRICE                0.0
NAME_TYPE_SUITE                0.7
NAME_INCOME_TYPE               0.0
NAME_EDUCATION_TYPE            0.0
NAME_FAMILY_STATUS             0.0
NAME_HOUSING_TYPE              0.0
REGION_POPULATION_RELATIVE     0.0
DAYS_BIRTH                     0.0
DAYS_EMPLOYED                  0.0
DAYS_REGISTRATION              0.0
DAYS_ID_PUBLISH                0.0
OWN_CAR_AGE                   67.9
FLAG_MOBIL                     0.0
FLAG_EMP_PHONE                 0.0
FLAG_WORK_PHONE                0.0
FLAG_CONT_MOBILE               0.0
FLAG_PHONE                     0.0
FLAG_EMAIL                     0.0
OCCUPATION_TYPE     

In [31]:
import numpy as np
x=np.array([1,2,3,4])

In [32]:
x[x>2]

array([3, 4])

In [33]:
dropped_columns=null_value_table[null_value_table>int(input("enter null value percentage to drop"))].index

enter null value percentage to drop30


In [34]:
retained_columns=null_value_table[null_value_table<int(input("enter null value percentage to retain"))].index

enter null value percentage to retain30


In [37]:
len(dropped_columns), len(retained_columns), len(train_data.columns)

(50, 72, 122)

In [38]:
### Note : in case of any data deletion from columns or rows , it requires explicit sign off from business partners

In [39]:
train_data_backup=train_data

In [40]:
train_data.drop(dropped_columns,axis=1,inplace=True)

#### Variables creation

Cont,cat variables.                                                                                                  

divide the data into numerical_cont,numerical_discrete,cat_class,cat_text

In [21]:
train_data.drop('SK_ID_CURR',axis=1,inplace=True)

In [41]:
numerical_columns=train_data.mean().index

In [43]:
catogory_columns=[]
for i in train_data.columns:
    if i not in numerical_columns:
        catogory_columns.append(i)

In [44]:
len(numerical_columns), len(catogory_columns), len(train_data.columns)

(61, 11, 72)

In [45]:
for i in train_data[numerical_columns].columns:
    print(i,train_data[i].nunique())

SK_ID_CURR 1000
TARGET 2
CNT_CHILDREN 7
AMT_INCOME_TOTAL 94
AMT_CREDIT 440
AMT_ANNUITY 767
AMT_GOODS_PRICE 182
REGION_POPULATION_RELATIVE 77
DAYS_BIRTH 963
DAYS_EMPLOYED 754
DAYS_REGISTRATION 947
DAYS_ID_PUBLISH 896
FLAG_MOBIL 1
FLAG_EMP_PHONE 2
FLAG_WORK_PHONE 2
FLAG_CONT_MOBILE 1
FLAG_PHONE 2
FLAG_EMAIL 2
CNT_FAM_MEMBERS 8
REGION_RATING_CLIENT 3
REGION_RATING_CLIENT_W_CITY 3
HOUR_APPR_PROCESS_START 20
REG_REGION_NOT_LIVE_REGION 2
REG_REGION_NOT_WORK_REGION 2
LIVE_REGION_NOT_WORK_REGION 2
REG_CITY_NOT_LIVE_CITY 2
REG_CITY_NOT_WORK_CITY 2
LIVE_CITY_NOT_WORK_CITY 2
EXT_SOURCE_2 979
EXT_SOURCE_3 369
OBS_30_CNT_SOCIAL_CIRCLE 16
DEF_30_CNT_SOCIAL_CIRCLE 6
OBS_60_CNT_SOCIAL_CIRCLE 15
DEF_60_CNT_SOCIAL_CIRCLE 5
DAYS_LAST_PHONE_CHANGE 717
FLAG_DOCUMENT_2 1
FLAG_DOCUMENT_3 2
FLAG_DOCUMENT_4 1
FLAG_DOCUMENT_5 2
FLAG_DOCUMENT_6 2
FLAG_DOCUMENT_7 1
FLAG_DOCUMENT_8 2
FLAG_DOCUMENT_9 2
FLAG_DOCUMENT_10 1
FLAG_DOCUMENT_11 2
FLAG_DOCUMENT_12 1
FLAG_DOCUMENT_13 2
FLAG_DOCUMENT_14 2
FLAG_DOCUMENT_15 2


In [46]:
numerical_contineous=[]
numerical_discrete=[]
for i in train_data[numerical_columns]:
    if train_data[i].nunique()>=20:
        numerical_contineous.append(i)
    else:
        numerical_discrete.append(i)    

In [47]:
len(numerical_contineous), len(numerical_discrete), len(numerical_columns)

(14, 47, 61)

In [48]:
catagory_text=[]
catagory_discrete=[]
for i in train_data[catogory_columns]:
    if train_data[i].nunique()>47:
        catagory_text.append(i)
    else:
        catagory_discrete.append(i)

In [49]:
len(catagory_text), len(catagory_discrete), len(catogory_columns)

(0, 11, 11)

In [50]:
for i in train_data[catogory_columns].columns:
    print(i,train_data[i].nunique())

NAME_CONTRACT_TYPE 2
CODE_GENDER 2
FLAG_OWN_CAR 2
FLAG_OWN_REALTY 2
NAME_TYPE_SUITE 7
NAME_INCOME_TYPE 5
NAME_EDUCATION_TYPE 4
NAME_FAMILY_STATUS 5
NAME_HOUSING_TYPE 6
WEEKDAY_APPR_PROCESS_START 7
ORGANIZATION_TYPE 45


In [51]:
len(train_data.columns), len(numerical_contineous), len(numerical_discrete), len(catagory_text), len(catagory_discrete)

(72, 14, 47, 0, 11)

In [52]:
train_data[i].mode()

0    Business Entity Type 3
dtype: object

In [53]:
for i in numerical_contineous:
    train_data[i].fillna(train_data[i].median(),inplace=True)

for i in numerical_discrete:
    train_data[i].fillna(train_data[i].mode().values[0],inplace=True)
    
for i in catagory_discrete:
    train_data[i].fillna(train_data[i].mode().values[0],inplace=True)

train_data.drop(catagory_text,axis=1,inplace=True)
    

In [34]:
train_data.isna().sum().head()

TARGET                0
NAME_CONTRACT_TYPE    0
CODE_GENDER           0
FLAG_OWN_CAR          0
FLAG_OWN_REALTY       0
dtype: int64

In [35]:
### to check null value treatement applied on all columns
train_data.isna().sum().values.sum()

0

### outliers treatement

In [54]:
for k in numerical_contineous:
    data=train_data[k].values
    q1=np.quantile(data,0.25)
    q3=np.quantile(data,0.75)
    iqr=q3-q1
    utv=q3+1.5*(iqr)
    ltv=q1-1.5*(iqr)
    outliers_data=[]
    for i in data:
        if i<ltv or i>utv:
            outliers_data.append(np.median(data))
        else:
            outliers_data.append(i)
    train_data[i]=outliers_data

In [55]:
train_data['TARGET'].value_counts()

0    922
1     78
Name: TARGET, dtype: int64

In [None]:
#### Box plot ( Graphs)

In [40]:
train_data[catagory_discrete].shape

(1000, 11)

In [41]:
for i in train_data[catagory_discrete].columns:
    print(i,train_data[i].nunique())

NAME_CONTRACT_TYPE 2
CODE_GENDER 2
FLAG_OWN_CAR 2
FLAG_OWN_REALTY 2
NAME_TYPE_SUITE 6
NAME_INCOME_TYPE 4
NAME_EDUCATION_TYPE 5
NAME_FAMILY_STATUS 5
NAME_HOUSING_TYPE 6
OCCUPATION_TYPE 18
WEEKDAY_APPR_PROCESS_START 7


### dummy variables , label encoders :

In [56]:
## We need to apply dummy 
dummy_table = pd.get_dummies(train_data[catagory_discrete])

In [57]:
dummy_table.shape

(1000, 87)

In [59]:
dummy_table.head()

Unnamed: 0,NAME_CONTRACT_TYPE_Cash loans,NAME_CONTRACT_TYPE_Revolving loans,CODE_GENDER_F,CODE_GENDER_M,FLAG_OWN_CAR_N,FLAG_OWN_CAR_Y,FLAG_OWN_REALTY_N,FLAG_OWN_REALTY_Y,NAME_TYPE_SUITE_Children,NAME_TYPE_SUITE_Family,...,ORGANIZATION_TYPE_Self-employed,ORGANIZATION_TYPE_Services,ORGANIZATION_TYPE_Trade: type 2,ORGANIZATION_TYPE_Trade: type 3,ORGANIZATION_TYPE_Trade: type 7,ORGANIZATION_TYPE_Transport: type 2,ORGANIZATION_TYPE_Transport: type 3,ORGANIZATION_TYPE_Transport: type 4,ORGANIZATION_TYPE_University,ORGANIZATION_TYPE_XNA
237277,1,0,1,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
214504,1,0,1,0,1,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0
282480,0,1,0,1,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
244579,1,0,0,1,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
157089,1,0,1,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [63]:
train_data.drop(catagory_discrete,axis=1,inplace=True)

In [64]:
train_data.shape,len(catagory_discrete)

((1000, 162), 11)

In [65]:
## Joining dummy table to the train_data table
for i in dummy_table.columns:
    train_data[i] = dummy_table[i]

In [66]:
train_data.shape

(1000, 162)

In [67]:
train_data.shape[0]

1000

In [68]:
## Add an random column called cat in the train_data table
train_data["cat"] = np.random.randint(0,100,train_data.shape[0])

In [69]:
train_data.head()

Unnamed: 0,SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,...,ORGANIZATION_TYPE_Services,ORGANIZATION_TYPE_Trade: type 2,ORGANIZATION_TYPE_Trade: type 3,ORGANIZATION_TYPE_Trade: type 7,ORGANIZATION_TYPE_Transport: type 2,ORGANIZATION_TYPE_Transport: type 3,ORGANIZATION_TYPE_Transport: type 4,ORGANIZATION_TYPE_University,ORGANIZATION_TYPE_XNA,cat
237277,374830,0,0,225000.0,675000.0,25146.0,675000.0,0.010966,-14594,-2673,...,0,0,0,0,0,0,0,0,0,48
214504,348562,0,0,180000.0,675000.0,32602.5,675000.0,0.00712,-20681,-1701,...,0,0,0,0,0,0,0,0,0,77
282480,427212,0,1,180000.0,135000.0,6750.0,135000.0,0.010643,-12512,-142,...,0,0,0,0,0,0,0,0,0,45
244579,383079,0,0,112500.0,521280.0,41062.5,450000.0,0.006629,-11261,-1448,...,0,0,0,0,0,0,0,0,0,34
157089,282083,0,0,225000.0,260640.0,26838.0,225000.0,0.018801,-24745,-14038,...,0,0,0,0,0,0,0,0,0,45


In [70]:
train_data.dtypes.head()

SK_ID_CURR            int64
TARGET                int64
CNT_CHILDREN          int64
AMT_INCOME_TOTAL    float64
AMT_CREDIT          float64
dtype: object

### Normalization

In [54]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder

In [58]:
for i in train_data.columns:
    mn=MinMaxScaler()
    mn.fit(pd.DataFrame(train_data[i]))
    y=mn.transform(pd.DataFrame(train_data[i]))
    train_data[i]=y

In [61]:
train_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
TARGET,1000.0,0.093000,0.290578,0.0,0.000000,0.000000,0.000000,1.0
CNT_CHILDREN,1000.0,0.087400,0.144576,0.0,0.000000,0.000000,0.200000,1.0
AMT_INCOME_TOTAL,1000.0,0.046371,0.046308,0.0,0.026676,0.036770,0.055516,1.0
AMT_CREDIT,1000.0,0.203227,0.145620,0.0,0.084890,0.169779,0.288115,1.0
AMT_ANNUITY,1000.0,0.191222,0.110512,0.0,0.112906,0.173414,0.244741,1.0
AMT_GOODS_PRICE,1000.0,0.217771,0.160443,0.0,0.085714,0.183673,0.287755,1.0
REGION_POPULATION_RELATIVE,1000.0,0.278659,0.196915,0.0,0.122922,0.246715,0.384476,1.0
DAYS_BIRTH,1000.0,0.513604,0.249901,0.0,0.316167,0.520386,0.720391,1.0
DAYS_EMPLOYED,1000.0,0.199033,0.370401,0.0,0.026868,0.030955,0.033402,1.0
DAYS_REGISTRATION,1000.0,0.730647,0.188977,0.0,0.584006,0.751923,0.898309,1.0


In [62]:
for i in train_data.columns:
    print(i,train_data[i].min(),train_data[i].max())

TARGET 0.0 1.0
CNT_CHILDREN 0.0 1.0
AMT_INCOME_TOTAL 0.0 1.0000000000000002
AMT_CREDIT 0.0 0.9999999999999999
AMT_ANNUITY 0.0 0.9999999999999999
AMT_GOODS_PRICE 0.0 1.0
REGION_POPULATION_RELATIVE 0.0 1.0000000000000002
DAYS_BIRTH 0.0 1.0
DAYS_EMPLOYED 0.0 1.0
DAYS_REGISTRATION 0.0 1.0
DAYS_ID_PUBLISH 0.0 1.0
FLAG_MOBIL 0.0 0.0
FLAG_EMP_PHONE 0.0 1.0
FLAG_WORK_PHONE 0.0 1.0
FLAG_CONT_MOBILE 0.0 1.0
FLAG_PHONE 0.0 1.0
FLAG_EMAIL 0.0 1.0
CNT_FAM_MEMBERS 0.0 0.9999999999999999
REGION_RATING_CLIENT 0.0 1.0
REGION_RATING_CLIENT_W_CITY 0.0 1.0
HOUR_APPR_PROCESS_START 0.0 1.0
REG_REGION_NOT_LIVE_REGION 0.0 1.0
REG_REGION_NOT_WORK_REGION 0.0 1.0
LIVE_REGION_NOT_WORK_REGION 0.0 1.0
REG_CITY_NOT_LIVE_CITY 0.0 1.0
REG_CITY_NOT_WORK_CITY 0.0 1.0
LIVE_CITY_NOT_WORK_CITY 0.0 1.0
EXT_SOURCE_2 0.0 1.0
EXT_SOURCE_3 0.0 1.0
OBS_30_CNT_SOCIAL_CIRCLE 0.0 1.0
DEF_30_CNT_SOCIAL_CIRCLE 0.0 1.0
OBS_60_CNT_SOCIAL_CIRCLE 0.0 1.0
DEF_60_CNT_SOCIAL_CIRCLE 0.0 1.0
DAYS_LAST_PHONE_CHANGE 0.0 1.0
FLAG_DOCUMENT_2 0.0 

### Standardization

In [63]:
for i in train_data.columns:
    sd=StandardScaler()
    sd.fit(pd.DataFrame(train_data[i]))
    z=sd.transform(pd.DataFrame(train_data[i]))
    train_data[i]=z

In [64]:
train_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
TARGET,1000.0,8.718581e-16,1.0005,-0.320212,-0.320212,-0.320212,-0.320212,3.122929
CNT_CHILDREN,1000.0,1.343370e-16,1.0005,-0.604830,-0.604830,-0.604830,0.779220,6.315421
AMT_INCOME_TOTAL,1000.0,-2.908784e-16,1.0005,-1.001864,-0.425516,-0.207438,0.197564,20.603419
AMT_CREDIT,1000.0,-8.126833e-17,1.0005,-1.396298,-0.813052,-0.229806,0.583239,5.474342
AMT_ANNUITY,1000.0,2.491340e-16,1.0005,-1.731194,-0.709016,-0.161217,0.484525,7.322139
AMT_GOODS_PRICE,1000.0,8.393286e-17,1.0005,-1.357995,-0.823492,-0.212631,0.436409,4.877880
REGION_POPULATION_RELATIVE,1000.0,4.840572e-17,1.0005,-1.415830,-0.791279,-0.162305,0.537641,3.665034
DAYS_BIRTH,1000.0,1.691286e-16,1.0005,-2.056259,-0.790457,0.027153,0.827894,1.947332
DAYS_EMPLOYED,1000.0,-1.173506e-16,1.0005,-0.537615,-0.465040,-0.454001,-0.447391,2.163515
DAYS_REGISTRATION,1000.0,2.049472e-16,1.0005,-3.868270,-0.776364,0.112638,0.887654,1.426036


### Label encoders

In [None]:
for i in train_data.columns:
    mn=LabelEncoder()
    mn.fit(pd.DataFrame(train_data['NAME_CONTRACT_TYPE']))
    y=mn.transform(pd.DataFrame(train_data['NAME_CONTRACT_TYPE']))
    # train_data