## Project On the Home Credit Model

Objective: Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

<img src="home-credit.jpg" style="height:320px">

In [1]:
import pandas as pd
import numpy as np

In [2]:
df=pd.read_csv("G:\\A6\\home-credit-default-risk\\application_train.csv")

In [3]:
# list of Variable names
df.columns

Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY',
       ...
       'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
       'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR',
       'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
       'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
       'AMT_REQ_CREDIT_BUREAU_YEAR'],
      dtype='object', length=122)

In [4]:
df.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
df.tail()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
307506,456251,0,Cash loans,M,N,N,0,157500.0,254700.0,27558.0,...,0,0,0,0,,,,,,
307507,456252,0,Cash loans,F,N,Y,0,72000.0,269550.0,12001.5,...,0,0,0,0,,,,,,
307508,456253,0,Cash loans,F,N,Y,0,153000.0,677664.0,29979.0,...,0,0,0,0,1.0,0.0,0.0,1.0,0.0,1.0
307509,456254,1,Cash loans,F,N,Y,0,171000.0,370107.0,20205.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
307510,456255,0,Cash loans,F,N,N,0,157500.0,675000.0,49117.5,...,0,0,0,0,0.0,0.0,0.0,2.0,0.0,1.0


In [6]:
# Dimension of the dataset
df.shape

(307511, 122)

## Pre-processing 
### Step 1 : Memory_management

In [7]:
def memory_management(train_identity):
    """ iterate through all the columns of a dataframe and modify the data type
    to reduce memory usage."""

    df=train_identity
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')
    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    print("*******************************************************************************************")
    train_identity=df
    return df

In [8]:
# size reduced for df and named as df1
df1=memory_management(df)

Memory usage of dataframe is 286.23 MB
Memory usage after optimization is: 59.54 MB
Decreased by 79.2%
*******************************************************************************************


### Step 2: null_values treatment

In [9]:
def null_values(base_dataset):
    print(base_dataset.isna().sum())
    ## null value percentage     
    null_value_table=(base_dataset.isna().sum()/base_dataset.shape[0])*100
    ## null value percentage beyond threshold drop , else treat the columns 
    
    retained_columns=null_value_table[null_value_table<int(input())].index
    # if any variable as null value greater than input(like 30% of the data) value than those variable are consider as drop
    drop_columns=null_value_table[null_value_table>int(input())].index
    base_dataset.drop(drop_columns,axis=1,inplace=True)
    len(base_dataset.isna().sum().index)
    cont=base_dataset.describe().columns
    cat=[i for i in base_dataset.columns if i not in base_dataset.describe().columns]
    for i in cat:
        base_dataset[i].fillna(base_dataset[i].value_counts().index[0],inplace=True)
    for i in cont:
        base_dataset[i].fillna(base_dataset[i].median(),inplace=True)
    print(base_dataset.isna().sum())
    return base_dataset,cat,cont

In [10]:
df2,cat,cont=null_values(df1)

SK_ID_CURR                         0
TARGET                             0
NAME_CONTRACT_TYPE                 0
CODE_GENDER                        0
FLAG_OWN_CAR                       0
FLAG_OWN_REALTY                    0
CNT_CHILDREN                       0
AMT_INCOME_TOTAL                   0
AMT_CREDIT                         0
AMT_ANNUITY                       12
AMT_GOODS_PRICE                  278
NAME_TYPE_SUITE                 1292
NAME_INCOME_TYPE                   0
NAME_EDUCATION_TYPE                0
NAME_FAMILY_STATUS                 0
NAME_HOUSING_TYPE                  0
REGION_POPULATION_RELATIVE         0
DAYS_BIRTH                         0
DAYS_EMPLOYED                      0
DAYS_REGISTRATION                  0
DAYS_ID_PUBLISH                    0
OWN_CAR_AGE                   202929
FLAG_MOBIL                         0
FLAG_EMP_PHONE                     0
FLAG_WORK_PHONE                    0
FLAG_CONT_MOBILE                   0
FLAG_PHONE                         0
F

The above code is for null value treatmeant and here we give 30 value for categorical and numercial variable as null value treatment. Means Above 30% value of have empty value we drop that variable and any varible below 30% empty value or no null value in any variable are taken as consider for the analysis

In [11]:
# A list of string variable
cat

['NAME_CONTRACT_TYPE',
 'CODE_GENDER',
 'FLAG_OWN_CAR',
 'FLAG_OWN_REALTY',
 'NAME_TYPE_SUITE',
 'NAME_INCOME_TYPE',
 'NAME_EDUCATION_TYPE',
 'NAME_FAMILY_STATUS',
 'NAME_HOUSING_TYPE',
 'WEEKDAY_APPR_PROCESS_START',
 'ORGANIZATION_TYPE']

In [12]:
# A list of numerical varaible
cont

Index(['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
       'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'FLAG_MOBIL', 'FLAG_EMP_PHONE',
       'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL',
       'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
       'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START',
       'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
       'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_2',
       'EXT_SOURCE_3', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
       'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE',
       'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3',
       'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6',
       'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMEN

### using sample data

In [13]:
df2.shape

(307511, 72)

In [14]:
df3=df2.sample(10000)

In [15]:
df3.shape

(10000, 72)

In [25]:
x=list(df3.columns)
x.remove('TARGET')

In [27]:
outliers_Columns=x

In [28]:
df3=df3[outliers_Columns]

### Step 2: Outlier Treatment

In [16]:
def outliers_transform(base_dataset):
    for i in base_dataset.var().sort_values(ascending=False).index[1:10]:
        x=np.array(base_dataset[i])
        qr1=np.quantile(x,0.25)
        qr3=np.quantile(x,0.75)
        iqr=qr3-qr1
        utv=qr3+(1.5*(iqr))
        ltv=qr1-(1.5*(iqr))
        y=[]
        for p in x:
            if p <ltv or p>utv:
                y.append(np.median(x))
            else:
                y.append(p)
        base_dataset[i]=y

In [29]:
outliers_transform(df3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


### Step 3: LabelEncoder

###### Now we are converting string value to numerical value for all the variable in the dataset

In [30]:
from sklearn.preprocessing import LabelEncoder
def label_encoders(data,cat):
    le=LabelEncoder()
    for i in cat:
        le.fit(data[i])
        x=le.transform(data[i])
        data[i]=x
    return data

In [31]:
df4=label_encoders(df3,cat)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


## Create a base model by taking all variable in the model.

In [35]:
df4['TARGET']=df2['TARGET']

In [37]:
df4['TARGET'].value_counts()

0    9208
1     792
Name: TARGET, dtype: int64

### SMOTE : converting imbalance data to balance data

In [39]:
y=df4['TARGET']
x=df4.drop('TARGET',axis=1)

And in the above code we are taking the variable y(dependent) and x(Independent) for our analysis.

In [40]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(x,y,random_state=120,test_size=0.2)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(8000, 71) (2000, 71) (8000,) (2000,)


In [42]:
from imblearn.over_sampling import SMOTE

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)

print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)



print("Before OverSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train==0)))

sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())

print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_res.shape))

print("After OverSampling, counts of label '1': {}".format(sum(y_train_res==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res==0)))

Number transactions X_train dataset:  (7000, 71)
Number transactions y_train dataset:  (7000,)
Number transactions X_test dataset:  (3000, 71)
Number transactions y_test dataset:  (3000,)
Before OverSampling, counts of label '1': 557
Before OverSampling, counts of label '0': 6443 

After OverSampling, the shape of train_X: (12886, 71)
After OverSampling, the shape of train_y: (12886,) 

After OverSampling, counts of label '1': 6443
After OverSampling, counts of label '0': 6443


In the above code we have split the data into train and test. The output of 1st is for train and 2nd is test

### Model Building : Baseline Models, Benchamarking Models

In [48]:
#### Baseline Models

In [51]:
from sklearn.tree import DecisionTreeClassifier
ln=DecisionTreeClassifier(criterion='entropy',max_depth=10)
ln.fit(X_train_res,y_train_res)
ln.predict(X_test)
from sklearn.metrics import confusion_matrix,accuracy_score
print(confusion_matrix(y_test,ln.predict(X_test)))
print(accuracy_score(y_test.values,ln.predict(X_test)))

[[2588  177]
 [ 204   31]]
0.873


In [61]:
### overfitted models

In [64]:
from sklearn.tree import DecisionTreeClassifier
ln=DecisionTreeClassifier(criterion='entropy',max_depth=10)
ln.fit(X_train_res,y_train_res)
ln.predict(X_train_res)
from sklearn.metrics import confusion_matrix,accuracy_score
print(confusion_matrix(y_train_res,ln.predict(X_train_res)))
print(accuracy_score(y_train_res,ln.predict(X_train_res)))

[[6229  214]
 [ 934 5509]]
0.9109110662734751


In [65]:
### bechmarking models

In [73]:
accuracy_score1=[]
for i in range(1,10):
    from sklearn.ensemble import RandomForestClassifier
    ln=RandomForestClassifier(n_estimators=i)
    ln.fit(X_train_res,y_train_res)
    ln.predict(X_test)
    #from sklearn.metrics import confusion_matrix,accuracy_score
    #print(confusion_matrix(y_test,ln.predict(X_test)))
    #print(accuracy_score(y_test.values,ln.predict(X_test)))
    accuracy_score1.append(accuracy_score(y_test.values,ln.predict(X_test)))

In [74]:
accuracy_score1=np.array(accuracy_score1)

In [75]:
accuracy_score1.argmax()

5

In [78]:
from sklearn.model_selection import GridSearchCV
rf_clf = RandomForestClassifier(random_state=42)
params_grid = {"max_depth": [3, None],
               "min_samples_split": [2, 3, 10],
               "min_samples_leaf": [1, 3, 10],
               "bootstrap": [True, False],
               "criterion": ['gini', 'entropy']}
grid_search = GridSearchCV(rf_clf, params_grid)

In [79]:
grid_search.fit(X_train_res, y_train_res)
grid_search.best_score_
grid_search.best_estimator_.get_params()















{'bootstrap': False,
 'class_weight': None,
 'criterion': 'entropy',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 10,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

In [80]:
RandomForestClassifier

SyntaxError: invalid syntax (<ipython-input-80-2ef7754e0cc3>, line 1)

In [None]:
('bootstrap': False,'class_weight': None, 'criterion': 'entropy', 'max_depth': None,
 'max_features': 'auto', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 10,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False)

In [81]:
### Benchmarking Models

In [82]:
from sklearn.ensemble import RandomForestClassifier
ln=RandomForestClassifier(n_estimators=5)
ln.fit(X_train_res,y_train_res)
ln.predict(X_test)
from sklearn.metrics import confusion_matrix,accuracy_score
print(confusion_matrix(y_test,ln.predict(X_test)))
print(accuracy_score(y_test.values,ln.predict(X_test)))


[[2667   98]
 [ 208   27]]
0.898


In [87]:
len(X_train.columns)

71

In [93]:
import pandas as pd
pd.set_option('display.float_format',lambda x:'%.5f' % x)
columns_important=pd.DataFrame(ln.feature_importances_,X_train.columns)

In [95]:
columns_important.reset_index(inplace=True)

In [97]:
columns_important.columns=['columns name','score']

In [101]:
columns_important.sort_values('score').head()

Unnamed: 0,columns name,score
47,FLAG_DOCUMENT_4,0.0
45,FLAG_DOCUMENT_2,0.0
60,FLAG_DOCUMENT_17,0.0
50,FLAG_DOCUMENT_7,0.0
62,FLAG_DOCUMENT_19,0.0


In [102]:
columns_important.sort_values('score').tail()

Unnamed: 0,columns name,score
12,NAME_EDUCATION_TYPE,0.0462
38,EXT_SOURCE_2,0.05287
39,EXT_SOURCE_3,0.05999
46,FLAG_DOCUMENT_3,0.06307
35,REG_CITY_NOT_WORK_CITY,0.07994


In [103]:
#### forward selection and backward elimination

In [113]:
columns1=list(X_train.columns)

In [115]:
columns1[0:3]

['SK_ID_CURR', 'NAME_CONTRACT_TYPE', 'CODE_GENDER']

In [116]:
X_train_res=pd.DataFrame(X_train_res)
X_test=

In [117]:
columns1=X_train_res.columns

In [123]:
columns2=X_test.columns

In [124]:
for i in range(1,len(columns1)+1):
    X_train_forward=X_train_res[columns1[0:i]]
    X_test_forward=X_test[columns2[0:i]]
    ln=RandomForestClassifier(n_estimators=5)
    ln.fit(X_train_forward,y_train_res)
    ln.predict(X_test_forward)
    from sklearn.metrics import confusion_matrix,accuracy_score
    print(confusion_matrix(y_test,ln.predict(X_test_forward)))
    print(accuracy_score(y_test,ln.predict(X_test_forward)))


[[1434 1331]
 [ 116  119]]
0.5176666666666667
[[1535 1230]
 [ 130  105]]
0.5466666666666666
[[1911  854]
 [ 155   80]]
0.6636666666666666
[[2148  617]
 [ 173   62]]
0.7366666666666667
[[2293  472]
 [ 189   46]]
0.7796666666666666
[[2438  327]
 [ 207   28]]
0.822
[[2589  176]
 [ 221   14]]
0.8676666666666667
[[2650  115]
 [ 226    9]]
0.8863333333333333
[[2656  109]
 [ 224   11]]
0.889
[[2647  118]
 [ 222   13]]
0.8866666666666667
[[2675   90]
 [ 223   12]]
0.8956666666666667
[[2673   92]
 [ 221   14]]
0.8956666666666667
[[2669   96]
 [ 223   12]]
0.8936666666666667
[[2670   95]
 [ 226    9]]
0.893
[[2692   73]
 [ 225   10]]
0.9006666666666666
[[2685   80]
 [ 229    6]]
0.897
[[2693   72]
 [ 225   10]]
0.901
[[2687   78]
 [ 224   11]]
0.8993333333333333
[[2685   80]
 [ 223   12]]
0.899
[[2687   78]
 [ 222   13]]
0.9
[[2656  109]
 [ 226    9]]
0.8883333333333333
[[2671   94]
 [ 231    4]]
0.8916666666666667
[[2682   83]
 [ 217   18]]
0.9
[[2674   91]
 [ 222   13]]
0.8956666666666667
[[26