# **A Data Science Project on Credit Default Risk Analysis**
##### Author:Mukesh Kumar Chaudhary
##### Email:cmukesh8688@gmail.com

### **Problem Statement**
Home Credit B.V. is an international non-bank financial institution founded in 1997 in the Czech Republic.The company operates in 14 countries and focuses on lending primarily to people with little or no credit history. As of 2016 the company has over 15 million active customers, with two-thirds of them in Asia and 7.3 million in China. 

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.


### **Data**


- application_{train|test}.csv

 - This is the main table, broken into two files for Train (with TARGET) and Test (without TARGET).
 - Static data for all applications. One row represents one loan in our data sample.
 
 
- bureau.csv
 - All client's previous credits provided by other financial institutions that were reported to Credit Bureau (for clients who have a      loan in our sample).
 - For every loan in our sample, there are as many rows as number of credits the client had in Credit Bureau before the application date.
 
 
- bureau_balance.csv
 - Monthly balances of previous credits in Credit Bureau.
 - This table has one row for each month of history of every previous credit reported to Credit Bureau – i.e the table has (#loans in      sample * # of relative previous credits * # of months where we have some history observable for the previous credits) rows.
 

- POS_CASH_balance.csv
 - Monthly balance snapshots of previous POS (point of sales) and cash loans that the applicant had with Home Credit.
 - This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to    loans in our sample – i.e. the table has (#loans in sample * # of relative previous credits * # of months in which we have some        history observable for the previous credits) rows.
 
 
- credit_card_balance.csv
 - Monthly balance snapshots of previous credit cards that the applicant has with Home Credit.
 - This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to    loans in our sample – i.e. the table has (#loans in sample * # of relative previous credit cards * # of months where we have some      history observable for the previous credit card) rows.
 
 
- previous_application.csv
 - All previous applications for Home Credit loans of clients who have loans in our sample.
 - There is one row for each previous application related to loans in our data sample.
 
 
- installments_payments.csv
 - Repayment history for the previously disbursed credits in Home Credit related to the loans in our sample.
   There is a) one row for every payment that was made plus b) one row each for missed payment.
   One row is equivalent to one payment of one installment OR one installment corresponding to one payment of one previous Home Credit    credit related to loans in our sample.
   

- train_bureau.csv 
 - This dataframe is created manualy by group joining application_train,bureau and bureau_balance dataframe with aggregation funtion count,sum,max,min,mean .
 
 
- previous_loan_final.csv
 -  This dataframe is created manually by group joining previous_application,POS_CASH_balance,credi_card_balance and intallments_payments dataframe with aggregation funtion count,sum,max,min,mean .
 
 
- home_credit_final.csv
 - This dataframe is created manually by joining train_bureau.csv and previous_loan_final.csv. 
 
 
- automative_features_app.csv
 - This is created by auto generated library ***featuretools*** with aggregation premitives sum,max,min,mode,mean,count
 
 
![image](image/database_flowchart.png)


In [None]:
# import all necessary libraries
# for data mauplation
import pandas as pd 
import numpy as np

# data visualization
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns


# ignore warnig from pandas
import warnings
warnings.filterwarnings("ignore")


# for featuretools
import featuretools as ft

# import user libraries 
from text_format_class import TxtFormat 
import manage_missing_data as manage_df
import manage_dataframe as manage_agg_cat
import display_corr as manage_corr
import manage_model
import manage_pca
import plot_features
%load_ext autoreload
%autoreload 2



In [None]:
# import sklearn library for preprocessing ,modelling , Accuracy Analysis , Cross Validation , optimization 

from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score,roc_auc_score
from  sklearn.model_selection import train_test_split

# XGBoosting 

import xgboost as xgb
from sklearn.metrics import mean_absolute_error

# Gradient Boosting algorithm
import os
import lightgbm as lgb
sns.set_style("darkgrid")

### Retrieving Data and Data dimension

In [None]:
# upload all Data 
pd.options.display.max_columns = None
df_app_train = pd.read_csv("Data/application_train.csv")
df_app_test = pd.read_csv("Data/application_test.csv")
df_bureau = pd.read_csv("Data/bureau.csv")
df_bureau_balance = pd.read_csv("Data/bureau_balance.csv")
df_previous = pd.read_csv("Data/previous_application.csv")
df_credit = pd.read_csv("Data/credit_card_balance.csv")
df_cash = pd.read_csv("Data/POS_CASH_balance.csv")
df_payment = pd.read_csv("Data/installments_payments.csv")

In [None]:
# Dimension of Data 
print("Dimension of Data")
print("------------------")
print("Application Train    : ", df_app_train.shape )
print("Application Test     :",df_app_test.shape)
print("Bureau               :" ,df_bureau.shape)
print("Bureau Balance       :",df_bureau_balance.shape)
print("Previous application : ",df_previous.shape)
print("Credit card balance  :",df_credit.shape)
print("POSH_CASH_balance    :" ,df_cash.shape)
print("Instalments payment  :",df_payment.shape)


In [None]:
# application_train from Home credit
print(df_app_train.shape)
df_app_train.head()


In [None]:
# for test application
print(df_app_test.shape)
df_app_test.head()

In [None]:
# check columns names between application test and application train 
for col in df_app_train.columns:
    if col not in df_app_test.columns:
        print(col)

In [None]:
#pd.options.display.max_columns = None
print(df_app_train.columns.values)

In [None]:
# credit history from another bureau
print(df_bureau.shape)
df_bureau.head()

In [None]:
# features of bureau dataset
df_bureau.columns.values

In [None]:
# bureau balance 
print(df_bureau_balance.shape)
df_bureau_balance.head()

In [None]:
# features of bureau balance dataset
df_bureau_balance.columns.values

In [None]:
# previous applicattion 
print(df_previous.shape)
df_previous.head()


In [None]:
# features of previous application 
df_previous.columns.values

In [None]:
#Credit card balance dataset
print(df_credit.shape)
df_credit.head()

In [None]:
# features of credit card balance dataset
df_credit.columns.values

In [None]:
# POSH_CASH balance dataset 
print(df_cash.shape)
df_cash.head()

In [None]:
# features of POSH_CASH balance dataset
df_cash.columns.values

In [None]:
# instalments payments dataset 
print(df_payment.shape)
df_payment.head()

In [None]:
# features of instalments_payments 
df_payment.columns.values

### Checking missing values

In [None]:
# application train
manage_df.missing_data_display(df_app_train)

In [None]:
# delete missing value more than 40% 
manage_df.delete_missing_values(df_app_train)
manage_df.handle_missing_value(df_app_train)
manage_df.missing_data_display(df_app_train)

In [None]:
#application test 
manage_df.delete_missing_values(df_app_test)
manage_df.handle_missing_value(df_app_test)
manage_df.missing_data_display(df_app_test)

In [None]:
#Bureau dataset
manage_df.delete_missing_values(df_bureau)
manage_df.handle_missing_value(df_bureau)
manage_df.missing_data_display(df_bureau)

In [None]:
#bureu balance dataset 
manage_df.delete_missing_values(df_bureau_balance)
manage_df.handle_missing_value(df_bureau_balance)
manage_df.missing_data_display(df_bureau_balance)
print(df_bureau_balance.shape)

In [None]:
#previous application dataset
print(df_previous.shape)
manage_df.delete_missing_values(df_previous)
manage_df.handle_missing_value(df_previous)
print(df_previous.shape)
manage_df.missing_data_display(df_previous)


In [None]:
#POSh_CASH dataset 
print(df_cash.shape)
manage_df.delete_missing_values(df_cash)
manage_df.handle_missing_value(df_cash)
print(df_cash.shape)
manage_df.missing_data_display(df_cash)

In [None]:
#credit balance dataset 
print(df_credit.shape)
manage_df.delete_missing_values(df_credit)
manage_df.handle_missing_value(df_credit)
print(df_credit.shape)
manage_df.missing_data_display(df_credit)

In [None]:
#payment balance dataset 
print(df_payment.shape)
manage_df.delete_missing_values(df_payment)
manage_df.handle_missing_value(df_payment)
print(df_payment.shape)
manage_df.missing_data_display(df_payment)

In [None]:
df_app_train.head()

## Data Exploration

#### Application train and test data 

In [None]:
# About Application train dataframe

print(df_app_train.shape)
df_app_train.head()

In [None]:
# features of df_application_train 
df_app_train.columns

In [None]:
# calculating percentage of not repaid loan and ploting 

total_no_applicant = df_app_train['TARGET'].count()

total_repaid = df_app_train[df_app_train['TARGET']==0]['TARGET'].count()

total_not_repaid = df_app_train[df_app_train['TARGET']==1]['TARGET'].count()



start = "\033[1m"
end = "\033[0;0m"

print("Status of Target")
print("------------------")
print(f"total_no_applicant : {TxtFormat().BOLD} {total_no_applicant} {TxtFormat().END} ")
print(f"""total loan was repaid: {TxtFormat().BOLD} {total_repaid} {TxtFormat().END} and Percent :  {TxtFormat().BOLD} {round(total_repaid/total_no_applicant *100)} %  {TxtFormat().END} """)
print(f"total loan was not repaid: {TxtFormat().BOLD} {total_not_repaid} {TxtFormat().END} and Percent : {TxtFormat().BOLD} {round(total_not_repaid/total_no_applicant *100)} %  {TxtFormat().END} ")


# plot graph

fig , ax = plt.subplots(nrows=1,ncols=2,figsize=(16,5))
df_app_train['TARGET'].value_counts().plot(kind='bar' ,colors = sns.color_palette(), ax =ax[0], fontsize = 14 , label = '0: Repaid \n 1: Notrepaid')
ax[0].set_title("Count of target variable",fontsize = 14)
ax[0].set_ylabel("Counts", fontsize = 14)
ax[0].set_xlabel("Target",fontsize = 14)

df_app_train['TARGET'].value_counts().plot.pie(autopct = "%1.0f%%", colors = sns.color_palette(), labels = ['repaid','notrepaid'],fontsize =18 ,ax =ax[1])
ax[1].set_title("Distribution of target variable",fontsize = 14)

fig.savefig(" Target Level.png")

In [None]:
# status of NAME_CONTRACT_TYPE with target
plot_features.display_targetfeature(df_app_train,'TARGET','NAME_CONTRACT_TYPE','SK_ID_CURR',horizantal= False)


In [None]:
# Gender with respect to target
plot_features.display_targetfeature(df_app_train,'TARGET','CODE_GENDER','SK_ID_CURR',horizantal= False)

In [None]:
# status CNT_CHILDREN
plot_features.display_targetfeature(df_app_train,'TARGET','CNT_CHILDREN','SK_ID_CURR',horizantal= False)
plt.legend(loc='best')


In [None]:
# NAME_TYPE_SUITE

plot_features.display_targetfeature(df_app_train,'TARGET','NAME_TYPE_SUITE','SK_ID_CURR',horizantal=False)

In [None]:
# NAME_INCOME_TYPE  feature plot with target 

plot_features.display_targetfeature(df_app_train,'TARGET','NAME_INCOME_TYPE','SK_ID_CURR',horizantal=False)


It seems working income type percetage on both target (0 , 1) has high 

In [None]:
# family status
plot_features.display_targetfeature(df_app_train,'TARGET','NAME_FAMILY_STATUS','SK_ID_CURR')

In [None]:
# status of NAME_HOUSING_TYPE
plot_features.display_targetfeature(df_app_train,'TARGET','NAME_HOUSING_TYPE','SK_ID_CURR')


In [None]:
# status of OCCUPATION_TYPE

plot_features.display_targetfeature(df_app_train,'TARGET','OCCUPATION_TYPE','SK_ID_CURR',horizantal=True)

In [None]:
df_app_train.head()

In [None]:
# distribution of Amount of Credit
plot_features.plot_distribution_feature(df_app_train,'AMT_CREDIT','blue')

In [None]:
plot_features.plot_distribution_feature(df_app_train,'AMT_INCOME_TOTAL','blue')

In [None]:
#check outlier 

#df_app_train['AMT_INCOME_TOTAL'].boxplot()

from scipy import stats

df_app_train[(np.abs(stats.zscore(df_app_train)) < 3).all(axis=1)]

#np.abs(stats.zscore(df_app_train['AMT_INCOME_TOTAL']) < 3).all()

In [None]:
# AMT_ANNUITY
plot_features.plot_distribution_feature(df_app_train,'AMT_ANNUITY','blue')

In [None]:
#   AMT_GOODS_PRICE

plot_features.plot_distribution_feature(df_app_train,'AMT_GOODS_PRICE','blue')

In [None]:
plot_features.plot_distribution_feature(df_app_train,'DAYS_BIRTH','blue')

 The negative value means that date pf birth is in past . The age range 
 is between approximative 20 to 68   

In [None]:
# Day employed distribution 
plot_features.plot_distribution_feature(df_app_train,'DAYS_EMPLOYED','blue')

The negetive value means unemployed but it's not clear. Most of people employed more than 100 years

In [None]:
# Days of registration distribution
plot_features.plot_distribution_feature(df_app_train,'DAYS_REGISTRATION','blue')

In [None]:
# kde plot EXT_SOURCE_3

manage_corr.Kde_target('EXT_SOURCE_3',df_app_train)

In [None]:
var = ['AMT_ANNUITY','AMT_GOODS_PRICE','DAYS_EMPLOYED', 'DAYS_REGISTRATION','DAYS_BIRTH','DAYS_ID_PUBLISH']

for v in var:
    a = str(v).split()
    print(a)
    
a = 'Mukesh'

print(a.split())    

In [None]:
#compare with target = 1 and target = 0 with 
# var = ['AMT_ANNUITY','AMT_GOODS_PRICE','DAYS_EMPLOYED', 'DAYS_REGISTRATION','DAYS_BIRTH','DAYS_ID_PUBLISH']


var = ['AMT_ANNUITY','AMT_GOODS_PRICE','DAYS_EMPLOYED', 'DAYS_REGISTRATION','DAYS_BIRTH','DAYS_ID_PUBLISH']
plot_features.plot_distribution_comp(df_app_train,var,3)

In [None]:
# check top 10 features correlation with target

manage_corr.target_corrs(df_app_train)

### EDA when join application train and previous loan

In [None]:
# join df_app_train and df_previous 

df_train_previous_eda = df_app_train.merge(df_previous,on='SK_ID_CURR',how ='left')
print(df_train_previous_eda.shape)
df_train_previous_eda.head()

In [None]:
# features of joining application and previous application loan
df_train_previous_eda.columns

In [None]:
# status of 'NAME_CONTRACT_STATUS' of previous loan

plot_features.display_targetfeature(df_train_previous_eda,'TARGET','NAME_CONTRACT_STATUS','SK_ID_CURR')

In [None]:
# 'AMT_CREDIT_y' of previous loan

plot_features.plot_distribution_onefeature(df_train_previous_eda,'AMT_CREDIT_y',color = 'blue')

In [None]:
# kde plot AMT_CREDIT_y

manage_corr.Kde_target('AMT_CREDIT_y',df_train_previous_eda)

In [None]:
# compare of previous loan 'AMT_ANNUITY_y', 'AMT_APPLICATION' ,'AMT_CREDIT_y', 'AMT_GOODS_PRICE_y'

features = ['AMT_ANNUITY_y', 'AMT_APPLICATION','AMT_CREDIT_y', 'AMT_GOODS_PRICE_y']

plot_features.plot_distribution_comp(df_train_previous_eda,features,n_row=2)

### Bureau data Exploration 

In [None]:
# dimension and info
df_bureau.head()

In [None]:
# Credit amount distribution 

plot_features.plot_distribution_onefeature(df_bureau,'AMT_CREDIT_SUM','blue')

### Part 1
previou_loan_final.csv is created by aggregation joining  previous application , POS_CASH_Balance , Intallment payment  and credit_balance .This dataframe has information of previous loan with transactions of cash and credit.

In [None]:
# previous loan application information which has credit amount , status , type of loan 
df_previous.head()

In [None]:
df_previous['NAME_CONTRACT_TYPE'].value_counts().plot(kind='barh')

In [None]:
df_previous[df_previous['NAME_CONTRACT_TYPE']=='Revolving loans']

In [None]:
# checking duplicated as column_wise for analysis one-many ralationship
print(df_previous['SK_ID_PREV'].duplicated().sum())
print(df_previous['SK_ID_CURR'].duplicated().sum())
df_previous.head()

In [None]:
# checking  each row as duplicated 
df_previous.duplicated().sum()

In [None]:
print(df_cash['SK_ID_PREV'].duplicated().sum())
print(df_cash['SK_ID_CURR'].duplicated().sum())
df_cash.head()

In [None]:
print(df_credit['SK_ID_PREV'].duplicated().sum())
print(df_credit['SK_ID_CURR'].duplicated().sum())
df_credit.head()

In [None]:
print(df_payment['SK_ID_PREV'].duplicated().sum())
print(df_payment['SK_ID_CURR'].duplicated().sum())
df_payment.head()

In [None]:
print(df_previous[df_previous['SK_ID_PREV']==2802425])
df_cash[df_cash['SK_ID_PREV']==2802425]


In [None]:
#df_credit[df_credit['SK_ID_PREV']==2802425]  # cash loan id
#df_credit[df_credit['SK_ID_PREV']==2030495]  # consumer loand id 
#df_credit[df_credit['SK_ID_PREV']==1285768].sort_values(by='MONTHS_BALANCE',ascending = False)  # revolving loan id 
df_credit[df_credit['SK_ID_PREV']==1629736].sort_values(by='MONTHS_BALANCE',ascending = False)

In [None]:
df_payment[df_payment['SK_ID_PREV']==1629736].sort_values(by='NUM_INSTALMENT_NUMBER',ascending = True)

In [None]:
import sys


def rerurn_size(df):
    # return size by dataframe in gigabyte
    return round(sys.getsizeof(df)/1e9,2)

In [None]:
df_previous.info()

In [None]:
df_previous.columns

In [None]:
df_previous_num = manage_agg_cat.agg_numeric(df_previous,group_var='SK_ID_CURR',df_name='previous')
print(df_previous_num.shape)
df_previous_num.head()

In [None]:
df_previous_cat = manage_agg_cat.count_categorical(df_previous,group_var='SK_ID_CURR',df_name='previous')
print(df_previous_cat.shape)
df_previous_cat.head()

In [None]:
manage_df.missing_data_display(df_previous_cat)

In [None]:
df_previous_final = df_previous_num.merge(df_previous_cat,on='SK_ID_CURR',how = 'inner')
print(df_previous_final.shape)
df_previous_final.head()

In [None]:
df_previous_final.head()

In [None]:
manage_df.missing_data_display(df_previous_final)

In [None]:
# numeric data grouping by SK_ID_CURR on POS_CASH_balance
df_cash_num = manage_agg_cat.agg_numeric(df_cash,group_var='SK_ID_CURR',df_name='pos_cash')
print(df_cash_num.shape)
df_cash_num.head()

In [None]:
# categorical data grouping by SK_ID_CURR on POS_CASH_balance
df_cash_cat = manage_agg_cat.count_categorical(df_cash,group_var='SK_ID_CURR',df_name='pos_cash')
print(df_cash_cat.shape)
df_cash_cat.head()

In [None]:
# merging numeric data and categorical data 

df_cash_final = df_cash_num.merge(df_cash_cat,on='SK_ID_CURR',how = 'inner')
print(df_cash_final.shape)
df_cash_final.head()

#### credit_card_balance dataframe

Credit_card_balance dataframe has credit transactions 

In [None]:
# numeric data grouping by SK_ID_CURR on credit_card_balance
df_credit_num = manage_agg_cat.agg_numeric(df_credit,group_var='SK_ID_CURR',df_name='credit')
print(df_credit_num.shape)
df_credit_num.head()

In [None]:
# categorical  data grouping by SK_ID_CURR on credit_card_balance
df_credit_cat = manage_agg_cat.count_categorical(df_credit,group_var='SK_ID_CURR',df_name='credit')
print(df_credit_cat.shape)
df_credit_cat.head()

In [None]:
# merging numeric data and categorical data of Credit_card_balance dataframe 

df_credit_final = df_credit_num.merge(df_credit_cat,on='SK_ID_CURR',how = 'inner')
print(df_credit_final.shape)
df_credit_final.head()

#### Instalments Payment DataFrame
This has payment and miss payment history of previous loans.

In [None]:
# numeric data grouping by SK_ID_CURR on intalments payment dataframe 
df_payment_num = manage_agg_cat.agg_numeric(df_payment,group_var='SK_ID_CURR',df_name='payment')
print(df_payment_num.shape)
df_payment_num.head()

#### Joining all dataframe df_prevoius_final, df_cash_final,df_credit_final,df_payment_num

In [None]:
# Joining all dataframe df_previous_final, df_cash_final,df_credit_final,df_payment_num

df_previous_loan = df_previous_final.merge(df_cash_final,on='SK_ID_CURR',how ='left')



df_previous_loan = df_previous_loan.merge(df_credit_final,on='SK_ID_CURR',how ='left')


df_previous_loan = df_previous_loan.merge(df_payment_num,on='SK_ID_CURR',how ='left')


print(df_previous_loan.shape)

df_previous_loan.head()


In [None]:
# checking null values 

manage_df.missing_data_display(df_previous_loan)

In [None]:
# Save file in csv 

df_previous_loan.to_csv("previou_loan_final.csv",index = False)

In [None]:
df_test = pd.read_csv("previou_loan_final.csv")
df_test

In [None]:
# delete from memory

del df_test,df_previous_cat,df_previous_final,df_previous_num,df_cash_cat,df_cash_final,df_cash_num
del df_credit_cat,df_credit_final,df_credit_num,df_payment_num
del df_app_test,df_app_train,df_bureau,df_bureau_balance,df_cash,df_credit,df_payment

## Part 2 Bureau Dataframe

Here , train_bureau.csv dataframe is generated by aggregation join application ,bureau and bureau_balance dataframe. Bureau has client's prevoius loan which is from other institution. Model and analysis client's status in term of previous loan of other institution. 



In [None]:
# about bureau 
print(df_bureau.shape)
df_bureau.head()

In [None]:
# group by client id (SK_ID_CURR) and count previous loan no

previous_loan_counts= df_bureau.groupby('SK_ID_CURR',as_index=False)['SK_ID_BUREAU'].count().rename(columns={'SK_ID_BUREAU':'previous_loan_counts'})
previous_loan_counts




In [None]:
# join with application training dataframe 

df_train= df_app_train.merge(previous_loan_counts,on='SK_ID_CURR',how='left')
print(df_train.shape)
manage_df.missing_data_display(df_train)

In [None]:
# fill 0 with null value in no of loan counts 

df_train.fillna(0,inplace=True)
manage_df.missing_data_display(df_train)

In [None]:
Kde_target('EXT_SOURCE_3',df_train)

In [None]:
# aggregating Numeric Columns

df_bureau_agg = df_bureau.drop(columns=['SK_ID_BUREAU']).groupby('SK_ID_CURR',as_index=False).agg(['count','mean','max','min','sum']).reset_index()
df_bureau_agg

In [None]:
# list of columns name 
columns = ['SK_ID_CURR']
for var in df_bureau_agg.columns.levels[0]:
    if var !='SK_ID_CURR':
        
        # iterate 
        for stat in df_bureau_agg.columns.levels[1][:-1]:
            columns.append('bureau_%s_%s'%(var,stat))

In [None]:
# assign columns name in groupby function
df_bureau_agg.columns=columns
df_bureau_agg.head()

In [None]:
# checking missin values 
manage_df.missing_data_display(df_bureau_agg)

In [None]:
# meger with training data 

df_train = df_train.merge(df_bureau_agg,on='SK_ID_CURR',how='left')
df_train.head()

In [None]:
#shape
print(df_train.shape)
df_train.head()

In [None]:
# Checking correlations aggregated values with target 

# list of new correlation

new_corr= []

# iteration with columns

for col in columns:
    corr = df_train['TARGET'].corr(df_train[col])
    
    new_corr.append((col,corr))

In [None]:
# sort the correlation with absolute value

new_corr = sorted(new_corr,key = lambda x:abs(x[1]),reverse=True)
new_corr[:15]

In [None]:
Kde_target('bureau_DAYS_CREDIT_mean',df_train)

In [None]:
def agg_numeric(df,group_var,df_name):
    
    # remove primary id variables 
    for col in df:
        if col != group_var and 'SK_ID' in col:
            df = df.drop(columns=col)
            
    group_ids = df[group_var]
    df_num = df.select_dtypes(exclude = 'object')
    df_num[group_var] =group_ids
    
    
    
    # group by specific variable and cal statistic
    
    df_agg = df_num.groupby(group_var).agg(['count','mean','max','min','sum']).reset_index()
    
    
    # all columns name 
    
    columns = [group_var]
    
    # iteration for adding all columns name 
    
    
    for var in df_agg.columns.levels[0]:
        # skip group name
        if var != group_var:
            
            # iteration again
            for stat in df_agg.columns.levels[1][:-1]:
                columns.append('%s_%s_%s' %(df_name,var,stat))
                
    df_agg.columns = columns
    return df_agg
    

In [None]:
# Aggregation dataframe from Bureau dataframe  
df_bureau_num = manage_agg_cat.agg_numeric(df_bureau,group_var='SK_ID_CURR',df_name='bureau')
print(df_bureau_num.shape)
df_bureau_num.head()

In [None]:
# counting SK_ID_CURR on aggregation dataframe of bureau dataframe 
df_bureau_num['SK_ID_CURR'].duplicated().sum()

In [None]:
# categorical dataframe from bureau dataframe 
df_bureau_cat = manage_agg_cat.count_categorical(df_bureau,group_var='SK_ID_CURR',df_name='bureau')
print(df_bureau_cat.shape)
df_bureau_cat.head()

In [None]:
# counting SK_ID_CURR on aggregation dataframe of bureau dataframe 
df_bureau_cat['SK_ID_CURR'].duplicated().sum()

In [None]:
# merging bureau_agg and  bureau_num

df_bureau_final = df_bureau_num.merge(df_bureau_cat,on='SK_ID_CURR',how = 'inner')
print(df_bureau_final.shape)
df_bureau_final.head()

In [None]:
# checking missing value in new datagframe df_bureau_final
manage_df.missing_data_display(df_bureau_final)

#### Bureau Balance 

In [None]:
# display of bureau balance overview
# one row represents one month transaction 
# it has too many SK_ID_BUREAU id repeating

print(df_bureau_balance.shape)
df_bureau_balance.head()

In [None]:
# counting number of SK_ID_BUREAU, one row represent one month transaction with SK_ID_BUREAU
df_bureau_balance['SK_ID_BUREAU'].duplicated().sum()

In [None]:
# for categorical data on  bureau balance dataframe

df_bureau_balance_cat = manage_agg_cat.count_categorical(df_bureau_balance,group_var='SK_ID_BUREAU',df_name='bureau_balance')
print(df_bureau_balance_cat.shape)
df_bureau_balance_cat.head()

In [None]:
# counting SK_ID_BUREAU
df_bureau_balance_cat['SK_ID_BUREAU'].duplicated().sum()

In [None]:
# For numerical data on bureau balance 

df_bureau_balance_num = manage_agg_cat.agg_numeric(df_bureau_balance,group_var='SK_ID_BUREAU',df_name='bureau_balance')
print(df_bureau_balance_num.shape)

df_bureau_balance_num

In [None]:
# counting SK_ID_BUREAU in numerical aggregation dataframe of bureau balance 

df_bureau_balance_num['SK_ID_BUREAU'].duplicated().sum()

In [None]:
df_bureau[['SK_ID_BUREAU','SK_ID_CURR']]

In [None]:
# merge bureau balance agg and cat datafarame
df_bureau_balance_final = df_bureau_balance_num.merge(df_bureau_balance_cat,on='SK_ID_BUREAU', how='inner')

print(df_bureau_balance_final.shape)

#merge with df_bureau dataframe where  unique SK_ID_BUREAU represents loan id 
# numbers of unique SK_ID_BUREAU id  represent no of loans
df_bureau_by_loan = df_bureau[['SK_ID_BUREAU','SK_ID_CURR']].merge(df_bureau_balance_final,on='SK_ID_BUREAU' , how ='inner')

print(df_bureau_by_loan.shape)
df_bureau_by_loan.head()



In [None]:
# again grouping new dataframe with aggregation stastics 

df_bureau_by_loan_final = manage_agg_cat.agg_numeric(df_bureau_by_loan,group_var='SK_ID_CURR',df_name='FinalBurBal')
print(df_bureau_by_loan_final.shape)
df_bureau_by_loan_final.head()

In [None]:
# now merging final bureau balance dataframe and final bureau dataframe

df_bureauAndbureaubalance = df_bureau_final.merge(df_bureau_by_loan_final,on='SK_ID_CURR')
print(df_bureauAndbureaubalance.shape)

# save df_bureauAndbureaubalance dataframe into csv file 
# which came from df_bureau and df_bureau dataframe after aggrigation statistic 
#  on both numeric and categorical data type

df_bureauAndbureaubalance.to_csv('BureauAndBureaubalance.csv')

df_bureauAndbureaubalance.head()

In [None]:
# check missing values on final dataframe which came from bureau and bureau balance

manage_df.missing_data_display(df_bureauAndbureaubalance).head(45)

#### application train data 

In [None]:
# aggregation data numeric data
df_train_num = manage_agg_cat.agg_numeric(df_app_train.drop(columns = ['TARGET']),group_var='SK_ID_CURR',df_name='train')
print(df_train_num.shape)
df_train_num.head()

In [None]:
# joining target 

df_train_num = df_train_num.merge(df_app_train[['SK_ID_CURR','TARGET']],on='SK_ID_CURR',how ='left')
print(df_train_num.shape)
df_train_num.head()

In [None]:
# categorical data for application train

df_train_cat = manage_agg_cat.count_categorical(df_app_train,group_var='SK_ID_CURR',df_name='train')
print(df_train_cat.shape)
df_train_cat.head()

In [None]:
# merging numeric agg and categorical train dataframe 

df_train_final = df_train_num.merge(df_train_cat,on='SK_ID_CURR')
print(df_train_final.shape)
df_train_final.head()

In [None]:
# check missing values tain final dataframe 

manage_df.missing_data_display(df_train_final)


In [None]:
# merging train final dataframe and bureau final dataframe

df_train_bureau = df_train_final.merge(df_bureauAndbureaubalance,on='SK_ID_CURR',how ='left')
print(df_train_bureau.shape)
df_train_bureau.to_csv("train_bureau.csv")
df_train_bureau.head()

In [None]:
# checking missing values in joined dataframe train and burean

manage_df.missing_data_display(df_train_bureau).head(45)

In [None]:
# delete null values which were genereted during aggregation train and joining dataframe
manage_df.delete_missing_values(df_train_bureau)
manage_df.missing_data_display(df_train_bureau)

In [None]:
# check status of final train dataframe 

print(df_train_bureau.shape)
df_train_bureau.head()

In [None]:
# checking correlation between target and generated features

target_corr = manage_corr.target_corrs(df_train_bureau)
target_corr[:20]    

In [None]:
#plor kde 

manage_corr.Kde_target('train_CNT_CHILDREN_mean',df_train_bureau)

In [None]:
# check collinear features 
plt.figure(figsize = (12,6))
sns.heatmap(df_train_bureau.corr())

##  Test dummy model on Application data only

Here , Dummy model is created from only base on application dataframe to find out general information of client. It is also used for analysis features importance and other behaviour.

In [None]:
# select categorical datatype from application dataf
print(len(df_app_train.select_dtypes("object").columns))
X_cat = df_app_train.select_dtypes("object")
X_cat.head()

In [None]:
# check missing values 

manage_df.missing_data_display(X_cat)

In [None]:
# get dummy 
X_cat_dummy = pd.DataFrame(pd.get_dummies(X_cat,drop_first=True),index=X_cat.index)
print(X_cat_dummy.shape)
X_cat_dummy.head()

In [None]:
# Selecting numerical data for futher steps
X_num = df_app_train.select_dtypes(exclude="object")
print(X_num.shape,df_app_train.shape)
X_num.head()

In [None]:
# checking 
manage_df.missing_data_display(X_num)

In [None]:
# simple overview of statistic info.
X_num.describe()

In [None]:
# fill na by mean values 
X_num.fillna(X_num.mean(),inplace =True)

In [None]:
# check size of numerical dataframe and categorical dataframe 
print(X_num.shape,X_cat_dummy.shape)

In [None]:
# merge numerical dataframe and categorical dataframe after get_dummies of categorical data 
X_final = X_num.merge(X_cat_dummy,left_index=True,right_index=True)
print(X_final.shape)
X_final.head()

In [None]:
# taking 30% of subset dataframe with balance subset for model 

X_subset = X_final.drop(columns=['TARGET'])
y_subset = X_final['TARGET']
X_train_subset , X_test_subset,y_train_subset,y_test_subset=train_test_split(X_subset,y_subset,test_size=0.3,stratify=y_subset,random_state=42)
print(X_test_subset.shape,y_test_subset.shape)

In [None]:
# merge 30% subset data of features dataframe and target features 
X_final=X_test_subset.join(y_test_subset)
X_final.head()

In [None]:
X_final['TARGET'].value_counts().plot(kind='bar')

In [None]:
# Prepare target and features

X = X_final.drop(columns=['TARGET','SK_ID_CURR'])
y= X_final['TARGET']

In [None]:
# split train dataframe and test data frame 
from  sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size =.25,random_state = 23)
print(X_train.shape,X_test.shape)
print(y_train.shape,y_test.shape)

In [None]:
# Reduce dimension of dataframe by PCA
# Scalling for PCA 
ss = StandardScaler()

df_ss = pd.DataFrame(ss.fit_transform(X_train),columns=X_train.columns)

print(df_ss.shape)
df_ss.head()

In [None]:
# finding no of components by PCA algorithm
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(df_ss)
cumsum = np.cumsum(pca.explained_variance_ratio_)*100
d = [n for n in range(len(cumsum))]
plt.figure(figsize=(6,4))
plt.plot(d, cumsum, color='red', label = 'Explained Variance')
plt.title('Explained Variance vs Number of Components')
plt.ylabel('Explained Variance')
plt.xlabel('Number of Components')
plt.axhline(y = 90, color='k', linestyle='--', label = '90% of Explained Variance')
plt.xlim(0,173)
plt.legend(loc='best');

print(' Number of Components:',(cumsum < 90).sum())

In [None]:
# for balancing target feature with oversampling by SMOTE class of imblearn library 
from imblearn.over_sampling import SMOTE
sm=SMOTE(random_state=42)
X_train,y_train = sm.fit_sample(X_train,y_train)
X_test,y_test = sm.fit_sample(X_test,y_test)



In [None]:
# checking balance
y_train.value_counts().plot(kind='bar')



### Modeling 

In [None]:
pipe = Pipeline([('ss',StandardScaler()),
                 ('pca',PCA(n_components=126)),
                 ('tree_clf',DecisionTreeClassifier(criterion='gini',max_depth=5))])

pipe.fit(X_train,y_train)

In [None]:
# Metrics
pred = pipe.predict(X_test)
print("Confusion Matrix \n")
print(confusion_matrix(y_test,pred))
print("Classification Reports \n")
print(classification_report(y_test,pred))
print("Roc_auc_score \n")
print(roc_auc_score(y_test,pred))

In [None]:
manage_model.plot_feature(pipe[2],X_train)   

#### RandomForest 

In [None]:
pipe_rf = Pipeline([('ss',StandardScaler()),
                 ('pca',PCA(n_components=126)),
                 ('rf',RandomForestClassifier(n_estimators=100,max_depth=5))])

pipe_rf.fit(X_train,y_train)


In [None]:
# Metrics of Randam Forest Algorithm
pred = pipe_rf.predict(X_test)
print("Confusion Metrics \n ")
print(confusion_matrix(y_test,pred))
print(("Classification Report\n"))
print(classification_report(y_test,pred))
print("Roc_auc_score \n")
print(roc_auc_score(y_test,pred))

In [None]:
# plot importance features 

manage_model.plot_feature(pipe_rf[2],X_train)

In [None]:
# create pipeline
pipe = Pipeline([('sc',StandardScaler()),
                 ('pca',PCA(n_components=126)),
                 ('rf',RandomForestClassifier(random_state=123))])


# create the grid parameter
n_estimators = [100, 300]
max_depth = [5, 8]
min_samples_split = [2, 5]
min_samples_leaf = [ 5, 10]

grid = [{'rf__n_estimators':n_estimators,
          'rf__max_depth':max_depth,
          'rf__min_samples_split':min_samples_split,
          'rf__min_samples_leaf':min_samples_leaf}]

In [None]:
gridsearch  = GridSearchCV(estimator=pipe,param_grid=grid,scoring='accuracy',cv=3)
gridsearch.fit(X_train,y_train)

In [None]:
gridsearch.best_params_

In [None]:
gridsearch.scorer_

In [None]:
# joining test and train app for featuretool

df_app_test['TARGET'] = np.nan
df_app_test['SET'] = 'test'
df_app_train['SET']='train'

app = df_app_train.append(df_app_test,ignore_index=True)


In [None]:
print(app.shape)
app.head()

## Final Model on prepared data which is merged by all dataframe

Here , train_bureau.csv and previous_loan_final.csv which are generated on part 1 and part 2 are merged for making whole dataframe named home_credit_final.csv. It is used for final model and prediction client's repayment abilities . Final dataframe has all information of client where are previous loan of another institute(credit bureau) and same institute. 

In [None]:
# retrive data which is made already from part 1 where it is made by joining application_train , bureau , bureau_balance 
df_app_bureau = pd.read_csv("Data/train_bureau.csv",)
print(df_app_bureau.shape)
df_app_bureau = df_app_bureau.iloc[:,1:]
df_app_bureau.head()

In [None]:
# Getting data from previous loan

df_previous_loan = pd.read_csv("Data/previou_loan_final.csv")
print(df_previous_loan.shape)
df_previous_loan.head()

In [None]:
# merging df_app_bureau and df_previous_loan_fina dataframe for making final dataframe 

df_home_final = df_app_bureau.merge(df_previous_loan,on='SK_ID_CURR',how='left')
print(df_home_final.shape)
df_home_final.head()

In [None]:
# Save final data of app_bureu and previous 

df_home_final.to_csv("Data/home_credit_final.csv",index=False)


In [None]:
# delete dataframe from memory

del df_app_bureau

## Data retriving from saved  csv file for further steps 

In [None]:
# getting final mergred final data 

df_home_final = pd.read_csv("Data/home_credit_final.csv")
print(df_home_final.shape)
df_home_final.head()

In [None]:
# check null value 

manage_df.missing_data_display(df_home_final)

In [None]:
# copy from original 
df_home_final_update = df_home_final.copy()

In [None]:
# delete columns which has more than 40% and check null values

manage_df.delete_missing_values(df_home_final_update)
manage_df.missing_data_display(df_home_final_update)

In [None]:
# check null values
manage_df.missing_data_display(df_home_final_update).head(30)

In [None]:
# delete rows because it's 6.2% only 

df_home_final_update.dropna(inplace = True)

In [None]:
# check null values
manage_df.missing_data_display(df_home_final_update)

In [None]:
# check dimension of final dataframe 

print(df_home_final_update.shape)
df_home_final_update.head()

In [None]:
# check correlation target between features
manage_corr.target_corrs(df_home_final_update)[:10]

In [None]:
manage_corr.Kde_target('previous_NAME_CONTRACT_STATUS_Refused_count_norm',df_home_final_update)

In [None]:
# check balance targete feature 
df_home_final_update['TARGET'].value_counts().plot(kind='bar')

In [None]:
# taking sample from population 

df_home_sample = df_home_final_update.sample(n=60000,random_state=42)

# check balance targete feature 
df_home_sample['TARGET'].value_counts().plot(kind='bar')

In [None]:
# correlation between feattures except target 
plt.figure(figsize=(15,6))
sns.heatmap(df_home_final_update.drop('TARGET',axis = 1).corr())

In [None]:
# create dataframe for PCA 
df_pca= df_home_final_update.drop(columns=['SK_ID_CURR','TARGET'])


# standerscaling for PCA 
ss =StandardScaler()
df_ss = pd.DataFrame(ss.fit_transform(df_pca),columns=df_pca.columns)
print(df_ss.shape)
df_ss.head()


In [None]:
# finding no of components by PCA algorithm
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(df_ss)
cumsum = np.cumsum(pca.explained_variance_ratio_)*100
d = [n for n in range(len(cumsum))]
plt.figure(figsize=(6,4))
plt.plot(d, cumsum, color='red', label = 'Explained Variance')
plt.title('Explained Variance vs Number of Components')
plt.ylabel('Explained Variance')
plt.xlabel('Number of Components')
plt.axhline(y = 90, color='k', linestyle='--', label = '90% of Explained Variance')
plt.xlim(0,300)
plt.legend(loc='best')
print(" Number of Components more upto 90%  : " , (cumsum < 90).sum())

In [None]:
# diplay no of important features 

print(" Number of Components more upto 90%  : " , (cumsum < 90).sum())

In [None]:
print(df_home_final_update.shape)
df_home_final_update.head()

In [None]:
# Data preparing for model 
#df_train_bureau =df_train_bureau.iloc[:,1:]
#df_train_bureau.head()

# taking sample from population 

df_home_sample = df_home_final_update.sample(n=60000,random_state=42)


#df_home_sample = df_home_final_update.copy()

# making target and features 
X = df_home_sample.drop(columns=['SK_ID_CURR','TARGET'])
y = df_home_sample['TARGET']

# split 

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state =123)
print(X_train.shape,y_train.shape)
print(X_test.shape,y_test.shape)
print(y_train.value_counts().plot(kind='bar'))

In [None]:
# making balance target by SMOTE
from imblearn.over_sampling import SMOTE
sm=SMOTE(random_state=42)
X_train,y_train = sm.fit_sample(X_train,y_train)
X_test,y_test = sm.fit_sample(X_test,y_test)

# check balance in target 

fig ,ax = plt.subplots(1,2,figsize=(12,6))
y_train.value_counts().plot(kind='bar',ax=ax[0],label='0:repaid 1:notrepaid')
y_test.value_counts().plot(kind='bar',ax=ax[1],label='0:repaid 1:notrepaid')
ax[0].set_title("train target")
ax[1].set_title("test target")
plt.legend()

#### Random Forest 

In [None]:
# random forest 

# making pipe for RandomForestClassifier

pipe_rf = Pipeline([('ss',StandardScaler()),
                    ('pca',PCA(n_components=242)),
                 ('rf',RandomForestClassifier(random_state =123))])
                 
pipe_rf.fit(X_train,y_train)   

# accuracy
print("Train score : ", pipe_rf.score(X_train,y_train))
print("test score :",pipe_rf.score(X_test,y_test))

In [None]:
# confusion metrixs for RandomForest 

pred = pipe_rf.predict(X_test)
print("Confusion Matrix ")
print("------------------ \n ")
print(confusion_matrix(y_test,pred))
print("\n Classification Report  ")
print("---------------------------- \n")
print(classification_report(y_test,pred))
print(" Roc_auc_score :")
print("------------------ \n")
print(roc_auc_score(y_test,pred))

In [None]:
# importance feature 
manage_model.plot_feature(pipe_rf[2],X_train)

#### LightGradientBoosting Machine Learning Algorithm

In [None]:
#lgbm = lgb.LGBMClassifier()

pipe_lgbm = Pipeline([('ss',StandardScaler()),
                    ('pca',PCA(n_components=242)),
                 ('lgbm',lgb.LGBMClassifier())])
                 
pipe_lgbm.fit(X_train,y_train)   

# accuracy
print("Train score : ", pipe_lgbm.score(X_train,y_train))
print("test score :",pipe_lgbm.score(X_test,y_test))

In [None]:
# confusion metrics for RandomForest 

pred = pipe_lgbm.predict(X_test)
print("Confusion Matrix ")
print("------------------ \n ")
print(confusion_matrix(y_test,pred))
print("\n Classification Report  ")
print("---------------------------- \n")
print(classification_report(y_test,pred))
print(" Roc_auc_score :")
print("------------------ \n")
print(roc_auc_score(y_test,pred))

In [None]:
# importance feature 
manage_model.plot_feature(pipe_lgbm[2],X_train)

In [None]:
df_home_final_update['train_CNT_CHILDREN_mean'].value_counts()

In [None]:
# check kde plot of most important featuter

manage_corr.Kde_target('train_AMT_CREDIT_count',df_home_final_update)

In [None]:
# check kde plot of most important featuter

manage_corr.Kde_target('train_CNT_CHILDREN_mean',df_home_final_update)

#### Naive Bayes GaussionNB

In [None]:
# Naive Bayes GaussianNB

pipe_naive = Pipeline([('ss',StandardScaler()),
                       ('pca',PCA(n_components=242)),
                       ('ga',GaussianNB())])
                 
pipe_naive.fit(X_train,y_train) 

print("Train score : ", pipe_naive.score(X_train,y_train))
print("test score :",pipe_naive.score(X_test,y_test))

# confusion metrixs for Naive Bayes GaussianNB 

pred = pipe_naive.predict(X_test)
print("Confusion Matrix ")
print("------------------ \n ")
print(confusion_matrix(y_test,pred))
print("\n Classification Report  ")
print("---------------------------- \n")
print(classification_report(y_test,pred))
print(" Roc_auc_score :")
print("------------------ \n")
print(roc_auc_score(y_test,pred))

#### Ada Boosting Algorithm

In [None]:
# Adaboost

pipe_ada = Pipeline([('ss',StandardScaler()),
                     ('pca',PCA(n_components=242)),
                     ('ada',AdaBoostClassifier(random_state=123))])
                 
pipe_ada.fit(X_train,y_train) 

print("Train score : ", pipe_ada.score(X_train,y_train))
print("test score :",pipe_ada.score(X_test,y_test))

# confusion metrixs for Ada Boosting Algorithm 

pred = pipe_naive.predict(X_test)
print("Confusion Matrix ")
print("------------------ \n ")
print(confusion_matrix(y_test,pred))
print("\n Classification Report  ")
print("---------------------------- \n")
print(classification_report(y_test,pred))
print(" Roc_auc_score :")
print("------------------ \n")
print(roc_auc_score(y_test,pred))

In [None]:
# XGBOOST 

pipe_xgb = Pipeline([('ss',StandardScaler()),
                     ('pca',PCA(n_components=242)),
                     ('xgb',xgb.XGBClassifier(random_state=123))])
                 
pipe_xgb.fit(X_train,y_train) 

print("Train score : ", pipe_ada.score(X_train,y_train))
print("test score :",pipe_ada.score(X_test,y_test))

# confusion metrixs for Ada Boosting Algorithm 

pred = pipe_xgb.predict(X_test)
print("Confusion Matrix ")
print("------------------ \n ")
print(confusion_matrix(y_test,pred))
print("\n Classification Report  ")
print("---------------------------- \n")
print(classification_report(y_test,pred))
print(" Roc_auc_score :")
print("------------------ \n")
print(roc_auc_score(y_test,pred))

In [None]:
# Confusion matrix of XGBoost

plt.figure(figsize = (10,6))  # figure size
ax = plt.subplot()
sns.set(font_scale=1.4)
sns.heatmap(confusion_matrix(y_test,pred),annot=True,fmt='g',ax =ax , xticklabels= ['Repaid', 'Not_repaid'] ,
            yticklabels=['Repaid','Not_repaid'],cmap ='YlGnBu' )

# subplot
ax.set_xlabel("Predicted labels" , fontsize = 20 );ax.set_ylabel("True labels" , fontsize = 20 )
ax.set_title("Confusion Matrix", fontsize = 20)
#ax.xaxis.set_ticklabels(['Repaid', 'Not_repaid'], fontsize = 14); 
#ax.yaxis.set_ticklabels(['Repaid','Not_repaid'], fontsize = 14);

plt.show()
plt.savefig("Confusion_matrix_xgb.png")



In [None]:
# X_train data 
X_train.head()

In [None]:
# plot tree  of XGBoost

xgb.plot_tree(pipe_xgb[2])
plt.rcParams['figure.figsize']=[200,150]
plt.show()
plt.savefig("xgb_tree.png")

In [None]:
# making pipe line for all algorithm

# GaussianNB
pipe_naive = Pipeline([('ss',StandardScaler()),
                       ('pca',PCA(n_components=242)),
                       ('ga',GaussianNB())])

#AdaBoostClassifier
#pipe_ada = Pipeline([('ss',StandardScaler()),
#                 ('ada',AdaBoostClassifier(random_state=123))])

#RandomForest 
pipe_rf = Pipeline([('ss',StandardScaler()),
                    ('pca',PCA(n_components=242)),
                    ('rf',RandomForestClassifier(random_state =123))])

# Lightgbm

pipe_lgbm = Pipeline([('ss',StandardScaler()),
                      ('pca',PCA(n_components=242)),
                      ('lgbm',lgb.LGBMClassifier())])


# logistic Regression model 
pipe_logistic  = Pipeline([('ss',StandardScaler()),
                           ('pca',PCA(n_components=242)),
                           ('lg',LogisticRegressionCV())])


pipelists = [pipe_logistic,pipe_naive,pipe_rf,pipe_lgbm]
pipeline_names = ['Logistic Regression','Naive Bayes','RandomForest','LightGredientBoosting Algorithm']



# for loop to fit each algorithm
for pipe in pipelists:
    print(pipe)
    pipe.fit(X_train,y_train) 
    
#Compare Accuracies
for index,val in enumerate(pipelists):
    print("%s pipeline test accuracy : %.3f" %(pipeline_names[index],val.score(X_test,y_test)))

#### Tuning and optimization of RandomForest and  LightGBM algorithm 


In [None]:
# RandomForest 
pipe_rf = Pipeline([('sc',StandardScaler()),
                 ('pca',PCA(n_components=242)),
                 ('rf',RandomForestClassifier(random_state=123))])


# create the grid parameter
n_estimators = [100, 300,400]
max_depth = [5, 8]
min_samples_split = [2, 5,8]
min_samples_leaf = [ 5, 10,15]

grid = [{'rf__n_estimators':n_estimators,
          'rf__max_depth':max_depth,
          'rf__min_samples_split':min_samples_split,
          'rf__min_samples_leaf':min_samples_leaf}]


gridsearch  = GridSearchCV(estimator=pipe_rf,param_grid=grid,scoring='accuracy',cv=3)
gridsearch.fit(X_train,y_train)


print("Best Parameter " )
print("-----------------\n")
print(gridsearch.best_params_)

print("\n")
print("Best Score ")
print("-----------\n")
print(gridsearch.best_score_)

In [None]:
# display best paramete and score

print("Best Parameter " )
print("-----------------\n")
print(gridsearch.best_params_)

print("\n")
print("Best Score ")
print("-----------\n")
print(gridsearch.best_score_)




In [None]:
# predict with best gridsearch parameter

print("Best from Grid Search Train Score ")
print("-----------\n")
print(gridsearch.best_score_)


print("\n")
best = gridsearch.best_estimator_
y_pred_grid = best.predict(X_test)
print("Best Test score with gridsearch's best parameter: ")
print("--------------------------------\n")
print(accuracy_score(y_pred_grid,y_test))

In [None]:
# light GBM optimization


pipe_lgbm = Pipeline([('ss',StandardScaler()),
                    ('pca',PCA(n_components=242)),
                 ('lgbm',lgb.LGBMClassifier())])


# create the grid parameter
#n_estimators = [100, 300,400]
#max_depth = [5, 8]
#min_samples_split = [2, 5,8]
#min_samples_leaf = [ 5, 10,15]

grid_para = {'lgbm__boosting_type':['gbdt','goss','dart'],
              'lgbm__num_leave':list(range(20,150)),
              'lgbm__learning_rate':list(np.logspace(np.log10(0.005),np.log10(0.5),base =10,num=1000)),
               'lgbm__subsample_for_bin': list(range(20000, 300000, 20000)),
               'lgbm__min_child_samples': list(range(20, 500, 5)),
               'lgbm__reg_alpha': list(np.linspace(0, 1)),
                'lgbm__reg_lambda': list(np.linspace(0, 1)),
                'lgbm__colsample_bytree': list(np.linspace(0.6, 1, 10)),
                'lgbm__subsample': list(np.linspace(0.5, 1, 100)),
                'lgbm__is_unbalance': [True, False]}


gridsearch  = GridSearchCV(estimator=pipe_lgbm,param_grid=grid_para,scoring='accuracy',cv=3)
gridsearch.fit(X_train,y_train)


print("Best Parameter " )
print("-----------------\n")
print(gridsearch.best_params_)

print("\n")
print("Best Score ")
print("-----------\n")
print(gridsearch.best_score_)

In [None]:
# light GBM optimization



grid_para = {'lgbm__boosting_type':['gbdt','goss','dart'],
              'lgbm__num_leave':list(range(20,150)),
              'lgbm__learning_rate':list(np.logspace(np.log10(0.005),np.log10(0.5),base =10,num=1000)),
               'lgbm__subsample_for_bin': list(range(20000, 300000, 20000)),
               'lgbm__min_child_samples': list(range(20, 500, 5)),
               'lgbm__reg_alpha': list(np.linspace(0, 1)),
                'lgbm__reg_lambda': list(np.linspace(0, 1)),
                'lgbm__colsample_bytree': list(np.linspace(0.6, 1, 10)),
                'lgbm__subsample': list(np.linspace(0.5, 1, 100)),
                'lgbm__is_unbalance': [True, False]}



N_FOLDS = 5
MAX_EVALS = 5

train_set = lgb.Dataset(data = X_train,label=y_train)
test_set =lgb.Dataset(data= X_test,label = y_test)

In [None]:
# Get default hyperparameters
model = lgb.LGBMClassifier()
default_params = model.get_params()

# Remove the number of estimators because we set this to 10000 in the cv call
del default_params['n_estimators']

# Cross validation with early stopping
cv_results = lgb.cv(default_params, train_set, num_boost_round = 10000, early_stopping_rounds = 100, 
                    metrics = 'auc', nfold = N_FOLDS, seed = 42)

In [None]:
cv_results

In [None]:
def objective(hyperparameters, iteration):
    """Objective function for grid and random search. Returns
       the cross validation score from a set of hyperparameters."""
    
    # Number of estimators will be found using early stopping
    if 'n_estimators' in hyperparameters.keys():
        del hyperparameters['n_estimators']
    
     # Perform n_folds cross validation
    cv_results = lgb.cv(hyperparameters, train_set, num_boost_round = 10000, nfold = N_FOLDS, 
                        early_stopping_rounds = 100, metrics = 'auc', seed = 42)
    
    # results to retun
    score = cv_results['auc-mean'][-1]
    estimators = len(cv_results['auc-mean'])
    hyperparameters['n_estimators'] = estimators 
    
    return [score, hyperparameters, iteration]

In [None]:
def random_search(param_grid, max_evals = MAX_EVALS):
    """Random search for hyperparameter optimization"""
    
    # Dataframe for results
    results = pd.DataFrame(columns = ['score', 'params', 'iteration'],
                                  index = list(range(MAX_EVALS)))
    
    # Keep searching until reach max evaluations
    for i in range(MAX_EVALS):
        
        # Choose random hyperparameters
        hyperparameters = {k: random.sample(v, 1)[0] for k, v in param_grid.items()}
        hyperparameters['subsample'] = 1.0 if hyperparameters['boosting_type'] == 'goss' else hyperparameters['subsample']

        # Evaluate randomly selected hyperparameters
        eval_results = objective(hyperparameters, i)
        
        results.loc[i, :] = eval_results
    
    # Sort with best score on top
    results.sort_values('score', ascending = False, inplace = True)
    results.reset_index(inplace = True)
    return results 

In [None]:
random_results = rand

In [None]:
#lgbm = lgb.LGBMClassifier()

"""

default parameter

    boosting_type='gbdt',
    num_leaves=31,
    max_depth=-1,
    learning_rate=0.1,
    n_estimators=100,
    subsample_for_bin=200000,
    objective=None,
    class_weight=None,
    min_split_gain=0.0,
    min_child_weight=0.001,
    min_child_samples=20,
    subsample=1.0,
    subsample_freq=0,
    colsample_bytree=1.0,
    reg_alpha=0.0,
    reg_lambda=0.0,
    random_state=None,
    n_jobs=-1,
    silent=True,
"""


pipe_lgbm = Pipeline([('ss',StandardScaler()),
                    ('pca',PCA(n_components=242)),
                 ('lgbm',lgb.LGBMClassifier())])

grid_para = {'lgbm__boosting_type':['gbdt','goss','dart'],
              'lgbm__num_leave':list(range(20,150)),
              'lgbm__learning_rate':list(np.logspace(np.log10(0.005),np.log10(0.5),base =10,num=1000)),
               'lgbm__subsample_for_bin': list(range(20000, 300000, 20000)),
               'lgbm__min_child_samples': list(range(20, 500, 5)),
               'lgbm__reg_alpha': list(np.linspace(0, 1)),
                'lgbm__reg_lambda': list(np.linspace(0, 1)),
                'lgbm__colsample_bytree': list(np.linspace(0.6, 1, 10)),
                'lgbm__subsample': list(np.linspace(0.5, 1, 100)),
                'lgbm__is_unbalance': [True, False]}


gs = GridSearchCV(estimator=pipe_lgbm,param_grid=grid_para,scoring="accuracy",cv = 5 )                 
gs.fit(X_train,y_train) 

In [None]:
model = lgb.LGBMClassifier()

In [None]:
default_params = model.get_params()
default_params

In [None]:
# plot learning rate from hyperparameter dict of lightGBM
plt.figure(figsize = (14,8))
sns.distplot(grid_para['lgbm__learning_rate'],bins=20,kde=False)
plt.xlabel("Learning Rate " ,fontsize =14)
plt.ylabel("Count" ,fontsize =14)
plt.title("Learning Rate Distribution " ,fontsize =14)

In [None]:
com =1 
for x in grid_para.values():
    com *= len(x)
print("there are {} combination " .format(com))    

In [None]:
# using objective function to find best hyperpara meter

import itertools


def grid_search(para_grid , max_evals = MAX_EVALS):
    
    # datframe for store 
    results = pd.DataFrame(columns =['score','params','iteration'],index = list(range(MAX_EVALS)))
    
    
    
    # 
    
    keys, values = zip(*grid_para.items())
    
    i = 0
    
    for v in itertools.product(*values):
        hyperpara = dict(zip(keys,v))
        if hyperpara['boosting_type'] == 'goss' :
            hyperpara['subsample'] =1.0
        
        
        #evalute para meter
        
        eval_result = objective()
    
    

In [None]:
key , values = zip(*grid_para.items())
for  v in itertools.product(*values):
    #print(v)

In [None]:
values