## Problem Statement
Have you ever wondered how lenders use various factors such as credit score, annual income, the loan amount approved, tenure, debt-to-income ratio etc. and select your interest rates? 

The process, defined as ‘risk-based pricing’, uses a sophisticated algorithm that leverages different determining factors of a loan applicant. Selection of significant factors will help develop a prediction algorithm which can estimate loan interest rates based on clients’ information. On one hand, knowing the factors will help consumers and borrowers to increase their credit worthiness and place themselves in a better position to negotiate for getting a lower interest rate. On the other hand, this will help lending companies to get an immediate fixed interest rate estimation based on clients information. Here, your goal is to use a training dataset to predict the loan rate category (1 / 2 / 3) that will be assigned to each loan in our test set.

You can use any combination of the features in the dataset to make your loan rate category predictions. Some features will be easier to use than others.

![](https://datahack-prod.s3.ap-south-1.amazonaws.com/__sized__/contest_cover/cover_1_OIHLvzm-thumbnail-1200x1200.png)

In [3]:
## import necessary libraries.

import numpy as np ## Numpy Library ( will use to convert data frame to array or creating array etc...).
import pandas as pd ## Pandas Library (will use to load data,create data frame...etc).
import os ## For connecting to machine to get path for reading/writing files.
from sklearn.model_selection import train_test_split ## For splitting data into train and validation.
from sklearn.preprocessing import LabelEncoder ## For label encoding(converting categorical values to label).
from xgboost import XGBClassifier ## XG boost model.
from sklearn.model_selection import GridSearchCV ## For Grid search(cross validation).
from sklearn.metrics import accuracy_score ## For getting accuracy value.
from sklearn.metrics import confusion_matrix ## For getting confusion matrix.
from sklearn.metrics import classification_report ## For classifier metrics(accuracy,TPR,TNR).
from sklearn.naive_bayes import GaussianNB ## Naive Nayes Model.
from sklearn.neighbors import KNeighborsClassifier ## KNN Model.
from sklearn.ensemble import RandomForestClassifier ## Random Forest  Model.
from sklearn.ensemble import BaggingClassifier ## Bagging Model.
from sklearn.ensemble import AdaBoostClassifier ## AdaBoost Model.
from sklearn.ensemble import GradientBoostingClassifier ## GradientBoost Model.
from sklearn.svm import SVC ## SVC Model.

In [2]:
## Get current working directory.
os.getcwd()

'D:\\Python\\Pratice\\ML for Banking'

In [543]:
## Set working directory.
os.chdir("D:\DataScience\Pratice\Machine Learning for Banking")
os.getcwd()

'D:\\DataScience\\Pratice\\Machine Learning for Banking'

In [544]:
## Load data sets.
train = pd.read_csv('train.csv',header='infer',sep=',')
test = pd.read_csv('test.csv',header='infer',sep=',')

In [545]:
## Display dimensions of train and test.
print(train.shape)
print(test.shape)

(164309, 14)
(109541, 13)


In [546]:
## Check first record of train data.
train.head(1)

Unnamed: 0,Loan_ID,Loan_Amount_Requested,Length_Employed,Home_Owner,Annual_Income,Income_Verified,Purpose_Of_Loan,Debt_To_Income,Inquiries_Last_6Mo,Months_Since_Deliquency,Number_Open_Accounts,Total_Accounts,Gender,Interest_Rate
0,10000001,7000,< 1 year,Rent,68000.0,not verified,car,18.37,0,,9,14,Female,1


In [548]:
## Check last record of train data.
train.tail(1)

Unnamed: 0,Loan_ID,Loan_Amount_Requested,Length_Employed,Home_Owner,Annual_Income,Income_Verified,Purpose_Of_Loan,Debt_To_Income,Inquiries_Last_6Mo,Months_Since_Deliquency,Number_Open_Accounts,Total_Accounts,Gender,Interest_Rate
164308,10164309,9250,10+ years,Rent,,VERIFIED - income,credit_card,19.44,1,,5,9,Female,2


In [547]:
## Check first record of test data.
test.head(1)

Unnamed: 0,Loan_ID,Loan_Amount_Requested,Length_Employed,Home_Owner,Annual_Income,Income_Verified,Purpose_Of_Loan,Debt_To_Income,Inquiries_Last_6Mo,Months_Since_Deliquency,Number_Open_Accounts,Total_Accounts,Gender
0,10164310,27500,10+ years,Mortgage,129000.0,VERIFIED - income,debt_consolidation,12.87,0,68.0,10,37,Male


In [549]:
## Check last record of test data.
test.tail(1)

Unnamed: 0,Loan_ID,Loan_Amount_Requested,Length_Employed,Home_Owner,Annual_Income,Income_Verified,Purpose_Of_Loan,Debt_To_Income,Inquiries_Last_6Mo,Months_Since_Deliquency,Number_Open_Accounts,Total_Accounts,Gender
109540,10273850,15000,2 years,Mortgage,137000.0,not verified,medical,8.66,1,60.0,8,17,Male


In [550]:
## Check summay statistics of train data.
train.describe(include='all')

Unnamed: 0,Loan_ID,Loan_Amount_Requested,Length_Employed,Home_Owner,Annual_Income,Income_Verified,Purpose_Of_Loan,Debt_To_Income,Inquiries_Last_6Mo,Months_Since_Deliquency,Number_Open_Accounts,Total_Accounts,Gender,Interest_Rate
count,164309.0,164309.0,156938,138960,139207.0,164309,164309,164309.0,164309.0,75930.0,164309.0,164309.0,164309,164309.0
unique,,1290.0,11,5,,3,14,,,,,,2,
top,,10000.0,10+ years,Mortgage,,VERIFIED - income,debt_consolidation,,,,,,Male,
freq,,11622.0,52915,70345,,59421,97101,,,,,,117176,
mean,10082160.0,,,,73331.16,,,17.207189,0.781698,34.229356,11.193818,25.067665,,2.158951
std,47432.07,,,,60377.5,,,7.845083,1.034747,21.76118,4.991813,11.583067,,0.738364
min,10000000.0,,,,4000.0,,,0.0,0.0,0.0,0.0,2.0,,1.0
25%,10041080.0,,,,45000.0,,,11.37,0.0,16.0,8.0,17.0,,2.0
50%,10082160.0,,,,63000.0,,,16.84,0.0,31.0,10.0,23.0,,2.0
75%,10123230.0,,,,88697.5,,,22.78,1.0,50.0,14.0,32.0,,3.0


In [551]:
## Check summay statistics of test data.
test.describe(include='all')

Unnamed: 0,Loan_ID,Loan_Amount_Requested,Length_Employed,Home_Owner,Annual_Income,Income_Verified,Purpose_Of_Loan,Debt_To_Income,Inquiries_Last_6Mo,Months_Since_Deliquency,Number_Open_Accounts,Total_Accounts,Gender
count,109541.0,109541.0,104605,92830,92643.0,109541,109541,109541.0,109541.0,50682.0,109541.0,109541.0,109541
unique,,1246.0,11,5,,3,14,,,,,,2
top,,10000.0,10+ years,Mortgage,,VERIFIED - income,debt_consolidation,,,,,,Male
freq,,7820.0,35413,46925,,39655,64302,,,,,,77817
mean,10219080.0,,,,73485.41,,,17.228969,0.78881,33.914684,11.174337,25.06844,
std,31621.91,,,,55638.45,,,7.84731,1.039903,21.732856,4.946314,11.599639,
min,10164310.0,,,,3000.0,,,0.0,0.0,0.0,0.0,2.0,
25%,10191700.0,,,,45000.0,,,11.35,0.0,15.0,8.0,17.0,
50%,10219080.0,,,,63000.0,,,16.86,0.0,31.0,10.0,24.0,
75%,10246460.0,,,,89000.0,,,22.78,1.0,49.0,14.0,32.0,


In [552]:
## Check train data column names.
train.columns

Index(['Loan_ID', 'Loan_Amount_Requested', 'Length_Employed', 'Home_Owner',
       'Annual_Income', 'Income_Verified', 'Purpose_Of_Loan', 'Debt_To_Income',
       'Inquiries_Last_6Mo', 'Months_Since_Deliquency', 'Number_Open_Accounts',
       'Total_Accounts', 'Gender', 'Interest_Rate'],
      dtype='object')

In [553]:
## Check test data column names.
test.columns

Index(['Loan_ID', 'Loan_Amount_Requested', 'Length_Employed', 'Home_Owner',
       'Annual_Income', 'Income_Verified', 'Purpose_Of_Loan', 'Debt_To_Income',
       'Inquiries_Last_6Mo', 'Months_Since_Deliquency', 'Number_Open_Accounts',
       'Total_Accounts', 'Gender'],
      dtype='object')

In [554]:
## Get index range for train data.
train.index

RangeIndex(start=0, stop=164309, step=1)

In [555]:
## Get index range for test data.
test.index

RangeIndex(start=0, stop=109541, step=1)

In [556]:
## Check data types for train data columns.
train.dtypes

Loan_ID                      int64
Loan_Amount_Requested       object
Length_Employed             object
Home_Owner                  object
Annual_Income              float64
Income_Verified             object
Purpose_Of_Loan             object
Debt_To_Income             float64
Inquiries_Last_6Mo           int64
Months_Since_Deliquency    float64
Number_Open_Accounts         int64
Total_Accounts               int64
Gender                      object
Interest_Rate                int64
dtype: object

In [557]:
## Check data types for test data columns.
test.dtypes

Loan_ID                      int64
Loan_Amount_Requested       object
Length_Employed             object
Home_Owner                  object
Annual_Income              float64
Income_Verified             object
Purpose_Of_Loan             object
Debt_To_Income             float64
Inquiries_Last_6Mo           int64
Months_Since_Deliquency    float64
Number_Open_Accounts         int64
Total_Accounts               int64
Gender                      object
dtype: object

In [558]:
## Check null values for train data.
train.isna().sum()

Loan_ID                        0
Loan_Amount_Requested          0
Length_Employed             7371
Home_Owner                 25349
Annual_Income              25102
Income_Verified                0
Purpose_Of_Loan                0
Debt_To_Income                 0
Inquiries_Last_6Mo             0
Months_Since_Deliquency    88379
Number_Open_Accounts           0
Total_Accounts                 0
Gender                         0
Interest_Rate                  0
dtype: int64

In [559]:
## Check null values for test data.
test.isna().sum()

Loan_ID                        0
Loan_Amount_Requested          0
Length_Employed             4936
Home_Owner                 16711
Annual_Income              16898
Income_Verified                0
Purpose_Of_Loan                0
Debt_To_Income                 0
Inquiries_Last_6Mo             0
Months_Since_Deliquency    58859
Number_Open_Accounts           0
Total_Accounts                 0
Gender                         0
dtype: int64

In [560]:
## This method will return number of levels,null values,unique values,data types for the given data frame.

def observations(df):
    return(pd.DataFrame({'dtypes' : df.dtypes,
                         'levels' : [df[x].unique() for x in df.columns],
                         'null_values' : df.isna().sum(),
                         'Unique Values': df.nunique()
                        }))

In [561]:
## Get column data types,numer of level for each column,null values,unique values for train data.
observations(train)

Unnamed: 0,dtypes,levels,null_values,Unique Values
Loan_ID,int64,"[10000001, 10000002, 10000003, 10000004, 10000...",0,164309
Loan_Amount_Requested,object,"[7,000, 30,000, 24,725, 16,000, 17,000, 4,500,...",0,1290
Length_Employed,object,"[< 1 year, 4 years, 7 years, 8 years, 2 years,...",7371,11
Home_Owner,object,"[Rent, Mortgage, nan, Own, Other, None]",25349,5
Annual_Income,float64,"[68000.0, nan, 75566.4, 56160.0, 96000.0, 3000...",25102,12305
Income_Verified,object,"[not verified, VERIFIED - income, VERIFIED - i...",0,3
Purpose_Of_Loan,object,"[car, debt_consolidation, credit_card, home_im...",0,14
Debt_To_Income,float64,"[18.37, 14.93, 15.88, 14.34, 22.17, 10.88, 5.6...",0,3953
Inquiries_Last_6Mo,int64,"[0, 3, 1, 2, 4, 5, 6, 7, 8]",0,9
Months_Since_Deliquency,float64,"[nan, 17.0, 16.0, 68.0, 13.0, 6.0, 64.0, 10.0,...",88379,122


In [562]:
## Get column data types,numer of level for each column,null values,unique values for test data.
observations(test)

Unnamed: 0,dtypes,levels,null_values,Unique Values
Loan_ID,int64,"[10164310, 10164311, 10164312, 10164313, 10164...",0,109541
Loan_Amount_Requested,object,"[27,500, 26,000, 6,075, 12,000, 35,000, 8,000,...",0,1246
Length_Employed,object,"[10+ years, < 1 year, 6 years, 8 years, 1 year...",4936,11
Home_Owner,object,"[Mortgage, nan, Rent, Own, Other, None]",16711,5
Annual_Income,float64,"[129000.0, 110000.0, 75000.0, 73000.0, 156000....",16898,9028
Income_Verified,object,"[VERIFIED - income, not verified, VERIFIED - i...",0,3
Purpose_Of_Loan,object,"[debt_consolidation, credit_card, home_improve...",0,14
Debt_To_Income,float64,"[12.87, 11.37, 6.83, 7.76, 9.62, 0.0, 22.89, 2...",0,3895
Inquiries_Last_6Mo,int64,"[0, 2, 1, 3, 6, 4, 5, 7, 8]",0,9
Months_Since_Deliquency,float64,"[68.0, nan, 26.0, 18.0, 22.0, 65.0, 47.0, 45.0...",58859,115


In [563]:
## Replace comma(,) with empty value for Loan_Amount_Requested column of train data.
train['Loan_Amount_Requested'] = train['Loan_Amount_Requested'].str.replace(',','')

In [564]:
## Replace comma(,) with empty value for Loan_Amount_Requested column of test data.
test['Loan_Amount_Requested'] = test['Loan_Amount_Requested'].str.replace(',','')

In [565]:
## Convert Loan_Amount_Requested column data type from string to float(train data).
train['Loan_Amount_Requested'] = train['Loan_Amount_Requested'].astype('float')

In [566]:
## Convert Loan_Amount_Requested column data type from string to float(test data).
test['Loan_Amount_Requested'] = test['Loan_Amount_Requested'].astype('float')

In [567]:
## Check object data type columns.
train.select_dtypes('object').columns

Index(['Length_Employed', 'Home_Owner', 'Income_Verified', 'Purpose_Of_Loan',
       'Gender'],
      dtype='object')

In [568]:
## Convert columns data types from object to category for the given data frame.
def dataTypeConversion(df):
    for col in df.select_dtypes('object').columns:
        df[col] = df[col].astype('category')

In [569]:
## Convert columns data types from object to category for train data.
dataTypeConversion(train)

In [570]:
## Convert columns data types from object to category for test data.
dataTypeConversion(test)

In [571]:
## Check train data columns data types after conversion.
train.dtypes

Loan_ID                       int64
Loan_Amount_Requested       float64
Length_Employed            category
Home_Owner                 category
Annual_Income               float64
Income_Verified            category
Purpose_Of_Loan            category
Debt_To_Income              float64
Inquiries_Last_6Mo            int64
Months_Since_Deliquency     float64
Number_Open_Accounts          int64
Total_Accounts                int64
Gender                     category
Interest_Rate                 int64
dtype: object

In [572]:
## Check test data columns data types after conversion.
test.dtypes

Loan_ID                       int64
Loan_Amount_Requested       float64
Length_Employed            category
Home_Owner                 category
Annual_Income               float64
Income_Verified            category
Purpose_Of_Loan            category
Debt_To_Income              float64
Inquiries_Last_6Mo            int64
Months_Since_Deliquency     float64
Number_Open_Accounts          int64
Total_Accounts                int64
Gender                     category
dtype: object

In [573]:
## Set index for train and test.
train.set_index('Loan_ID',inplace=True)
test.set_index('Loan_ID',inplace=True)

In [574]:
## Check first record of train data after setting index value.
train.head(1)

Unnamed: 0_level_0,Loan_Amount_Requested,Length_Employed,Home_Owner,Annual_Income,Income_Verified,Purpose_Of_Loan,Debt_To_Income,Inquiries_Last_6Mo,Months_Since_Deliquency,Number_Open_Accounts,Total_Accounts,Gender,Interest_Rate
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
10000001,7000.0,< 1 year,Rent,68000.0,not verified,car,18.37,0,,9,14,Female,1


In [575]:
## Check first record of test data after setting index value.
test.head(1)

Unnamed: 0_level_0,Loan_Amount_Requested,Length_Employed,Home_Owner,Annual_Income,Income_Verified,Purpose_Of_Loan,Debt_To_Income,Inquiries_Last_6Mo,Months_Since_Deliquency,Number_Open_Accounts,Total_Accounts,Gender
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
10164310,27500.0,10+ years,Mortgage,129000.0,VERIFIED - income,debt_consolidation,12.87,0,68.0,10,37,Male


In [576]:
## Split data into train and test(80:20 ratio).
X_train,X_test,y_train,y_test = train_test_split(train.drop('Interest_Rate',axis=1),train['Interest_Rate'],test_size=0.2,random_state=1234)

In [577]:
## Check dimeniosn of train and validation data.
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(131447, 12)
(131447,)
(32862, 12)
(32862,)


In [578]:
## Get unique values for Length_Employed column. 
train['Length_Employed'].unique()

[< 1 year, 4 years, 7 years, 8 years, 2 years, ..., NaN, 6 years, 9 years, 3 years, 5 years]
Length: 12
Categories (11, object): [< 1 year, 4 years, 7 years, 8 years, ..., 6 years, 9 years, 3 years, 5 years]

In [579]:
## Create a dictionary to map number of years experience.
experience_mapping = {
    '< 1 year' : '0',
    '1 year' : '1',
    '2 years' : '2',
    '3 years' : '3',
    '4 years' : '4',
    '5 years' : '5',
    '6 years' : '6',
    '7 years' : '7',
    '8 years' : '8',
    '9 years' : '9',
    '10+ years' : '10',
    'NaN': 'Unknown'
}

In [580]:
## Return experience year mappinng value from dictionary.
def experience_foo(exp):
    return experience_mapping[exp]

In [581]:
## Map experience values with dictionary values for Length_Employed column(train data).
X_train['Length_Employed'] = X_train['Length_Employed'].apply(experience_foo)

In [582]:
## Check first record of train data after mapping dictionary values.
X_train.head(1)

Unnamed: 0_level_0,Loan_Amount_Requested,Length_Employed,Home_Owner,Annual_Income,Income_Verified,Purpose_Of_Loan,Debt_To_Income,Inquiries_Last_6Mo,Months_Since_Deliquency,Number_Open_Accounts,Total_Accounts,Gender
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
10003909,15000.0,10,Mortgage,65000.0,not verified,debt_consolidation,20.33,2,,10,34,Male


In [583]:
## Map experience values with dictionary values for Length_Employed column(validation data).
X_test['Length_Employed'] = X_test['Length_Employed'].apply(experience_foo)

In [584]:
## Check first record of validation data after mapping dictionary values.
X_test.head(1)

Unnamed: 0_level_0,Loan_Amount_Requested,Length_Employed,Home_Owner,Annual_Income,Income_Verified,Purpose_Of_Loan,Debt_To_Income,Inquiries_Last_6Mo,Months_Since_Deliquency,Number_Open_Accounts,Total_Accounts,Gender
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
10086865,35000.0,7,,82000.0,VERIFIED - income,debt_consolidation,19.13,1,38.0,10,19,Male


In [585]:
## Map experience values with dictionary values for Length_Employed column(test data).
test['Length_Employed'] = test['Length_Employed'].apply(experience_foo)

In [586]:
## Check first record of test data after mapping dictionary values.
test.head(1)

Unnamed: 0_level_0,Loan_Amount_Requested,Length_Employed,Home_Owner,Annual_Income,Income_Verified,Purpose_Of_Loan,Debt_To_Income,Inquiries_Last_6Mo,Months_Since_Deliquency,Number_Open_Accounts,Total_Accounts,Gender
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
10164310,27500.0,10,Mortgage,129000.0,VERIFIED - income,debt_consolidation,12.87,0,68.0,10,37,Male


In [587]:
### find missing values % and display them in descending order.
missing_value = (train.isna().sum()/len(train)).round(4)*100
missing_value.sort_values(ascending=False)

Months_Since_Deliquency    53.79
Home_Owner                 15.43
Annual_Income              15.28
Length_Employed             4.49
Interest_Rate               0.00
Gender                      0.00
Total_Accounts              0.00
Number_Open_Accounts        0.00
Inquiries_Last_6Mo          0.00
Debt_To_Income              0.00
Purpose_Of_Loan             0.00
Income_Verified             0.00
Loan_Amount_Requested       0.00
dtype: float64

In [588]:
##  fillna requires a value that already exists as a category for categgorical columns so that's why adding Unkown level.
X_train['Length_Employed'] = X_train['Length_Employed'].cat.add_categories('Unknown')

In [589]:
## Fill NA values with Unknown for Length_Employed column of train data.
X_train['Length_Employed'].fillna('Unknown',inplace=True)

In [590]:
##  fillna requires a value that already exists as a category for categgorical columns so that's why adding Unkown level.
X_test['Length_Employed'] = X_test['Length_Employed'].cat.add_categories('Unknown')

In [591]:
## Fill NA values with Unknown for Length_Employed column of validation data.
X_test['Length_Employed'].fillna('Unknown',inplace=True)

In [592]:
##  fillna requires a value that already exists as a category for categgorical columns so that's why adding Unkown level.
test['Length_Employed'] = test['Length_Employed'].cat.add_categories('Unknown')

In [593]:
## Fill NA values with Unknown for Length_Employed column of test data.
test['Length_Employed'].fillna('Unknown',inplace=True)

In [594]:
## Fill NA values with None for Home_Owner column of train data.
X_train['Home_Owner'].fillna('None',inplace=True)

In [595]:
## Fill NA values with None for Home_Owner column of validation data.
X_test['Home_Owner'].fillna('None',inplace=True)

In [596]:
## Fill NA values with None for Home_Owner column of test data.
test['Home_Owner'].fillna('None',inplace=True)

In [597]:
## Fill NA values with mean of Annual_Income column for Annual_Income column of train data.
X_train['Annual_Income'].fillna(X_train['Annual_Income'].mean(),inplace=True)

In [598]:
## Fill NA values with mean of Annual_Income column for Annual_Income column of validation data.
X_test['Annual_Income'].fillna(X_test['Annual_Income'].mean(),inplace=True)

In [599]:
## Fill NA values with mean of Annual_Income column for Annual_Income column of test data.
test['Annual_Income'].fillna(test['Annual_Income'].mean(),inplace=True)

In [600]:
## Fill NA values with mean of Months_Since_Deliquency column for Months_Since_Deliquency column of train data.
X_train['Months_Since_Deliquency'].fillna(X_train['Months_Since_Deliquency'].mean(),inplace=True)

In [601]:
## Fill NA values with mean of Months_Since_Deliquency column for Months_Since_Deliquency column of validation data.
X_test['Months_Since_Deliquency'].fillna(X_test['Months_Since_Deliquency'].mean(),inplace=True)

In [602]:
## Fill NA values with mean of Months_Since_Deliquency column for Months_Since_Deliquency column of test data.
test['Months_Since_Deliquency'].fillna(test['Months_Since_Deliquency'].mean(),inplace=True)

In [603]:
## Check null/NA values for train data.
X_train.isna().sum()

Loan_Amount_Requested      0
Length_Employed            0
Home_Owner                 0
Annual_Income              0
Income_Verified            0
Purpose_Of_Loan            0
Debt_To_Income             0
Inquiries_Last_6Mo         0
Months_Since_Deliquency    0
Number_Open_Accounts       0
Total_Accounts             0
Gender                     0
dtype: int64

In [604]:
## Check null/NA values for validation data.
X_test.isna().sum()

Loan_Amount_Requested      0
Length_Employed            0
Home_Owner                 0
Annual_Income              0
Income_Verified            0
Purpose_Of_Loan            0
Debt_To_Income             0
Inquiries_Last_6Mo         0
Months_Since_Deliquency    0
Number_Open_Accounts       0
Total_Accounts             0
Gender                     0
dtype: int64

In [605]:
## Check null/NA values for test data.
test.isna().sum()

Loan_Amount_Requested      0
Length_Employed            0
Home_Owner                 0
Annual_Income              0
Income_Verified            0
Purpose_Of_Loan            0
Debt_To_Income             0
Inquiries_Last_6Mo         0
Months_Since_Deliquency    0
Number_Open_Accounts       0
Total_Accounts             0
Gender                     0
dtype: int64

In [606]:
## Check data types for train data.
X_train.dtypes

Loan_Amount_Requested       float64
Length_Employed            category
Home_Owner                 category
Annual_Income               float64
Income_Verified            category
Purpose_Of_Loan            category
Debt_To_Income              float64
Inquiries_Last_6Mo            int64
Months_Since_Deliquency     float64
Number_Open_Accounts          int64
Total_Accounts                int64
Gender                     category
dtype: object

In [607]:
## Get unique values for Length_Employed column.
X_train['Length_Employed'].unique()

[10, 2, 3, 1, 5, ..., 4, 8, Unknown, 9, 6]
Length: 12
Categories (12, object): [10, 2, 3, 1, ..., 8, Unknown, 9, 6]

In [608]:
## Instantiate Label Encoder.
le_experience = LabelEncoder()
le_interest_rate = LabelEncoder()

In [609]:
## Do label encoding for Length_Employed of train data.
X_train['Length_Employed'] = le_experience.fit_transform(X_train['Length_Employed'])

In [610]:
## Do label encoding for Length_Employed of validation data.
X_test['Length_Employed'] = le_experience.transform(X_test['Length_Employed'])

In [611]:
## Do label encoding for Length_Employed of test data.
test['Length_Employed'] = le_experience.transform(test['Length_Employed'])

In [612]:
## Do label encoding for traget column of train data.
y_train = le_interest_rate.fit_transform(y_train)

In [613]:
## Do label encoding for traget column of validation data.
y_test = le_interest_rate.transform(y_test)

In [None]:
## Store train data category columns data into X_train_dummy_cat.
X_train_dummy_cat = X_train[['Home_Owner','Income_Verified','Purpose_Of_Loan','Gender']]

In [615]:
## Get Dummies for train data category columns.
X_train_dummy_cat = pd.get_dummies(columns = ['Home_Owner','Income_Verified','Purpose_Of_Loan','Gender'], data = X_train_dummy_cat, drop_first= True)

In [616]:
## Store validation data category columns data into X_test_dummy_cat.
X_test_dummy_cat = X_test[['Home_Owner','Income_Verified','Purpose_Of_Loan','Gender']]

In [617]:
## Get Dummies for validation data category columns.
X_test_dummy_cat = pd.get_dummies(columns = ['Home_Owner','Income_Verified','Purpose_Of_Loan','Gender'], data = X_test_dummy_cat, drop_first= True)

In [618]:
## Check first record of validation data after doing dummies.
X_test_dummy_cat.head(1)

Unnamed: 0_level_0,Home_Owner_None,Home_Owner_Other,Home_Owner_Own,Home_Owner_Rent,Income_Verified_VERIFIED - income source,Income_Verified_not verified,Purpose_Of_Loan_credit_card,Purpose_Of_Loan_debt_consolidation,Purpose_Of_Loan_educational,Purpose_Of_Loan_home_improvement,Purpose_Of_Loan_house,Purpose_Of_Loan_major_purchase,Purpose_Of_Loan_medical,Purpose_Of_Loan_moving,Purpose_Of_Loan_other,Purpose_Of_Loan_renewable_energy,Purpose_Of_Loan_small_business,Purpose_Of_Loan_vacation,Purpose_Of_Loan_wedding,Gender_Male
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
10086865,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1


In [619]:
## Store test data category columns data into test_dummy_cat.
test_dummy_cat = test[['Home_Owner','Income_Verified','Purpose_Of_Loan','Gender']]

In [620]:
## Get Dummies for test data category columns.
test_dummy_cat = pd.get_dummies(columns = ['Home_Owner','Income_Verified','Purpose_Of_Loan','Gender'], data = test_dummy_cat, drop_first= True)

In [621]:
## Drop repeating columns in train data.
X_train.drop(['Home_Owner','Income_Verified','Purpose_Of_Loan','Gender'],axis=1,inplace=True)

In [622]:
## Drop repeating columns in validation data.
X_test.drop(['Home_Owner','Income_Verified','Purpose_Of_Loan','Gender'],axis=1,inplace=True)

In [623]:
## Drop repeating columns in test data.
test.drop(['Home_Owner','Income_Verified','Purpose_Of_Loan','Gender'],axis=1,inplace=True)

In [624]:
## Check dimensions of dummies of train,validation,test data.
print(X_train_dummy_cat.shape)
print(X_test_dummy_cat.shape)
print(test_dummy_cat.shape)

(131447, 20)
(32862, 20)
(109541, 20)


In [625]:
## Check dimensions of train,validation,test data.
print(X_train.shape)
print(X_test.shape)
print(test.shape)

(131447, 8)
(32862, 8)
(109541, 8)


In [626]:
## Concat dummies with remaining train columns.
train_data = pd.concat([X_train, X_train_dummy_cat], axis=1,sort=False)

In [627]:
## Concat dummies with remaining validation columns.
validation_data = pd.concat([X_test, X_test_dummy_cat], axis=1,sort=False)

In [628]:
## Concat dummies with remaining test columns.
test_data = pd.concat([test, test_dummy_cat], axis=1,sort=False)

In [629]:
## Create a data frame for traget varible.
temp = pd.DataFrame(y_train)

In [630]:
## Check null values for target varible.
temp.isna().sum()

0    0
dtype: int64

In [631]:
## Check column names for train data.
train_data.columns

Index(['Loan_Amount_Requested', 'Length_Employed', 'Annual_Income',
       'Debt_To_Income', 'Inquiries_Last_6Mo', 'Months_Since_Deliquency',
       'Number_Open_Accounts', 'Total_Accounts', 'Home_Owner_None',
       'Home_Owner_Other', 'Home_Owner_Own', 'Home_Owner_Rent',
       'Income_Verified_VERIFIED - income source',
       'Income_Verified_not verified', 'Purpose_Of_Loan_credit_card',
       'Purpose_Of_Loan_debt_consolidation', 'Purpose_Of_Loan_educational',
       'Purpose_Of_Loan_home_improvement', 'Purpose_Of_Loan_house',
       'Purpose_Of_Loan_major_purchase', 'Purpose_Of_Loan_medical',
       'Purpose_Of_Loan_moving', 'Purpose_Of_Loan_other',
       'Purpose_Of_Loan_renewable_energy', 'Purpose_Of_Loan_small_business',
       'Purpose_Of_Loan_vacation', 'Purpose_Of_Loan_wedding', 'Gender_Male'],
      dtype='object')

In [632]:
## Copy test data into temp.
temp = test_data.copy()

In [633]:
#xgb = XGBClassifier() ## Instantiate XGBClassifier model

#optimization_dict = {'max_depth': [2,3,4,5,6,7], ## trying with different max_depth,n_estimators to find best model
#                     'n_estimators': [50,60,70,80,90,100,150,200]} 

## Build best model with Grid Search params
#model = GridSearchCV(xgb, ## XGB model
#                     optimization_dict, ## dictory with different max_depth,n_estimators
#                     scoring='accuracy', ## on which parameter we are interested
#                     verbose=1, ## for messaging purpose
#                     n_jobs=-1) ## Number of jobs to run in parallel. ''-1' means use all processors

#%time model.fit(train_data, y_train) ## Fit a model
#print(model.best_score_) ## Display best score calues
#print(model.best_params_) ## Display best parameters

In [634]:
## Build a model with best params which we were found after grid search CV (above code).
model = XGBClassifier(max_depth=6,           ## Depth of the tree.
                      n_estimators=200,      ## number of trees.
                      learning_rate = 0.001, ## learning rate.
                      booster ='gbtree',     ## tree type.
                      random_state=1234)     ## seed value.
## Fit a model.
%time model.fit(train_data, y_train)

Wall time: 5min 13s


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.001, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=None, n_estimators=200, n_jobs=1,
              nthread=None, objective='multi:softprob', random_state=1234,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [635]:
## Get the predictions on train data.
train_pred = model.predict(train_data)

In [636]:
## Display accuracy value for train data.
print("Train Accuracy :",accuracy_score(y_train,train_pred))

Train Accuracy : 0.5115065387570655


In [637]:
## Get the predictions on validation data.
validation_pred = model.predict(validation_data)

In [638]:
## Display  accuracy value for validation data.
print("Validation Accuracy :",accuracy_score(y_test,validation_pred))

Validation Accuracy : 0.5079727344653399


In [639]:
## Get the confusion matrix for train data.
confusion_matrix_train = confusion_matrix(y_train, train_pred)
print(confusion_matrix_train)

[[ 3137 19081  4864]
 [ 2060 39071 15417]
 [  566 22223 25028]]


In [640]:
## Get the confusion matrix for validation data.
confusion_matrix_test = confusion_matrix(y_test, validation_pred)
print(confusion_matrix_test)

[[ 739 4732 1253]
 [ 523 9607 3902]
 [ 134 5625 6347]]


In [641]:
## Get the predictions on test data.
y_pred = model.predict(temp)

In [642]:
## Display predictions.
y_pred

array([1, 0, 2, ..., 1, 2, 1], dtype=int64)

In [643]:
## Copy temp values to temp1.
temp1 = temp.copy()

In [644]:
## Check first 5 records of temp1.
temp1.head()

Unnamed: 0_level_0,Loan_Amount_Requested,Length_Employed,Annual_Income,Debt_To_Income,Inquiries_Last_6Mo,Months_Since_Deliquency,Number_Open_Accounts,Total_Accounts,Home_Owner_None,Home_Owner_Other,...,Purpose_Of_Loan_house,Purpose_Of_Loan_major_purchase,Purpose_Of_Loan_medical,Purpose_Of_Loan_moving,Purpose_Of_Loan_other,Purpose_Of_Loan_renewable_energy,Purpose_Of_Loan_small_business,Purpose_Of_Loan_vacation,Purpose_Of_Loan_wedding,Gender_Male
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10164310,27500.0,2,129000.0,12.87,0,68.0,10,37,0,0,...,0,0,0,0,0,0,0,0,0,1
10164311,26000.0,2,110000.0,11.37,0,33.914684,6,23,1,0,...,0,0,0,0,0,0,0,0,0,1
10164312,6075.0,0,75000.0,6.83,2,33.914684,5,20,0,0,...,0,0,0,0,0,0,0,0,0,1
10164313,12000.0,2,73000.0,7.76,0,33.914684,6,8,0,0,...,0,0,0,0,0,0,0,0,0,1
10164314,35000.0,0,156000.0,9.62,0,26.0,9,21,0,0,...,0,0,0,0,0,0,0,0,0,1


In [645]:
## Do inverse tranform on predictions to get it's original values.
temp1['Interest_Rate'] = le_interest_rate.inverse_transform(y_pred)

In [646]:
## Reset index value.
temp1.reset_index(inplace=True)

In [647]:
## Copy Loan_ID, Interest_Rate column data from temp1 o to_submit_1.
to_submit_1 = temp1[['Loan_ID', 'Interest_Rate']]

In [648]:
## Check dimesnions of to_submit_1.
to_submit_1.shape

(109541, 2)

In [649]:
## Check dimesnions of test data.
test_data.shape

(109541, 28)

In [650]:
## Check value counts for Interest_Rate column of to_submit_1.
to_submit_1.Interest_Rate.value_counts()

2    71330
3    36342
1     1869
Name: Interest_Rate, dtype: int64

In [651]:
## Store to_submit_1 into csv file with name XGBoost. 
to_submit_1.to_csv('XGBoost.csv',index = False)

In [4]:
## Build different classifier models.

In [653]:
## Instantiate KNN model.
## model = KNeighborsClassifier(algorithm = 'brute', n_neighbors = 3,metric = "euclidean")

In [654]:
## Instantiate Navie Bayes Model.
## model = GaussianNB()

In [655]:
## Instantiate Random forest Model.
## model = RandomForestClassifier(n_estimators=200,max_depth=6,n_jobs=-1,class_weight = 'balanced') #class_weight = 'balanced'

In [676]:
## Instantiate Bagging clasifier Model.
## model  = BaggingClassifier(n_estimators=500)

In [692]:
## Instantiate Adaboost Model.
## model = AdaBoostClassifier(n_estimators=200,learning_rate=.001)

In [708]:
## Instantiate Gradient boosting classifier Model.
## model = GradientBoostingClassifier(n_estimators=200,learning_rate=0.01)

In [659]:
## Instantiate SVC Model.
## model = SVC(C=10,kernel='rbf')

In [660]:
## Random forest gave best result comapre to different classifier models.
model = RandomForestClassifier(n_estimators=2000,         ## The number of trees in the forest.
                               max_depth=7,               ## The maximum depth of the tree.
                               n_jobs=-1,                 ## The number of jobs to run in parallel. -1 means using all processors.
                               class_weight = 'balanced', ## Weights associated with classes in the form.
                               criterion='entropy')       ##The function to measure the quality of a split.

In [709]:
## Fit a model.
%time model.fit(train_data, y_train)

Wall time: 4min 38s


GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.01, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=200,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [710]:
## Get prediction on train and validation data.
predict_train = model.predict(train_data)
predict_validation = model.predict(validation_data)

In [711]:
## Display accuracy value for train data.
print("Train Accuracy :",accuracy_score(y_train,predict_train))

Train Accuracy : 0.5058921086065106


In [712]:
## Display  accuracy value for validation data.
print("Validation Accuracy :",accuracy_score(y_test,predict_validation))

Validation Accuracy : 0.5056296025804881


In [713]:
## Get the confusion matrix for train data.
confusion_matrix_train = confusion_matrix(y_train, predict_train)
print(confusion_matrix_train)

[[  432 22319  4331]
 [  160 41941 14447]
 [   50 23642 24125]]


In [714]:
## Get the confusion matrix for validation data.
confusion_matrix_validation = confusion_matrix(y_test, predict_validation)
print(confusion_matrix_validation)

[[  100  5508  1116]
 [   50 10371  3611]
 [    7  5954  6145]]


In [715]:
## Get the predictions on test data.
y_pred = model.predict(temp)

In [716]:
## Copy temp data into temp1.
temp1 = temp.copy()

In [717]:
## Do inverse transform on predictions to get it's original values.
temp1['Interest_Rate'] = le_interest_rate.inverse_transform(y_pred)

In [718]:
## Reset the index value.
temp1.reset_index(inplace=True)

In [719]:
## Copy Loan_ID,Interest_Rate columns data from temp1 to to_submit_1.
to_submit_1 = temp1[['Loan_ID', 'Interest_Rate']]

In [720]:
## Check dimensions of to_submit_1.
to_submit_1.shape

(109541, 2)

In [721]:
## Check dimmensions of test data.
test_data.shape

(109541, 28)

In [722]:
## Check value counts for Interest_Rate column of to_submit_1.
to_submit_1.Interest_Rate.value_counts()

2    73869
3    35628
1       44
Name: Interest_Rate, dtype: int64

In [723]:
## Store to_submit_1 into csv file with name RadomForest. 
to_submit_1.to_csv('RadomForest.csv',index = False)