# Artivatic Data Labs Pvt. Ltd.

## Problem Statement


#### The Bank Indessa has not done well in the last 3 quarters. Their NPAs (Non Performing Assets) have reached all time high. It is starting to lose the confidence of its investors. As a result, it’s stock has fallen by 20% in the previous quarter alone.

#### After careful analysis, it was found that the majority of NPA was contributed by loan defaulters. With the messy data collected over all the years, this bank has decided to use machine learning to figure out a way to find these defaulters and devise a plan to reduce them.

#### This bank uses a pool of investors to sanction their loans. For example: If any customer has applied for a loan of 20000, along with the bank, the investors perform due diligence on the requested loan application. Keep this in mind while understanding data.

#### In this challenge, you will help this bank by predicting the probability that a member will default.

### Dataset ... Download the dataset from the following link:

https://drive.google.com/file/d/1jIUQO0POfYslbO9ru_Z3Cb5nPaEnGbv-/view?usp=sharing

####  importing the packages

In [5]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import cross_val_score


from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import auc
from sklearn.metrics import roc_auc_score,confusion_matrix,classification_report
from sklearn.model_selection import learning_curve,StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [6]:
#  Importing Train and Test Dataset


dfTrain_set = pd.read_csv (r'D:\ML_Artivatic_dataset\train_indessa.csv')
dfTest_set = pd.read_csv (r'D:\ML_Artivatic_dataset\test_indessa.csv')

In [7]:
dfTrain_set.head(10)

Unnamed: 0,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,batch_enrolled,int_rate,grade,sub_grade,emp_title,...,collections_12_mths_ex_med,mths_since_last_major_derog,application_type,verification_status_joint,last_week_pay,acc_now_delinq,tot_coll_amt,tot_cur_bal,total_rev_hi_lim,loan_status
0,58189336,14350,14350,14350.0,36 months,,19.19,E,E3,clerk,...,0.0,74.0,INDIVIDUAL,,26th week,0.0,0.0,28699.0,30800.0,0
1,70011223,4800,4800,4800.0,36 months,BAT1586599,10.99,B,B4,Human Resources Specialist,...,0.0,,INDIVIDUAL,,9th week,0.0,0.0,9974.0,32900.0,0
2,70255675,10000,10000,10000.0,36 months,BAT1586599,7.26,A,A4,Driver,...,0.0,,INDIVIDUAL,,9th week,0.0,65.0,38295.0,34900.0,0
3,1893936,15000,15000,15000.0,36 months,BAT4808022,19.72,D,D5,Us office of Personnel Management,...,0.0,,INDIVIDUAL,,135th week,0.0,0.0,55564.0,24700.0,0
4,7652106,16000,16000,16000.0,36 months,BAT2833642,10.64,B,B2,LAUSD-HOLLYWOOD HIGH SCHOOL,...,0.0,,INDIVIDUAL,,96th week,0.0,0.0,47159.0,47033.0,0
5,10247268,15000,15000,14950.0,36 months,BAT2575549,8.9,A,A5,Design Consultant,...,0.0,,INDIVIDUAL,,113th week,0.0,0.0,350619.0,29500.0,0
6,8089625,5000,5000,4975.0,36 months,,7.9,A,A4,TOYOTA OF NORTH HOLLYWOOD,...,0.0,,INDIVIDUAL,,117th week,0.0,1023.0,13272.0,55500.0,1
7,23043116,6000,6000,6000.0,36 months,,9.17,B,B1,Banker,...,0.0,54.0,INDIVIDUAL,,78th week,0.0,0.0,272579.0,11800.0,0
8,45900933,6000,6000,6000.0,36 months,BAT4136152,13.99,C,C4,LVN,...,0.0,,INDIVIDUAL,,44th week,0.0,0.0,281521.0,62100.0,0
9,41272507,34550,34550,34550.0,60 months,BAT4694572,17.14,D,D4,Registered Nurse,...,0.0,,INDIVIDUAL,,52th week,0.0,0.0,76034.0,33200.0,0


In [8]:
dfTest_set.head(10)

Unnamed: 0,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,batch_enrolled,int_rate,grade,sub_grade,emp_title,...,collection_recovery_fee,collections_12_mths_ex_med,mths_since_last_major_derog,application_type,verification_status_joint,last_week_pay,acc_now_delinq,tot_coll_amt,tot_cur_bal,total_rev_hi_lim
0,11937648,14000,14000,14000.0,60 months,BAT4711174,16.24,C,C5,Data Analyst,...,0.0,0.0,,INDIVIDUAL,,104th week,0.0,0.0,85230.0,45700.0
1,38983318,16000,16000,16000.0,60 months,BAT4318899,9.49,B,B2,Senior Database Administrator,...,0.0,0.0,,INDIVIDUAL,,57th week,0.0,0.0,444991.0,21400.0
2,27999917,11050,11050,11050.0,60 months,BAT446479,15.61,D,D1,Customer service representative,...,0.0,0.0,26.0,INDIVIDUAL,,70th week,0.0,0.0,105737.0,16300.0
3,61514932,35000,35000,34700.0,60 months,BAT4664105,12.69,C,C2,ACCT OFFICER,...,0.0,0.0,,INDIVIDUAL,,22th week,0.0,0.0,287022.0,72400.0
4,59622821,6500,6500,6500.0,36 months,,6.89,A,A3,Paralegal,...,0.0,0.0,,INDIVIDUAL,,22th week,0.0,0.0,234278.0,26700.0
5,28822038,13475,13475,13475.0,60 months,,18.99,E,E1,Human Resource,...,0.0,0.0,,INDIVIDUAL,,70th week,0.0,131.0,29383.0,42700.0
6,10718089,5000,5000,5000.0,36 months,,7.62,A,A3,Software Engineer,...,0.0,0.0,,INDIVIDUAL,,74th week,0.0,0.0,38403.0,17000.0
7,58114582,10000,10000,10000.0,60 months,BAT5662637,22.99,F,F2,Deli Manager,...,0.0,0.0,,INDIVIDUAL,,26th week,0.0,246.0,7119.0,18200.0
8,35023176,30000,30000,30000.0,36 months,BAT6248271,9.17,B,B1,Registered Nurse,...,0.0,0.0,21.0,INDIVIDUAL,,26th week,0.0,0.0,85611.0,51100.0
9,1268247,7000,7000,7000.0,60 months,BAT4467682,15.96,C,C5,HSBC,...,0.0,0.0,,INDIVIDUAL,,213th week,0.0,,,


In [9]:
dfTrain_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 532428 entries, 0 to 532427
Data columns (total 45 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   member_id                    532428 non-null  int64  
 1   loan_amnt                    532428 non-null  int64  
 2   funded_amnt                  532428 non-null  int64  
 3   funded_amnt_inv              532428 non-null  float64
 4   term                         532428 non-null  object 
 5   batch_enrolled               447279 non-null  object 
 6   int_rate                     532428 non-null  float64
 7   grade                        532428 non-null  object 
 8   sub_grade                    532428 non-null  object 
 9   emp_title                    501595 non-null  object 
 10  emp_length                   505537 non-null  object 
 11  home_ownership               532428 non-null  object 
 12  annual_inc                   532425 non-null  float64
 13 

In [10]:
dfTest_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354951 entries, 0 to 354950
Data columns (total 44 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   member_id                    354951 non-null  int64  
 1   loan_amnt                    354951 non-null  int64  
 2   funded_amnt                  354951 non-null  int64  
 3   funded_amnt_inv              354951 non-null  float64
 4   term                         354951 non-null  object 
 5   batch_enrolled               309352 non-null  object 
 6   int_rate                     354951 non-null  float64
 7   grade                        354951 non-null  object 
 8   sub_grade                    354951 non-null  object 
 9   emp_title                    334322 non-null  object 
 10  emp_length                   337017 non-null  object 
 11  home_ownership               354951 non-null  object 
 12  annual_inc                   354950 non-null  float64
 13 

In [14]:
dfTrain_set.columns

Index(['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term',
       'batch_enrolled', 'int_rate', 'grade', 'sub_grade', 'emp_title',
       'emp_length', 'home_ownership', 'annual_inc', 'verification_status',
       'pymnt_plan', 'desc', 'purpose', 'title', 'zip_code', 'addr_state',
       'dti', 'delinq_2yrs', 'inq_last_6mths', 'mths_since_last_delinq',
       'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal',
       'revol_util', 'total_acc', 'initial_list_status', 'total_rec_int',
       'total_rec_late_fee', 'recoveries', 'collection_recovery_fee',
       'collections_12_mths_ex_med', 'mths_since_last_major_derog',
       'application_type', 'verification_status_joint', 'last_week_pay',
       'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal', 'total_rev_hi_lim',
       'loan_status'],
      dtype='object')

In [11]:
dfTrain_set.describe(include = 'all')

Unnamed: 0,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,batch_enrolled,int_rate,grade,sub_grade,emp_title,...,collections_12_mths_ex_med,mths_since_last_major_derog,application_type,verification_status_joint,last_week_pay,acc_now_delinq,tot_coll_amt,tot_cur_bal,total_rev_hi_lim,loan_status
count,532428.0,532428.0,532428.0,532428.0,532428,447279.0,532428.0,532428,532428,501595,...,532333.0,132980.0,532428,305,532428,532412.0,490424.0,490424.0,490424.0,532428.0
unique,,,,,2,104.0,,7,35,190124,...,,,2,3,98,,,,,
top,,,,,36 months,,,B,B3,Teacher,...,,,INDIVIDUAL,Not Verified,13th week,,,,,
freq,,,,,372793,106079.0,,152713,33844,8280,...,,,532123,170,30333,,,,,
mean,35005470.0,14757.595722,14744.271291,14704.926696,,,13.242969,,,,...,0.014299,44.121462,,,,0.005015,213.562222,139554.1,32080.57,0.236327
std,24121480.0,8434.42008,8429.139277,8441.290381,,,4.379611,,,,...,0.133005,22.19841,,,,0.079117,1958.571538,153914.9,38053.04,0.424826
min,70473.0,500.0,500.0,0.0,,,5.32,,,,...,0.0,0.0,,,,0.0,0.0,0.0,0.0,0.0
25%,10866880.0,8000.0,8000.0,8000.0,,,9.99,,,,...,0.0,27.0,,,,0.0,0.0,29839.75,14000.0,0.0
50%,37095900.0,13000.0,13000.0,13000.0,,,12.99,,,,...,0.0,44.0,,,,0.0,0.0,80669.5,23700.0,0.0
75%,58489200.0,20000.0,20000.0,20000.0,,,16.2,,,,...,0.0,61.0,,,,0.0,0.0,208479.2,39800.0,0.0


In [12]:
dfTest_set.describe(include = 'all')

Unnamed: 0,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,batch_enrolled,int_rate,grade,sub_grade,emp_title,...,collection_recovery_fee,collections_12_mths_ex_med,mths_since_last_major_derog,application_type,verification_status_joint,last_week_pay,acc_now_delinq,tot_coll_amt,tot_cur_bal,total_rev_hi_lim
count,354951.0,354951.0,354951.0,354951.0,354951,309352.0,354951.0,354951,354951,334322,...,354951.0,354901.0,88723.0,354951,206,354951,354938.0,326679.0,326679.0,326679.0
unique,,,,,2,104.0,,7,35,135102,...,,,,2,3,93,,,,
top,,,,,36 months,,,B,B3,Teacher,...,,,,INDIVIDUAL,Not Verified,13th week,,,,
freq,,,,,248332,128008.0,,101822,22479,5527,...,,,,354745,113,19988,,,,
mean,34996350.0,14751.76792,14738.287116,14698.770903,,,13.252396,,,,...,4.913062,0.0145,44.079923,,,,0.004956,243.9283,139314.2,32050.68
std,24101200.0,8437.019324,8431.045701,8443.341658,,,4.38525,,,,...,63.128236,0.13595,22.152081,,,,0.075333,16130.22,153502.2,36649.69
min,70626.0,500.0,500.0,0.0,,,5.32,,,,...,0.0,0.0,0.0,,,,0.0,0.0,0.0,0.0
25%,10889410.0,8000.0,8000.0,8000.0,,,9.99,,,,...,0.0,0.0,27.0,,,,0.0,0.0,29873.5,13900.0
50%,37086500.0,13000.0,13000.0,13000.0,,,12.99,,,,...,0.0,0.0,44.0,,,,0.0,0.0,80369.0,23700.0
75%,58448920.0,20000.0,20000.0,20000.0,,,16.2,,,,...,0.0,0.0,61.0,,,,0.0,0.0,207800.5,39700.0


In [15]:
import pandas_profiling
from pandas_profiling import ProfileReport

In [16]:
profile = pandas_profiling.ProfileReport(dfTrain_set)
print(profile)
profile.to_file(output_file="D://Arun//profiling_before_preprocessing.html")




HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=59.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Generate report structure', max=1.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='Render HTML', max=1.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Export report to file', max=1.0, style=ProgressStyle(desc…




In [17]:
dfTrain_set = dfTrain_set[['member_id', 'loan_amnt', 'funded_amnt', 'addr_state', 'funded_amnt_inv', 'sub_grade', 'term', 'emp_length', 'int_rate', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'collections_12_mths_ex_med', 'mths_since_last_major_derog', 'last_week_pay', 'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal', 'total_rev_hi_lim', 'loan_status']]
dfTest_set = dfTrain_set[['member_id', 'loan_amnt', 'funded_amnt', 'addr_state', 'funded_amnt_inv', 'sub_grade', 'term', 'emp_length', 'int_rate', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'collections_12_mths_ex_med', 'mths_since_last_major_derog', 'last_week_pay', 'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal', 'total_rev_hi_lim']]

In [18]:
## Data transformation


dfTrain_set['term'].replace(to_replace=' months', value='', regex=True, inplace=True)
dfTest_set['term'].replace(to_replace=' months', value='', regex=True, inplace=True)
dfTrain_set['term'] = pd.to_numeric(dfTrain_set['term'], errors='coerce')
dfTest_set['term'] = pd.to_numeric(dfTest_set['term'], errors='coerce')

In [19]:
dfTrain_set['emp_length'].replace('n/a', '0', inplace=True)
dfTrain_set['emp_length'].replace(to_replace='\+ years', value='', regex=True, inplace=True)
dfTrain_set['emp_length'].replace(to_replace=' years', value='', regex=True, inplace=True)
dfTrain_set['emp_length'].replace(to_replace='< 1 year', value='0', regex=True, inplace=True)
dfTrain_set['emp_length'].replace(to_replace=' year', value='', regex=True, inplace=True)
dfTest_set['emp_length'].replace('n/a', '0', inplace=True)
dfTest_set['emp_length'].replace(to_replace='\+ years', value='', regex=True, inplace=True)
dfTest_set['emp_length'].replace(to_replace=' years', value='', regex=True, inplace=True)
dfTest_set['emp_length'].replace(to_replace='< 1 year', value='0', regex=True, inplace=True)
dfTest_set['emp_length'].replace(to_replace=' year', value='', regex=True, inplace=True)
dfTrain_set['emp_length'] = pd.to_numeric(dfTrain_set['emp_length'], errors='coerce')
dfTest_set['emp_length'] = pd.to_numeric(dfTest_set['emp_length'], errors='coerce')

In [22]:
dfTrain_set['last_week_pay'].replace(to_replace='th week', value='', regex=True, inplace=True)
dfTest_set['last_week_pay'].replace(to_replace='th week', value='', regex=True, inplace=True)
dfTrain_set['last_week_pay'].replace(to_replace='NA', value='', regex=True, inplace=True)
dfTest_set['last_week_pay'].replace(to_replace='NA', value='', regex=True, inplace=True)
dfTrain_set['last_week_pay'] = pd.to_numeric(dfTrain_set['last_week_pay'], errors='coerce')
dfTest_set['last_week_pay'] = pd.to_numeric(dfTest_set['last_week_pay'], errors='coerce')

In [25]:
dfTrain_set['sub_grade'].replace(to_replace='A', value='0', regex=True, inplace=True)
dfTrain_set['sub_grade'].replace(to_replace='B', value='1', regex=True, inplace=True)
dfTrain_set['sub_grade'].replace(to_replace='C', value='2', regex=True, inplace=True)
dfTrain_set['sub_grade'].replace(to_replace='D', value='3', regex=True, inplace=True)
dfTrain_set['sub_grade'].replace(to_replace='E', value='4', regex=True, inplace=True)
dfTrain_set['sub_grade'].replace(to_replace='F', value='5', regex=True, inplace=True)
dfTrain_set['sub_grade'].replace(to_replace='G', value='6', regex=True, inplace=True)
dfTest_set['sub_grade'].replace(to_replace='A', value='0', regex=True, inplace=True)
dfTest_set['sub_grade'].replace(to_replace='B', value='1', regex=True, inplace=True)
dfTest_set['sub_grade'].replace(to_replace='C', value='2', regex=True, inplace=True)
dfTest_set['sub_grade'].replace(to_replace='D', value='3', regex=True, inplace=True)
dfTest_set['sub_grade'].replace(to_replace='E', value='4', regex=True, inplace=True)
dfTest_set['sub_grade'].replace(to_replace='F', value='5', regex=True, inplace=True)
dfTest_set['sub_grade'].replace(to_replace='G', value='6', regex=True, inplace=True)
dfTrain_set['sub_grade'] = pd.to_numeric(dfTrain_set['sub_grade'], errors='coerce')
dfTest_set['sub_grade'] = pd.to_numeric(dfTest_set['sub_grade'], errors='coerce')

In [26]:
'''
Missing values Suggestions
'''



cols = ['term', 'loan_amnt', 'funded_amnt', 'last_week_pay', 'int_rate', 'sub_grade', 'annual_inc', 'dti', 'mths_since_last_delinq', 'mths_since_last_record', 'open_acc', 'revol_bal', 'revol_util', 'total_acc', 'total_rec_int', 'mths_since_last_major_derog', 'tot_coll_amt', 'tot_cur_bal', 'total_rev_hi_lim', 'emp_length']
for col in cols:
    print('Suggestions with Median: %s' % (col))
    dfTrain_set[col].fillna(dfTrain_set[col].median(), inplace=True)
    dfTest_set[col].fillna(dfTest_set[col].median(), inplace=True)

cols = ['acc_now_delinq', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'collections_12_mths_ex_med']
for col in cols:
    print('Suggestions with Zero: %s' % (col))
    dfTrain_set[col].fillna(0, inplace=True)
    dfTest_set[col].fillna(0, inplace=True)

print('Missing value Suggestions done.')

Suggestions with Median: term
Suggestions with Median: loan_amnt
Suggestions with Median: funded_amnt
Suggestions with Median: last_week_pay
Suggestions with Median: int_rate
Suggestions with Median: sub_grade
Suggestions with Median: annual_inc
Suggestions with Median: dti
Suggestions with Median: mths_since_last_delinq
Suggestions with Median: mths_since_last_record
Suggestions with Median: open_acc
Suggestions with Median: revol_bal
Suggestions with Median: revol_util
Suggestions with Median: total_acc
Suggestions with Median: total_rec_int
Suggestions with Median: mths_since_last_major_derog
Suggestions with Median: tot_coll_amt
Suggestions with Median: tot_cur_bal
Suggestions with Median: total_rev_hi_lim
Suggestions with Median: emp_length
Suggestions with Zero: acc_now_delinq
Suggestions with Zero: total_rec_late_fee
Suggestions with Zero: recoveries
Suggestions with Zero: collection_recovery_fee
Suggestions with Zero: collections_12_mths_ex_med
Missing value Suggestions done.


In [27]:
""" Columns home_ownership and purpose columns attributes were earlier included and were considered categorical 
But since the feature importance was low, they were removed. 
Keeping this section here for sake of completeness of data preprocessing steps for reference, to dummify/categorize variables, update list of attributes cat_attr'

"""
cat_attr = ['home_ownership', 'purpose']
for cat in cat_attr:
    print('Categorizing: %s...' % (cat))
    df_col = [cat]
    dfTrain_set[cat] = dfTrain_set[cat].astype("category")
    dfTrain_set[cat] = pd.get_dummies(dfTrain_set, columns=df_col)
    dfTest_set[cat] = dfTest_set[cat].astype("category")
    dfTest_set[cat] = pd.get_dummies(dfTest_set, columns=df_col)

Categorizing: home_ownership...


KeyError: 'home_ownership'

In [53]:
profile = pandas_profiling.ProfileReport(dfTrain_set)
print(profile)
profile.to_file(output_file="D://Arun//profiling_after_preprocessing.html")




HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=46.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Generate report structure', max=1.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='Render HTML', max=1.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Export report to file', max=1.0, style=ProgressStyle(desc…




In [29]:

'''
Feature Engineering
'''





# Separating the member_id column of test dataframe to help create a csv after predictions
test_member_id = pd.DataFrame(dfTest_set['member_id'])


# Creating target variable pandas series from train dataframe, this will be used by cross validation to calculate
# the accuracy of the model
train_target = pd.DataFrame(dfTrain_set['loan_status'])

In [30]:
# It's good to create a copy of train and test dataframes. this way we can play around different features as we tune the
# performance of the classifier with important features
selected_cols = ['member_id', 'emp_length', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'sub_grade', 'int_rate', 'annual_inc', 'dti', 'mths_since_last_delinq', 'mths_since_last_record', 'open_acc', 'revol_bal', 'revol_util', 'total_acc', 'total_rec_int', 'total_rec_late_fee', 'mths_since_last_major_derog', 'last_week_pay', 'tot_cur_bal', 'total_rev_hi_lim', 'tot_coll_amt', 'recoveries', 'collection_recovery_fee', 'term', 'acc_now_delinq', 'collections_12_mths_ex_med']
finalTrain = dfTrain_set[selected_cols]
finalTest = dfTest_set[selected_cols]

# How big the loan a person has taken with respect to his earnings, annual income to loan amount ratio
finalTrain['loan_to_income'] = finalTrain['annual_inc']/finalTrain['funded_amnt_inv']
finalTest['loan_to_income'] = finalTest['annual_inc']/finalTest['funded_amnt_inv']

In [31]:
# All these attributes indicate that the repayment was not all hunky-dory. All the amounts caclulated are ratios 
# like, recovery to the loan amount. This column gives a magnitude of how much the repayment has gone off course 
# in terms of ratios.
finalTrain['bad_state'] = finalTrain['acc_now_delinq'] + (finalTrain['total_rec_late_fee']/finalTrain['funded_amnt_inv']) + (finalTrain['recoveries']/finalTrain['funded_amnt_inv']) + (finalTrain['collection_recovery_fee']/finalTrain['funded_amnt_inv']) + (finalTrain['collections_12_mths_ex_med']/finalTrain['funded_amnt_inv'])
finalTest['bad_state'] = finalTest['acc_now_delinq'] + (finalTest['total_rec_late_fee']/finalTest['funded_amnt_inv']) + (finalTest['recoveries']/finalTest['funded_amnt_inv']) + (finalTest['collection_recovery_fee']/finalTest['funded_amnt_inv']) + (finalTrain['collections_12_mths_ex_med']/finalTest['funded_amnt_inv'])

In [32]:
# For the sake of this model, I have used just a boolean flag if things had gone bad, with this case I didn't see
# a benifit of including above computations
finalTrain.loc[finalTrain['bad_state'] > 0, 'bad_state'] = 1
finalTest.loc[finalTest['bad_state'] > 0, 'bad_state'] = 1

In [33]:

# Total number of available/unused 'credit lines'
finalTrain['avl_lines'] = finalTrain['total_acc'] - finalTrain['open_acc']
finalTest['avl_lines'] = finalTest['total_acc'] - finalTest['open_acc']

In [34]:

# Interest paid so far
finalTrain['int_paid'] = finalTrain['total_rec_int'] + finalTrain['total_rec_late_fee']
finalTest['int_paid'] = finalTest['total_rec_int'] + finalTest['total_rec_late_fee']

In [35]:
# Calculating EMIs paid (in terms of percent)
finalTrain['emi_paid_progress_perc'] = ((finalTrain['last_week_pay']/(finalTrain['term']/12*52+1))*100)
finalTest['emi_paid_progress_perc'] = ((finalTest['last_week_pay']/(finalTest['term']/12*52+1))*100)

In [36]:
# Calculating total repayments received so far, in terms of EMI or recoveries after charge off
finalTrain['total_repayment_progress'] = ((finalTrain['last_week_pay']/(finalTrain['term']/12*52+1))*100) + ((finalTrain['recoveries']/finalTrain['funded_amnt_inv']) * 100)
finalTest['total_repayment_progress'] = ((finalTest['last_week_pay']/(finalTest['term']/12*52+1))*100) + ((finalTest['recoveries']/finalTest['funded_amnt_inv']) * 100)

In [37]:

'''
Split data set into train-test-cv
Train model & predict
'''
# Split train and cross validation sets
X_train, X_test, y_train, y_test = train_test_split(np.array(finalTrain), np.array(train_target), test_size=0.30)
eval_set=[(X_test, y_test)]

In [40]:
import xgboost

print('Initializing xgboost.sklearn.XGBClassifier and starting training...')

st = datetime.now()

clf = xgboost.sklearn.XGBClassifier(
    objective="binary:logistic", 
    learning_rate=0.05, 
    seed=9616, 
    max_depth=20, 
    gamma=10, 
    n_estimators=500)


Initializing xgboost.sklearn.XGBClassifier and starting training...


In [41]:

print('Initializing xgboost.sklearn.XGBClassifier and starting training...')

st = datetime.now()

clf = xgboost.sklearn.XGBClassifier(
    objective="binary:logistic", 
    learning_rate=0.05, 
    seed=9616, 
    max_depth=20, 
    gamma=10, 
    n_estimators=500)

clf.fit(X_train, y_train, early_stopping_rounds=20, eval_metric="auc", eval_set=eval_set, verbose=True)

print(datetime.now()-st)

Initializing xgboost.sklearn.XGBClassifier and starting training...
[0]	validation_0-auc:0.96831
Will train until validation_0-auc hasn't improved in 20 rounds.
[1]	validation_0-auc:0.96989
[2]	validation_0-auc:0.97080
[3]	validation_0-auc:0.97101
[4]	validation_0-auc:0.97116
[5]	validation_0-auc:0.97118
[6]	validation_0-auc:0.97169
[7]	validation_0-auc:0.97198
[8]	validation_0-auc:0.97210
[9]	validation_0-auc:0.97234
[10]	validation_0-auc:0.97235
[11]	validation_0-auc:0.97243
[12]	validation_0-auc:0.97257
[13]	validation_0-auc:0.97272
[14]	validation_0-auc:0.97284
[15]	validation_0-auc:0.97297
[16]	validation_0-auc:0.97302
[17]	validation_0-auc:0.97309
[18]	validation_0-auc:0.97319
[19]	validation_0-auc:0.97330
[20]	validation_0-auc:0.97333
[21]	validation_0-auc:0.97338
[22]	validation_0-auc:0.97346
[23]	validation_0-auc:0.97350
[24]	validation_0-auc:0.97356
[25]	validation_0-auc:0.97359
[26]	validation_0-auc:0.97365
[27]	validation_0-auc:0.97372
[28]	validation_0-auc:0.97380
[29]	val

In [49]:
y_pred = clf.predict(X_test)
submission_file_name = "D://Arun//Submission_"

In [50]:
accuracy = accuracy_score(np.array(y_test).flatten(), y_pred)
print("Accuracy: %.10f%%" % (accuracy * 100.0))
submission_file_name = submission_file_name + ("_Accuracy_%.6f" % (accuracy * 100)) + '_'


Accuracy: 94.1500917178%


In [51]:
accuracy_per_roc_auc = roc_auc_score(np.array(y_test).flatten(), y_pred)
print("ROC-AUC: %.10f%%" % (accuracy_per_roc_auc * 100))
submission_file_name = submission_file_name + ("_ROC-AUC_%.6f" % (accuracy_per_roc_auc * 100))

ROC-AUC: 92.5323867764%


In [52]:
final_pred = pd.DataFrame(clf.predict_proba(np.array(finalTest)))
dfSub = pd.concat([test_member_id, final_pred.loc[:, 1:2]], axis=1)
dfSub.rename(columns={1:'loan_status'}, inplace=True)
dfSub.to_csv((('%s.csv') % (submission_file_name)), index=False)
