# Applying Classification Modeling
The goal of this week's assessment is to find the model which best predicts whether or not a person will default on their bank loan. In doing so, we want to utilize all of the different tools we have learned over the course: data cleaning, EDA, feature engineering/transformation, feature selection, hyperparameter tuning, and model evaluation. 


#### Data Set Information:

This research aimed at the case of customers default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. Because the real probability of default is unknown, this study presented the novel Sorting Smoothing Method to estimate the real probability of default. With the real probability of default as the response variable (Y), and the predictive probability of default as the independent variable (X), the simple linear regression result (Y = A + BX) shows that the forecasting model produced by artificial neural network has the highest coefficient of determination; its regression intercept (A) is close to zero, and regression coefficient (B) to one. Therefore, among the six data mining techniques, artificial neural network is the only one that can accurately estimate the real probability of default. 

- NT is the abbreviation for New Taiwain. 


#### Attribute Information:

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables: 
- X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. 
- X2: Gender (1 = male; 2 = female). 
- X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). 
- X4: Marital status (1 = married; 2 = single; 3 = others). 
- X5: Age (year). 
- X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: 
    - X6 = the repayment status in September, 2005; 
    - X7 = the repayment status in August, 2005; . . .;
    - etc...
    - X11 = the repayment status in April, 2005. 
    - The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. 
- X12-X17: Amount of bill statement (NT dollar). 
    - X12 = amount of bill statement in September, 2005;
    - etc...
    - X13 = amount of bill statement in August, 2005; . . .; 
    - X17 = amount of bill statement in April, 2005. 
- X18-X23: Amount of previous payment (NT dollar). 
    - X18 = amount paid in September, 2005; 
    - X19 = amount paid in August, 2005; . . .;
    - etc...
    - X23 = amount paid in April, 2005. 




You will fit three different models (KNN, Logistic Regression, and Decision Tree Classifier) to predict credit card defaults and use gridsearch to find the best hyperparameters for those models. Then you will compare the performance of those three models on a test set to find the best one.  


## Process/Expectations

- You will be working in pairs for this assessment

### Please have ONE notebook and be prepared to explain how you worked in your pair.

1. Clean up your data set so that you can perform an EDA. 
    - This includes handling null values, categorical variables, removing unimportant columns, and removing outliers.
2. Perform EDA to identify opportunities to create new features.
    - [Great Example of EDA for classification](https://www.kaggle.com/stephaniestallworth/titanic-eda-classification-end-to-end) 
    - [Using Pairplots with Classification](https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166)
3. Engineer new features. 
    - Create polynomial and/or interaction features. 
    - Additionaly, you must also create **at least 2 new features** that are not interactions or polynomial transformations. 
        - *For example, you can create a new dummy variable that based on the value of a continuous variable (billamount6 >2000) or take the average of some past amounts.*
4. Perform some feature selection. 
    
5. You must fit **three** models to your data and tune **at least 1 hyperparameter** per model. 
6. Using the F-1 Score, evaluate how well your models perform and identify your best model.
7. Using information from your EDA process and your model(s) output provide insight as to which borrowers are more likely to deafult


In [96]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, auc
plt.style.use('seaborn')
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV,\
cross_val_score, RandomizedSearchCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import RFE, RFECV

## 1. Data Cleaning

In [97]:
df = pd.read_csv('training_data.csv' , index_col=0)

In [98]:
df.Y.value_counts()

0                             17471
1                              5028
default payment next month        1
Name: Y, dtype: int64

In [99]:
# Split data to be used in the models
# Create matrix of features
X = df.drop('Y', axis = 1) # grabs everything else but 'Survived'


# Create target variable
y = df['Y'] # y is the column we're trying to predict

In [100]:
X.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23
28835,220000,2,1,2,36,0,0,0,0,0,...,217900,221193,181859,184605,10000,8018,10121,6006,10987,143779
25329,200000,2,3,2,29,-1,-1,-1,-1,-1,...,326,326,326,326,326,326,326,326,326,326
18894,180000,2,1,2,27,-2,-2,-2,-2,-2,...,0,0,0,0,0,0,0,0,0,0
690,80000,1,2,2,32,0,0,0,0,0,...,47593,43882,42256,42527,1853,1700,1522,1548,1488,1500
6239,10000,1,2,2,27,0,0,0,0,0,...,4878,5444,2639,2697,2000,1100,600,300,300,1000


In [101]:
df['X3'].value_counts()

2            10516
1             7919
3             3713
5              208
4               90
6               42
0               11
EDUCATION        1
Name: X3, dtype: int64

## 2. EDA

In [102]:
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
28835,220000,2,1,2,36,0,0,0,0,0,...,221193,181859,184605,10000,8018,10121,6006,10987,143779,1
25329,200000,2,3,2,29,-1,-1,-1,-1,-1,...,326,326,326,326,326,326,326,326,326,0
18894,180000,2,1,2,27,-2,-2,-2,-2,-2,...,0,0,0,0,0,0,0,0,0,0
690,80000,1,2,2,32,0,0,0,0,0,...,43882,42256,42527,1853,1700,1522,1548,1488,1500,0
6239,10000,1,2,2,27,0,0,0,0,0,...,5444,2639,2697,2000,1100,600,300,300,1000,1


In [103]:
df.dtypes

X1     object
X2     object
X3     object
X4     object
X5     object
X6     object
X7     object
X8     object
X9     object
X10    object
X11    object
X12    object
X13    object
X14    object
X15    object
X16    object
X17    object
X18    object
X19    object
X20    object
X21    object
X22    object
X23    object
Y      object
dtype: object

In [104]:
X['X1']

28835    220000
25329    200000
18894    180000
690       80000
6239      10000
          ...  
16247     40000
2693     350000
8076     100000
20213     20000
7624      20000
Name: X1, Length: 22500, dtype: object

In [105]:
df.loc[df['X1'] == 'LIMIT_BAL']

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month


In [106]:
df.sort_values(by='X1', ascending = False, inplace = True)
df = df.rename(columns=df.iloc[0]).drop(df.index[0])

In [107]:
df.columns

Index(['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2',
       'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
       'default payment next month'],
      dtype='object')

In [13]:
y = df['default payment next month']

In [14]:
y

23626    0
9489     1
27783    0
1940     0
18657    0
        ..
26650    0
5844     0
8549     0
26260    0
10116    1
Name: default payment next month, Length: 22499, dtype: object

In [15]:
x = df.drop(columns='default payment next month')

In [16]:
x

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
23626,90000,2,3,1,36,0,0,-1,0,0,...,18318,18982,19371,19608,1580,19789,1295,1000,849,1000
9489,90000,1,3,2,25,0,0,0,0,-2,...,7884,0,0,0,1172,1061,0,0,0,193
27783,90000,2,1,2,25,0,0,0,0,0,...,8848,10135,11731,8138,1500,1500,1500,2000,1500,1000
1940,90000,1,1,1,30,0,0,0,0,0,...,46496,40244,39903,8629,12000,5000,3000,10000,1000,1838
18657,90000,2,2,2,26,1,2,2,2,2,...,90311,91431,92840,91205,4500,4100,3500,3800,0,3500
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26650,10000,1,2,2,24,1,2,2,-2,-2,...,0,0,0,0,1000,0,0,0,0,0
5844,10000,1,2,2,37,-1,-1,-1,0,-1,...,780,390,780,390,1475,780,0,31250,0,0
8549,10000,1,2,2,28,-1,2,-1,-1,2,...,370,1766,1496,8356,0,370,1496,0,7000,0
26260,10000,2,3,1,33,1,-2,-2,-2,-2,...,0,0,0,0,0,0,0,0,0,0


In [17]:
x.columns

Index(['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2',
       'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6'],
      dtype='object')

## 3. Feature Engineering

In [18]:
x['SEX'].value_counts()

2    13572
1     8927
Name: SEX, dtype: int64

In [19]:
pd.get_dummies(x,columns=['SEX'],drop_first=True)

Unnamed: 0,LIMIT_BAL,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,SEX_2
23626,90000,3,1,36,0,0,-1,0,0,0,...,18982,19371,19608,1580,19789,1295,1000,849,1000,1
9489,90000,3,2,25,0,0,0,0,-2,-2,...,0,0,0,1172,1061,0,0,0,193,0
27783,90000,1,2,25,0,0,0,0,0,0,...,10135,11731,8138,1500,1500,1500,2000,1500,1000,1
1940,90000,1,1,30,0,0,0,0,0,0,...,40244,39903,8629,12000,5000,3000,10000,1000,1838,0
18657,90000,2,2,26,1,2,2,2,2,2,...,91431,92840,91205,4500,4100,3500,3800,0,3500,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26650,10000,2,2,24,1,2,2,-2,-2,-2,...,0,0,0,1000,0,0,0,0,0,0
5844,10000,2,2,37,-1,-1,-1,0,-1,0,...,390,780,390,1475,780,0,31250,0,0,0
8549,10000,2,2,28,-1,2,-1,-1,2,2,...,1766,1496,8356,0,370,1496,0,7000,0,0
26260,10000,3,1,33,1,-2,-2,-2,-2,-2,...,0,0,0,0,0,0,0,0,0,1


In [20]:
x['EDUCATION'].value_counts()

2    10516
1     7919
3     3713
5      208
4       90
6       42
0       11
Name: EDUCATION, dtype: int64

In [21]:
x['MARRIAGE'].value_counts()

2    12026
1    10195
3      234
0       44
Name: MARRIAGE, dtype: int64

In [22]:
x['PAY_0'].value_counts()

0     11057
-1     4272
1      2750
-2     2048
2      2032
3       239
4        51
5        20
8        15
6         9
7         6
Name: PAY_0, dtype: int64

In [23]:
x['PAY_2'].value_counts()

0     11804
-1     4526
2      2967
-2     2813
3       251
4        70
1        24
5        19
7        16
6         8
8         1
Name: PAY_2, dtype: int64

In [24]:
x['PAY_3'].value_counts()

0     11823
-1     4464
-2     3024
2      2891
3       177
4        58
7        22
6        19
5        15
1         4
8         2
Name: PAY_3, dtype: int64

In [25]:
x['PAY_4'].value_counts()

0     12330
-1     4281
-2     3227
2      2390
3       138
4        49
7        47
5        28
6         5
1         2
8         2
Name: PAY_4, dtype: int64

In [26]:
x['PAY_5'].value_counts()

0     12706
-1     4124
-2     3401
2      2014
3       128
4        59
7        47
5        16
6         3
8         1
Name: PAY_5, dtype: int64

In [27]:
x['PAY_6'].value_counts()

0     12233
-1     4284
-2     3663
2      2078
3       140
7        38
4        38
6        14
5         9
8         2
Name: PAY_6, dtype: int64

In [28]:
x.astype(object).astype(int)

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
23626,90000,2,3,1,36,0,0,-1,0,0,...,18318,18982,19371,19608,1580,19789,1295,1000,849,1000
9489,90000,1,3,2,25,0,0,0,0,-2,...,7884,0,0,0,1172,1061,0,0,0,193
27783,90000,2,1,2,25,0,0,0,0,0,...,8848,10135,11731,8138,1500,1500,1500,2000,1500,1000
1940,90000,1,1,1,30,0,0,0,0,0,...,46496,40244,39903,8629,12000,5000,3000,10000,1000,1838
18657,90000,2,2,2,26,1,2,2,2,2,...,90311,91431,92840,91205,4500,4100,3500,3800,0,3500
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26650,10000,1,2,2,24,1,2,2,-2,-2,...,0,0,0,0,1000,0,0,0,0,0
5844,10000,1,2,2,37,-1,-1,-1,0,-1,...,780,390,780,390,1475,780,0,31250,0,0
8549,10000,1,2,2,28,-1,2,-1,-1,2,...,370,1766,1496,8356,0,370,1496,0,7000,0
26260,10000,2,3,1,33,1,-2,-2,-2,-2,...,0,0,0,0,0,0,0,0,0,0


In [29]:
x['PAY_AMT1'] = x['PAY_AMT1'].astype(object).astype(int)
x['BILL_AMT1']=x['BILL_AMT1'].astype(object).astype(int)
x['PAY_AMT2'] = x['PAY_AMT2'].astype(object).astype(int)
x['BILL_AMT2']=x['BILL_AMT2'].astype(object).astype(int)
x['PAY_AMT3'] = x['PAY_AMT3'].astype(object).astype(int)
x['BILL_AMT3']=x['BILL_AMT3'].astype(object).astype(int)
x['PAY_AMT4'] = x['PAY_AMT4'].astype(object).astype(int)
x['BILL_AMT4']=x['BILL_AMT4'].astype(object).astype(int)
x['PAY_AMT5'] = x['PAY_AMT5'].astype(object).astype(int)
x['BILL_AMT5']=x['BILL_AMT5'].astype(object).astype(int)
x['PAY_AMT6'] = x['PAY_AMT6'].astype(object).astype(int)
x['BILL_AMT6']=x['BILL_AMT6'].astype(object).astype(int)

In [30]:
#Did they pay their bill amount
x['paid_off_1'] = x['BILL_AMT1']- x['PAY_AMT1']
x['paid_off_1'] = x['paid_off_1'].apply(lambda z:0 if z>0 else 1)
x['paid_off_2'] = x['BILL_AMT2']- x['PAY_AMT1']
x['paid_off_2'] = x['paid_off_2'].apply(lambda z:0 if z>0 else 1)
x['paid_off_3'] = x['BILL_AMT3']- x['PAY_AMT1']
x['paid_off_3'] = x['paid_off_3'].apply(lambda z:0 if z>0 else 1)
x['paid_off_4'] = x['BILL_AMT4']- x['PAY_AMT1']
x['paid_off_4'] = x['paid_off_4'].apply(lambda z:0 if z>0 else 1)
x['paid_off_5'] = x['BILL_AMT5']- x['PAY_AMT1']
x['paid_off_5'] = x['paid_off_5'].apply(lambda z:0 if z>0 else 1)
x['paid_off_6'] = x['BILL_AMT6']- x['PAY_AMT1']
x['paid_off_6'] = x['paid_off_6'].apply(lambda z:0 if z>0 else 1)

In [31]:
#0 in paid_off_1 means they did not pay off bill, 1 means they did pay off bill
x['paid_off_1']

23626    0
9489     0
27783    0
1940     0
18657    0
        ..
26650    0
5844     1
8549     0
26260    1
10116    0
Name: paid_off_1, Length: 22499, dtype: int64

In [32]:
x['paid_off_2']

23626    0
9489     0
27783    0
1940     0
18657    0
        ..
26650    0
5844     1
8549     0
26260    1
10116    0
Name: paid_off_2, Length: 22499, dtype: int64

In [33]:
x['paid_off_3']

23626    0
9489     0
27783    0
1940     0
18657    0
        ..
26650    1
5844     1
8549     0
26260    1
10116    0
Name: paid_off_3, Length: 22499, dtype: int64

In [34]:
x['paid_off_4']

23626    0
9489     1
27783    0
1940     0
18657    0
        ..
26650    1
5844     1
8549     0
26260    1
10116    0
Name: paid_off_4, Length: 22499, dtype: int64

In [35]:
x['paid_off_5']

23626    0
9489     1
27783    0
1940     0
18657    0
        ..
26650    1
5844     1
8549     0
26260    1
10116    0
Name: paid_off_5, Length: 22499, dtype: int64

In [36]:
x['paid_off_6']

23626    0
9489     1
27783    0
1940     1
18657    0
        ..
26650    1
5844     1
8549     0
26260    1
10116    0
Name: paid_off_6, Length: 22499, dtype: int64

In [37]:
x.columns

Index(['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2',
       'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
       'paid_off_1', 'paid_off_2', 'paid_off_3', 'paid_off_4', 'paid_off_5',
       'paid_off_6'],
      dtype='object')

In [38]:
'''
THIS ASTYPE CONVERSION DOESNT ACTUALLY SAVE TO THE DATAFRAME 
UNLESS I DO WHAT I DID BELOW WITH MARRIAGE


'''

'\nTHIS ASTYPE CONVERSION DOESNT ACTUALLY SAVE TO THE DATAFRAME \nUNLESS I DO WHAT I DID BELOW WITH MARRIAGE\n\n\n'

In [39]:
x['MARRIAGE'] = x['MARRIAGE'].astype(object).astype(int)

In [40]:
x['MARRIAGE'] = x['MARRIAGE'].astype(object).astype(int).apply(lambda z:3 if z==0 else z)

In [41]:
x['MARRIAGE'].value_counts()

2    12026
1    10195
3      278
Name: MARRIAGE, dtype: int64

In [42]:
pd.get_dummies(x,columns=['MARRIAGE'],drop_first=True)

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,...,PAY_AMT5,PAY_AMT6,paid_off_1,paid_off_2,paid_off_3,paid_off_4,paid_off_5,paid_off_6,MARRIAGE_2,MARRIAGE_3
23626,90000,2,3,36,0,0,-1,0,0,0,...,849,1000,0,0,0,0,0,0,0,0
9489,90000,1,3,25,0,0,0,0,-2,-2,...,0,193,0,0,0,1,1,1,1,0
27783,90000,2,1,25,0,0,0,0,0,0,...,1500,1000,0,0,0,0,0,0,1,0
1940,90000,1,1,30,0,0,0,0,0,0,...,1000,1838,0,0,0,0,0,1,0,0
18657,90000,2,2,26,1,2,2,2,2,2,...,0,3500,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26650,10000,1,2,24,1,2,2,-2,-2,-2,...,0,0,0,0,1,1,1,1,1,0
5844,10000,1,2,37,-1,-1,-1,0,-1,0,...,0,0,1,1,1,1,1,1,1,0
8549,10000,1,2,28,-1,2,-1,-1,2,2,...,7000,0,0,0,0,0,0,0,1,0
26260,10000,2,3,33,1,-2,-2,-2,-2,-2,...,0,0,1,1,1,1,1,1,0,0


In [43]:
x['EDUCATION'] = x['EDUCATION'].astype(object).astype(int)

In [44]:
x['EDUCATION'] = x['EDUCATION'].apply(lambda z:z if z>=1 and z<=4 else 4)

In [45]:
x['EDUCATION'].value_counts()

2    10516
1     7919
3     3713
4      351
Name: EDUCATION, dtype: int64

In [46]:
pd.get_dummies(x,columns=['EDUCATION'],drop_first=True)

Unnamed: 0,LIMIT_BAL,SEX,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,...,PAY_AMT6,paid_off_1,paid_off_2,paid_off_3,paid_off_4,paid_off_5,paid_off_6,EDUCATION_2,EDUCATION_3,EDUCATION_4
23626,90000,2,1,36,0,0,-1,0,0,0,...,1000,0,0,0,0,0,0,0,1,0
9489,90000,1,2,25,0,0,0,0,-2,-2,...,193,0,0,0,1,1,1,0,1,0
27783,90000,2,2,25,0,0,0,0,0,0,...,1000,0,0,0,0,0,0,0,0,0
1940,90000,1,1,30,0,0,0,0,0,0,...,1838,0,0,0,0,0,1,0,0,0
18657,90000,2,2,26,1,2,2,2,2,2,...,3500,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26650,10000,1,2,24,1,2,2,-2,-2,-2,...,0,0,0,1,1,1,1,1,0,0
5844,10000,1,2,37,-1,-1,-1,0,-1,0,...,0,1,1,1,1,1,1,1,0,0
8549,10000,1,2,28,-1,2,-1,-1,2,2,...,0,0,0,0,0,0,0,1,0,0
26260,10000,2,1,33,1,-2,-2,-2,-2,-2,...,0,1,1,1,1,1,1,0,1,0


In [47]:
x.columns

Index(['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2',
       'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
       'paid_off_1', 'paid_off_2', 'paid_off_3', 'paid_off_4', 'paid_off_5',
       'paid_off_6'],
      dtype='object')

In [48]:
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_data = poly.fit_transform(x)
poly_columns = poly.get_feature_names(x.columns)
df_poly = pd.DataFrame(poly_data, columns=poly_columns)
df_poly.columns

Index(['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2',
       'PAY_3', 'PAY_4', 'PAY_5',
       ...
       'paid_off_3^2', 'paid_off_3 paid_off_4', 'paid_off_3 paid_off_5',
       'paid_off_3 paid_off_6', 'paid_off_4^2', 'paid_off_4 paid_off_5',
       'paid_off_4 paid_off_6', 'paid_off_5^2', 'paid_off_5 paid_off_6',
       'paid_off_6^2'],
      dtype='object', length=464)

In [49]:
features = ['LIMIT_BAL', 'SEX']

## 4. Feature Selection

## 5. Model Fitting and Hyperparameter Tuning
KNN, Logistic Regression, Decision Tree

In [50]:
X_train, X_test, y_train, y_test = train_test_split(x.astype(object).astype(int), y.astype(object).astype(int), random_state=0)

In [51]:
dtc = DecisionTreeClassifier(random_state=0)

In [52]:
dtc.fit(X_train, y_train)

DecisionTreeClassifier(random_state=0)

In [53]:
y_pred = dtc.predict(X_test)

In [54]:
y_preds = pd.Series(y_pred)

In [55]:
y_preds

0       0
1       1
2       0
3       1
4       0
       ..
5620    0
5621    0
5622    0
5623    0
5624    1
Length: 5625, dtype: int64

In [56]:
f1_score(y_preds, y_test)

0.4070525105404369

In [57]:
neigh = KNeighborsClassifier(n_neighbors=3)

In [58]:
neigh.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=3)

In [59]:
y_pred = neigh.predict(X_test)

In [60]:
f1_score(y_pred, y_test)

0.2678311499272198

In [61]:
log = LogisticRegression(random_state=0,max_iter=1000)

In [62]:
log.fit(X_train, y_train)

LogisticRegression(max_iter=1000, random_state=0)

In [63]:
y_pred = log.predict(X_test)

In [64]:
f1_score(y_pred, y_test)

0.0

## 6. Model Evaluation

In [75]:
rfe = RFECV(DecisionTreeClassifier(random_state=0),cv=5)

In [78]:
X_rfe_train = rfe.fit_transform(X_train, y_train)
X_rfe_test = rfe.transform(X_test)
dectree = DecisionTreeClassifier(random_state=0).fit(X_rfe_train, y_train)

In [82]:
y_pred=dectree.predict(X_rfe_train)

In [84]:
f1_score(y_pred,y_train)

0.9986663110162709

In [65]:

params_grid = {
    'criterion' : ['gini','entropy'],
    'max_depth':[None, 5,3],
    'min_samples_split':[2,10,20]
}
gridsearch_model = GridSearchCV(estimator=dtc,param_grid = params_grid,verbose=1)

In [66]:
gridsearch_model.fit(X_train,y_train)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  90 out of  90 | elapsed:   10.7s finished


GridSearchCV(estimator=DecisionTreeClassifier(random_state=0),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [None, 5, 3],
                         'min_samples_split': [2, 10, 20]},
             verbose=1)

In [67]:
gridsearch_model.best_params_

{'criterion': 'gini', 'max_depth': 3, 'min_samples_split': 2}

In [68]:
best_model = gridsearch_model.best_estimator_

In [69]:
y_pred = best_model.predict(X_test)

In [70]:
f1_score(y_pred, y_test)

0.4878048780487806

## 7. Final Model

In [71]:
for fi, feature in zip(best_model.feature_importances_, X_train.columns):
    print(fi, feature)

0.0 LIMIT_BAL
0.0 SEX
0.006086491236001525 EDUCATION
0.0 MARRIAGE
0.0 AGE
0.7717899353035195 PAY_0
0.139657966883848 PAY_2
0.013791441559922553 PAY_3
0.0 PAY_4
0.010229882966250672 PAY_5
0.0 PAY_6
0.0 BILL_AMT1
0.0 BILL_AMT2
0.0 BILL_AMT3
0.0 BILL_AMT4
0.0 BILL_AMT5
0.0 BILL_AMT6
0.006125500043036013 PAY_AMT1
0.05231878200742191 PAY_AMT2
0.0 PAY_AMT3
0.0 PAY_AMT4
0.0 PAY_AMT5
0.0 PAY_AMT6
0.0 paid_off_1
0.0 paid_off_2
0.0 paid_off_3
0.0 paid_off_4
0.0 paid_off_5
0.0 paid_off_6


In [85]:
best_model.score(X_test,y_test)

0.8208

In [108]:
holdout = pd.read_csv('holdout_data.csv' , index_col=0)

In [109]:
holdout

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23
5501,180000,2,2,1,44,0,0,0,0,0,...,170788,174764,162667,166953,10000,8000,7000,6000,7000,10000
28857,130000,2,2,1,48,-2,-2,-2,-2,-2,...,1487,1279,749,440,1240,1487,1279,749,440,849
11272,60000,2,1,1,43,-1,3,2,0,0,...,495,330,165,340,0,330,0,0,340,0
8206,240000,1,1,1,42,0,0,0,0,0,...,91027,51508,51127,0,20000,2213,1030,1023,6790,10893
6362,100000,2,2,1,28,2,0,0,0,0,...,70844,63924,57326,59654,3500,3003,1910,2400,3300,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14600,90000,2,2,1,34,-2,-2,-2,-2,-2,...,11855,665,0,665,1924,11855,10655,0,665,0
12687,180000,2,2,2,28,0,0,0,0,0,...,109741,112907,115924,118832,6500,5000,5000,5000,5000,5000
7374,360000,1,2,1,37,1,-2,-2,-2,-2,...,0,0,0,0,0,0,0,0,0,0
27661,50000,2,2,2,23,-1,0,0,2,0,...,12595,11449,9914,9875,1502,2651,500,500,500,500


In [110]:
holdout.columns=['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2',
       'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

In [112]:
holdout.columns

Index(['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2',
       'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6'],
      dtype='object')

In [113]:
pd.get_dummies(holdout,columns=['SEX'],drop_first=True)

Unnamed: 0,LIMIT_BAL,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,SEX_2
5501,180000,2,1,44,0,0,0,0,0,0,...,174764,162667,166953,10000,8000,7000,6000,7000,10000,1
28857,130000,2,1,48,-2,-2,-2,-2,-2,-2,...,1279,749,440,1240,1487,1279,749,440,849,1
11272,60000,1,1,43,-1,3,2,0,0,-1,...,330,165,340,0,330,0,0,340,0,1
8206,240000,1,1,42,0,0,0,0,0,0,...,51508,51127,0,20000,2213,1030,1023,6790,10893,0
6362,100000,2,1,28,2,0,0,0,0,2,...,63924,57326,59654,3500,3003,1910,2400,3300,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14600,90000,2,1,34,-2,-2,-2,-2,-2,-2,...,665,0,665,1924,11855,10655,0,665,0,1
12687,180000,2,2,28,0,0,0,0,0,0,...,112907,115924,118832,6500,5000,5000,5000,5000,5000,1
7374,360000,2,1,37,1,-2,-2,-2,-2,-2,...,0,0,0,0,0,0,0,0,0,0
27661,50000,2,2,23,-1,0,0,2,0,0,...,11449,9914,9875,1502,2651,500,500,500,500,1


In [114]:
holdout = holdout.astype(object).astype(int)

In [115]:
#Did they pay their bill amount
holdout['paid_off_1'] = holdout['BILL_AMT1']- holdout['PAY_AMT1']
holdout['paid_off_1'] = holdout['paid_off_1'].apply(lambda z:0 if z>0 else 1)
holdout['paid_off_2'] = holdout['BILL_AMT2']- holdout['PAY_AMT1']
holdout['paid_off_2'] = holdout['paid_off_2'].apply(lambda z:0 if z>0 else 1)
holdout['paid_off_3'] = holdout['BILL_AMT3']- holdout['PAY_AMT1']
holdout['paid_off_3'] = holdout['paid_off_3'].apply(lambda z:0 if z>0 else 1)
holdout['paid_off_4'] = holdout['BILL_AMT4']- holdout['PAY_AMT1']
holdout['paid_off_4'] = holdout['paid_off_4'].apply(lambda z:0 if z>0 else 1)
holdout['paid_off_5'] = holdout['BILL_AMT5']- holdout['PAY_AMT1']
holdout['paid_off_5'] = holdout['paid_off_5'].apply(lambda z:0 if z>0 else 1)
holdout['paid_off_6'] = holdout['BILL_AMT6']- holdout['PAY_AMT1']
holdout['paid_off_6'] = holdout['paid_off_6'].apply(lambda z:0 if z>0 else 1)

In [116]:
holdout['MARRIAGE'] = holdout['MARRIAGE'].astype(object).astype(int).apply(lambda z:3 if z==0 else z)

In [117]:
pd.get_dummies(holdout,columns=['MARRIAGE'],drop_first=True)

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,...,PAY_AMT5,PAY_AMT6,paid_off_1,paid_off_2,paid_off_3,paid_off_4,paid_off_5,paid_off_6,MARRIAGE_2,MARRIAGE_3
5501,180000,2,2,44,0,0,0,0,0,0,...,7000,10000,0,0,0,0,0,0,0,0
28857,130000,2,2,48,-2,-2,-2,-2,-2,-2,...,440,849,1,1,0,0,1,1,0,0
11272,60000,2,1,43,-1,3,2,0,0,-1,...,340,0,0,0,0,0,0,0,0,0
8206,240000,1,1,42,0,0,0,0,0,0,...,6790,10893,0,0,0,0,0,1,0,0
6362,100000,2,2,28,2,0,0,0,0,2,...,3300,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14600,90000,2,2,34,-2,-2,-2,-2,-2,-2,...,665,0,1,1,0,1,1,1,0,0
12687,180000,2,2,28,0,0,0,0,0,0,...,5000,5000,0,0,0,0,0,0,1,0
7374,360000,1,2,37,1,-2,-2,-2,-2,-2,...,0,0,1,1,1,1,1,1,0,0
27661,50000,2,2,23,-1,0,0,2,0,0,...,500,500,0,0,0,0,0,0,1,0


In [118]:
holdout['EDUCATION'] = holdout['EDUCATION'].apply(lambda z:z if z>=1 and z<=4 else 4)

In [119]:
pd.get_dummies(holdout,columns=['EDUCATION'],drop_first=True)

Unnamed: 0,LIMIT_BAL,SEX,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,...,PAY_AMT6,paid_off_1,paid_off_2,paid_off_3,paid_off_4,paid_off_5,paid_off_6,EDUCATION_2,EDUCATION_3,EDUCATION_4
5501,180000,2,1,44,0,0,0,0,0,0,...,10000,0,0,0,0,0,0,1,0,0
28857,130000,2,1,48,-2,-2,-2,-2,-2,-2,...,849,1,1,0,0,1,1,1,0,0
11272,60000,2,1,43,-1,3,2,0,0,-1,...,0,0,0,0,0,0,0,0,0,0
8206,240000,1,1,42,0,0,0,0,0,0,...,10893,0,0,0,0,0,1,0,0,0
6362,100000,2,1,28,2,0,0,0,0,2,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14600,90000,2,1,34,-2,-2,-2,-2,-2,-2,...,0,1,1,0,1,1,1,1,0,0
12687,180000,2,2,28,0,0,0,0,0,0,...,5000,0,0,0,0,0,0,1,0,0
7374,360000,1,1,37,1,-2,-2,-2,-2,-2,...,0,1,1,1,1,1,1,1,0,0
27661,50000,2,2,23,-1,0,0,2,0,0,...,500,0,0,0,0,0,0,1,0,0


In [120]:
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_data = poly.fit_transform(holdout)
poly_columns = poly.get_feature_names(holdout.columns)
df_poly = pd.DataFrame(poly_data, columns=poly_columns)
df_poly.columns

Index(['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2',
       'PAY_3', 'PAY_4', 'PAY_5',
       ...
       'paid_off_3^2', 'paid_off_3 paid_off_4', 'paid_off_3 paid_off_5',
       'paid_off_3 paid_off_6', 'paid_off_4^2', 'paid_off_4 paid_off_5',
       'paid_off_4 paid_off_6', 'paid_off_5^2', 'paid_off_5 paid_off_6',
       'paid_off_6^2'],
      dtype='object', length=464)

In [121]:
features = ['LIMIT_BAL', 'SEX']

In [122]:
final_pred = best_model.predict(holdout)

In [123]:
final_answer = pd.DataFrame(final_pred)

In [128]:
final_answer.to_csv('default_preds_jacob_ash.csv')