# Applying Classification Modeling
The goal of this week's assessment is to find the model which best predicts whether or not a person will default on their bank loan. In doing so, we want to utilize all of the different tools we have learned over the course: data cleaning, EDA, feature engineering/transformation, feature selection, hyperparameter tuning, and model evaluation. 


#### Data Set Information:

This research aimed at the case of customers default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. Because the real probability of default is unknown, this study presented the novel Sorting Smoothing Method to estimate the real probability of default. With the real probability of default as the response variable (Y), and the predictive probability of default as the independent variable (X), the simple linear regression result (Y = A + BX) shows that the forecasting model produced by artificial neural network has the highest coefficient of determination; its regression intercept (A) is close to zero, and regression coefficient (B) to one. Therefore, among the six data mining techniques, artificial neural network is the only one that can accurately estimate the real probability of default. 

- NT is the abbreviation for New Taiwain. 


#### Attribute Information:

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables: 
- X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. 
- X2: Gender (1 = male; 2 = female). 
- X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). 
- X4: Marital status (1 = married; 2 = single; 3 = others). 
- X5: Age (year). 
- X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: 
    - X6 = the repayment status in September, 2005; 
    - X7 = the repayment status in August, 2005; . . .;
    - etc...
    - X11 = the repayment status in April, 2005. 
    - The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. 
- X12-X17: Amount of bill statement (NT dollar). 
    - X12 = amount of bill statement in September, 2005;
    - etc...
    - X13 = amount of bill statement in August, 2005; . . .; 
    - X17 = amount of bill statement in April, 2005. 
- X18-X23: Amount of previous payment (NT dollar). 
    - X18 = amount paid in September, 2005; 
    - X19 = amount paid in August, 2005; . . .;
    - etc...
    - X23 = amount paid in April, 2005. 




You will fit three different models (KNN, Logistic Regression, and Decision Tree Classifier) to predict credit card defaults and use gridsearch to find the best hyperparameters for those models. Then you will compare the performance of those three models on a test set to find the best one.  


## Process/Expectations

- You will be working in pairs for this assessment

### Please have ONE notebook and be prepared to explain how you worked in your pair.

1. Clean up your data set so that you can perform an EDA. 
    - This includes handling null values, categorical variables, removing unimportant columns, and removing outliers.
2. Perform EDA to identify opportunities to create new features.
    - [Great Example of EDA for classification](https://www.kaggle.com/stephaniestallworth/titanic-eda-classification-end-to-end) 
    - [Using Pairplots with Classification](https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166)
3. Engineer new features. 
    - Create polynomial and/or interaction features. 
    - Additionaly, you must also create **at least 2 new features** that are not interactions or polynomial transformations. 
        - *For example, you can create a new dummy variable that based on the value of a continuous variable (billamount6 >2000) or take the average of some past amounts.*
4. Perform some feature selection. 
    
5. You must fit **three** models to your data and tune **at least 1 hyperparameter** per model. 
6. Using the F-1 Score, evaluate how well your models perform and identify your best model.
7. Using information from your EDA process and your model(s) output provide insight as to which borrowers are more likely to deafult


In [1]:
# import libraries

import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 50)

## 1. Data Cleaning

In [2]:
df = pd.read_csv('training_data.csv' , index_col=0)

In [3]:
df.Y.value_counts()

0                             17471
1                              5028
default payment next month        1
Name: Y, dtype: int64

In [4]:
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
28835,220000,2,1,2,36,0,0,0,0,0,0,222598,222168,217900,221193,181859,184605,10000,8018,10121,6006,10987,143779,1
25329,200000,2,3,2,29,-1,-1,-1,-1,-1,-1,326,326,326,326,326,326,326,326,326,326,326,326,0
18894,180000,2,1,2,27,-2,-2,-2,-2,-2,-2,0,0,0,0,0,0,0,0,0,0,0,0,0
690,80000,1,2,2,32,0,0,0,0,0,0,51372,51872,47593,43882,42256,42527,1853,1700,1522,1548,1488,1500,0
6239,10000,1,2,2,27,0,0,0,0,0,0,8257,7995,4878,5444,2639,2697,2000,1100,600,300,300,1000,1


In [5]:
df['X6']

28835     0
25329    -1
18894    -2
690       0
6239      0
         ..
16247     0
2693     -1
8076      1
20213    -1
7624      1
Name: X6, Length: 22500, dtype: object

In [6]:
df.dtypes

X1     object
X2     object
X3     object
X4     object
X5     object
X6     object
X7     object
X8     object
X9     object
X10    object
X11    object
X12    object
X13    object
X14    object
X15    object
X16    object
X17    object
X18    object
X19    object
X20    object
X21    object
X22    object
X23    object
Y      object
dtype: object

In [7]:
df['X3'].value_counts()

2            10516
1             7919
3             3713
5              208
4               90
6               42
0               11
EDUCATION        1
Name: X3, dtype: int64

## 2. EDA

In [8]:
df.drop('ID', axis = 0, inplace = True)

In [9]:
df = df.apply(lambda x: x.astype(float))

In [10]:
df.dtypes

X1     float64
X2     float64
X3     float64
X4     float64
X5     float64
X6     float64
X7     float64
X8     float64
X9     float64
X10    float64
X11    float64
X12    float64
X13    float64
X14    float64
X15    float64
X16    float64
X17    float64
X18    float64
X19    float64
X20    float64
X21    float64
X22    float64
X23    float64
Y      float64
dtype: object

In [11]:
## For our Marital Status column, we have 3 groups in our key but 4 in the dataframe
## We need to figure out where 0 belongs

## ! is married, 2 is single, 3 is others

df.groupby('X4')['X5'].mean()

X4
0.0    37.590909
1.0    40.027857
2.0    31.412606
3.0    42.893162
Name: X5, dtype: float64

In [12]:
## 0 appears to belong to 1 aka the married group but I should investigate further

In [13]:
df.groupby('X4')['X1'].mean()

## We have 4 groups in our key for Education Level but 7 in our dataframe

X4
0.0    140681.818182
1.0    181372.437469
2.0    156242.115417
3.0    103888.888889
Name: X1, dtype: float64

In [14]:
df.groupby('X3')['X5'].mean()

##6 and 0 look like they belong to 3 aka high school education
## 5 might belong to 2 aka college education but I should investigate further

X3
0.0    40.545455
1.0    34.231847
2.0    34.658806
3.0    40.157285
4.0    34.666667
5.0    35.937500
6.0    43.904762
Name: X5, dtype: float64

In [15]:
df.groupby('X3')['X1'].mean()

## 1 graduate, 2 college, 3 high school, 4 others

X3
0.0    220000.000000
1.0    213389.316833
2.0    146419.360974
3.0    125641.712901
4.0    230000.000000
5.0    161951.923077
6.0    135000.000000
Name: X1, dtype: float64

In [16]:
df['X3']

28835    1.0
25329    3.0
18894    1.0
690      2.0
6239     2.0
        ... 
16247    2.0
2693     1.0
8076     3.0
20213    3.0
7624     1.0
Name: X3, Length: 22499, dtype: float64

In [17]:
df[(df['X3'] == 5) | (df['X3'] == 6) | (df['X3'] == 0)]['X4'].value_counts()

1.0    139
2.0    117
3.0      5
Name: X4, dtype: int64

In [18]:
df[df['X4'] == 0]['X5'].mean()

37.59090909090909

In [19]:
conditions = [
    df['X3'].eq(0),
    df['X3'].eq(1),
    df['X3'].eq(2),
    df['X3'].eq(3),
    df['X3'].eq(4),
    df['X3'].eq(5),
    df['X3'].eq(6),
]

choices = [
    1,
    0,
    0,
    0,
    0,
    1,
    1,   
]

df['X3_undefined'] = np.select(conditions,choices)

In [20]:
conditions = [
    df['X3'].eq(0),
    df['X3'].eq(5),
    df['X3'].eq(6),
]

choices = [
    4,
    4,
    4
]

df['X3'] = np.select(conditions,choices, df['X3'])

In [21]:
df['X3'].value_counts()

2.0    10516
1.0     7919
3.0     3713
4.0      351
Name: X3, dtype: int64

In [22]:
df['X4_undefined'] = np.where(df['X4'] == 0, 1,0)

In [23]:
df['X4'] = np.where(df['X4'] == 0, 3, df['X4'])

In [24]:
df['X4'].value_counts()

2.0    12026
1.0    10195
3.0      278
Name: X4, dtype: int64

In [25]:
df['X3_undefined'].value_counts()

0    22238
1      261
Name: X3_undefined, dtype: int64

In [26]:
df['X4_undefined'].value_counts()

0    22455
1       44
Name: X4_undefined, dtype: int64

In [27]:
df['X4'].value_counts()

2.0    12026
1.0    10195
3.0      278
Name: X4, dtype: int64

In [28]:
df['X3'].value_counts()

2.0    10516
1.0     7919
3.0     3713
4.0      351
Name: X3, dtype: int64

In [29]:
df

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y,X3_undefined,X4_undefined
28835,220000.0,2.0,1.0,2.0,36.0,0.0,0.0,0.0,0.0,0.0,0.0,222598.0,222168.0,217900.0,221193.0,181859.0,184605.0,10000.0,8018.0,10121.0,6006.0,10987.0,143779.0,1.0,0,0
25329,200000.0,2.0,3.0,2.0,29.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,326.0,326.0,326.0,326.0,326.0,326.0,326.0,326.0,326.0,326.0,326.0,326.0,0.0,0,0
18894,180000.0,2.0,1.0,2.0,27.0,-2.0,-2.0,-2.0,-2.0,-2.0,-2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0
690,80000.0,1.0,2.0,2.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0,51372.0,51872.0,47593.0,43882.0,42256.0,42527.0,1853.0,1700.0,1522.0,1548.0,1488.0,1500.0,0.0,0,0
6239,10000.0,1.0,2.0,2.0,27.0,0.0,0.0,0.0,0.0,0.0,0.0,8257.0,7995.0,4878.0,5444.0,2639.0,2697.0,2000.0,1100.0,600.0,300.0,300.0,1000.0,1.0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16247,40000.0,2.0,2.0,1.0,38.0,0.0,0.0,3.0,2.0,2.0,2.0,35183.0,39197.0,39477.0,39924.0,39004.0,41462.0,4600.0,1200.0,1400.0,0.0,3069.0,0.0,1.0,0,0
2693,350000.0,1.0,1.0,1.0,42.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,3800.0,3138.0,4150.0,3750.0,1362.0,8210.0,3138.0,4160.0,3750.0,2272.0,8210.0,9731.0,0.0,0,0
8076,100000.0,2.0,3.0,2.0,46.0,1.0,-1.0,2.0,2.0,-1.0,0.0,0.0,203.0,203.0,0.0,7856.0,16544.0,203.0,0.0,0.0,7856.0,10000.0,865.0,0.0,0,0
20213,20000.0,2.0,3.0,1.0,50.0,-1.0,-1.0,-1.0,-1.0,-2.0,-2.0,5141.0,3455.0,6906.0,0.0,0.0,0.0,3754.0,6906.0,290.0,0.0,0.0,0.0,1.0,0,0


## 3. Feature Engineering

### Remaining balances after each monthly payment

In [30]:
df['bal_september'] = df['X12'] - df['X18'] #amount left to pay September
df['bal_august'] = df['X13'] - df['X19'] #amount left to pay August
df['bal_july'] = df['X14'] - df['X20'] #amount left to pay July
df['bal_june'] = df['X15'] - df['X21'] #amount left to pay June
df['bal_may'] = df['X16'] - df['X22'] #amount left to pay July
df['bal_april'] = df['X17'] - df['X23'] #amount left to pay April

### Monthly credit utilization

In [31]:
df['util_september'] = df['bal_september'] / df['X1'] #September credit utilization
df['util_august'] = df['bal_august'] / df['X1'] #August credit utilization
df['util_july'] = df['bal_july'] / df['X1'] #July credit utilization
df['util_june'] = df['bal_june'] / df['X1'] #June credit utilization
df['util_may'] = df['bal_may'] / df['X1'] #July credit utilization
df['util_april'] = df['bal_april'] / df['X1'] #April credit utilization

### Utilization Trend

In [32]:
# # test = [0.12,0.23,0.34,0.45,0.6]
# test = [0.6, 0.5, 0.54, 0.2, 0.3]

In [33]:
# for index, i in enumerate(test):
#     if index == 0:
#         trend = 0
#     else:
#         trend += i - test[index - 1]
# trend

In [34]:
def util_trend(df,cols):
    for index, i in enumerate(cols):
        if index == 0:
            trend = 0
        else:
            trend += df[i] - df[cols[index - 1]]
    return trend

In [35]:
prefix = ['util']
months = ['september', 'august', 'july', 'june', 'may', 'april']
i = [prefix[0] + "_" + x for x in months]
df['util_trend'] = util_trend(df, i) #the overall trend of the credit utlization

In [36]:
i

['util_september',
 'util_august',
 'util_july',
 'util_june',
 'util_may',
 'util_april']

### Average Utilization

In [37]:
df['avg_util'] = (df[i[0]] + df[i[1]] + df[i[2]] + df[i[3]] + df[i[4]] + df[i[5]]) / 6 #average credit utilization

### Payment History

In [38]:
df['payment_hist'] = df.apply(lambda x:[x['X6'], x['X7'], x['X8'], x['X9'], x['X10'], x['X11']], axis = 1) 

looks like -2 is a value but i don't know what that means. Maybe they prepaid? Also, you can't be two months behind when you were up to date the month before.

According to Kaggle this is a possible interpretation:

-2 = Balance paid in full and no transactions this period (we may refer to this credit card account as having been 'inactive' this period)

-1 = Balance paid in full, but account has a positive balance at end of period due to recent transactions for which payment has not yet come due

0 = Customer paid the minimum due amount, but not the entire balance. I.e., the customer paid enough for their account to remain in good standing, but did revolve a balance

### Account Status

In [39]:
df['account_status'] = df['payment_hist'].apply(lambda lst:1 if max(set(lst), key=lst.count) > 0 else 0)

In [40]:
# 1 = bad standing, payments are not up to date for a majority of the time 0 = good standing the majority of time

### Ever owed more than thier credit limit

In [41]:
df['excess_debt'] = 1 * df.apply(lambda x: x['X12':'X17'] > x['X1'], axis = 1).any(axis = 1)

In [42]:
df['excess_debt'].value_counts()

0    19600
1     2899
Name: excess_debt, dtype: int64

### Continually owed more than thier credit limit

In [43]:
df['consistent_excess_debt'] = 1 * (df.apply(lambda x: x['X12':'X17'] > x['X1'], axis = 1).sum(axis = 1) > 1)

In [44]:
df['consistent_excess_debt'].value_counts()

0    20931
1     1568
Name: consistent_excess_debt, dtype: int64

In [45]:
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y,X3_undefined,X4_undefined,bal_september,bal_august,bal_july,bal_june,bal_may,bal_april,util_september,util_august,util_july,util_june,util_may,util_april,util_trend,avg_util,payment_hist,account_status,excess_debt,consistent_excess_debt
28835,220000.0,2.0,1.0,2.0,36.0,0.0,0.0,0.0,0.0,0.0,0.0,222598.0,222168.0,217900.0,221193.0,181859.0,184605.0,10000.0,8018.0,10121.0,6006.0,10987.0,143779.0,1.0,0,0,212598.0,214150.0,207779.0,215187.0,170872.0,40826.0,0.966355,0.973409,0.94445,0.978123,0.776691,0.185573,-0.780782,0.8041,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0,1,1
25329,200000.0,2.0,3.0,2.0,29.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,326.0,326.0,326.0,326.0,326.0,326.0,326.0,326.0,326.0,326.0,326.0,326.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[-1.0, -1.0, -1.0, -1.0, -1.0, -1.0]",0,0,0
18894,180000.0,2.0,1.0,2.0,27.0,-2.0,-2.0,-2.0,-2.0,-2.0,-2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[-2.0, -2.0, -2.0, -2.0, -2.0, -2.0]",0,0,0
690,80000.0,1.0,2.0,2.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0,51372.0,51872.0,47593.0,43882.0,42256.0,42527.0,1853.0,1700.0,1522.0,1548.0,1488.0,1500.0,0.0,0,0,49519.0,50172.0,46071.0,42334.0,40768.0,41027.0,0.618988,0.62715,0.575887,0.529175,0.5096,0.512837,-0.10615,0.562273,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0,0,0
6239,10000.0,1.0,2.0,2.0,27.0,0.0,0.0,0.0,0.0,0.0,0.0,8257.0,7995.0,4878.0,5444.0,2639.0,2697.0,2000.0,1100.0,600.0,300.0,300.0,1000.0,1.0,0,0,6257.0,6895.0,4278.0,5144.0,2339.0,1697.0,0.6257,0.6895,0.4278,0.5144,0.2339,0.1697,-0.456,0.4435,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0,0,0


## 4. Feature Selection

## 5. Model Fitting and Hyperparameter Tuning
KNN, Logistic Regression, Decision Tree

## KNN

In [46]:
# Split data to be used in the models
# Create matrix of features
X = df.drop(['Y', 'payment_hist'], axis = 1) # grabs everything else but 'Survived' is this suspoesed to be survived?


# Create target variable
y = df['Y'] # y is the column we're trying to predict

In [47]:
y.dtypes

dtype('float64')

In [48]:
X.dtypes

X1                        float64
X2                        float64
X3                        float64
X4                        float64
X5                        float64
X6                        float64
X7                        float64
X8                        float64
X9                        float64
X10                       float64
X11                       float64
X12                       float64
X13                       float64
X14                       float64
X15                       float64
X16                       float64
X17                       float64
X18                       float64
X19                       float64
X20                       float64
X21                       float64
X22                       float64
X23                       float64
X3_undefined                int64
X4_undefined                int64
bal_september             float64
bal_august                float64
bal_july                  float64
bal_june                  float64
bal_may       

In [49]:
from sklearn.model_selection import train_test_split
X_trainknn, X_testknn, y_trainknn, y_testknn = train_test_split(X, y, random_state=1)

In [50]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()  
scaler.fit(X_trainknn)

X_trainknn = scaler.transform(X_trainknn)  
X_testknn = scaler.transform(X_testknn)  

In [51]:
from sklearn.neighbors import KNeighborsClassifier

In [52]:
knn = KNeighborsClassifier(n_neighbors=1)

In [53]:
knn.fit(X_trainknn, y_trainknn)

KNeighborsClassifier(n_neighbors=1)

In [54]:
y_pred_class = knn.predict(X_testknn)

In [55]:
from sklearn import metrics
print('Accuracy:' + str(metrics.accuracy_score(y_testknn, y_pred_class)))

Accuracy:0.7292444444444445


In [56]:
metrics.precision_score(y_testknn, y_pred_class)

0.3959349593495935

In [57]:
metrics.precision_score(y_testknn, y_pred_class)

metrics.recall_score(y_testknn, y_pred_class)

0.38437253354380424

In [58]:
1-df.Y.mean()

0.7765234010400462

In [59]:
print('F-Score:' + str(metrics.f1_score(y_testknn, y_pred_class)))

F-Score:0.39006808169803764


In [146]:
from sklearn.model_selection import train_test_split
X_trainknn20, X_testknn20, y_trainknn20, y_testknn20 = train_test_split(X, y, random_state=1)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()  
scaler.fit(X_trainknn20)

X_trainknn20 = scaler.transform(X_trainknn20)  
X_testknn20 = scaler.transform(X_testknn20)  

from sklearn.neighbors import KNeighborsClassifier

knn20 = KNeighborsClassifier(n_neighbors=20)

knn20.fit(X_trainknn20, y_trainknn20)

y_pred_class20 = knn20.predict(X_testknn20)

from sklearn import metrics
print('Accuracy:' + str(metrics.accuracy_score(y_testknn20, y_pred_class20)))

1-df.Y.mean()

print('F-Score:' + str(metrics.f1_score(y_testknn20, y_pred_class20)))

Accuracy:0.8030222222222222
F-Score:0.3725934314835787


In [143]:
metrics.precision_score(y_testknn20, y_pred_class20)

0.6593186372745491

In [144]:
metrics.recall_score(y_testknn20, y_pred_class20)

0.2596685082872928

In [147]:
import pickle
mod = open('knn20.pkl', 'wb')
pickle.dump(knn20, mod)
mod.close()

## Decision Tree

In [63]:
y.value_counts()

0.0    17471
1.0     5028
Name: Y, dtype: int64

In [64]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import make_scorer, accuracy_score

In [65]:
X_traindt, X_testdt, y_traindt, y_testdt = train_test_split(X, y, random_state=1)

In [66]:
from collections import Counter
from imblearn.under_sampling import TomekLinks

In [67]:
X_traintl, X_testtl, y_traintl, y_testtl = train_test_split(X, y, random_state=1 )

In [68]:
y_traintl.value_counts()

0.0    13113
1.0     3761
Name: Y, dtype: int64

In [69]:
tl = TomekLinks()
X_res, y_res = tl.fit_resample(X_traintl, y_traintl)
print('Resampled dataset shape %s' % Counter(y_res))


Resampled dataset shape Counter({0.0: 12075, 1.0: 3761})


In [70]:
tl = TomekLinks()
X_res, y_res = tl.fit_sample(X_traintl, y_traintl)


In [71]:
y_res.value_counts()

0.0    12075
1.0     3761
Name: Y, dtype: int64

In [72]:
from sklearn.metrics import accuracy_score, f1_score, recall_score

tomek_clf = DecisionTreeClassifier(max_depth=3, min_samples_leaf=16, class_weight = 'balanced',  random_state = 42)
tomek_clf = tomek_clf.fit(X_res, y_res)
tomek_pred_train=tomek_clf.predict(X_traintl)
tomek_pred_test = tomek_clf.predict(X_testtl)
print('Training F1 score: ', f1_score(y_traintl, tomek_pred_train))
print('Test F1 score: ', f1_score(y_testtl, tomek_pred_test))

Training F1 score:  0.518679807471923
Test F1 score:  0.5185694635488308


In [None]:
##import pickle
##mod = open('gs_forest2.pkl', 'wb')
##pickle.dump(gs_forest2.best_estimator_, mod)
##mod.close()

In [73]:
from sklearn.metrics import accuracy_score, f1_score, recall_score

tomek_clf = DecisionTreeClassifier(max_depth=8, min_samples_leaf=31, class_weight = 'balanced', max_leaf_nodes = 23,  random_state = 42)
tomek_clf = tomek_clf.fit(X_res, y_res)
tomek_pred_train=tomek_clf.predict(X_traintl)
tomek_pred_test = tomek_clf.predict(X_testtl)
print('Training F1 score: ', f1_score(y_traintl, tomek_pred_train))
print('Test F1 score: ', f1_score(y_testtl, tomek_pred_test))

Training F1 score:  0.5277139718366124
Test F1 score:  0.508274231678487


In [136]:
from sklearn.model_selection import GridSearchCV
param_dict={'max_depth': range(1,10,1),'criterion': ['gini','entropy'], 'min_samples_leaf' : range(10,40,1), 'max_leaf_nodes': range(0,30,1), 'class_weight': ['balanced']}

# Setting refit='AUC', refits an estimator on the whole dataset with the
# parameter setting that has the best cross-validated AUC score.
# That estimator is made available at ``gs.best_estimator_`` along with
# parameters like ``gs.best_score_``, ``gs.best_params_`` and
# ``gs.best_index_``
gs = GridSearchCV(DecisionTreeClassifier(random_state=42),
                  param_grid= param_dict,
                  scoring='f1', cv=5, refit='AUC', return_train_score=True, verbose = 1, n_jobs=-1)
gs.fit(X_res, y_res)

Fitting 5 folds for each of 16200 candidates, totalling 81000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    1.9s
[Parallel(n_jobs=-1)]: Done 440 tasks      | elapsed:    6.5s
[Parallel(n_jobs=-1)]: Done 738 tasks      | elapsed:    9.2s
[Parallel(n_jobs=-1)]: Done 1276 tasks      | elapsed:   15.1s
[Parallel(n_jobs=-1)]: Done 2176 tasks      | elapsed:   29.0s
[Parallel(n_jobs=-1)]: Done 3276 tasks      | elapsed:   44.8s
[Parallel(n_jobs=-1)]: Done 4576 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 7484 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 9636 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 10712 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 11762 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done 12912 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done 14594 tasks      | elapsed:  4.0min
[Parallel(n_jobs=-1)]: Done 15944 tasks      | elapsed:  4.5min
[Parallel(n_jobs=-1)]: Done 17394 tasks

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=42), n_jobs=-1,
             param_grid={'class_weight': ['balanced'],
                         'criterion': ['gini', 'entropy'],
                         'max_depth': range(1, 10),
                         'max_leaf_nodes': range(0, 30),
                         'min_samples_leaf': range(10, 40)},
             refit='AUC', return_train_score=True, scoring='f1', verbose=1)

In [137]:
gs.best_score_

0.5405737243280145

In [138]:
gs.best_params_

{'class_weight': 'balanced',
 'criterion': 'entropy',
 'max_depth': 4,
 'max_leaf_nodes': 15,
 'min_samples_leaf': 32}

In [140]:
y_predsgs=gs.best_estimator_.predict(X_testtl)
f1_score=metrics.f1_score(y_testtl,y_predsgs)
f1_score

0.5246387550944794

In [141]:
import pickle
mod = open('gs.pkl', 'wb')
pickle.dump(gs.best_estimator_, mod)
mod.close()

In [74]:
clf = DecisionTreeClassifier(class_weight='balanced')
clf = clf.fit(X_traindt,y_traindt)
y_pred_traindt = clf.predict(X_traindt)
y_pred_testdt = clf.predict(X_testdt)

In [75]:
print("Training F1 Score:",metrics.f1_score(y_traindt, y_pred_traindt))
print("Testing F1 Score:",metrics.f1_score(y_testdt, y_pred_testdt))

Training F1 Score: 0.9990702616549343
Testing F1 Score: 0.37594583831142964


In [77]:
from sklearn.ensemble import RandomForestClassifier

X_trainrf, X_testrf, y_trainrf, y_testrf = train_test_split(X, y, random_state=1)

In [133]:
rf = RandomForestClassifier(class_weight='balanced')

param_dictrf={'n_estimators': [100,200], 'max_depth': range(5,10,1), 'criterion': ['gini','entropy'],'min_samples_leaf' : range(20,40,1), 'max_leaf_nodes': range(15,30,1)}


In [134]:
from sklearn.model_selection import GridSearchCV

gs_forest=GridSearchCV(rf,param_dictrf,scoring='f1',cv=3,verbose=1,n_jobs=-1)

In [135]:
gs_forest.fit(X_trainrf,y_trainrf)

Fitting 3 folds for each of 6000 candidates, totalling 18000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   11.7s


KeyboardInterrupt: 

In [86]:
gs_forest.best_estimator_

RandomForestClassifier(class_weight='balanced', max_depth=5, max_leaf_nodes=17,
                       min_samples_leaf=21)

In [87]:
gs_forest.best_score_

0.5413298455632769

In [89]:
y_predsrf=gs_forest.best_estimator_.predict(X_testrf)
f1_score=metrics.f1_score(y_testrf,y_predsrf)

In [90]:
f1_score

0.5398072117101035

In [93]:
rf2 = RandomForestClassifier(class_weight='balanced')

param_dictrf2={'n_estimators': range(100,500,100), 'criterion': ['gini','entropy'], 'max_depth': range(1,10,1), 'min_samples_leaf' : range(15,30,1), 'max_leaf_nodes': range(10,20,1)}


In [94]:
from sklearn.model_selection import GridSearchCV

gs_forest2=GridSearchCV(rf2,param_dictrf2,scoring='f1',cv=3,verbose=1,n_jobs=-2)

In [95]:
gs_forest2.fit(X_trainrf,y_trainrf)

Fitting 3 folds for each of 10800 candidates, totalling 32400 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 11 concurrent workers.
[Parallel(n_jobs=-2)]: Done  28 tasks      | elapsed:    6.6s
[Parallel(n_jobs=-2)]: Done 178 tasks      | elapsed:   32.4s
[Parallel(n_jobs=-2)]: Done 428 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-2)]: Done 778 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-2)]: Done 1228 tasks      | elapsed:  3.9min
[Parallel(n_jobs=-2)]: Done 1778 tasks      | elapsed:  5.6min
[Parallel(n_jobs=-2)]: Done 2428 tasks      | elapsed:  8.5min
[Parallel(n_jobs=-2)]: Done 3178 tasks      | elapsed: 12.1min
[Parallel(n_jobs=-2)]: Done 4028 tasks      | elapsed: 16.6min
[Parallel(n_jobs=-2)]: Done 4978 tasks      | elapsed: 22.0min
[Parallel(n_jobs=-2)]: Done 6028 tasks      | elapsed: 28.8min
[Parallel(n_jobs=-2)]: Done 7178 tasks      | elapsed: 37.1min
[Parallel(n_jobs=-2)]: Done 8428 tasks      | elapsed: 47.1min
[Parallel(n_jobs=-2)]: Done 9778 tasks      | elapsed: 58.1min
[Parallel(n_jobs=-2)]: Done 11228 tasks      

GridSearchCV(cv=3, estimator=RandomForestClassifier(class_weight='balanced'),
             n_jobs=-2,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': range(1, 10),
                         'max_leaf_nodes': range(10, 20),
                         'min_samples_leaf': range(15, 30),
                         'n_estimators': range(100, 500, 100)},
             scoring='f1', verbose=1)

In [96]:
gs_forest2.best_estimator_

RandomForestClassifier(class_weight='balanced', max_depth=4, max_leaf_nodes=11,
                       min_samples_leaf=29, n_estimators=200)

In [97]:
gs_forest2.best_score_

0.5430203894407452

In [100]:
y_predsrf2=gs_forest2.best_estimator_.predict(X_testrf)
f1_score=metrics.f1_score(y_testrf,y_predsrf2)

In [101]:
f1_score

0.5418535127055306

In [103]:
import pickle
mod = open('gs_forest2.pkl', 'wb')
pickle.dump(gs_forest2.best_estimator_, mod)
mod.close()

In [122]:
rf3 = RandomForestClassifier(class_weight='balanced')

param_dictrf3={'n_estimators': range(100,300,100), 'criterion': ['gini','entropy'], 'max_depth': range(1,5,1), 'min_samples_leaf' : range(20,25,1), 'max_leaf_nodes': range(2,6,1)}


In [123]:
gs_forest3=GridSearchCV(rf3,param_dictrf3,scoring='f1',cv=3,verbose=1,n_jobs=-2)

In [124]:
gs_forest3.fit(X_res,y_res)

Fitting 3 folds for each of 320 candidates, totalling 960 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 11 concurrent workers.
[Parallel(n_jobs=-2)]: Done  28 tasks      | elapsed:    4.8s
[Parallel(n_jobs=-2)]: Done 178 tasks      | elapsed:   23.9s
[Parallel(n_jobs=-2)]: Done 428 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-2)]: Done 778 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-2)]: Done 960 out of 960 | elapsed:  2.9min finished


GridSearchCV(cv=3, estimator=RandomForestClassifier(class_weight='balanced'),
             n_jobs=-2,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': range(1, 5),
                         'max_leaf_nodes': range(2, 6),
                         'min_samples_leaf': range(20, 25),
                         'n_estimators': range(100, 300, 100)},
             scoring='f1', verbose=1)

In [125]:
gs_forest3.best_estimator_

RandomForestClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=3, max_leaf_nodes=5, min_samples_leaf=24,
                       n_estimators=200)

In [126]:
gs_forest3.best_score_

0.5580883118005331

In [127]:
y_predsrf3=gs_forest3.best_estimator_.predict(X_testrf)
f1_score=metrics.f1_score(y_testrf,y_predsrf3)

In [128]:
f1_score

0.5395176252319109

## 6. Model Evaluation

## 7. Final Model