# Applying Classification Modeling
The goal of this week's assessment is to find the model which best predicts whether or not a person will default on their bank loan. In doing so, we want to utilize all of the different tools we have learned over the course: data cleaning, EDA, feature engineering/transformation, feature selection, hyperparameter tuning, and model evaluation. 


#### Data Set Information:

This research aimed at the case of customers default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. Because the real probability of default is unknown, this study presented the novel Sorting Smoothing Method to estimate the real probability of default. With the real probability of default as the response variable (Y), and the predictive probability of default as the independent variable (X), the simple linear regression result (Y = A + BX) shows that the forecasting model produced by artificial neural network has the highest coefficient of determination; its regression intercept (A) is close to zero, and regression coefficient (B) to one. Therefore, among the six data mining techniques, artificial neural network is the only one that can accurately estimate the real probability of default. 

- NT is the abbreviation for New Taiwain. 


#### Attribute Information:

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables: 
- X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. 
- X2: Gender (1 = male; 2 = female). 
- X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). 
- X4: Marital status (1 = married; 2 = single; 3 = others). 
- X5: Age (year). 
- X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: 
    - X6 = the repayment status in September, 2005; 
    - X7 = the repayment status in August, 2005; . . .;
    - etc...
    - X11 = the repayment status in April, 2005. 
    - The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. 
- X12-X17: Amount of bill statement (NT dollar). 
    - X12 = amount of bill statement in September, 2005;
    - etc...
    - X13 = amount of bill statement in August, 2005; . . .; 
    - X17 = amount of bill statement in April, 2005. 
- X18-X23: Amount of previous payment (NT dollar). 
    - X18 = amount paid in September, 2005; 
    - X19 = amount paid in August, 2005; . . .;
    - etc...
    - X23 = amount paid in April, 2005. 




You will fit three different models (KNN, Logistic Regression, and Decision Tree Classifier) to predict credit card defaults and use gridsearch to find the best hyperparameters for those models. Then you will compare the performance of those three models on a test set to find the best one.  


## Process/Expectations

- You will be working in pairs for this assessment

### Please have ONE notebook and be prepared to explain how you worked in your pair.

1. Clean up your data set so that you can perform an EDA. 
    - This includes handling null values, categorical variables, removing unimportant columns, and removing outliers.
2. Perform EDA to identify opportunities to create new features.
    - [Great Example of EDA for classification](https://www.kaggle.com/stephaniestallworth/titanic-eda-classification-end-to-end) 
    - [Using Pairplots with Classification](https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166)
3. Engineer new features. 
    - Create polynomial and/or interaction features. 
    - Additionaly, you must also create **at least 2 new features** that are not interactions or polynomial transformations. 
        - *For example, you can create a new dummy variable that based on the value of a continuous variable (billamount6 >2000) or take the average of some past amounts.*
4. Perform some feature selection. 
    
5. You must fit **three** models to your data and tune **at least 1 hyperparameter** per model. 
6. Using the F-1 Score, evaluate how well your models perform and identify your best model.
7. Using information from your EDA process and your model(s) output provide insight as to which borrowers are more likely to deafult


In [1]:
# import libraries

import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 50)

## 1. Data Cleaning

In [2]:
df = pd.read_csv('training_data.csv' , index_col=0)

In [3]:
df.Y.value_counts()

0                             17471
1                              5028
default payment next month        1
Name: Y, dtype: int64

In [4]:
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
28835,220000,2,1,2,36,0,0,0,0,0,0,222598,222168,217900,221193,181859,184605,10000,8018,10121,6006,10987,143779,1
25329,200000,2,3,2,29,-1,-1,-1,-1,-1,-1,326,326,326,326,326,326,326,326,326,326,326,326,0
18894,180000,2,1,2,27,-2,-2,-2,-2,-2,-2,0,0,0,0,0,0,0,0,0,0,0,0,0
690,80000,1,2,2,32,0,0,0,0,0,0,51372,51872,47593,43882,42256,42527,1853,1700,1522,1548,1488,1500,0
6239,10000,1,2,2,27,0,0,0,0,0,0,8257,7995,4878,5444,2639,2697,2000,1100,600,300,300,1000,1


In [5]:
df['X6']

28835     0
25329    -1
18894    -2
690       0
6239      0
         ..
16247     0
2693     -1
8076      1
20213    -1
7624      1
Name: X6, Length: 22500, dtype: object

In [6]:
df.dtypes

X1     object
X2     object
X3     object
X4     object
X5     object
X6     object
X7     object
X8     object
X9     object
X10    object
X11    object
X12    object
X13    object
X14    object
X15    object
X16    object
X17    object
X18    object
X19    object
X20    object
X21    object
X22    object
X23    object
Y      object
dtype: object

In [7]:
df['X3'].value_counts()

2            10516
1             7919
3             3713
5              208
4               90
6               42
0               11
EDUCATION        1
Name: X3, dtype: int64

## 2. EDA

In [8]:
df.drop('ID', axis = 0, inplace = True)

In [9]:
df = df.apply(lambda x: x.astype(float))

In [10]:
# Split data to be used in the models
# Create matrix of features
X = df.drop('Y', axis = 1) # grabs everything else but 'Survived'

In [11]:
# Create target variable
y = df['Y'] # y is the column we're trying to predict

In [12]:
## For our Marital Status column, we have 3 groups in our key but 4 in the dataframe
## We need to figure out where 0 belongs

## ! is married, 2 is single, 3 is others

df.groupby('X4')['X5'].mean()

X4
0.0    37.590909
1.0    40.027857
2.0    31.412606
3.0    42.893162
Name: X5, dtype: float64

In [13]:
## 0 appears to belong to 1 aka the married group but I should investigate further

In [14]:
df.groupby('X4')['X1'].mean()

## We have 4 groups in our key for Education Level but 7 in our dataframe

X4
0.0    140681.818182
1.0    181372.437469
2.0    156242.115417
3.0    103888.888889
Name: X1, dtype: float64

In [15]:
df.groupby('X3')['X5'].mean()

##6 and 0 look like they belong to 3 aka high school education
## 5 might belong to 2 aka college education but I should investigate further

X3
0.0    40.545455
1.0    34.231847
2.0    34.658806
3.0    40.157285
4.0    34.666667
5.0    35.937500
6.0    43.904762
Name: X5, dtype: float64

In [16]:
df.groupby('X3')['X1'].mean()

## 1 graduate, 2 college, 3 high school, 4 others

X3
0.0    220000.000000
1.0    213389.316833
2.0    146419.360974
3.0    125641.712901
4.0    230000.000000
5.0    161951.923077
6.0    135000.000000
Name: X1, dtype: float64

In [17]:
df['X3']

28835    1.0
25329    3.0
18894    1.0
690      2.0
6239     2.0
        ... 
16247    2.0
2693     1.0
8076     3.0
20213    3.0
7624     1.0
Name: X3, Length: 22499, dtype: float64

In [18]:
df[(df['X3'] == 5) | (df['X3'] == 6) | (df['X3'] == 0)]['X4'].value_counts()

1.0    139
2.0    117
3.0      5
Name: X4, dtype: int64

In [19]:
df[df['X4'] == 0]['X5'].mean()

37.59090909090909

In [20]:
conditions = [
    df['X3'].eq(0),
    df['X3'].eq(1),
    df['X3'].eq(2),
    df['X3'].eq(3),
    df['X3'].eq(4),
    df['X3'].eq(5),
    df['X3'].eq(6),
]

choices = [
    0,
    0,
    0,
    0,
    0,
    1,
    1,   
]

df['X3_undefined'] = np.select(conditions,choices)

In [21]:
df['X3_undefined'].value_counts()

0    22249
1      250
Name: X3_undefined, dtype: int64

In [22]:
conditions = [
    df['X3'].eq(5),
    df['X3'].eq(6)
]

choices = [
    0,
    0  
]

df['X3'] = np.select(conditions,choices, default = df['X3'])

In [23]:
df['X3'].value_counts()

2.0    10516
1.0     7919
3.0     3713
0.0      261
4.0       90
Name: X3, dtype: int64

In [24]:
df['X4'].value_counts()

2.0    12026
1.0    10195
3.0      234
0.0       44
Name: X4, dtype: int64

In [25]:
df['X4_undefined'] = np.where(df['X4'] == 0, 1,0)

In [26]:
df['X4_undefined'].value_counts()

0    22455
1       44
Name: X4_undefined, dtype: int64

In [27]:
conditions = [
    df['X4'].eq(0)
]

choices = [
    3
]

df['X4'] = np.select(conditions,choices, default = df['X4'])

In [28]:
df['X4'].value_counts()

2.0    12026
1.0    10195
3.0      278
Name: X4, dtype: int64

## 3. Feature Engineering

### Remaining balances after each monthly payment

In [29]:
df['bal_september'] = df['X12'] - df['X18'] #amount left to pay September
df['bal_august'] = df['X13'] - df['X19'] #amount left to pay August
df['bal_july'] = df['X14'] - df['X20'] #amount left to pay July
df['bal_june'] = df['X15'] - df['X21'] #amount left to pay June
df['bal_may'] = df['X16'] - df['X22'] #amount left to pay July
df['bal_april'] = df['X17'] - df['X23'] #amount left to pay April

### Monthly credit utilization

In [30]:
df['util_september'] = df['bal_september'] / df['X1'] #September credit utilization
df['util_august'] = df['bal_august'] / df['X1'] #August credit utilization
df['util_july'] = df['bal_july'] / df['X1'] #July credit utilization
df['util_june'] = df['bal_june'] / df['X1'] #June credit utilization
df['util_may'] = df['bal_may'] / df['X1'] #July credit utilization
df['util_april'] = df['bal_april'] / df['X1'] #April credit utilization

### Utilization Trend

In [31]:
# # test = [0.12,0.23,0.34,0.45,0.6]
# test = [0.6, 0.5, 0.54, 0.2, 0.3]

In [32]:
# for index, i in enumerate(test):
#     if index == 0:
#         trend = 0
#     else:
#         trend += i - test[index - 1]
# trend

In [33]:
def util_trend(df,cols):
    for index, i in enumerate(cols):
        if index == 0:
            trend = 0
        else:
            trend += df[i] - df[cols[index - 1]]
    return trend

In [34]:
prefix = ['util']
months = ['september', 'august', 'july', 'june', 'may', 'april']
i = [prefix[0] + "_" + x for x in months]
df['util_trend'] = util_trend(df, i) #the overall trend of the credit utlization

In [35]:
i

['util_september',
 'util_august',
 'util_july',
 'util_june',
 'util_may',
 'util_april']

### Average Utilization

In [36]:
df['avg_util'] = (df[i[0]] + df[i[1]] + df[i[2]] + df[i[3]] + df[i[4]] + df[i[5]]) / 6 #average credit utilization

### Payment History

In [37]:
df['payment_hist'] = df.apply(lambda x:[x['X6'], x['X7'], x['X8'], x['X9'], x['X10'], x['X11']], axis = 1) 

looks like -2 is a value but i don't know what that means. Maybe they prepaid? Also, you can't be two months behind when you were up to date the month before.

According to Kaggle this is a possible interpretation:

-2 = Balance paid in full and no transactions this period (we may refer to this credit card account as having been 'inactive' this period)

-1 = Balance paid in full, but account has a positive balance at end of period due to recent transactions for which payment has not yet come due

0 = Customer paid the minimum due amount, but not the entire balance. I.e., the customer paid enough for their account to remain in good standing, but did revolve a balance

### Account Status

In [38]:
df['account_status'] = df['payment_hist'].apply(lambda lst:1 if max(set(lst), key=lst.count) > 0 else 0)

In [39]:
# 1 = bad standing, payments are not up to date for a majority of the time 0 = good standing the majority of time

### Ever owed more than thier credit limit

In [40]:
df['excess_debt'] = 1 * df.apply(lambda x: x['X12':'X17'] > x['X1'], axis = 1).any(axis = 1)

In [41]:
df['excess_debt'].value_counts()

0    19600
1     2899
Name: excess_debt, dtype: int64

### Continually owed more than thier credit limit

In [42]:
df['consistent_excess_debt'] = 1 * (df.apply(lambda x: x['X12':'X17'] > x['X1'], axis = 1).sum(axis = 1) > 1)

In [43]:
df['consistent_excess_debt'].value_counts()

0    20931
1     1568
Name: consistent_excess_debt, dtype: int64

In [44]:
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y,X3_undefined,X4_undefined,bal_september,bal_august,bal_july,bal_june,bal_may,bal_april,util_september,util_august,util_july,util_june,util_may,util_april,util_trend,avg_util,payment_hist,account_status,excess_debt,consistent_excess_debt
28835,220000.0,2.0,1.0,2.0,36.0,0.0,0.0,0.0,0.0,0.0,0.0,222598.0,222168.0,217900.0,221193.0,181859.0,184605.0,10000.0,8018.0,10121.0,6006.0,10987.0,143779.0,1.0,0,0,212598.0,214150.0,207779.0,215187.0,170872.0,40826.0,0.966355,0.973409,0.94445,0.978123,0.776691,0.185573,-0.780782,0.8041,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0,1,1
25329,200000.0,2.0,3.0,2.0,29.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,326.0,326.0,326.0,326.0,326.0,326.0,326.0,326.0,326.0,326.0,326.0,326.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[-1.0, -1.0, -1.0, -1.0, -1.0, -1.0]",0,0,0
18894,180000.0,2.0,1.0,2.0,27.0,-2.0,-2.0,-2.0,-2.0,-2.0,-2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[-2.0, -2.0, -2.0, -2.0, -2.0, -2.0]",0,0,0
690,80000.0,1.0,2.0,2.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0,51372.0,51872.0,47593.0,43882.0,42256.0,42527.0,1853.0,1700.0,1522.0,1548.0,1488.0,1500.0,0.0,0,0,49519.0,50172.0,46071.0,42334.0,40768.0,41027.0,0.618988,0.62715,0.575887,0.529175,0.5096,0.512837,-0.10615,0.562273,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0,0,0
6239,10000.0,1.0,2.0,2.0,27.0,0.0,0.0,0.0,0.0,0.0,0.0,8257.0,7995.0,4878.0,5444.0,2639.0,2697.0,2000.0,1100.0,600.0,300.0,300.0,1000.0,1.0,0,0,6257.0,6895.0,4278.0,5144.0,2339.0,1697.0,0.6257,0.6895,0.4278,0.5144,0.2339,0.1697,-0.456,0.4435,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0,0,0


## 4. Feature Selection

In [45]:
#kbest

In [46]:
#vif

## 5. Model Fitting and Hyperparameter Tuning
KNN, Logistic Regression, Decision Tree

In [47]:
# Split data to be used in the models
# Create matrix of features
X = df.drop(['Y', 'payment_hist'], axis = 1) # grabs everything else but 'Survived' is this suspoesed to be survived?


# Create target variable
y = df['Y'] # y is the column we're trying to predict

In [48]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

In [104]:
#correct class imbalance
from imblearn.over_sampling import SMOTE

sm_base = SMOTE(random_state=27)
X_train_base, y_train_base = sm_base.fit_sample(X_train, y_train)

In [50]:
y_train_base.value_counts()

1.0    12204
0.0    12204
Name: Y, dtype: int64

In [51]:
non_binary_cols = list(((X.max() > 1) | (X.min() < 0)).index)

In [52]:
#only scale non binary columns
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

base_ct = ColumnTransformer([
        ('base_transform', StandardScaler(), non_binary_cols)
    ], remainder='passthrough')

base_ct.fit(X_train)
X_train_base_scaled = pd.DataFrame(data=base_ct.transform(X_train), columns = X_train.columns)
X_test_base_scaled = pd.DataFrame(data=base_ct.transform(X_test), columns = X_train.columns)

In [53]:
# run logit with non polys
from sklearn.linear_model import LogisticRegressionCV
logreg = LogisticRegressionCV(cv = 10, penalty='l1', solver = 'saga', max_iter=10000, scoring='f1', n_jobs = -1, verbose = 1)
logreg.fit(X_train_base_scaled, y_train_base)
y_preds_base = logreg.predict(X_test_base_scaled)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.


convergence after 1 epochs took 0 seconds
convergence after 3 epochs took 0 secondsconvergence after 3 epochs took 0 seconds

convergence after 3 epochs took 0 seconds
convergence after 3 epochs took 0 seconds
convergence after 3 epochs took 0 seconds
convergence after 3 epochs took 0 seconds
convergence after 4 epochs took 0 seconds
convergence after 21 epochs took 0 seconds
convergence after 22 epochs took 0 seconds
convergence after 22 epochs took 0 seconds
convergence after 22 epochs took 0 seconds
convergence after 21 epochs took 0 seconds
convergence after 22 epochs took 0 secondsconvergence after 21 epochs took 0 seconds

convergence after 22 epochs took 0 seconds
convergence after 25 epochs took 1 seconds
convergence after 28 epochs took 1 seconds
convergence after 31 epochs took 1 seconds
convergence after 30 epochs took 1 seconds
convergence after 31 epochs took 1 seconds
convergence after 29 epochs took 1 seconds
convergence after 32 epochs took 1 seconds
convergence after 3

[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:   33.9s remaining:   22.6s


convergence after 1474 epochs took 30 seconds
convergence after 678 epochs took 11 seconds
convergence after 139 epochs took 2 seconds
convergence after 51 epochs took 1 seconds
convergence after 9 epochs took 0 seconds
convergence after 2 epochs took 0 seconds
convergence after 2 epochs took 0 seconds
convergence after 138 epochs took 2 seconds
convergence after 46 epochs took 1 seconds
convergence after 9 epochs took 0 seconds
convergence after 2 epochs took 0 seconds
convergence after 2 epochs took 0 seconds
convergence after 791 epochs took 13 seconds
convergence after 151 epochs took 2 seconds
convergence after 46 epochs took 0 seconds
convergence after 9 epochs took 0 seconds
convergence after 2 epochs took 0 seconds
convergence after 2 epochs took 0 seconds
convergence after 1952 epochs took 37 seconds
convergence after 152 epochs took 2 seconds
convergence after 49 epochs took 0 seconds
convergence after 10 epochs took 1 seconds
convergence after 2 epochs took 0 seconds
converg

[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:   44.8s finished


In [54]:
from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score
acc_score_base = accuracy_score(y_test, y_preds_base)
precision_base = precision_score(y_test, y_preds_base)
recall_base = recall_score(y_test, y_preds_base)
f1_base = f1_score(y_test, y_preds_base)

print(f'accuracy: {acc_score_base}')
print(f'precision: {precision_base}')
print(f'recall: {recall_base}')
print(f'f1: {f1_base}')

accuracy: 0.6982222222222222
precision: 0.3874898456539399
recall: 0.6432906271072151
f1: 0.48365019011406846


In [55]:
list(zip(X.columns, logreg.coef_[0]))

[('X1', -0.16765286704579035),
 ('X2', -0.04803066944960413),
 ('X3', -0.06704701987310553),
 ('X4', -0.08333860958276146),
 ('X5', 0.06284497630691871),
 ('X6', 0.6591200681542689),
 ('X7', 0.20811075886638378),
 ('X8', 0.026827199529193435),
 ('X9', 0.0),
 ('X10', 0.05418909326154573),
 ('X11', -0.07379616085060292),
 ('X12', -0.023640119782739978),
 ('X13', 0.0),
 ('X14', 0.0),
 ('X15', 0.0),
 ('X16', 0.0),
 ('X17', -0.032458617978538266),
 ('X18', -0.3154243416304572),
 ('X19', -0.2262765040571774),
 ('X20', -0.028664083016322353),
 ('X21', -0.06385284888335216),
 ('X22', -0.06209078135561669),
 ('X23', -0.03775450433576465),
 ('X3_undefined', -0.18409690888460653),
 ('X4_undefined', -0.06857126331034123),
 ('bal_september', 0.0),
 ('bal_august', 0.0),
 ('bal_july', 0.08329538039642931),
 ('bal_june', 0.0),
 ('bal_may', 0.0),
 ('bal_april', 0.0),
 ('util_september', -0.32638801012580715),
 ('util_august', 0.181671319985354),
 ('util_july', 0.0),
 ('util_june', 0.05452170099455451),

### Logit with Poly2

In [56]:
from sklearn.preprocessing import PolynomialFeatures
poly2 = PolynomialFeatures(degree=2, include_bias=False, interaction_only = False)
poly2_data = poly2.fit_transform(X)
poly2_columns = poly2.get_feature_names(X.columns)
X_poly2 = pd.DataFrame(poly2_data, columns=poly2_columns)

In [57]:
X_train_poly2, X_test_poly2, y_train_poly2, y_test_poly2 = train_test_split(X_poly2, y, test_size = 0.3)

In [139]:
sm_poly2 = SMOTE(random_state=28)
X_train_poly2, y_train_poly2 = sm_poly2.fit_sample(X_train_poly2, y_train_poly2)

In [59]:
y_train_poly2.value_counts()

1.0    12222
0.0    12222
Name: Y, dtype: int64

In [60]:
non_binary_cols_poly2 = list(((X_poly2.max() > 1) | (X_poly2.min() < 0)).index)

In [61]:
poly2_ct = ColumnTransformer([
        ('poly2_transform', StandardScaler(), non_binary_cols_poly2)
    ], remainder='passthrough')

poly2_ct.fit(X_train_poly2)
X_train_poly2_scaled = pd.DataFrame(data=poly2_ct.transform(X_train_poly2), columns = X_train_poly2.columns)
X_test_poly2_scaled = pd.DataFrame(data=poly2_ct.transform(X_test_poly2), columns = X_train_poly2.columns)

In [63]:
X_train_poly2_scaled.shape[1]

945

In [133]:
from sklearn.feature_selection import SelectKBest, f_classif, SelectFpr
selector = SelectKBest(f_classif, k=190)

selector.fit(X_train_poly2_scaled, y_train_poly2)

  f = msb / msw


SelectKBest(k=190)

In [134]:
selected_columns = X_train_poly2_scaled.columns[selector.get_support()]
removed_columns = X_train_poly2_scaled.columns[~selector.get_support()]

In [113]:
selected_columns

Index(['X1', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'util_april', 'avg_util',
       'account_status',
       ...
       'bal_may account_status', 'bal_april account_status',
       'util_september account_status', 'util_august account_status',
       'util_july account_status', 'util_june account_status',
       'util_may account_status', 'util_april account_status',
       'avg_util account_status', 'account_status^2'],
      dtype='object', length=200)

In [135]:
logreg_poly2 = LogisticRegressionCV(cv = 10, penalty='l1', tol = 0.0007, solver = 'saga', max_iter=1000, scoring='f1', n_jobs = -1, verbose =2)
logreg_poly2.fit(X_train_poly2_scaled[selected_columns], y_train_poly2)
y_preds_poly2 = logreg_poly2.predict(X_test_poly2_scaled[selected_columns])

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.


convergence after 1 epochs took 1 seconds
convergence after 1 epochs took 1 seconds
convergence after 1 epochs took 1 seconds
convergence after 1 epochs took 1 seconds
convergence after 3 epochs took 1 seconds
convergence after 3 epochs took 1 seconds
convergence after 3 epochs took 1 seconds
convergence after 3 epochs took 1 seconds
convergence after 249 epochs took 17 seconds
convergence after 254 epochs took 18 seconds
convergence after 252 epochs took 18 seconds
convergence after 254 epochs took 18 seconds
convergence after 263 epochs took 18 seconds
convergence after 258 epochs took 18 seconds
convergence after 259 epochs took 18 seconds
convergence after 259 epochs took 18 seconds
convergence after 201 epochs took 15 seconds
convergence after 216 epochs took 15 seconds
convergence after 222 epochs took 16 seconds
convergence after 219 epochs took 16 seconds
convergence after 234 epochs took 16 seconds
convergence after 252 epochs took 18 seconds
convergence after 250 epochs took 

[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:  1.5min remaining:   37.7s


convergence after 13 epochs took 1 seconds
convergence after 2 epochs took 0 seconds
convergence after 2 epochs took 0 seconds
convergence after 2 epochs took 0 seconds
convergence after 254 epochs took 13 seconds
convergence after 258 epochs took 13 seconds
convergence after 211 epochs took 9 seconds
convergence after 255 epochs took 10 seconds
convergence after 335 epochs took 13 seconds
convergence after 271 epochs took 12 seconds
convergence after 157 epochs took 7 seconds
convergence after 169 epochs took 8 seconds
convergence after 67 epochs took 3 seconds
convergence after 9 epochs took 1 seconds
convergence after 2 epochs took 0 seconds
convergence after 2 epochs took 0 seconds
convergence after 2 epochs took 0 seconds
convergence after 67 epochs took 3 seconds
convergence after 9 epochs took 0 seconds
convergence after 2 epochs took 0 seconds
convergence after 2 epochs took 0 seconds
convergence after 2 epochs took 0 seconds


[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:  2.1min finished


convergence after 31 epochs took 2 seconds


In [307]:
from sklearn.linear_model import LogisticRegression
logreg_poly2_1 = LogisticRegression(C = 0.3, penalty='l1', tol = 0.0007, solver = 'saga', max_iter=1000, n_jobs = -1, verbose =2)
logreg_poly2_1.fit(X_train_poly2_scaled[selected_columns], y_train_poly2)
y_preds_poly2_1 = logreg_poly2_1.predict(X_test_poly2_scaled[selected_columns])

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.


convergence after 386 epochs took 17 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   17.7s finished


In [136]:
acc_score_poly2 = accuracy_score(y_test_poly2, y_preds_poly2)
precision_poly2 = precision_score(y_test_poly2, y_preds_poly2)
recall_poly2 = recall_score(y_test_poly2, y_preds_poly2)
f1_poly2 = f1_score(y_test_poly2, y_preds_poly2)

print(f'accuracy: {acc_score_poly2}')
print(f'precision: {precision_poly2}')
print(f'recall: {recall_poly2}')
print(f'f1: {f1_poly2}')

accuracy: 0.7752592592592593
precision: 0.49523809523809526
recall: 0.5542971352431713
f1: 0.5231059415278214


In [308]:
acc_score_poly2 = accuracy_score(y_test_poly2, y_preds_poly2_1)
precision_poly2 = precision_score(y_test_poly2, y_preds_poly2_1)
recall_poly2 = recall_score(y_test_poly2, y_preds_poly2_1)
f1_poly2 = f1_score(y_test_poly2, y_preds_poly2_1)

print(f'accuracy: {acc_score_poly2}')
print(f'precision: {precision_poly2}')
print(f'recall: {recall_poly2}')
print(f'f1: {f1_poly2}')

accuracy: 0.7758518518518519
precision: 0.4964114832535885
recall: 0.5529646902065289
f1: 0.5231641979199496


### Random Forest

In [140]:
X_train_randfor, X_test_randfor, y_train_randfor, y_test_randfor = train_test_split(X, y, test_size = 0.3)

sm_randfor = SMOTE(random_state=29)
X_train_randfor, y_train_randfor = sm_randfor.fit_sample(X_train_randfor, y_train_randfor)

In [254]:
from sklearn.ensemble import RandomForestClassifier

rand_for = RandomForestClassifier(max_depth=3, min_samples_leaf=25, max_features =10, random_state=30, oob_score=True)

In [255]:
rand_for.fit(X_train_randfor, y_train_randfor)

RandomForestClassifier(max_depth=3, max_features=10, min_samples_leaf=25,
                       oob_score=True, random_state=30)

In [256]:
y_preds_randfor = rand_for.predict(X_test_randfor)

In [200]:
from sklearn.model_selection import GridSearchCV

In [165]:
# creating our parameters to test
param_dict={'max_features': range(10,80,10)}
#            'min_samples_leaf' : range(10,20,1),
#            'max_features': range(10,80,10)}
#create the instance of GridSearchCV using the F1 metric for our scoring. 
grid_tree=GridSearchCV(rand_for, param_grid = param_dict, cv=8, scoring='f1', verbose = True, n_jobs = -1)

In [166]:
#fit the Gridsearch to our data
grid_tree.fit(X_train_randfor, y_train_randfor)

Fitting 8 folds for each of 7 candidates, totalling 56 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done  56 out of  56 | elapsed:  2.6min finished


GridSearchCV(cv=8, estimator=RandomForestClassifier(), n_jobs=-1,
             param_grid={'max_features': range(10, 80, 10)}, scoring='f1',
             verbose=True)

In [167]:
grid_tree.best_params_

{'max_features': 10}

In [164]:
grid_tree.best_params_

{'max_depth': 6}

In [157]:
grid_tree.best_params_

{'max_depth': 26, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 10}

In [155]:
y_preds_randfor = grid_tree.best_estimator_.predict(X_test_randfor)

In [257]:
acc_score_randfor = accuracy_score(y_test_randfor, y_preds_randfor)
precision_randfor = precision_score(y_test_randfor, y_preds_randfor)
recall_randfor = recall_score(y_test_randfor, y_preds_randfor)
f1_randfor = f1_score(y_test_randfor, y_preds_randfor)

print(f'accuracy: {acc_score_randfor}')
print(f'precision: {precision_randfor}')
print(f'recall: {recall_randfor}')
print(f'f1: {f1_randfor}')

accuracy: 0.798074074074074
precision: 0.5349500713266762
recall: 0.5133470225872689
f1: 0.5239259517988124


## 6. Model Evaluation

In [259]:
from sklearn.ensemble import VotingClassifier
voting = VotingClassifier(estimators=[('lr_base', logreg), ('lr_poly2', logreg_poly2), ('rand_for', rand_for)], voting = 'hard')

In [264]:
voting.fit(X_train_poly2_scaled, y_train_poly2)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed: 46.1min remaining: 30.7min
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed: 68.6min finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:  8.2min remaining:  3.5min
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed: 12.2min finished


convergence after 32 epochs took 6 seconds


VotingClassifier(estimators=[('lr_base',
                              LogisticRegressionCV(cv=10, max_iter=10000,
                                                   n_jobs=-1, penalty='l1',
                                                   scoring='f1', solver='saga',
                                                   verbose=1)),
                             ('lr_poly2',
                              LogisticRegressionCV(cv=10, max_iter=1000,
                                                   n_jobs=-1, penalty='l1',
                                                   scoring='f1', solver='saga',
                                                   tol=0.0007, verbose=2)),
                             ('rand_for',
                              RandomForestClassifier(max_depth=3,
                                                     max_features=10,
                                                     min_samples_leaf=25,
                                                     oob_score=Tru

In [265]:
y_preds_voting = voting.predict(X_test_poly2_scaled)

In [267]:
acc_score_voting = accuracy_score(y_test_poly2, y_preds_voting)
precision_voting = precision_score(y_test_poly2, y_preds_voting)
recall_voting = recall_score(y_test_poly2, y_preds_voting)
f1_voting = f1_score(y_test_poly2, y_preds_voting)

print(f'accuracy: {acc_score_voting}')
print(f'precision: {precision_voting}')
print(f'recall: {recall_voting}')
print(f'f1: {f1_voting}')

accuracy: 0.7714074074074074
precision: 0.48760330578512395
recall: 0.5502998001332445
f1: 0.5170579029733959


## 7. Final Model

In [313]:
log_preds = logreg.predict(X_test)
# log_poly2_1_preds = logreg_poly2_1.predict(X_test)
randfor_preds = rand_for.predict(X_test)

In [329]:
final_preds = 1 * (((log_preds + 2*randfor_preds) / 2) >= 0.5)

In [330]:
acc_score_final = accuracy_score(y_test, final_preds)
precision_final = precision_score(y_test, final_preds)
recall_final = recall_score(y_test, final_preds)
f1_final = f1_score(y_test, final_preds)

print(f'accuracy: {acc_score_final}')
print(f'precision: {precision_final}')
print(f'recall: {recall_final}')
print(f'f1: {f1_final}')

accuracy: 0.8026666666666666
precision: 0.5538132573057734
recall: 0.523937963587323
f1: 0.5384615384615384
