## FINAL PROJECT

### TELCO CHURN PREDICTION
#### SECTION 3 - HYPERPARAMETER TUNING

Since the hyperparameter tuning consumes a lot of time and doing it repetitively only making things worse. Therefore, I'll separate the hyperparameter tuning on this separate sections. Thus this notebook made solely for experimenting with the hyperparameter tuning. So that the tuning may not disturb the Machine Learning flow and getting the best models possible for our Machine Learning.

Now then, to start we best import libraries needed and load our data first.

### IMPORTING LIBRARIES

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder

from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report, recall_score
from sklearn.model_selection import RandomizedSearchCV

import warnings
warnings.filterwarnings('ignore')

### LOAD DATA

In [2]:
df = pd.read_csv('df_ready.csv')

In [3]:
df.head()

Unnamed: 0,Tenure,MonthlyCharges,TotalCharges,Churn,Gender_Female,Gender_Male,SeniorCitizen_No,SeniorCitizen_Yes,Partner_No,Partner_Yes,...,StreamingMovies_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year,PaperlessBilling_No,PaperlessBilling_Yes,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0.013889,0.115423,0.003437,0,1,0,1,0,0,1,...,0,1,0,0,0,1,0,0,1,0
1,0.472222,0.385075,0.217564,0,0,1,1,0,1,0,...,0,0,1,0,1,0,0,0,0,1
2,0.027778,0.354229,0.012453,1,0,1,1,0,1,0,...,0,1,0,0,0,1,0,0,0,1
3,0.625,0.239303,0.211951,0,0,1,1,0,1,0,...,0,0,1,0,1,0,1,0,0,0
4,0.027778,0.521891,0.017462,1,1,0,1,0,1,0,...,0,1,0,0,0,1,0,0,1,0


From this point forward, we will focus our search for the best tuning for our hyperparameter models. The steps are rather simple : 
- First we set the parameter for each hyperparameter, then we apply it to our models.
- After that we start to compare the normal model (default parameter), with the tuned parameter (hyperparameter).
- The comparison are between their accuracies and of course we will also focusing more on their recall.

### SET THE PARAMETERS FOR HYPERPARAMETER

##### LOGISTIC REGRESSION

In [4]:
# Logistic Regression

penalty = ['l1', 'l2', 'elasticnet', 'none']
solver = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
max_iter = [10, 100, 1000]
LRG_param = {'penalty' : penalty, 'solver': solver, 'max_iter' : max_iter}

##### RANDOM FOREST CLASSIFIER

In [5]:
#Random Forest Classifier

max_depth = [10, 20, 40, 'None']
min_samples_leaf = [2, 4, 8]
min_samples_split = [2, 10, 100]
n_estimators = [10, 100, 500]

RFC_param = {'max_depth' : max_depth, 'min_samples_leaf': min_samples_leaf, 'min_samples_split' : min_samples_split, 'n_estimators' : n_estimators}

##### K-NEAREST NEIGHBORS

In [6]:
#K-Nearest Nighbors

leaf_size = list(range(1, 50))
n_neighbors = list(range(1, 30))
p=[1,2]

KNN_param = {'leaf_size' : leaf_size, 'n_neighbors' : n_neighbors, 'p' : p}

### SPLIT DATA

In [7]:
x = df.drop(columns = ['Churn'])
y = df['Churn'].values

In [8]:
#Split train data 80%, test data 20%

x_train, x_test, y_train, y_test =  train_test_split(x, y, train_size = 0.8, shuffle = False)

In [9]:
#Split train data 90%, test data 10%

x1_train, x1_test, y1_train, y1_test =  train_test_split(x, y, train_size = 0.9, shuffle = False)

In [10]:
x_train

Unnamed: 0,Tenure,MonthlyCharges,TotalCharges,Gender_Female,Gender_Male,SeniorCitizen_No,SeniorCitizen_Yes,Partner_No,Partner_Yes,Dependents_No,...,StreamingMovies_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year,PaperlessBilling_No,PaperlessBilling_Yes,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0.013889,0.115423,0.003437,1,0,1,0,0,1,1,...,0,1,0,0,0,1,0,0,1,0
1,0.472222,0.385075,0.217564,0,1,1,0,1,0,1,...,0,0,1,0,1,0,0,0,0,1
2,0.027778,0.354229,0.012453,0,1,1,0,1,0,1,...,0,1,0,0,0,1,0,0,0,1
3,0.625000,0.239303,0.211951,0,1,1,0,1,0,1,...,0,0,1,0,1,0,1,0,0,0
4,0.027778,0.521891,0.017462,1,0,1,0,1,0,1,...,0,1,0,0,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5629,0.013889,0.017910,0.002309,0,1,0,1,1,0,1,...,0,1,0,0,1,0,0,0,0,1
5630,0.541667,0.847761,0.459936,1,0,1,0,1,0,1,...,1,1,0,0,0,1,0,0,1,0
5631,0.041667,0.067164,0.009010,0,1,1,0,0,1,1,...,0,1,0,0,1,0,0,0,0,1
5632,0.805556,0.020398,0.130285,0,1,1,0,1,0,1,...,0,0,0,1,1,0,0,1,0,0


In [11]:
x1_train

Unnamed: 0,Tenure,MonthlyCharges,TotalCharges,Gender_Female,Gender_Male,SeniorCitizen_No,SeniorCitizen_Yes,Partner_No,Partner_Yes,Dependents_No,...,StreamingMovies_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year,PaperlessBilling_No,PaperlessBilling_Yes,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0.013889,0.115423,0.003437,1,0,1,0,0,1,1,...,0,1,0,0,0,1,0,0,1,0
1,0.472222,0.385075,0.217564,0,1,1,0,1,0,1,...,0,0,1,0,1,0,0,0,0,1
2,0.027778,0.354229,0.012453,0,1,1,0,1,0,1,...,0,1,0,0,0,1,0,0,0,1
3,0.625000,0.239303,0.211951,0,1,1,0,1,0,1,...,0,0,1,0,1,0,1,0,0,0
4,0.027778,0.521891,0.017462,1,0,1,0,1,0,1,...,0,1,0,0,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6333,0.833333,0.873134,0.741687,1,0,0,1,1,0,1,...,1,0,1,0,0,1,0,0,1,0
6334,0.930556,0.853234,0.810502,1,0,1,0,1,0,0,...,1,0,0,1,0,1,0,0,1,0
6335,0.194444,0.511443,0.106093,1,0,1,0,1,0,1,...,1,1,0,0,0,1,0,0,1,0
6336,0.791667,0.557711,0.462688,0,1,1,0,0,1,0,...,0,1,0,0,0,1,0,0,1,0


#### DEPLOYING MODELS - DEFAULT

In [12]:
# 80% Train

LRG = LogisticRegression().fit(x_train, y_train)
RFC = RandomForestClassifier().fit(x_train, y_train)
KNN = KNeighborsClassifier().fit(x_train, y_train)

In [13]:
# 90% Train

LRG1 = LogisticRegression().fit(x1_train, y1_train)
RFC1 = RandomForestClassifier().fit(x1_train, y1_train)
KNN1 = KNeighborsClassifier().fit(x1_train, y1_train)

#### SET THE HYPERPARAMETER

We will set the hyperparameter using Randomized Search Cross Validation Method (RSCV) for each model.

##### LOGISTIC REGRESSION

In [14]:
def CVLRG (est, xtr, ytr):
    result = RandomizedSearchCV(estimator = est, param_distributions = LRG_param, cv=5, scoring = 'recall').fit(xtr, ytr)
    return result

In [15]:
# parameter for 80% Train

for i in range(1,4):
    cv_lrg = CVLRG(LRG, x_train, y_train)
    print('Hyper Model', i, cv_lrg.best_params_)

Hyper Model 1 {'solver': 'saga', 'penalty': 'none', 'max_iter': 1000}
Hyper Model 2 {'solver': 'saga', 'penalty': 'none', 'max_iter': 1000}
Hyper Model 3 {'solver': 'saga', 'penalty': 'none', 'max_iter': 10}


In [16]:
# parameter for 90% Train

for i in range(1,4):
    cv_lrg1 = CVLRG(LRG1, x1_train, y1_train)
    print('Hyper Model', i, cv_lrg1.best_params_)

Hyper Model 1 {'solver': 'saga', 'penalty': 'none', 'max_iter': 1000}
Hyper Model 2 {'solver': 'newton-cg', 'penalty': 'none', 'max_iter': 10}
Hyper Model 3 {'solver': 'saga', 'penalty': 'none', 'max_iter': 1000}


In [17]:
# applying hyperparameter models for 80% Train

LRG_hyper1 = LogisticRegression(solver = 'lbfgs', penalty = 'none', max_iter = 10).fit(x_train, y_train)
LRG_hyper2 = LogisticRegression(solver = 'newton-cg', penalty = 'none', max_iter = 100).fit(x_train, y_train)
LRG_hyper3 = LogisticRegression(solver = 'saga', penalty = 'l1', max_iter = 1000).fit(x_train, y_train)

In [18]:
# applying hyperparameter models for 90% Train

LRG1_hyper1 = LogisticRegression(solver = 'lbfgs', penalty = 'none', max_iter = 100).fit(x1_train, y1_train)
LRG1_hyper2 = LogisticRegression(solver = 'sag', penalty = 'none', max_iter = 10).fit(x1_train, y1_train)
LRG1_hyper3 = LogisticRegression(solver = 'lbfgs', penalty = 'none', max_iter = 10).fit(x1_train, y1_train)

In [19]:
# y_predict for 80% Train

yp_LRG = LRG.predict(x_test)
yp_LRG_hyper1 = LRG_hyper1.predict(x_test)
yp_LRG_hyper2 = LRG_hyper2.predict(x_test)
yp_LRG_hyper3 = LRG_hyper3.predict(x_test)

# y_predict for 90% Train

yp_LRG1 = LRG1.predict(x1_test)
yp_LRG1_hyper1 = LRG1_hyper1.predict(x1_test)
yp_LRG1_hyper2 = LRG1_hyper2.predict(x1_test)
yp_LRG1_hyper3 = LRG1_hyper3.predict(x1_test)

In [20]:
# setting the default vs hyperparameter model score

# 80% Train

# Accuracy
LRG_acc = LRG.score(x_test, y_test)
LRG_hyper1_acc = LRG_hyper1.score(x_test, y_test)
LRG_hyper2_acc = LRG_hyper2.score(x_test, y_test)
LRG_hyper3_acc = LRG_hyper3.score(x_test, y_test)

# Recall
LRG_rec = recall_score(y_test, yp_LRG, pos_label = 1)
LRG_rec_hyper1 = recall_score(y_test, yp_LRG_hyper1, pos_label = 1)
LRG_rec_hyper2 = recall_score(y_test, yp_LRG_hyper2, pos_label = 1)
LRG_rec_hyper3 = recall_score(y_test, yp_LRG_hyper3, pos_label = 1)

# 90% Train

# Accuracy
LRG1_acc = LRG1.score(x1_test, y1_test)
LRG1_hyper1_acc = LRG1_hyper1.score(x1_test, y1_test)
LRG1_hyper2_acc = LRG1_hyper2.score(x1_test, y1_test)
LRG1_hyper3_acc = LRG1_hyper3.score(x1_test, y1_test)

# Recall
LRG1_rec = recall_score(y1_test, yp_LRG1, pos_label = 1)
LRG1_rec_hyper1 = recall_score(y1_test, yp_LRG1_hyper1, pos_label = 1)
LRG1_rec_hyper2 = recall_score(y1_test, yp_LRG1_hyper2, pos_label = 1)
LRG1_rec_hyper3 = recall_score(y1_test, yp_LRG1_hyper3, pos_label = 1)

In [21]:
model_LRG80_score = pd.DataFrame({'Logistic Regression' : ['Default', 'Hyper Test 1', 'Hyper Test 2', 'Hyper Test 3'], 
                                  'Accuracy Score': [LRG_acc, LRG_hyper1_acc, LRG_hyper2_acc, LRG_hyper3_acc],
                                  'Recall Score' : [LRG_rec, LRG_rec_hyper1, LRG_rec_hyper2, LRG_rec_hyper3]})

In [22]:
model_LRG90_score = pd.DataFrame({'Logistic Regression' : ['Default', 'Hyper Test 1', 'Hyper Test 2', 'Hyper Test 3'], 
                                  'Accuracy Score': [LRG1_acc, LRG1_hyper1_acc, LRG1_hyper2_acc, LRG1_hyper3_acc], 
                                  'Recall Score' : [LRG1_rec, LRG1_rec_hyper1, LRG1_rec_hyper2, LRG1_rec_hyper3]})

In [23]:
pd.concat([model_LRG80_score, model_LRG90_score], keys = ['LRG 80 Score', 'LRG 90 Score'])

Unnamed: 0,Unnamed: 1,Logistic Regression,Accuracy Score,Recall Score
LRG 80 Score,0,Default,0.801278,0.530184
LRG 80 Score,1,Hyper Test 1,0.801987,0.543307
LRG 80 Score,2,Hyper Test 2,0.802697,0.538058
LRG 80 Score,3,Hyper Test 3,0.803407,0.535433
LRG 90 Score,0,Default,0.797163,0.517949
LRG 90 Score,1,Hyper Test 1,0.8,0.528205
LRG 90 Score,2,Hyper Test 2,0.792908,0.369231
LRG 90 Score,3,Hyper Test 3,0.8,0.528205


We can see that on 80% Train data, all the hyperparameter test proves that the model would perform better than the original one, while test no. 3 provided a better accuracy score, but the focus is on recall score. Therefore we would pick the hyperparameter model no.1. On the other hand, using hyperparameter test no. 1 and no. 3 on the 90% Train data would also yield on a better accuracy scores, therefore picking either one of them is fine as the recall also higher than the original.

Now on to the next models.

##### RANDOM FOREST CLASSIFIER

In [24]:
def CVRFC (est, xtr, ytr):
    result = RandomizedSearchCV(estimator = est, param_distributions = RFC_param, cv=5, scoring = 'recall').fit(xtr, ytr)
    return result

In [25]:
# 80% Train

for i in range(1,4):
    cv_rfc = CVRFC(RFC, x_train, y_train)
    print('Hyper Model', i, cv_rfc.best_params_)

Hyper Model 1 {'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_depth': 10}
Hyper Model 2 {'n_estimators': 10, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_depth': 20}
Hyper Model 3 {'n_estimators': 100, 'min_samples_split': 10, 'min_samples_leaf': 8, 'max_depth': 20}


In [26]:
# 90% Train

for i in range(1,4):
    cv_rfc1 = CVRFC(RFC1, x1_train, y1_train)
    print('Hyper Model', i, cv_rfc1.best_params_)

Hyper Model 1 {'n_estimators': 500, 'min_samples_split': 2, 'min_samples_leaf': 8, 'max_depth': 40}
Hyper Model 2 {'n_estimators': 500, 'min_samples_split': 10, 'min_samples_leaf': 4, 'max_depth': 10}
Hyper Model 3 {'n_estimators': 10, 'min_samples_split': 10, 'min_samples_leaf': 2, 'max_depth': 40}


In [27]:
# applying hyperparameter models for 80% Train

RFC_hyper1 = RandomForestClassifier(n_estimators = 500, min_samples_split = 2, min_samples_leaf = 8, max_depth = 10).fit(x_train, y_train)
RFC_hyper2 = RandomForestClassifier(n_estimators = 100, min_samples_split = 2, min_samples_leaf = 4, max_depth = 10).fit(x_train, y_train)
RFC_hyper3 = RandomForestClassifier(n_estimators = 500, min_samples_split = 10, min_samples_leaf = 4, max_depth = 10).fit(x_train, y_train)

In [28]:
# applying hyperparameter models for 90% Train

RFC1_hyper1 = RandomForestClassifier(n_estimators = 100, min_samples_split = 2, min_samples_leaf = 2, max_depth = 10).fit(x1_train, y1_train)
RFC1_hyper2 = RandomForestClassifier(n_estimators = 500, min_samples_split = 2, min_samples_leaf = 4, max_depth = 40).fit(x1_train, y1_train)
RFC1_hyper3 = RandomForestClassifier(n_estimators = 100, min_samples_split = 10, min_samples_leaf = 8, max_depth = 40).fit(x1_train, y1_train)

In [29]:
# y_predict for 80% Train

yp_RFC = RFC.predict(x_test)
yp_RFC_hyper1 = RFC_hyper1.predict(x_test)
yp_RFC_hyper2 = RFC_hyper2.predict(x_test)
yp_RFC_hyper3 = RFC_hyper3.predict(x_test)

# y_predict for 90% Train

yp_RFC1 = RFC1.predict(x1_test)
yp_RFC1_hyper1 = RFC1_hyper1.predict(x1_test)
yp_RFC1_hyper2 = RFC1_hyper2.predict(x1_test)
yp_RFC1_hyper3 = RFC1_hyper3.predict(x1_test)

In [30]:
# setting the default vs hyperparameter model score

# 80% Train

# Accuracy
RFC_acc = RFC.score(x_test, y_test)
RFC_hyper1_acc = RFC_hyper1.score(x_test, y_test)
RFC_hyper2_acc = RFC_hyper2.score(x_test, y_test)
RFC_hyper3_acc = RFC_hyper3.score(x_test, y_test)

# Recall
RFC_rec = recall_score(y_test, yp_RFC, pos_label = 1)
RFC_rec_hyper1 = recall_score(y_test, yp_RFC_hyper1, pos_label = 1)
RFC_rec_hyper2 = recall_score(y_test, yp_RFC_hyper2, pos_label = 1)
RFC_rec_hyper3 = recall_score(y_test, yp_RFC_hyper3, pos_label = 1)

# 90% Train

# Accuracy
RFC1_acc = RFC1.score(x1_test, y1_test)
RFC1_hyper1_acc = RFC1_hyper1.score(x1_test, y1_test)
RFC1_hyper2_acc = RFC1_hyper2.score(x1_test, y1_test)
RFC1_hyper3_acc = RFC1_hyper3.score(x1_test, y1_test)

# Recall
RFC1_rec = recall_score(y1_test, yp_RFC1, pos_label = 1)
RFC1_rec_hyper1 = recall_score(y1_test, yp_RFC1_hyper1, pos_label = 1)
RFC1_rec_hyper2 = recall_score(y1_test, yp_RFC1_hyper2, pos_label = 1)
RFC1_rec_hyper3 = recall_score(y1_test, yp_RFC1_hyper3, pos_label = 1)

In [31]:
model_RFC80_score = pd.DataFrame({'Logistic Regression' : ['Default', 'Hyper Test 1', 'Hyper Test 2', 'Hyper Test 3'], 
                                  'Accuracy Score': [RFC_acc, RFC_hyper1_acc, RFC_hyper2_acc, RFC_hyper3_acc],
                                  'Recall Score' : [RFC_rec, RFC_rec_hyper1, RFC_rec_hyper2, RFC_rec_hyper3]})

In [32]:
model_RFC90_score = pd.DataFrame({'Logistic Regression' : ['Default', 'Hyper Test 1', 'Hyper Test 2', 'Hyper Test 3'], 
                                  'Accuracy Score': [RFC1_acc, RFC1_hyper1_acc, RFC1_hyper2_acc, RFC1_hyper3_acc], 
                                  'Recall Score' : [RFC1_rec, RFC1_rec_hyper1, RFC1_rec_hyper2, RFC1_rec_hyper3]})

In [33]:
pd.concat([model_RFC80_score, model_RFC90_score], keys = ['RFC 80 Score', 'RFC 90 Score'])

Unnamed: 0,Unnamed: 1,Logistic Regression,Accuracy Score,Recall Score
RFC 80 Score,0,Default,0.791341,0.493438
RFC 80 Score,1,Hyper Test 1,0.797019,0.48294
RFC 80 Score,2,Hyper Test 2,0.799858,0.496063
RFC 80 Score,3,Hyper Test 3,0.801278,0.493438
RFC 90 Score,0,Default,0.8,0.482051
RFC 90 Score,1,Hyper Test 1,0.807092,0.497436
RFC 90 Score,2,Hyper Test 2,0.801418,0.487179
RFC 90 Score,3,Hyper Test 3,0.804255,0.482051


We can see in Random Forest models on 80% Train data, all the recall on these models are below 50%, therefore either one of them I think is not eligible to compete againts the other models. That being said, the hyperparameter test no.3 provided best score, while the highest recall are in test no.2, therefore we choose this. While on the 90% Train score, we have a better accuracy and recall score on hyper parameter test no.1, therefore we choose this.

##### K NEAREST NEIGHBORS

In [34]:
def CVKNN (est, xtr, ytr):
    result = RandomizedSearchCV(estimator = est, param_distributions = KNN_param, cv=5, scoring = 'recall').fit(xtr, ytr)
    return result

In [35]:
# 80% Train

for i in range(1,4):
    cv_knn = CVKNN(KNN, x_train, y_train)
    print('Hyper Model', i, cv_knn.best_params_)

Hyper Model 1 {'p': 2, 'n_neighbors': 29, 'leaf_size': 3}
Hyper Model 2 {'p': 1, 'n_neighbors': 27, 'leaf_size': 15}
Hyper Model 3 {'p': 1, 'n_neighbors': 29, 'leaf_size': 45}


In [36]:
# 90% Train

for i in range(1,4):
    cv_knn1 = CVKNN(KNN1, x1_train, y1_train)
    print('Hyper Model', i, cv_knn1.best_params_)

Hyper Model 1 {'p': 1, 'n_neighbors': 27, 'leaf_size': 20}
Hyper Model 2 {'p': 1, 'n_neighbors': 29, 'leaf_size': 19}
Hyper Model 3 {'p': 1, 'n_neighbors': 29, 'leaf_size': 34}


In [37]:
# applying hyperparameter models for 80% Train

KNN_hyper1 = KNeighborsClassifier(p = 2, n_neighbors = 15, leaf_size = 39).fit(x_train, y_train)
KNN_hyper2 = KNeighborsClassifier(p = 1, n_neighbors = 25, leaf_size = 39).fit(x_train, y_train)
KNN_hyper3 = KNeighborsClassifier(p = 2, n_neighbors = 25, leaf_size = 18).fit(x_train, y_train)

In [38]:
# applying hyperparameter models for 90% Train

KNN1_hyper1 = KNeighborsClassifier(p = 2, n_neighbors = 15, leaf_size = 39).fit(x1_train, y1_train)
KNN1_hyper2 = KNeighborsClassifier(p = 1, n_neighbors = 25, leaf_size = 39).fit(x1_train, y1_train)
KNN1_hyper3 = KNeighborsClassifier(p = 1, n_neighbors = 25, leaf_size = 18).fit(x1_train, y1_train)

In [39]:
# y_predict for 80% Train

yp_KNN = KNN.predict(x_test)
yp_KNN_hyper1 = KNN_hyper1.predict(x_test)
yp_KNN_hyper2 = KNN_hyper2.predict(x_test)
yp_KNN_hyper3 = KNN_hyper3.predict(x_test)

# y_predict for 90% Train

yp_KNN1 = KNN1.predict(x1_test)
yp_KNN1_hyper1 = KNN1_hyper1.predict(x1_test)
yp_KNN1_hyper2 = KNN1_hyper2.predict(x1_test)
yp_KNN1_hyper3 = KNN1_hyper3.predict(x1_test)

In [40]:
# setting the default vs hyperparameter model score

# 80% Train

# Accuracy
KNN_acc = KNN.score(x_test, y_test)
KNN_hyper1_acc = KNN_hyper1.score(x_test, y_test)
KNN_hyper2_acc = KNN_hyper2.score(x_test, y_test)
KNN_hyper3_acc = KNN_hyper3.score(x_test, y_test)

# Recall
KNN_rec = recall_score(y_test, yp_KNN, pos_label = 1)
KNN_rec_hyper1 = recall_score(y_test, yp_KNN_hyper1, pos_label = 1)
KNN_rec_hyper2 = recall_score(y_test, yp_KNN_hyper2, pos_label = 1)
KNN_rec_hyper3 = recall_score(y_test, yp_KNN_hyper3, pos_label = 1)

# 90% Train

# Accuracy
KNN1_acc = KNN1.score(x1_test, y1_test)
KNN1_hyper1_acc = KNN1_hyper1.score(x1_test, y1_test)
KNN1_hyper2_acc = KNN1_hyper2.score(x1_test, y1_test)
KNN1_hyper3_acc = KNN1_hyper3.score(x1_test, y1_test)

# Recall
KNN1_rec = recall_score(y1_test, yp_KNN1, pos_label = 1)
KNN1_rec_hyper1 = recall_score(y1_test, yp_KNN1_hyper1, pos_label = 1)
KNN1_rec_hyper2 = recall_score(y1_test, yp_KNN1_hyper2, pos_label = 1)
KNN1_rec_hyper3 = recall_score(y1_test, yp_KNN1_hyper3, pos_label = 1)

In [41]:
model_KNN80_score = pd.DataFrame({'Logistic Regression' : ['Default', 'Hyper Test 1', 'Hyper Test 2', 'Hyper Test 3'], 
                                  'Accuracy Score': [KNN_acc, KNN_hyper1_acc, KNN_hyper2_acc, KNN_hyper3_acc],
                                  'Recall Score' : [KNN_rec, KNN_rec_hyper1, KNN_rec_hyper2, KNN_rec_hyper3]})

In [42]:
model_KNN90_score = pd.DataFrame({'Logistic Regression' : ['Default', 'Hyper Test 1', 'Hyper Test 2', 'Hyper Test 3'], 
                                  'Accuracy Score': [KNN1_acc, KNN1_hyper1_acc, KNN1_hyper2_acc, KNN1_hyper3_acc], 
                                  'Recall Score' : [KNN1_rec, KNN1_rec_hyper1, KNN1_rec_hyper2, KNN1_rec_hyper3]})

In [43]:
pd.concat([model_KNN80_score, model_KNN90_score], keys = ['KNN 80 Score', 'KNN 90 Score'])

Unnamed: 0,Unnamed: 1,Logistic Regression,Accuracy Score,Recall Score
KNN 80 Score,0,Default,0.770759,0.543307
KNN 80 Score,1,Hyper Test 1,0.782115,0.569554
KNN 80 Score,2,Hyper Test 2,0.788502,0.587927
KNN 80 Score,3,Hyper Test 3,0.787793,0.590551
KNN 90 Score,0,Default,0.777305,0.553846
KNN 90 Score,1,Hyper Test 1,0.780142,0.538462
KNN 90 Score,2,Hyper Test 2,0.797163,0.594872
KNN 90 Score,3,Hyper Test 3,0.797163,0.594872


On the 80% Train data, we got the better accuracy with hyperparameter test no.2, while the recall slightly better on test no.3. While on the 90% Train data, the accuracy and recall on hyperparameter tests no.2 and no.3 are all equals. Therefore we can use these hyperparameters to our model.