## HYPERPARAMETER TUNING

#### SECTION 3 - HYPERPARAMETER TUNING

Since the hyperparameter tuning consumes a lot of time and doing it repetitively only making things worse. Therefore, I'll separate the hyperparameter tuning on this separate sections. Thus this notebook made solely for experimenting with the hyperparameter tuning. So that the tuning may not disturb the Machine Learning flow and getting the best models possible for our Machine Learning.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder

from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report, precision_score, recall_score, f1_score, accuracy_score
from sklearn.model_selection import RandomizedSearchCV

import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('df_ready.csv')

In [3]:
df.head()

Unnamed: 0,Tenure,MonthlyCharges,TotalCharges,Churn,Gender_Female,Gender_Male,SeniorCitizen_No,SeniorCitizen_Yes,Partner_No,Partner_Yes,...,StreamingMovies_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year,PaperlessBilling_No,PaperlessBilling_Yes,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0.013889,0.115423,0.003437,0,1,0,1,0,0,1,...,0,1,0,0,0,1,0,0,1,0
1,0.472222,0.385075,0.217564,0,0,1,1,0,1,0,...,0,0,1,0,1,0,0,0,0,1
2,0.027778,0.354229,0.012453,1,0,1,1,0,1,0,...,0,1,0,0,0,1,0,0,0,1
3,0.625,0.239303,0.211951,0,0,1,1,0,1,0,...,0,0,1,0,1,0,1,0,0,0
4,0.027778,0.521891,0.017462,1,1,0,1,0,1,0,...,0,1,0,0,0,1,0,0,1,0


### SET THE PARAMETERS FOR HYPERPARAMETER

##### LOGISTIC REGRESSION

In [4]:
# Logistic Regression

penalty = ['l1', 'l2', 'elasticnet', 'none']
solver = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
max_iter = [10, 100, 1000]
LRG_param = {'penalty' : penalty, 'solver': solver, 'max_iter' : max_iter}

##### RANDOM FOREST CLASSIFIER

In [5]:
#Random Forest Classifier

max_depth = [10, 20, 40, 'None']
min_samples_leaf = [2, 4, 8]
min_samples_split = [2, 10, 100]
n_estimators = [10, 100, 500]

RFC_param = {'max_depth' : max_depth, 'min_samples_leaf': min_samples_leaf, 'min_samples_split' : min_samples_split, 'n_estimators' : n_estimators}

##### K-NEAREST NEIGHBORS

In [6]:
#K-Nearest Nighbors

leaf_size = list(range(1, 50))
n_neighbors = list(range(1, 30))
p=[1,2]

KNN_param = {'leaf_size' : leaf_size, 'n_neighbors' : n_neighbors, 'p' : p}

### SPLIT DATA

In [7]:
x = df.drop(columns = ['Churn'])
y = df['Churn'].values

In [8]:
#Split train data 80%, test data 20%

x_train, x_test, y_train, y_test =  train_test_split(x, y, train_size = 0.8, shuffle = False)

In [9]:
#Split train data 90%, test data 10%

x1_train, x1_test, y1_train, y1_test =  train_test_split(x, y, train_size = 0.9, shuffle = False)

In [10]:
x_train

Unnamed: 0,Tenure,MonthlyCharges,TotalCharges,Gender_Female,Gender_Male,SeniorCitizen_No,SeniorCitizen_Yes,Partner_No,Partner_Yes,Dependents_No,...,StreamingMovies_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year,PaperlessBilling_No,PaperlessBilling_Yes,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0.013889,0.115423,0.003437,1,0,1,0,0,1,1,...,0,1,0,0,0,1,0,0,1,0
1,0.472222,0.385075,0.217564,0,1,1,0,1,0,1,...,0,0,1,0,1,0,0,0,0,1
2,0.027778,0.354229,0.012453,0,1,1,0,1,0,1,...,0,1,0,0,0,1,0,0,0,1
3,0.625000,0.239303,0.211951,0,1,1,0,1,0,1,...,0,0,1,0,1,0,1,0,0,0
4,0.027778,0.521891,0.017462,1,0,1,0,1,0,1,...,0,1,0,0,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5629,0.013889,0.017910,0.002309,0,1,0,1,1,0,1,...,0,1,0,0,1,0,0,0,0,1
5630,0.541667,0.847761,0.459936,1,0,1,0,1,0,1,...,1,1,0,0,0,1,0,0,1,0
5631,0.041667,0.067164,0.009010,0,1,1,0,0,1,1,...,0,1,0,0,1,0,0,0,0,1
5632,0.805556,0.020398,0.130285,0,1,1,0,1,0,1,...,0,0,0,1,1,0,0,1,0,0


In [11]:
x1_train

Unnamed: 0,Tenure,MonthlyCharges,TotalCharges,Gender_Female,Gender_Male,SeniorCitizen_No,SeniorCitizen_Yes,Partner_No,Partner_Yes,Dependents_No,...,StreamingMovies_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year,PaperlessBilling_No,PaperlessBilling_Yes,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0.013889,0.115423,0.003437,1,0,1,0,0,1,1,...,0,1,0,0,0,1,0,0,1,0
1,0.472222,0.385075,0.217564,0,1,1,0,1,0,1,...,0,0,1,0,1,0,0,0,0,1
2,0.027778,0.354229,0.012453,0,1,1,0,1,0,1,...,0,1,0,0,0,1,0,0,0,1
3,0.625000,0.239303,0.211951,0,1,1,0,1,0,1,...,0,0,1,0,1,0,1,0,0,0
4,0.027778,0.521891,0.017462,1,0,1,0,1,0,1,...,0,1,0,0,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6333,0.833333,0.873134,0.741687,1,0,0,1,1,0,1,...,1,0,1,0,0,1,0,0,1,0
6334,0.930556,0.853234,0.810502,1,0,1,0,1,0,0,...,1,0,0,1,0,1,0,0,1,0
6335,0.194444,0.511443,0.106093,1,0,1,0,1,0,1,...,1,1,0,0,0,1,0,0,1,0
6336,0.791667,0.557711,0.462688,0,1,1,0,0,1,0,...,0,1,0,0,0,1,0,0,1,0


#### DEPLOYING MODELS - DEFAULT

In [12]:
# 80% Train

LRG = LogisticRegression().fit(x_train, y_train)
RFC = RandomForestClassifier().fit(x_train, y_train)
KNN = KNeighborsClassifier().fit(x_train, y_train)

In [13]:
# 90% Train

LRG1 = LogisticRegression().fit(x1_train, y1_train)
RFC1 = RandomForestClassifier().fit(x1_train, y1_train)
KNN1 = KNeighborsClassifier().fit(x1_train, y1_train)

#### SET THE HYPERPARAMETER

I will set the hyperparameter using Randomized Search Cross Validation Method (RSCV) for each models.

##### LOGISTIC REGRESSION

In [14]:
def CVLRG (est, xtr, ytr):
    result = RandomizedSearchCV(estimator = est, param_distributions = LRG_param, cv=5, scoring = 'recall').fit(xtr, ytr)
    return result

In [15]:
# parameter for 80% Train

for i in range(1,4):
    cv_lrg = CVLRG(LRG, x_train, y_train)
    print('Hyper Model', i, cv_lrg.best_params_)

Hyper Model 1 {'solver': 'lbfgs', 'penalty': 'none', 'max_iter': 10}
Hyper Model 2 {'solver': 'newton-cg', 'penalty': 'none', 'max_iter': 100}
Hyper Model 3 {'solver': 'saga', 'penalty': 'l1', 'max_iter': 1000}


In [16]:
# parameter for 90% Train

for i in range(1,4):
    cv_lrg1 = CVLRG(LRG1, x1_train, y1_train)
    print('Hyper Model', i, cv_lrg1.best_params_)

Hyper Model 1 {'solver': 'lbfgs', 'penalty': 'none', 'max_iter': 100}
Hyper Model 2 {'solver': 'sag', 'penalty': 'none', 'max_iter': 10}
Hyper Model 3 {'solver': 'lbfgs', 'penalty': 'none', 'max_iter': 10}


In [44]:
# applying hyperparameter models for 80% Train

LRG_hyper1 = LogisticRegression(solver = 'lbfgs', penalty = 'none', max_iter = 10).fit(x_train, y_train)
LRG_hyper2 = LogisticRegression(solver = 'newton-cg', penalty = 'none', max_iter = 100).fit(x_train, y_train)
LRG_hyper3 = LogisticRegression(solver = 'saga', penalty = 'l1', max_iter = 1000).fit(x_train, y_train)

In [45]:
# applying hyperparameter models for 90% Train

LRG1_hyper1 = LogisticRegression(solver = 'lbfgs', penalty = 'none', max_iter = 100).fit(x1_train, y1_train)
LRG1_hyper2 = LogisticRegression(solver = 'sag', penalty = 'none', max_iter = 10).fit(x1_train, y1_train)
LRG1_hyper3 = LogisticRegression(solver = 'lbfgs', penalty = 'none', max_iter = 10).fit(x1_train, y1_train)

In [46]:
# setting the default vs hyperparameter model score

# 80% Train

LRG_score = LRG.score(x_test, y_test)
LRG_hyper1_score = LRG_hyper1.score(x_test, y_test)
LRG_hyper2_score = LRG_hyper2.score(x_test, y_test)
LRG_hyper3_score = LRG_hyper3.score(x_test, y_test)

# 90% Train

LRG1_score = LRG1.score(x1_test, y1_test)
LRG1_hyper1_score = LRG1_hyper1.score(x1_test, y1_test)
LRG1_hyper2_score = LRG1_hyper2.score(x1_test, y1_test)
LRG1_hyper3_score = LRG1_hyper3.score(x1_test, y1_test)

In [47]:
model_LRG80_score = pd.DataFrame({'Logistic Regression' : ['Default', 'Hyper Test 1', 'Hyper Test 2', 'Hyper Test 3'], 
                                  'Model Score': [LRG_score, LRG_hyper1_score, LRG_hyper2_score, LRG_hyper3_score]})

In [48]:
model_LRG90_score = pd.DataFrame({'Logistic Regression' : ['Default', 'Hyper Test 1', 'Hyper Test 2', 'Hyper Test 3'], 
                                  'Model Score': [LRG1_score, LRG1_hyper1_score, LRG1_hyper2_score, LRG1_hyper3_score]})

In [49]:
pd.concat([model_LRG80_score, model_LRG90_score], keys = ['LRG 80 Score', 'LRG 90 Score'])

Unnamed: 0,Unnamed: 1,Logistic Regression,Model Score
LRG 80 Score,0,Default,0.801278
LRG 80 Score,1,Hyper Test 1,0.801987
LRG 80 Score,2,Hyper Test 2,0.802697
LRG 80 Score,3,Hyper Test 3,0.803407
LRG 90 Score,0,Default,0.797163
LRG 90 Score,1,Hyper Test 1,0.8
LRG 90 Score,2,Hyper Test 2,0.795745
LRG 90 Score,3,Hyper Test 3,0.8


We can see that on 80% Train data the hyperparameter using test no. 1 and 2 proves that the model would perform better, but test no. 3 provided a better score, therefore we'll pick no.3. On the other hand, using hyperparameter test no. 1 and no. 3 on the 90% Train data would also yield on a better performances, therefore picking either one of them is fine.

Now on to the next models.

##### RANDOM FOREST CLASSIFIER

In [23]:
def CVRFC (est, xtr, ytr):
    result = RandomizedSearchCV(estimator = est, param_distributions = RFC_param, cv=5, scoring = 'recall').fit(xtr, ytr)
    return result

In [24]:
# 80% Train

for i in range(1,4):
    cv_rfc = CVRFC(RFC, x_train, y_train)
    print('Hyper Model', i, cv_rfc.best_params_)

Hyper Model 1 {'n_estimators': 500, 'min_samples_split': 2, 'min_samples_leaf': 8, 'max_depth': 10}
Hyper Model 2 {'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_depth': 10}
Hyper Model 3 {'n_estimators': 500, 'min_samples_split': 10, 'min_samples_leaf': 4, 'max_depth': 10}


In [25]:
# 90% Train

for i in range(1,4):
    cv_rfc1 = CVRFC(RFC1, x1_train, y1_train)
    print('Hyper Model', i, cv_rfc1.best_params_)

Hyper Model 1 {'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_depth': 10}
Hyper Model 2 {'n_estimators': 500, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_depth': 40}
Hyper Model 3 {'n_estimators': 100, 'min_samples_split': 10, 'min_samples_leaf': 8, 'max_depth': 40}


In [50]:
# applying hyperparameter models for 80% Train

RFC_hyper1 = RandomForestClassifier(n_estimators = 500, min_samples_split = 2, min_samples_leaf = 8, max_depth = 10).fit(x_train, y_train)
RFC_hyper2 = RandomForestClassifier(n_estimators = 100, min_samples_split = 2, min_samples_leaf = 4, max_depth = 10).fit(x_train, y_train)
RFC_hyper3 = RandomForestClassifier(n_estimators = 500, min_samples_split = 10, min_samples_leaf = 4, max_depth = 10).fit(x_train, y_train)

In [51]:
# applying hyperparameter models for 90% Train

RFC1_hyper1 = RandomForestClassifier(n_estimators = 100, min_samples_split = 2, min_samples_leaf = 2, max_depth = 10).fit(x1_train, y1_train)
RFC1_hyper2 = RandomForestClassifier(n_estimators = 500, min_samples_split = 2, min_samples_leaf = 4, max_depth = 40).fit(x1_train, y1_train)
RFC1_hyper3 = RandomForestClassifier(n_estimators = 100, min_samples_split = 10, min_samples_leaf = 8, max_depth = 40).fit(x1_train, y1_train)

In [52]:
# setting the default vs hyperparameter model score

# 80% Train

RFC_score = RFC.score(x_test, y_test)
RFC_hyper1_score = RFC_hyper1.score(x_test, y_test)
RFC_hyper2_score = RFC_hyper2.score(x_test, y_test)
RFC_hyper3_score = RFC_hyper3.score(x_test, y_test)

# 90% Train

RFC1_score = RFC1.score(x1_test, y1_test)
RFC1_hyper1_score = RFC1_hyper1.score(x1_test, y1_test)
RFC1_hyper2_score = RFC1_hyper2.score(x1_test, y1_test)
RFC1_hyper3_score = RFC1_hyper3.score(x1_test, y1_test)

In [53]:
model_RFC80_score = pd.DataFrame({'Logistic Regression' : ['Default', 'Hyper Test 1', 'Hyper Test 2', 'Hyper Test 3'], 
                                  'Model Score': [RFC_score, RFC_hyper1_score, RFC_hyper2_score, RFC_hyper3_score]})

In [54]:
model_RFC90_score = pd.DataFrame({'Logistic Regression' : ['Default', 'Hyper Test 1', 'Hyper Test 2', 'Hyper Test 3'], 
                                  'Model Score': [RFC1_score, RFC1_hyper1_score, RFC1_hyper2_score, RFC1_hyper3_score]})

In [55]:
pd.concat([model_RFC80_score, model_RFC90_score], keys = ['RFC 80 Score', 'RFC 90 Score'])

Unnamed: 0,Unnamed: 1,Logistic Regression,Model Score
RFC 80 Score,0,Default,0.798439
RFC 80 Score,1,Hyper Test 1,0.800568
RFC 80 Score,2,Hyper Test 2,0.7956
RFC 80 Score,3,Hyper Test 3,0.799858
RFC 90 Score,0,Default,0.792908
RFC 90 Score,1,Hyper Test 1,0.804255
RFC 90 Score,2,Hyper Test 2,0.801418
RFC 90 Score,3,Hyper Test 3,0.797163


We can see in Random Forest models on 80% Train data, the hyperparameter test no.1 provided best score, therefore we'll choose it. While on the 90% Train score, we also have a better score on hyper parameter test no.1.

##### K NEAREST NEIGHBORS

In [32]:
def CVKNN (est, xtr, ytr):
    result = RandomizedSearchCV(estimator = est, param_distributions = KNN_param, cv=5, scoring = 'recall').fit(xtr, ytr)
    return result

In [33]:
# 80% Train

for i in range(1,4):
    cv_knn = CVKNN(KNN, x_train, y_train)
    print('Hyper Model', i, cv_knn.best_params_)

Hyper Model 1 {'p': 2, 'n_neighbors': 15, 'leaf_size': 39}
Hyper Model 2 {'p': 1, 'n_neighbors': 25, 'leaf_size': 39}
Hyper Model 3 {'p': 2, 'n_neighbors': 25, 'leaf_size': 18}


In [34]:
# 90% Train

for i in range(1,4):
    cv_knn1 = CVKNN(KNN1, x1_train, y1_train)
    print('Hyper Model', i, cv_knn1.best_params_)

Hyper Model 1 {'p': 2, 'n_neighbors': 29, 'leaf_size': 45}
Hyper Model 2 {'p': 1, 'n_neighbors': 27, 'leaf_size': 48}
Hyper Model 3 {'p': 1, 'n_neighbors': 21, 'leaf_size': 24}


In [56]:
# applying hyperparameter models for 80% Train

KNN_hyper1 = KNeighborsClassifier(p = 2, n_neighbors = 15, leaf_size = 39).fit(x_train, y_train)
KNN_hyper2 = KNeighborsClassifier(p = 1, n_neighbors = 25, leaf_size = 39).fit(x_train, y_train)
KNN_hyper3 = KNeighborsClassifier(p = 2, n_neighbors = 25, leaf_size = 18).fit(x_train, y_train)

In [57]:
# applying hyperparameter models for 90% Train

KNN1_hyper1 = KNeighborsClassifier(p = 2, n_neighbors = 15, leaf_size = 39).fit(x1_train, y1_train)
KNN1_hyper2 = KNeighborsClassifier(p = 1, n_neighbors = 25, leaf_size = 39).fit(x1_train, y1_train)
KNN1_hyper3 = KNeighborsClassifier(p = 1, n_neighbors = 25, leaf_size = 18).fit(x1_train, y1_train)

In [58]:
# setting the default vs hyperparameter model score

# 80% Train

KNN_score = KNN.score(x_test, y_test)
KNN_hyper1_score = KNN_hyper1.score(x_test, y_test)
KNN_hyper2_score = KNN_hyper2.score(x_test, y_test)
KNN_hyper3_score = KNN_hyper3.score(x_test, y_test)

# 90% Train

KNN1_score = KNN1.score(x1_test, y1_test)
KNN1_hyper1_score = KNN1_hyper1.score(x1_test, y1_test)
KNN1_hyper2_score = KNN1_hyper2.score(x1_test, y1_test)
KNN1_hyper3_score = KNN1_hyper3.score(x1_test, y1_test)

In [59]:
model_KNN80_score = pd.DataFrame({'Logistic Regression' : ['Default', 'Hyper Test 1', 'Hyper Test 2', 'Hyper Test 3'], 
                                  'Model Score': [KNN_score, KNN_hyper1_score, KNN_hyper2_score, KNN_hyper3_score]})

In [60]:
model_KNN90_score = pd.DataFrame({'Logistic Regression' : ['Default', 'Hyper Test 1', 'Hyper Test 2', 'Hyper Test 3'], 
                                  'Model Score': [KNN1_score, KNN1_hyper1_score, KNN1_hyper2_score, KNN1_hyper3_score]})

In [61]:
pd.concat([model_KNN80_score, model_KNN90_score], keys = ['KNN 80 Score', 'KNN 90 Score'])

Unnamed: 0,Unnamed: 1,Logistic Regression,Model Score
KNN 80 Score,0,Default,0.770759
KNN 80 Score,1,Hyper Test 1,0.782115
KNN 80 Score,2,Hyper Test 2,0.788502
KNN 80 Score,3,Hyper Test 3,0.787793
KNN 90 Score,0,Default,0.777305
KNN 90 Score,1,Hyper Test 1,0.780142
KNN 90 Score,2,Hyper Test 2,0.797163
KNN 90 Score,3,Hyper Test 3,0.797163


On the 80% Train data, the KNN performed better by using hyperparameter test no.2. While on the 90% Train data, the hyperparameter tests no.2, and no.3 are all equals. Therefore we can use these hyperparameters to our model.