# Hyperparameter Optimization : Cross Validation with Manual search

1. A very small project with an intension to learn about Kfold cross validation and the hyperparameter tuning with manual search.

2. In this project the breast Cancer dataset is used.

3. This is a classification dataset and the Classification Algorithm used is Logistic Regression

4. Acknowledgement and reference :

I am really thankful to have the resources created by Soledad Galli

       -- https://www.udemy.com/course/hyperparameter-optimization-for-machine-learning/
       -- https://github.com/solegalli/hyperparameter-optimization

# Importing libraries

In [36]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,f1_score
from sklearn.model_selection import KFold,cross_validate
from sklearn.datasets import load_breast_cancer

In [10]:
breast_cancer_X, breast_cancer_y = load_breast_cancer(return_X_y=True)
X = pd.DataFrame(breast_cancer_X)
y = pd.Series(breast_cancer_y).map({0:1, 1:0})

X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [12]:
y.head()

0    1
1    1
2    1
3    1
4    1
dtype: int64

# Check the percentage of the class labels

This step is important to understand if the dataset is balanced or not.

If the data is imbalanced we need to handle the imbalanced dataset

Step1 : Use value counts

Step2 : Calculate the len

Step3 : Calculate the %tage

In [13]:
y.value_counts()

0    357
1    212
dtype: int64

In [14]:
len(y)

569

In [17]:
print('%age of benign tumor is ', (y.value_counts()/len(y))[0])
print('%age of malignant tumor is ', (y.value_counts()/len(y))[1])

%age of benign tumor is  0.6274165202108963
%age of malignant tumor is  0.37258347978910367


# Split the data into train and test dataset

In [19]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)

In this dataset we have the numbers only. so no need to perform any encoding

# Training the model

In [22]:
logit = LogisticRegression(penalty='l2',C=1.0,solver='liblinear',random_state=42,verbose= 1)

In [26]:
#Here the model is just created but not trained.

In [40]:
#Creating Cross validation scheme
kf = KFold(n_splits=5,shuffle=True,random_state=42)

#Estimating the generalization error

clf = cross_validate(
    logit,
    X_train,
    y_train,
    scoring=['f1','accuracy'], ##We can use any of the metrics
    cv=kf,
    return_train_score=True,
)

[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]

In [44]:
df_metric = pd.DataFrame(clf)
df_metric

Unnamed: 0,fit_time,score_time,test_f1,train_f1,test_accuracy,train_accuracy
0,0.024873,0.003003,0.918033,0.944206,0.9375,0.959119
1,0.021293,0.00202,0.877193,0.945607,0.9125,0.959119
2,0.014493,0.003798,0.962963,0.92887,0.975,0.946541
3,0.011117,0.001013,0.88,0.95082,0.924051,0.962382
4,0.016731,0.002058,1.0,0.919643,1.0,0.943574


Now if you observe, we just got different test scores and train scores,
but 2 things we didn't get here : 

1. The absolute value of the metric.
        - Mean of the metric
        - Standard deviation of the metric
        
2. The correct value of the hyperparameters for which we shall get the optimal result.
        - Searching techniques
            -Mannual Search
            -Grid Search
            -Random Search

In [46]:
np.mean(df_metric['test_accuracy'])

0.9498101265822786

In [47]:
np.std(df_metric['test_accuracy'])

0.03274351658794615

# Mannual Search

In the mannual search technique we can just change the values of the hyperparameters mannually
and check wheather the metric is improving or not.

In this case we need to repeatedly calculate the table and track the improvement.

Previously C=1. Now lets change the value of the C=0.01 and see the performance metric

In [50]:
logit = LogisticRegression(penalty='l2',C=0.01,random_state=42,solver='liblinear',verbose=1)

kf = KFold(n_splits=5,shuffle=True,random_state=42)

clf = cross_validate(logit,X_train,y_train,scoring=['accuracy','f1'],cv=kf,return_train_score=True)

metric = pd.DataFrame(clf)

[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]

In [52]:
np.mean(metric['test_accuracy'])

0.9121518987341772

In [53]:
np.std(metric['test_accuracy'])

0.020693250670147212

# Conclusion

We can observe that after changing the value of C mannually from 1 to 0.01 the accuracy metric is reduced.

In [54]:
pwd

'C:\\Users\\silri'