<a href="https://colab.research.google.com/github/Aftabgazali/Learning-Best-Practices-for-Model-Evaluation-and-Hyperparameter-Tuning.ipynb/blob/main/Learning_Best_Practices_for_Model_Evaluation_and_Hyperparameter_Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing the Breast Cancer Library

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer

In [None]:
breast_data = load_breast_cancer()
df = pd.DataFrame(data = breast_data.data, columns = breast_data.feature_names)
df['target'] = breast_data.target
df.head()

In [None]:
print(f"Class labels are {np.unique(df['target'])}")

In [None]:
X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values

# Split the dataset into Training & Testing

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)

# Data Pipelining

* The make_pipeline function takes an arbitrary number of scikit-learn transformers (objects that sup-
port the fit and transform methods as input), followed by a scikit-learn estimator that implements the
fit and predict methods. In our preceding code example, we provided two scikit-learn transformers,
StandardScaler and PCA, and a LogisticRegression estimator as inputs to the make_pipeline func-
tion, which constructs a scikit-learn Pipeline object from these objects.
We can think of a scikit-learn Pipeline as a meta-estimator or wrapper around those individual
transformers and estimators. If we call the fit method of Pipeline, the data will be passed down a
series of transformers via fit and transform calls on these intermediate steps until it arrives at the
estimator object (the final element in a pipeline). The estimator will then be fitted to the transformed
training data.

* When we executed the fit method on the pipe_lr pipeline in the preceding code example,
StandardScaler first performed fit and transform calls on the training data. Second, the trans-
formed training data was passed on to the next object in the pipeline, PCA. Similar to the previous
step, PCA also executed fit and transform on the scaled input data and passed it to the final element
of the pipeline, the estimator.

* Finally, the LogisticRegression estimator was fit to the training data after it underwent transfor-
mations via StandardScaler and PCA. Again, we should note that there is no limit to the number of
intermediate steps in a pipeline; however, if we want to use the pipeline for prediction tasks, the last
pipeline element has to be an estimator.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline


pipeline = make_pipeline(StandardScaler(), PCA(n_components=2), LogisticRegression())
pipeline.fit(X_train,y_train)
y_pred = pipeline.predict(X_test)
test_acc = pipeline.score(X_test, y_test)
print(f"Testing Accuracy {test_acc*100:.2f}")

# Accessing the Model's Performance

*Best Approach is to use k-fold cross-validation, we randomly split the training dataset into k folds without replacement.
Here, k – 1 folds, the so-called training folds, are used for the model training, and one fold, the so-called
test fold, is used for performance evaluation. This procedure is repeated k times so that we obtain k
models and performance estimates*

**Working on bigger dataset, `k=10` value is generally a good estimate.**

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator=pipeline, X=X_train, y = y_train, cv=10)
print(f"CV accuracy scores | {scores*100}")
print(f"\nCV accuracy {np.mean(scores)*100:.2f}")

# Debugging algorithms with learning and validation curves

In [None]:
from sklearn.model_selection import learning_curve

pipeline = make_pipeline(StandardScaler(), LogisticRegression(penalty='l2', max_iter=1000))

train_sizes, train_scores, test_scores =\
                          learning_curve(estimator=pipeline,X =X_train, y=y_train,
                          train_sizes=np.linspace(0.1,1.0,10), cv=10)
# Take the mean & std of the accuracies
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
plt.plot(train_sizes, train_mean,color='blue', marker='o',markersize=5, label='Training accuracy')
plt.fill_between(train_sizes,train_mean + train_std,train_mean - train_std,alpha=0.15, color='blue')
plt.plot(train_sizes, test_mean,color='green', linestyle='--',marker='s', markersize=5,label='Validation accuracy')
plt.fill_between(train_sizes,test_mean + test_std,test_mean - test_std,alpha=0.15, color='green')
plt.grid()
plt.xlabel('Number of training examples')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
# set the y label axis units we want from 80% to 100%
plt.ylim([0.8, 1.03])
plt.show()

**Printing out the Validation Curve**

*Here we look for tweaking the parameter which is `C` from logistic regression, so we provide a desired range `param_range = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]` then in the curve function we specified the parameter we want to tweak i.e. `'logisticregression__C'`*

In [None]:
from sklearn.model_selection import validation_curve
param_range = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
train_scores, test_scores = validation_curve(estimator=pipeline, X = X_train, y = y_train, param_name='logisticregression__C',
                                  param_range=param_range,cv=10)
# Take the mean & std of the accuracies
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
plt.plot(param_range, train_mean,color='blue', marker='o',markersize=5, label='Training accuracy')
plt.fill_between(param_range,train_mean + train_std,train_mean - train_std,alpha=0.15, color='blue')
plt.plot(param_range, test_mean,color='green', linestyle='--',marker='s', markersize=5,label='Validation accuracy')
plt.fill_between(param_range,test_mean + test_std,test_mean - test_std,alpha=0.15, color='green')
plt.grid()
plt.xscale('log')
plt.xlabel('Parameter C')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
# set the y label axis units we want from 80% to 100%
plt.ylim([0.8, 1.03])
plt.show()

*From above validation curve we can interpret that Best value of C will be between 0.1 & 1.0(10^0)*

# Tunning the Model using Hyperparameter grid search

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

pipe_svc = make_pipeline(StandardScaler(), SVC())
param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]

param_grid = [{'svc__C': param_range, 'svc__kernel': ['linear']},
              {'svc__C': param_range, 'svc__gamma': param_range, 'svc__kernel': ['rbf']}]

gs = GridSearchCV(estimator=pipe_svc,
                  param_grid=param_grid,
                  scoring='accuracy',
                  cv=10,
                  refit=True)

gs = gs.fit(X_train, y_train)

#return the best score and best parameters
print(gs.best_score_)
print(gs.best_params_)

*Get the best model using  gs.`best_estimator_`*

In [None]:
best_model = gs.best_estimator_
# If we set refit=True, no need to fit the best_model again as grid search will
# automatically set the best params if this attr. is set to true
# best_model.fit(X_train, y_train)
print(f"Testing Accuracy {best_model.score(X_test,y_test)*100:.2f}")

In [None]:
y_pred = best_model.predict(X_test)

# Performance Metrics

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, matthews_corrcoef, roc_auc_score

pre_val = precision_score(y_true = y_test, y_pred = y_pred)

rec_val = recall_score(y_true=y_test, y_pred = y_pred)

f1_val = f1_score(y_true=y_test, y_pred = y_pred)

mcc_val = matthews_corrcoef(y_true = y_test, y_pred = y_pred)

roc_val = roc_auc_score(y_true = y_test, y_score = y_pred)

print(f"The Precision score: {pre_val} | the recall score is {rec_val} | the f1 score is {f1_val} | the mcc score is {mcc_val} | Roc value is {roc_val}")

# Dealing with Class Imbalance

**Let's create an imbalance dataset, we will pick only 40 samples of class 1 and all samples of class 0**

In [None]:
print(X.shape[0])

In [None]:
X_imb = np.vstack((X[y ==0],X[y == 1][:40]))
y_imb = np.hstack((y[y == 0], y[y == 1][:40]))

In [None]:
# 212 class 0 samples in the dataset
X_imb[y_imb == 0].shape[0]

In [None]:
X[y ==0],X[y == 1][:40]

**We can see that, if we have all zeros as the prediction then, comparing it with the imbalance dataset gives us 85% which means that the dataset has 85% class label as 0 and the rest 15% as class label 1 which means our dataset is imbalance**

In [None]:
y_pred = np.zeros(y_imb.shape[0])
print(np.mean(y_pred == y_imb)*100)

In [None]:
y_pred = np.ones(y_imb.shape[0])
np.mean(y_pred == y_imb)*100

***As a result in such cases, the model will be more biased towards the majority class, and will fail to pick out any pattern in the dataset as most of the examples fall under class 0, the model will fail to learn anything hence we can't stick to accuracy or any other PM for the validation purpose***

***The algorithm implicitly learns a model that optimizes the predictions based on the
most abundant class in the dataset to minimize the loss or maximize the reward during training***

*So for example if our focus was to identify the majority of patients with cancer, and assuming class 1 indicates a patient with cancer, then we would be looking at **Recall** as our performance metric as in recall, we try to maximise the TP*

*In spam filtering, we don't want to label email as spam if system is not so sure, in that we must look for Precision*

***Note:*** *A TP means that predicted and actual class/label was '1', a TN means that predicted and actual label was '0'. A FP means predicted class was '1' but actual was '0'. FN means predicted class '0' and actual class '1'*

***One way to deal with imbalanced class proportions during model fitting is to assign a larger penalty
to wrong predictions on the minority class. Via scikit-learn, adjusting such a penalty is as convenient
as setting the class_weight parameter to class_weight='balanced', which is implemented for most
classifiers.***

**Other popular strategies for dealing with class imbalance include upsampling the minority class,
downsampling the majority class, and the generation of synthetic training examples**

In [None]:
from sklearn.utils import resample
print(f"Number of class 1 before resample {X_imb[y_imb == 1].shape[0]}")

X_upsampled, y_upsampled = resample(X_imb[y_imb == 1], y_imb[y_imb == 1], replace = True, n_samples = X_imb[y_imb == 0].shape[0])


# So it matches with the number of samples of majority class in this case class 0, hence in resample we provide n_sample = the len of majority class samples
print(f"Number of class 1 after resample {X_upsampled[y_upsampled == 1].shape[0]}")

*After upsampling we have to stack the upsampled with the majority one as in if you check `y_upsampled[y_upsampled == 0]` you will get an empty result as class 0 is not present in the upsampled data*

In [None]:
X_balanced = np.vstack((X[y == 0], X_upsampled))
y_balanced = np.hstack((y[y == 0], y_upsampled))

In [None]:
y_balanced[y_balanced == 0].shape[0]

In [None]:
y_pred = np.ones(y_balanced.shape[0])

np.mean(y_pred == y_balanced)*100


# **Further Read: Synthetic Minority Over-sampling Technique (SMOTE) also highly recommended to check out imbalanced-learn, a Python library that is entirely focused on imbalanced datasets,including an implementation of SMOTE. You can learn more about imbalanced-learn at**
https://github.com/scikit-learn-contrib/imbalanced-learn.