# Optimizing the Training Process

In this notebook we'll put in practice some of the techniques used to improve the training process.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.dummy import DummyClassifier
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_validate

from scipy.stats import uniform as sp_randFloat
from scipy.stats import randint as sp_randInt

%matplotlib inline

## The Data

Once again, we'll be working with the [Breast Cancer dataset](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)) from the UCI ML Repository.

As a reminder, this dataset consists of the following attributes:

1. Sample code number: id number
2. Clump Thickness: 1 - 10
3. Uniformity of Cell Size: 1 - 10
4. Uniformity of Cell Shape: 1 - 10
5. Marginal Adhesion: 1 - 10
6. Single Epithelial Cell Size: 1 - 10
7. Bare Nuclei: 1 - 10
8. Bland Chromatin: 1 - 10
9. Normal Nucleoli: 1 - 10
10. Mitoses: 1 - 10
11. Class: (2 for benign, 4 for malignant)

In [None]:
#import data
data = pd.read_csv('breast-cancer-wisconsin.data', header=None)

#set column names
data.columns = ['Sample Code Number','Clump Thickness','Uniformity of Cell Size',
                'Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size',
                'Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses','Class']
#view top 10 rows
data.head(10)

NameError: name 'pd' is not defined

In [None]:
data = data.drop(['Sample Code Number'],axis=1) #Drop 1st column
data = data[data['Bare Nuclei'] != '?'] #Remove rows with missing data
data['Class'] = np.where(data['Class'] == 2, 0, 1) #Change the class representation
data['Class'].value_counts() #Class distribution

## Splitting the data

We'll start by separating the features from the labels, and creating Training/Testing sets using the ```train_test_split``` function.

In [None]:
#Split data into attributes and class
X = data.drop(['Class'],axis=1)
y = data['Class']

#perform training and test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Before building a classification model, let's build a Dummy Classifier to determine the "baseline" performance. This answers the question — "What would be the success rate of the model, if one were simply guessing?" The dummy classifier we are using will simply predict the majority class.

In [None]:
#Dummy Classifier
clf = DummyClassifier(strategy= 'most_frequent').fit(X_train,y_train)
y_pred = clf.predict(X_test)

#Distribution of y test
print('y actual : \n' +  str(y_test.value_counts()))

#Distribution of y predicted
print('y predicted : \n' + str(pd.Series(y_pred).value_counts()))

In [None]:
print(classification_report(y_test, y_pred, target_names=['Benign', 'Malignant'], zero_division=0))

#Dummy Classifier Confusion matrix
cm = confusion_matrix(y_test,y_pred)
cmd = ConfusionMatrixDisplay(cm, display_labels=clf.classes_)
cmd.plot()

From the output, we can observe that there are 68 malignant and 103 benign cases in the test dataset. However, our classifier predicts all cases as benign (as it is the majority class). The accuracy of the model is 60%, but this is a case where accuracy may not be the best metric to evaluate the model since the classes are imbalanced. Precision, recall, f1-score and looking at the confusion matrix gives us a better idea of the true performance of this model.

## Testing and tuning models

Now that we have the baseline performance, we can now start exploring ways to build a better model.

As previously mentioned, the purpose of the train-test split is reserving a subset of the data to not be used during training. This allows us to evaluate the model's generalization performance on new unseen data.
However, when evaluating different settings ("hyperparameters") for estimators, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can "leak" into the model and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-called "validation set": training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.

We can thus use the `train_test_split` function again to further the divide the datasets.

In [None]:
X_train_t, X_val, y_train_t, y_val = train_test_split(X_train, y_train, train_size=0.9, random_state=42)

We can now use this train set to train our model, check performance with the validation set, and tune hyperparameters. Repeating this process until we're satisfied.

In [None]:
clf = DecisionTreeClassifier()
clf.fit(X_train_t, y_train_t)

In [None]:
# Test with Training Set
y_pred = clf.predict(X_train_t)

print(classification_report(y_train_t, y_pred, target_names=['Benign', 'Malignant']))

cm = confusion_matrix(y_train_t, y_pred)
cmd = ConfusionMatrixDisplay(cm, display_labels=clf.classes_)
cmd.plot()

In [None]:
# Test with validation set
y_pred = clf.predict(X_val)

print(classification_report(y_val, y_pred, target_names=['Benign', 'Malignant']))

cm = confusion_matrix(y_val, y_pred)
cmd = ConfusionMatrixDisplay(cm, display_labels=clf.classes_)
cmd.plot()

At this moment, if we are not satisfied with the results, we can go back and adjust some of the model's hyperparameters. For instance, we could decide to limit the DT's max depth in order to fight overfitting.

In [None]:
clf = DecisionTreeClassifier(max_depth=15)
clf.fit(X_train_t, y_train_t)

Retest the model...

In [None]:
# Test with Training Set
y_pred = clf.predict(X_train_t)

print(classification_report(y_train_t, y_pred, target_names=['Benign', 'Malignant']))

cm = confusion_matrix(y_train_t, y_pred)
cmd = ConfusionMatrixDisplay(cm, display_labels=clf.classes_)
cmd.plot()

In [None]:
# Test with validation set
y_pred = clf.predict(X_val)

print(classification_report(y_val, y_pred, target_names=['Benign', 'Malignant']))

cm = confusion_matrix(y_val,y_pred)
cmd = ConfusionMatrixDisplay(cm, display_labels=clf.classes_)
cmd.plot()

...and repeat until we are satisfied with the validation results.

When done, we can retrain the model usign all of the original training data (train + validation) with the chosen hyperparameters.

In [None]:
clf = DecisionTreeClassifier(max_depth=15)
clf.fit(X_train, y_train)

Finally, we can check the optimized model's performance using the test set.

In [None]:
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred, target_names=['Benign', 'Malignant']))

cm = confusion_matrix(y_test,y_pred)
cmd = ConfusionMatrixDisplay(cm, display_labels=clf.classes_)
cmd.plot()

By fitting the Decision Tree model with our manually adjusted parameters, we have a much 'better' model than the previously established baseline. The accuracy is 94% and at the same time, the Precision is 95%. Now, let's take a look at the confusion matrix for this model: Looking at the misclassified instances, we can observe that 8 malignant cases have been classified incorrectly as benign (False negatives). Also, just 2 benign case has been classified as malignant (False positive)... or at least those were the results when I originally ran this experiment. These results might be different for you depending on the train-validation-test splits you get (which are pseudo-random). In the next section, we'll explore a technique to minimize potential bias introduced by a "lucky" or "unlucky" split.

## Cross-Validation

As previously discussed validation sets ensure testing sets remain "hidden" from the training process, however, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for training the model, additionally the results can depend on a particular random choice for the pair of (train, validation) sets.

A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets. Then, for each of these k "folds" a model is trained using $k-1$ of the folds as training data; and the resulting model is validated on the remaining part of the data.

In [None]:
clf = DecisionTreeClassifier(max_depth=15)
scores = cross_validate(clf, X_train, y_train, cv=5, scoring=['accuracy', 'precision', 'recall', 'f1'])

print(f"{scores['test_accuracy'].mean():0.2f} accuracy with a standard deviation of {scores['test_accuracy'].std():0.2f}")
print(f"{scores['test_precision'].mean():0.2f} precision with a standard deviation of {scores['test_precision'].std():0.2f}")
print(f"{scores['test_recall'].mean():0.2f} recall with a standard deviation of {scores['test_recall'].std():0.2f}")
print(f"{scores['test_f1'].mean():0.2f} f1 with a standard deviation of {scores['test_f1'].std():0.2f}")

CV scores will give us a better overview of a particular model's (model + hyperparameters) performance on our data. However, we would still need to adjust hyperparamenters, train, score, repeat until satisfied. In the next section, we'll discuss how to automate this process.

## Optimizing the model using GridSearch

 Instead of manually trying various hyperparameter combinations, we can automates the process by programatically exploring a predefined range of hyperparameters systematically and exhaustively to find the combination that yields the best model performance. This technique is known as GridSearch.

Before proceeding, let's take a closer look at our previous results.

In [None]:
# Decision Tree Classifier
clf = DecisionTreeClassifier(max_depth=15).fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Model Evaluation metrics
print(classification_report(y_test, y_pred, target_names=['Benign', 'Malignant'], zero_division=1))

# Classifier Confusion matrix
cm = confusion_matrix(y_test,y_pred)
cmd = ConfusionMatrixDisplay(cm, display_labels=clf.classes_)
cmd.plot()

As previously noted, our DT misclassified 8 malignant tumors as benign, and 2 benign as malignant. For this particular problem, a false negative is more serious as a disease has been ignored, which can lead to the death of the patient. At the same time, a false positive would lead to an unnecessary treatment — incurring additional cost.

Let's try to minimize the false negatives by using Grid Search to find the optimal parameters. Grid search can be used to improve any specific evaluation metric. The metric we need to focus on to reduce false negatives is Recall, hence we'll set the `scoring` parameter to `recall`.

We'll also need to define which DT hyperparameters and possible values to explore. `GridSearchCV` will systematically perform k-fold cross validation for each possible hyperparameter combination. Track the average scores and identify the best model found.

For this example, we'll search over different `criterion` (gini or entropy), `max_depth` and `max_features` combinations.



In [None]:
#Grid Search
clf = DecisionTreeClassifier()
grid_values = {'criterion':['gini', 'entropy'], 'max_depth':[10, 15, 20, 25, 30],
               'max_features':[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, None]}
grid_clf = GridSearchCV(clf, param_grid = grid_values, scoring = 'recall')
grid_clf.fit(X_train, y_train)

We can now check the best combination of features (and their corresponding scores) found.

In [None]:
# Obtain best parameters
best_parameters = grid_clf.best_params_
# Store parameters in a dataframe
pd.DataFrame.from_dict(best_parameters, orient='index', columns=['Assigned Value']).sort_index()

In [None]:
# Obtain best score
grid_clf.best_score_

In [None]:
#Predict values based on new parameters
y_pred = grid_clf.predict(X_test)

# New Model Evaluation metrics
print(classification_report(y_test, y_pred, target_names=['Benign', 'Malignant'], zero_division=1))

#Logistic Regression (Grid Search) Confusion matrix
cm = confusion_matrix(y_test, y_pred)
cmd = ConfusionMatrixDisplay(cm)
cmd.plot()

From the confusion matrix above, we can see that the number of false negatives has decresead by 2. We've successfully accomplished our goal, however keep in mind that GridSearch will be limited by our choice of hyperparametrs and values. Poor choices will generally lead to poor results.

## Optimizing the model using RandomizedSearch

Random search is very similar to grid search. However, instead of testing every combination of hyperparameters, random search only tests a certain number of combinations that are selected at random.

At first glance, random search may seem unappealing. After all, if you can't test every hyperparameter combination, you are unlikely to find the best one. However, this approach does come with certain perks. Firstly, since random search tests fewer model architectures, it requires less time and less computation to obtain results, therefore we can increase the number of hyperparameters and possible values without getting exponentially higher computation times. So, although random search may not necessarily find the best possible combination of hyperparameters, it can provide a model that comes close to the ideal model in terms of performance while seaching in a bigger search space.

In [None]:
# RandomizedSearch
clf = DecisionTreeClassifier()
rand_values = {'criterion':['gini', 'entropy'], 'splitter':['best', 'random'],
               'max_depth':sp_randInt(1, 50), 'min_samples_split':sp_randFloat(),
               'max_features':sp_randFloat()}
rand_clf = RandomizedSearchCV(clf, param_distributions = rand_values, n_iter=500, scoring='recall')
rand_clf.fit(X_train, y_train)

In [None]:
# Obtain best parameters
best_parameters = rand_clf.best_params_
# Store parameters in a dataframe
pd.DataFrame.from_dict(best_parameters, orient='index', columns=['Assigned Value']).sort_index()

In [None]:
# Obtain best score
rand_clf.best_score_

In [None]:
#Predict values based on new parameters
y_pred = rand_clf.predict(X_test)

# New Model Evaluation metrics
print(classification_report(y_test, y_pred, target_names=['Benign', 'Malignant'], zero_division=1))

#Logistic Regression (Grid Search) Confusion matrix
cm = confusion_matrix(y_test,y_pred)
cmd = ConfusionMatrixDisplay(cm)
cmd.plot()

After RandomSearch, we've managed to achieve a similar result to those with GridSearch. Try increasing the number of iterations or adding extra hyperparameters to see if you can get a better result.

Finally, we can visualize the (surprisingly short) resulting Decision Tree.

In [None]:
plot_tree(rand_clf.best_estimator_, feature_names=rand_clf.feature_names_in_,
          class_names=['Benign', 'Malignant'], filled=True, rounded=True)
plt.show()

## Final Considerations

`GridSearchCV` and `RandomizedSearchCV` are fundamental techniques for fine-tuning our models, however careful consideration should still be employed for the selection of features, values and scoring used during search. In this example we've chosen to focus on Recall, however this won't always be the case. When chosing a scoring metric, carefully consider the nature of the task and the data's characteristics. Refer to SciKit-Learn's guide on [Cross-Validation](https://scikit-learn.org/stable/modules/cross_validation.html) and [Metrics and scoring](https://scikit-learn.org/stable/modules/model_evaluation.html) for additional information.