## Hyperparameter tuning & Ensemble methods

In [3]:
# Libraries to work with the data object
import pandas as pd 
import numpy as np

# libraries to visualize
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.tree import export_graphviz
from six import StringIO
from IPython.display import Image

import graphviz
import pydotplus

# sklearn packages for Decision Tree
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import tree

# sklearn packages for KNN
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

#load our data
df = pd.read_csv("../data/heart_disease_health_indicators_BRFSS2015.csv", delimiter=",")

In [4]:
# Below, we make a list of features/independent variables 'X', and specify our target/dependent variable, y
# The model will guess/predict the 'y' feature (our target) based on the list of features, 'X'
# Running the cell will not produce any output. This is because we are defining X and y, which we will be using in the next section to train our model

X = df[['GenHlth', 'Age', 'DiffWalk', 'HighBP', 'Stroke', 'PhysHlth', 'HighChol', 'Diabetes', 'Income', 'Education', 'Smoker']].values

y = df['HeartDiseaseorAttack'].values

## 1. Splitting and Scaling

Before we go on with hyperparameter tuning, we split our data into a test and train sample. 

In [5]:
#Split data into test and train - 80/20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

We also scale our data so it's consistent and uniform, and therefore easier to process for our models. We use sklearn's StandardScaler that does standardization (not normalization!). Unlike with normalization, where the scaled values fall between 0 and 1, standardized values do not have any fixed minimum or maximum. Instead, standardized values are scaled in such a way that they all have a mean equal to 0 and **standard** deviation equal to 1. It is also a great way to detect outliers aka. if it something deviates too much from the standard.

Good  to remember!
- **Standardisation** - Standardization is a scaling technique that assumes your data conforms to a normal distribution. If a given data attribute is normal or close to normal, this is probably the scaling method to use.
- **Normalisatoin** - Normalization is a scaling technique that does not assume any specific distribution. If your data is not normally distributed, consider normalizing it prior to applying your machine learning algorithm. 

In [6]:
# create a standard scaler object and fit it to the training data
scaler = StandardScaler()
scaler.fit(X_train)

# transform the training and test data using the scaler
X_train_std = scaler.transform(X_train)
X_test_std = scaler.transform(X_test)

***
***

## 2. Hyperparameter tuning

Now, we get to the hyperparameter tuning! We will be trying out two ways for finding optimal parameters or decision trees: **GridSearch** (5.1) and  **Random Search** (5.2)

But before me move on let's back up for a second. So far, we have looked at two classification models:
- Decision trees
- KNN

As touched upon in section 3 of this notebook, when we train these models, we can choose to specify different parameters, for instance:
- depth (for decision trees)
- k = the number of neighbours (for KNN models)

As we learned last week, the classification performance of our model is affected by the value of these parameters (if you are unsure how, take a look at the KNN visualization in the notebook from week 3).

Thus, we can optimize model accuracy score through hyperparameter tuning. Hyperparameter tuning entails finding optimal values for parameters such as k (in KNN models) and depth (in decision tree models). We do hyperparameter tuning because providing better/optimal parameters when training our models will help us increase the accuracy of these models. 

So, to sum up, **we will be doing hyperparameter tuning when we try to find the optimal k (for KNN classification models) or the optimal depth (for decision tree classification models)**

Hopefully, hyperparamter tuning will allow us to reach a better accuracy than that of our previous decision tree model, clf and clf_2 (see the notebook from week 3 on classification)

### 2.1 GridSearch: finding the optimal max_depth (and more) for a decision tree

One way to find the optimal parameters for a decision tree is using GridSearchCV. 

When performing a GridSearch, we start by specifying the parameters, we want to optimize. In the we will be going through in this notebook, we are trying to find the optimal value for three decisiom tree parameters:
- max_depth
- min_samples_split
- min_sample_leaf

Before we move on, we'll explain what these parameters mean in turn. We will be using decision tree terminology, so if you have trouble understanding what is meant by nodes, branches, depth etc. is, take a look at this illustration first. The model pictured in the image is a tree of depth 2:

<img src='decision_tree.png'>

**max_depth recap:**<br>
We already touched on this parameter in the lecture for week 3. To recap, the max_depth parameter in decision trees controls how deep the tree can grow during the training process. Setting a low max_depth can help prevent overfitting, but it might also miss some patterns in the data. On the other hand, setting a higher max_depth can capture more complex patterns, but it might also lead to overfitting. So, it's a trade-off between complexity and performance.

**min_samples_split explained:**<br>
The min_samples_split hyperparameter determines the minimum number of samples required to split an internal node in a decision tree.
<br>
Imagine you have a decision tree. When building this tree, at each step, it decides whether to split a node into smaller nodes. min_samples_split is like a rule for the tree. It says, "Don't split a node unless you have at least this many samples in it."
<br>
So, if min_samples_split is set to 5, it means that the decision tree won't split a node unless it has at least 5 samples in it. If there are fewer than 5 samples, the node won't split, and it becomes a leaf node in the tree.
<br>
Adjusting min_samples_split can affect how the decision tree is built. A smaller value might lead to a more complex tree with more splits, while a larger value might lead to a simpler tree with fewer splits.

**min_samples_leaf explained:**<br>
The min_samples_leaf hyperparameter specifies the minimum number of samples required to be at a leaf node in a decision tree.
<br>
Imagine you have a decision tree. When building this tree, at each step, it decides whether to keep splitting nodes or stop and declare them as leaf nodes. min_samples_leaf is like a rule for the tree. It says, "Stop splitting a node if, after the split, each resulting leaf node would have at least this many samples."
<br>
So, if min_samples_leaf is set to 5, it means that the decision tree won't continue splitting a node if any resulting leaf node would have fewer than 5 samples. It helps prevent the tree from creating leaf nodes with very few samples, which could lead to overfitting.
Adjusting min_samples_leaf can affect the depth and complexity of the decision tree. A smaller value might lead to a deeper tree with more splits, while a larger value might lead to a shallower tree with fewer splits.



After specifying the parameters we want to optimize, we also define the range within we want to search. For instance, we can define the range of max_depth to be from 2 to 51. Doing this, we ask GridSearch to find the optimal depth within this range.

You can read more on GridSearch in the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [32]:
#import LinearSVC to speed up gridsearch
from sklearn.svm import LinearSVC

#import GridSearchCV
from sklearn.model_selection import GridSearchCV

In [33]:
# Defining various parameters that we test our gridsearch with. 
# The gridsearch will choose the best param that we use later.
C = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
param_grid = {'C': C}

In [34]:
# Create the GridSearchCV object
# Cross-validation
# grid_search = GridSearchCV(dt, param_grid, scoring='accuracy', cv=5)

grid_search = GridSearchCV(LinearSVC(dual=False), param_grid, scoring='accuracy', cv=5)


In [35]:
grid_search.fit(X_train, y_train)

In [36]:
#here we ask GridSearch to give us the optimal value of the three parameters
grid_search.best_params_

{'C': 0.1}

It is important to note that, '.best_params_' doesn’t show the overall best parameters, but rather the best parameters of **the range we passed in** to our search.

Anyway - having found the best parameters for our model, we can now try to make a new decision tree, 'clf_3', which (hopefully) is even more accurate than our previous models, 'clf' and 'clf_2'...

In [38]:
clf_3 = LinearSVC(C=0.1, dual=False)

In [39]:
#training the model on our data
clf_3.fit(X_train, y_train)

In [40]:
#predicting...
y_pred = clf_3.predict(X_test)

In [41]:
#calculating the accuracy of our new model...
print("Accuracy: ", metrics.accuracy_score(y_test, y_pred))

Accuracy:  0.9080928729107537


By using this method we did not really improve the accuracy by much. The difference was +0,01%. This means the model made 7 less incorrect predictions.

Decision tree:
- clf ~ 88% accurate

Optimized Decision tree:
- clf_2 ~ 91% accurate

K-NN:
- clf_knn ~ 90% accurate

Hyperparameter tuning using LinearSVC
- clf_3 ~ 91% accurate

In [42]:
#below, we print number of correct and incorrect predictions
mtr = confusion_matrix(y_test, y_pred)

print("Correct predictions:", (mtr[0,0] + mtr[1,1]))
print("Incorrect predictions:", (mtr[0,1] + mtr[1,0]))
print("Total predictions:", (mtr.sum()))

Correct predictions: 46073
Incorrect predictions: 4663
Total predictions: 50736


## Remarks
Since we are using LinearSVC which is not a tree based algorithm, it is not possible to draw a tree for this model.

### 2.2 k-fold cross validation for KNN - finding the optimal k-number of neighbours


Now, we'll try to find the optimal k for our KNN model. As you might remember, we made a KNN classification model without specifying a number of neighbours. Thus, the model used the default number of neighbours, 5.

The model had an accuracy of 0.90 - so, in 90% of the time, the model was correct.

Let's see if we can improve that with some hyperparameter tuning!

Below, we use k-fold cross-validation to find the optimal k in our KNN classifier. 
More specifically, we iterate through different values of k, apply k-fold cross-validation for each value, and then select the k value that gives the highest average accuracy across all folds.

In [50]:
# set up k-NN classifier with cross-validation to find the optimal value of k
# we have chosen to search for k in the range from 1-15
cv_scores = []

for k in range(1, 15):
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train_std, y_train, cv=5, scoring='accuracy')
    cv_scores.append(scores.mean())

# Find the optimal value of k
optimal_k = np.argmax(cv_scores) + 1
print("The optimal number of neighbors is %d" % optimal_k)

The optimal number of neighbors is 14


In [52]:
k_values = [i for i in range (13,15)]
scoresByK = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    score = cross_val_score(knn, X_train_std, y_train, cv=5)
    scoresByK.append(score.mean())

sns.lineplot(x = k_values, y = scoresByK, marker = 'o')
plt.xlabel("K Values")
plt.ylabel("Accuracy Score")

KeyboardInterrupt: 

In [53]:
model_KNN = KNeighborsClassifier(n_neighbors=14)

In [54]:
# fit the classifier to the standardized training data, tilpasser den data vi har, så den passer til den model vi gerne vil lave, her KNN
model_KNN.fit(X_train_std, y_train)

# predict the class labels for the standardized test data
y_pred = model_KNN.predict(X_test_std)

In [58]:
# evaluate the performance of the classifier using the accuracy score
accuracy_test_KNN = round(model_KNN.score(X_test_std, y_test),4)
print('Accuracy:', accuracy_test_KNN)

Accuracy:  0.906


As you can see, we managed to improve the model from an accuracy of 0.9 to approximately 0.91.