# Notebook 4 : Trees and Ensemble Methods
Notebook prepared by [Chloé-Agathe Azencott](http://cazencott.info) and contributions from [Giann Karlo](https://www.giannkarlo.info/).

In this notebook, we will discover decision trees and ensemble methods (random forests, gradient boosting).

In [None]:
# load numpy as np, matplotlib as plt
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

In [None]:
plt.rc('font', **{'size': 12}) # sets the global font size for plots (in pt)

In [None]:
import pandas as pd

## 1. Data Loading

The goal of this notebook is to use the visual description of a mushroom to predict whether it is edible or not.

The data is available in `data/mushrooms.csv`. It comes from the dataset https://archive.ics.uci.edu/ml/datasets/Mushroom but slightly modified.

It contains a first line (header) describing the columns, then one line per mushroom. The values of the different variables are all represented by letters; here is their meaning:
1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y
4. bruises: bruises=t,no=f
5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s
6. gill-attachment: attached=a,descending=d,free=f,notched=n
7. gill-spacing: close=c,crowded=w,distant=d
8. gill-size: broad=b,narrow=n
9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
10. stalk-shape: enlarging=e,tapering=t
11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
16. veil-type: partial=p,universal=u
17. veil-color: brown=n,orange=o,white=w,yellow=y
18. ring-number: none=n,one=o,two=t
19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z
20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y
21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y
22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d

The first column tells us the class of each mushroom, 'e' for edible and 'p' for poisonous.

**Alternatively:** If you need to download the file (e.g., on Colab), uncomment the following two lines:

In [None]:
!wget https://raw.githubusercontent.com/CBIO-mines/fml-dassault-systems/main/data/mushrooms.csv

df = pd.read_csv("mushrooms.csv")

--2025-11-06 14:19:04--  https://raw.githubusercontent.com/CBIO-mines/fml-dassault-systems/main/data/mushrooms.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 374004 (365K) [text/plain]
Saving to: ‘mushrooms.csv’


2025-11-06 14:19:05 (14.2 MB/s) - ‘mushrooms.csv’ saved [374004/374004]



In [None]:
# df = pd.read_csv('data/mushrooms.csv')
df.shape

(8124, 23)

In [None]:
df.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


### Converting variables to numerical values

Our variables are currently *categorical.*

For example, for the "cap shape" variable, `b` corresponds to a bell cap, `c` to a conical cap, `f` to a flat cap, `k` to a knobbed cap, `s` to a sunken cap, and `x` to a convex cap.

To work with this data, we need to convert these categories into numerical values.

One possibility is to convert each letter into a number between 0 and the total number of categories, using [preprocessing.LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

This encoding is not necessarily ideal: an algorithm that uses Euclidean distance will consider a convex cap (`x` converted to 5) to be closer to a sunken cap (`s` converted to 4) than to a conical cap (`c` converted to 1), which doesn't make much sense. However, this is not a problem for algorithms based on decision trees, which treat categories as such and not as numerical values. The conversion is only necessary for implementation reasons.

[One-hot encoding](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features) is generally a better choice. Note, however, that it has the disadvantage of increasing the number of variables and creating correlated variables.

In [None]:
from sklearn import preprocessing

In [None]:
label_encoder = preprocessing.LabelEncoder()

for col in df.columns:
    df[col] = label_encoder.fit_transform(df[col])

We can observe our data again:

In [None]:
df.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,1,5,2,4,1,6,1,0,1,4,...,2,7,7,0,2,1,4,2,3,5
1,0,5,2,9,1,0,1,0,0,4,...,2,7,7,0,2,1,4,3,2,1
2,0,0,2,8,1,3,1,0,0,5,...,2,7,7,0,2,1,4,3,2,3
3,1,5,3,8,1,6,1,0,1,5,...,2,7,7,0,2,1,4,2,3,5
4,0,5,2,3,0,5,1,1,0,4,...,2,7,7,0,2,1,0,3,0,1


### Creating the X and y data matrices

In [None]:
X = np.array(df.drop(columns=['class']))
y = np.array(df['class'])
print(X.shape, y.shape)

(8124, 22) (8124,)


**Question:** How many samples (examples) does our dataset contain? How many variables?

## 2. Selection and evaluation framework

We can now split our data into a training set and a test set, and then fix a split of the training set into 10 folds for cross-validation.

You will need the functions [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) and [`KFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)

In [None]:
from sklearn import model_selection

### Training and test set

In [None]:
### START OF YOUR CODE
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, random_state=42)
### END OF YOUR CODE

### Cross-validation

In [None]:
n_folds = 10

### START OF YOUR CODE

# Create a KFold object that will allow cross-validation in n_folds folds
kf = ...

# Use kf to split the training set into n_folds folds.
# kf.split returns an iterator (consumed after a loop).
# To use the same folds multiple times, we convert this iterator into a list of indices:
kf_indices = ...

### END OF YOUR CODE

## 3. Decision Tree

We will now use a decision tree to learn a classifier on our data.

Decision trees are implemented in the [DecisionTreeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) class of scikit-learn's `tree` module.

In [None]:
from sklearn import tree

### Decision tree with default hyperparameters

Let's determine the F1 score using cross-validation for a decision tree with default hyperparameters in scikit-learn:

In [None]:
model_tree_default = tree.DecisionTreeClassifier()

f1_tree_default = model_selection.cross_val_score(model_tree_default, # predictor to evaluate
                                                  X_train, y_train, # training data
                                                  cv=kf_indices, # cross-validation to use
                                                  scoring='f1' # performance evaluation metric
                                                  )
print("F1 of a decision tree (default) in cross-validation: %.3f +/- %.3f" % (np.mean(f1_tree_default), np.std(f1_tree_default)))

**Question:** What do you think of this performance?

### Cross-validation of decision tree depth

By default (see the documentation), we used a decision tree with maximum depth. We will now consider the tree depth (`max_depth`) as a hyperparameter to optimize using a grid search. We are re-using and adapting the code used for kNN in Notebook 3.

Let's start by defining the grid:

In [None]:
d_values = np.arange(2, 31)

In [None]:
d_values

We can now use [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html):

In [None]:
# Instantiation of a GridSearchCV object
grid_tree = model_selection.GridSearchCV(tree.DecisionTreeClassifier(), # predictor to evaluate
                                         {'max_depth': d_values}, # dictionary of hyperparameter values
                                         cv=kf_indices, # cross-validation to use
                                         scoring='f1' # performance evaluation metric
                                         )

In [None]:
%%time

# Use this object on the training data
grid_tree.fit(X_train, y_train)

The optimal hyperparameter value is given by:

In [None]:
print(grid_tree.best_params_)

The following code allows displaying the model's performance according to the hyperparameter value:

In [None]:
mean_test_score = grid_tree.cv_results_['mean_test_score']
stde_test_score = grid_tree.cv_results_['std_test_score'] / np.sqrt(n_folds) # standard error

plt.plot(d_values, mean_test_score)
plt.plot(d_values, (mean_test_score + stde_test_score), '--', color='steelblue')
plt.plot(d_values, (mean_test_score - stde_test_score), '--', color='steelblue')
plt.fill_between(d_values, (mean_test_score + stde_test_score),
                 (mean_test_score - stde_test_score), alpha=0.2)

best_index = np.where(d_values == grid_tree.best_params_['max_depth'])[0][0]
plt.scatter(d_values[best_index], mean_test_score[best_index])


plt.xlabel('maximum depth')
plt.ylabel('F1')
plt.title("Performance (in cross-validation) along the grid")

**Question:** What do you think of this performance?

### Optimal decision tree

In [None]:
print("Best F1 in cross-validation: %.3f" % grid_tree.best_score_)

We can now retrieve the optimal decision tree:

In [None]:
model_tree_best = grid_tree.best_estimator_

## 4. Interpretation of the decision tree

### Visualization

The [plot_tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html) method of scikit-learn's `tree` module allows us to visualize the optimal decision tree:

In [None]:
fig = plt.figure(figsize=(25, 20))
tree.plot_tree(model_tree_best, fontsize=12)
plt.show()

**Question:** Does the learned model seem interpretable to you?

### Variable Importance

To interpret the decision tree, we can also look at the importance of each variable. It is greater the more the variable helps to reduce the tree's classification error.

In [None]:
fig = plt.figure(figsize=(12, 6))

num_features = X_train.shape[1]

# Display decision tree importances
plt.scatter(range(num_features), model_tree_best.feature_importances_,
           label="Decision Tree")

# Legend
tmp = plt.legend(fontsize=14)

# X-axis
plt.xlabel('Variables', fontsize=14)
feature_names = list(df.columns[1:])
tmp = plt.xticks(range(num_features), feature_names,
                 rotation=90, fontsize=14)

# Y-axis
tmp = plt.ylabel('Importance', fontsize=14)

# Title
tmp = plt.title('Variable Importance', fontsize=16)

### Comparison to logistic regression

We can also compare these importances to the regression coefficients of a logistic regression:

In [None]:
from sklearn import linear_model

Train a regularized logistic regression model (Ridge regularization, or "L2") with a grid search on the value of the regularization coefficient C, using cross-validation:

In [None]:
c_values = np.logspace(-3, 3, 50)

### START OF YOUR CODE

# Center and scale the data
from sklearn.preprocessing import StandardScaler
std_scaler = ...
X_train_scaled = ...
X_test_scaled = ...

# Instantiation of a GridSearchCV object
grid_logreg = ...

# Application to training data
grid_logreg.fit(X_train_scaled, y_train)

### END OF YOUR CODE

In [None]:
print("Best F1 in cross-validation: %.3f" % grid_logreg.best_score_)

**Question:** Compare this performance to that of the decision tree.

In [None]:
fig = plt.figure(figsize=(12, 6))

num_features = X_train.shape[1]

# Scale importances between 0 and 1
tree_importances = model_tree_best.feature_importances_
tree_importances_min = np.min(tree_importances)
tree_importances_max = np.max(tree_importances)
tree_importances = (tree_importances-tree_importances_min)/(tree_importances_max-tree_importances_min)

# Display decision tree importances
plt.bar(range(num_features), tree_importances,
           label="Decision Tree", width=0.4)

# Scale absolute values of linear model coefficients between 0 and 1
logreg_coeffs = np.abs(grid_logreg.best_estimator_.coef_[0])
logreg_coeffs_min = np.min(logreg_coeffs)
logreg_coeffs_max = np.max(logreg_coeffs)
logreg_coeffs = (logreg_coeffs-logreg_coeffs_min)/(logreg_coeffs_max-logreg_coeffs_min)

# Display logistic regression importances
plt.bar((np.arange(num_features)+0.4), logreg_coeffs,
           label="Logistic Regression", width=0.4)


# Legend
tmp = plt.legend(fontsize=14)

# X-axis
plt.xlabel('Variables', fontsize=14)
feature_names = list(df.columns[1:])
tmp = plt.xticks(range(num_features), feature_names,
                 rotation=90, fontsize=14)

# Y-axis
tmp = plt.ylabel('Importance', fontsize=14)

# Title
tmp = plt.title('Variable Importance', fontsize=16)

**Question:** How do these importances compare?

## 5. Random Forest

Can we improve the decision tree's performance using an ensemble method? We will use a random forest here, implemented in the [RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) class of scikit-learn's `ensemble` module.

In [None]:
from sklearn import ensemble

### Cross-validation of the number of trees and their maximum depth.

We will now consider two hyperparameters, the maximum depth of each tree (`max_depth`), and the number of trees in the forest (`n_estimators`).

Let's start by defining the grid:

In [None]:
d_values = np.array([3, 4, 10])
n_values = np.array([10, 20, 50, 100, 200])#, 100, 200, 500])

We can now use [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html):

In [None]:
### START OF YOUR CODE

# Instantiation of a GridSearchCV object
grid_rf = ...

# Use this object on the training data
grid_rf.fit(X_train, y_train)

### END OF YOUR CODE

The optimal hyperparameter values are given by:

In [None]:
print(grid_rf.best_params_)

And we can display the model's performance according to the value of each of the two hyperparameters:

In [None]:
# Reshape scores into a 2D array
mean_test_score_array = np.reshape(grid_rf.cv_results_['mean_test_score'], (len(d_values), len(n_values)))
std_test_score_array = np.reshape(grid_rf.cv_results_['std_test_score'], (len(d_values), len(n_values)))

In [None]:
for (idx, d) in enumerate(d_values):
    mean_test_score = mean_test_score_array[idx, :]
    stde_test_score = std_test_score_array[idx, :] / np.sqrt(n_folds) # standard error

    p = plt.plot(n_values, mean_test_score, label="Max depth = %d" % d)
    plt.plot(n_values, (mean_test_score + stde_test_score), '--', color=p[0].get_color())
    plt.plot(n_values, (mean_test_score - stde_test_score), '--', color=p[0].get_color())
    plt.fill_between(n_values, (mean_test_score + stde_test_score),
                     (mean_test_score - stde_test_score), alpha=0.2)

    # Display best hyperparameters
    if d == grid_rf.best_params_['max_depth']:
        best_ntree_index = np.where(n_values == grid_rf.best_params_['n_estimators'])[0][0]
        plt.scatter(n_values[best_ntree_index], mean_test_score[best_ntree_index],
                   marker='*', s=200, color='red')

plt.legend(loc=(1.1, 0))
plt.xlabel("Number of trees")
plt.ylabel('F1')
plt.title("Performance (in cross-validation) along the grid")
plt.xscale('log') # use a logarithmic scale on the x-axis

**Question:** How does the performance of random forests compare to previous performances?

### Optimal random forest

In [None]:
print("Best F1 in cross-validation: %.3f" % grid_rf.best_score_)

We can now retrieve the optimal decision tree:

In [None]:
model_rf_best = grid_rf.best_estimator_

### Variable Importance

We can once again look at the importance of each variable, for the best random forest model:

In [None]:
fig = plt.figure(figsize=(12, 6))

num_features = X_train.shape[1]

# Scale importances between 0 and 1
tree_importances = model_tree_best.feature_importances_
tree_importances_min = np.min(tree_importances)
tree_importances_max = np.max(tree_importances)
tree_importances = (tree_importances-tree_importances_min)/(tree_importances_max-tree_importances_min)

# Display decision tree importances
plt.bar(range(num_features), tree_importances,
           label="Decision Tree", width=0.3)

# Scale absolute values of linear model coefficients between 0 and 1
logreg_coeffs = np.abs(grid_logreg.best_estimator_.coef_[0])
logreg_coeffs_min = np.min(logreg_coeffs)
logreg_coeffs_max = np.max(logreg_coeffs)
logreg_coeffs = (logreg_coeffs-logreg_coeffs_min)/(logreg_coeffs_max-logreg_coeffs_min)

# Display logistic regression importances
plt.bar((np.arange(num_features)+0.3), logreg_coeffs,
           label="Logistic Regression", width=0.3)

# Scale importances between 0 and 1
rf_importances = model_rf_best.feature_importances_
rf_importances_min = np.min(rf_importances)
rf_importances_max = np.max(rf_importances)
rf_importances = (rf_importances-rf_importances_min)/(rf_importances_max-rf_importances_min)

# Display forest importances
plt.bar((np.arange(num_features)+0.6),  rf_importances,
           label="Random Forest", width=0.3)


# Legend
tmp = plt.legend()

# X-axis
plt.xlabel('Variables')
feature_names = list(df.columns[1:])
tmp = plt.xticks(range(num_features), feature_names,
                 rotation=90, fontsize=14)

# Y-axis
tmp = plt.ylabel('Importance')

# Title
tmp = plt.title('Variable Importance')

**Question:** What are the most important variables now? How does this compare to previous models?

## 6. Gradient Boosting

Gradient boosting is implemented in scikit-learn in the [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html?highlight=boosting#sklearn.ensemble.GradientBoostingClassifier) class of the `ensemble` module.

### Cross-validation and hyperparameter selection

As with random forests, we will optimize the number of estimators and the depth of the trees here.

In [None]:
n_values = np.array([10, 20, 50, 100, 200])
d_values = np.array([3, 4, 7])

We can now use [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html):

In [None]:
### START OF YOUR CODE

# Instantiation of a GridSearchCV object
grid_boost = ...

# Use this object on the training data
grid_boost.fit(X_train, y_train)

### END OF YOUR CODE

The optimal hyperparameter values are given by:

In [None]:
print(grid_boost.best_params_)

And we can display the model's performance according to the value of each of the two hyperparameters:

In [None]:
# Reshape scores into a 2D array
mean_test_score_array = np.reshape(grid_boost.cv_results_['mean_test_score'], (len(d_values), len(n_values)))
std_test_score_array = np.reshape(grid_boost.cv_results_['std_test_score'], (len(d_values), len(n_values)))

In [None]:
for (idx, d) in enumerate(d_values):
    mean_test_score = mean_test_score_array[idx, :]
    stde_test_score = std_test_score_array[idx, :] / np.sqrt(n_folds) # standard error

    p = plt.plot(n_values, mean_test_score, label="Max depth = %d" % d)
    plt.plot(n_values, (mean_test_score + stde_test_score), '--', color=p[0].get_color())
    plt.plot(n_values, (mean_test_score - stde_test_score), '--', color=p[0].get_color())
    plt.fill_between(n_values, (mean_test_score + stde_test_score),
                     (mean_test_score - stde_test_score), alpha=0.2)

    # Display best hyperparameters
    if d == grid_boost.best_params_['max_depth']:
        best_ntree_index = np.where(n_values == grid_boost.best_params_['n_estimators'])[0][0]
        plt.scatter(n_values[best_ntree_index], mean_test_score[best_ntree_index],
                   marker='*', s=200, color='red')

plt.legend(loc=(1.1, 0))
plt.xlabel("Number of trees")
plt.ylabel('F1')
plt.title("Performance (in cross-validation) along the grid")
plt.xscale('log') # use a logarithmic scale on the x-axis

**Question:** How does the performance of gradient boosting evolve based on hyperparameter values? How does it compare to previous performances?

### Optimal Boosting

In [None]:
print("Best F1 in cross-validation: %.3f" % grid_boost.best_score_)

We can now retrieve the optimal decision tree:

In [None]:
model_boost_best = grid_boost.best_estimator_

### Variable Importance

We can once again look at the importance of each variable:

In [None]:
fig = plt.figure(figsize=(12, 6))

num_features = X_train.shape[1]

### Decision Tree
# Scale importances between 0 and 1
tree_importances = model_tree_best.feature_importances_
tree_importances_min = np.min(tree_importances)
tree_importances_max = np.max(tree_importances)
tree_importances = (tree_importances-tree_importances_min)/(tree_importances_max-tree_importances_min)

# Display decision tree importances
plt.bar(range(num_features), tree_importances,
           label="Decision Tree", width=0.2)

### Logistic Regression
# Scale absolute values of linear model coefficients between 0 and 1
logreg_coeffs = np.abs(grid_logreg.best_estimator_.coef_[0])
logreg_coeffs_min = np.min(logreg_coeffs)
logreg_coeffs_max = np.max(logreg_coeffs)
logreg_coeffs = (logreg_coeffs-logreg_coeffs_min)/(logreg_coeffs_max-logreg_coeffs_min)

# Display logistic regression importances
plt.bar((np.arange(num_features)+0.2), logreg_coeffs,
           label="Logistic Regression", width=0.2)

### Random Forest
# Scale importances between 0 and 1
rf_importances = model_rf_best.feature_importances_
rf_importances_min = np.min(rf_importances)
rf_importances_max = np.max(rf_importances)
rf_importances = (rf_importances-rf_importances_min)/(rf_importances_max-rf_importances_min)

# Display forest importances
plt.bar((np.arange(num_features)+0.4),  rf_importances,
           label="Random Forest", width=0.2)

### Boosting
# Scale importances between 0 and 1
boost_importances = model_boost_best.feature_importances_
boost_importances_min = np.min(boost_importances)
boost_importances_max = np.max(boost_importances)
boost_importances = (boost_importances-boost_importances_min)/(boost_importances_max-boost_importances_min)

# Display boosting importances
plt.bar((np.arange(num_features)+0.6),  boost_importances,
           label="Boosting", width=0.2)

# Legend
tmp = plt.legend()

# X-axis
plt.xlabel('Variables')
feature_names = list(df.columns[1:])
tmp = plt.xticks(range(num_features), feature_names,
                 rotation=90, fontsize=14)

# Y-axis
tmp = plt.ylabel('Importance')

# Title
tmp = plt.title('Variable Importance')

**Question:** What are the most important variables now? How does this compare to previous models?

## 7. Final Model

**Question:** Which of these models do you choose as the most performant for classifying mushrooms in the test set?

You will now evaluate the model you have chosen on the test set:

In [None]:
my_model = ... # TODO : insert the name of the model you have chosen here.

# Predict on the test set
y_pred = ...

In [None]:
from sklearn import metrics
print("F1 of the chosen model on the test set: %.3f" % metrics.f1_score(y_test, y_pred))

**Question:** What do you think of this performance? Is there a risk of overfitting?

### Confusion Matrix

To better interpret the results, we can also visualize the confusion matrix:

In [None]:
metrics.ConfusionMatrixDisplay.from_predictions(y_test, y_pred)

**Question:** What do you think of this confusion matrix? Is it satisfactory? Remember that we are trying to predict if a mushroom is edible.

### ROC Curve

We can also evaluate the model's performance **before thresholding**, i.e., by using the predicted numerical scores rather than binary labels, thanks to a [ROC Curve](https://scikit-learn.org/stable/modules/model_evaluation.html#roc-metrics).

The scores before thresholding of a scikit-learn classification model are accessible through the `predict_proba` method.

In [None]:
y_pred_scores =  my_model.predict_proba(X_test)[:, 1]
y_pred_scores

In [None]:
y_pred_scores.shape
X_test.shape

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_scores)
max_fpr = 0.01
roc_auc = metrics.auc(fpr, tpr)
max_index_where_fpr_acceptable = np.where(fpr <= max_fpr)[0][-1]
max_tpr = tpr[max_index_where_fpr_acceptable]

fig = plt.figure(figsize=(7, 7))

plt.plot(fpr, tpr, lw=2)

# diagonal
plt.plot([0, 1], [0, 1], color='k')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])

# Add more ticks to the axes
plt.xticks(np.arange(0, 1.1, 0.1))
plt.yticks(np.arange(0, 1.1, 0.1))

plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
plt.title("ROC curve of the final model")

# Add vertical line at max_fpr and horizontal line at max_tpr
plt.plot([max_fpr, max_fpr], [0, max_tpr], color='red', linestyle='--', label=f'FPR = {max_fpr:.2f}')
plt.plot([0, max_fpr], [max_tpr, max_tpr], color='red', linestyle='--')
plt.legend()

This curve can also be used to determine the true positive rate corresponding to a given false positive rate, and to determine the corresponding threshold:

In [None]:
print("The true positive rate corresponding to a false positive rate not exceeding %.f %% is %.f %%" % ((100*max_fpr), (100*max_tpr)))
print("It corresponds to a threshold of %.2f on the model's predictions." % thresholds[max_index_where_fpr_acceptable])