*This Notebook was created by Antoine Palisson*

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

### Dataset

In [None]:
from sklearn.datasets import fetch_openml

data = fetch_openml('artificial-characters', version=1, as_frame=True, parser='pandas')
X = data['data']
y = data['target']

# Quick Exploration & Preprocessing

**<font color='blue'>1.a. How many classes does the label have ?<br>1.b. Is the dataset balanced ?**

In [None]:
# 10 classes --> Multi-class classification
# The dataset is imbalanced
y.value_counts(normalize=True)

3     0.138579
8     0.117244
1     0.117048
2     0.116657
5     0.098649
6     0.097867
9     0.097867
4     0.079076
7     0.078293
10    0.058720
Name: Class, dtype: float64

**<font color='blue'>2. Split the dataset into a training and a testing set.**

*Tips: Don't forget to do the splitting according to the type of the task (classification, regression) and the dataset label (balanced or not).*

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=42)

The dataset only contains numerical features.<br>The purpose of this exercise is not to explore the data or to do specific preprocessing.

**<font color='blue'>3. How should you preprocess the dataset ?<br> Don't apply the preprocessing yet.**

It should be scaled.

# Model Selection

**<font color='blue'>1.a. Which metric from the sklearn library should you use for this dataset ?<br> Should you change some of its parameters ?**

*Tips: Is the dataset balanced ? How many classes does the label ?*

The f1-score because the dataset is imbalanced.<br>
The average parameter of the f1_score function should be set to micro, macro or weighted because it is a multi-class classifiation.

The **`make_scorer`** function is a utility function in the Sklearn library that allows you to create a custom scoring function that can be used in model selection and evaluation. It essentially transforms an arbitrary function into a scorer object that can be passed to the `cross_val_score` or `GridSearchCV` functions (you will use them in this notebook).

You can find it [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html).

---

The `make_scorer` **parameters**:

* **`score_func`**: This parameter is a function that computes the score for a given set of predicted and true values. The function takes two arguments: the true labels and the predicted labels. The score function can be any function that returns a scalar value, such as accuracy, precision, recall, F1-score, etc.

* **`greater_is_better`**: This parameter is a boolean value that determines whether a higher score is better or worse for the model. If set to True, the scorer will be maximizing the score; if set to False, the scorer will be minimizing the score.

* **`needs_proba`**: This parameter is a boolean value that determines whether the scorer requires the model to output predicted probabilities instead of predicted labels. If set to True, the scorer expects the model to output probabilities, and the score_func function will be applied to the probabilities instead of the predicted labels.

* **`needs_threshold`**: This parameter is a boolean value that determines whether the scorer requires the model to output a probability threshold. If set to True, the scorer will optimize the threshold in addition to the model's hyperparameters.

* **`kwargs`**: This parameter is a dictionary of additional keyword arguments that can be passed to the score_func function. These arguments can be used to customize the behavior of the score function, such as changing the weight of different classes or adjusting the threshold for binary classification.

---
**Code examples**:

Example 1 - *it is the same as passing `scoring='accuracy'`*
```
acc_scorer = make_scorer(accuracy_score)
scores = cross_val_score(..., scoring=acc_scorer)
```

Example 2 - *average is a parameter of the `f1_score` function*
```
f1_scorer = make_scorer(f1_score, average='micro')
scores = cross_val_score(..., scoring=f1_scorer)
```


**<font color='blue'>1.b. Use the `make_scorer` function from sklearn to create a metric with the correct parameters.**

In [None]:
from sklearn.metrics import make_scorer, f1_score

f1_scorer = make_scorer(f1_score, average='micro')

**Let's compare four models:**

*   Logistic Regression
*   Support Vector Classifier
*   k-Neighbors Classifier
*   Decision Tree Classifier

To compare the models, you will use a **cross-validation method**.

As a remainder:

> *The **`Pipeline`** class in sklearn is a tool for chaining multiple processing steps together into a single estimator. It can be used to automate the workflow of a machine learning project by **combining data preprocessing and modeling into a single object** that can be used for training and prediction. It can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline).*

> *The most important parameter of the Pipeline is the `steps`: it is a list of tuples, where each tuple contains the name of the step and the processing object. The steps are executed in the order they are listed.*

>```
pipeline = Pipeline(steps=[('preprocesing', StandardScaler()),        # Preprocessing
                           ('model', LogisticRegression())])          # Model


**<font color='blue'>2.a. Use the `Pipeline` class to merge the preprocessing function and a Logistic Regression model.**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

logreg = LogisticRegression(max_iter=200,
                            random_state=42)

pipe = Pipeline([('preprocessing', StandardScaler()),
                 ('model', logreg)])

**<font color='blue'>2.b. Do a cross-validation method to evaluate the model performance using the `make_scorer` function.<br>Show the mean and the standard deviation of the scores.**

In [None]:
from sklearn.model_selection import cross_val_score

score = cross_val_score(pipe, 
                        X_train, 
                        y_train, 
                        cv=3, 
                        scoring=f1_scorer)

print(f"Mean score : {np.mean(score):.4f} +/- {np.std(score):.4f}")

Mean score : 0.3559 +/- 0.0018


**<font color='blue'>3. Do the same for the Support Vector Classifier model.**

In [None]:
from sklearn.svm import SVC

svc = SVC(random_state=42)
pipe = Pipeline([('preprocessing', StandardScaler()),
                 ('model', svc)])

score = cross_val_score(pipe, 
                        X_train, 
                        y_train, 
                        cv=3, 
                        scoring=f1_scorer)

print(f"Mean score : {np.mean(score):.4f} +/- {np.std(score):.4f}")

Mean score : 0.6112 +/- 0.0068


**<font color='blue'>4. Do the same for the k-neighbors classifier model.**

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
pipe = Pipeline([('preprocessing', StandardScaler()),
                 ('model', knn)])

score = cross_val_score(pipe, 
                        X_train, 
                        y_train, 
                        cv=3, 
                        scoring=f1_scorer)

print(f"Mean score : {np.mean(score):.4f} +/- {np.std(score):.4f}")

Mean score : 0.6286 +/- 0.0052


**<font color='blue'>5. Finally, do the same for the Decision Tree classifier model.<br>Which model is the best ?**

In [None]:
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier(random_state=42)
pipe = Pipeline([('preprocessing', StandardScaler()),
                 ('model', dtc)])

score = cross_val_score(pipe, 
                        X_train, 
                        y_train, 
                        cv=3, 
                        scoring=f1_scorer)

print(f"Mean score : {np.mean(score):.4f} +/- {np.std(score):.4f}")

Mean score : 0.8087 +/- 0.0070


# Hyperparameter Tuning

## Grid Search

The **Decision Tree classifier** model has many parameters.<br> Let's try to tune the following ones:

*   **`criterion`** which can takes three values gini, entropy or logloss
*   **`splitter`** which can takes two values random or best
*   **`max_depth`** which can any positive integer or None (i.e. infinite)
*   **`min_samples_split`** which can any positive integer from 2
*   **`min_samples_leaf`** which can any positive integer from 1
*   **`max_features`** which can any positive integer from 1 to the number of features

Explaining these hyperparameters is out of the scopre of this exercise.<br>Thus, we will consider each of them has potentially very important for the task.



**<font color='blue'>1. Change the values of each of the hyperparameters independently and evaluate the model performance.<br>Find the hyperparameters that make the performances change a lot.**

In [None]:
# Criterion does not make the model vary a lot
for criterion in ['gini', 'entropy']:
    dtc = DecisionTreeClassifier(criterion=criterion,
                                random_state=42)
    pipe = Pipeline([('preprocessing', StandardScaler()),
                    ('model', dtc)])

    score = cross_val_score(pipe, 
                            X_train, 
                            y_train, 
                            cv=3, 
                            scoring=f1_scorer)

    print(f"{criterion} --> Mean score : {np.mean(score):.4f} +/- {np.std(score):.4f}")

gini --> Mean score : 0.8087 +/- 0.0070
entropy --> Mean score : 0.8118 +/- 0.0007


In [None]:
# Splitter does not make the model vary a lot
for splitter in ['best', 'random']:
    dtc = DecisionTreeClassifier(splitter=splitter,
                                 random_state=42)
    pipe = Pipeline([('preprocessing', StandardScaler()),
                    ('model', dtc)])

    score = cross_val_score(pipe, 
                            X_train, 
                            y_train, 
                            cv=3, 
                            scoring=f1_scorer)

    print(f"{splitter} --> Mean score : {np.mean(score):.4f} +/- {np.std(score):.4f}")

best --> Mean score : 0.8087 +/- 0.0070
random --> Mean score : 0.8052 +/- 0.0078


In [None]:
# max_depth seems very important
for max_depth in [2, None]:                                   # 2 and None are the most extreme values
    dtc = DecisionTreeClassifier(max_depth=max_depth,
                                 random_state=42)
    pipe = Pipeline([('preprocessing', StandardScaler()),
                    ('model', dtc)])

    score = cross_val_score(pipe, 
                            X_train, 
                            y_train, 
                            cv=3, 
                            scoring=f1_scorer)

    print(f"{max_depth} --> Mean score : {np.mean(score):.4f} +/- {np.std(score):.4f}")

2 --> Mean score : 0.2370 +/- 0.0035
None --> Mean score : 0.8087 +/- 0.0070


In [None]:
# min_samples_split seems very important
for min_samples_split in [2, 100]:                                   
    dtc = DecisionTreeClassifier(min_samples_split=min_samples_split,
                                 random_state=42)
    pipe = Pipeline([('preprocessing', StandardScaler()),
                    ('model', dtc)])

    score = cross_val_score(pipe, 
                            X_train, 
                            y_train, 
                            cv=3, 
                            scoring=f1_scorer)

    print(f"{min_samples_split} --> Mean score : {np.mean(score):.4f} +/- {np.std(score):.4f}")

2 --> Mean score : 0.8087 +/- 0.0070
100 --> Mean score : 0.5700 +/- 0.0085


In [None]:
# min_samples_leaf seems very important
for min_samples_leaf in [1, 100]:                                   
    dtc = DecisionTreeClassifier(min_samples_leaf=min_samples_leaf,
                                 random_state=42)
    pipe = Pipeline([('preprocessing', StandardScaler()),
                    ('model', dtc)])

    score = cross_val_score(pipe, 
                            X_train, 
                            y_train, 
                            cv=3, 
                            scoring=f1_scorer)

    print(f"{min_samples_leaf} --> Mean score : {np.mean(score):.4f} +/- {np.std(score):.4f}")

1 --> Mean score : 0.8087 +/- 0.0070
100 --> Mean score : 0.4856 +/- 0.0129


In [None]:
# max_features seems to very little effect on the performances
for max_features in [1, 6]:                                   
    dtc = DecisionTreeClassifier(max_features=max_features,
                                 random_state=42)
    pipe = Pipeline([('preprocessing', StandardScaler()),
                    ('model', dtc)])

    score = cross_val_score(pipe, 
                            X_train, 
                            y_train, 
                            cv=3, 
                            scoring=f1_scorer)

    print(f"{max_features} --> Mean score : {np.mean(score):.4f} +/- {np.std(score):.4f}")

1 --> Mean score : 0.7847 +/- 0.0079
6 --> Mean score : 0.8093 +/- 0.0040


In [None]:
# Overall, max_depth, min_samples_leaf and min_samples_split are the most important parameters.

The **`GridSearchCV`** class in Sklearn is a tool for performing an exhaustive search over a specified parameter grid for an estimator. It searches over all possible combinations of the parameters to determine the best parameter values based on the chosen evaluation metric.

You can find it [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV).

---

Most important **parameters**:

* **`estimator`**: This parameter takes an estimator object that is to be tuned using GridSearchCV. The estimator object should implement a fit method that takes the training data as input.

* **`param_grid`**: This parameter is a dictionary or a list of dictionaries that defines the hyperparameter search space. The keys of the dictionary are the hyperparameter names and the values are the corresponding search spaces. A search space is a list of possible values or a distribution where the values are sampled.

* **`scoring`**: This parameter specifies the metric to use for evaluating the performance of the model with different hyperparameters. It can take a string representing a built-in scoring metric, a callable object that implements a custom scoring metric with make_scorer, or a list/tuple of multiple scoring metrics.

* **`cv`**: This parameter specifies the cross-validation splitting strategy. It can take an integer value representing the number of folds in a KFold cross-validation, a cross-validation iterator, or a specific data splitting strategy. It specifies how the data is partitioned into training and validation sets for each hyperparameter combination.

---

Most important **attributes**:

* **`cv_results_`**: This attribute is a dictionary that contains all of the cross-validation results for each combination of hyperparameters tried during the GridSearchCV search. It includes information such as the mean and standard deviation of the test scores, training times, and hyperparameter values for each combination.

* **`best_params_`**: This attribute is a dictionary that contains the best hyperparameter values found during the GridSearchCV search. It includes the hyperparameter names as keys and their corresponding best values as values.

* **`best_score_`**: This attribute is a float value that represents the best cross-validation score obtained during the GridSearchCV search.

---

```
params = {...}

grid_search = GridSearchCV(estimator=model, 
                           param_grid=params)

grid_search.fit(X_train, y_train)

results = grid_search.cv_results_
```



**<font color='blue'>2. Create a param_grid dictionnary with a maximum of 5 different values per hyperparameter.<br> How many trials would you perform ?**

*Tips: If you are using a pipeline, then the model hyperparameters have been renamed to the name of the step in the Pipeline + two underscores + the name of the hyperparameter (see below).*

```
pipe = Pipeline([('preprocessing', ...),
                 ('model', ...)])
params = {"model__hyperparameter1" : [...],
          "model__hyperparameter2" : [...]}
```

In [None]:
params = {"model__criterion": ["gini", "entropy"], 
          "model__splitter": ["random", "best"],
          "model__max_depth": [None, 5, 10, 15, 20], 
          "model__min_samples_split": [2, 5, 10, 15, 20], 
          "model__min_samples_leaf": [1, 5, 10, 15, 20], 
          "model__max_features": [2, 3, 4, 5, 6]}

In [None]:
print(f"Number of trials : {2*2*5*5*5*5}")

Number of trials : 2500


**<font color='blue'>2. Use a Grid Search strategy to find the best hyperparameter using the param_grid defined at the previous question.**

*Tips: You should pass the Pipeline to the estimator parameter.<br> Additionally, you should pass the custom scorer to the scoring parameter.*

In [None]:
import time

In [None]:
t0 = time.time()
from sklearn.model_selection import GridSearchCV

dtc = DecisionTreeClassifier(random_state=42)
pipe = Pipeline([('preprocessing', StandardScaler()),
                 ('model', dtc)])

gscv = GridSearchCV(pipe,
                    cv=3, 
                    param_grid=params,
                    scoring=f1_scorer)

gscv.fit(X_train, y_train)
print(f"The grid search took {time.time() - t0:.0f}s to run")

The grid search took 329.89425015449524s to run


**<font color='blue'>3.a. Get the result of all the trials using the .cv_results_ attribute and transform it into a DataFrame.**


In [None]:
results = pd.DataFrame(gscv.cv_results_)
results.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__criterion,param_model__max_depth,param_model__max_features,param_model__min_samples_leaf,param_model__min_samples_split,param_model__splitter,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,0.01518,0.00088,0.019618,0.000687,gini,,2,1,2,random,"{'model__criterion': 'gini', 'model__max_depth...",0.78789,0.792661,0.784141,0.78823,0.003486,23
1,0.018331,0.000296,0.02019,0.001477,gini,,2,1,2,best,"{'model__criterion': 'gini', 'model__max_depth...",0.794128,0.795229,0.800661,0.796673,0.002855,16
2,0.016771,0.00407,0.021697,0.002493,gini,,2,1,5,random,"{'model__criterion': 'gini', 'model__max_depth...",0.637064,0.638899,0.635095,0.63702,0.001553,309
3,0.018711,0.001195,0.019289,0.000665,gini,,2,1,5,best,"{'model__criterion': 'gini', 'model__max_depth...",0.69578,0.689541,0.707048,0.697457,0.007245,86
4,0.016322,0.00301,0.019047,0.000233,gini,,2,1,10,random,"{'model__criterion': 'gini', 'model__max_depth...",0.617248,0.601468,0.610499,0.609738,0.006465,523


**<font color='blue'>3.b. Sort the trials by the `rank_test_score` column.<br>What is the best set of hyperparameters ? Is it better than the default model.**

In [None]:
# The result it bit better than  the default model but it took more than 5 minutes to compute
# Imagine with a bigger parameter space.
results.loc[:,'params':].sort_values('rank_test_score').head()

Unnamed: 0,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
200,"{'model__criterion': 'gini', 'model__max_depth...",0.817982,0.809908,0.800294,0.809395,0.00723,1
201,"{'model__criterion': 'gini', 'model__max_depth...",0.80367,0.812477,0.811674,0.809274,0.003976,2
1451,"{'model__criterion': 'entropy', 'model__max_de...",0.812844,0.805872,0.808003,0.808906,0.002917,3
1400,"{'model__criterion': 'entropy', 'model__max_de...",0.802569,0.805872,0.805066,0.804502,0.001406,4
101,"{'model__criterion': 'gini', 'model__max_depth...",0.802569,0.801101,0.809104,0.804258,0.003479,5


In [None]:
results.loc[:,'params':].sort_values('rank_test_score')['params'].head().to_numpy()

array([{'model__criterion': 'gini', 'model__max_depth': None, 'model__max_features': 6, 'model__min_samples_leaf': 1, 'model__min_samples_split': 2, 'model__splitter': 'random'},
       {'model__criterion': 'gini', 'model__max_depth': None, 'model__max_features': 6, 'model__min_samples_leaf': 1, 'model__min_samples_split': 2, 'model__splitter': 'best'},
       {'model__criterion': 'entropy', 'model__max_depth': None, 'model__max_features': 6, 'model__min_samples_leaf': 1, 'model__min_samples_split': 2, 'model__splitter': 'best'},
       {'model__criterion': 'entropy', 'model__max_depth': None, 'model__max_features': 5, 'model__min_samples_leaf': 1, 'model__min_samples_split': 2, 'model__splitter': 'random'},
       {'model__criterion': 'gini', 'model__max_depth': None, 'model__max_features': 4, 'model__min_samples_leaf': 1, 'model__min_samples_split': 2, 'model__splitter': 'best'}],
      dtype=object)

## Random Search

**`RandomizedSearchCV`** is a class in Scikit-learn that randomly selects a subset of hyperparameters and fits the model using those hyperparameters, repeating this process for a specified number of iterations to find the optimal combination of hyperparameters that produce the best performance on a given metric.

You can find it [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV).

---

**`RandomizedSearchCV`** has very similar parameters to **`GridSearchCV`**.<br> It also adds new parameters suchs as:

* **`param_distributions`**: This parameter is a dictionary or a list of dictionaries, where each dictionary contains hyperparameter distributions to be sampled from. The hyperparameters to be tuned are specified as keys in each dictionary, and the corresponding value is a distribution over the hyperparameter space from which to sample. This parameter controls the search space from which the hyperparameters are randomly sampled.

* **`n_iter`**: This parameter specifies the number of iterations to perform during the randomized search. Each iteration samples a set of hyperparameters from the specified param_distributions and fits the model using those hyperparameters. The higher the value of n_iter, the more exhaustive the search for the optimal hyperparameters will be.

* **`random_state`**: This parameter controls the random number generator used for the randomized search. Setting a specific value for random_state ensures that the same set of hyperparameters is sampled on each run, making the results reproducible. If random_state is not set, the search will generate different hyperparameters each time it is run.

---

**`RandomizedSearchCV`** has the same attributes as **`GridSearchCV`**.

---

```
params = {...}

random_search = RandomizedSearchCV(estimator=model, 
                                   param_distribution=params,
                                   n_iter=...)

random_search.fit(X_train, y_train)

results = random_search.cv_results_
```

**<font color='blue'>1. Use the following param_distribution in the RandomizedSeachCV class to perform a first coarse search with `n_iter=500`.<br>How many trials would it be if used on along with a Grid Search strategy ?**

In [None]:
max_depth = list(range(5,80,5))
max_depth.append(None)

params = {"model__criterion": ["gini", "entropy"], 
          "model__splitter": ["random", "best"],
          "model__max_depth": max_depth, 
          "model__min_samples_split": range(2,30,2), 
          "model__min_samples_leaf": range(1,30,2), 
          "model__max_features": range(1,7)}

In [None]:
print(f"Number of trials : {len(max_depth)*len(range(1,30,2))*len(range(2,30,2))*2*2*len(range(1,7)):,}")

Number of trials : 80,640


In [None]:
from sklearn.model_selection import RandomizedSearchCV

rscv = RandomizedSearchCV(pipe, 
                          params,
                          n_iter=500, 
                          scoring=f1_scorer, 
                          cv=3,
                          random_state=42)

rscv.fit(X_train, y_train)

**<font color='blue'>2.a. Get the result of all the trials using the .cv_results_ attribute and sort the trials by the `rank_test_score` column.**

In [None]:
results = pd.DataFrame(rscv.cv_results_).loc[:,'params':].sort_values('rank_test_score')
results.head()

Unnamed: 0,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
339,"{'model__splitter': 'best', 'model__min_sample...",0.779817,0.776514,0.781938,0.779423,0.002232,1
185,"{'model__splitter': 'best', 'model__min_sample...",0.761101,0.777248,0.760279,0.766209,0.007813,2
233,"{'model__splitter': 'best', 'model__min_sample...",0.760367,0.767706,0.76395,0.764008,0.002997,3
192,"{'model__splitter': 'best', 'model__min_sample...",0.767706,0.74422,0.759178,0.757035,0.009707,4
71,"{'model__splitter': 'best', 'model__min_sample...",0.734312,0.735413,0.738987,0.736237,0.001996,5


**<font color='blue'>2.b. Have a look to the top hyperparameter combinations and especially their values.<br> Can you find similar values in different top combinations ?**

In [None]:
# Splitter seems to be better with best
# min_samples_leaf seems to be better with 1
# min_samples_split seems to be better with very low value
# max_depth seems to be better with values close to 30
# max_features seems to be better with 4, 5 or 6
results.iloc[:5]['params'].to_numpy()

array([{'model__splitter': 'best', 'model__min_samples_split': 2, 'model__min_samples_leaf': 1, 'model__max_features': 1, 'model__max_depth': 30, 'model__criterion': 'entropy'},
       {'model__splitter': 'best', 'model__min_samples_split': 4, 'model__min_samples_leaf': 1, 'model__max_features': 5, 'model__max_depth': 25, 'model__criterion': 'entropy'},
       {'model__splitter': 'best', 'model__min_samples_split': 4, 'model__min_samples_leaf': 1, 'model__max_features': 6, 'model__max_depth': 25, 'model__criterion': 'gini'},
       {'model__splitter': 'best', 'model__min_samples_split': 4, 'model__min_samples_leaf': 1, 'model__max_features': 4, 'model__max_depth': 25, 'model__criterion': 'entropy'},
       {'model__splitter': 'best', 'model__min_samples_split': 6, 'model__min_samples_leaf': 1, 'model__max_features': 6, 'model__max_depth': 25, 'model__criterion': 'gini'}],
      dtype=object)

**<font color='blue'>3. Using the knowledge of the first 500 trials, reduce the spaces close the best values for each parameter and run 250 trials.**

In [None]:
dtc = DecisionTreeClassifier(splitter="best",
                             random_state=42)

pipe = Pipeline([('preprocessing', StandardScaler()),
                 ('model', dtc)])

max_depth = list(range(20,35))
max_depth.append(None)

params = {"model__criterion": ["gini", "entropy"],
          "model__max_depth": max_depth, 
          "model__min_samples_split": range(2,8), 
          "model__min_samples_leaf": range(1,5), 
          "model__max_features": range(4,7)}

rscv = RandomizedSearchCV(pipe, 
                          params, 
                          n_iter=250,
                          scoring=f1_scorer, 
                          cv=3,
                          random_state=42)

rscv.fit(X_train, y_train)

**<font color='blue'>4. Get the result of all the trials and have a look to the top hyperparameter combinations.**

In [None]:
results = pd.DataFrame(rscv.cv_results_).loc[:,'params':].sort_values('rank_test_score')
results.head()

Unnamed: 0,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
46,"{'model__min_samples_split': 2, 'model__min_sa...",0.812844,0.805872,0.808003,0.808906,0.002917,1
206,"{'model__min_samples_split': 2, 'model__min_sa...",0.813211,0.805872,0.805066,0.80805,0.003664,2
175,"{'model__min_samples_split': 2, 'model__min_sa...",0.806239,0.807339,0.80837,0.807316,0.00087,3
202,"{'model__min_samples_split': 2, 'model__min_sa...",0.803303,0.80367,0.806902,0.804625,0.001617,4
58,"{'model__min_samples_split': 2, 'model__min_sa...",0.802569,0.801101,0.809104,0.804258,0.003479,5


In [None]:
results.iloc[:5]['params'].to_numpy()

array([{'model__min_samples_split': 2, 'model__min_samples_leaf': 1, 'model__max_features': 6, 'model__max_depth': 31, 'model__criterion': 'entropy'},
       {'model__min_samples_split': 2, 'model__min_samples_leaf': 1, 'model__max_features': 6, 'model__max_depth': 27, 'model__criterion': 'gini'},
       {'model__min_samples_split': 2, 'model__min_samples_leaf': 1, 'model__max_features': 5, 'model__max_depth': 27, 'model__criterion': 'gini'},
       {'model__min_samples_split': 2, 'model__min_samples_leaf': 1, 'model__max_features': 5, 'model__max_depth': 22, 'model__criterion': 'entropy'},
       {'model__min_samples_split': 2, 'model__min_samples_leaf': 1, 'model__max_features': 4, 'model__max_depth': None, 'model__criterion': 'gini'}],
      dtype=object)

**<font color='blue'>5.a. Get the best hyperparameter combination and re-train the model.**

In [None]:
rscv.best_params_

{'model__min_samples_split': 2,
 'model__min_samples_leaf': 1,
 'model__max_features': 6,
 'model__max_depth': 31,
 'model__criterion': 'entropy'}

In [None]:
dtc = DecisionTreeClassifier(criterion = "entropy",
                             splitter = "best",
                             min_samples_split = 2,
                             min_samples_leaf = 1,
                             max_features = 6,
                             max_depth = 31,
                             random_state = 42)
pipe = Pipeline([('preprocessing', StandardScaler()),
                 ('model', dtc)])

pipe.fit(X_train, y_train)

**<font color='blue'>5.b. Predict the training, validation (using a cross-validation) and testing sets and get the scores.**

In [None]:
y_pred = pipe.predict(X_test)

print(f"Training score   : {f1_score(y_true=y_train, y_pred=pipe.predict(X_train), average='micro'):.2%}")
print(f"Validation score : {np.mean(cross_val_score(pipe, X_train, y_train, cv=3, scoring=f1_scorer)):.2%}")
print(f"Test score       : {f1_score(y_true=y_test, y_pred=y_pred, average='micro'):.2%}")

Training score   : 99.41%
Validation score : 80.89%
Test score       : 90.70%
