*This Notebook was created by Antoine Palisson*

In [13]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

### Dataset

In [3]:
from sklearn.datasets import fetch_openml

data = fetch_openml('artificial-characters', version=1, as_frame=True)
X = data['data']
y = data['target']

In [22]:
X.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7
0,0.0,0.0,0.0,0.0,20.0,20.0,46.1
1,1.0,19.0,0.0,19.0,8.0,8.0,46.1
2,2.0,0.0,20.0,19.0,8.0,22.47,46.1
3,3.0,0.0,20.0,8.0,42.0,23.41,46.1
4,4.0,19.0,8.0,8.0,42.0,35.74,46.1


In [11]:
y.head()

0    1
1    1
2    1
3    1
4    1
Name: Class, dtype: category
Categories (10, object): ['1', '2', '3', '4', ..., '7', '8', '9', '10']

In [15]:
y.info()

<class 'pandas.core.series.Series'>
RangeIndex: 10218 entries, 0 to 10217
Series name: Class
Non-Null Count  Dtype   
--------------  -----   
10218 non-null  category
dtypes: category(1)
memory usage: 10.5 KB


# Quick Exploration & Preprocessing

**<font color='blue'>1.a. How many classes does the label have ?<br>1.b. Is the dataset balanced ?**

In [12]:
y.value_counts()
# 1.a. It has 10 classes.
# 1.b. It is inbalanced.

3     1416
8     1198
1     1196
2     1192
5     1008
6     1000
9     1000
4      808
7      800
10     600
Name: Class, dtype: int64

**<font color='blue'>2. Split the dataset into a training and a testing set.**

*Tips: Don't forget to do the splitting according to the type of the task (classification, regression) and the dataset label (balanced or not).*

This is a classfication task, and the dataset label is not balanced. So we can do stratify. <br/><br/>
The dataset will be splitted into:<br/>
(1) 70% training set;<br/>
(2) 10% validation set;<br/>
(3) 20% testing set.

In [14]:
#Split training and rest
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, train_size=0.7, random_state=42)
#Split validation and rest
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, train_size = 1/3, stratify=y_test, random_state=42)

In [24]:
print(y_train.value_counts(normalize=True))
print(y_val.value_counts(normalize=True))


3     0.138563
8     0.117170
1     0.117030
2     0.116611
5     0.098714
6     0.097875
9     0.097875
4     0.079139
7     0.078300
10    0.058725
Name: Class, dtype: float64
3     0.138943
1     0.117417
8     0.117417
2     0.116438
5     0.098826
6     0.097847
9     0.097847
4     0.078278
7     0.078278
10    0.058708
Name: Class, dtype: float64


The dataset only contains numerical features.<br>The purpose of this exercise is not to explore the data or to do specific preprocessing.

**<font color='blue'>3. How should you preprocess the dataset ?<br> Don't apply the preprocessing yet.**

<b>Preprocessing steps:</b>
<li>Missing values processing;</li>
<li>Outliers processing;</li>
<li>Handle data errors;</li>
<li>Duplications processing;</li>
<li>For Numerical data: data transformation (scaling or math tranformation)</li>
<li>For categorical data: data transformation (according the type of categorical data); <br/>
Encoding</li>

# Model Selection

**<font color='blue'>1.a. Which metric from the sklearn library should you use for this dataset ?<br> Should you change some of its parameters ?**

*Tips: Is the dataset balanced ? How many classes does the label ?*

The **`make_scorer`** function is a utility function in the Sklearn library that allows you to create a custom scoring function that can be used in model selection and evaluation. It essentially transforms an arbitrary function into a scorer object that can be passed to the `cross_val_score` or `GridSearchCV` functions (you will use them in this notebook).

You can find it [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html).

---

The `make_scorer` **parameters**:

* **`score_func`**: This parameter is a function that computes the score for a given set of predicted and true values. The function takes two arguments: the true labels and the predicted labels. The score function can be any function that returns a scalar value, such as accuracy, precision, recall, F1-score, etc.

* **`greater_is_better`**: This parameter is a boolean value that determines whether a higher score is better or worse for the model. If set to True, the scorer will be maximizing the score; if set to False, the scorer will be minimizing the score.

* **`needs_proba`**: This parameter is a boolean value that determines whether the scorer requires the model to output predicted probabilities instead of predicted labels. If set to True, the scorer expects the model to output probabilities, and the score_func function will be applied to the probabilities instead of the predicted labels.

* **`needs_threshold`**: This parameter is a boolean value that determines whether the scorer requires the model to output a probability threshold. If set to True, the scorer will optimize the threshold in addition to the model's hyperparameters.

* **`kwargs`**: This parameter is a dictionary of additional keyword arguments that can be passed to the score_func function. These arguments can be used to customize the behavior of the score function, such as changing the weight of different classes or adjusting the threshold for binary classification.

---
**Code examples**:

Example 1 - *it is the same as passing `scoring='accuracy'`*
```
acc_scorer = make_scorer(accuracy_score)
scores = cross_val_score(..., scoring=acc_scorer)
```

Example 2 - *average is a parameter of the `f1_score` function*
```
f1_scorer = make_scorer(f1_score, average='micro')
scores = cross_val_score(..., scoring=f1_scorer)
```


**<font color='blue'>1.b. Use the `make_scorer` function from sklearn to create a metric with the correct parameters.**

**Let's compare four models:**

*   Logistic Regression
*   Support Vector Classifier
*   k-Neighbors Classifier
*   Decision Tree Classifier

To compare the models, you will use a **cross-validation method**.

As a remainder:

> *The **`Pipeline`** class in sklearn is a tool for chaining multiple processing steps together into a single estimator. It can be used to automate the workflow of a machine learning project by **combining data preprocessing and modeling into a single object** that can be used for training and prediction. It can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline).*

> *The most important parameter of the Pipeline is the `steps`: it is a list of tuples, where each tuple contains the name of the step and the processing object. The steps are executed in the order they are listed.*

>```
pipeline = Pipeline(steps=[('preprocesing', StandardScaler()),        # Preprocessing
                           ('model', LogisticRegression())])          # Model


**<font color='blue'>2.a. Use the `Pipeline` class to merge the preprocessing function and a Logistic Regression model.**

**<font color='blue'>2.b. Do a cross-validation method to evaluate the model performance using the `make_scorer` function.<br>Show the mean and the standard deviation of the scores.**

**<font color='blue'>3. Do the same for the Support Vector Classifier model.**

**<font color='blue'>4. Do the same for the k-neighbors classifier model.**

**<font color='blue'>5. Finally, do the same for the Decision Tree classifier model.<br>Which model is the best ?**

# Hyperparameter Tuning

## Grid Search

The **Decision Tree classifier** model has many parameters.<br> Let's try to tune the following ones:

*   **`criterion`** which can takes three values gini, entropy or logloss
*   **`splitter`** which can takes two values random or best
*   **`max_depth`** which can any positive integer or None (i.e. infinite)
*   **`min_samples_split`** which can any positive integer from 2
*   **`min_samples_leaf`** which can any positive integer from 1
*   **`max_features`** which can any positive integer from 1 to the number of features

Explaining these hyperparameters is out of the scopre of this exercise.<br>Thus, we will consider each of them has potentially very important for the task.



**<font color='blue'>1. Change the values of each of the hyperparameters independently and evaluate the model performance.<br>Find the hyperparameters that make the performances change a lot.**

The **`GridSearchCV`** class in Sklearn is a tool for performing an exhaustive search over a specified parameter grid for an estimator. It searches over all possible combinations of the parameters to determine the best parameter values based on the chosen evaluation metric.

You can find it [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV).

---

Most important **parameters**:

* **`estimator`**: This parameter takes an estimator object that is to be tuned using GridSearchCV. The estimator object should implement a fit method that takes the training data as input.

* **`param_grid`**: This parameter is a dictionary or a list of dictionaries that defines the hyperparameter search space. The keys of the dictionary are the hyperparameter names and the values are the corresponding search spaces. A search space is a list of possible values or a distribution where the values are sampled.

* **`scoring`**: This parameter specifies the metric to use for evaluating the performance of the model with different hyperparameters. It can take a string representing a built-in scoring metric, a callable object that implements a custom scoring metric with make_scorer, or a list/tuple of multiple scoring metrics.

* **`cv`**: This parameter specifies the cross-validation splitting strategy. It can take an integer value representing the number of folds in a KFold cross-validation, a cross-validation iterator, or a specific data splitting strategy. It specifies how the data is partitioned into training and validation sets for each hyperparameter combination.

---

Most important **attributes**:

* **`cv_results_`**: This attribute is a dictionary that contains all of the cross-validation results for each combination of hyperparameters tried during the GridSearchCV search. It includes information such as the mean and standard deviation of the test scores, training times, and hyperparameter values for each combination.

* **`best_params_`**: This attribute is a dictionary that contains the best hyperparameter values found during the GridSearchCV search. It includes the hyperparameter names as keys and their corresponding best values as values.

* **`best_score_`**: This attribute is a float value that represents the best cross-validation score obtained during the GridSearchCV search.

---

```
params = {...}

grid_search = GridSearchCV(estimator=model, 
                           param_grid=params)

grid_search.fit(X_train, y_train)

results = grid_search.cv_results_
```



**<font color='blue'>2. Create a param_grid dictionnary with a maximum of 5 different values per hyperparameter.<br> How many trials would you perform ?**

*Tips: If you are using a pipeline, then the model hyperparameters have been renamed to the name of the step in the Pipeline + two underscores + the name of the hyperparameter (see below).*

```
pipe = Pipeline([('preprocessing', ...),
                 ('model', ...)])
params = {"model__hyperparameter1" : [...],
          "model__hyperparameter2" : [...]}
```

**<font color='blue'>2. Use a Grid Search strategy to find the best hyperparameter using the param_grid defined at the previous question.**

*Tips: You should pass the Pipeline to the estimator parameter.<br> Additionally, you should pass the custom scorer to the scoring parameter.*

**<font color='blue'>3.a. Get the result of all the trials using the .cv_results_ attribute and transform it into a DataFrame.**


**<font color='blue'>3.b. Sort the trials by the `rank_test_score` column.<br>What is the best set of hyperparameters ? Is it better than the default model.**

## Random Search

**`RandomizedSearchCV`** is a class in Scikit-learn that randomly selects a subset of hyperparameters and fits the model using those hyperparameters, repeating this process for a specified number of iterations to find the optimal combination of hyperparameters that produce the best performance on a given metric.

You can find it [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV).

---

**`RandomizedSearchCV`** has very similar parameters to **`GridSearchCV`**.<br> It also adds new parameters suchs as:

* **`param_distributions`**: This parameter is a dictionary or a list of dictionaries, where each dictionary contains hyperparameter distributions to be sampled from. The hyperparameters to be tuned are specified as keys in each dictionary, and the corresponding value is a distribution over the hyperparameter space from which to sample. This parameter controls the search space from which the hyperparameters are randomly sampled.

* **`n_iter`**: This parameter specifies the number of iterations to perform during the randomized search. Each iteration samples a set of hyperparameters from the specified param_distributions and fits the model using those hyperparameters. The higher the value of n_iter, the more exhaustive the search for the optimal hyperparameters will be.

* **`random_state`**: This parameter controls the random number generator used for the randomized search. Setting a specific value for random_state ensures that the same set of hyperparameters is sampled on each run, making the results reproducible. If random_state is not set, the search will generate different hyperparameters each time it is run.

---

**`RandomizedSearchCV`** has the same attributes as **`GridSearchCV`**.

---

```
params = {...}

random_search = RandomizedSearchCV(estimator=model, 
                                   param_distribution=params,
                                   n_iter=...)

random_search.fit(X_train, y_train)

results = random_search.cv_results_
```

**<font color='blue'>1. Use the following param_distribution in the RandomizedSeachCV class to perform a first coarse search with `n_iter=500`.<br>How many trials would it be if used on along with a Grid Search strategy ?**

In [None]:
max_depth = list(range(5,80,5))
max_depth.append(None)

params = {"model__criterion": ["gini", "entropy"], 
          "model__splitter": ["random", "best"],
          "model__max_depth": max_depth, 
          "model__min_samples_split": range(2,30,2), 
          "model__min_samples_leaf": range(1,30,2), 
          "model__max_features": range(1,7)}

**<font color='blue'>2.a. Get the result of all the trials using the .cv_results_ attribute and sort the trials by the `rank_test_score` column.**

**<font color='blue'>5.b. Have a look to the top hyperparameter combinations and especially their values.<br> Can you find similar values in different top combinations ?**

**<font color='blue'>3. Using the knowledge of the first 500 trials, reduce the spaces close the best values for each parameter and run 250 trials.**

**<font color='blue'>4. Get the result of all the trials and have a look to the top hyperparameter combinations.**

**<font color='blue'>5.a. Get the best hyperparameter combination and re-train the model.**

**<font color='blue'>5.b. Predict the training, validation (using a cross-validation) and testing sets and get the scores.**