Ce notebook est destin√© √† calculer les performances d'un mod√®le classique de machine learning bas√© sur le de la vectorisation des textes √† l'aide d'un Tf-Idf suivi d'un mod√®le de classification supervis√©e.

# Pr√©paration de l'environnement

In [1]:
# from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
import pandas as pd
import numpy as np
import pickle
import warnings
import mlflow
import mlflow.sklearn
from tqdm import tqdm
from utils import filter_dataset
from ml import create_ml_model
from sklearn.model_selection import cross_validate
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
import time


In [2]:
SEED = 314
# Define the URI of the MLflow server and the name of the experiment
URI = "http://localhost:5000"
PATH_COLS = "../data/processed/train_columns.pkl"

In [3]:
# Remove FutureWarning alerts
warnings.filterwarnings("ignore", category=FutureWarning)

# Initialiser tqdm pour pandas
tqdm.pandas()

# Set a random seed
np.random.seed(SEED)
print("Random seed set to", SEED)

Random seed set to 314


# Chargement des donn√©es et split en train et test

On r√©cup√©re les donn√©es s√©par√©es d'entrainement et de test lors de l'analyse exploratoire.

In [4]:
# Load the pickle file containing the columns
with open(PATH_COLS, "rb") as f:
    cols = pickle.load(f)

# Load the parquet file
X_train_full = pd.read_pickle("../data/processed/X_train.pickle")
X_test_full = pd.read_pickle("../data/processed/X_test.pickle")
y_train = pd.read_pickle("../data/processed/y_train.pickle")
y_test = pd.read_pickle("../data/processed/y_test.pickle")

Afin d'√©viter de r√©aliser √† nouveau les pr√©traitements qui alourdiraient les temps d'exp√©rimentation, on les charge directement.<br>
Si on souhaite toutefois d√©ployer ce type de mod√®le en production, il sera n√©cessaire d'ajouter au pipeline une √©tape de pr√©traitement des donn√©es avec la fonction la plus adapt√©e.<br>


Le sch√©ma du pipeline serait alors le suivant:
- Fonction de standardisation des donn√©es (Via un column transformer)
- Tokenisation avec Spacy
- Vectorisation avec Tf-Idf
- Mod√®le de classification

In [5]:
cols

['hour',
 'text',
 'tokenized_text',
 'tokenized_lemma_text',
 'tokenized_cleaned_text',
 'tokenized_cleaned_text_no_punct_and_digits',
 'tokenized_cleaned_lemma_text_no_punct_and_digits']

# Mod√©lisation

Je vais √©valuer diff√®rentes architectures de r√©seaux de neurones pour la classification de tweets en fonction de leur sentiment.<br>
Voici un r√©sum√© de la m√©thodologie que je vais suivre:

**Environnement de test**:<br>
Je vais utiliser un environnement CPU classique avec Scikit-Learn.<br>

**Jeu d'entrainement et de test**:<br>
J'utiliserai le jeu de donn√©es d'entrainement et de test d√©j√† pr√©par√© √† la fin de l'analyse exploratoire et d'avoir une base de travail comparable avec les diff√©rents pr√©-traitements.<br>
A noter que le jeu de validation ici est un jeu de donn√©es unique et ne sera pas test√© en validation crois√©e comme sur les mod√®les de machine learning classique.<br>

**Pr√©paration des donn√©es - tokenisation**:<br>
J'ai d√©cid√© de conserver le m√™me mod√®le de tokenisation que pour les mod√®les de machine learning classique pour garder une coh√©rence dans les r√©sultats.<br>
Ce dernier a √©t√© enregistr√© en tant que mod√®le dans MLFlow pour pouvoir √™tre r√©utilis√© facilement.<br>
Toutefois, ici, j'utiliserai d√©j√† les donn√©es pr√©par√©es pour √©viter de les recalculer √† chaque fois.<br>
Les donn√©es brutes ont √©t√© tokenis√©es avec le mod√®le `en_core_web_sm` de Spacy sous deux diff√©rentes formes:
- **Donn√©es brutes**: Les donn√©es sont utilis√©es telles quelles.
- **Donn√©es lemmatis√©es**: Les donn√©es sont lemmatis√©es. Sachant que la lemmatisation est faite avec Spacy, cette √©tape a √©t√© int√©gr√©e dans la tokenisation.<br>

**Pr√©paration des donn√©es - standardisation des textes**:<br>
Pour faciliter ici les exp√©riences et de r√©aliser √† nouveau les pr√©traitements et la tokenisazion, notamment couteuse pour les lemmes, qui alourdiraient les temps d'exp√©rimentation, on les charge directement.<br>
Si on souhaite toutefois d√©ployer ce type de mod√®le en production, il sera n√©cessaire d'ajouter au pipeline une √©tape de pr√©traitement des donn√©es avec la fonction la plus adapt√©e.<br>


Le sch√©ma du pipeline serait alors le suivant:
- `Tokenisation avec Spacy (chargement du mod√®le enregistr√©)`


Suivi du pipeline √† enregistrer sous MLFlow:
- `Fonction de standardisation des donn√©es (Via un column transformer)`
- `Vectorisation avec Tf-Idf`
- `Mod√®le de classification`


**Les architectures test√©es**:<br>
L'objectif est de partir vers une architecture simple vers une architecture plus complexe pour voir l'impact sur les performances du mod√®le.<br>


Le fichier `ml.py`contient la fonction `create_ml_model` d√©ploie un pipeline avec le Tf-Idf et un mod√®le de classification au choix.<br>



Le fichier `utils.py` contient les fonctions g√©n√©riques qui sont utilis√©es sur tous les notebooks.<br>
Il nous suffira uniquement de pr√©ciser les param√®tres du mod√®le √† tester et ceux concernant mlflow pour lancer les tests.<br>


Voici un r√©sum√© des architectures que je vais tester:
- `Mod√®le tfidf + R√©gression logistique`: La r√©gression logistique est un mod√®le simple, robuste et rapide √† entrainer. Il est int√©ressant de voir comment il se comporte sur ce jeu de donn√©es.
- `Mod√®le tfidf + Multinomial NB`: Ce mod√®le probabilitique est souvent utilis√© pour la classification de texte. Il est int√©ressant de voir comment il se comporte sur ce jeu de donn√©es.

Pour le travail de mod√©lisation, je vais utiliser **un mod√®le de r√©gression logistique en tant que r√©f√©rence** pour √©valuer les performances en fonction des donn√©es pr√©par√©es.<br>
Je testerai √©galement la cr√©ation des features via le Tf-Idf en fonction de diff√©rents hyperparam√®tres.<br>
Enfin, lorsque nous aurons trouv√© le jeu de donn√©es le plus adapt√©, nous pourrons tester diff√©rents mod√®les de classification pour voir si nous pouvons am√©liorer les performances.


Pour enregistrer les √©valuations des mod√®les, je cr√©√© une exp√©rience pour le suivi des performances des mod√®les bas√© **sur la vectorisation par Tf-Idf**:

In [6]:
# Define the URI of the MLflow server and the name of the experiment
experiment = "ml_models_experiments"

# Set the tracking URI
mlflow.set_tracking_uri(URI)
# try to connect to the server
try:
    mlflow.tracking.get_tracking_uri()
except Exception as e:
    print(f"Cannot connect to the server : {URI}. Check the server status.")
    raise e
# Set, and create if necessary, the experiment
try:
    mlflow.create_experiment(experiment)
except Exception:
    pass
finally:
    mlflow.set_experiment(experiment)

## Evaluation des jeux de donn√©es

On va it√©rer par run les diff√©rentes exp√©rimentations.<br>
Dans l'ordre, on va d√©j√† agir sur les param√®tres du Tf-Idf pour voir l'impact sur les performances du mod√®le en testant sur les diff√©rentes pr√©paration de donn√©es:
- `Tokenized_text`: les donn√©es tokeniz√©es avec Spacy
- `tokenized_lemma_text` : les donn√©es tokeniz√©es et lemmatis√©es avec Spacy
- `tokenized_cleaned_text` : les donn√©es tokeniz√©es et nettoy√©es avec la fonction de standardisation l√©g√®re utilis√©e lors de l'exploration
- `tokenized_cleaned_text_no_punct_and_digits` : les donn√©es tokeniz√©es, nettoy√©es et sans ponctuation et chiffres avec la fonction de standardisation l√©g√®re utilis√©e lors de l'exploration
- `tokenized_cleaned_lemma_text_no_punct_and_digits` : les donn√©es tokeniz√©es, lemmatis√©es, nettoy√©es et sans ponctuation et chiffres avec la fonction de standardisation l√©g√®re utilis√©e lors de l'exploration. Je garde uniquement cette pr√©paration pour les lemmes √©tant donn√©es que la simplification avanc√©e doit profiter √† un Tf-Idf plut√¥t que de garder du bruit avec la ponctuation et les chiffres.

In [7]:
cols

['hour',
 'text',
 'tokenized_text',
 'tokenized_lemma_text',
 'tokenized_cleaned_text',
 'tokenized_cleaned_text_no_punct_and_digits',
 'tokenized_cleaned_lemma_text_no_punct_and_digits']

### **RUN 1:** param√®tres par d√©faut (unigram) pour √©valuer les performances de la r√©gression logistique sur les diff√©rents jeux de donn√©es.

In [8]:
# Iterate through each preprocessed dataset
for col_name in cols[2:]:
    # Filter the dataset to keep only the column of interest
    X_train, X_test = filter_dataset(X_train_full, X_test_full, cols=[col_name])
    # Start the MLflow run & autolog
    mlflow.sklearn.autolog()
    with mlflow.start_run() as active_run:
        # Set the model
        model = create_ml_model(
            col_name,
            TfidfVectorizer(
                ngram_range=(1, 1),
                min_df=5,
                strip_accents="unicode",
                stop_words=None,
                lowercase=True,
            ),
            LogisticRegression(max_iter=1000, random_state=SEED),
        )
        # Cross validate the model & log the validation scores
        val_scores = cross_validate(
            model,
            X_train,
            y_train,
            cv=5,
            scoring=["accuracy", "roc_auc", "f1"],
            n_jobs=-1,
        )

        # Fit the model on the training data
        model.fit(X_train, y_train)

        # Compute the inference time & log it
        start_time = time.time()
        y_pred = model.predict(X_test)
        inference_time = time.time() - start_time

        # Log the additionnal metrics & parameters
        mlflow.log_metrics(
            {
                "val_accuracy": val_scores["test_accuracy"].mean(),
                "val_roc_auc": val_scores["test_roc_auc"].mean(),
                "val_f1": val_scores["test_f1"].mean(),
                "inference_time": inference_time,
            }
        )
        mlflow.log_params(
            {
                "data_preparation": col_name,
                "test_size_ratio": 0.2,
                "val_splits": len(val_scores["test_accuracy"]),
            }
        )

        # Evaluate the data on the test set with th model logged in MLflow
        evaluation_data = pd.concat([X_test, y_test], axis=1).assign(predictions=y_pred)
        model_uri = f"runs:/{active_run.info.run_id}/model"
        mlflow.evaluate(
            model=model_uri,
            model_type="classifier",
            data=evaluation_data,
            targets="target",
            predictions="predictions",
            evaluators=None,
            evaluator_config={
                "log_model_explainability": False
            },  # Disable SHAP explanations
        )

2025-01-12 21:20:27.331299: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-01-12 21:20:28.107426: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libdirectml.d6f03b303ac3c4f2eeb8ca631688c9757b361310.so
2025-01-12 21:20:28.107511: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libdxcore.so
2025-01-12 21:20:28.112425: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libd3d12.so
Dropped Escape call with ulEscapeCode : 0x03007703
Dropped Escape call with ulEscapeCode : 0x03007703
2025-01-12 21:20:28.277260: I tensorflow/c/logging.cc:34] DirectML device enumeration: found 1 compatible adapters.
2025/01/12 21:20:39 INFO mlflow.models.evaluation.default_evaluator: Computin

<Figure size 1050x700 with 0 Axes>

![image.png](attachment:image.png)

On obtient d'embl√©e des r√©sultats assez prometteurs et les donn√©es lemmatis√©es avec la pr√©paration avanc√©e donnent les meilleurs r√©sultats.<br>

### **Run 2** : Second run en ajoutant les stop words. On attends √† ce que l'ajout des stop words ne soit pas forc√©ment b√©n√©fique avec un Tf-Idf.

In [9]:
# Iterate through each preprocessed dataset
for col_name in cols[2:]:
    # Filter the dataset to keep only the column of interest
    X_train, X_test = filter_dataset(X_train_full, X_test_full, cols=[col_name])
    # Start the MLflow run & autolog
    mlflow.sklearn.autolog()
    with mlflow.start_run() as active_run:
        # Set the model
        model = create_ml_model(
            col_name,
            TfidfVectorizer(
                ngram_range=(1, 1),
                min_df=5,
                strip_accents="unicode",
                stop_words="english",
            ),
            LogisticRegression(max_iter=1000, random_state=SEED),
        )
        # Cross validate the model & log the validation scores
        val_scores = cross_validate(
            model,
            X_train,
            y_train,
            cv=5,
            scoring=["accuracy", "roc_auc", "f1"],
            n_jobs=-1,
        )

        # Fit the model on the training data
        model.fit(X_train, y_train)

        # Compute the inference time & log it
        start_time = time.time()
        y_pred = model.predict(X_test)
        inference_time = time.time() - start_time

        # Log the additionnal metrics & parameters
        mlflow.log_metrics(
            {
                "val_accuracy": val_scores["test_accuracy"].mean(),
                "val_roc_auc": val_scores["test_roc_auc"].mean(),
                "val_f1": val_scores["test_f1"].mean(),
                "inference_time": inference_time,
            }
        )
        mlflow.log_params(
            {
                "data_preparation": col_name,
                "test_size_ratio": 0.2,
                "val_splits": len(val_scores["test_accuracy"]),
            }
        )

        # Evaluate the data on the test set with th model logged in MLflow
        evaluation_data = pd.concat([X_test, y_test], axis=1).assign(predictions=y_pred)
        model_uri = f"runs:/{active_run.info.run_id}/model"
        mlflow.evaluate(
            model=model_uri,
            model_type="classifier",
            data=evaluation_data,
            targets="target",
            predictions="predictions",
            evaluators=None,
            evaluator_config={
                "log_model_explainability": False
            },  # Disable SHAP explanations
        )

2025/01/12 21:21:35 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/12 21:21:35 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/12 21:21:35 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
2025/01/12 21:21:36 INFO mlflow.tracking._tracking_service.client: üèÉ View run bright-fowl-944 at: http://localhost:5000/#/experiments/13/runs/8f9a7c80dee94818a2db4e921eb0b5a0.
2025/01/12 21:21:36 INFO mlflow.tracking._tracking_service.client: üß™ View experiment at: http://localhost:5000/#/experiments/13.
2025/01/12 21:21:46 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/12 21:21:46 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/12 21:21:46 INFO mlflow.models.evaluation.default_eval

![image.png](attachment:image.png)

Comme attendu, l'ajout des stop words a profit√© davantage aux textes avec moins de nettoyage. Cela a par contre p√©nalis√© les textes lemmatis√©s.

### **Run 3** : Enlever les stop words p√©nalisent les performances du mod√®le. Testons cette fois le ngram_range avec (1,2) pour voir si cela am√©liore les performances.

In [10]:
# Iterate through each preprocessed dataset
for col_name in cols[2:]:
    # Filter the dataset to keep only the column of interest
    X_train, X_test = filter_dataset(X_train_full, X_test_full, cols=[col_name])
    # Start the MLflow run & autolog
    mlflow.sklearn.autolog()
    with mlflow.start_run() as active_run:
        # Set the model
        model = create_ml_model(
            col_name,
            TfidfVectorizer(
                ngram_range=(1, 2),
                min_df=5,
                strip_accents="unicode",
                stop_words=None,
            ),
            LogisticRegression(max_iter=1000, random_state=SEED),
        )
        # Cross validate the model & log the validation scores
        val_scores = cross_validate(
            model,
            X_train,
            y_train,
            cv=5,
            scoring=["accuracy", "roc_auc", "f1"],
            n_jobs=-1,
        )

        # Fit the model on the training data
        model.fit(X_train, y_train)

        # Compute the inference time & log it
        start_time = time.time()
        y_pred = model.predict(X_test)
        inference_time = time.time() - start_time

        # Log the additionnal metrics & parameters
        mlflow.log_metrics(
            {
                "val_accuracy": val_scores["test_accuracy"].mean(),
                "val_roc_auc": val_scores["test_roc_auc"].mean(),
                "val_f1": val_scores["test_f1"].mean(),
                "inference_time": inference_time,
            }
        )
        mlflow.log_params(
            {
                "data_preparation": col_name,
                "test_size_ratio": 0.2,
                "val_splits": len(val_scores["test_accuracy"]),
            }
        )

        # Evaluate the data on the test set with th model logged in MLflow
        evaluation_data = pd.concat([X_test, y_test], axis=1).assign(predictions=y_pred)
        model_uri = f"runs:/{active_run.info.run_id}/model"
        mlflow.evaluate(
            model=model_uri,
            model_type="classifier",
            data=evaluation_data,
            targets="target",
            predictions="predictions",
            evaluators=None,
            evaluator_config={
                "log_model_explainability": False
            },  # Disable SHAP explanations
        )

2025/01/12 21:22:31 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/12 21:22:32 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/12 21:22:32 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
2025/01/12 21:22:32 INFO mlflow.tracking._tracking_service.client: üèÉ View run monumental-conch-506 at: http://localhost:5000/#/experiments/13/runs/bb8d880d36c143fc80f9cddf2924e0e2.
2025/01/12 21:22:32 INFO mlflow.tracking._tracking_service.client: üß™ View experiment at: http://localhost:5000/#/experiments/13.
2025/01/12 21:22:44 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/12 21:22:44 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/12 21:22:44 INFO mlflow.models.evaluation.default

![image.png](attachment:image.png)

Cela am√©liore l√©g√®rement les performances. Poussons l'exp√©rience avec un ngram_range de (1,3) pour voir si cela am√©liore les performances.

### **Run 4** : Testons le ngram_range avec (1,3) pour voir si cela am√©liore les performances.

In [11]:
# Iterate through each preprocessed dataset
for col_name in cols[2:]:
    # Filter the dataset to keep only the column of interest
    X_train, X_test = filter_dataset(X_train_full, X_test_full, cols=[col_name])
    # Start the MLflow run & autolog
    mlflow.sklearn.autolog()
    with mlflow.start_run() as active_run:
        # Set the model
        model = create_ml_model(
            col_name,
            TfidfVectorizer(
                ngram_range=(1, 3),
                min_df=5,
                strip_accents="unicode",
                stop_words=None,
            ),
            LogisticRegression(max_iter=1000, random_state=SEED),
        )
        # Cross validate the model & log the validation scores
        val_scores = cross_validate(
            model,
            X_train,
            y_train,
            cv=5,
            scoring=["accuracy", "roc_auc", "f1"],
            n_jobs=-1,
        )

        # Fit the model on the training data
        model.fit(X_train, y_train)

        # Compute the inference time & log it
        start_time = time.time()
        y_pred = model.predict(X_test)
        inference_time = time.time() - start_time

        # Log the additionnal metrics & parameters
        mlflow.log_metrics(
            {
                "val_accuracy": val_scores["test_accuracy"].mean(),
                "val_roc_auc": val_scores["test_roc_auc"].mean(),
                "val_f1": val_scores["test_f1"].mean(),
                "inference_time": inference_time,
            }
        )
        mlflow.log_params(
            {
                "data_preparation": col_name,
                "test_size_ratio": 0.2,
                "val_splits": len(val_scores["test_accuracy"]),
            }
        )

        # Evaluate the data on the test set with th model logged in MLflow
        evaluation_data = pd.concat([X_test, y_test], axis=1).assign(predictions=y_pred)
        model_uri = f"runs:/{active_run.info.run_id}/model"
        mlflow.evaluate(
            model=model_uri,
            model_type="classifier",
            data=evaluation_data,
            targets="target",
            predictions="predictions",
            evaluators=None,
            evaluator_config={
                "log_model_explainability": False
            },  # Disable SHAP explanations
        )

2025/01/12 21:23:33 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/12 21:23:33 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/12 21:23:33 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
2025/01/12 21:23:34 INFO mlflow.tracking._tracking_service.client: üèÉ View run luxuriant-dove-981 at: http://localhost:5000/#/experiments/13/runs/afe94e479a9940c19cddd199f309d07a.
2025/01/12 21:23:34 INFO mlflow.tracking._tracking_service.client: üß™ View experiment at: http://localhost:5000/#/experiments/13.
2025/01/12 21:23:46 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/12 21:23:46 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/12 21:23:46 INFO mlflow.models.evaluation.default_e

![image.png](attachment:image.png)

On obtient ici les meilleurs r√©sultats. Essayons √† pr√©sent d'enlever du bruit dans les donn√©es et diminuer la dimensionnalit√© en ajoutant un minimum de fr√©quence de document.

### **Run 5** : Testons avec min_df √† 10 tout en conservant le ngram_range √† (1,3) pour voir si cela am√©liore les performances.

In [12]:
# Iterate through each preprocessed dataset
for col_name in cols[2:]:
    # Filter the dataset to keep only the column of interest
    X_train, X_test = filter_dataset(X_train_full, X_test_full, cols=[col_name])
    # Start the MLflow run & autolog
    mlflow.sklearn.autolog()
    with mlflow.start_run() as active_run:
        # Set the model
        model = create_ml_model(
            col_name,
            TfidfVectorizer(
                ngram_range=(1, 3),
                min_df=10,
                strip_accents="unicode",
                stop_words=None,
            ),
            LogisticRegression(max_iter=1000, random_state=SEED),
        )
        # Cross validate the model & log the validation scores
        val_scores = cross_validate(
            model,
            X_train,
            y_train,
            cv=5,
            scoring=["accuracy", "roc_auc", "f1"],
            n_jobs=-1,
        )

        # Fit the model on the training data
        model.fit(X_train, y_train)

        # Compute the inference time & log it
        start_time = time.time()
        y_pred = model.predict(X_test)
        inference_time = time.time() - start_time

        # Log the additionnal metrics & parameters
        mlflow.log_metrics(
            {
                "val_accuracy": val_scores["test_accuracy"].mean(),
                "val_roc_auc": val_scores["test_roc_auc"].mean(),
                "val_f1": val_scores["test_f1"].mean(),
                "inference_time": inference_time,
            }
        )
        mlflow.log_params(
            {
                "data_preparation": col_name,
                "test_size_ratio": 0.2,
                "val_splits": len(val_scores["test_accuracy"]),
            }
        )

        # Evaluate the data on the test set with th model logged in MLflow
        evaluation_data = pd.concat([X_test, y_test], axis=1).assign(predictions=y_pred)
        model_uri = f"runs:/{active_run.info.run_id}/model"
        mlflow.evaluate(
            model=model_uri,
            model_type="classifier",
            data=evaluation_data,
            targets="target",
            predictions="predictions",
            evaluators=None,
            evaluator_config={
                "log_model_explainability": False
            },  # Disable SHAP explanations
        )

2025/01/12 21:24:38 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/12 21:24:39 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/12 21:24:39 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
2025/01/12 21:24:39 INFO mlflow.tracking._tracking_service.client: üèÉ View run unleashed-kite-627 at: http://localhost:5000/#/experiments/13/runs/3d712feee1ab4fe8b9d714dfa062ae02.
2025/01/12 21:24:39 INFO mlflow.tracking._tracking_service.client: üß™ View experiment at: http://localhost:5000/#/experiments/13.
2025/01/12 21:24:51 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/12 21:24:52 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/12 21:24:52 INFO mlflow.models.evaluation.default_e

![image.png](attachment:image.png)

Pas d'am√©lioration sur les performances, voir de moins bons r√©sultats sur la meilleure pr√©paration.<br>
Diminuons le min_df √† 3 pour voir si cela am√©liore les performances.

### **Run 6** : Testons avec min_df √† 3 tout en conservant le ngram_range √† (1,3) pour voir si cela am√©liore les performances.

In [13]:
# Iterate through each preprocessed dataset
for col_name in cols[2:]:
    # Filter the dataset to keep only the column of interest
    X_train, X_test = filter_dataset(X_train_full, X_test_full, cols=[col_name])
    # Start the MLflow run & autolog
    mlflow.sklearn.autolog()
    with mlflow.start_run() as active_run:
        # Set the model
        model = create_ml_model(
            col_name,
            TfidfVectorizer(
                ngram_range=(1, 3),
                min_df=3,
                strip_accents="unicode",
                stop_words=None,
            ),
            LogisticRegression(max_iter=1000, random_state=SEED),
        )
        # Cross validate the model & log the validation scores
        val_scores = cross_validate(
            model,
            X_train,
            y_train,
            cv=5,
            scoring=["accuracy", "roc_auc", "f1"],
            n_jobs=-1,
        )

        # Fit the model on the training data
        model.fit(X_train, y_train)

        # Compute the inference time & log it
        start_time = time.time()
        y_pred = model.predict(X_test)
        inference_time = time.time() - start_time

        # Log the additionnal metrics & parameters
        mlflow.log_metrics(
            {
                "val_accuracy": val_scores["test_accuracy"].mean(),
                "val_roc_auc": val_scores["test_roc_auc"].mean(),
                "val_f1": val_scores["test_f1"].mean(),
                "inference_time": inference_time,
            }
        )
        mlflow.log_params(
            {
                "data_preparation": col_name,
                "test_size_ratio": 0.2,
                "val_splits": len(val_scores["test_accuracy"]),
            }
        )

        # Evaluate the data on the test set with th model logged in MLflow
        evaluation_data = pd.concat([X_test, y_test], axis=1).assign(predictions=y_pred)
        model_uri = f"runs:/{active_run.info.run_id}/model"
        mlflow.evaluate(
            model=model_uri,
            model_type="classifier",
            data=evaluation_data,
            targets="target",
            predictions="predictions",
            evaluators=None,
            evaluator_config={
                "log_model_explainability": False
            },  # Disable SHAP explanations
        )

2025/01/12 21:25:44 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/12 21:25:44 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/12 21:25:44 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
2025/01/12 21:25:45 INFO mlflow.tracking._tracking_service.client: üèÉ View run illustrious-bear-482 at: http://localhost:5000/#/experiments/13/runs/751eac4089a040a7aec94b9c13cc36c4.
2025/01/12 21:25:45 INFO mlflow.tracking._tracking_service.client: üß™ View experiment at: http://localhost:5000/#/experiments/13.
2025/01/12 21:25:57 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/12 21:25:57 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/12 21:25:58 INFO mlflow.models.evaluation.default

![image.png](attachment:image.png)

L√©g√®re am√©lioration des performances. Regardons comment se comporte le mod√®le avec l'accuracy sur le jeu d'entrainement et de test :

![image.png](attachment:image.png)

De mani√®re g√©n√©ralement, on a un d√©s√©quilibre sur le tradeoff biais-variance avec une tendance √† l'overfitting sur les donn√©es d'entra√Ænement. Essayons de modifier les param√®tres de r√©gularisation sur le mod√®le de r√©gression logistique pour voir si on peut am√©liorer les performances.

### **Run 7** : Testons diff√©rentes valeurs pour la r√©gularisation sur la r√©gression logistique pour voir cela am√©liore les r√©sultats.

In [14]:
# Iterate through each selected dataset
for col_name in cols[2:]:
    # Filter the dataset to keep only the column of interest
    X_train, X_test = filter_dataset(X_train_full, X_test_full, cols=[col_name])
    # Test C values for the Logistic Regression
    for c_val in (0.01, 0.1, 10, 100):
        # Start the MLflow run & autolog
        mlflow.sklearn.autolog()
        with mlflow.start_run() as active_run:
            # Set the model
            model = create_ml_model(
                col_name,
                TfidfVectorizer(
                    ngram_range=(1, 3),
                    min_df=3,
                    strip_accents="unicode",
                    stop_words=None,
                ),
                LogisticRegression(max_iter=1000, C=c_val, random_state=SEED),
            )
            # Cross validate the model & log the validation scores
            val_scores = cross_validate(
                model,
                X_train,
                y_train,
                cv=5,
                scoring=["accuracy", "roc_auc", "f1"],
                n_jobs=-1,
            )

            # Fit the model on the training data
            model.fit(X_train, y_train)

            # Compute the inference time & log it
            start_time = time.time()
            y_pred = model.predict(X_test)
            inference_time = time.time() - start_time

            # Log the additionnal metrics & parameters
            mlflow.log_metrics(
                {
                    "val_accuracy": val_scores["test_accuracy"].mean(),
                    "val_roc_auc": val_scores["test_roc_auc"].mean(),
                    "val_f1": val_scores["test_f1"].mean(),
                    "inference_time": inference_time,
                }
            )
            mlflow.log_params(
                {
                    "data_preparation": col_name,
                    "test_size_ratio": 0.2,
                    "val_splits": len(val_scores["test_accuracy"]),
                }
            )

            # Evaluate the data on the test set with th model logged in MLflow
            evaluation_data = pd.concat([X_test, y_test], axis=1).assign(
                predictions=y_pred
            )
            model_uri = f"runs:/{active_run.info.run_id}/model"
            mlflow.evaluate(
                model=model_uri,
                model_type="classifier",
                data=evaluation_data,
                targets="target",
                predictions="predictions",
                evaluators=None,
                evaluator_config={
                    "log_model_explainability": False
                },  # Disable SHAP explanations
            )

2025/01/12 21:26:51 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/12 21:26:52 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/12 21:26:52 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
2025/01/12 21:26:53 INFO mlflow.tracking._tracking_service.client: üèÉ View run legendary-slug-417 at: http://localhost:5000/#/experiments/13/runs/9398426a75dc4cbfbe93a60057a71039.
2025/01/12 21:26:53 INFO mlflow.tracking._tracking_service.client: üß™ View experiment at: http://localhost:5000/#/experiments/13.
2025/01/12 21:27:06 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/12 21:27:06 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/12 21:27:07 INFO mlflow.models.evaluation.default_e

![image.png](attachment:image.png)

Les autres param√®tres d√©gradent les performances du mod√®les, on conserve la valeur par d√©faut de C √† 1.0.

### **Run 8** : Testons √† pr√©sent un mod√®le MultinomialNB pour voir si cela am√©liore les performances avec diff√©rentes valeurs pour alpha.

In [15]:
# Iterate through each preprocessed dataset
for col_name in cols[2:]:
    # Filter the dataset to keep only the column of interest
    X_train, X_test = filter_dataset(X_train_full, X_test_full, cols=[col_name])
    for alpha_val in (0.1, 0.5, 1.0, 2.0, 5.0, 10.0):
        # Start the MLflow run & autolog
        mlflow.sklearn.autolog()
        with mlflow.start_run() as active_run:
            # Set the model
            model = create_ml_model(
                col_name,
                TfidfVectorizer(
                    ngram_range=(1, 3),
                    min_df=3,
                    strip_accents="unicode",
                    stop_words=None,
                ),
                MultinomialNB(alpha=alpha_val),
            )
            # Cross validate the model & log the validation scores
            val_scores = cross_validate(
                model,
                X_train,
                y_train,
                cv=5,
                scoring=["accuracy", "roc_auc", "f1"],
                n_jobs=-1,
            )

            # Fit the model on the training data
            model.fit(X_train, y_train)

            # Compute the inference time & log it
            start_time = time.time()
            y_pred = model.predict(X_test)
            inference_time = time.time() - start_time

            # Log the additionnal metrics & parameters
            mlflow.log_metrics(
                {
                    "val_accuracy": val_scores["test_accuracy"].mean(),
                    "val_roc_auc": val_scores["test_roc_auc"].mean(),
                    "val_f1": val_scores["test_f1"].mean(),
                    "inference_time": inference_time,
                }
            )
            mlflow.log_params(
                {
                    "data_preparation": col_name,
                    "test_size_ratio": 0.2,
                    "val_splits": len(val_scores["test_accuracy"]),
                }
            )

            # Evaluate the data on the test set with th model logged in MLflow
            evaluation_data = pd.concat([X_test, y_test], axis=1).assign(
                predictions=y_pred
            )
            model_uri = f"runs:/{active_run.info.run_id}/model"
            mlflow.evaluate(
                model=model_uri,
                model_type="classifier",
                data=evaluation_data,
                targets="target",
                predictions="predictions",
                evaluators=None,
                evaluator_config={
                    "log_model_explainability": False
                },  # Disable SHAP explanations
            )

2025/01/12 21:31:27 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/12 21:31:27 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/12 21:31:27 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
2025/01/12 21:31:28 INFO mlflow.tracking._tracking_service.client: üèÉ View run adaptable-bear-501 at: http://localhost:5000/#/experiments/13/runs/46fb21f769b14413a872746b9b8908e1.
2025/01/12 21:31:28 INFO mlflow.tracking._tracking_service.client: üß™ View experiment at: http://localhost:5000/#/experiments/13.
2025/01/12 21:31:40 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/12 21:31:40 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/12 21:31:40 INFO mlflow.models.evaluation.default_e

![image.png](attachment:image.png)

Les r√©sultats sont moins bons que sur la r√©gression logistique. On conserve le mod√®le de r√©gression logistique pour la suite des exp√©rimentations.<br>
On note encore une fois que les donn√©es lemmatis√©es avec la pr√©paration avanc√©e donnent les meilleurs r√©sultats sur le mod√®le MultinomialNB.<br>

## Conclusion sur les mod√®les de r√©gression logistique

En t√™te de classement, on retrouve la r√©gression logistique avec un ngram_range de (1,3) et un min_df de 3. Les autres mod√®les n'ont pas r√©ussi √† surpasser les performances de la r√©gression logistique.<br>
Les jeux de donn√©es lemmatis√©es donnent les meilleurs r√©sultats pour ce type de mod√®le avec la standardisation avanc√©e sur les textes.<br>

![image.png](attachment:image.png)