Ce notebook est destiné à calculer les performances d'un modèle classique de machine learning basé sur le de la vectorisation des textes à l'aide d'un Tf-Idf suivi d'un modèle de classification supervisée.

# Préparation de l'environnement

In [21]:
# from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
import pandas as pd
import numpy as np
import pickle
import warnings
import mlflow
import mlflow.sklearn
from tqdm import tqdm
from utils import split_data, filter_dataset
from ml import create_ml_model
from sklearn.model_selection import cross_validate
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
import time


In [2]:
SEED = 314
# Define the URI of the MLflow server and the name of the experiment
URI = "http://localhost:5000"
PATH_PARQUET = "../data/processed/df_preprocessed.parquet"
PATH_COLS = "../data/processed/columns.pkl"

In [3]:
# Remove FutureWarning alerts
warnings.filterwarnings("ignore", category=FutureWarning)

# Initialiser tqdm pour pandas
tqdm.pandas()

# Set a random seed
np.random.seed(SEED)
print("Random seed set to", SEED)

Random seed set to 314


# Chargement des données et split en train et test

In [4]:
# Load the pickle file containing the columns
with open(PATH_COLS, "rb") as f:
    cols = pickle.load(f)

# Load the parquet file
df = pd.read_parquet(
    PATH_PARQUET,
    engine="pyarrow",
    use_nullable_dtypes=False,
)

# Define the parameters for the split
proportion = 0.015  # approximatively 1.5% of 1.6 million : +/- 24000 rows
sampling = True
test_split = 0.2

# Split the data with the same SEED fixed
X_train_full, X_test_full, y_train, y_test = split_data(
    df,
    test_split=test_split,
    sampling=sampling,
    proportion=proportion,
)

# Modélisation

Pour le travail de modélisation, je vais utiliser **un modèle de régression logistique en tant que référence** pour évaluer les performances en fonction des données préparées.<br>
Je testerai également la création des features via le Tf-Idf en fonction de différents hyperparamètres.<br>
Enfin, lorsque nous aurons trouvé le jeu de données le plus adapté, nous pourrons tester différents modèles de classification pour voir si nous pouvons améliorer les performances.


Pour enregistrer les évaluations des modèles, je créé une expérience pour le suivi des performances des modèles basé **sur la vectorisation par Tf-Idf**:

In [5]:
# Define the URI of the MLflow server and the name of the experiment
experiment = "ml_tfidf_vectorizer"
cols_tracked = cols[3:]

# Set the tracking URI
mlflow.set_tracking_uri(URI)
# try to connect to the server
try:
    mlflow.tracking.get_tracking_uri()
except Exception as e:
    print(f"Cannot connect to the server : {URI}. Check the server status.")
    raise e
# Set, and create if necessary, the experiment
try:
    mlflow.create_experiment(experiment)
except Exception:
    pass
finally:
    mlflow.set_experiment(experiment)

## Evaluation des jeux de données

### **RUN 1:** paramètres par défaut pour évaluer les performances de la régression logistique sur les différents jeux de données.

In [6]:
# Iterate through each preprocessed dataset
for col_name in cols_tracked:
    # Filter the dataset to keep only the column of interest
    X_train, X_test = filter_dataset(X_train_full, X_test_full, cols=[col_name])
    # Start the MLflow run & autolog
    mlflow.sklearn.autolog()
    with mlflow.start_run() as active_run:
        # Set the model
        model = create_ml_model(
            col_name,
            TfidfVectorizer(
                ngram_range=(1, 1),
                min_df=5,
                strip_accents="unicode",
                stop_words=None,
            ),
            LogisticRegression(max_iter=1000, random_state=SEED),
        )
        # Cross validate the model & log the validation scores
        val_scores = cross_validate(
            model,
            X_train,
            y_train,
            cv=5,
            scoring=["accuracy", "roc_auc", "f1"],
            n_jobs=-1,
        )

        # Fit the model on the training data
        model.fit(X_train, y_train)

        # Compute the inference time & log it
        start_time = time.time()
        y_pred = model.predict(X_test)
        inference_time = time.time() - start_time

        # Log the additionnal metrics & parameters
        mlflow.log_metrics(
            {
                "val_accuracy": val_scores["test_accuracy"].mean(),
                "val_roc_auc": val_scores["test_roc_auc"].mean(),
                "val_f1": val_scores["test_f1"].mean(),
                "inference_time": inference_time,
            }
        )
        mlflow.log_params(
            {
                "data_preparation": col_name,
                "test_size_ratio": test_split,
                "val_splits": len(val_scores["test_accuracy"]),
            }
        )

        # Evaluate the data on the test set with th model logged in MLflow
        evaluation_data = pd.concat([X_test, y_test], axis=1).assign(predictions=y_pred)
        model_uri = f"runs:/{active_run.info.run_id}/model"
        mlflow.evaluate(
            model=model_uri,
            model_type="classifier",
            data=evaluation_data,
            targets="target",
            predictions="predictions",
            evaluators=None,
            evaluator_config={
                "log_model_explainability": False
            },  # Disable SHAP explanations
        )

2025-01-05 17:25:00.516496: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-01-05 17:25:01.814976: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libdirectml.d6f03b303ac3c4f2eeb8ca631688c9757b361310.so
2025-01-05 17:25:01.815049: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libdxcore.so
2025-01-05 17:25:01.820320: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libd3d12.so
Dropped Escape call with ulEscapeCode : 0x03007703
Dropped Escape call with ulEscapeCode : 0x03007703
2025-01-05 17:25:02.409238: I tensorflow/c/logging.cc:34] DirectML device enumeration: found 1 compatible adapters.
2025/01/05 17:25:13 INFO mlflow.models.evaluation.default_evaluator: Computin

<Figure size 1050x700 with 0 Axes>

### **Run 2** : Second run en ajoutant les stop words. On attends à ce que l'ajout des stop words ne soit pas forcément bénéfique avec un Tf-Idf.

In [7]:
# Iterate through each preprocessed dataset
for col_name in cols_tracked:
    # Filter the dataset to keep only the column of interest
    X_train, X_test = filter_dataset(X_train_full, X_test_full, cols=[col_name])
    # Start the MLflow run & autolog
    mlflow.sklearn.autolog()
    with mlflow.start_run() as active_run:
        # Set the model
        model = create_ml_model(
            col_name,
            TfidfVectorizer(
                ngram_range=(1, 1),
                min_df=5,
                strip_accents="unicode",
                stop_words="english",
            ),
            LogisticRegression(max_iter=1000, random_state=SEED),
        )
        # Cross validate the model & log the validation scores
        val_scores = cross_validate(
            model,
            X_train,
            y_train,
            cv=5,
            scoring=["accuracy", "roc_auc", "f1"],
            n_jobs=-1,
        )

        # Fit the model on the training data
        model.fit(X_train, y_train)

        # Compute the inference time & log it
        start_time = time.time()
        y_pred = model.predict(X_test)
        inference_time = time.time() - start_time

        # Log the additionnal metrics & parameters
        mlflow.log_metrics(
            {
                "val_accuracy": val_scores["test_accuracy"].mean(),
                "val_roc_auc": val_scores["test_roc_auc"].mean(),
                "val_f1": val_scores["test_f1"].mean(),
                "inference_time": inference_time,
            }
        )
        mlflow.log_params(
            {
                "data_preparation": col_name,
                "test_size_ratio": test_split,
                "val_splits": len(val_scores["test_accuracy"]),
            }
        )

        # Evaluate the data on the test set with th model logged in MLflow
        evaluation_data = pd.concat([X_test, y_test], axis=1).assign(predictions=y_pred)
        model_uri = f"runs:/{active_run.info.run_id}/model"
        mlflow.evaluate(
            model=model_uri,
            model_type="classifier",
            data=evaluation_data,
            targets="target",
            predictions="predictions",
            evaluators=None,
            evaluator_config={
                "log_model_explainability": False
            },  # Disable SHAP explanations
        )

2025/01/05 17:26:12 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/05 17:26:12 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/05 17:26:12 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
2025/01/05 17:26:13 INFO mlflow.tracking._tracking_service.client: 🏃 View run auspicious-rat-41 at: http://localhost:5000/#/experiments/3/runs/412ac735040641788ebd246f6367c841.
2025/01/05 17:26:13 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:5000/#/experiments/3.
2025/01/05 17:26:22 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/05 17:26:22 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/05 17:26:22 INFO mlflow.models.evaluation.default_evaluator:

### **Run 3** : Enlever les stop words pénalisent les performances du modèle. Testons cette fois le ngram_range avec (1,2) pour voir si cela améliore les performances.

In [8]:
# Iterate through each preprocessed dataset
for col_name in cols_tracked:
    # Filter the dataset to keep only the column of interest
    X_train, X_test = filter_dataset(X_train_full, X_test_full, cols=[col_name])
    # Start the MLflow run & autolog
    mlflow.sklearn.autolog()
    with mlflow.start_run() as active_run:
        # Set the model
        model = create_ml_model(
            col_name,
            TfidfVectorizer(
                ngram_range=(1, 2),
                min_df=5,
                strip_accents="unicode",
                stop_words=None,
            ),
            LogisticRegression(max_iter=1000, random_state=SEED),
        )
        # Cross validate the model & log the validation scores
        val_scores = cross_validate(
            model,
            X_train,
            y_train,
            cv=5,
            scoring=["accuracy", "roc_auc", "f1"],
            n_jobs=-1,
        )

        # Fit the model on the training data
        model.fit(X_train, y_train)

        # Compute the inference time & log it
        start_time = time.time()
        y_pred = model.predict(X_test)
        inference_time = time.time() - start_time

        # Log the additionnal metrics & parameters
        mlflow.log_metrics(
            {
                "val_accuracy": val_scores["test_accuracy"].mean(),
                "val_roc_auc": val_scores["test_roc_auc"].mean(),
                "val_f1": val_scores["test_f1"].mean(),
                "inference_time": inference_time,
            }
        )
        mlflow.log_params(
            {
                "data_preparation": col_name,
                "test_size_ratio": test_split,
                "val_splits": len(val_scores["test_accuracy"]),
            }
        )

        # Evaluate the data on the test set with th model logged in MLflow
        evaluation_data = pd.concat([X_test, y_test], axis=1).assign(predictions=y_pred)
        model_uri = f"runs:/{active_run.info.run_id}/model"
        mlflow.evaluate(
            model=model_uri,
            model_type="classifier",
            data=evaluation_data,
            targets="target",
            predictions="predictions",
            evaluators=None,
            evaluator_config={
                "log_model_explainability": False
            },  # Disable SHAP explanations
        )

2025/01/05 17:31:26 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/05 17:31:26 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/05 17:31:26 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
2025/01/05 17:31:27 INFO mlflow.tracking._tracking_service.client: 🏃 View run enchanting-dog-371 at: http://localhost:5000/#/experiments/3/runs/a9abb2f966e64a60b6cb030e307cb328.
2025/01/05 17:31:27 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:5000/#/experiments/3.
2025/01/05 17:31:38 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/05 17:31:38 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/05 17:31:38 INFO mlflow.models.evaluation.default_evaluator

### **Run 4** : Testons le ngram_range avec (1,3) pour voir si cela améliore les performances.

In [9]:
# Iterate through each preprocessed dataset
for col_name in cols_tracked:
    # Filter the dataset to keep only the column of interest
    X_train, X_test = filter_dataset(X_train_full, X_test_full, cols=[col_name])
    # Start the MLflow run & autolog
    mlflow.sklearn.autolog()
    with mlflow.start_run() as active_run:
        # Set the model
        model = create_ml_model(
            col_name,
            TfidfVectorizer(
                ngram_range=(1, 3),
                min_df=5,
                strip_accents="unicode",
                stop_words=None,
            ),
            LogisticRegression(max_iter=1000, random_state=SEED),
        )
        # Cross validate the model & log the validation scores
        val_scores = cross_validate(
            model,
            X_train,
            y_train,
            cv=5,
            scoring=["accuracy", "roc_auc", "f1"],
            n_jobs=-1,
        )

        # Fit the model on the training data
        model.fit(X_train, y_train)

        # Compute the inference time & log it
        start_time = time.time()
        y_pred = model.predict(X_test)
        inference_time = time.time() - start_time

        # Log the additionnal metrics & parameters
        mlflow.log_metrics(
            {
                "val_accuracy": val_scores["test_accuracy"].mean(),
                "val_roc_auc": val_scores["test_roc_auc"].mean(),
                "val_f1": val_scores["test_f1"].mean(),
                "inference_time": inference_time,
            }
        )
        mlflow.log_params(
            {
                "data_preparation": col_name,
                "test_size_ratio": test_split,
                "val_splits": len(val_scores["test_accuracy"]),
            }
        )

        # Evaluate the data on the test set with th model logged in MLflow
        evaluation_data = pd.concat([X_test, y_test], axis=1).assign(predictions=y_pred)
        model_uri = f"runs:/{active_run.info.run_id}/model"
        mlflow.evaluate(
            model=model_uri,
            model_type="classifier",
            data=evaluation_data,
            targets="target",
            predictions="predictions",
            evaluators=None,
            evaluator_config={
                "log_model_explainability": False
            },  # Disable SHAP explanations
        )

2025/01/05 17:32:29 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/05 17:32:29 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/05 17:32:29 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
2025/01/05 17:32:30 INFO mlflow.tracking._tracking_service.client: 🏃 View run stately-mare-942 at: http://localhost:5000/#/experiments/3/runs/18486472bb0f4ed0a0afb0bf08a600a0.
2025/01/05 17:32:30 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:5000/#/experiments/3.
2025/01/05 17:32:41 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/05 17:32:41 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/05 17:32:41 INFO mlflow.models.evaluation.default_evaluator: 

### **Run 5** : Testons avec min_df à 10 tout en conservant le ngram_range à (1,3) pour voir si cela améliore les performances.

In [10]:
# Iterate through each preprocessed dataset
for col_name in cols_tracked:
    # Filter the dataset to keep only the column of interest
    X_train, X_test = filter_dataset(X_train_full, X_test_full, cols=[col_name])
    # Start the MLflow run & autolog
    mlflow.sklearn.autolog()
    with mlflow.start_run() as active_run:
        # Set the model
        model = create_ml_model(
            col_name,
            TfidfVectorizer(
                ngram_range=(1, 3),
                min_df=10,
                strip_accents="unicode",
                stop_words=None,
            ),
            LogisticRegression(max_iter=1000, random_state=SEED),
        )
        # Cross validate the model & log the validation scores
        val_scores = cross_validate(
            model,
            X_train,
            y_train,
            cv=5,
            scoring=["accuracy", "roc_auc", "f1"],
            n_jobs=-1,
        )

        # Fit the model on the training data
        model.fit(X_train, y_train)

        # Compute the inference time & log it
        start_time = time.time()
        y_pred = model.predict(X_test)
        inference_time = time.time() - start_time

        # Log the additionnal metrics & parameters
        mlflow.log_metrics(
            {
                "val_accuracy": val_scores["test_accuracy"].mean(),
                "val_roc_auc": val_scores["test_roc_auc"].mean(),
                "val_f1": val_scores["test_f1"].mean(),
                "inference_time": inference_time,
            }
        )
        mlflow.log_params(
            {
                "data_preparation": col_name,
                "test_size_ratio": test_split,
                "val_splits": len(val_scores["test_accuracy"]),
            }
        )

        # Evaluate the data on the test set with th model logged in MLflow
        evaluation_data = pd.concat([X_test, y_test], axis=1).assign(predictions=y_pred)
        model_uri = f"runs:/{active_run.info.run_id}/model"
        mlflow.evaluate(
            model=model_uri,
            model_type="classifier",
            data=evaluation_data,
            targets="target",
            predictions="predictions",
            evaluators=None,
            evaluator_config={
                "log_model_explainability": False
            },  # Disable SHAP explanations
        )

2025/01/05 17:35:13 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/05 17:35:13 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/05 17:35:13 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
2025/01/05 17:35:14 INFO mlflow.tracking._tracking_service.client: 🏃 View run gaudy-gull-751 at: http://localhost:5000/#/experiments/3/runs/3f98bf73a51a4f6ab4e5333dfc4efebe.
2025/01/05 17:35:14 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:5000/#/experiments/3.
2025/01/05 17:35:22 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/05 17:35:22 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/05 17:35:22 INFO mlflow.models.evaluation.default_evaluator: Te

### **Run 6** : Testons avec min_df à 3 tout en conservant le ngram_range à (1,3) pour voir si cela améliore les performances.

In [11]:
# Iterate through each preprocessed dataset
for col_name in cols_tracked:
    # Filter the dataset to keep only the column of interest
    X_train, X_test = filter_dataset(X_train_full, X_test_full, cols=[col_name])
    # Start the MLflow run & autolog
    mlflow.sklearn.autolog()
    with mlflow.start_run() as active_run:
        # Set the model
        model = create_ml_model(
            col_name,
            TfidfVectorizer(
                ngram_range=(1, 3),
                min_df=3,
                strip_accents="unicode",
                stop_words=None,
            ),
            LogisticRegression(max_iter=1000, random_state=SEED),
        )
        # Cross validate the model & log the validation scores
        val_scores = cross_validate(
            model,
            X_train,
            y_train,
            cv=5,
            scoring=["accuracy", "roc_auc", "f1"],
            n_jobs=-1,
        )

        # Fit the model on the training data
        model.fit(X_train, y_train)

        # Compute the inference time & log it
        start_time = time.time()
        y_pred = model.predict(X_test)
        inference_time = time.time() - start_time

        # Log the additionnal metrics & parameters
        mlflow.log_metrics(
            {
                "val_accuracy": val_scores["test_accuracy"].mean(),
                "val_roc_auc": val_scores["test_roc_auc"].mean(),
                "val_f1": val_scores["test_f1"].mean(),
                "inference_time": inference_time,
            }
        )
        mlflow.log_params(
            {
                "data_preparation": col_name,
                "test_size_ratio": test_split,
                "val_splits": len(val_scores["test_accuracy"]),
            }
        )

        # Evaluate the data on the test set with th model logged in MLflow
        evaluation_data = pd.concat([X_test, y_test], axis=1).assign(predictions=y_pred)
        model_uri = f"runs:/{active_run.info.run_id}/model"
        mlflow.evaluate(
            model=model_uri,
            model_type="classifier",
            data=evaluation_data,
            targets="target",
            predictions="predictions",
            evaluators=None,
            evaluator_config={
                "log_model_explainability": False
            },  # Disable SHAP explanations
        )

2025/01/05 17:37:16 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/05 17:37:17 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/05 17:37:17 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
2025/01/05 17:37:17 INFO mlflow.tracking._tracking_service.client: 🏃 View run aged-mole-891 at: http://localhost:5000/#/experiments/3/runs/8645cc87316b4175a9b3df8ebc999819.
2025/01/05 17:37:17 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:5000/#/experiments/3.
2025/01/05 17:37:28 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/05 17:37:28 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/05 17:37:28 INFO mlflow.models.evaluation.default_evaluator: Tes

### **Run 7** : Testons à présent l'ajout de l'heure en tant que feature pour voir si cela améliore les performances.

In [14]:
# Iterate through each preprocessed dataset
for col_name in cols_tracked:
    # Filter the dataset to keep only the column of interest
    X_train, X_test = filter_dataset(
        X_train_full, X_test_full, cols=["hour_sin", "hour_cos", col_name]
    )
    # Start the MLflow run & autolog
    mlflow.sklearn.autolog()
    with mlflow.start_run() as active_run:
        # Set the model
        model = create_ml_model(
            col_name,
            TfidfVectorizer(
                ngram_range=(1, 3),
                min_df=3,
                strip_accents="unicode",
                stop_words=None,
            ),
            LogisticRegression(max_iter=1000, random_state=SEED),
        )
        # Cross validate the model & log the validation scores
        val_scores = cross_validate(
            model,
            X_train,
            y_train,
            cv=5,
            scoring=["accuracy", "roc_auc", "f1"],
            n_jobs=-1,
        )

        # Fit the model on the training data
        model.fit(X_train, y_train)

        # Compute the inference time & log it
        start_time = time.time()
        y_pred = model.predict(X_test)
        inference_time = time.time() - start_time

        # Log the additionnal metrics & parameters
        mlflow.log_metrics(
            {
                "val_accuracy": val_scores["test_accuracy"].mean(),
                "val_roc_auc": val_scores["test_roc_auc"].mean(),
                "val_f1": val_scores["test_f1"].mean(),
                "inference_time": inference_time,
            }
        )
        mlflow.log_params(
            {
                "data_preparation": col_name,
                "test_size_ratio": test_split,
                "val_splits": len(val_scores["test_accuracy"]),
            }
        )

        # Evaluate the data on the test set with th model logged in MLflow
        evaluation_data = pd.concat([X_test, y_test], axis=1).assign(predictions=y_pred)
        model_uri = f"runs:/{active_run.info.run_id}/model"
        mlflow.evaluate(
            model=model_uri,
            model_type="classifier",
            data=evaluation_data,
            targets="target",
            predictions="predictions",
            evaluators=None,
            evaluator_config={
                "log_model_explainability": False
            },  # Disable SHAP explanations
        )

2025/01/05 17:41:10 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/05 17:41:11 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/05 17:41:11 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
2025/01/05 17:41:12 INFO mlflow.tracking._tracking_service.client: 🏃 View run handsome-snake-932 at: http://localhost:5000/#/experiments/3/runs/8d88c35ef78d4f559125d16917fe85ac.
2025/01/05 17:41:12 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:5000/#/experiments/3.
2025/01/05 17:41:23 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/05 17:41:23 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/05 17:41:23 INFO mlflow.models.evaluation.default_evaluator

### **Run 8** : Cela n'améliore pas les performances. Testons une régularisation plus forte pour voir cela améliore le trade-off entre biais et variance.

In [16]:
# Iterate through each selected dataset
for col_name in ("lemma", "lemma_nomention"):
    # Filter the dataset to keep only the column of interest
    X_train, X_test = filter_dataset(X_train_full, X_test_full, cols=[col_name])
    # Test C values for the Logistic Regression
    for c_val in (0.01, 0.1, 10, 100):
        # Start the MLflow run & autolog
        mlflow.sklearn.autolog()
        with mlflow.start_run() as active_run:
            # Set the model
            model = create_ml_model(
                col_name,
                TfidfVectorizer(
                    ngram_range=(1, 3),
                    min_df=3,
                    strip_accents="unicode",
                    stop_words=None,
                ),
                LogisticRegression(max_iter=1000, C=c_val, random_state=SEED),
            )
            # Cross validate the model & log the validation scores
            val_scores = cross_validate(
                model,
                X_train,
                y_train,
                cv=5,
                scoring=["accuracy", "roc_auc", "f1"],
                n_jobs=-1,
            )

            # Fit the model on the training data
            model.fit(X_train, y_train)

            # Compute the inference time & log it
            start_time = time.time()
            y_pred = model.predict(X_test)
            inference_time = time.time() - start_time

            # Log the additionnal metrics & parameters
            mlflow.log_metrics(
                {
                    "val_accuracy": val_scores["test_accuracy"].mean(),
                    "val_roc_auc": val_scores["test_roc_auc"].mean(),
                    "val_f1": val_scores["test_f1"].mean(),
                    "inference_time": inference_time,
                }
            )
            mlflow.log_params(
                {
                    "data_preparation": col_name,
                    "test_size_ratio": test_split,
                    "val_splits": len(val_scores["test_accuracy"]),
                }
            )

            # Evaluate the data on the test set with th model logged in MLflow
            evaluation_data = pd.concat([X_test, y_test], axis=1).assign(
                predictions=y_pred
            )
            model_uri = f"runs:/{active_run.info.run_id}/model"
            mlflow.evaluate(
                model=model_uri,
                model_type="classifier",
                data=evaluation_data,
                targets="target",
                predictions="predictions",
                evaluators=None,
                evaluator_config={
                    "log_model_explainability": False
                },  # Disable SHAP explanations
            )

2025/01/05 17:51:43 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/05 17:51:43 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/05 17:51:43 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
2025/01/05 17:51:44 INFO mlflow.tracking._tracking_service.client: 🏃 View run unleashed-cub-674 at: http://localhost:5000/#/experiments/3/runs/ee370e5191294a5abe4db121d461c43c.
2025/01/05 17:51:44 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:5000/#/experiments/3.
2025/01/05 17:51:55 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/05 17:51:55 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/05 17:51:55 INFO mlflow.models.evaluation.default_evaluator:

### **Run 9** : Testons à présent un modèle MultinomialNB pour voir si cela améliore les performances.

In [20]:
# Iterate through each preprocessed dataset
for col_name in ("lemma", "lemma_nomention"):
    # Filter the dataset to keep only the column of interest
    X_train, X_test = filter_dataset(X_train_full, X_test_full, cols=[col_name])
    for alpha_val in (0.1, 0.5, 1.0, 2.0, 5.0, 10.0):
        # Start the MLflow run & autolog
        mlflow.sklearn.autolog()
        with mlflow.start_run() as active_run:
            # Set the model
            model = create_ml_model(
                col_name,
                TfidfVectorizer(
                    ngram_range=(1, 3),
                    min_df=3,
                    strip_accents="unicode",
                    stop_words=None,
                ),
                MultinomialNB(alpha=alpha_val),
            )
            # Cross validate the model & log the validation scores
            val_scores = cross_validate(
                model,
                X_train,
                y_train,
                cv=5,
                scoring=["accuracy", "roc_auc", "f1"],
                n_jobs=-1,
            )

            # Fit the model on the training data
            model.fit(X_train, y_train)

            # Compute the inference time & log it
            start_time = time.time()
            y_pred = model.predict(X_test)
            inference_time = time.time() - start_time

            # Log the additionnal metrics & parameters
            mlflow.log_metrics(
                {
                    "val_accuracy": val_scores["test_accuracy"].mean(),
                    "val_roc_auc": val_scores["test_roc_auc"].mean(),
                    "val_f1": val_scores["test_f1"].mean(),
                    "inference_time": inference_time,
                }
            )
            mlflow.log_params(
                {
                    "data_preparation": col_name,
                    "test_size_ratio": test_split,
                    "val_splits": len(val_scores["test_accuracy"]),
                }
            )

            # Evaluate the data on the test set with th model logged in MLflow
            evaluation_data = pd.concat([X_test, y_test], axis=1).assign(
                predictions=y_pred
            )
            model_uri = f"runs:/{active_run.info.run_id}/model"
            mlflow.evaluate(
                model=model_uri,
                model_type="classifier",
                data=evaluation_data,
                targets="target",
                predictions="predictions",
                evaluators=None,
                evaluator_config={
                    "log_model_explainability": False
                },  # Disable SHAP explanations
            )

2025/01/05 17:57:53 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/05 17:57:53 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/05 17:57:53 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
2025/01/05 17:57:54 INFO mlflow.tracking._tracking_service.client: 🏃 View run auspicious-auk-30 at: http://localhost:5000/#/experiments/3/runs/007f5aafa7f84c529b674216a3cb6838.
2025/01/05 17:57:54 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:5000/#/experiments/3.
2025/01/05 17:58:05 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/05 17:58:05 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/05 17:58:05 INFO mlflow.models.evaluation.default_evaluator:

### **Run 10** : Testons à présent un modèle SVC linéaire.

In [None]:
# Iterate through each preprocessed dataset
for col_name in ("lemma", "lemma_nomention"):
    # Filter the dataset to keep only the column of interest
    X_train, X_test = filter_dataset(X_train_full, X_test_full, cols=[col_name])
    for c_val in (0.05, 0.1, 0.5, 1.0, 10.0):
        # Start the MLflow run & autolog
        mlflow.sklearn.autolog()
        with mlflow.start_run() as active_run:
            # Set the model
            model = create_ml_model(
                col_name,
                TfidfVectorizer(
                    ngram_range=(1, 3),
                    min_df=3,
                    strip_accents="unicode",
                    stop_words=None,
                ),
                SVC(kernel="linear", C=c_val, random_state=SEED),
            )
            # Cross validate the model & log the validation scores
            val_scores = cross_validate(
                model,
                X_train,
                y_train,
                cv=5,
                scoring=["accuracy", "roc_auc", "f1"],
                n_jobs=-1,
            )

            # Fit the model on the training data
            model.fit(X_train, y_train)

            # Compute the inference time & log it
            start_time = time.time()
            y_pred = model.predict(X_test)
            inference_time = time.time() - start_time

            # Log the additionnal metrics & parameters
            mlflow.log_metrics(
                {
                    "val_accuracy": val_scores["test_accuracy"].mean(),
                    "val_roc_auc": val_scores["test_roc_auc"].mean(),
                    "val_f1": val_scores["test_f1"].mean(),
                    "inference_time": inference_time,
                }
            )
            mlflow.log_params(
                {
                    "data_preparation": col_name,
                    "test_size_ratio": test_split,
                    "val_splits": len(val_scores["test_accuracy"]),
                }
            )

            # Evaluate the data on the test set with th model logged in MLflow
            evaluation_data = pd.concat([X_test, y_test], axis=1).assign(
                predictions=y_pred
            )
            model_uri = f"runs:/{active_run.info.run_id}/model"
            mlflow.evaluate(
                model=model_uri,
                model_type="classifier",
                data=evaluation_data,
                targets="target",
                predictions="predictions",
                evaluators=None,
                evaluator_config={
                    "log_model_explainability": False
                },  # Disable SHAP explanations
            )

2025/01/05 18:05:17 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/05 18:05:29 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/05 18:05:35 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
2025/01/05 18:05:35 INFO mlflow.tracking._tracking_service.client: 🏃 View run bittersweet-swan-720 at: http://localhost:5000/#/experiments/3/runs/a4b18b182df849b99cbd4e065f401ac6.
2025/01/05 18:05:35 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:5000/#/experiments/3.
2025/01/05 18:08:19 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2025/01/05 18:08:30 INFO mlflow.models.evaluation.default_evaluator: The evaluation dataset is inferred as binary dataset, positive label is 1, negative label is 0.
2025/01/05 18:08:36 INFO mlflow.models.evaluation.default_evaluat

## Conclusion

En tête de classement, on retrouve la régression logistique avec un ngram_range de (1,3) et un min_df de 3. Les autres modèles n'ont pas réussi à surpasser les performances de la régression logistique.<br>
Les jeux de données `lemma` et `lemma_nomention` atteignent des performances équivalentes que ce soit sur le jeu de validation ou de test.<br>
On note aussi une bonne performance du jeu `text` sans transformation avec un ngram_range de (1,3) et un min_df de 3 si l'on devait privilégier les performances et les prétraitements.<br>