# <a id='toc1_'></a>[Projet 7 : Réalisez une analyse de sentiments grâce au Deep Learning](#toc0_)
# <a id='toc2_'></a>[Modèle sur mesure simple](#toc0_)

[Lien OpenClassroom](https://openclassrooms.com/fr/paths/795/projects/1516/1578-mission)

---

**Table of contents**<a id='toc0_'></a>    
- [Projet 7 : Réalisez une analyse de sentiments grâce au Deep Learning](#toc1_)    
- [Modèle sur mesure simple](#toc2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

---
---

## <a id='toc2_1_'></a>[Imports](#toc0_)

In [1]:
import pandas as pd
import numpy as np
import mlflow
import mlflow.sklearn  # Needed for autologging or specific sklearn logging
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    classification_report,
    confusion_matrix,
)
from sklearn.dummy import DummyClassifier
import joblib  # For saving the vectorizer
import os
import warnings

---
---

## <a id='toc2_2_'></a>[Chargement des données](#toc0_)

In [2]:
TRAIN_DATA_PATH = "./train_data.csv"
VAL_DATA_PATH = "./validation_data.csv"
TEST_DATA_PATH = "./test_data.csv"

train_df = pd.read_csv(TRAIN_DATA_PATH)
val_df = pd.read_csv(VAL_DATA_PATH)
test_df = pd.read_csv(TEST_DATA_PATH)

# Handle potential NaN values in 'cleaned_text' that might result from preprocessing
train_df["cleaned_text"].fillna("", inplace=True)
val_df["cleaned_text"].fillna("", inplace=True)
test_df["cleaned_text"].fillna("", inplace=True)


X_train = train_df["cleaned_text"]
y_train = train_df["sentiment"]
X_val = val_df["cleaned_text"]
y_val = val_df["sentiment"]
X_test = test_df["cleaned_text"]
y_test = test_df["sentiment"]

print("Data loaded successfully:")
print(f"Train samples: {len(X_train)}")
print(f"Validation samples: {len(X_val)}")
print(f"Test samples: {len(X_test)}")

Data loaded successfully:
Train samples: 1113546
Validation samples: 238617
Test samples: 238618


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train_df["cleaned_text"].fillna("", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  val_df["cleaned_text"].fillna("", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting va

---
---

## Préparation de MLFlow

In [3]:
EXPERIMENT_NAME = "Tweet Sentiment Analysis - Simple Models"
mlflow.set_experiment(EXPERIMENT_NAME)

# Get the current experiment details (optional)
try:
    experiment = mlflow.get_experiment_by_name(EXPERIMENT_NAME)
    if experiment is None:
        experiment_id = mlflow.create_experiment(EXPERIMENT_NAME)
        print(f"Created new experiment with ID: {experiment_id}")
    else:
        experiment_id = experiment.experiment_id
        print(f"Using existing experiment '{EXPERIMENT_NAME}' with ID: {experiment_id}")
except Exception as e:
    print(f"Error setting up MLflow experiment: {e}")
    raise

VECTORIZER_FILENAME = "tfidf_vectorizer_simple.joblib"

Using existing experiment 'Tweet Sentiment Analysis - Simple Models' with ID: 897408299388468996


---
---

## Extraction des features, entrainement du modèle et logging MLFlow

In [4]:
run_name = "LogisticRegression_TFIDF"
print(f"\nStarting MLflow Run: {run_name}")

with mlflow.start_run(run_name=run_name) as run:
    run_id = run.info.run_id
    print(f"MLflow Run ID: {run_id}")

    # --- Parameters ---
    # TF-IDF Parameters
    tfidf_max_features = 10000
    tfidf_ngram_range = (1, 3)  # Use unigrams and bigrams

    # Logistic Regression Parameters
    lr_C = 0.5  # Inverse of regularization strength
    lr_solver = "saga"  # Good for smaller datasets and binary classification
    lr_max_iter = 500
    lr_class_weight = "balanced"  # Useful for imbalanced datasets

    # Log parameters
    print("Logging parameters...")
    mlflow.log_param("vectorizer_type", "TF-IDF")
    mlflow.log_param("tfidf_max_features", tfidf_max_features)
    mlflow.log_param("tfidf_ngram_range", str(tfidf_ngram_range))  # Log tuple as string
    mlflow.log_param("model_type", "LogisticRegression")
    mlflow.log_param("lr_C", lr_C)
    mlflow.log_param("lr_solver", lr_solver)
    mlflow.log_param("lr_max_iter", lr_max_iter)
    mlflow.log_param("lr_class_weight", lr_class_weight)

    # --- Feature Extraction (TF-IDF) ---
    print("Fitting TF-IDF Vectorizer...")
    vectorizer = TfidfVectorizer(
        max_features=tfidf_max_features, ngram_range=tfidf_ngram_range
    )
    X_train_tfidf = vectorizer.fit_transform(X_train)
    print(f"TF-IDF - Training data transformed shape: {X_train_tfidf.shape}")

    # Transform validation and test sets
    X_val_tfidf = vectorizer.transform(X_val)
    X_test_tfidf = vectorizer.transform(X_test)
    print(f"TF-IDF - Validation data transformed shape: {X_val_tfidf.shape}")
    print(f"TF-IDF - Test data transformed shape: {X_test_tfidf.shape}")

    # Save the fitted vectorizer locally first
    joblib.dump(vectorizer, VECTORIZER_FILENAME)
    print(f"Vectorizer saved locally to {VECTORIZER_FILENAME}")

    # Log the vectorizer as an artifact
    mlflow.log_artifact(VECTORIZER_FILENAME, artifact_path="vectorizer")
    print("Vectorizer logged as MLflow artifact.")

    # Clean up local file after logging (optional)
    if os.path.exists(VECTORIZER_FILENAME):
        os.remove(VECTORIZER_FILENAME)

    # --- Model Training ---
    print("Training Logistic Regression model...")
    model = LogisticRegression(
        C=lr_C,
        solver=lr_solver,
        max_iter=lr_max_iter,
        class_weight=lr_class_weight,
        random_state=42,  # for reproducibility
    )
    model.fit(X_train_tfidf, y_train)
    print("Model training complete.")

    # --- Evaluation on Validation Set ---
    print("Evaluating on validation set...")
    y_val_pred = model.predict(X_val_tfidf)
    y_val_pred_proba = model.predict_proba(X_val_tfidf)[
        :, 1
    ]  # Probability of positive class

    val_accuracy = accuracy_score(y_val, y_val_pred)
    val_precision = precision_score(y_val, y_val_pred, zero_division=0)
    val_recall = recall_score(y_val, y_val_pred, zero_division=0)
    val_f1 = f1_score(y_val, y_val_pred, zero_division=0)

    # Log validation metrics
    print("Logging validation metrics...")
    mlflow.log_metric("val_accuracy", val_accuracy)
    mlflow.log_metric("val_precision", val_precision)
    mlflow.log_metric("val_recall", val_recall)
    mlflow.log_metric("val_f1", val_f1)

    print(f"Validation Accuracy: {val_accuracy:.4f}")
    print(f"Validation Precision: {val_precision:.4f}")
    print(f"Validation Recall: {val_recall:.4f}")
    print(f"Validation F1-Score: {val_f1:.4f}")

    # --- Log the Model ---
    print("Logging the trained model...")
    mlflow.sklearn.log_model(model, artifact_path="logistic-regression-model")
    print("Model logged successfully.")

    val_report = classification_report(y_val, y_val_pred, output_dict=True)
    mlflow.log_dict(val_report, "validation_classification_report.json")


print(f"\nMLflow Run {run_id} finished.")


Starting MLflow Run: LogisticRegression_TFIDF
MLflow Run ID: e07074f03a6d488cb53c11f435aa0f68
Logging parameters...
Fitting TF-IDF Vectorizer...
TF-IDF - Training data transformed shape: (1113546, 10000)
TF-IDF - Validation data transformed shape: (238617, 10000)
TF-IDF - Test data transformed shape: (238618, 10000)
Vectorizer saved locally to tfidf_vectorizer_simple.joblib
Vectorizer logged as MLflow artifact.
Training Logistic Regression model...
Model training complete.
Evaluating on validation set...
Logging validation metrics...
Validation Accuracy: 0.7750
Validation Precision: 0.7633
Validation Recall: 0.7971
Validation F1-Score: 0.7798
Logging the trained model...




Model logged successfully.

MLflow Run e07074f03a6d488cb53c11f435aa0f68 finished.


---
---

## Évaluation du modèle

---

### Évaluation sur les données de test

In [5]:
logged_model_uri = f"runs:/{run_id}/logistic-regression-model"


# Load the model logged in the previous run
try:
    loaded_model = mlflow.sklearn.load_model(logged_model_uri)
    print(f"Model loaded successfully from: {logged_model_uri}")

    # Make predictions on the test set (using the same TF-IDF transformation)
    y_test_pred = loaded_model.predict(X_test_tfidf)
    y_test_pred_proba = loaded_model.predict_proba(X_test_tfidf)[
        :, 1
    ]  # Probability of positive class

    # Calculate test metrics
    test_accuracy = accuracy_score(y_test, y_test_pred)
    test_precision = precision_score(y_test, y_test_pred, zero_division=0)
    test_recall = recall_score(y_test, y_test_pred, zero_division=0)
    test_f1 = f1_score(y_test, y_test_pred, zero_division=0)

    print("\nTest Set Performance:")
    print(f"Test Accuracy: {test_accuracy:.4f}")
    print(f"Test Precision: {test_precision:.4f}")
    print(f"Test Recall: {test_recall:.4f}")
    print(f"Test F1-Score: {test_f1:.4f}")

    print("\nClassification Report (Test Set):")
    print(classification_report(y_test, y_test_pred))

    print("\nConfusion Matrix (Test Set):")
    print(confusion_matrix(y_test, y_test_pred))

    # --- Optional: Log test metrics back to the same MLflow run ---
    # This is often done to have all metrics associated with a single run.
    # You need the client to log to an existing run *after* it has finished.
    client = mlflow.tracking.MlflowClient()
    client.log_metric(run_id, "test_accuracy", test_accuracy)
    client.log_metric(run_id, "test_precision", test_precision)
    client.log_metric(run_id, "test_recall", test_recall)
    client.log_metric(run_id, "test_f1", test_f1)
    print("\nTest metrics logged back to the MLflow run.")

    # Log test classification report as well
    test_report_dict = classification_report(y_test, y_test_pred, output_dict=True)
    client.log_dict(run_id, test_report_dict, "test_classification_report.json")


except Exception as e:
    print(f"Error loading model or evaluating on test set: {e}")
    print("Ensure the run ID is correct and the model was logged properly.")

Model loaded successfully from: runs:/e07074f03a6d488cb53c11f435aa0f68/logistic-regression-model

Test Set Performance:
Test Accuracy: 0.7744
Test Precision: 0.7625
Test Recall: 0.7969
Test F1-Score: 0.7793

Classification Report (Test Set):
              precision    recall  f1-score   support

           0       0.79      0.75      0.77    119351
           1       0.76      0.80      0.78    119267

    accuracy                           0.77    238618
   macro avg       0.77      0.77      0.77    238618
weighted avg       0.77      0.77      0.77    238618


Confusion Matrix (Test Set):
[[89747 29604]
 [24222 95045]]

Test metrics logged back to the MLflow run.


---

### Comparaison avec un modèle naif

In [6]:
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_train_tfidf, y_train)  # Needs to be fitted, though strategy is simple

y_test_pred_dummy = dummy_clf.predict(X_test_tfidf)

# Calculate metrics for the baseline
dummy_accuracy = accuracy_score(y_test, y_test_pred_dummy)
dummy_precision = precision_score(y_test, y_test_pred_dummy, zero_division=0)
dummy_recall = recall_score(y_test, y_test_pred_dummy, zero_division=0)
dummy_f1 = f1_score(y_test, y_test_pred_dummy, zero_division=0)

print("Naive Baseline Performance (Test Set):")
print(f"Baseline Accuracy: {dummy_accuracy:.4f}")
print(f"Baseline Precision: {dummy_precision:.4f}")
print(f"Baseline Recall: {dummy_recall:.4f}")
print(f"Baseline F1-Score: {dummy_f1:.4f}")

print("\nComparison:")
print(f"Model Test Accuracy: {test_accuracy:.4f} vs Baseline: {dummy_accuracy:.4f}")
print(f"Model Test F1-Score: {test_f1:.4f} vs Baseline: {dummy_f1:.4f}")

Naive Baseline Performance (Test Set):
Baseline Accuracy: 0.5002
Baseline Precision: 0.0000
Baseline Recall: 0.0000
Baseline F1-Score: 0.0000

Comparison:
Model Test Accuracy: 0.7744 vs Baseline: 0.5002
Model Test F1-Score: 0.7793 vs Baseline: 0.0000


---

### Enregistrement du model

In [7]:
registered_model_info = mlflow.register_model(
    model_uri=logged_model_uri, name="MODEL_SIMPLE"
)
print("Model registered successfully:")
print(f"- Name: {registered_model_info.name}")
print(f"- Version: {registered_model_info.version}")
print(f"- Stage: {registered_model_info.current_stage}")

Model registered successfully:
- Name: MODEL_SIMPLE
- Version: 2
- Stage: None


Registered model 'MODEL_SIMPLE' already exists. Creating a new version of this model...
Created version '2' of model 'MODEL_SIMPLE'.


---
---

## Dashboard MLFlow

In [8]:
! mlflow server --host 127.0.0.1 --port 8080

^C


![Overview](./mlflow_screenshot/simple/Overview.png)

![Metrics](./mlflow_screenshot/simple/Metrics.png)

![Compare Runs](./mlflow_screenshot/simple/Compare_runs.png)