# Baseline binaire — Sentiment Trustpilot

Ce notebook construit **trois baselines** pour une classification binaire du sentiment à partir d'un dataset **nettoyé** (`CleanText`, `Rating`) :

1) **Majority Class** (DummyClassifier)
2) **TF‑IDF + Multinomial Naive Bayes**
3) **TF‑IDF + Logistic Regression**

On compare `accuracy`, `precision`, `recall`, `f1` et la matrice de confusion.

---

In [3]:
# 🔧 Imports & versions
import pandas as pd
import numpy as np
from pathlib import Path
from collections import Counter

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
from sklearn.dummy import DummyClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
import joblib

import sys, sklearn
print("pandas:", pd.__version__)
print("numpy:", np.__version__)
print("sklearn:", sklearn.__version__)
print("joblib:", joblib.__version__)

pandas: 2.3.0
numpy: 2.3.0
sklearn: 1.7.1
joblib: 1.5.1


## 1) Chargement du dataset nettoyé
- Le fichier attendu contient **`CleanText`** (texte) et **`Rating`** (1–5).
- On dérive une cible **binaire** : `y = 1` si `Rating >= 4` (positif), sinon `0` (négatif/neutre).

In [4]:
# ⚙️ Paramètres
RANDOM_STATE = 30
TEST_SIZE = 0.2
INPUT_PATH = "trustpilot_dataset_final_features.csv"  # adapter si besoin

df = pd.read_csv(INPUT_PATH)
expected_cols = {"CleanText", "Rating"}
if not expected_cols.issubset(df.columns):
    raise ValueError(f"Colonnes manquantes. On attend {expected_cols}, trouvé {list(df.columns)}")

df = df.dropna(subset=["CleanText", "Rating"]).copy()
df["Rating"] = pd.to_numeric(df["Rating"], errors="coerce")
df = df.dropna(subset=["Rating"]) 

# Binaire: 1 si rating >= 4, sinon 0
df["target"] = (df["Rating"] >= 4).astype(int)
X = df["CleanText"].astype(str).values
y = df["target"].values

print("Taille dataset:", df.shape)
print("Répartition binaire:", Counter(y))
df.head(3)

FileNotFoundError: [Errno 2] No such file or directory: 'trustpilot_dataset_final_features.csv'

## 2) Split train / test (stratifié)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE, stratify=y
)
print("Train:", X_train.shape, "Test:", X_test.shape)

## 3) Baseline 1 — Majority Class
Toujours prédire la classe majoritaire (référence minimale).

In [None]:
dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X_train.reshape(-1,1), y_train)  # astuce: Dummy ignore les features
y_pred_dummy = dummy.predict(X_test.reshape(-1,1))

print("Accuracy:", accuracy_score(y_test, y_pred_dummy))
print(classification_report(y_test, y_pred_dummy, digits=3))

cm = confusion_matrix(y_test, y_pred_dummy)
plt.figure(figsize=(4,4))
plt.imshow(cm, interpolation='nearest')
plt.title('Confusion Matrix — Majority Class')
plt.xlabel('Predicted')
plt.ylabel('True')
for (i,j), v in np.ndenumerate(cm):
    plt.text(j, i, str(v), ha='center', va='center')
plt.show()

## 4) Baseline 2 — TF‑IDF + Multinomial Naive Bayes
Paramètres sobres (1–2‑grammes, `min_df=2`, `max_df=0.95`).

In [None]:
pipe_nb = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1,2), min_df=2, max_df=0.95)),
    ("clf", MultinomialNB(alpha=0.5))
])
pipe_nb.fit(X_train, y_train)
y_pred_nb = pipe_nb.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred_nb))
print(classification_report(y_test, y_pred_nb, digits=3))

cm = confusion_matrix(y_test, y_pred_nb)
plt.figure(figsize=(4,4))
plt.imshow(cm, interpolation='nearest')
plt.title('Confusion Matrix — TF-IDF + NB')
plt.xlabel('Predicted')
plt.ylabel('True')
for (i,j), v in np.ndenumerate(cm):
    plt.text(j, i, str(v), ha='center', va='center')
plt.show()

## 5) Baseline 3 — TF‑IDF + Logistic Regression
`class_weight='balanced'` pour aider si léger déséquilibre.

In [None]:
pipe_lr = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1,2), min_df=2, max_df=0.95))
    ,("clf", LogisticRegression(max_iter=1000, class_weight='balanced', random_state=RANDOM_STATE))
])
pipe_lr.fit(X_train, y_train)
y_pred_lr = pipe_lr.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr, digits=3))

cm = confusion_matrix(y_test, y_pred_lr)
plt.figure(figsize=(4,4))
plt.imshow(cm, interpolation='nearest')
plt.title('Confusion Matrix — TF-IDF + LogReg')
plt.xlabel('Predicted')
plt.ylabel('True')
for (i,j), v in np.ndenumerate(cm):
    plt.text(j, i, str(v), ha='center', va='center')
plt.show()

## 6) Tableau comparatif des scores

In [None]:
def metrics_row(name, y_true, y_pred):
    return {
        "model": name,
        "accuracy": accuracy_score(y_true, y_pred),
        "f1": f1_score(y_true, y_pred)
    }

rows = [
    metrics_row("Majority", y_test, y_pred_dummy),
    metrics_row("TFIDF_NB", y_test, y_pred_nb),
    metrics_row("TFIDF_LogReg", y_test, y_pred_lr),
]
summary = pd.DataFrame(rows).sort_values("f1", ascending=False)
summary

## 7) Sauvegarde (optionnel) du meilleur pipeline
On sauvegarde le pipeline complet (vectorizer + modèle) en `.joblib` pour réutilisation directe.

In [None]:
best_name = summary.iloc[0]["model"]
best_pipe = {"TFIDF_NB": pipe_nb, "TFIDF_LogReg": pipe_lr, "Majority": dummy}[best_name]

models_dir = Path("models"); models_dir.mkdir(parents=True, exist_ok=True)
out_path = models_dir / f"baseline_binary_{best_name.lower()}.joblib"
joblib.dump(best_pipe, out_path)
print(f"✅ Modèle sauvegardé: {out_path}")

## (Bonus) Prédiction rapide sur du texte libre

In [None]:
samples = [
    "The ring is fantastic and the sleep analysis is spot on.",
    "Battery died after one day and support ignored my emails.",
    "Works as expected.",
]
preds = best_pipe.predict(samples if best_name != "Majority" else np.array(samples).reshape(-1,1))
list(zip(samples, preds))