**Instalation et imports**

In [1]:
from datasets import load_dataset
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
import joblib


  from .autonotebook import tqdm as notebook_tqdm


**Chargement du dataset Hugging Face**

In [2]:
# Charger le dataset customer-support-tickets depuis Hugging Face
dataset = load_dataset("Tobi-Bueck/customer-support-tickets")
dataset

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Generating train split: 100%|██████████| 61765/61765 [00:01<00:00, 37756.69 examples/s]


DatasetDict({
    train: Dataset({
        features: ['subject', 'body', 'answer', 'type', 'queue', 'priority', 'language', 'version', 'tag_1', 'tag_2', 'tag_3', 'tag_4', 'tag_5', 'tag_6', 'tag_7', 'tag_8'],
        num_rows: 61765
    })
})

**Passage en DataFrame et création du texte**

In [3]:
# On suppose qu'on utilise le split "train"
df = dataset["train"].to_pandas()

# On garde les colonnes qui nous intéressent
# subject + body = texte d'entrée, type = label
df = df[["subject", "body", "type"]].dropna()

# Créer une colonne 'text' = subject + body
df["text"] = df["subject"].astype(str) + " " + df["body"].astype(str)

# Afficher un aperçu
df.head()

Unnamed: 0,subject,body,type,text
0,Wesentlicher Sicherheitsvorfall,"Sehr geehrtes Support-Team,\n\nich möchte eine...",Incident,Wesentlicher Sicherheitsvorfall Sehr geehrtes ...
1,Account Disruption,"Dear Customer Support Team,\n\nI am writing to...",Incident,"Account Disruption Dear Customer Support Team,..."
2,Query About Smart Home System Integration Feat...,"Dear Customer Support Team,\n\nI hope this mes...",Request,Query About Smart Home System Integration Feat...
3,Inquiry Regarding Invoice Details,"Dear Customer Support Team,\n\nI hope this mes...",Request,Inquiry Regarding Invoice Details Dear Custome...
4,Question About Marketing Agency Software Compa...,"Dear Support Team,\n\nI hope this message reac...",Problem,Question About Marketing Agency Software Compa...


**Encodage des labels et split train/test**

In [4]:
# Encodage des labels (type) en entiers
label2id = {label: idx for idx, label in enumerate(sorted(df["type"].unique()))}
id2label = {idx: label for label, idx in label2id.items()}

df["label"] = df["type"].map(label2id)

X = df["text"].values
y = df["label"].values

# Split train / test (par ex. 80% / 20%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

len(X_train), len(X_test)


(34628, 8658)

**Définition du pipeline TF‑IDF + Régression Logistique**

In [5]:
# Pipeline : TF-IDF (texte) -> Régression Logistique (classification)
model = Pipeline([
    ("tfidf", TfidfVectorizer(
        max_features=20000,   # pour limiter la taille, ajustable
        ngram_range=(1, 2),   # unigrams + bigrams
        stop_words="english"  # ou None si tu veux garder tout
    )),
    ("clf", LogisticRegression(
        max_iter=1000,
        multi_class="multinomial",
        n_jobs=-1
    ))
])


**Entraînement du modèle**

In [6]:
model.fit(X_train, y_train)



**Évaluation sur le jeu de test**

In [7]:
y_pred = model.predict(X_test)

print("Rapport de classification :")
print(classification_report(y_test, y_pred, target_names=[id2label[i] for i in sorted(id2label.keys())]))

print("Matrice de confusion :")
print(confusion_matrix(y_test, y_pred))


Rapport de classification :
              precision    recall  f1-score   support

      Change       0.99      0.88      0.93       878
    Incident       0.77      0.90      0.83      3477
     Problem       0.72      0.50      0.59      1815
     Request       0.97      0.99      0.98      2488

    accuracy                           0.84      8658
   macro avg       0.86      0.82      0.83      8658
weighted avg       0.84      0.84      0.83      8658

Matrice de confusion :
[[ 776   15    8   79]
 [   2 3121  349    5]
 [   1  897  914    3]
 [   7    5    1 2475]]


**Sauvegarde du modèle entraîné**

In [8]:
# Sauvegarder tout le pipeline (TF-IDF + régression logistique)
joblib.dump({
    "pipeline": model,
    "label2id": label2id,
    "id2label": id2label
}, "ticket_type_classifier.joblib")

"Modèle sauvegardé dans ticket_type_classifier.joblib"


'Modèle sauvegardé dans ticket_type_classifier.joblib'

**Rechargement du modèle et prédiction sur un nouvel exemple**

In [9]:
saved = joblib.load("ticket_type_classifier.joblib")
loaded_model = saved["pipeline"]
id2label = saved["id2label"]

# Exemple de ticket
nouveau_subject = "Problem with my invoice"
nouveau_body = "I was charged twice this month, please fix this issue."
nouveau_text = nouveau_subject + " " + nouveau_body

pred_label_id = loaded_model.predict([nouveau_text])[0]
pred_label = id2label[pred_label_id]

print("Texte :", nouveau_text)
print("Type prédit :", pred_label)


Texte : Problem with my invoice I was charged twice this month, please fix this issue.
Type prédit : Problem
