# TP en Traitement Automatique du Langage Naturel: Classification de Sentiments sur des Critiques de Films

## Objectif
L'objectif de ce TP est de développer un système de classification de sentiments utilisant des critiques de films. Vous utiliserez un ensemble de données IMDb et appliqueront un modèle K-Nearest Neighbors (KNN) pour classer les critiques en catégories positives ou négatives.

## Partie 1: Traitement des Textes
1. **Preprocess**: Appliquer un preprocess si besoin
1. **Vectorisation**: Transformez les documents textuels en vecteurs numériques en utilisant `TfidfVectorizer`.

## Partie 3: Modélisation
1. **Construction du Modèle KNN**: Créez un modèle KNN
2. **Entraînement du Modèle**: Entraînez le modèle sur l'ensemble d'entraînement.

## Partie 4: Évaluation
1. **Prédiction et Classification**: Utilisez le modèle pour prédire les sentiments sur l'ensemble de test.
2. **Rapport de Classification**: Générez un rapport de classification pour évaluer la performance du modèle.

## Questions
1. Comment la réduction du nombre de caractéristiques (`max_features`) affecte-t-elle la performance du modèle ?
2. Quel impact a le choix du nombre de voisins dans KNN sur les résultats ?
3. Comparez les performances du modèle KNN avec un autre classificateur (par exemple, [Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html) ou [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC). Lequel performe mieux et pourquoi ?
4. Le preprocessing améliore t-il la clasification ?

## Ressources
- [IMDb Dataset](https://huggingface.co/datasets/imdb)
- [Scikit-learn Documentation](https://scikit-learn.org/stable/)

In [1]:
!pip install -q -U datasets scikit-learn spacy
!python -m spacy download en_core_web_sm



[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB 1.4 MB/s eta 0:00:10
     ---------------------------------------- 0.1/12.8 MB 1.3 MB/s eta 0:00:10
     ---------------------------------------- 0.1/12.8 MB 1.3 MB/s eta 0:00:10
      --------------------------------------- 0.2/12.8 MB 1.4 MB/s eta 0:00:10
      --------------------------------------- 0.2/12.8 MB 1.1 MB/s eta 0:00:11
      --------------------------------------- 0.2/12.8 MB 1.1 MB/s eta 0:00:11
     - -------------------------------------- 0.3/12.8 MB 1.1 MB/s eta 0:00:12
     - -------------------------------------- 0.4/12.8 MB 1.1 MB/s eta 0:00:12
     - -------------------------------------- 0.4/12.8 MB 1.1 MB/s eta 0:00:12
     - ----------------------------------


[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

# On split plusieur fois le dataset afin de réduire le temps de calcul
dataset = load_dataset("imdb", split="train")
dataset = dataset.train_test_split(stratify_by_column="label", test_size=0.3, seed=42)
dataset = dataset['train'].train_test_split(stratify_by_column="label", test_size=0.3, seed=42)

In [None]:
# clean data remove html tags, stopwords, punctuation, lowercase
import re
import spacy

nlp = spacy.load("en_core_web_sm")

def clean_text(text):
    text = re.sub(r"<br\s*/?>", " ", text)
    text = re.sub(r"[^a-zA-Z\s]", "", text)
    text = text.lower()
    text = " ".join([token.lemma_ for token in nlp(text) if not token.is_stop])
    return text

dataset = dataset.map(lambda x: {"text": clean_text(x["text"])})

print(f"shape of train set is {len(dataset['train'])}")
print(f"shape of test set is {len(dataset['test'])}")


In [4]:
# vectorisation des données
vectorizer = TfidfVectorizer(stop_words="english")
X_train = vectorizer.fit_transform(dataset['train']['text'])
X_test = vectorizer.transform(dataset['test']['text'])

In [5]:
# entrainement du modèle
model = KNeighborsClassifier(n_neighbors=7)
model.fit(X_train, dataset['train']['label'])
print("entrainement terminé")

entrainement terminé


In [10]:
#test SVC model
from sklearn.svm import SVC
model_svm = SVC()

model_svm.fit(X_train, dataset['train']['label'])

In [11]:
# prédiction
y_pred_knn = model.predict(X_test)
y_pred_svm = model_svm.predict(X_test)

# affichage du score du modèle
score_knn = model.score(X_test, dataset['test']['label'])
score_svm = model_svm.score(X_test, dataset['test']['label'])
print(f"Score Knn : {score_knn:.4f}\n")
print(f"Score SVM : {score_svm:.4f}\n")



Score Knn : 0.7691

Score SVM : 0.8832



In [7]:
# search the best model with GridSearchCV
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_neighbors': [2,3,5,7],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan'],
}

grid = GridSearchCV(KNeighborsClassifier(), param_grid, verbose=1, cv=3, n_jobs=-1)
grid.fit(X_train, dataset['train']['label'])

print(f"Best parameters: {grid.best_params_}")
print(f"Best score: {grid.best_score_:.4f}")


Fitting 3 folds for each of 16 candidates, totalling 48 fits
Best parameters: {'metric': 'euclidean', 'n_neighbors': 7, 'weights': 'distance'}
Best score: 0.7666


In [9]:
# take 5 random sample from test set and print the prediction vs the true label
import random
for _ in range(5):
    i = random.randint(0, len(dataset['test']['text']))
    val = dataset['test']['text'][i]
    predicted = model.predict(vectorizer.transform([val]))[0]
    predicted_label = "positive" if predicted == 1 else "negative"
    true_label = "positive" if dataset['test']['label'][i] == 1 else "negative"
    print(f"Text: {val}...\nPredicted: {predicted_label}\nTrue: {true_label}\n")




Text: movie potential make bad lindsay crouse act ve see maybe crouse fan like performance movie bad   delivery robotic deliver line appear try sure line right simply read list head voice little inflection not believe bad acting give lead role movie know somebody biz   hate mean comment long performance stick   like story go continue watch script making good movie end disappointing maybe acting well like...
Predicted: negative
True: negative

Text: lorne michael prove absolutely business produce movie   d think dismal flick superstar night roxbury conehead d start notion maybe not know s come movie argue not know s come television try feature film skit wear welcome time snl make sense   personally like tim meadow think great right movie shame talented guy waste film feature unfunny unfunny situation cap dreadfully bad song dance scene laugh movie bad funny   oh thankful tired snl character film bad big screen...
Predicted: negative
True: negative

Text: movie great touch stone cold hea