# TP en Traitement Automatique du Langage Naturel: Classification de Sentiments sur des Critiques de Films

## Objectif
L'objectif de ce TP est de développer un système de classification de sentiments utilisant des critiques de films. Vous utiliserez un ensemble de données IMDb et appliqueront un modèle K-Nearest Neighbors (KNN) pour classer les critiques en catégories positives ou négatives.

## Partie 1: Traitement des Textes
1. **Preprocess**: Appliquer un preprocess si besoin
1. **Vectorisation**: Transformez les documents textuels en vecteurs numériques en utilisant `TfidfVectorizer`.

## Partie 3: Modélisation
1. **Construction du Modèle KNN**: Créez un modèle KNN
2. **Entraînement du Modèle**: Entraînez le modèle sur l'ensemble d'entraînement.

## Partie 4: Évaluation
1. **Prédiction et Classification**: Utilisez le modèle pour prédire les sentiments sur l'ensemble de test.
2. **Rapport de Classification**: Générez un rapport de classification pour évaluer la performance du modèle.

## Questions
1. Comment la réduction du nombre de caractéristiques (`max_features`) affecte-t-elle la performance du modèle ?
2. Quel impact a le choix du nombre de voisins dans KNN sur les résultats ?
3. Comparez les performances du modèle KNN avec un autre classificateur (par exemple, [Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html) ou [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC). Lequel performe mieux et pourquoi ?
4. Le preprocessing améliore t-il la clasification ?

## Ressources
- [IMDb Dataset](https://huggingface.co/datasets/imdb)
- [Scikit-learn Documentation](https://scikit-learn.org/stable/)

In [1]:

# !python -m venv env
# !source env/bin/activate
!pip install datasets scikit-learn spacy plotly
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

# On split plusieur fois le dataset afin de réduire le temps de calcule
dataset = load_dataset("imdb", split="train")
dataset = dataset.train_test_split(stratify_by_column="label", test_size=0.3, seed=42)
dataset = dataset['train'].train_test_split(stratify_by_column="label", test_size=0.3, seed=42)

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# preprocessing data
import spacy
nlp = spacy.load("en_core_web_sm")




In [4]:
# tokenization
def tokenize(text):
    doc = nlp.tokenizer(text)
    return [token.text for token in doc]

In [5]:
# stop words
from spacy.lang.en.stop_words import STOP_WORDS
stop_words = list(STOP_WORDS)

# clean


In [7]:
# vectorisation des données
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(dataset['train']['text'])
X_test = vectorizer.transform(dataset['test']['text'])

# entrainement du modèle
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, dataset['train']['label'])

# prédiction
y_pred = model.predict(X_test)

# affichage des résultats
print(classification_report(dataset['test']['label'], y_pred, target_names=dataset['test'].features['label'].names))



              precision    recall  f1-score   support

         neg       0.77      0.73      0.75      2625
         pos       0.75      0.78      0.76      2625

    accuracy                           0.76      5250
   macro avg       0.76      0.76      0.76      5250
weighted avg       0.76      0.76      0.76      5250



In [9]:
# test sur 5 données random
import random
for _ in range(5):
    i = random.randint(0, len(dataset['test']) - 1)
    print(f"Text: {dataset['test']['text'][i]}")
    print(f"True label: {dataset['test']['label'][i]}")
    print(f"Predicted label: {dataset['test'].features['label'].names[y_pred[i]]}")
    print()

Text: I just got back from the film and I'm completely appalled. This movie is an absolute mockery to all of mankind. The theatre I was in maybe had 4 other people. This movie was recommended to me and I couldn't believe that this person liked it. I can't believe that any sane human would like it. There was no plot NO PLOT AT ALL. It was a joke. How can you make a movie about nothing. This movie only goes to show why Hollywood is in such a shambles. I can only just look at the spiral of the "Horror Movie" industry and giggle. What a travesty to all filmaking, this is true of all the new "teen horror flicks" Grudge,Boogeyman,Ring,Saw series. It is all such trash. Don't support this kind of hogwash!
True label: 0
Predicted label: neg

Text: I love Memoirs of a Geisha so I read the book twice; it is one of the best book I've read last year. I was looking forward to the movie and was afraid that reading the book would ruin the viewing pleasure of the movie. I wasn't expecting the movie to 

In [6]:
# vectorisation des données
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(dataset['train']['text'])
X_test = vectorizer.transform(dataset['test']['text'])

# SVM
from sklearn.svm import SVC
model = SVC()
model.fit(X_train, dataset['train']['label'])
y_pred = model.predict(X_test)
print(classification_report(dataset['test']['label'], y_pred, target_names=dataset['test'].features['label'].names))


              precision    recall  f1-score   support

         neg       0.90      0.88      0.89      2625
         pos       0.88      0.90      0.89      2625

    accuracy                           0.89      5250
   macro avg       0.89      0.89      0.89      5250
weighted avg       0.89      0.89      0.89      5250

