# Natural Language Processing </a>

## Assignment: K Nearest Neighbors Model for the IMDB Movie Review Dataset

For the final project, build a K Nearest Neighbors model to predict the sentiment (positive or negative) of movie reviews. The dataset is originally hosted here: http://ai.stanford.edu/~amaas/data/sentiment/

Use the notebooks from the class and implement the model, train and test with the corresponding datasets.

You can follow these steps:
1. Read training-test data (Given)
2. Train a KNN classifier (Implement)
3. Make predictions on your test dataset (Implement)
4. Expermintation (Implement)

__You can use the KNN Classifier from here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html__

## 1. Reading the dataset

We will use the __pandas__ library to read our dataset.

#### __Training data:__
Let's read our training data. Here, we have the text and label fields. Labe is 1 for positive reviews and 0 for negative reviews.

In [None]:
import pandas as pd

train_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_train.csv', header=0)
train_df.head()

Unnamed: 0,text,label
0,This movie makes me want to throw up every tim...,0
1,Listening to the director's commentary confirm...,0
2,One of the best Tarzan films is also one of it...,1
3,Valentine is now one of my favorite slasher fi...,1
4,No mention if Ann Rivers Siddons adapted the m...,0


#### __Test data:__

In [None]:
import pandas as pd

test_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_test.csv', header=0)
test_df.head()

Unnamed: 0,text,label
0,What I hoped for (or even expected) was the we...,0
1,Garden State must rate amongst the most contri...,0
2,There is a lot wrong with this film. I will no...,1
3,"To qualify my use of ""realistic"" in the summar...",1
4,Dirty War is absolutely one of the best politi...,1


## 2. Train a KNN Classifier
Here, you will apply pre-processing operations we covered in the class. Then, you can split your dataset to training and validation here. For your first submission, you will use __K Nearest Neighbors Classifier__. It is available [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html).

In [None]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import classification_report

nltk.download('stopwords')
nltk.download('punkt')

stop = stopwords.words('english')

# à exclure
excluding = ['against', 'not', 'don', "don't", 'ain', 'aren', "aren't", 'couldn', "couldn't",
             'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't",
             'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't",
             'needn', "needn't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren',
             "weren't", 'won', "won't", 'wouldn', "wouldn't"]

stop_words = [word for word in stop if word not in excluding]
snow = SnowballStemmer('english')

def process_text(texts):
    final_text_list = []
    for sent in texts:
        # Vérifier si la phrase est une valeur manquante
        if isinstance(sent, str) == False:
            sent = ""

        filtered_sentence = []

        sent = sent.lower()  # Mettre en minuscules
        sent = sent.strip()  # Supprimer les espace
        sent = re.sub('\s+', ' ', sent)  # Supprimer les espaces et les tabulations
        sent = re.compile('<.*?>').sub('', sent)  # Supprimer HTML

        for w in word_tokenize(sent):
            if (not w.isnumeric()) and (len(w) > 2) and (w not in stop_words):
                filtered_sentence.append(snow.stem(w))

        final_string = " ".join(filtered_sentence)
        final_text_list.append(final_string)

    return final_text_list

# données d'entraînement
train_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_train.csv', header=0)

# onnées de test
test_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_test.csv', header=0)

# Nettoyer les textes du DataFrame
train_df['text'] = process_text(train_df['text'])
test_df['text'] = process_text(test_df['text'])

# Vectorisation
vectorizer = CountVectorizer(binary=True)

# Convertir les textes en matrice
X_train = vectorizer.fit_transform(train_df['text'])
y_train = train_df['label']
X_test = vectorizer.transform(test_df['text'])
y_test = test_df['label']

X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# KNN avec 5 voisins
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_split, y_train_split)


y_val_pred = knn.predict(X_val)

val_accuracy = accuracy_score(y_val, y_val_pred)
val_f1 = f1_score(y_val, y_val_pred, average='weighted')
print(f'Accuracy: {val_accuracy:.4f}')
print(f'F1 Score: {val_f1:.4f}')
print(classification_report(y_val, y_val_pred))


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Accuracy: 0.6272
F1 Score: 0.6249
              precision    recall  f1-score   support

           0       0.64      0.55      0.59      2441
           1       0.62      0.70      0.66      2559

    accuracy                           0.63      5000
   macro avg       0.63      0.63      0.62      5000
weighted avg       0.63      0.63      0.62      5000



## 3. Make predictions on your test dataset

Once we select our best performing model, we can use it to make predictions on the test dataset. You can simply use __.fit()__ function with your training data to use the best performing K value and use __.predict()__ with your test data to get your test predictions.

In [None]:
# Entraîner sur toutes les données d'entraînement
knn_best = KNeighborsClassifier(n_neighbors=5)  # voisins = 5
knn_best.fit(X_train, y_train)

# Prédire sur le jeu de test
y_test_pred = knn_best.predict(X_test)

# Check les performances
test_accuracy = accuracy_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred, average='weighted')

print(f'Accuracy: {test_accuracy:.4f}')
print(f'F1 Score: {test_f1:.4f}')

# Rapport de classification pour chaque label
print(classification_report(y_test, y_test_pred))


Accuracy: 0.6284
F1 Score: 0.6284
              precision    recall  f1-score   support

           0       0.63      0.64      0.63     12500
           1       0.63      0.62      0.62     12500

    accuracy                           0.63     25000
   macro avg       0.63      0.63      0.63     25000
weighted avg       0.63      0.63      0.63     25000



## 4. Experimentation

For each of the following tasks, track both the **weighted F1-score** and **accuracy**:

1. **Change the binary parameter in CountVectorizer**: Test both `binary=True` and `binary=False`, and evaluate performance.
2. **Switch to TfidfVectorizer**: Replace the CountVectorizer with TfidfVectorizer and compare results.
3. **Adjust the max_features**: Experiment with different values of `max_features` for both TfidfVectorizer and CountVectorizer (`binary=True`).
4. **Optimize KNN**: Select the best-performing model from task 3 and vary the number of neighbors (`n_neighbors`) in the KNN classifier.


In [None]:

vectorizer_binary_false = CountVectorizer(binary=False)

X_train_binary_false = vectorizer_binary_false.fit_transform(train_df['text'])
X_test_binary_false = vectorizer_binary_false.transform(test_df['text'])

X_train_split, X_val, y_train_split, y_val = train_test_split(X_train_binary_false, y_train, test_size=0.2, random_state=42)
knn.fit(X_train_split, y_train_split)
y_val_pred = knn.predict(X_val)

print(f'Accuracy -> binary=False: {accuracy_score(y_val, y_val_pred):.4f}')
print(f'F1 Score -> binary=False: {f1_score(y_val, y_val_pred, average="weighted"):.4f}')
print(classification_report(y_val, y_val_pred))




Accuracy -> binary=False: 0.6442
F1 Score -> binary=False: 0.6427
              precision    recall  f1-score   support

           0       0.65      0.58      0.61      2441
           1       0.64      0.71      0.67      2559

    accuracy                           0.64      5000
   macro avg       0.65      0.64      0.64      5000
weighted avg       0.65      0.64      0.64      5000



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(train_df['text'])
X_test_tfidf = tfidf_vectorizer.transform(test_df['text'])

X_train_split, X_val, y_train_split, y_val = train_test_split(X_train_tfidf, y_train, test_size=0.2, random_state=42)
knn.fit(X_train_split, y_train_split)
y_val_pred = knn.predict(X_val)

print(f'Accuracy -> TfidfVectorizer: {accuracy_score(y_val, y_val_pred):.4f}')
print(f'F1 Score -> TfidfVectorizer: {f1_score(y_val, y_val_pred, average="weighted"):.4f}')
print(classification_report(y_val, y_val_pred))



Accuracy -> TfidfVectorizer: 0.7960
F1 Score -> TfidfVectorizer: 0.7954
              precision    recall  f1-score   support

           0       0.82      0.75      0.78      2441
           1       0.78      0.84      0.81      2559

    accuracy                           0.80      5000
   macro avg       0.80      0.79      0.80      5000
weighted avg       0.80      0.80      0.80      5000



In [None]:
print("Using CountVectorizer:")
for max_features in [5000, 10000, 20000]:
    vectorizer = CountVectorizer(max_features=max_features, binary=True)
    X_train_count = vectorizer.fit_transform(train_df['text'])
    X_test_count = vectorizer.transform(test_df['text'])

    X_train_split, X_val, y_train_split, y_val = train_test_split(X_train_count, y_train, test_size=0.2, random_state=42)
    knn.fit(X_train_split, y_train_split)
    y_val_pred = knn.predict(X_val)

    print(f'Accuracy -> CountVectorizer -> binary=True (max_features={max_features}): {accuracy_score(y_val, y_val_pred):.4f}')
    print(f'F1 Score -> CountVectorizer -> binary=True (max_features={max_features}): {f1_score(y_val, y_val_pred, average="weighted"):.4f}')
    print(classification_report(y_val, y_val_pred))


print("\nUsing TfidfVectorizer:")
for max_features in [5000, 10000, 20000]:
    tfidf_vectorizer = TfidfVectorizer(max_features=max_features)
    X_train_tfidf = tfidf_vectorizer.fit_transform(train_df['text'])
    X_test_tfidf = tfidf_vectorizer.transform(test_df['text'])

    X_train_split, X_val, y_train_split, y_val = train_test_split(X_train_tfidf, y_train, test_size=0.2, random_state=42)
    knn.fit(X_train_split, y_train_split)
    y_val_pred = knn.predict(X_val)

    print(f'Accuracy -> TfidfVectorizer (max_features={max_features}): {accuracy_score(y_val, y_val_pred):.4f}')
    print(f'F1 Score -> TfidfVectorizer (max_features={max_features}): {f1_score(y_val, y_val_pred, average="weighted"):.4f}')
    print(classification_report(y_val, y_val_pred))


Using CountVectorizer:
Accuracy -> CountVectorizer -> binary=True (max_features=5000): 0.6538
F1 Score -> CountVectorizer -> binary=True (max_features=5000): 0.6506
              precision    recall  f1-score   support

           0       0.68      0.56      0.61      2441
           1       0.64      0.74      0.69      2559

    accuracy                           0.65      5000
   macro avg       0.66      0.65      0.65      5000
weighted avg       0.66      0.65      0.65      5000

Accuracy -> CountVectorizer -> binary=True (max_features=10000): 0.6352
F1 Score -> CountVectorizer -> binary=True (max_features=10000): 0.6347
              precision    recall  f1-score   support

           0       0.63      0.60      0.62      2441
           1       0.64      0.67      0.65      2559

    accuracy                           0.64      5000
   macro avg       0.64      0.63      0.63      5000
weighted avg       0.64      0.64      0.63      5000

Accuracy -> CountVectorizer -> binary

In [None]:
n_neighbors_list = [1, 3, 5, 7, 9]
max_features = 20000

tfidf_vectorizer = TfidfVectorizer(max_features=max_features)
X_train_tfidf = tfidf_vectorizer.fit_transform(train_df['text'])
X_test_tfidf = tfidf_vectorizer.transform(test_df['text'])

X_train_split, X_val, y_train_split, y_val = train_test_split(X_train_tfidf, y_train, test_size=0.2, random_state=42)

for n_neighbors in n_neighbors_list:
    knn = KNeighborsClassifier(n_neighbors=n_neighbors)
    knn.fit(X_train_split, y_train_split)
    y_val_pred = knn.predict(X_val)

    print(f'Accuracy -> TfidfVectorizer (max_features={max_features}, n_neighbors={n_neighbors}): {accuracy_score(y_val, y_val_pred):.4f}')
    print(f'F1 Score -> TfidfVectorizer (max_features={max_features}, n_neighbors={n_neighbors}): {f1_score(y_val, y_val_pred, average="weighted"):.4f}')
    print(classification_report(y_val, y_val_pred))



Accuracy -> TfidfVectorizer (max_features=20000, n_neighbors=1): 0.7630
F1 Score -> TfidfVectorizer (max_features=20000, n_neighbors=1): 0.7625
              precision    recall  f1-score   support

           0       0.78      0.72      0.75      2441
           1       0.75      0.80      0.78      2559

    accuracy                           0.76      5000
   macro avg       0.76      0.76      0.76      5000
weighted avg       0.76      0.76      0.76      5000

Accuracy -> TfidfVectorizer (max_features=20000, n_neighbors=3): 0.7836
F1 Score -> TfidfVectorizer (max_features=20000, n_neighbors=3): 0.7830
              precision    recall  f1-score   support

           0       0.80      0.74      0.77      2441
           1       0.77      0.83      0.80      2559

    accuracy                           0.78      5000
   macro avg       0.79      0.78      0.78      5000
weighted avg       0.79      0.78      0.78      5000

Accuracy -> TfidfVectorizer (max_features=20000, n_neighbo