# Natural Language Processing </a>

## Assignment: K Nearest Neighbors Model for the IMDB Movie Review Dataset

For the final project, build a K Nearest Neighbors model to predict the sentiment (positive or negative) of movie reviews. The dataset is originally hosted here: http://ai.stanford.edu/~amaas/data/sentiment/

Use the notebooks from the class and implement the model, train and test with the corresponding datasets.

You can follow these steps:
1. Read training-test data (Given)
2. Train a KNN classifier (Implement)
3. Make predictions on your test dataset (Implement)
4. Expermintation (Implement)

__You can use the KNN Classifier from here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html__

## 1. Reading the dataset

We will use the __pandas__ library to read our dataset.

#### __Training data:__
Let's read our training data. Here, we have the text and label fields. Labe is 1 for positive reviews and 0 for negative reviews.

In [143]:
import pandas as pd

train_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_train.csv', header=0)
train_df.head()

Unnamed: 0,text,label
0,This movie makes me want to throw up every tim...,0
1,Listening to the director's commentary confirm...,0
2,One of the best Tarzan films is also one of it...,1
3,Valentine is now one of my favorite slasher fi...,1
4,No mention if Ann Rivers Siddons adapted the m...,0


In [144]:
print(train_df["label"].value_counts());
print(train_df.isna().sum())

label
0    12500
1    12500
Name: count, dtype: int64
text     0
label    0
dtype: int64


#### __Test data:__

In [145]:
import pandas as pd

test_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_test.csv', header=0)
test_df.head()

Unnamed: 0,text,label
0,What I hoped for (or even expected) was the we...,0
1,Garden State must rate amongst the most contri...,0
2,There is a lot wrong with this film. I will no...,1
3,"To qualify my use of ""realistic"" in the summar...",1
4,Dirty War is absolutely one of the best politi...,1


In [146]:
print(test_df["label"].value_counts());
print(test_df.isna().sum())

label
0    12500
1    12500
Name: count, dtype: int64
text     0
label    0
dtype: int64


## 2. Train a KNN Classifier
Here, you will apply pre-processing operations we covered in the class. Then, you can split your dataset to training and validation here. For your first submission, you will use __K Nearest Neighbors Classifier__. It is available [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html).

In [147]:
# Implement this

import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

nltk.download('punkt')
nltk.download('stopwords')

# train_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_train.csv')

stop_words = set(stopwords.words('english'))

excluding = ['against', 'not', 'don', "don't", 'ain', 'aren', "aren't", 'couldn', "couldn't",
             'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't",
             'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't",
             'needn', "needn't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren',
             "weren't", 'won', "won't", 'wouldn', "wouldn't"]

stop_words = stop_words.difference(excluding)

def clean_text(text):
    text = text.lower()
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    words = word_tokenize(text)
    cleaned_words = [word for word in words if word not in stop_words]
    final_string = ' '.join(cleaned_words)
    return final_string

# train_df['cleaned_text'] = train_df['text'].head(10).apply(clean_text)
# print("Original Text :", train_df['text'].iloc[0])
# print("Clean Text :", train_df['cleaned_text'].iloc[0])



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [148]:
from sklearn.model_selection import train_test_split

X=train_df[["text"]]
Y=train_df["label"]
X_train, X_val, y_train, y_val = train_test_split(X,
                                                  Y,
                                                  test_size=0.10,
                                                  shuffle=True,
                                                  random_state=324
                                                 )

In [149]:
print("Processing the reviewText fields")
train_text_list = X_train['text'].apply(clean_text)
val_text_list = X_val['text'].apply(clean_text)

# print("Texte original :", X_train['text'].iloc[0])
# print("Texte nettoyé :", X_train['cleaned_text'].iloc[0])

Processing the reviewText fields


In [150]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier


import gensim
from gensim.models import Word2Vec
### PIPELINE ###
##########################
w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    # ('text_vect', CountVectorizer(binary=True,
    ( 'text_vect', TfidfVectorizer(use_idf=True,
                                  max_features=15000, ngram_range=(1, 2))),
    ('knn', KNeighborsClassifier(9))
                                ])
from sklearn import set_config
set_config(display='diagram')
pipeline

## 3. Make predictions on your test dataset

Once we select our best performing model, we can use it to make predictions on the test dataset. You can simply use __.fit()__ function with your training data to use the best performing K value and use __.predict()__ with your test data to get your test predictions.

In [151]:
# We using lists of processed text fields
X_train = train_text_list
X_val = val_text_list

# Fit the Pipeline to training data
pipeline.fit(X_train, y_train.values)

In [152]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

[[ 941  339]
 [ 195 1025]]
              precision    recall  f1-score   support

           0       0.83      0.74      0.78      1280
           1       0.75      0.84      0.79      1220

    accuracy                           0.79      2500
   macro avg       0.79      0.79      0.79      2500
weighted avg       0.79      0.79      0.79      2500

Accuracy (validation): 0.7864


## 4. Experimentation

For each of the following tasks, track both the **weighted F1-score** and **accuracy**:

1. **Change the binary parameter in CountVectorizer**: Test both `binary=True` and `binary=False`, and evaluate performance.
2. **Switch to TfidfVectorizer**: Replace the CountVectorizer with TfidfVectorizer and compare results.
3. **Adjust the max_features**: Experiment with different values of `max_features` for both TfidfVectorizer and CountVectorizer (`binary=True`).
4. **Optimize KNN**: Select the best-performing model from task 3 and vary the number of neighbors (`n_neighbors`) in the KNN classifier.


In [153]:
# Task 1 True CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier


import gensim
from gensim.models import Word2Vec
### PIPELINE ###
##########################
w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    ('text_vect', CountVectorizer(binary=True,
    # ( 'text_vect', TfidfVectorizer(use_idf=True,
                                  max_features=15000, ngram_range=(1, 2))),
    ('knn', KNeighborsClassifier(9))
                                ])

from sklearn import set_config
set_config(display='diagram')
pipeline

X_train = train_text_list
X_val = val_text_list

# Fit the Pipeline to training data
pipeline.fit(X_train, y_train.values)

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))



[[921 359]
 [646 574]]
              precision    recall  f1-score   support

           0       0.59      0.72      0.65      1280
           1       0.62      0.47      0.53      1220

    accuracy                           0.60      2500
   macro avg       0.60      0.60      0.59      2500
weighted avg       0.60      0.60      0.59      2500

Accuracy (validation): 0.598


In [154]:
# Task 1 False CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier


import gensim
from gensim.models import Word2Vec
### PIPELINE ###
##########################
w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    ('text_vect', CountVectorizer(binary=False,
    # ( 'text_vect', TfidfVectorizer(use_idf=True,
                                  max_features=15000, ngram_range=(1, 2))),
    ('knn', KNeighborsClassifier(9))
                                ])

from sklearn import set_config
set_config(display='diagram')
pipeline

X_train = train_text_list
X_val = val_text_list

pipeline.fit(X_train, y_train.values)

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))



[[936 344]
 [659 561]]
              precision    recall  f1-score   support

           0       0.59      0.73      0.65      1280
           1       0.62      0.46      0.53      1220

    accuracy                           0.60      2500
   macro avg       0.60      0.60      0.59      2500
weighted avg       0.60      0.60      0.59      2500

Accuracy (validation): 0.5988


In [155]:
# Task 2 True TfidfVectorizer

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier


import gensim
from gensim.models import Word2Vec
### PIPELINE ###
##########################
w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    # ('text_vect', CountVectorizer(binary=True,
    ( 'text_vect', TfidfVectorizer(use_idf=True,
                                  max_features=20000, ngram_range=(1, 2))),
    ('knn', KNeighborsClassifier(9))
                                ])

from sklearn import set_config
set_config(display='diagram')
pipeline

X_train = train_text_list
X_val = val_text_list

pipeline.fit(X_train, y_train.values)

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))


[[ 943  337]
 [ 185 1035]]
              precision    recall  f1-score   support

           0       0.84      0.74      0.78      1280
           1       0.75      0.85      0.80      1220

    accuracy                           0.79      2500
   macro avg       0.80      0.79      0.79      2500
weighted avg       0.80      0.79      0.79      2500

Accuracy (validation): 0.7912


In [156]:
# Task 2 False TfidfVectorizer

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier


import gensim
from gensim.models import Word2Vec
### PIPELINE ###
##########################
w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    # ('text_vect', CountVectorizer(binary=True,
    ( 'text_vect', TfidfVectorizer(use_idf=False,
                                  max_features=15000, ngram_range=(1, 2))),
    ('knn', KNeighborsClassifier(9))
                                ])

from sklearn import set_config
set_config(display='diagram')
pipeline

X_train = train_text_list
X_val = val_text_list

pipeline.fit(X_train, y_train.values)

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))


[[968 312]
 [332 888]]
              precision    recall  f1-score   support

           0       0.74      0.76      0.75      1280
           1       0.74      0.73      0.73      1220

    accuracy                           0.74      2500
   macro avg       0.74      0.74      0.74      2500
weighted avg       0.74      0.74      0.74      2500

Accuracy (validation): 0.7424


In [None]:
# Task 3 TfidfVectorizer

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

import gensim
import numpy as np

max_features_list = [100, 500, 1000, 2000, 5000, 10000, 20000]

results = {}

for max_features in max_features_list:
    print(f"Testing max_features: {max_features}")

    pipeline = Pipeline([
        ('text_vect', TfidfVectorizer(use_idf=True, max_features=max_features, ngram_range=(1, 2))),
        ('knn', KNeighborsClassifier(9))
    ])

    pipeline.fit(X_train, y_train.values)

    val_predictions = pipeline.predict(X_val)

    cm = confusion_matrix(y_val.values, val_predictions)
    report = classification_report(y_val.values, val_predictions, output_dict=True)
    accuracy = accuracy_score(y_val.values, val_predictions)

    results[max_features] = {
        'confusion_matrix': cm,
        'classification_report': report,
        'accuracy': accuracy
    }

    print(cm)
    print(classification_report(y_val.values, val_predictions))
    print("Accuracy (validation):", accuracy)

print("Results for all max_features tested:")
for max_features, metrics in results.items():
    print(f"max_features: {max_features}, Accuracy: {metrics['accuracy']:.4f}")


Testing max_features: 100
[[869 411]
 [409 811]]
              precision    recall  f1-score   support

           0       0.68      0.68      0.68      1280
           1       0.66      0.66      0.66      1220

    accuracy                           0.67      2500
   macro avg       0.67      0.67      0.67      2500
weighted avg       0.67      0.67      0.67      2500

Accuracy (validation): 0.672
Testing max_features: 500
[[890 390]
 [284 936]]
              precision    recall  f1-score   support

           0       0.76      0.70      0.73      1280
           1       0.71      0.77      0.74      1220

    accuracy                           0.73      2500
   macro avg       0.73      0.73      0.73      2500
weighted avg       0.73      0.73      0.73      2500

Accuracy (validation): 0.7304
Testing max_features: 1000
[[878 402]
 [251 969]]
              precision    recall  f1-score   support

           0       0.78      0.69      0.73      1280
           1       0.71      0

In [None]:
#Task 3 CountVectorizer

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

import gensim
import numpy as np

max_features_list = [100, 500, 1000, 2000, 5000, 10000, 20000]

results = {}

for max_features in max_features_list:
    print(f"Testing max_features: {max_features}")

    pipeline = Pipeline([
        ('text_vect', CountVectorizer(binary=True, max_features=max_features, ngram_range=(1, 2))),
        ('knn', KNeighborsClassifier(9))
    ])

    pipeline.fit(X_train, y_train.values)

    val_predictions = pipeline.predict(X_val)

    cm = confusion_matrix(y_val.values, val_predictions)
    report = classification_report(y_val.values, val_predictions, output_dict=True)
    accuracy = accuracy_score(y_val.values, val_predictions)

    results[max_features] = {
        'confusion_matrix': cm,
        'classification_report': report,
        'accuracy': accuracy
    }

    print(cm)
    print(classification_report(y_val.values, val_predictions))
    print("Accuracy (validation):", accuracy)

print("Results for all max_features tested:")
for max_features, metrics in results.items():
    print(f"max_features: {max_features}, Accuracy: {metrics['accuracy']:.4f}")


In [None]:
# Task 4

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn import set_config

import gensim

neighbors_list = [1, 3, 5, 7, 9, 11]

results = {}

for n_neighbors in neighbors_list:
    print(f"Testing n_neighbors: {n_neighbors}")

    pipeline = Pipeline([
        ('text_vect', TfidfVectorizer(use_idf=True, max_features=20000, ngram_range=(1, 2))),
        ('knn', KNeighborsClassifier(n_neighbors=n_neighbors))
    ])

    set_config(display='diagram')

    pipeline.fit(X_train, y_train.values)
    val_predictions = pipeline.predict(X_val)

    cm = confusion_matrix(y_val.values, val_predictions)
    report = classification_report(y_val.values, val_predictions, output_dict=True)
    accuracy = accuracy_score(y_val.values, val_predictions)

    results[n_neighbors] = {
        'confusion_matrix': cm,
        'classification_report': report,
        'accuracy': accuracy
    }

    print(cm)
    print(classification_report(y_val.values, val_predictions))
    print("Accuracy (validation):", accuracy)

print("Results for all n_neighbors tested:")
for n_neighbors, metrics in results.items():
    print(f"n_neighbors: {n_neighbors}, Accuracy: {metrics['accuracy']:.4f}")
