# Natural Language Processing </a>

## Assignment: K Nearest Neighbors Model for the IMDB Movie Review Dataset

For the final project, build a K Nearest Neighbors model to predict the sentiment (positive or negative) of movie reviews. The dataset is originally hosted here: http://ai.stanford.edu/~amaas/data/sentiment/

Use the notebooks from the class and implement the model, train and test with the corresponding datasets.

You can follow these steps:
1. Read training-test data (Given)
2. Train a KNN classifier (Implement)
3. Make predictions on your test dataset (Implement)
4. Expermintation (Implement)

__You can use the KNN Classifier from here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html__

## 1. Reading the dataset

We will use the __pandas__ library to read our dataset.

#### __Training data:__
Let's read our training data. Here, we have the text and label fields. Labe is 1 for positive reviews and 0 for negative reviews.

In [2]:
import pandas as pd

train_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_train.csv', header=0)
train_df.head()

Unnamed: 0,text,label
0,This movie makes me want to throw up every tim...,0
1,Listening to the director's commentary confirm...,0
2,One of the best Tarzan films is also one of it...,1
3,Valentine is now one of my favorite slasher fi...,1
4,No mention if Ann Rivers Siddons adapted the m...,0


#### __Test data:__

In [1]:
import pandas as pd

test_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_test.csv', header=0)
test_df.head()

Unnamed: 0,text,label
0,What I hoped for (or even expected) was the we...,0
1,Garden State must rate amongst the most contri...,0
2,There is a lot wrong with this film. I will no...,1
3,"To qualify my use of ""realistic"" in the summar...",1
4,Dirty War is absolutely one of the best politi...,1


## 2. Train a KNN Classifier
Here, you will apply pre-processing operations we covered in the class. Then, you can split your dataset to training and validation here. For your first submission, you will use __K Nearest Neighbors Classifier__. It is available [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html).

In [17]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn import set_config

X = test_df['text']
y = test_df['label']

# Splitting the data into training and validation sets

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1, random_state=324)
import gensim
from gensim.models import Word2Vec

w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    ('text_vect', CountVectorizer(binary=True,
                                  max_features=100)),
    ('knn', KNeighborsClassifier())
                                ])

from sklearn import set_config
set_config(display='diagram')
pipeline




## 3. Make predictions on your test dataset

Once we select our best performing model, we can use it to make predictions on the test dataset. You can simply use __.fit()__ function with your training data to use the best performing K value and use __.predict()__ with your test data to get your test predictions.

In [18]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

X_train = train_df['text']
y_train = train_df['label']
X_val = train_df['text']
y_val = train_df['label']

pipeline.fit(X_train, y_train.values)

val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))


[[ 8033  4467]
 [ 1220 11280]]
              precision    recall  f1-score   support

           0       0.87      0.64      0.74     12500
           1       0.72      0.90      0.80     12500

    accuracy                           0.77     25000
   macro avg       0.79      0.77      0.77     25000
weighted avg       0.79      0.77      0.77     25000

Accuracy (validation): 0.77252


## 4. Experimentation

For each of the following tasks, track both the **weighted F1-score** and **accuracy**:

1. **Change the binary parameter in CountVectorizer**: Test both `binary=True` and `binary=False`, and evaluate performance.
2. **Switch to TfidfVectorizer**: Replace the CountVectorizer with TfidfVectorizer and compare results.
3. **Adjust the max_features**: Experiment with different values of `max_features` for both TfidfVectorizer and CountVectorizer (`binary=True`).
4. **Optimize KNN**: Select the best-performing model from task 3 and vary the number of neighbors (`n_neighbors`) in the KNN classifier.


In [23]:
# Task 1
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn import set_config
import gensim

def evaluate_pipeline(binary_value=True):

    X = test_df['text']
    y = test_df['label']

    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1, random_state=324)


    w2v = gensim.models.Word2Vec()
    pipeline = Pipeline([
        ('text_vect', CountVectorizer(binary=binary_value, max_features=100)),
        ('knn', KNeighborsClassifier())
                                    ])

    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_val)

    # Calculate accuracy and weighted F1-score
    accuracy = accuracy_score(y_val, y_pred)
    f1 = f1_score(y_val, y_pred, average='weighted')

    return accuracy, f1

# Evaluation of binary=False
accuracy_false, f1_false = evaluate_pipeline(binary_value=False)
print(f"Binary=False -> Accuracy: {accuracy_false}, Weighted F1: {f1_false}")

# Evaluation of binary=True
accuracy_true, f1_true = evaluate_pipeline(binary_value=True)
print(f"Binary=True -> Accuracy: {accuracy_true}, Weighted F1: {f1_true}")




Binary=False -> Accuracy: 0.592, Weighted F1: 0.5912117015652286
Binary=True -> Accuracy: 0.6004, Weighted F1: 0.5986419197379986


In [25]:
# Task 2
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn import set_config
import gensim

X = test_df['text']
y = test_df['label']

# Splitting the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1, random_state=324)

def evaluate_pipeline_tfidf(use_idf):
    pipeline = Pipeline([
        ('text_vect', TfidfVectorizer(use_idf=use_idf, max_features=100)),
        ('knn', KNeighborsClassifier())
    ])

    # Fitting the pipeline to the training data
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_val)

    # Calculating accuracy and weighted F1-score
    accuracy = accuracy_score(y_val, y_pred)
    f1 = f1_score(y_val, y_pred, average='weighted')

    return accuracy, f1

# Evaluating use_idf=True (Tfidf with idf)
accuracy_idf_true, f1_idf_true = evaluate_pipeline_tfidf(use_idf=True)
print(f"TfidfVectorizer (use_idf=True) -> Accuracy: {accuracy_idf_true}, Weighted F1: {f1_idf_true}")

# Evaluating use_idf=False (Tfidf without idf)
accuracy_idf_false, f1_idf_false = evaluate_pipeline_tfidf(use_idf=False)
print(f"TfidfVectorizer (use_idf=False) -> Accuracy: {accuracy_idf_false}, Weighted F1: {f1_idf_false}")




TfidfVectorizer (use_idf=True) -> Accuracy: 0.6344, Weighted F1: 0.6337703401472127
TfidfVectorizer (use_idf=False) -> Accuracy: 0.6228, Weighted F1: 0.6225074728218895


In [27]:
# Task 3
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn import set_config
import gensim

y = test_df['label']

# Splitting the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1, random_state=324)

def evaluate_pipeline(vectorizer, max_features):
    pipeline = Pipeline([
        ('text_vect', vectorizer(max_features=max_features)),
        ('knn', KNeighborsClassifier())
    ])

    # Fitting the pipeline to the training data
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_val)

    # Calculating accuracy and weighted F1-score
    accuracy = accuracy_score(y_val, y_pred)
    f1 = f1_score(y_val, y_pred, average='weighted')

    # Return the metrics
    return accuracy, f1

# Experimenting with different max_features values
max_features_values = [50, 100, 200, 500, 1000]

print("CountVectorizer (binary=True) Results")
for max_feat in max_features_values:
    accuracy, f1 = evaluate_pipeline(lambda max_features: CountVectorizer(binary=True, max_features=max_features), max_feat)
    print(f"Max Features: {max_feat} -> Accuracy: {accuracy}, Weighted F1: {f1}")

print("TfidfVectorizer Results")
for max_feat in max_features_values:
    accuracy, f1 = evaluate_pipeline(lambda max_features: TfidfVectorizer(use_idf=True, max_features=max_features), max_feat)
    print(f"Max Features: {max_feat} -> Accuracy: {accuracy}, Weighted F1: {f1}")




---- CountVectorizer (binary=True) Results ----
Max Features: 50 -> Accuracy: 0.5616, Weighted F1: 0.5614139436990266
Max Features: 100 -> Accuracy: 0.6004, Weighted F1: 0.5986419197379986
Max Features: 200 -> Accuracy: 0.6228, Weighted F1: 0.6198629487090926
Max Features: 500 -> Accuracy: 0.6416, Weighted F1: 0.6353650470310073
Max Features: 1000 -> Accuracy: 0.6532, Weighted F1: 0.648473352878017

---- TfidfVectorizer Results ----
Max Features: 50 -> Accuracy: 0.58, Weighted F1: 0.5798776827701684
Max Features: 100 -> Accuracy: 0.6344, Weighted F1: 0.6337703401472127
Max Features: 200 -> Accuracy: 0.6796, Weighted F1: 0.6795300197021901
Max Features: 500 -> Accuracy: 0.704, Weighted F1: 0.7020804940813176
Max Features: 1000 -> Accuracy: 0.6888, Weighted F1: 0.6865733867619369


In [29]:
# Task 4
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn import set_config
import gensim

X = test_df['text']
y = test_df['label']

# Splitting the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1, random_state=324)

def evaluate_knn_pipeline(vectorizer, max_features, n_neighbors):
    pipeline = Pipeline([
        ('text_vect', vectorizer(max_features=max_features)),
        ('knn', KNeighborsClassifier(n_neighbors=n_neighbors))
    ])

    # Fitting the pipeline to the training data
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_val)

    # Calculating accuracy and weighted F1-score
    accuracy = accuracy_score(y_val, y_pred)
    f1 = f1_score(y_val, y_pred, average='weighted')

    # Returnong the metrics
    return accuracy, f1

# Defining the vectorizer and max_features
best_vectorizer = CountVectorizer
best_max_features = 100

# Experimenting with different values of n_neighbors for KNeighborsClassifier
n_neighbors_values = [1, 3, 5, 7, 9, 11]

print(f"Optimizing KNN for {best_vectorizer.__name__} with max_features={best_max_features}")
for n_neighbors in n_neighbors_values:
    accuracy, f1 = evaluate_knn_pipeline(lambda max_features: best_vectorizer(binary=True, max_features=max_features), best_max_features, n_neighbors)
    print(f"n_neighbors: {n_neighbors} -> Accuracy: {accuracy}, Weighted F1: {f1}")




---- Optimizing KNN for CountVectorizer with max_features=100 ----
n_neighbors: 1 -> Accuracy: 0.5904, Weighted F1: 0.5898272378166856
n_neighbors: 3 -> Accuracy: 0.592, Weighted F1: 0.5910584615384615
n_neighbors: 5 -> Accuracy: 0.6004, Weighted F1: 0.5986419197379986
n_neighbors: 7 -> Accuracy: 0.6212, Weighted F1: 0.619533431423308
n_neighbors: 9 -> Accuracy: 0.6292, Weighted F1: 0.6271000889569471
n_neighbors: 11 -> Accuracy: 0.636, Weighted F1: 0.633591708761764
