<a href="https://colab.research.google.com/github/Nathn/ReviewSentimentPredictor/blob/main/NLP_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing </a>

## Assignment: K Nearest Neighbors Model for the IMDB Movie Review Dataset

For the final project, build a K Nearest Neighbors model to predict the sentiment (positive or negative) of movie reviews. The dataset is originally hosted here: http://ai.stanford.edu/~amaas/data/sentiment/

Use the notebooks from the class and implement the model, train and test with the corresponding datasets.

You can follow these steps:
1. Read training-test data (Given)
2. Train a KNN classifier (Implement)
3. Make predictions on your test dataset (Implement)
4. Expermintation (Implement)

__You can use the KNN Classifier from here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html__

## 1. Reading the dataset

We will use the __pandas__ library to read our dataset.

#### __Training data:__
Let's read our training data. Here, we have the text and label fields. Label is 1 for positive reviews and 0 for negative reviews.

In [4]:
import pandas as pd

train_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_train.csv', header=0)
train_df.head()

Unnamed: 0,text,label
0,This movie makes me want to throw up every tim...,0
1,Listening to the director's commentary confirm...,0
2,One of the best Tarzan films is also one of it...,1
3,Valentine is now one of my favorite slasher fi...,1
4,No mention if Ann Rivers Siddons adapted the m...,0


#### __Test data:__

In [5]:
import pandas as pd

test_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_test.csv', header=0)
test_df.head()

Unnamed: 0,text,label
0,What I hoped for (or even expected) was the we...,0
1,Garden State must rate amongst the most contri...,0
2,There is a lot wrong with this film. I will no...,1
3,"To qualify my use of ""realistic"" in the summar...",1
4,Dirty War is absolutely one of the best politi...,1


## 2. Train a KNN Classifier
Here, you will apply pre-processing operations we covered in the class. Then, you can split your dataset to training and validation here. For your first submission, you will use __K Nearest Neighbors Classifier__. It is available [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html).

In [6]:
# Install the library and functions
import nltk

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

import nltk, re
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

# Let's get a list of stop words from the NLTK library
stop = stopwords.words('english')

# These words are important for our problem. We don't want to remove them.
excluding = ['against', 'not', 'don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't",
             'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 
             'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't",
             'needn', "needn't",'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', 
             "weren't", 'won', "won't", 'wouldn', "wouldn't"]

# New stop word list
stop_words = [word for word in stop if word not in excluding]

snow = SnowballStemmer('english')

def process_text(texts): 
    final_text_list=[]
    for sent in texts:
        
        # Check if the sentence is a missing value
        if isinstance(sent, str) == False:
            sent = ""
            
        filtered_sentence=[]
        
        sent = sent.lower() # Lowercase 
        sent = sent.strip() # Remove leading/trailing whitespace
        sent = re.sub('\s+', ' ', sent) # Remove extra space and tabs
        sent = re.compile('<.*?>').sub('', sent) # Remove HTML tags/markups:
        
        for w in word_tokenize(sent):
            # We are applying some custom filtering here, feel free to try different things
            # Check if it is not numeric and its length>2 and not in stop words
            if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words):  
                # Stem and add to filtered list
                filtered_sentence.append(snow.stem(w))
        final_string = " ".join(filtered_sentence) #final string of cleaned words
 
        final_text_list.append(final_string)
        
    return final_text_list

train_text_list = process_text(train_df["text"].tolist())

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier

import gensim

w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    ('text_vect', CountVectorizer(
        binary=True,
    # ( 'text_vect', TfidfVectorizer(
        # use_idf=True,
        max_features=10
    )),
    ('knn', KNeighborsClassifier())  
])

from sklearn import set_config
set_config(display='diagram')

pipeline


  sent = re.sub('\s+', ' ', sent) # Remove extra space and tabs
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\tranc\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\tranc\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tranc\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 3. Make predictions on your test dataset

Once we select our best performing model, we can use it to make predictions on the test dataset. You can simply use __.fit()__ function with your training data to use the best performing K value and use __.predict()__ with your test data to get your test predictions.

In [7]:
pipeline.fit(train_text_list, train_df["label"])

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Use the fitted pipeline to make predictions on the validation dataset
val_predictions = pipeline.predict(process_text(test_df["text"].tolist()))

# Print the confusion matrix
print(confusion_matrix(test_df["label"], val_predictions))
print(classification_report(test_df["label"], val_predictions))
print("Accuracy (validation):", accuracy_score(test_df["label"], val_predictions))


[[6785 5715]
 [6213 6287]]
              precision    recall  f1-score   support

           0       0.52      0.54      0.53     12500
           1       0.52      0.50      0.51     12500

    accuracy                           0.52     25000
   macro avg       0.52      0.52      0.52     25000
weighted avg       0.52      0.52      0.52     25000

Accuracy (validation): 0.52288


## 4. Experimentation

For each of the following tasks, track both the **weighted F1-score** and **accuracy**:

1. **Change the binary parameter in CountVectorizer**: Test both `binary=True` and `binary=False`, and evaluate performance.
2. **Switch to TfidfVectorizer**: Replace the CountVectorizer with TfidfVectorizer and compare results.
3. **Adjust the max_features**: Experiment with different values of `max_features` for both TfidfVectorizer and CountVectorizer (`binary=True`).
4. **Optimize KNN**: Select the best-performing model from task 3 and vary the number of neighbors (`n_neighbors`) in the KNN classifier.


In [12]:
# Task 1

pipeline = Pipeline([
    ('text_vect', CountVectorizer(
        binary=True,
        max_features=10
    )),
    ('knn', KNeighborsClassifier())  
])

pipeline.fit(train_text_list, train_df["label"])

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Use the fitted pipeline to make predictions on the validation dataset
val_predictions = pipeline.predict(process_text(test_df["text"].tolist()))

# Print the confusion matrix
print(confusion_matrix(test_df["label"], val_predictions))
print(classification_report(test_df["label"], val_predictions))
print("Accuracy (validation) with binary=True:", accuracy_score(test_df["label"], val_predictions))

pipeline = Pipeline([
    ('text_vect', CountVectorizer(
        binary=False,
        max_features=10
    )),
    ('knn', KNeighborsClassifier())  
])

pipeline.fit(train_text_list, train_df["label"])

val_predictions = pipeline.predict(process_text(test_df["text"].tolist()))

# Print the confusion matrix
print(confusion_matrix(test_df["label"], val_predictions))
print(classification_report(test_df["label"], val_predictions))
print("Accuracy (validation) with binary=False:", accuracy_score(test_df["label"], val_predictions))

[[6785 5715]
 [6213 6287]]
              precision    recall  f1-score   support

           0       0.52      0.54      0.53     12500
           1       0.52      0.50      0.51     12500

    accuracy                           0.52     25000
   macro avg       0.52      0.52      0.52     25000
weighted avg       0.52      0.52      0.52     25000

Accuracy (validation) with binary=True: 0.52288
[[6358 6142]
 [5752 6748]]
              precision    recall  f1-score   support

           0       0.53      0.51      0.52     12500
           1       0.52      0.54      0.53     12500

    accuracy                           0.52     25000
   macro avg       0.52      0.52      0.52     25000
weighted avg       0.52      0.52      0.52     25000

Accuracy (validation) with binary=False: 0.52424


In [13]:
# Task 2

pipeline = Pipeline([
    ('text_vect', CountVectorizer(
        binary=False,
        max_features=10
    )),
    ('knn', KNeighborsClassifier())  
])

pipeline.fit(train_text_list, train_df["label"])

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Use the fitted pipeline to make predictions on the validation dataset
val_predictions = pipeline.predict(process_text(test_df["text"].tolist()))

# Print the confusion matrix
print(confusion_matrix(test_df["label"], val_predictions))
print(classification_report(test_df["label"], val_predictions))
print("Accuracy (validation) with CountVectorizer:", accuracy_score(test_df["label"], val_predictions))

pipeline = Pipeline([
    ( 'text_vect', TfidfVectorizer(
        use_idf=True,
        binary=False,
        max_features=10
    )),
    ('knn', KNeighborsClassifier())  
])

pipeline.fit(train_text_list, train_df["label"])

# Use the fitted pipeline to make predictions on the validation dataset
val_predictions = pipeline.predict(process_text(test_df["text"].tolist()))

# Print the confusion matrix
print(confusion_matrix(test_df["label"], val_predictions))
print(classification_report(test_df["label"], val_predictions))
print("Accuracy (validation) with TfidfVectorizer:", accuracy_score(test_df["label"], val_predictions))

[[6358 6142]
 [5752 6748]]
              precision    recall  f1-score   support

           0       0.53      0.51      0.52     12500
           1       0.52      0.54      0.53     12500

    accuracy                           0.52     25000
   macro avg       0.52      0.52      0.52     25000
weighted avg       0.52      0.52      0.52     25000

Accuracy (validation) with CountVectorizer: 0.52424
[[6428 6072]
 [5686 6814]]
              precision    recall  f1-score   support

           0       0.53      0.51      0.52     12500
           1       0.53      0.55      0.54     12500

    accuracy                           0.53     25000
   macro avg       0.53      0.53      0.53     25000
weighted avg       0.53      0.53      0.53     25000

Accuracy (validation) with TfidfVectorizer: 0.52968


In [10]:
# Task 3

# Implement this

In [11]:
# Task 4

# Implement this