# Natural Language Processing </a>

## Assignment: K Nearest Neighbors Model for the IMDB Movie Review Dataset

For the final project, build a K Nearest Neighbors model to predict the sentiment (positive or negative) of movie reviews. The dataset is originally hosted here: http://ai.stanford.edu/~amaas/data/sentiment/

Use the notebooks from the class and implement the model, train and test with the corresponding datasets.

You can follow these steps:
1. Read training-test data (Given)
2. Train a KNN classifier (Implement)
3. Make predictions on your test dataset (Implement)

__You can use the KNN Classifier from here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html__

## 1. Reading the dataset

We will use the __pandas__ library to read our dataset. 

#### __Training data:__
Let's read our training data. Here, we have the text and label fields. Labe is 1 for positive reviews and 0 for negative reviews.

In [2]:
import pandas as pd

train_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_train.csv', header=0)
train_df.head(5)

Unnamed: 0,text,label
0,This movie makes me want to throw up every tim...,0
1,Listening to the director's commentary confirm...,0
2,One of the best Tarzan films is also one of it...,1
3,Valentine is now one of my favorite slasher fi...,1
4,No mention if Ann Rivers Siddons adapted the m...,0


In [3]:
print('The shape of the dataset is:', train_df.shape)
print(train_df.isna().sum())

The shape of the dataset is: (25000, 2)
text     0
label    0
dtype: int64


#### __Test data:__

In [4]:
import pandas as pd

test_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_test.csv', header=0)
test_df.head(5)

Unnamed: 0,text,label
0,What I hoped for (or even expected) was the we...,0
1,Garden State must rate amongst the most contri...,0
2,There is a lot wrong with this film. I will no...,1
3,"To qualify my use of ""realistic"" in the summar...",1
4,Dirty War is absolutely one of the best politi...,1


In [5]:
print('The shape of the dataset is:', test_df.shape)
print(test_df.isna().sum())

The shape of the dataset is: (25000, 2)
text     0
label    0
dtype: int64


## 2. Train a KNN Classifier
Here, you will apply pre-processing operations we covered in the class. Then, you can split your dataset to training and validation here. For your first submission, you will use __K Nearest Neighbors Classifier__. It is available [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html).

In [6]:
# Implement this

Apply pre-processing operations : lower case, stemming and stop words removal.
Shuffle datatset before splitting.
Split train_df  to training set and validation set - training (90%) and validation (10%)
train_df , test_df

Text processing: stop words removal, stemming and text cleaning processes.

In [7]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [8]:
from sklearn.utils import validation
import re , nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

# Let's get a list of stop words from the NLTK library
stop = stopwords.words('english')

# These words are important for our problem. We don't want to remove them.
excluding = ['against', 'not', 'don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't",
             'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 
             'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't",
             'needn', "needn't",'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', 
             "weren't", 'won', "won't", 'wouldn', "wouldn't"]

# New stop word list
stop_words = [word for word in stop if word not in excluding]

snow = SnowballStemmer('english')

# process_text function
def process_text(texts): 
    final_text_list=[]
    for sent in texts:
        
        # Check if the sentence is a missing value
        if isinstance(sent, str) == False:
            sent = ""
            
        filtered_sentence=[]
        
        sent = sent.lower() # Lowercase 
        sent = sent.strip() # Remove leading/trailing whitespace
        sent = re.sub('\s+', ' ', sent) # Remove extra space and tabs
        sent = re.compile('<.*?>').sub('', sent) # Remove HTML tags/markups:
        
        for w in word_tokenize(sent):
            # We are applying some custom filtering here, feel free to try different things
            # Check if it is not numeric and its length>2 and not in stop words
            if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words):  
                # Stem and add to filtered list
                filtered_sentence.append(snow.stem(w))
        final_string = " ".join(filtered_sentence) #final string of cleaned words
 
        final_text_list.append(final_string)
        
    return final_text_list

Splitting train_df into training (90%) and validation (10%) and shuffle

In [9]:
from sklearn.model_selection import train_test_split
X=train_df[["text"]]
Y=train_df["label"]
X_train, X_val, y_train, y_val = train_test_split(X,
                                                  Y,
                                                  test_size=0.10,
                                                  shuffle=True,
                                                  random_state=324
                                                 )

In [10]:
print("Processing the reviewText fields")
train_text_list = process_text(X_train["text"].tolist())
val_text_list = process_text(X_val["text"].tolist())

Processing the reviewText fields


In [11]:
print(train_text_list[0])
print(val_text_list[0])

extrem tens thriller set urban chao são paulo biggest ugliest third world nightmar brazilian urbania sake make easi anyon curious intrigu truli well made film grit mexican featur amor perro charact not far max cadi cape fear although not mean film psychopath two partner alexandr borg marco ricca construct compani pay hitman anisio miklo third partner major share holder said construct outfit murder blame citi thing begin look grim inde witti charismat walk nightmar anisio decid want around ever nervous partner crime not trespass import deconstruct strict social code make brazilian societi anisio turn poverti attitud want look almost entir handheld graini perform outstand throughout especi first time actor member classic brazilian pop band titã paulo miklo dazzl baffl viewer pretti funni social terror.i saw film brasília film fest novemb 2001. sinc done well sundanc berlin kleber mendonça filho
classic film noir homag chinatown roman polanski return theme given greatest hit 60s creepi ps

Data Processing with Pipeline

simple pipeline to use our text field and fit a simple K Nearest Neighbour classifier

uses a single field - text column

In [27]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier


import gensim
from gensim.models import Word2Vec
### PIPELINE ###
##########################
w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    #('text_vect', CountVectorizer(binary=False,
    ( 'text_vect', TfidfVectorizer(use_idf=False,
                                  max_features=30)),
    ('knn', KNeighborsClassifier())  
                                ])

# Visualize the pipeline
# This will come in handy especially when building more complex pipelines, stringing together multiple preprocessing steps
from sklearn import set_config
set_config(display='diagram')
pipeline

## 3. Make predictions on your test dataset

Once we select our best performing model, we can use it to make predictions on the test dataset. You can simply use __.fit()__ function with your training data to use the best performing K value and use __.predict()__ with your test data to get your test predictions.

In [None]:
# Implement this

.fit() = training data                         
.predict() = test data

train classifier with .fit()

evaluate the performance of the trained classifier - use .predict() this time


Train the classifier

In [28]:
# We using lists of processed text fields 
X_train = train_text_list
X_val = val_text_list

# Fit the Pipeline to training data
pipeline.fit(X_train, y_train.values)

Test the classifier   == using .predict()

In [30]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Use the fitted pipeline to make predictions on the validation dataset
val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

[[804 476]
 [453 767]]
              precision    recall  f1-score   support

           0       0.64      0.63      0.63      1280
           1       0.62      0.63      0.62      1220

    accuracy                           0.63      2500
   macro avg       0.63      0.63      0.63      2500
weighted avg       0.63      0.63      0.63      2500

Accuracy (validation): 0.6284


when max features = 10,binary = True, accuracy = 0.5252
when max features = 10,binary = False, accuracy = 0.514

when max features = 20, binary = True, accuracy = 0.5544
when max features = 20, binary = False, accuracy = 0.5888

when max features = 30, binary = True, accuracy = 0.6068
when max features = 30, binary = False, accuracy = 0.6232



when max features = 10, when TfidfVectorizer(use_idf=True) , accuracy = 0.5292  
when max features = 10, when TfidfVectorizer(use_idf=False) , accuracy = 0.5244

when max features = 20, when TfidfVectorizer(use_idf=True) , accuracy = 0.6036
when max features = 20, when TfidfVectorizer(use_idf=False) , accuracy = 0.5944 

when max features = 30, when TfidfVectorizer(use_idf=True) , accuracy = 0.628
when max features = 30, when TfidfVectorizer(use_idf=False) , accuracy = 0.6284


0.6284 is highest percentage

use:

when max features = 20, binary = False, accuracy = 0.5888

when max features = 20, when TfidfVectorizer(use_idf=True) , accuracy = 0.6036

Now use the pipeline with the test dataset 

In [21]:
x_1=process_text(test_df["text"].tolist())
y_1=test_df["label"]


In [31]:
print(x_1[1])

print(y_1[1])

garden state must rate amongst contriv pretenti film time plot simpl one involv young man return home mother death discov love realli plot n't import import zach braff writer director star abl hang plot necessari accoutr indi arti film therefor present endless cute quirki charact scene n't exist reason plot charact develop simpli give artist credibl film wes anderson braff hope unfortun somewhat astonish braff not fool mani imdb also critic realli ought known better.of cours braff gratuit use quirki alon not make garden state bad film realli make garden state stinker braff script simpli not write skill carri film dialogu characteris abysm braff often resort blunt devis symbol achiev n't achiev write exampl numb braff charact shown indiffer impend plane crash n't work plot take place dream later shown fight back against circumst scream bottomless abyss life bottomless abyss clever braff two scene must rank amongst ludicr contriv ever seen cinema screen.on plus side act passabl despit la

In [32]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Use the fitted pipeline to make predictions on the test_df
val_predictions = pipeline.predict(x_1)
print(confusion_matrix(y_1.values, val_predictions))
print(classification_report(y_1.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_1.values, val_predictions))

[[7834 4666]
 [4626 7874]]
              precision    recall  f1-score   support

           0       0.63      0.63      0.63     12500
           1       0.63      0.63      0.63     12500

    accuracy                           0.63     25000
   macro avg       0.63      0.63      0.63     25000
weighted avg       0.63      0.63      0.63     25000

Accuracy (validation): 0.62832


when max features = 30, when TfidfVectorizer(use_idf=False) , accuracy = 0.62832