# Movie Review Sentiment Classification using Naive Bayes

This is my implementation of using a Naive Bayes classifier to do sentiment analysis on movie review dataset.
i.e. given a moview review the target is to output its sentiment whether positive, negative or neutral.

## Dataset

For training the model we will use the [Movie Review Dataset](https://www.cs.cornell.edu/people/pabo/movie-review-data/).
It was made available in 2004 by Bo Pang and Lillian Lee. Around 2,000 moview reviews are included in the dataset that are annonated as either `positive` or `negative`. It is about 3.8MB in size and can be
downloaded from this [link](https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/movie_reviews.zip).


Once you download the dataset, it contains 2 folders: pos and neg.

<img src="images/dataset folders.png" width="200px" />

The folder name indicate the true sentiment of the files inside. Each folder contains 1000 text files.

## Loading the Training Data

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import os
import time

In [13]:
reviews_text = [] #stored as np array
reviews_label = [] #labels (pos or neg)

classes = ['pos', 'neg']
current_path = os.getcwd()

for i in classes:
    path = os.path.join(current_path, 'movie_reviews', str(i))
    reviews = os.listdir(path)
    
    for review in reviews:
        file = open(path + '\\' + review, 'r') 
        txt = file.read() 
        reviews_text.append(txt)
        reviews_label.append(i)
        
reviews_text = np.array(reviews_text)
reviews_label = np.array(reviews_label)

In [14]:
# shape of data
reviews_text.shape

(2000,)

Each moview review consists of a bunch of words spread over multiple sentences. 
Before we proceed further we need to tokenize each review into a list of words. For that we will use the
`nltk.tokenize` [package](http://www.nltk.org/api/nltk.tokenize.html)

In [18]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Mostafa\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [36]:
X = []
y = reviews_label
for review in reviews_text:
    words = word_tokenize(review)
    words=np.array([word.lower() for word in words if word.isalpha()])
    X.append(words)
X = np.array(X)

## Train/Test Split

At this point, it is a good idea to split the data into training data and testing data where the former is used in training the model while the later is used to evaluate its performance on unseen data before deploying it into a production environment.

In [37]:
def train_test_split(data, labels, test_size=0.2, random_state=0):
    np.random.seed(random_state)
    N = labels.shape[0]
    idx = np.random.permutation(N)
    train_size = int(np.ceil((1-test_size)*N))
    X_train = data[idx[:train_size]]
    y_train = labels[idx[:train_size]]
    X_test = data[idx[train_size:]]
    y_test = labels[idx[train_size:]]
    return X_train, X_test, y_train, y_test

In [38]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=113)
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (1600,)
X_test shape: (400,)
y_train shape: (1600,)
y_test shape: (400,)


## Feature Extraction

In this setup, we will limit our feature to the most common 2000 words in the corpus. So first, 
we need to determine the most common words and then convert each input as a 2000-dimensional verctor
were each component determine whether the i-th word is present in the review.

In [63]:
# collect all words in the dataset
all_words = []
for rev in X_train:
    all_words.extend(rev.ravel())

# sort by frequency
all_words = nltk.FreqDist(w for w in all_words)

# pick he top 2000 most frequent
word_features = list(all_words)[:2000]

2000

## Building the Model

## Testing the model