**The aim of this notebook is to implement a Naive Bayes classifier in NumPy and use it to predict the sentiment of a movie review. Data is obtained from: https://www.mth548.org/Projects/text_with_naive_bayes/text_with_naive_bayes.html**

Contents:  
1. Data pre-processing
2. Implementing Naive Bayes in Numpy
3. Predicting sentiment on test data
4. Improving the model

**Loading data**

In [1]:
import numpy as np
import pandas as pd

path = r'C:\Users\sdrin\Desktop'
movie_reviews = pd.read_csv(path + '\movie_reviews.csv')

In [2]:
movie_reviews.head(5)

Unnamed: 0,review,sentiment
0,"This film is absolutely awful, but nevertheles...",negative
1,Well since seeing part's 1 through 3 I can hon...,negative
2,I got to see this film at a preview and was da...,positive
3,This adaptation positively butchers a classic ...,negative
4,Råzone is an awful movie! It is so simple. It ...,negative


**Pre-processing**

In [3]:
# remove recurring html
movie_reviews['review_'] = movie_reviews.review.apply(lambda x: x.replace("<br /><br />"," "))

In [4]:
# remove punctuation
import string
def remove_punctuation(text):
    no_punc = "".join([t for t in text if t not in string.punctuation])
    return no_punc.lower()

movie_reviews.review_ = movie_reviews.review_.apply(lambda x: remove_punctuation(x))

In [5]:
# tokenize
import re

def tokenize(text):
    tokens = re.split('\W+',text)
    return tokens

movie_reviews.review_ = movie_reviews.review_.apply(lambda x: tokenize(x))

In [6]:
# remove stop words
import nltk
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
# add some of our own stop words based on knowledge of the data
add = ['movie','film','movies','films','cinema','review']
for a in add:
    stop_words.append(a)
    
stop_words = [remove_punctuation(s) for s in stop_words]

def remove_stopwords(text):
    no_stop = [w for w in text if w not in stop_words]
    return no_stop

movie_reviews.review_ = movie_reviews.review_.apply(lambda x: remove_stopwords(x))

In [7]:
# Get the frequency of all vocab
vocab = movie_reviews.review_[0]
for r in range(len(movie_reviews.review_)-1):
    vocab += movie_reviews.review_[r+1]

df = pd.DataFrame(vocab,columns=['vocab'])
df['count'] = df.groupby('vocab')['vocab'].transform('count')
df.sort_values('count',ascending = False, inplace=True)
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,vocab,count
0,one,25729
1,like,19682
2,good,14710
3,even,12511
4,would,12141


In [8]:
# use the most frequent 1000 words to train the algorithm on
vocab = df['vocab'][:1000].values

In [9]:
# obtain a vector representing if a word appears in a review
def vectorize(text,vocab):
    vector = []
    for word in vocab:
        if word in text:
            vector.append(1)
        else:
            vector.append(0)
    return vector

movie_reviews['vectorized'] = movie_reviews['review_'].copy()
movie_reviews['vectorized'] = movie_reviews['review_'].apply(lambda x: vectorize(x,vocab))
movie_reviews

Unnamed: 0,review,sentiment,review_,vectorized
0,"This film is absolutely awful, but nevertheles...",negative,"[absolutely, awful, nevertheless, hilarious, t...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,Well since seeing part's 1 through 3 I can hon...,negative,"[well, since, seeing, parts, 1, 3, honestly, s...","[1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, ..."
2,I got to see this film at a preview and was da...,positive,"[got, see, preview, dazzled, typical, romantic...","[1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, ..."
3,This adaptation positively butchers a classic ...,negative,"[adaptation, positively, butchers, classic, be...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, ..."
4,Råzone is an awful movie! It is so simple. It ...,negative,"[råzone, awful, simple, seems, tried, make, sh...","[0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, ..."
...,...,...,...,...
24995,With this movie being the only Dirty Harry mov...,positive,"[dirty, harry, clint, eastwood, stars, produce...","[0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, ..."
24996,Any screen adaptation of a John Grisham story ...,positive,"[screen, adaptation, john, grisham, story, des...","[0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, ..."
24997,This film captured my heart from the very begi...,positive,"[captured, heart, beginning, hearing, quincy, ...","[0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, ..."
24998,A deplorable social condition triggers off the...,positive,"[deplorable, social, condition, triggers, cata...","[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, ..."


**Implementing Naive Bayes Algorithm**

Bayes Classifiers work by finding the probability of the **class** given the **data**:  

**P(class|data) =** $\frac{P(data|class) P(class) }{P(data)}$

**P(data|class)** is Guassian distributed likelihood; ie given the class what is the likelihood of obtaining exactly this data under a Guassian distribution?  

**P(class)** is our prior belief; what we know about the distribution of each class eg 70% positive

**P(data)** will cancel out as we are only interested in which of P(class|data) is higher for each class.  

We call it "Naive" because of the naive assumption that the data are all independantly distributed (covariance = 0). 

For example in a single review P("funny"|"hilarious") > P("funny"|"boring") due to covariance. But with Naive bayes we assume the probabilites are equal. This is to simplify the algorithm, as calculating covariances between all possible words will significantly increase the order of complexity and computational time.

In our case, we have P(y|X) where X is a feature vector (x<sub>1</sub>, x<sub>2</sub>, x<sub>3</sub>, ... ,x<sub>n</sub>):

P(y|X) = $\frac{P(X|y) P(y) }{P(X)}$

For selecting the class with the highest probability:

**y = argmax<sub>y</sub>P(x<sub>1</sub>|y) P(x<sub>2</sub>|y) ... P(x<sub>n</sub>|y).P(y)**


Since the log function is monotonically increasing, we can take the log of everything to turn a product into a summation (easier to compute)

**y = argmax<sub>y</sub>log(P(x<sub>1</sub>|y)) + log(P(x<sub>2</sub>|y)) +...+ log(P(x<sub>n</sub>|y)) + log(P(y))** 

log(P(x<sub>i</sub>|y) can be obtained by calculating the pdf from the N(mean, var) distribution, taking the log and summing over y.

Using scipy.stats multivariate normal package allows us to use the multivariate normal distribution and plug in the vector X directly.

In [10]:
# implementing Naive Bayes
from scipy.stats import multivariate_normal as mvn

class NaiveBayes(object):
    
    def fit(self, X, y, smoothing=1e-2):
        self.gaussians = dict() # empty dictionary to hold P(X|y)
        self.priors = dict() # empty dictionary to hold P(y)
        labels = set(y)
        for c in labels:
            X_c = X[y==c] # X|y
            self.gaussians[c] = {
                'mean': X_c.mean(axis=0),
                'var': X_c.var(axis=0) + smoothing,
            }
            self.priors[c] = len(y[y==c]) / len(y) # P(y)
    
    def score(self, X, y):
        P = self.predict(X)
        return np.mean(P==y)
            
    def predict(self, X):
        N, D = X.shape
        K = len(self.gaussians)
        P = np.zeros((N, K))
        for c, g in iter(self.gaussians.items()):
            mean, var = g['mean'], g['var']
            P[:,c] = mvn.logpdf(X, mean=mean, cov=var) + np.log(self.priors[c]) # log(P(X|y)) + log(P(y))
        return np.argmax(P, axis=1)

In [11]:
X = np.array([a for a in movie_reviews.vectorized])
X

array([[1, 1, 1, ..., 1, 1, 1],
       [1, 0, 1, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0]])

In [12]:
X.shape

(25000, 1000)

In [13]:
sent_dict = {
    'negative': 0,
    'positive': 1,
}
movie_reviews['sentiment'] = movie_reviews['sentiment'].map(sent_dict)

y = np.array([i for i in movie_reviews.sentiment])
y

array([0, 0, 1, ..., 1, 1, 1])

In [14]:
y.shape

(25000,)

In [15]:
from sklearn.model_selection import train_test_split

idx = movie_reviews.index.values

X_train, X_test, y_train, y_test, idx_train, idx_test = train_test_split(X, y, idx, test_size = 0.2, random_state = 123)

model = NaiveBayes()
model.fit(X_train, y_train)
predictions_test = model.predict(X_test)

print("train score:", model.score(X_train, y_train))
print("test score:", model.score(X_test, y_test))

train score: 0.83365
test score: 0.825


82.5% of the reviews in the test set we managed to match the sentiment to the review. Very close to the train score which means we have very little over-fitting.