# Import libraries and dataset

We import the necessary libraries

In [1]:
import numpy as np
import pandas as pd

We load the IMDB dataset from Huggingface.

In [2]:
!pip install datasets
from datasets import load_dataset
dataset_train = load_dataset('imdb', split='train')
dataset_test = load_dataset('imdb', split='test')



Reusing dataset imdb (/home/niss/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)
Reusing dataset imdb (/home/niss/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)


# Preprocessing

We're going to do some preprocessing for the testing part of our algorithm later on.

From the dataset we create our training and test data.

In [3]:
x_train, y_train, x_test, y_test = dataset_train[:]['text'], dataset_train[:]['label'], dataset_test[:]['text'], dataset_test[:]['label']
len(x_train)

25000

We import our utils.py file which contains functions for stemming and lemmatization.

In [4]:
!python3 -m spacy download en_core_web_sm
import utils

Collecting en-core-web-sm==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
[K     |████████████████████████████████| 13.6 MB 21.7 MB/s eta 0:00:01
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


[nltk_data] Downloading package punkt to /home/niss/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
# We apply stemming to our dataset
stemmed_train = utils.stem(x_train)
stemmed_test = utils.stem(x_test)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 25000/25000 [00:45<00:00, 543.58it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 25000/25000 [00:44<00:00, 560.37it/s]


In [6]:
# We apply lemmatization to our dataset

lemmas_train = utils.lemm(x_train)
lemmas_test = utils.lemm(x_test)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 25000/25000 [04:44<00:00, 87.83it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 25000/25000 [04:34<00:00, 90.92it/s]


In [7]:
# We create a dataframe version of our dataset because it will be easier to
# use later on when we want to test

df_train_lemma = pd.DataFrame(list(zip(lemmas_train, y_train)), columns=['val', 'label'])
df_test_lemma = pd.DataFrame(list(zip(lemmas_test, y_test)), columns=['val', 'label'])

df_train_stem = pd.DataFrame(list(zip(stemmed_train, y_train)), columns=['val', 'label'])
df_test_stem = pd.DataFrame(list(zip(stemmed_test, y_test)), columns=['val', 'label'])

df_train = pd.DataFrame(list(zip(x_train, y_train)), columns=['val', 'label'])
df_test = pd.DataFrame(list(zip(x_test, y_test)), columns=['val', 'label'])
df_train.head(3), df_test.head(3)

(                                                 val  label
 0  Bromwell High is a cartoon comedy. It ran at t...      1
 1  Homelessness (or Houselessness as George Carli...      1
 2  Brilliant over-acting by Lesley Ann Warren. Be...      1,
                                                  val  label
 0  I went and saw this movie last night after bei...      1
 1  Actor turned director Bill Paxton follows up h...      1
 2  As a recreational golfer with some knowledge o...      1)

# Naive Bayes algorithm

In [8]:
# Typing to make functions clearer
from typing import List
from typing import Dict
from typing import Tuple

# Functions used in our Naive Bayes algorithm

def occurences_and_vocabulary(x_train: List[str], 
                              y_train: List[str], 
                              classes: List[int]) -> Tuple[Dict[int, Dict[str, int]], List[str]]:
    '''
    Takes the input dataset and create a list of every word found in it to form
    the vocabulary
    From this dataset we also create a dictionnary compiling the occurence of
    every word found in the given dataset according to the class of the current
    document
    '''
    dictionnary = {}
    # We initialize our dictionnary with as much entry as there are classes
    # This will be a dictionnary of dictionnaries, with each key a class
    # Each dictionnary will record the occurence of words for a given class
    for c in classes:
      dictionnary[c] = {}
    c = -1
    vocabulary = []
    for i in range(len(y_train)):
      c = y_train[i]
      # We split our document into token, each token is a word from the document
      splitted_doc = re.split("[ .,\"]", x_train[i])
      for word in splitted_doc:
        vocabulary.append(word)
        # We count the number of occurence of the current word in all of the
        # given dataset
        if word not in dictionnary[c]:
          dictionnary[c][word] = 1
        else:
          dictionnary[c][word] += 1
    # We remove multiple occurences of words in our vocabulary
    vocabulary = np.unique(vocabulary)

    return dictionnary, vocabulary

def sum_counts(D: Dict[int, Dict[str, int]], classes: List[int]) -> List[int]:
    '''
    Counts the total number of word found in the dataset for each class
    '''
    sum_per_class = [0 for i in range(len(classes))]
    for key in D:
      # Since we have a dictionnary of dictionnary as input, we take the values
      # of the current dictionnary and sum all of them
      sum_per_class[key] = sum(D[key].values())
    return sum_per_class

In [14]:
# Function used for the binary Naive Bayes
import re
def word_check_and_vocabulary(x_train: List[str], 
                              y_train: List[str], 
                              classes: List[int]) -> Tuple[Dict[int, Dict[str, int]], List[str]]:
    '''
    Takes the input dataset and create a list of every word found in it to form
    the vocabulary
    From this dataset we also create a dictionnary compiling which word can be
    found in which class of document
    '''
    # We still retain the dictionnary of dictionnaries format to keep the same
    # code structure
    word_check = {}
    for c in classes:
      word_check[c] = {}
    c = -1
    vocabulary = []
    for i in range(len(y_train)):
      c = y_train[i]
      splitted_doc = re.split("[ .,\"]", x_train[i])
      for word in splitted_doc:
        vocabulary.append(word)
        # Instead of adding, we just keep the occurence at 1
        if word not in word_check[c]:
          word_check[c][word] = 1
    vocabulary = np.unique(vocabulary)

    return word_check, vocabulary

In [10]:
def train_naive_bayes(data: List[str], target: List[int]) -> Tuple[List[int], Dict[int, Dict[str, int]], List[str]]:
    '''
    Take an input dataset and train the model on that dataset, generating
    the likelihood of each word with respect to each class as well as
    a vocabulary of the training dataset
    '''
    # We initialize our data
    logprior = dict()
    classes = np.unique(target)
    count, vocabulary = occurences_and_vocabulary(data, target, classes)
    ndoc = len(target)
    loglikelihood = dict()
    sum_per_class = sum_counts(count, classes)

    # For each class we compute the loglikelihood of every word present in
    # document of that class
    for c in classes:
      nc = np.count_nonzero(target == c)
      logprior[c] = np.log(nc/ndoc)

      # For each word found in a certain class, we compute the
      # word's loglikelihood
      loglikelihood[c] = {}
      for key in count[c]:
        loglikelihood[c][key] = np.log((count[c][key] + 1) / (sum_per_class[c] + 1))

    return logprior, loglikelihood, vocabulary

In [11]:
def train_binary_naive_bayes(data: List[str], target: List[int]) -> Tuple[Dict[int, Dict[str, int]], List[str]]:
    '''
    Take an input dataset and record which word appear in which document
    of which class
    '''
    logprior = dict()
    classes = np.unique(target)
    count, vocabulary = word_check_and_vocabulary(data, target, classes)
    return count, vocabulary

In [12]:
# Test one doc at a time

def test_naive_bayes(testdoc: List[str], 
                     logprior: List[int], 
                     loglikelihood: Dict[int, Dict[str, int]], 
                     C: List[int]) -> int:
    '''
    Takes the given input and from the given logprior, loglikelihood
    we evaluate the class of that input 
    '''
    sum_ = [0 for i in range(len(logprior))]
    for c in C:
        sum_[c] = logprior[c]
        for word in testdoc:
          # We do a try except to avoid doing if checks for every word
          # and speed up the process
          # If we go into the except condition, that means the word is
          # not present so we do not take it into account
          try:
            sum_[c] += loglikelihood[c][word]
          except:
            pass
    return np.argmax(sum_)

# Test

First, let's see how accurate our model is when trained with the initial dataset

In [15]:
logprior, loglikelihood, vocabulary = train_naive_bayes(x_train, y_train)

In [16]:
accuracy = 0
# Create a confusion_matrix
confusion_matrix = np.array([[0, 0], [0, 0]])
# We transform our documents into a list of words
split_df = df_test['val'].str.split("[ .,\"]")
# We test 1 document at a time
for i in range(len(split_df)):
    # The model predict the current document's class
    res = test_naive_bayes(split_df[i], logprior, loglikelihood, [0, 1])
    if res == y_test[i]:
        accuracy += 1
    confusion_matrix[res, y_test[i]] += 1
accuracy /= len(split_df)
accuracy

0.61016

It's not very accurate.

In [18]:
# To note, because 1 is positive and 0 is negative here, positions are reversed
# True negative is at position (0, 0)
# False negative at (0, 1)
# False positive at (1, 0)
# True positive at (1, 1)

print("Confusion matrix: ")
print(confusion_matrix)

Confusion matrix: 
[[7987 5233]
 [4513 7267]]


From the confusion matrix we can note that the model often thinks the review is a positive one when it's a negative review.

## With the different preprocessing methods

Let's add some preprocessing and see how it improves our model.

First, let's see how stemming improve our accuracy.

In [20]:
# Preprocessing has already been done beforehand, we now use this stemmed dataset
logprior_stemming, loglikelihood_stemming, vocabulary_stemming = train_naive_bayes(stemmed_train, y_train)

In [21]:
accuracy_stemming = 0
confusion_matrix_stem = np.array([[0, 0], [0, 0]])
split_df = df_test_stem['val'].str.split("[ .,\"]")
for i in range(len(split_df)):
    res = test_naive_bayes(split_df[i], logprior_stemming, loglikelihood_stemming, [0, 1])
    if res == y_test[i]:
        accuracy_stemming += 1
    confusion_matrix_stem[res, y_test[i]] += 1
accuracy_stemming /= len(split_df)
accuracy_stemming

0.70724

In [22]:
print("Confusion matrix: ")
print(confusion_matrix_stem)

Confusion matrix: 
[[9632 4451]
 [2868 8049]]


As we can see, the accuracy increased by 10% which is a lot.
This is explained by a much smaller vocabulary, so there's a lot less word to count which makes the whole process less messy.

As we can see:

In [23]:
print("Length vocabulary: " + str(len(vocabulary)))
print("Length vocabulary after stemming: " + str(len(vocabulary_stemming)))

Length vocabulary: 177907
Length vocabulary after stemming: 49134


The length of the vocabulary after stemming is about 1/4 the size of the original vocabulary.

Let's see with lemmatization.

In [24]:
# Preprocessing has already been done beforehand, we now use this lemmatized dataset
logprior_lemma, loglikelihood_lemma, vocabulary_lemma = train_naive_bayes(lemmas_train, y_train)

In [25]:
accuracy_lemma = 0
confusion_matrix_lemma = np.array([[0, 0], [0, 0]])
split_df = df_test_lemma['val'].str.split("[ .,\"]")
for i in range(len(split_df)):
    res = test_naive_bayes(split_df[i], logprior_lemma, loglikelihood_lemma, [0, 1])
    if res == y_test[i]:
        accuracy_lemma += 1
    confusion_matrix_lemma[res, y_test[i]] += 1
accuracy_lemma /= len(split_df)
accuracy_lemma

0.69256

In [26]:
print("Confusion matrix: ")
print(confusion_matrix_lemma)

Confusion matrix: 
[[9431 4617]
 [3069 7883]]


Slightly less accurate than with stemming, but still a net improvement over no pre-processing at all. The accuracy with lemmatization would be improved further with a more detailed vocabulary.

We can also notice that with the preprocessing, the model now has a much easier time understanding when a review is negative. This can be because after stemming or lemmatization, a lot of words can be found in both positive and negative reviews but negative reviews will have a lot more negative words like "not" standing out in terms of occurences. 

## Comparing with Binary Naive Bayes

Let's compare the accuracy between counting the occurence of words and just checking their presence.

In [27]:
word_check, binary_vocabulary = train_binary_naive_bayes(x_train, y_train)

In [28]:
accuracy_binary = 0
confusion_matrix_binary = np.array([[0, 0], [0, 0]])
split_df = df_test['val'].str.split("[ .,\"]")
for i in range(len(split_df)):
    res = test_naive_bayes(split_df[i], [0, 0], word_check, [0, 1])
    if res == y_test[i]:
        accuracy_binary += 1
    confusion_matrix_binary[res, y_test[i]] += 1
accuracy_binary /= len(split_df)

In [29]:
print("Naive Bayes accuracy: " + str(accuracy))
print("Binary Naive Bayes accuracy: " + str(accuracy_binary))

Naive Bayes accuracy: 0.61016
Binary Naive Bayes accuracy: 0.5908


In [30]:
print("Confusion matrix: ")
print(confusion_matrix_binary)

Confusion matrix: 
[[9615 7345]
 [2885 5155]]


As we can see, just checking their presence results in a lower accuracy and this is due to a lack of depth in learning.

## What happened when our model guessed incorrectly ?

In [31]:
# To test our model on-the-fly
split_df = df_test_stem['val'].str.split("[ .,\"]")

Let's take a look at some instances where our model wrongly determined the positivity of the document. We'll use our model with stemming because it has the best accuracy out of all models.

In [32]:
i = 7 # Arbitrary number
var = test_naive_bayes(split_df[i], logprior_stemming, loglikelihood_stemming, [0, 1])
print("Model's prediction: " + str(var))
print("Answer: " + str(y_test[i]))

Model's prediction: 0
Answer: 1


Our model thought it was a negative critic but it was in fact a positive one. Let's take a look at the document.

In [33]:
df_test_stem['val'][i]

'i felt this film did have mani good qualiti the cinematographi was certain differ expos the stage aspect of the set and stori the origin charact as actor was certain an achiev and i felt most play quit convinc of cours they are play themselv but definit uniqu the cultur aspect may leav mani disappoint as a familiar with the chines and orient cultur will answer a lot of question regard relationship and the stigma that goe with ani drug use i found the jia hongsheng stori interest on a down note the stori is in beij and some of the fashion and music reek of earli 90s even though this was made in 2001 so it realli cheesi sometim the beatl crap etc whatev not a top ten or twenti but if it on the televis check it out'

When analizing this document, we can notice a lot of words that could be counted as negative such as "reek", "crap, "drug", "disappoint" etc. which probably have mislead our model into thinking this was a negative review.

Let's take a look at another document our model failed to assess correctly.

In [34]:
i = 62
split_df = df_test_stem['val'].str.split("[ .,\"]")
var = test_naive_bayes(split_df[62], logprior_stemming, loglikelihood_stemming, [0, 1])
print("Model's prediction: " + str(var))
print("Answer: " + str(y_test[i]))

Model's prediction: 0
Answer: 1


Once again, our model thought it was a negative critic but it was a positive one. Let's take a look at the document.

In [35]:
df_test_stem['val'][i]

'i have been a fan of madonna for quit sometim now howev i thought i would comment on this br br this film mistaken one of them as well as madonna was pan by the critic they were high mistaken and mani potenti viewer were turn off by the bad br br first madonna doe an excel job in this movi which was one of her first she play a ditsi blond in the film she is far from a ditsi blond in real life most critic were somewhat prejud by her sing fame and did give her a fair shake when you view this film i hope that you understand that the accent and the goofi is just act she was absolut hyster as was the br br griffen dunn is anoth person who was not given a fair review in the film if you take a look at his filmographi you will see he is quit an accomplish br br as far as the movi itself this is someth similar to pretti woman but came 3 year befor the robert gere success it a comedi with lot of site gag slapstick and one liner some of the comedi is deadpan and take a comedi aficionado to reall

The same thing can be said here: we can see words like "mistaken", "critic", "turn off", "goofi" be very present in this document which probably have mislead our model into thinking it was a negative critic instead of a more positive one.