# Naive bayers

We set the random seed to make our result reproductible

In [None]:
import random

random.seed(10)

First we import everything we need for this sheet.
Torchtext includes several datasets. We will use IMDB dataset in our case.

In [None]:
# import datasets
from datasets import load_dataset, concatenate_datasets
import pandas as pd

We download the data from the torchtext database. But we do not use the fonction to directly split the train and test set (`split=('train', 'test')`). We will manually split data train and test set. First we will going to merge it into a dataset of 50 000 elements. 

In [None]:
dataset_train = load_dataset('imdb', split='train')
dataset_test = load_dataset('imdb', split='test')

Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)
Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)


In [None]:
dataset = concatenate_datasets([dataset_train, dataset_test])
len(dataset)

50000

Now that we have our data, we want to convert it to a DataFrame to facilitate manipulations.

In [None]:
from typing import List, Tuple

def create_dataframe(data: List[Tuple[str, str]], columns: List[str]) -> pd.DataFrame:
    """ Convert our data into a DataFrame and convert the string identifier to int """

    rtn = pd.DataFrame(data, columns=columns)
    return rtn

df = create_dataframe(list(zip(dataset['label'], dataset['text'])), ['Label', 'Text'])
df.head()

Unnamed: 0,Label,Text
0,1,Bromwell High is a cartoon comedy. It ran at t...
1,1,Homelessness (or Houselessness as George Carli...
2,1,Brilliant over-acting by Lesley Ann Warren. Be...
3,1,This is easily the most underrated film inn th...
4,1,This is not the typical Mel Brooks film. It wa...


First, we need to convert the text into numbers that we can do calculations on. We use word frequencies. We want to transform the given text to a vector on the basis of the frequency of each word in the text.

For this we use `CountVectorizer` from `sklearn`. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
 
X = cv.fit_transform(df['Text']).toarray()
y = df['Label']

The `train_test_split` shuffles all the dataset before splitting. In our case, we will use 75% of data for training and 25% for testing.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
           X, y, test_size = 0.25, random_state = 0)

Bayes Theorem describes for two independent events `A` and `B` that: 
$$ P(A_B) = (P(B_A) * P(A))/P(B) $$

We're going to use the Naive Bayes Classifier Algorithm based on applying Bayes' theorem. Here, we assume the `naive` condition that every word in a sentence is independent of the other ones. This means that now we look at individual words. So for example: 
$$ P(\text{liked the movie}) = P(\text{liked}) * P(\text{the}) * P(\text{movie}) $$

In [None]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

We use the confusion_matrix of sklearn to display the number of right (True positive and True negative) and wrong (False positive and False negative) predictions.

In [None]:
from sklearn.metrics import confusion_matrix
y_pred = gnb.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
cm

array([[5271,  981],
       [1595, 4653]])

We use the classification_report of sklearn to display the precision, recall, and F1-score for both classes on the test data.

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.77      0.84      0.80      6252
           1       0.83      0.74      0.78      6248

    accuracy                           0.79     12500
   macro avg       0.80      0.79      0.79     12500
weighted avg       0.80      0.79      0.79     12500



Here the bad predicted sentences:

In [None]:
bad_predict_df = y_test.where(y_test != y_pred).dropna()

bad_predict_df

45519    0.0
26128    1.0
26376    1.0
12968    0.0
32104    1.0
        ... 
34600    1.0
21253    0.0
19426    0.0
27615    1.0
30528    1.0
Name: Label, Length: 2576, dtype: float64

In [None]:
indexes = bad_predict_df.index

indexes

Int64Index([45519, 26128, 26376, 12968, 32104,  8369, 46250, 34839, 34478,
             5140,
            ...
            16446,  1971, 46737,  6875, 11686, 34600, 21253, 19426, 27615,
            30528],
           dtype='int64', length=2576)

In [None]:
df.iloc[indexes]

Unnamed: 0,Label,Text
45519,0,I think the movie was one sided I watched it r...
26128,1,"I really liked this picture, because it realis..."
26376,1,I think it is a brilliant show with cool talki...
12968,0,I saw this movie as a very young girl (I'm 27 ...
32104,1,what a refreshing change from the PG movies th...
...,...,...
34600,1,"The first time I saw this film, I wanted to li..."
21253,0,"If you have seen the Sholay of 1975, Don't wat..."
19426,0,"In 1993, ""the visitors"" was an enormous hit in..."
27615,1,Sheba Baby is always underrated most likely be...


In [None]:
df.iloc[32104]["Text"]

'what a refreshing change from the PG movies that have teen girls jumping in and out of bed, young high school boys counting how many girls they can "hook up" with, kids drinking, doing drugs, etc., etc., etc. Carl Hiaasen has written so many books that are enjoyable but hardly classic literature. but he has finally written something that Middle School kids WANT to read. And this movie sends a message to kids that maybe they can make a difference, that maybe their voices can be heard. Filmed in South Florida, the scenery is beautiful and natural and REAL. Who cares if its predictable, and a little corny. So was FREE WILLY and look how well that did. This is a good family movie..........a rare breed.'

In [None]:
df.iloc[26128]["Text"]

'I really liked this picture, because it realistically dealt with two people in love, and one of them having a disorder. Though the ending saddened me, I know that that was the best way for it to finish off. I would recommed this to everyone.'

This sentence is wrongly classified by our model. This is probably due to some negative words used in this sentence: disorder, saddened, finish.

We have try the naive bayes model with pretreatement in the file `naive_bayers_pretreatement.ipynb`