## Naive bayers pretreatement

We set the random seed to make our result reproductible.

In [None]:
import random

random.seed(10)

First we import everything we need for this sheet.
Torchtext includes several datasets. We will use IMDB dataset in our case.

In [None]:
# import datasets
from datasets import load_dataset, concatenate_datasets
import pandas as pd

We download the data from HuggingFace. We will manually split data train and test set. First we will going to merge train and test dataset into one dataset of 50 000 elements. 

In [None]:
dataset_train = load_dataset('imdb', split='train')
dataset_test = load_dataset('imdb', split='test')

Downloading:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a...


Downloading:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)


Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a. Subsequent calls will reuse this data.


In [None]:
dataset = concatenate_datasets([dataset_train, dataset_test])
len(dataset)

50000

Now that we have our data, we want to convert it to a DataFrame to facilitate manipulations. In the same time, we convert the strings `neg` to `0` and `pos` to `1` to train our model correctly.

In [None]:
from typing import List, Tuple

def create_dataframe(data: List[Tuple[str, str]], columns: List[str]) -> pd.DataFrame:
    """ Convert our data into a DataFrame and convert the string identifier to int """

    rtn = pd.DataFrame(data, columns=columns)
    return rtn

df = create_dataframe(list(zip(dataset['label'], dataset['text'])), ['Label', 'Text'])
df.head()

Unnamed: 0,Label,Text
0,1,Bromwell High is a cartoon comedy. It ran at t...
1,1,Homelessness (or Houselessness as George Carli...
2,1,Brilliant over-acting by Lesley Ann Warren. Be...
3,1,This is easily the most underrated film inn th...
4,1,This is not the typical Mel Brooks film. It wa...


In [None]:
# import packages for steeming
import nltk
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize 

In [None]:
# We need to download a package for word tokenization
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Let's start with the word tokenization.

In [None]:
# Tokenization
df['Text'] = df['Text'].apply(lambda x: " ".join(word_tokenize(x)))

Now let's apply the stemming to everything that is composed of characters. Words are simply cut and stemmed. We do not have any punctuation.

In [None]:
# Steeming
import re

re_word = re.compile(r"^\w+$")
stemmer = SnowballStemmer("english")

def stemming(text):
    return [stemmer.stem(word) for word in word_tokenize(text.lower()) if re_word.match(word)]
        
df['Text'] = df['Text'].apply(lambda x: " ".join(stemming(x)))

In [None]:
df['Text'][:5]

0    bromwel high is a cartoon comedi it ran at the...
1    homeless or houseless as georg carlin state ha...
2    brilliant by lesley ann warren best dramat hob...
3    this is easili the most underr film inn the br...
4    this is not the typic mel brook film it was mu...
Name: Text, dtype: object

In [None]:
!python -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 26.6 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')



Let's lemmatize all token we can find.

In [None]:
# Lemmatization
import spacy

nlp = spacy.load("en_core_web_sm", disable = ['ner', 'tagger', 'parser', 'textcat', "lemmatizer"])

def lemmatization(text):
    return [token.lemma_ for token in nlp(text.lower()) if re_word.match(token.text)]

df['Text'] = df['Text'].apply(lambda x: " ".join(lemmatization(x)))

In [None]:
df['Text'][:5]

0    bromwel high be a cartoon comedi it run at the...
1    homeless or houseless a georg carlin state hav...
2    brilliant by lesley ann warren well dramat hob...
3    this be easili the much underr film inn the br...
4    this be not the typic mel brook film it be muc...
Name: Text, dtype: object

For exemple, in the first sentence we can see that 'is' has been transformed by 'be' and 'ran' by 'run'.

First, we need to convert the text into numbers that we can do calculations on. We use word frequencies. We want to transform the given text to a vector on the basis of the frequency of each word in the text.

For this we use `CountVectorizer` from `sklearn`. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
 
X = cv.fit_transform(df['Text']).toarray()
y = df['Label']

The `train_test_split` shuffles all the dataset before splitting. In our case, we will use 75% of data for training and 25% for testing.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
           X, y, test_size = 0.25, random_state = 0)

Bayes Theorem describes for two independent events `A` and `B` that: 
$$ P(A_B) = (P(B_A) * P(A))/P(B) $$

We're going to use the Naive Bayes Classifier Algorithm based on applying Bayes' theorem. Here, we assume the `naive` condition that every word in a sentence is independent of the other ones. This means that now we look at individual words. So for example: 
$$ P(\text{liked the movie}) = P(\text{liked}) * P(\text{the}) * P(\text{movie}) $$

In [None]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

We use the confusion_matrix of sklearn to display the number of right (True positive and True negative) and wrong (False positive and False negative) predictions.

In [None]:
from sklearn.metrics import confusion_matrix
y_pred = gnb.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
cm

array([[5315,  937],
       [2235, 4013]])

We use the classification_report of sklearn to display the precision, recall, and F1-score for both classes on the test data.

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.70      0.85      0.77      6252
           1       0.81      0.64      0.72      6248

    accuracy                           0.75     12500
   macro avg       0.76      0.75      0.74     12500
weighted avg       0.76      0.75      0.74     12500



If we compare those results with the naive bayers without pretreatement, we can observe that the accucary, precision, recall, f1-score for both classes is less good. In that case, pretreatement is quite useless because naive bayers learn each word independently

Now we want to see which samples have been wrongly classified

In [None]:
bad_predict_df = y_test.where(y_test != y_pred).dropna()
indexes = bad_predict_df.index
df.iloc[indexes]

Unnamed: 0,Label,Text
45519,0,i think the movi be one side i watch it recent...
26128,1,i realli like this pictur becaus it realist de...
26376,1,i think it be a brilliant show with cool talk ...
32104,1,what a refresh chang from the pg movi that hav...
43460,0,ray bradburi run and hide this tacki film vers...
...,...,...
9769,1,this be a visual adapt of manga with veri litt...
27615,1,sheba babi be alway underr much like becaus it...
35021,1,this film be fill with great act great music s...
5420,1,rooki be a wonder movi about the 2 chanc life ...


In [None]:
df.iloc[32104]["Text"]

'what a refresh chang from the pg movi that have teen girl jump in and out of bed young high school boy count how mani girl they can hook up with kid drink do drug etc etc etc carl hiaasen have write so mani book that be enjoy but hard classic literatur but he have final write someth that middl school kid want to read and this movi send a messag to kid that mayb they can make a differ that mayb their voic can be hear film in south florida the sceneri be beauti and natur and real who care if it predict and a littl corni so be free willi and look how good that do this be a good famili movi rare breed'

In [None]:
df.iloc[26128]["Text"]

'i realli like this pictur becaus it realist deal with two peopl in love and one of them have a disord though the end sadden me i know that that be the well way for it to finish off i would recom this to everyon'

Like the naive bayers without pretreatement, it is quite the same wrongly classified sentences for same problems. 