# A simple explanation for Naive Bayesian Classifier

This notebook is meant to accompany the reader of the article _A Simple Explanation for Naive Bayesian Classifiers_ in Baeldung.

## Managing the necessary imports

This is a technical requirement

In [1]:
import pandas as pd
pd.options.display.max_rows = 999
pd.options.display.max_columns = 999

## Corpus of texts

This is the corpus of texts on which we refer for the rest of this notebook

In [2]:
corpus = ['the cat is on the table',
     'the dog is in the room',
     'the table is in the room',
     'the cat is not a dog',
     'the cat and the dog are in the room',
     'the room is not a table']

pd.DataFrame(corpus,columns=['texts'])

Unnamed: 0,texts
0,the cat is on the table
1,the dog is in the room
2,the table is in the room
3,the cat is not a dog
4,the cat and the dog are in the room
5,the room is not a table


## Labels

Here we will see which labels are associated with the texts present in our corpus. The texts are divided accordingly to two labels, and on the basis of whether they talk about animals or not.

In [3]:
labels = ['animals',
           'animals',
           'not_animals',
           'animals',
           'animals',
           'not_animals']
df_corpus = pd.DataFrame({'texts':corpus,'labels':labels})
df_corpus

Unnamed: 0,texts,labels
0,the cat is on the table,animals
1,the dog is in the room,animals
2,the table is in the room,not_animals
3,the cat is not a dog,animals
4,the cat and the dog are in the room,animals
5,the room is not a table,not_animals


## Make naive predictions on the basis of single tokens

We can study whether any single token is a good predictor of class affiliation.

We start by grouping the texts according to the presence of the first word of the first sentence.

In [4]:
mask = (df_corpus['texts'].str.contains('the'))
df_corpus[mask]

Unnamed: 0,texts,labels
0,the cat is on the table,animals
1,the dog is in the room,animals
2,the table is in the room,not_animals
3,the cat is not a dog,animals
4,the cat and the dog are in the room,animals
5,the room is not a table,not_animals


We can also see which ones do not contain the first word of the first sentence. As you can see the dataset is empty.

In [5]:
df_corpus[-mask]

Unnamed: 0,texts,labels


Since all texts contain the word _the_, its presence does not indicate whether a text talks about _animals_. Let's then try with the next word, _cat_. These are the texts which contain it:

In [6]:
mask = (df_corpus['texts'].str.contains('cat'))
df_corpus[mask]

Unnamed: 0,texts,labels
0,the cat is on the table,animals
3,the cat is not a dog,animals
4,the cat and the dog are in the room,animals


And these are the texts which do not contain it:

In [7]:
df_corpus[-mask]

Unnamed: 0,texts,labels
1,the dog is in the room,animals
2,the table is in the room,not_animals
5,the room is not a table,not_animals


## A simple prediction

We can then simply predict that a text is affiliated with the class _animals_ if it contains the word _CAT_.

In [8]:
predictions = df_corpus
predictions['prediction'] = predictions['texts'].apply(lambda x: 'animals' if 'cat' in x else 'not_animals')
predictions

Unnamed: 0,texts,labels,prediction
0,the cat is on the table,animals,animals
1,the dog is in the room,animals,not_animals
2,the table is in the room,not_animals,not_animals
3,the cat is not a dog,animals,animals
4,the cat and the dog are in the room,animals,animals
5,the room is not a table,not_animals,not_animals


## Extension to all words

We can then extend the same procedure as above to all words in all texts. In doing so we build a probability distribution of compound bayesian probability of words and classes.

## Tokenization

We first need to tokenize the texts by extracting the vocabulary present in our training dataset.

In [9]:
tokens = []
for text in corpus:
    for word in text.split():
        if word not in tokens:
            tokens.append(word)
tokens

['the',
 'cat',
 'is',
 'on',
 'table',
 'dog',
 'in',
 'room',
 'not',
 'a',
 'and',
 'are']

# Extracting the Bernoulli Bag-of-Words matrix

The Bag-of-Words matrix is composed of the absolute frequencies of occurrence of the words which are part of the text collection. Its rows correspond to each individual document or text, and its columns correspond to each individual word. Since we are interested in a Bernoulli-distributed variable, we convert all positive frequencies to _true_ and all values of 0 to _false_.
As a text vectorizer we can use CountVectorizer contained in the skLearn library.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(token_pattern='\w+', vocabulary=tokens)
matrix = cv.fit_transform(corpus)
matrix = matrix.todense()
matrix = matrix>0
df = pd.DataFrame(matrix)
df.columns = tokens
df['labels']=labels
print(df.shape)
df

(6, 13)


Unnamed: 0,the,cat,is,on,table,dog,in,room,not,a,and,are,labels
0,True,True,True,True,True,False,False,False,False,False,False,False,animals
1,True,False,True,False,False,True,True,True,False,False,False,False,animals
2,True,False,True,False,True,False,True,True,False,False,False,False,not_animals
3,True,True,True,False,False,True,False,False,True,True,False,False,animals
4,True,True,False,False,False,True,True,True,False,False,True,True,animals
5,True,False,True,False,True,False,False,True,True,True,False,False,not_animals


In [11]:
bayes = pd.DataFrame(columns=df.columns)
bayes['labels'] = df['labels'].unique()
total = df.shape[0]
for i, label in enumerate(bayes['labels']):
    
    for word in df.columns[:-1]:
           
        P_A = (df['labels'] == label).sum()/total
        
        P_B = df[word].sum()/total
        
        mask = (df['labels']==label) & (df[word]==True)
        P_B_A = (df[mask][word].sum()) / (df[df['labels']==label].shape[0])
        
        bayes.loc[i,word] = (float(P_B_A) * float(P_A)) / float(P_B)

bayes

Unnamed: 0,the,cat,is,on,table,dog,in,room,not,a,and,are,labels
0,0.666667,1,0.6,1,0.333333,1,0.666667,0.5,0.5,0.5,1,1,animals
1,0.333333,0,0.4,0,0.666667,0,0.333333,0.5,0.5,0.5,0,0,not_animals


# Bernoulli Naive Bayesian Classification

To make more complex predictions, we can use the presence or absence of all words in unseen texts as clues according to which we perform predictions. The classifier which performs this task is called a _Bernoulli_ classifier, because it assumes that the input features are Bernoulli-distributed. It is also called a _Naive Bayesian_ classifier, because it assumes that the words in a text are independent from one another.

# Classification

We then use a classification algorithm to learn the rule accordingly to which texts are associated with their labels.

These rules allows us to identify the abstract predictors of class affiliation of a text.

This particular classifier is a Naive Bayesian classifier, which solves the classification task by first computing Bayes' Theorem for each feature of the input in respect to each labels, as follows:

\begin{equation}
   P(A | B) = \frac{P(B | A) * P(A)}{P(B)}
\end{equation}

The classifier then determines the class affiliation of a text vector on the basis of a smoothed version of the maximum likelihood.

In [12]:
from sklearn.naive_bayes import BernoulliNB

clf = BernoulliNB()
X = df[df.columns[:-1]]
y = df['labels']
clf.fit(X,y)
predictions = clf.predict(X)
pd.DataFrame({'Texts':corpus,'Labels':labels,'Predictions':predictions})

Unnamed: 0,Texts,Labels,Predictions
0,the cat is on the table,animals,animals
1,the dog is in the room,animals,animals
2,the table is in the room,not_animals,not_animals
3,the cat is not a dog,animals,animals
4,the cat and the dog are in the room,animals,animals
5,the room is not a table,not_animals,not_animals
