# Multinomial and Bernoulli Naive Bayes

For understanding Multinomial and Bernoulli Naive Bayes, we will start with a small example and understand the end to end process. In another notebook, we will build a full-fledged email spam classifier.

To start with, let's take a few sentences and classify them in two different classes - education or cinema. Each sentence will represent one document. In real-world cases, a document be any piece of text such as an email, a news article, a book review, a tweet etc. The analysis and the algorithm involved doesn’t depend on the type of document we use.

The notebook is divided into the following sections:

1. Importing and preprocessing data
2. Building the model: Multinomial Naive Bayes
3. Building the model: Bernoulli Naive Bayes

### 1. Importing and Preprocessing Data

Let us first look at the sentences and their classes. We have kept the training sentences in file example_train.csv. Test sentences have been put in the file example_test.csv.

In [2]:
import numpy as np
import pandas as pd
import sklearn

# training data
train_docs = pd.read_csv("example_train.csv")
train_docs

Unnamed: 0,Document,Class
0,Upgrad is a great educational institution.,education
1,Educational greatness depends on ethics,education
2,A story of great ethics and educational greatness,education
3,Sholey is a great cinema,cinema
4,good movie depends on good story,cinema


So as you can see there are 5 documents (sentences) , 3 are of "education" class and 2 are of "cinema" class.

In [3]:
# convert label to a numerical variable
train_docs['Class'] = train_docs.Class.map({'cinema':0, 'education':1})
train_docs

Unnamed: 0,Document,Class
0,Upgrad is a great educational institution.,1
1,Educational greatness depends on ethics,1
2,A story of great ethics and educational greatness,1
3,Sholey is a great cinema,0
4,good movie depends on good story,0


Let's now split the dataframe into X and y labels.

In [4]:
# Convert the df to a numpy array
train_array = train_docs.values

# split X and y
X_train = train_array[:, 0]
y_train = train_array[:, 1]
y_train = y_train.astype('int') # sklearn needs y as integers

print("X_train")
print(X_train)
print("y_train")
print(y_train)

X_train
['Upgrad is a great educational institution.'
 'Educational greatness depends on ethics'
 'A story of great ethics and educational greatness'
 'Sholey is a great cinema' 'good movie depends on good story']
y_train
[1 1 1 0 0]


### Creating the Bag of Words Representation

We now have to convert the data into a format which can be used for training the model. We'll use the **bag of words representation** for each sentence (document).

Imagine breaking X in individual words and putting them all in a bag. Then we pick all the unique words from the bag one by one and make a dictionary of unique words.

This is called **vectorization of words**. We have the class `CountVectorizer()` in scikit learn to vectorize the words.

In [2]:
# Create an object of CountVertorizer() class
from sklearn.feature_extraction.text import CountVectorizer
# help(CountVectorizer)

In [3]:
vec = CountVertorizer()

NameError: name 'CountVertorizer' is not defined