# Multinomial and Bernoulli Naive Bayes

For understanding Multinomial and Bernoulli Naive Bayes, we will take a few sentences and classify them in two different classes. Each sentence will represent one document. In real world examples, every sentence could be a document, such as a mail, or a news article, a book review, a tweet etc. 

The analysis and mathematics involved doesn’t depend on the type of document we use. Therefore we have chosen a set of small sentences to demonstrate the calculation involved and to drive in the concept.

In [37]:
import numpy as np
import pandas as pd
import sklearn
docs = pd.read_csv("example_train1.csv")
# text in column 1, classifier in column 2
docs

Unnamed: 0,Document,Class
0,Teclov is a great educational institution.,education
1,Educational greatness depends on ethics,education
2,A story of great ethics and educational greatness,education
3,Sholey is a great cinema,cinema
4,good movie depends on good story,cinema


### There are 5 documents (sentences), in which 3 are of "education" class and 2 are of "cinema" class

In [38]:
# convert label to a numerical variable
docs['Class'] = docs['Class'].map({'education' : 1, 'cinema' : 0})
docs

Unnamed: 0,Document,Class
0,Teclov is a great educational institution.,1
1,Educational greatness depends on ethics,1
2,A story of great ethics and educational greatness,1
3,Sholey is a great cinema,0
4,good movie depends on good story,0


In [39]:
numpy_array = docs.as_matrix()
print(numpy_array)
X = numpy_array[:, 0]
Y = numpy_array[:, 1]
# print(Y)
Y = Y.astype('int')
print("X")
print(X)
print("Y")
print(Y)

[['Teclov is a great educational institution.' 1]
 ['Educational greatness depends on ethics' 1]
 ['A story of great ethics and educational greatness' 1]
 ['Sholey is a great cinema' 0]
 ['good movie depends on good story' 0]]
X
['Teclov is a great educational institution.'
 'Educational greatness depends on ethics'
 'A story of great ethics and educational greatness'
 'Sholey is a great cinema' 'good movie depends on good story']
Y
[1 1 1 0 0]


  """Entry point for launching an IPython kernel.


Imagine breaking X in individual words and putting them all in a bag. Then we pick all the unique words from the bag one by one and make a dictionary of unique words.

This is called **vectorization of words**. We have the class ```CountVectorizer()``` in scikit learn to vectorize the words. Let us first see it in action.

In [40]:
# create an object of CountVectorizer() class
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()

Here ```vec``` is an object of class ```CountVectorizer```. This has a method called ```fit()``` which converts a corpus of documents into a vector of unique words as shown below.

In [41]:
vec.fit(X)
vec.vocabulary_

{'teclov': 15,
 'is': 9,
 'great': 6,
 'educational': 3,
 'institution': 8,
 'greatness': 7,
 'depends': 2,
 'on': 12,
 'ethics': 4,
 'story': 14,
 'of': 11,
 'and': 0,
 'sholey': 13,
 'cinema': 1,
 'good': 5,
 'movie': 10}

```CountVectorizer()``` has converted the documents into a set of unique words alphabetically sorted and indexed.

**Stop Words**
We can see a few trivial words such as 'and', 'is', 'of' etc. These words don't really make any difference in classifying a document. These are called 'stop words'. So we would like to get rid of them.

We can remove them by passing the parameter stop_words='english' while instantiating ```CountVectorizer``` as follows.

In [42]:
# removing the stop words
vec = CountVectorizer(stop_words='english')
vec.fit(X)
vec.vocabulary_

{'teclov': 11,
 'great': 5,
 'educational': 2,
 'institution': 7,
 'greatness': 6,
 'depends': 1,
 'ethics': 3,
 'story': 10,
 'sholey': 9,
 'cinema': 0,
 'good': 4,
 'movie': 8}

In [43]:
# Another way of print the 'vocabulary'
# printing feature names
print(vec.get_feature_names())
print(len(vec.get_feature_names()))

['cinema', 'depends', 'educational', 'ethics', 'good', 'great', 'greatness', 'institution', 'movie', 'sholey', 'story', 'teclov']
12


So our final dictionary is made of 12 words (after discarding the stop words). Now, to do classification, we need to represent all the documents with respect to these words of the form of features.

Every document will be converted into a feature *vector* representing presence of these words in that document. Let's convert each of our training documents into a feature vector.

In [44]:
# another way of representing the features
X_transformed = vec.transform(X)
X_transformed

<5x12 sparse matrix of type '<class 'numpy.int64'>'
	with 20 stored elements in Compressed Sparse Row format>

You can see X_tranformed is a 5 x 12 sparse matrix. It has 5 rows for each of our 5 documents and 12 columns each 
for one word of the dictionary which we just created. Let us print X_transformed.

In [45]:
print(X_transformed)

  (0, 2)	1
  (0, 5)	1
  (0, 7)	1
  (0, 11)	1
  (1, 1)	1
  (1, 2)	1
  (1, 3)	1
  (1, 6)	1
  (2, 2)	1
  (2, 3)	1
  (2, 5)	1
  (2, 6)	1
  (2, 10)	1
  (3, 0)	1
  (3, 5)	1
  (3, 9)	1
  (4, 1)	1
  (4, 4)	2
  (4, 8)	1
  (4, 10)	1


This representation can be understood as follows:

Consider first 4 rows of the output: (0,2), (0,5), (0,7) and (0,11). It says that the first document (index 0) has 
7th , 2nd , 5th and 11th 'word' present in the document, and that they appear only
once in the document- indicated by the right hand column entry. 

Similarly, consider the entry (4,4) (third from bottom). It says that the fifth document has the fifth word present twice. Indeed, the 5th word('good') appears twice in the 5th document. 

In real problems, you often work with large documents and vocabularies, and each document contains only a few words in the vocabulary. So it would be a waste of space to store the vocabulary in a typical dataframe, since most entries would be zero. Also, matrix products, additions etc. are much faster with sparse matrices. That's why we use sparse matrices to store the data.


Let us convert this sparse matrix into a more easily interpretable array:

In [46]:
# converting transformed matrix back to an array
# note the high number of zeros
X = X_transformed.toarray()

In [47]:
X

array([[0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 2, 0, 0, 0, 1, 0, 1, 0]], dtype=int64)

 To make better sense of the dataset, let us examine the vocabulary and document-term matrix together in a pandas dataframe. The way to convert a matrix into a dataframe is ```pd.DataFrame(matrix, columns=columns)```.


In [48]:
# converting matrix to pandas DataFrame
pd.DataFrame(X, columns = vec.get_feature_names())

Unnamed: 0,cinema,depends,educational,ethics,good,great,greatness,institution,movie,sholey,story,teclov
0,0,0,1,0,0,1,0,1,0,0,0,1
1,0,1,1,1,0,0,1,0,0,0,0,0
2,0,0,1,1,0,1,1,0,0,0,1,0
3,1,0,0,0,0,1,0,0,0,1,0,0
4,0,1,0,0,2,0,0,0,1,0,1,0


This table shows how many times a particular word occurs in document. In other words, this is a frequency table of the words.

A corpus of documents can thus be represented by a matrix with one row per document and one column per
token (e.g. word) occurring in the corpus.

A corpus of documents can thus be represented by a matrix with one row per document and one column per
token (e.g. word) occurring in the corpus.

#### So, the 4 steps for vectorization are as follows

- Import
- Instantiate
- Fit 
- Transform

Let us summarise all we have done till now:

- ```vect.fit(train)``` learns the vocabulary of the training data
- ```vect.transform(train)``` uses the fitted vocabulary to build a document-term matrix from the training data
- ```vect.transform(test)``` uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)

In [54]:
test_docs = pd.read_csv("example_test.csv")
# text in column 1, classifier in column 2
test_docs

Unnamed: 0,Document,Class
0,very good educational institute,education


In [55]:
# convert label to a numerical variable
test_docs['Class'] = test_docs.Class.map({'education' : 1, 'cinema' : 0})
test_docs

Unnamed: 0,Document,Class
0,very good educational institute,1


In [57]:
test_numpy_array = test_docs.as_matrix()
X_test = test_numpy_array[:, 0]
Y_test = test_numpy_array[:, 1]
Y_test = Y_test.astype('int')
print("X_test")
print(X_test)
print("Y_test")
print(Y_test)

X_test
['very good educational institute']
Y_test
[1]


  """Entry point for launching an IPython kernel.


In [59]:
X_test_transformed = vec.transform(X_test)
X_test_transformed

<1x12 sparse matrix of type '<class 'numpy.int64'>'
	with 2 stored elements in Compressed Sparse Row format>

In [60]:
X_test = X_test_transformed.toarray()
X_test

array([[0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

## Multinomial Naive Bayes

In [61]:
# building a multinomial NB model
from sklearn.naive_bayes import MultinomialNB

# instantiate NB class
mnb = MultinomialNB()

# fitting the model on training data
mnb.fit(X, Y)

# predicting probabilities of test data
mnb.predict_proba(X_test)

array([[0.43859649, 0.56140351]])

In [62]:
proba=mnb.predict_proba(X_test)
print("probability of test document belonging to class CINEMA" , proba[:,0])
print("probability of test document belonging to class EDUCATION" , proba[:,1])

probability of test document belonging to class CINEMA [0.43859649]
probability of test document belonging to class EDUCATION [0.56140351]


In [63]:
pd.DataFrame(proba, columns=['Cinema', 'Education'])

Unnamed: 0,Cinema,Education
0,0.438596,0.561404


## Bernoulli Naive Bayes

In [64]:
from sklearn.naive_bayes import BernoulliNB

# instantiating bernoulli NB class
bnb = BernoulliNB()

# fitting the model
bnb.fit(X, Y)

# predicting probability of test data
bnb.predict_proba(X_test)
proba_bnb = bnb.predict_proba(X_test)

In [65]:
pd.DataFrame(proba_bnb, columns=['Cinema', 'Education'])

Unnamed: 0,Cinema,Education
0,0.377463,0.622537


In [66]:

mnb

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)