# Naive Bayes

For understanding Multinomial and Bernoulli Naive Bayes, we will take a few sentences and classify them in two different classes. Each sentence will represent one document. In real world examples, every sentence could be a document, such as a mail, or a news article, a book review, a tweet etc.

The analysis and mathematics involved doesn’t depend on the type of document we use. Therefore we have chosen a set of small sentences to demonstrate the calculation involved and to drive in the concept.

Let us first look at the sentences and their classes. We have kept these sentences in file example_train.csv. Test sentences have been put in the file example_test.csv.

In [9]:
import numpy as np
import pandas as pd
import sklearn

docs = pd.read_csv('/Users/cameronlooney/Downloads/example_bayes.csv') 
#text in column 1, classifier in column 2.
docs

Unnamed: 0,Document,Class
0,UCC is an educational college in Cork,education
1,Educational greatness depends on ethics,education
2,A story of great ethics and educational greatness,education
3,Sholey is a great cinema,cinema
4,good movie depends on good story,cinema


In [12]:
# convert label to a numerical variable
docs['Class'] = docs.Class.map({'cinema':0, 'education':1})
docs

Unnamed: 0,Document,Class
0,UCC is an educational college in Cork,1
1,Educational greatness depends on ethics,1
2,A story of great ethics and educational greatness,1
3,Sholey is a great cinema,0
4,good movie depends on good story,0


In [14]:
numpy_array = docs.values
X = numpy_array[:,0]
Y = numpy_array[:,1]
Y = Y.astype('int')
print("X")
print(X)
print("Y")
print(Y)

X
['UCC is an educational college in Cork'
 'Educational greatness depends on ethics'
 'A story of great ethics and educational greatness'
 'Sholey is a great cinema' 'good movie depends on good story']
Y
[1 1 1 0 0]


we now perform word vectorization  <br>
Word vectorization is a methodology in NLP to map words or phrases from vocabulary to a corresponding vector of real numbers which used to find word predictions, word similarities/semantics. The process of converting words into numbers are called Vectorization.

In [15]:
from sklearn.feature_extraction.text import CountVectorizer 
vec = CountVectorizer( )

In [16]:
vec.fit(X)
vec.vocabulary_

{'ucc': 18,
 'is': 12,
 'an': 0,
 'educational': 6,
 'college': 3,
 'in': 11,
 'cork': 4,
 'greatness': 10,
 'depends': 5,
 'on': 15,
 'ethics': 7,
 'story': 17,
 'of': 14,
 'great': 9,
 'and': 1,
 'sholey': 16,
 'cinema': 2,
 'good': 8,
 'movie': 13}

**Stop Words**

We can see a few trivial words such as 'and','is','of', etc. These words don't really make any difference in classyfying a document. These are called 'stop words'. So we would like to get rid of them.

We can remove them by passing a parameter stop_words='english' while instantiating Countvectorizer() as follows

In [17]:
vec = CountVectorizer(stop_words='english' )
vec.fit(X)
vec.vocabulary_

{'ucc': 12,
 'educational': 4,
 'college': 1,
 'cork': 2,
 'greatness': 8,
 'depends': 3,
 'ethics': 5,
 'story': 11,
 'great': 7,
 'sholey': 10,
 'cinema': 0,
 'good': 6,
 'movie': 9}

Every document will be converted into a feature vector representing presence of these words in that document. Let's convert each of our training documents in to a feature vector.

In [18]:
# another way of representing the features
X_transformed=vec.transform(X)
X_transformed

<5x13 sparse matrix of type '<class 'numpy.int64'>'
	with 20 stored elements in Compressed Sparse Row format>

In [19]:
print(X_transformed)

  (0, 1)	1
  (0, 2)	1
  (0, 4)	1
  (0, 12)	1
  (1, 3)	1
  (1, 4)	1
  (1, 5)	1
  (1, 8)	1
  (2, 4)	1
  (2, 5)	1
  (2, 7)	1
  (2, 8)	1
  (2, 11)	1
  (3, 0)	1
  (3, 7)	1
  (3, 10)	1
  (4, 3)	1
  (4, 6)	2
  (4, 9)	1
  (4, 11)	1


This representation can be understood as follows:

Consider first 4 rows of the output: (0,2), (0,5), (0,7) and (0,11). It says that the first document (index 0) has 7th , 2nd , 5th and 11th 'word' present in the document, and that they appear only once in the document- indicated by the right hand column entry.

Similarly, consider the entry (4,4) (third from bottom). It says that the fifth document has the fifth word present twice. Indeed, the 5th word('good') appears twice in the 5th document.

In real problems, you often work with large documents and vocabularies, and each document contains only a few words in the vocabulary. So it would be a waste of space to store the vocabulary in a typical dataframe, since most entries would be zero. Also, matrix products, additions etc. are much faster with sparse matrices. That's why we use sparse matrices to store the data.

In [20]:
X=X_transformed.toarray()
# converting matrix to dataframe
pd.DataFrame(X, columns=vec.get_feature_names())

Unnamed: 0,cinema,college,cork,depends,educational,ethics,good,great,greatness,movie,sholey,story,ucc
0,0,1,1,0,1,0,0,0,0,0,0,0,1
1,0,0,0,1,1,1,0,0,1,0,0,0,0
2,0,0,0,0,1,1,0,1,1,0,0,1,0
3,1,0,0,0,0,0,0,1,0,0,1,0,0
4,0,0,0,1,0,0,2,0,0,1,0,1,0


A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the "Bag of Words" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

So, the 4 steps for vectorization are as follows

- Import<br>
- Instantiate<br>
- Fit<br>
- Transform<br>

Let us summarise all we have done till now:<br>

- vect.fit(train) learns the vocabulary of the training data<br>
- vect.transform(train) uses the fitted vocabulary to build a document-term matrix from the training data<br>
- vect.transform(test) uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)<br>