# Naive Bayes

Author: Fadoua Ghourabi (fadouaghourabi@gmail.com)

Date: July 10, 2019

In [178]:
import pandas as pd
import numpy as np

## Bayes Theorem

$$P(A|B) = \frac{P(B|A)P(A)}{P(B)},$$

where:
- $A$ and $B$ are events with $P(B)\neq 0$
- $P(A|B)$ is a conditional probability: the likelihood of event $A$ occurring given that $B$ is true.
- $P(A)$ and $P(B)$ are the probabilities of observing $A$ and $B$ **independently of each other** (**Naive Bayes assumption**).

**Example 1.** 

In [179]:
weather = pd.DataFrame([
                        ["rainy","hot","high","False",0],
                        ["rainy","hot","high","True",0],
                        ["cloudy","hot","high","False",1],
                        ["sunny","mild","high","False",1],
                        ["sunny","cool","normal","False",1],
                        ["sunny","cool","normal","True",0],
                        ["cloudy","cool","normal","True",0],
                        ["rainy","mild","high","False",0],
                        ["rainy","cool","normal","False",1],
                        ["sunny","mild","normal","False",1],
                        ["rainy","mild","normal","True",1],
                        ["cloudy","mild","high","True",1]], columns=["condition","temp","humidity","wind","hike"])

In [180]:
weather

Unnamed: 0,condition,temp,humidity,wind,hike
0,rainy,hot,high,False,0
1,rainy,hot,high,True,0
2,cloudy,hot,high,False,1
3,sunny,mild,high,False,1
4,sunny,cool,normal,False,1
5,sunny,cool,normal,True,0
6,cloudy,cool,normal,True,0
7,rainy,mild,high,False,0
8,rainy,cool,normal,False,1
9,sunny,mild,normal,False,1


$$P(A|B) = \frac{P(B|A)P(A)}{P(B)},$$



What is $P(y=1|\text{sunny})$?

- $P(\text{sunny}|y=1) = 3/7$
- $P(y=1) = 7/12$
- $P(\text{sunny}) = 4/12$

In [183]:
p1 = (3/7 * 7/12)/(4/12)
p1

0.75

What is P(y = 0|sunny)?

- P(sunny|y=0) = 1/5
- P(y = 0) = 5/12
- P(sunny) = 4/12

In [186]:
p0 = (1/5*5/12)/(4/12)
p0

0.25

Given a class variable $y$ and features $x_1, \cdots, x_n$ that are assumed **mutually independent**, Bayes theorem applied repeatedly gives:
$$P(y|x_1,\cdots,x_n) = \frac{P(y)P(x_1|y) \cdots P(x_n|y)}{P(x_1)\cdots P(x_n)}$$

Since the denominator ${P(x_1)\cdots P(x_n)}$ is a constant, we deduce:
$$P(y|x_1,\cdots,x_n) \propto P(y)\Pi_{i=1}^{n}P(x_i|y)$$

The classification $\hat{y}$ of $x_1, \cdots, x_n$ would be:

$$\hat{y} = \underset{y}{\text{argmax}}~P(y)\Pi_{i=1}^{n}P(x_i|y)$$

**Example 2.** Now, suppose weather condition is X =["sunny","cool","noraml","True"], what is its class $\hat{y}$?

$P(y=0|X) \propto 0.008$
- $P(y=0) = 5/12$
- $P(\text{sunny}|y=0) = 1/5$
- $P(\text{cool}|y=0) = 2/5$
- $P(\text{normal}|y=0) = 2/5$
- $P(\text{True}|y=0) = 3/5$

$P(y=1|X)  \propto 0.012$
- $P(y=1) = 7/12$
- $P(\text{sunny}|y=1) = 3/7$
- $P(\text{cool}|y=1) = 2/7$
- $P(\text{normal}|y=1) = 4/7$
- $P(\text{True}|y=1) = 2/7$

In [None]:
(5/12)*(1/5)*(2/5)*(2/5)*(3/5)

In [6]:
(7/12)*(3/7)*(2/7)*(4/7)*(2/7)

0.011661807580174925

## Gaussian Naive Bayes

In Gaussian Naive Bayes, the features are continuous. We assume they are sampled from a Gaussian distribution. The conditional probability is thus defined as follows:

$$P(x_i|y) = \frac{1}{\sqrt{2\pi \sigma_y^{2}}}exp{(-\frac{(x_i - \mu_y)^2}{2\sigma_y^2})}$$

$\mu_y$ is the mean and $\sigma_y$ is the standard deviation.

In [187]:
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

In [188]:
iris = datasets.load_iris()
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [189]:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, stratify=iris.target, random_state=16)

In [190]:
gnb = GaussianNB()
gnb.fit(X_train,y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [191]:
print("Train set accuracy {}".format(gnb.score(X_train,y_train)))
print("Test set accuracy {}".format(gnb.score(X_test,y_test)))

Train set accuracy 0.9553571428571429
Test set accuracy 0.9736842105263158


In [192]:
gnb.theta_ # $\mu_y$ or mean of each feature per class

array([[4.97297297, 3.40810811, 1.44864865, 0.23243243],
       [5.96216216, 2.77027027, 4.30810811, 1.34594595],
       [6.54473684, 2.92105263, 5.53684211, 2.02368421]])

In [13]:
gnb.sigma_ # $\sigma_y$ or variance of each feature per class

array([[0.12143171, 0.11155588, 0.03330899, 0.00813733],
       [0.28289263, 0.09073777, 0.20560994, 0.03707816],
       [0.43826178, 0.0963989 , 0.31811635, 0.06812327]])

In [193]:
gnb.class_prior_ # prior probability of each class

array([0.33035714, 0.33035714, 0.33928571])

In [195]:
0.33035714+0.33035714+0.33928571

0.9999999900000001

In [202]:
gnb.predict(np.array([5.2,0,10.3,0.2]).reshape(1,4))

array([2])

Comparing the results with other classifiers.

In [203]:
from sklearn.neighbors import KNeighborsClassifier

In [204]:
clf = KNeighborsClassifier(n_neighbors=2)
clf.fit(X_train, y_train)
print("Train set accuracy: {}".format(clf.score(X_train, y_train)))
print("Test set accuracy: {}".format(clf.score(X_test, y_test)))

Train set accuracy: 0.9642857142857143
Test set accuracy: 0.9473684210526315


## Multinomial Naive Bayes

Multinomial Naive Bayes is mostly used for text classification problem, i.e whether a text belongs to a category of science, technology, art etc. The classification is based on the frequency of the words present in the text.

The distribution of the data is given by $\theta_y = (\theta_{y1},\cdots,\theta_{yn})$ for each class $y$, where $n$ is the number of features (or the size of vocabulary in text classification) and $\theta_{yi} = P(x_i|y)$.

When features are not present in the learning samples, e.g. words do not appear in the vocabulary of training text, $P(x_i|y)$ is zero. To avoid zero propabilities, the parameters $\theta_y$ is estimated by a smoothed version of maximum likelihood:
$$\hat{\theta}_{yi} = \frac{N_{yi} + \alpha}{N_y + n\alpha},$$

where:
- $N_{yi}$ is the occurrence of feature $i$ in class $y$, e.g. occurrence of $i$th word in class $y$
- $N_y = \sum_{i=1}^n N_{yi}$ is the total occurrence of features. 
- $\alpha$ is a smoothing parameter

### Problem. A simple twitter sentiment analysis

Twitter sentiment analysis is a classical problem in **natural language processing (NLP)**. The goal is to be able to automatically classify a tweet as a positive or negative tweet sentiment-wise. The classifier needs to be trained and to do that, we use a list of manually classified/labelled tweets. We use  the following 5 positive tweets and 5 negative tweets.

In [18]:
tweets = [('I love this car', 'positive'),
          ('This view is amazing', 'positive'),
          ('I feel great this morning', 'positive'),
          ('I am so excited about the concert', 'positive'),
          ('He is my best friend', 'positive'),
          ('I do not like this car', 'negative'),
          ('This view is horrible', 'negative'),
          ('I feel tired this morning', 'negative'),
          ('I am not looking forward to the concert', 'negative'),
          ('He is my enemy', 'negative')]

We preprocess the tweets by converting all words to lower cases and using 1 space as seperator. Our pre-processing step creates uniform and plain text format for easier treatment. For instance, "Great", "great" and "GReat" have to be converted to lower case to be considered equivalent.

Be careful! Preprocessing in NLP is composed of several complex steps such as tokenization, removing stopwords and punctuations, lemmatization, stemming... **For simplicity, we only convert into lower cases and split using 1 space separator**.

In [205]:
tw_data = []
for (sent, sentiment) in tweets:
    # sent.split() extract the words (or tokens) of a tweet
    # w.lower() convert each word to lower case
    lower_sent = [w.lower() for w in sent.split()] 
    # " ".join(lower_sent) joins the words (or tokens) using 1 space
    tw_data.append((" ".join(lower_sent),sentiment))
    
tw_df = pd.DataFrame(tw_data,columns=["tweet","sentiment"])

In [206]:
tw_df

Unnamed: 0,tweet,sentiment
0,i love this car,positive
1,this view is amazing,positive
2,i feel great this morning,positive
3,i am so excited about the concert,positive
4,he is my best friend,positive
5,i do not like this car,negative
6,this view is horrible,negative
7,i feel tired this morning,negative
8,i am not looking forward to the concert,negative
9,he is my enemy,negative


##### Bag of words

Bag of words is the simplest representation of text using numbers. This model is only concerned with with the occurrence of the word and not where it is placed (i.e. order) in bag. The intuition behind such approach is that similar texts contain similar words.

text1 = 'I love this car'
text2 = 'This view is amazing'
text3 = 'I feel great this morning'
text4 = 'I am so excited about the concert'
text5 = 'He is my best friend'
text6 = 'I do not like this car'
text7 = 'This view is horrible'
text8 = 'I feel tired this morning'
text9 = 'I am not looking forward to the concert'
text10 = 'He is my enemy'

**Step 1**: Generate vocabulary or the set of unique words/tokens in the text1$\sim$text10.

**Step 2**: Create vector for each text. The length of the vector is the length of the vocabulary and the features are the words of the vocabulary. Wherever a word occurs in a text, 1 is inserted in the corresponding position of text vector. 

In [127]:
from sklearn.feature_extraction.text import CountVectorizer

``CountVectorizer()`` implements bag of words representation. We fit ``CountVectorizer()`` to our data, i.e. text1$\sim$text10 so that it builds the vocabulary (step 1) and gives an index for each feature (recall features are words in the vocabulary). The create a vector representation for each text, we use ``transform()`` function. The vectors of text1$\sim$text10 are stacked in a matrix, or rather a sparse matrix (a matrix whose most of its elements are zeros). 

In [207]:
# this function performs step 1 and 2
# it returns:
# 1. the bag of words model (variable vectorizer) and,
# 2. vectors of text1...text10 in a sparce matrix that we convert into dataframe for easier treatment.
def bag_of_words_vec(text):
    vectorizer = CountVectorizer()
    vocabulary=vectorizer.fit(text) 
    print(vectorizer.vocabulary_,len(vectorizer.vocabulary_)) 
    doc_term_matrix= vectorizer.transform(text)
    return (vectorizer, pd.DataFrame(doc_term_matrix.toarray())) 

The vocabulary is composed of 28 words. The bag of words representation is thus of shape 10x28.

In [208]:
vectorizer, tw_vecs = bag_of_words_vec(tw_df.tweet)
tw_vecs

{'love': 18, 'this': 24, 'car': 4, 'view': 27, 'is': 15, 'amazing': 2, 'feel': 9, 'great': 12, 'morning': 19, 'am': 1, 'so': 22, 'excited': 8, 'about': 0, 'the': 23, 'concert': 5, 'he': 13, 'my': 20, 'best': 3, 'friend': 11, 'do': 6, 'not': 21, 'like': 16, 'horrible': 14, 'tired': 25, 'looking': 17, 'forward': 10, 'to': 26, 'enemy': 7} 28


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18,19,20,21,22,23,24,25,26,27
0,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
2,0,0,0,0,0,0,0,0,0,1,...,0,1,0,0,0,0,1,0,0,0
3,1,1,0,0,0,1,0,0,1,0,...,0,0,0,0,1,1,0,0,0,0
4,0,0,0,1,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
5,0,0,0,0,1,0,1,0,0,0,...,0,0,0,1,0,0,1,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
7,0,0,0,0,0,0,0,0,0,1,...,0,1,0,0,0,0,1,1,0,0
8,0,1,0,0,0,1,0,0,0,0,...,0,0,0,1,0,1,0,0,1,0
9,0,0,0,0,0,0,0,1,0,0,...,0,0,1,0,0,0,0,0,0,0


**Practice.** Pick a vector and check whether the representation is correct.

**Step 3** Now we are ready for applying classification algorithm. 

We fit ``MultinomialNB`` to the data with the default smoothing alpha equal to 1.

In [209]:
from sklearn.naive_bayes import MultinomialNB

In [210]:
mnb = MultinomialNB()
mnb.fit(tw_vecs,tw_df.sentiment)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [143]:
mnb.class_count_ # count of data in each class

array([5., 5.])

In [211]:
mnb.alpha # value of alpha

1.0

In [212]:
mnb = MultinomialNB(alpha=0.01)
mnb.fit(tw_vecs,tw_df.sentiment)

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

Suppose we receive three new tweets that we want to classify _positive_ or _negative_. We first transform them using the same bag of words representation that we fit in text1$\sim$text10. We use ``transform`` function.

In [169]:
new_tweets = vectorizer.transform(["the cake is good",
                                   "amazing song",
                                   "i am not happy about the result",
                                   "hope always"]).toarray()

In [173]:
new_tweets

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0]])

Finally, we predict the sentiments of the the new tweets.

In [213]:
mnb.predict(new_tweets)

array(['positive', 'positive', 'negative', 'negative'], dtype='<U8')

## Bernoulli Naive Bayes

The Bernoulli Naive Bayes classifier assumes that the data is binary and counts how often every feature of each class is not zero. 

In [214]:
from sklearn.naive_bayes import BernoulliNB

In [215]:
bnb = BernoulliNB()
bnb.fit(tw_vecs,tw_df.sentiment)
bnb.predict(new_tweets)

array(['positive', 'positive', 'negative', 'positive'], dtype='<U8')