## 1. What is the Naive Bayes Algorithm?

### 1.1 Overview

Naive Bayes is a supervised classification algorithm which belongs to the family of "Probabilistic Classifiers". As the name suggests, <u>it is uses Bayes' theorem at its core with a 'naive' assumption</u>.

The algorithm is widely used for simple classification problems and works well with text data in a bag of words form. Infact, it was most popularized by its use as a spam email classifier by google. To this date, its widely used as a benchmarking model by data scientists in hackatons on kaggle.

Before we get into the crux of the algorithm, its important to know what bayes rule is.

### 1.2 The Bayes' Theorem

You can get a great intuitive sense about conditional probability from lectures of Prof John Tsitsiklis (MITOpenCourseware). 

`"You know know something about the world, and based on what you know, you setup a probability model and you write down probabilities about the different outcomes. Then someone gives you some new information, which changes your beliefs and thus changes the probabilities of your outcomes."`

In a simple sense, bayes' theorem talks about the 'new' probabilities of an experiment given that an event has occured.

Let's say you have a sample space the probabilities for different events are given as below - 

<img src="img.jpg" width='500'>

<b>Now let's say someone tells you that B has occured. The area outside the circle that represents B is now meaningless since the new sample space is now B. This means, we need to revise the probabilities that have been assiged to the spaces inside B.</b>

Now someone asks you, what is the probability that A occurs, given that B has occured. Since B has occured and the new sample space is now B, the area that was initially represented by $P(A \cap B)$ is now the $P(A \mid B)$. But the probability now has to be recalculated.

What is the probability of this new section? Well, we can just use simple ratios for solving this.

$$ P (A \mid B) = {P(A \cap B) \over P(B)} = {0.3 \over (0.3+0.2)}$$

Now, the whole exercise we did above can also be done for when A has occured. Meaning that we can use symmetry to show - 

$$ P (B \mid A) = {P(A \cap B) \over P(A)}$$

Using the two symmetric equations and replacing $P(A \cap B)$, we can write the famous equation - 

$$ P (A \mid B) = {P (B \mid A) \times P(A) \over P(B)}$$

### 1.2 The Naive Assumption

Now, let's see an 'expanded' version of this definition which is more relevant to data science. We want to know what is the probability of $y$ given that $x_1, x_2, ... , x_n$ have occured (where $x_i$ are different features and y is the class label we want to predict). This is given by - 

$$P(y \mid x_1, x_2, ... , x_n) = {P(x_1, x_2, ... , x_n \mid y) \times P(y) \over P(x_1, x_2, ... , x_n)}$$

Here comes the <u>Naive assumption of conditional independence</u>. The assumption is that all the features $x_1, x_2 ..., x_n$ are independent. This means that $P(A,B) = P(A).P(B)$. Applying this to the above equation we get - 

$$P(y \mid x_1, x_2, ... , x_n) = {P(x_1 \mid y).P(x_2 \mid y)... P(x_n \mid y).P(y) \over P(x_1).P(x_2)...P(x_n)}$$

Using proper convention - 

$$P(y \mid x_1, x_2, ... , x_n) = {P(y)\prod_{i=1}^{n} P(x_i \mid y) \over \prod_{i=1}^{n} P(x_i)}$$

Since the denominator is constant, we can write the equation as -

$$P(y \mid x_1, x_2, ... , x_n) \propto P(y)\prod_{i=1}^{n} P(x_i \mid y)$$

In order to turn this into a classifier, we need to pick up the one with the max probability. This can be expressed as -

$$y = argmax_y(P(y)\prod_{i=1}^{n} P(x_i \mid y))$$

### 1.3 Curve fitting

When working with different algorithms, its important to undertand where a particular algorithm will work and where it may not. For this, I have always found a 2 dimensional example of decision boundaries quite useful from an intuitive point of view, since the core 'nature' of the classifier retains itself in higher dimensional spaces as well. The objective is to separate the blue from the red points. The decision boundary with the confidence is plotted. 

As seen below (image from sklearn documentation), the naive bayes classifier is capable of fitting smooth continous decision boundaries, but fails when the data needs a high degree polynomial.

<img src='NBclassifier.PNG' width=700>

### 1.4 Different Classifiers

Now, when you actually plan on implementing this, you will notice there are multiple classifiers available that can be used for a naive bayes model. The point of these classifiers is quite simple. The above equation calculates $P(x_i \mid y)$ in a straight forward way for features with discrete values, but what if the features are continous variables. Clearly, this will need you to 'assume' the nature of the distribution since unlike discrete valued features, where you could just sum up the number of times a discrete value occured along with the specific y class label. In this case we can use classifiers such as GaussianNB.

Simply put, by choosing different classifiers, you get to choose the assumptions regarding the nature of distributions of $P(x_i \mid y)$.

<u>**GaussianNB**</u>

When your features are continous in nature, you assume that they are conditionally independent and that the data associated with each class is normally distributed. To calculate the $P(x_i \mid y)$ values, we segment the data by the class, compute mean, variance and then derive the probability distribution as below.

$$P(x_i \mid y) = {1 \over \sqrt{2 \pi \sigma^2_y}}exp \Bigg(-{(x_i - \mu_y)^2 \over 2 \sigma^2_y} \Bigg)$$

## 2. How to implement the Naive Bayes algorithm?

Lets do a quick implementation of Naive Bayes. This is how you would actually implement it when working with data. The focus will be on the algorithm rather than data preprocessing for now. For this example, I will select the iris data which is available as part of the sklearn datasets API.

In [7]:
# Loading the dependencies for the model
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

In [21]:
# Loading the iris data into X, y. Each dataset will have its own way of doing this.
iris = load_iris()
X = iris.data
y = iris.target
y_labels = iris.target_names
X_labels = iris.feature_names

In [22]:
#4 features, 150 samples, 3 target values to predict
print("Column names - ",X_labels)
print("Shape of X - ",X.shape)
print("Target values - ",y_labels)

Column names -  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Shape of X -  (150, 4)
Target values -  ['setosa' 'versicolor' 'virginica']


In [26]:
#Separating test and train data for model evaluation
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)

In [27]:
#Instantiate the classifier and fit it to training data
nb = GaussianNB()
nb.fit(X_train, y_train)

GaussianNB(priors=None)

In [28]:
#Evaluation of the model using test data
nb.score(X_test, y_test)

0.9

In [34]:
# Model's prediction when sepal_len = 5.8, sepal_wid = 2.2, petal_len = 4.2, petal_wid = 0.4
nb.predict([[5.8,2.2,4.2,0.4]])

array([1])

The value 1 corresponds to versicolor, as seen by the y_labels (0 represents setosa).

## 3. References

- https://www.youtube.com/watch?v=j9WZyLZCBzs&list=PLUl4u3cNGP61MdtwGTqZA0MreSaDybji8
- https://scikit-learn.org/stable/modules/naive_bayes.html
- https://en.wikipedia.org/wiki/Naive_Bayes_classifier
- https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html