Naive Bayes Classifier
===

Naive Bayes classifiers are a family of probabilistic classifiers based on Bayes' theorem, with an assumption that each feature is independent from one another.

This notebook is the first of three notebooks discussing a Naive Bayes classifier. This is where we try to learn about **Gaussian Naive Bayes Classifier**.

Before diving straight to Gaussian NB, we must first take a look at the Naive Bayes probabilistic model.

Mathematically, give a dataset $x = (x_{1}, \ldots, x_{n})$ to be classified, NB assigns to an example (dataset feature) a discrete probability,

$$ p(C_{k}|x_{1}, \ldots, x_{n} )$$

for $K$-classes in the dataset. To learn this multivariate distribution would require a large amount of data. Thus, to simplify the task of learning, we assume that the features are conditionally independent from each other given the class. Consequently leading to the use of Bayes' theorem,

$$ p(C_{k} | x) = \dfrac{p(C_{k})p(x|C_{k})}{p(x)} $$

Translating to plain English, the above equation may be understood by

$$ posterior = \dfrac{prior \times likelihood}{evidence} $$

By conditional probability, the numerator is just the *joint probability distribution* $p(C_{k}, x)$, and may be factored through chain rule,

$$ p(C_{k}, x) = p(x_{1}|x_{2}, \ldots, x_{n}, C_{k})p(x_{2}|x_{3}, \ldots, x_{n},C_{k})p(x_{n}|C_{k})p(C_{k}) $$

Now, through the assumption of conditional independence of features, i.e. each feature $x_{i}$ is conditionally independent from every other feature $x_{j}$ for $j \neq i$, we get

$$ p(x_{i} | x_{i + 1}, \ldots, x_{n}, C_{k}) = p(x_{i} | C_{k}) $$

Leading us to the expression of the joint probability model $p(C_{k}, x)$ as,

$$ p(C_{k} | x_{1}, \ldots, x_{n}) = p(C_{k}, x_{1}, \ldots, x_{n}) = p(C_{k}) p(x_{1}|C_{k}) p(x_{2}|C_{k}) \ldots p(x_{n}|C_{k}) = p(C_{k}) \prod_{i=0}^{n} p(x_{i}|C_{k})$$

### Building a NB Classifier

What we have explored above is a naive Bayes probabilistic model. To build a classifier, we have to incorporate a decision rule, like what we have with linear regression classifier (recall that the example is if $\hat{y} \geq 0.5$ then output is 1, otherwise 0). For our NB classifier, we use the $argmax$ function,

$$ \hat{y} = argmax_{k \in \{1,\ldots,K\}} \bigg(p(C_{k}) \prod_{i=1}^{n} p(x_{i}|C_{k})\bigg) $$

## Gaussian Naive Bayes

When the data at hand is continuous data, the assumption is that the continuous values for each class are distributed according to a [Gaussian distribution](https://math.stackexchange.com/questions/2288322/intuition-behind-the-normal-distribution). Recall that the probability density function of the normal (Gaussian) distribution is given by

$$ f(x) = \dfrac{1}{\sqrt{2 \pi \sigma^2}} \cdot exp\bigg(\dfrac{-(x - \mu)^2}{2 \sigma^2}\bigg) $$

where $\sigma^2$ represents the variance of the values in $x$, while $\mu$ represents the mean of the values in $x$.

So, for Gaussian NB, suppose we have a training data which consists of continuous attribute $x$, we shall segment the data by class. Then, we compute the mean $\mu$ and the variance $\sigma^2$ of $x$ per class. We let $\mu_{k}$ be the mean of the values in $x$ for class $C_{k}$, then it follows that we let $\sigma_{k}^{2}$ be the variance of the values of $x$ for class $c_{k}$.

Now, assume we have collected some observation values $x_{i}$. Thus, we have the probability density for $x_{i}$ for class $C_{k}$ as $p(x=x_{i}|C_{k})$. We plug $x_{i}$ to the Gaussian distribution equation with parameters $\mu_{k}$ and $\sigma_{k}^{2}$,

$$ p(x = x_{i} | C_{k}) = \dfrac{1}{\sqrt{2 \pi \sigma_{k}^{2}}} \cdot exp\bigg(\dfrac{-(x_{i} - \mu_{k})^2}{2 \sigma_{k}^{2}}\bigg) $$

### Example

Sex classification: Classify whether a given person details pertain to a male or a female. The features are: (1) height, (2) weight, and (3) foot size. Example from [Wikipedia](https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification).

|Person|height (feet)|weight (lbs)|foot size(inches)|
|------|-------------|------------|-----------------|
|male|6|180|12|
|male|5.92|190|11|
|male|5.58|170|12|
|male|5.92|165|10|
|female|5|100|6|
|female|5.5|150|8|
|female|5.42|130|7|
|female|5.75|150|9|

We load `numpy` and `pandas` for data handling.

In [1]:
import numpy as np
import pandas as pd

We load the dataset from `sex.csv`.

In [2]:
dataset = pd.read_csv('sex.csv')

We view the dataset.

In [3]:
dataset

Unnamed: 0,Person,height,weight,foot size
0,male,6.0,180,12
1,male,5.92,190,11
2,male,5.58,170,12
3,male,5.92,165,10
4,female,5.0,100,6
5,female,5.5,150,8
6,female,5.42,130,7
7,female,5.75,150,9


We index the labels $[male, female] \rightarrow [0, 1] $.

In [4]:
dataset['Person'] = dataset['Person'].replace('male', 0)
dataset['Person'] = dataset['Person'].replace('female', 1)

Let us view the dataset again just to check.

In [5]:
dataset

Unnamed: 0,Person,height,weight,foot size
0,0,6.0,180,12
1,0,5.92,190,11
2,0,5.58,170,12
3,0,5.92,165,10
4,1,5.0,100,6
5,1,5.5,150,8
6,1,5.42,130,7
7,1,5.75,150,9


### Mean and Variance

Before we can compute for the probability distribution for features $x$, we must first compute for the mean $\mu$ and variance $\sigma^{2}$ values of $x_{i}$ for each $k$ class.

In [6]:
mean_male = dataset[['height', 'weight', 'foot size']][dataset['Person'] == 0].mean()
mean_female = dataset[['height', 'weight', 'foot size']][dataset['Person'] == 1].mean()

var_male = dataset[['height', 'weight', 'foot size']][dataset['Person'] == 0].var()
var_female = dataset[['height', 'weight', 'foot size']][dataset['Person'] == 1].var()

We view the computed values.

In [7]:
print('Mean values for male features\n===\n{}'.format(mean_male))
print()
print('Mean values for female features\n===\n{}'.format(mean_female))
print()
print('Variance values for male features\n===\n{}'.format(var_male))
print()
print('Variance values for female features\n===\n{}'.format(var_female))

Mean values for male features
===
height         5.855
weight       176.250
foot size     11.250
dtype: float64

Mean values for female features
===
height         5.4175
weight       132.5000
foot size      7.5000
dtype: float64

Variance values for male features
===
height         0.035033
weight       122.916667
foot size      0.916667
dtype: float64

Variance values for female features
===
height         0.097225
weight       558.333333
foot size      1.666667
dtype: float64


Now that we have the $\mu$ and $\sigma^{2}$ values for each features $x_{i}$ per $k$-class, let us now write a function for our $likelihood$ computation, i.e. $p(x_{i}|C_{k})$. Recall that we are going to plugin the likelihood computation into the Gaussian probability density function,

$$ p(x = x_{i} | C_{k}) = \dfrac{1}{\sqrt{2 \pi \sigma_{k}^{2}}} \cdot exp\bigg(\dfrac{-(x_{i} - \mu_{k})^2}{2 \sigma_{k}^{2}}\bigg) $$

Hence, we implement the `likelihood` function as follows,

In [8]:
def likelihood(feature, mean, variance):
    return (1 / np.sqrt(2 * np.pi * variance)) * np.exp((-(feature - mean) ** 2) / (2 * variance))

Now that we have defined our function for computing likelihood, let us talk about computing the $prior$ probability for $k$-classes. There are two ways to do this: (1) give an equal probability for each $k$-classes, or (2) (number of class samples) / (total number of samples). Either way, for this notebook, we shall get a prior probabiity of 0.5 since there are exactly 4 samples for each class.

Let us declare prior probabilities for each $k$ class.

In [9]:
priors = np.array([1 / np.unique(dataset.Person).shape[0]] * np.unique(dataset.Person).shape[0])

In [10]:
priors

array([0.5, 0.5])

Our `priors` array $p(C_{k})$ has now completed our equation. We can now classify an unlabelled data. Take for example the new features we are going to insert to `dataset` below.

In [11]:
dataset.loc[dataset.shape[0]] = [-1, 6, 130, 8]

Let us view our new `dataset`.

In [12]:
dataset

Unnamed: 0,Person,height,weight,foot size
0,0,6.0,180,12
1,0,5.92,190,11
2,0,5.58,170,12
3,0,5.92,165,10
4,1,5.0,100,6
5,1,5.5,150,8
6,1,5.42,130,7
7,1,5.75,150,9
8,-1,6.0,130,8


Let us get our first likelihood value, starting with `height`.

In [13]:
x_1 = likelihood(feature=dataset.loc[8]['height'],
                 mean=np.array([mean_male['height'], mean_female['height']]),
                 variance=np.array([var_male['height'], var_female['height']]))

This completes our $p(x=x_{1} | C_{k})$. Let us take a look at what we got.

In [14]:
print(x_1)

[1.57888318 0.22345873]


Oh, dear lord. That's not a probability distribution. Recall that our Gaussian-infused $p(x_{i} | C_{k})$ equation gives a probability density, not a probability distribution. So, we are still on the right track. To get the equivalent probaility distribution, we only need to **normalize** the probability density. This is where our previously-ignored `evidence` comes into use. Take note that the `evidence` may be computed as follows,

$$ evidence = \sum_{i = 0}^{k - 1} \Bigg( p(c_{i}) \prod_{j = 0}^{n - 1} p(x_{j} | c_{i}) \Bigg) $$

Concretely, the `evidence` may be computed as follows in this case,

$$ evidence = p(male)\ p(height|male)\ p(weight|male)\ p(foot\ size|male)\ +\ p(female)\ p(height|female)\ p(weight|female)\ p(foot\ size|female) $$

In other words, the `evidence` is the sum of all joint probability $p(C_{k}, x)$.

But for now, let us move on to the computation of likelihood values for the next features.

In [15]:
# weight feature
x_2 = likelihood(feature=dataset.loc[8]['weight'],
                 mean=np.array([mean_male['weight'], mean_female['weight']]),
                 variance=np.array([var_male['weight'], var_female['weight']]))

# foot size feature
x_3 = likelihood(feature=dataset.loc[8]['foot size'],
                 mean=np.array([mean_male['foot size'], mean_female['foot size']]),
                 variance=np.array([var_male['foot size'], var_female['foot size']]))

Now that we have all the likelihood values and our prior probabilities, the variables in our equation are now complete.

$$ prediction = prior \times x_{1} \times x_{2} \times x_{3} $$ 

The above equation is equivalent to the one we defined above, i.e. $p(C_{k}) \prod_{i=0}^{n} p(x_{i}|C_{k})$.

In [16]:
prediction = priors * x_1 * x_2 * x_3

In [17]:
prediction

array([6.19707184e-09, 5.37790918e-04])

Now, we shall normalize the predictions using the equation for `evidence` we have defined above, which may be simply implemented as follows.

In [18]:
prediction = prediction / np.sum(prediction)

In [19]:
prediction

array([1.15230663e-05, 9.99988477e-01])

If you cannot believe that the `prediction` values now sum up to 1, i.e. a probability distribution, let us get its sum.

In [20]:
np.sum(prediction)

1.0

See, it adds up to 1. Now, to get our predicted class, let us now use the $argmax$ function.

In [21]:
predicted_class = np.argmax(prediction)

In [22]:
print(predicted_class)

1


Recall that the index `1` refers to `female`, hence the predicted class is `female`.

That finishes that Gaussian Naive Bayes classifier from scratch.

#### References

[Naive Bayes classifiers in TensorFlow](https://nicolovaligi.com/naive-bayes-tensorflow.html) by Nicolò Valigi.

[Naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) in Wikipedia.