# Support Vector Machines - Part 1

## Overview
- [1. Naive Bayes Rule](#1)
- [2. Gaussian Naive Bayes](#2)
- [3. Gaussian Naive Bayes Model from Scratch](#3)
- [4. Gaussian Naive Bayes Model in `Sklearn`](#4)
- [5. References](#5)

<a name='1' ></a>
## 1. Naive Bayes Rule
Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable. Bayes’ theorem states the following relationship, given class variable $y = c$  and dependent feature vector $\mathbf{x} = \begin{bmatrix} x_1 & x_2 & \cdots & x_n \end{bmatrix}$

$$P(y = c \mid \mathbf{x}) = \frac{P(y = c) P(\mathbf{x} \mid y = c)}{P(\mathbf{x})}$$

where

- $P(y = c \mid \mathbf{x})$ is *the posterior probability*
- $P(y)$ is *the prior probability*
- $P(\mathbf{x} \mid y = c)$ is *the likelihood*
- $P(\mathbf{x})$ is *the evidence*

To make the calculation easier, we will now use $P(y)$ instead of $P(y = c)$. In addition, we have:

\begin{align*}
    P(y) P(\mathbf{x} \mid y) &= P(\mathbf{x}, y) \\
    &= P(x_1, x_2, ..., x_n, y) \\
    &= P(x_1 \mid x_2, ..., x_n, y) P(x_2, ..., x_n, y) \\
    &= P(x_1 \mid x_2, ..., x_n, y) \underbrace{P(x_2 \mid x_3 ..., x_n, y) P(x_3, ..., x_n, y)}_\text{$P(x_2, ..., x_n, y)$}  \\
    &= \cdots \\
    &= P(x_1 \mid x_2, ..., x_n, y) P(x_2 \mid x_3, ..., x_n, y) \cdots P(x_{n-1} \mid x_n, y) P(x_n \mid y) P(y)
\end{align*}

Using the *naive* conditional independence assumption that

$$P(x_i | x_{i+1}, \dots, x_{n-1}, x_n) = P(x_i | y),$$

So  

$$P(y) P(\mathbf{x} \mid y) = P(y) \prod_{i=1}^{n} P(x_i \mid y) \\
\Rightarrow P(y \mid x_1, \dots, x_n) = \frac{P(y) \prod_{i=1}^{n} P(x_i \mid y)}{P(x_1, \dots, x_n)}
$$

Since $P(\mathbf{x}) = P(x_1, \dots, x_n)$ is constant given the input, we can use the following classification rule:

\begin{align}\begin{aligned}P(y \mid x_1, \dots, x_n) \propto P(y) \prod_{i=1}^{n} P(x_i \mid y)\\
\Downarrow\\\hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i \mid y),\end{aligned}\end{align}

and we can use **Maximum A Posteriori (MAP)** estimation to estimate $P(y)$ and $P(x_i | y)$; the former is then the relative frequency of class  in the training set.

The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of $P(x_i | y)$.

In spite of their apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many real-world situations, famously document classification and spam filtering. They require a small amount of training data to estimate the necessary parameters.

Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods. The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality.

<a name='2' ></a>
## 2. Gaussian Naive Bayes
**Gaussian Naive Bayes** implements the Gaussian Naive Bayes algorithm for classification. The likelihood of the features is assumed to be Gaussian:

$$P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right)$$

Use **Maximum Likelihood Estimation (MLE)** for $m$ examples and $x_i^{(j)}$ is $i^{th}$ feature at the $j^{th}$ observation *(Note: $L(\mu_y, \sigma_y)$ is $L(\mu_{iy}, \sigma_{iy})$ below)*

$$L(\mu_y, \sigma_y) = \prod_{j=1}^{m}P(x_i^{(j)}|\mu_y, \sigma_y^2) = \prod_{i=1}^{m}  \frac{1}{\sqrt{2\pi}\sigma_y} e^{-\frac{(x_i^{(j)} - \mu_y)^2}{2\sigma_y^2}}$$

Take logarithm of $L$

\begin{align*}
    l(\mu_y, \sigma_y) &= \log L(\mu_y, \sigma_y) \\
    &= \log \left( \prod_{j=1}^{m}  \frac{1}{\sqrt{2\pi}\sigma_y} e^{-\frac{(x_i^{(j)} - \mu_y)^2}{2\sigma_y^2}} \right)\\
    &= \sum_{j=1}^{m} \left[ -\log \sigma_y - \frac12 \log 2\pi - \frac{1}{2\sigma_y^2}(x_i^{(j)} - \mu_y)^2  \right] \\
    &= -m\log \sigma_y - \frac{m}{2} \log 2\pi - \frac{1}{2\sigma_y^2} \sum_{j=1}^{m}(x_i^{(j)} - \mu_y)^2
\end{align*}

Deriative

\begin{align*}
\frac{∂l}{∂\mu_y} &= \frac{1}{\sigma_y^2} \sum_{j=1}^m (x_i^{(j)} - \mu_y) = 0 \\
\frac{∂l}{∂\sigma_y} &= -\frac{m}{\sigma_y} +  \frac{1}{\sigma_y^3} \sum_{j=1}^m (x_i^{(j)} - \mu_y)^2 = 0
\end{align*}

Solve the two above equations

\begin{align*}
\hat{\mu_y} &= \bar x_i = EX_{iy} \\
\hat{\sigma_y} &= \sqrt{\frac{1}{m}\sum_{j=1}^m (x_i^{(j)} - \bar x)^2} = \sqrt{DX_{iy}}
\end{align*}

<a name='3' ></a>
## 3. Gaussian Naive Bayes Model from Scratch

### Import package

In [1]:
# Import package
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn import datasets

### Overview dataset

In [2]:
# Overview dataset
X, y = datasets.load_iris(return_X_y=True, as_frame=True)
X, y

(     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
 0                  5.1               3.5                1.4               0.2
 1                  4.9               3.0                1.4               0.2
 2                  4.7               3.2                1.3               0.2
 3                  4.6               3.1                1.5               0.2
 4                  5.0               3.6                1.4               0.2
 ..                 ...               ...                ...               ...
 145                6.7               3.0                5.2               2.3
 146                6.3               2.5                5.0               1.9
 147                6.5               3.0                5.2               2.0
 148                6.2               3.4                5.4               2.3
 149                5.9               3.0                5.1               1.8
 
 [150 rows x 4 columns],
 0      0
 1      0
 2   

In [3]:
X, y = np.array(X), np.array(y)

### Build model

In [13]:
from scipy import stats

class GaussianNaiveBayes():
    """Build GNB from scratch."""
    
    def __init__(self):
        self.mean_list = []
        self.std_list = []
        
    def fit(self, X, y):
        """
        We have to compute mean/standard deviation belong to each class.
        
        Args:
            X: training examples
            y: labels correspond to X
        """

        for label in np.unique(y):
            indices = np.where(y == label)[0]
            self.mean_list.append(X[indices, :].mean(axis=0))
            self.std_list.append(X[indices, :].std(axis=0))
            
            
    def predict(self, x):
        """Predict label from input x."""
        probabilities = []
        classes = len(self.mean_list)
        for c in range(classes):
            score = stats.multivariate_normal.pdf(x, mean=self.mean_list[c], cov=self.std_list[c])
            probabilities.append(score)
        return np.argmax(probabilities)

### Training and Testing

In [14]:
# Training model
gnb = GaussianNaiveBayes()
gnb.fit(X, y)

In [15]:
# Testing
x1 = [6.4, 3.1, 5.5, 1.8]
x2 = [4.9, 3. , 1.4, 0.2]
X_test = [x1, x2]

y_test = []
for x in X_test:
    y_test.append(gnb.predict(x))
    
print(f'Label: {np.array(y_test)}')

Label: [2 0]


In [16]:
gnb.std_list, gnb.mean_list

([array([0.34894699, 0.37525458, 0.17191859, 0.10432641]),
  array([0.51098337, 0.31064449, 0.46518813, 0.19576517]),
  array([0.62948868, 0.31925538, 0.54634787, 0.27188968])],
 [array([5.006, 3.428, 1.462, 0.246]),
  array([5.936, 2.77 , 4.26 , 1.326]),
  array([6.588, 2.974, 5.552, 2.026])])

<a name='4' ></a>
## 4. Gaussian Naive Bayes Model in `sklearn`

In [17]:
from sklearn.naive_bayes import GaussianNB

# Training model
clf = GaussianNB()
clf.fit(X, y)

# Testing
X_test = [x1, x2]
print(f'Label: {clf.predict(X_test)}')

Label: [2 0]


In [18]:
np.sqrt(clf.var_), clf.theta_

(array([[0.34894699, 0.37525458, 0.17191859, 0.10432643],
        [0.51098337, 0.3106445 , 0.46518814, 0.19576517],
        [0.62948868, 0.31925539, 0.54634788, 0.27188969]]),
 array([[5.006, 3.428, 1.462, 0.246],
        [5.936, 2.77 , 4.26 , 1.326],
        [6.588, 2.974, 5.552, 2.026]]))

**Nice! It's easy to see that our model and sklearn's model share the common result!**

<a name='5' ></a>
## 5. References
- [https://en.wikipedia.org/wiki/Naive_Bayes_classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)
- [https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case](https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case)
- [https://machinelearningcoban.com/2017/08/08/nbc/](https://machinelearningcoban.com/2017/08/08/nbc/)
- [https://scikit-learn.org/stable/modules/naive_bayes.html#](https://scikit-learn.org/stable/modules/naive_bayes.html#)
- [https://www.python-engineer.com/courses/mlfromscratch/05_naivebayes/](https://www.python-engineer.com/courses/mlfromscratch/05_naivebayes/)
- [https://phamdinhkhanh.github.io/deepai-book/ch_ml/NaiveBayes.html](https://phamdinhkhanh.github.io/deepai-book/ch_ml/NaiveBayes.html)