# Naive Bayes Algorithm
Diego de Almeida Miranda

2023 July 

## The naive Bayes probabilistic model


The Naive Bayes Classifier are based on the Bayes theorem, that can be writed as


$P(y|X) = \frac{P(y)P(X|y)}{P(X)}$


where we can define the likelihood of the $y$ occur given $X$ events/attributes. The model have some naive assumptions of the data, where the features are independenets from each other, then all features have the same impact in the target $y$. We can re-write the theorem as


$posterior = \frac{prior\times likelihood}{evidence}$


Our evidences are defined by the set of features  $X = (x_1, x_2, ...,x_n)$, what give us


$P(y|x_1, x_2, ...,x_n) = \frac{P(y)P(x_1, x_2, ..., x_n|y)}{P(x_1, x_2, ..., x_n)}$


In pratice we are only interested in the numerator of that fraction, since the denominator does not depend on $y$ and the values of the feature $x_i$ are given, so the denominator is effectively constant. The numerator is equivalent to the joint probability model:


$P(y, x_1, x_2, ..., x_n)$


So, using repeated applications of definition of conditional probability:

$P(y, x_1, x_2, x_3, ..., x_n) = P(y)P(x_1, x_2, x_3, ..., x_n|y)$

$P(y, x_1, x_2, x_3, ..., x_n) = P(y)P(x_1|y)P(x_2, x_3, ..., x_n|y, x_1)$

$P(y, x_1, x_2, x_3, ..., x_n) = P(y)P(x_1|y)P(x_2|y, x_1)P(x_2, x_3, ..., x_n|y, x_1, x_2)$

$...$

$P(y, x_1, x_2, x_3, ..., x_n) = P(y)P(x_1|y)P(x_2|y, x_1)P(x_3|y, x_1, x_2)...P(x_n|y, x_1, x_2, x_3, ..., x_{n-1})$

Assuming that each feature $x_i$ is conditionally independent of every other feature $x_j$ for $i\neq j$, this means

$P(x_i|y, x_j) = P(x_i|y)$

So the joint model can be expressed as 

$P(y, x_1, x_2, ..., x_n) = P(y)P(x_1|y)P(x_2|y)...P(x_n|y)$

$P(y, x_1, x_2, ..., x_n) = P(y) \Pi_{i=1}^nP(x_i|y)$

So the conditional distribution over the class variable $y$ can be expressed like this:

$P(y|x_1, x_2, ..., x_n) = \frac{1}{Z}P(y)\Pi_{i=1}^nP(x_i|y)$

Where $Z$ a constant if the values of the feature variables are known and can be ignored in this case.

With math tricks we say that 

$P(y|x_1, x_2, ..., x_n) = \log P(y) \Sigma_{i=1}^n\log P(x_i|y)$






## Parameter Estimation

To estimate the parameters for a feature's distribution, one must assume a distribution or generate nonparametric models for the features from the training set. f one is dealing with continuous data, a typical assumption is that the continuous values associated with each class are distributed according to a Gaussian distribution.

For example, suppose the training data contains a continuous attribute, $x_i$. We first segment the data by the class, and then compute the mean and variance of $x_i$ in each class. Let $\mu_y$ be the mean of the values in $x_i$ associated with class $y$, and let $\sigma_y^2$ be the variance of the values in associated with class $x_i$. Then, the probability of some value
given a class, $P(x_i|y)$, can be computed by plugging $x_i$ into the equation for a Gaussian distribution parameterized by $\mu_y$ and $\sigma_y^2$. That is

$P(x_i|y) = \frac{1}{\sqrt{2\pi\sigma^2_y}}\times exp(-\frac{(x_i-\mu_y)^2}{2\sigma^2_y})$



The Gaussian distribution can be implemented as following

In [None]:
import numpy as np

def guassian_pdf(self, x, mean, var):
        numerator = np.exp(-((x - mean) ** 2) / (2 * var))
        denominator = np.sqrt(2 * np.pi * var)
        return numerator/denominator

## Implementation



##### Train
-   We segment the data by the class, and then compute the mean and variance of $x_i$ in each class.

##### Prediction
-   Calculate the posterior likelihood for each class using the Gaussian distribution
-   Chose the class with most probable, using the _maximum a posteriori_ decision rule

#### Implementation

In [3]:
import numpy as np

class NaiveBayes:

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self._classes = np.unique(y)
        n_classes = len(self._classes)

        self._mean = np.zeros((n_classes, n_features), dtype=np.float64)
        self._var = np.zeros((n_classes, n_features), dtype=np.float64)
        self._priors = np.zeros(n_classes, dtype=np.float64)

        for idx, c in enumerate(self._classes):
            X_c = X[y==c]
            self._mean[idx, :] = X_c.mean(axis=0)
            self._var[idx, :] = X_c.var(axis=0)
            self._priors[idx] = X_c.shape[0] / float(n_samples)


    def predict(self, X):
        y_pred = [self._predict(x) for x in X]
        return np.array(y_pred)
    
    def _predict(self, x):
        posteriors = []

        for idx, c in enumerate(self._classes):
            prior = np.log(self._priors[idx])
            posterior = np.sum(np.log(self._pdf(idx, x)))
            posterior = posterior + prior
            posteriors.append(posterior)
        
        return  self._classes[np.argmax(posteriors)]

    # probability dense function
    def _pdf(self, idx, x):
        mean = self._mean[idx]
        var = self._var[idx]
        numerator = np.exp(-((x - mean) ** 2) / (2 * var))
        denominator = np.sqrt(2 * np.pi * var)

        return numerator / denominator