# Naïve Bayes Implementation

### 貝氏分類器Naive Bayes Classifier 
https://ithelp.ithome.com.tw/articles/10297660

There is a feature set $\mathbf{x}=\{x_1, x_2,...,x_p\}$, which can be divided in class $y_k$ in classes $C={y_1,...,y_K}$. 
\begin{equation}\begin{split}
p(y_k|\mathbf{x}) &= \frac{p(\mathbf{x}|y_k)p(y_k)}{p(\mathbf{x})}\\
\text{posterior} &= \frac{\text{prior} \times \text{likehood}}{\text{evdience}}
\end{split} \tag{1}
\end{equation}
where <br>
$\ p(y_k|\mathbf{x})$ is the posterior probability of class ($y_{k}$, target) given predictor ($\mathbf{x}$). <br>
$\ p(y_k)$ is the prior probability of class.<br>
$\ p(\mathbf{x}|y_k)$ is the likelihood which is the probability of the predictor given class.<br>
$\ p(\mathbf{x})$ is the prior probability of the predictor.

\begin{equation}\begin{split}
p(y_k, x_1,...,x_p) &= p(x_1,...,x_p, y_k)\\
                    &= p(x_1 | x_2,...,x_p, y_k)p(x_2,...,x_p, y_k)\\
                    &= p(x_1 | x_2,...,x_p, y_k)p(x_2 |x_3,...,x_p, y_k)p(x_3,...,x_p, y_k)\\
                    &= ....\\
                    &= p(x_1 | x_2,...,x_p, y_k)p(x_2 |x_3,...,x_p, y_k)...p(x_p | y_k)p(y_p)
\end{split} \tag{2}
\end{equation}

Now the "naive" conditional independence assumes that all features in $\mathbf{x}$  are <span style="color:red">mutually independent</span>, conditional on the category $y_k$.<br>
Under this assumption,
\begin{equation}\begin{split}
p(x_i | x_{i+1},...,x_p,y_k) = p(x_i| y_k)
\end{split}. \tag{3}
\end{equation}
Thus, the joint model can be expressed as
\begin{equation}\begin{split}
p(y_k | x_1,...,x_p)  \varpropto p(y_k, x_1,...,x_p) &= p(y_k) p(x_1| y_k) p(x_2| y_k)... \\
                &= p(y_k) \prod_{i=1}^{p} p(x_i| y_k)
\end{split}  \tag{4}
\end{equation}
where $\varpropto$ denotes proportionality since we omited the denominator $p(x)$.

This means that under the above independence assumptions, the conditional distribution over the class variable $y$ is:
\begin{equation}\begin{split}
p(y_k | x_1,...,x_p) = \frac{1}{Z}\prod_{i=1}^{p} p(x_i| y_k)
\end{split} . \tag{5}
\end{equation}
where the evidence $Z=p(\mathbf {x} )=\sum _{k}p(y_{k})\ p(\mathbf {x} \mid y_{k})$ is a scaling factor dependent only on $x_{1},\ldots ,x_{n}$. <br>

That is, a constant if the values of the feature variables are known.<br>
If we use Maximum A Posteriori (MAP) estimation to estimate $p(y_k)$ and $p(x_i|y_k)$; the former is then the relative frequency of class $y_k$
in the training set.
\begin{equation}\begin{split}
\hat{y}_k = \arg \max_{y_k} p(y_k) \prod_{i=1}^{p} p(x_i| y_k)
\end{split} . \tag{6}
\end{equation}
The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of $p(x_i|y_k)$.

There isn’t just one type of Naïve Bayes classifier. The most popular types differ based on the distributions of the feature values. Some of these include: 

- Gaussian Naïve Bayes (GaussianNB): This is a variant of the Naïve Bayes classifier, which is used with Gaussian distributions—i.e. normal distributions—and continuous variables. This model is fitted by finding the mean and standard deviation of each class. 
- Multinomial Naïve Bayes (MultinomialNB): This type of Naïve Bayes classifier assumes that the features are from multinomial distributions. This variant is useful when using discrete data, such as frequency counts, and it is typically applied within natural language processing use cases, like spam classification. 
- Bernoulli Naïve Bayes (BernoulliNB): This is another variant of the Naïve Bayes classifier, which is used with Boolean variables—that is, variables with two values, such as True and False or 1 and 0. 

### Gaussian Naive Bayes
GaussianNB implements the Gaussian Naive Bayes algorithm for classification. The likelihood of the features is assumed to be Gaussian:
\begin{equation}\begin{split}
p(x_i | y) = p(x_i| \mu_{y}, \sigma_{y}) = \frac{1}{\sqrt{2 \pi \sigma_{y}^{2}}} \exp (-\frac{(x_i- \mu_y)^2}{2 \sigma_{y}^2})
\end{split} . \tag{7}
\end{equation}
where the parameters $\sigma_{y}$ and $\mu_{y}$ are estimated using maximum likelihood.

For, let $y = y_k$. Now, the question is how to maximize the likelihood estimate of the normal distribution parameters. <br>
To find the maximum likelihood estimation (MLE) of the two parameters. <br>
We can define the likelihood function with $x_1,...,x_p$ as
\begin{equation}\begin{split}
L(\mu, \sigma| x_1,...,x_p) &= \prod_{i=1}^{p} p(x_i|\mu, \sigma^2) \\
                            &= \prod_{i=1}^{p} \frac{1}{\sqrt{2 \pi \sigma^{2}}} \exp (-\frac{(x_i- \mu)^2}{2 \sigma^2}) \\
                            &= \frac{1}{({2 \pi \sigma^{2}})^{p/2}} \exp (- \frac{1}{2 \sigma^2} \sum_{i=1}^{p} (x_i- \mu)^2)
\end{split}  \tag{8}
\end{equation}

The log-likelihood function is 
\begin{equation}\begin{split}
l(\mu, \sigma| x_1,...,x_p) &= \ln(L(\mu, \sigma; x_1,...,x_p)) \\
                            &= -\frac{p}{2} \ln(2\pi) - \frac{p}{2}\ln{(\sigma^2)} -\frac{1}{2\sigma^2} \sum_{i=1}^{p}(x_i- \mu)^2
\end{split} \tag{9}
\end{equation}

To solve the maximization problem
\begin{equation}\begin{split}
\max_{\mu,\sigma^2} l(\mu, \sigma| x_1,...,x_p)
\end{split}  \tag{10}
\end{equation}

The first-order conditions for a maximum are
\begin{equation}\begin{split}
\frac{\partial}{\partial \mu} l(\mu, \sigma| x_1,...,x_p) = 0 \\
\frac{\partial}{\partial \sigma} l(\mu, \sigma| x_1,...,x_p) = 0
\end{split} \tag{11}
\end{equation}

The partial derivative of the log-likelihood for the mean is
\begin{equation}\begin{split}
\frac{\partial}{\partial \mu} l(\mu, \sigma| x_1,...,x_p) 
&= \frac{\partial}{\partial \mu} [-\frac{p}{2} \ln(2\pi) - \frac{p}{2}\ln{(\sigma^2)} -\frac{1}{2\sigma^2} \sum_{i=1}^{p}(x_i- \mu)^2] \\
&= \frac{1}{\sigma^2} \sum_{i=1}^{p}(x_i- \mu) \\
&= \frac{1}{\sigma^2} (\sum_{i=1}^{p}x_i- p\mu)
\end{split}  \tag{12}
\end{equation}
which is equal to zero only if $(\sum_{i=1}^{p}x_i- p\mu) = 0$ <br>
Therefore, the first of the two first-order conditions implies
\begin{equation}\begin{split}
\hat{\mu} = \frac{1}{p} \sum_{i=1}^{p}x_i
\end{split} \tag{13}
\end{equation}

Similarly, we have
\begin{equation}\begin{split}
\hat{\sigma}^2 = \frac{1}{p} \sum_{i=1}^{p} (x_i-\hat{\mu})^2
\end{split} . \tag{14}\end{equation}

Thus, the estimator $\hat{\mu }$ is equal to the sample mean and the estimator $\hat{\sigma}^2$ is equal to the unadjusted sample variance.

This Naive Bayes tutorial is broken down into 5 parts:

- Step 1: Separate By Class.
- Step 2: Summarize Dataset.
- Step 3: Summarize Data By Class.
- Step 4: Gaussian Probability Density Function.
- Step 5: Class Probabilities.

In [4]:
from sklearn.datasets import load_iris
from sklearn.model_selection import StratifiedKFold
from sklearn.naive_bayes import GaussianNB

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = StratifiedKFold(X, y, test_size=0.5, random_state=0)

gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
print("Number of mislabeled points out of a total %d points : %d"
      % (X_test.shape[0], (y_test != y_pred).sum()))

Number of mislabeled points out of a total 75 points : 4


In [7]:
from sklearn.metrics import confusion_matrix,accuracy_score
print(f'confusion matrix = {confusion_matrix(y_test, y_pred)}')
print(f'accuracy = {accuracy_score(y_test,y_pred)}')

confusion matrix = [[21  0  0]
 [ 0 30  0]
 [ 0  4 20]]
accuracy = 0.9466666666666667
