# Naive Bayes classification algorithm

## 1. Maximum Likelihood Estimation
- In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. 
- The idea is to transfer solving the density function directly to solveing parameters for the Likelihood Function (the product of univariate density function with unknown parameters to be solved for independent and identically distributed random variables).
- The general likelihood function:
$$ L(\theta) = f(x_1,...,x_n|\theta) = \prod_{i=1}^n f(x_i|\theta) $$
$$ ln(L(\theta)) = ln(\underset{\theta} {\textrm{arg max}} \prod_{i=1}^n f(x_i|\theta)) $$
, where
$$ \theta = {\theta_1, \theta_2,..., \theta_s} $$
To solve $\theta$, we write the partial derivatives for each parameters and let them equal to 0, then solve the parameters.
$$\frac{\partial}{\partial {\theta_s}} = 0 $$
- ```Why we use 'log':```
  - This is because when $a > 1$, the lower the slope(k) is, the higher the y (can be limitless high), that is, when $k = 0$, y will get its maximum value.
- ```Examples in Naive Bayes classifier:```
There are two applications of MLE in Naive Bayes:
  1) Estimation of priori probabilities $p(C_k)$; 2) Estimation of conditional probabilities $p(X|C_k)$.

## 2. Naive Bayes classifier
### 2.1 Overview: 
Naïve Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features/predictors) in a learning problem. Maximum-likelihood training can be done by evaluating a closed-form expression, which takes linear time, rather than by expensive iterative approximation as used for many other types of classifiers.
### 2.2 Probabilistic model(in general):
$$ p(C_k | x_1, ..., x_n) $$
that is,
$$ p(C_k|X) = \frac {p(C_k)p(X|C_k)} {p(X)} $$
that is,
$$ p(C_k|X) = \frac {p(C_k)p(X|C_k)} {\sum_{i=1}^n p(C_k)p(X|C_k)} $$
  - This model means that given a problem instance to be classified, represented by a vector ${\displaystyle \mathbf {x} =(x_{1},\ldots ,x_{n})}$ representing some n features (independent variables), it assigns to this instance probabilities for each of K possible outcomes or classes $C_k$.
 - In plain English, the formula above means:
$$posterior = \frac{\textrm{prior} { \textrm{*}} { \textrm{likelihood}}}{\textrm{evidence}}$$
  - Through this conditional probability model, we have the probabilities for each outcome / class (e.g. spam and ham class). Now we need to make decision on which class the instance should be classified into. The Naive Bayes classifier combines the above model with a decision rule: to pick the hypothesis that is the most probable (known as the maximum a posteriori or MAP decision rule), that is,
$$\hat{y} = \underset{k \in {1,...,k}} {\textrm{arg max}} p(C_k) {\prod_{i=1}^n p(x_i|C_k)\quad} $$

### 2.3  Assumption: 
a particular feature is independent of the value of any other features, given the class variable. According to the general model, we can see that the denominator is:
$$ p(C_k, x_1,...,x_n) $$
which can be rewritten as follows (useing chain rule) for repeated applications of the definition of conditional probability:
$$ p(C_k, x_1,...,x_n)$$ 
$$= p(x_1,...,x_n,C_K)$$
$$= p(x_1|x_2,...,x_n,C_k)p(x_2,...,x_n,C_k)$$
$$= p(x_1|x_2,...,x_n,C_k)p(x_2|x_3,...,x_n,C_k)p(x_3,...,x_n,C_k)$$
$$= ...$$
$$= p(x_1|x_2,...,x_n,C_k)p(x_2|x_3,...,x_n,C_k)...p(x_{n-1}|x_n,C_k)p(C_k)$$
  - This reqiures a lot of computation and computer power if the number of features is quite large. Also, when we have many features (such as in this dataset), this model will produce noise, especially if we do not have many training data.
  - Therefore, we have a 'naive' conditional independence assumption. Under this assumption, we have:
$$ p(x_i|x_{i+1},...,x_n,C_k) = p(x_i|C_k) $$
  - And the original model can be expressed as:
$$ p(C_k|x_1,...,x_n) ∝ p(C_k, x_1,...,x_n) $$
$$ ∝ p(C_k)p(x_1|C_k)p(x_2|C_k)... $$
$$ ∝ p(C_k)\prod_{i=1}^n p(x_i|C_k) $$

### 2.4 Priori probability:
  - A class's prior may be calculated by assuming equiprobable classes (i.e., ${\displaystyle p(C_{k})=1/K}$), or by calculating an estimate for the class probability from the training set (i.e., prior for a given class = number of samples in the class / total number of samples).  (e.g. we want to classify male and female given a set of new values of features such as height, weight, and foot size, in the training set. In the training set, we have 1000 groups of data, in which there are 300 groups are classified as male and 700 groups are female. The priori probabilities are 3/10 and 7/10, respectively.)
  - Priori probability is estimated in the learning phase with Maximum Likelihood Estimation. The idea is we estimate the population fraction of each class (let's say, male and female) through calculating the fraction of each class. Here, the Likelihood Function is $p^m * (1-p)^{(n-m)}$, where $m$ means population of male, $n$ is the total sample size, $p$ stands for probability of getting the sequence of only males data from the test data, and $(1-p)$ is for feamle data. The calculation process is as followed:
The Likelihood Function is:
    $$ L(p) = p^m * (1-p^{(n-m)}) $$
    $$ ln(L(p)) = m*ln(p) + (n-m)*ln(1-p) $$
    $$ \frac {\partial}{\partial p} = \frac {m}{p} + - \frac {n-m}{1-p} $$
, let $ \frac {\partial}{\partial p} = 0 $
    $$ \frac {m}{p} + - \frac {n-m}{1-p} = 0 $$
    $$  p = \frac {m}{n} $$

![alt text](https://miro.medium.com/max/1400/1*wLX7-3-x08m7TqsYSFeXfQ.png)

### 2.5 Evidence (normalizer):
In the previous analysis, we proved that
$$ p(C_k|x_1,...,x_n) = (1/Z) {p(C_k)} \prod_{i=1}^n p(x_i|C_k) $$
Now we consider 
$$ Z = p(X) = \sum_{i=1}^n p(C_k)p(X|C_k) $$
is a scaling factor dependent only on ${\displaystyle x_{1},\ldots ,x_{n}}$, that is, a constant if the values of the feature variables are known.

### 2.6 Likelihood:
  - The probability distribution of the given data can be quite different (the following three types of Naive Bayes assume different probability distributions), but the basic idea is to estimate the parameters for a feature's distribution. One must assume a distribution or generate nonparametric models for the features from the training set.
  - The assumptions on distributions of features are called the "event model" of the naïve Bayes classifier. For discrete features like the ones encountered in document classification (include spam filtering), ```multinomial``` and ```Bernoulli``` distributions are popular. These assumptions lead to two distinct models, which are often confused。
  - Although we assume the models (distributions of features), we still need paramtertes in the models to use them. While it is almost unlikely to get the probability density function of these features (because of the samples or maybe the features are too many), we can simply use the training data to train the model and get the parameters that, under the assumed statistical model, make the observed data most probable. This is where the MLE is applied. 

### 2.7 Gaussian Naive Bayes:
  - Typical assumption: the continuous values associated with each class are distributed according to a normal (Gaussian) distribution. For example, suppose we have collected some observation value $v$. Then, the probability distribution of $v$ given a class $C_{k}$, $p(x=v\mid C_{k})$, can be computed by plugging $v$ into the equation for a normal distribution parameterized by $\mu _{k}$ and $\sigma _{k}^{2}$. That is,
$$ p(x=v|C_k) = \frac{1}{\sqrt{{2\pi}{\sigma_k}^2}} e^{-\frac {(v-{\mu_k}^2)}{2{\sigma_k}^2}} $$
  ![alt text](https://miro.medium.com/max/1400/1*YeSlHB90J4rviMiG0Su6tA.png)
    - One example is the one that classfies children and adults through their height, weight, and foot size.

### 2.8 Multinomial Naive Bayes:
  - The general model is (the same as other Naive Bayes classifiers):
  $$ p(C_k | x_1, ..., x_n) $$
  - Also, in Multinomial Naive Bayes, we have:
  $$ p(X|C_k) = \frac {n!}{\prod_{i=1}^n x_i!} \prod_{i} {{p_k}_i}^{x_i} $$
  - That is (omitting $\frac {n!}{\prod_{i=1}^n x_i!}$),
  $$ p(C_k|X) ∝ p(C_x) \prod_{i=1}^n {{p_k}_i^{x_i}} $$
  When expresses in log-space, the MNB becomes a linear classifier:
  $$ log (p(X|C_k)) ∝ log (p(C_x) \prod_{i=1}^n {{p_k}_i}^{x_i})$$
  $$ = log (p(C_k)) + \sum_{i=1}^n {x_i} \textrm{*} log ({p_k}_i) $$
  $$ = b + w_{k}^T x $$
  , where two parameters in this model are $p_c$ or $\pi_j$ in $b = log (p(C_k))$ and ${p_k}_i$ or $\theta_ij$ in ${{w_k}_i} = log ({p_k}_i)$, respectively.
  $$ L(\pi, \theta) = p(C_x) \prod_{i=1}^n {{p_k}_i^{x_i}} $$
  $$ log L(\pi, \theta) = log (p(C_x) \prod_{i=1}^n {{p_k}_i}^{x_i})$$
  $$ = log (p(C_k)) + \sum_{i=1}^n {x_i} \textrm{*} log ({p_k}_i) $$
  - We use the training data to find two parameters that make the observed data most probable under this multinomial model, and we get:
$$\pi_j = \frac {N_j}{N}$$ 
$$p(w_i|c) = \theta_{ij} = \frac {count(w_i, c_j)}{\sum_{w \in V} count(w, c_j)}$$. However, there is a problem as maybe we have seen no training data tell us a word has been classified in a specific class. In this case, zero probability cannot be conditioned away. That's why we have Laplace (add-1) smoothing:
$$ p(w_i|c) = \frac {count(w_i, c_j)+1}{\sum_{w \in V} count(w, c_j)+1}$$
$$ = \frac {count(w_i, c_j)+1}{(\sum_{w \in V} count(w, c_j)) + |V|}$$

### 2.9 Bernouli Naive Bayes:
- Bernouli disribution can be expressed as $$ P(X=x) = p^x(1-p)^{(1-x)} $$
, where $x=1$, $P(X=1)=p$; $x=0$, $P(x=0)=1-p$
- When applied to Naive Bayes, we can get 
$$ p(X|C_k) = \prod_{i=1}^n {{p_k}_i}^{x_i} (1-{{p_k}_i})^{(1-x_i)} $$
- That is,
  $$ p(C_k|X) ∝ p(C_x) \prod_{i=1}^n {{p_k}_i}^{x_i} (1-{{p_k}_i})^{(1-x_i)} $$
  When expresses in log-space, the MNB becomes a linear classifier:
  $$ log (p(C_k|X)) ∝ log (p(C_x) \prod_{i=1}^n {{p_k}_i}^{x_i} (1-{{p_k}_i})^{(1-x_i)}$$
  $$ = log (p(C_k)) + {x_i} \sum_{i=1}^n log {{p_k}_i} + (1-x_i)\sum_{i=1}^n log (1-{p_k}_i) $$
, where two parameters in this model are $p_c$ or $\pi$ in $b = log (p(C_k))$ and ${p_k}_i$ or $\theta$ in ${{w_k}_i} = log ({p_k}_i)$, respectively.
- Now, we can rewrite the formula above and convert it to a likelihood function:
  $$ L(\pi, \theta) = p(y^{(i)}|\pi) \prod_{j=1}^n p(x_{j}^{(i)}|\theta_{j})$$
  $$ lnL(\pi, \theta) = \sum_{i} {(lnp(y^{(i)}|\pi)} + \sum_{j} {(lnp(x_{j}^{(i)}|\theta_{j})} $$
solving two parameters through partial derivatives, we get:
$$ \pi_c = \frac {N_c} {N_D} $$
$$ \theta_jc = \frac {N_jc} {N_c}$$
- Just as the multinomial Naive Bayes, we need to deal with the situation when zero probability occurs. To avoid this problem, we have:
$$ \theta_jc = \frac {\alpha + N_jc} {2\alpha + N_c}$$

### Reference:

```Maximum Likelihood Estimation:```
- https://en.wikipedia.org/wiki/Maximum_likelihood_estimation
- [Bayes’ classifier with Maximum Likelihood Estimation](https://towardsdatascience.com/bayes-classifier-with-maximum-likelihood-estimation-4b754b641488)
- [贝叶斯分类与极大似然估计](https://zhuanlan.zhihu.com/p/87044348)
- [最大似然估计MLE与贝叶斯估计](https://blog.csdn.net/bitcarmanlee/article/details/52201858)
- [极大似然估计（MLE）学习总结](https://blog.csdn.net/qq_36078753/article/details/79959651?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-3.control&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-3.control)

```Naive Bayes:```
- https://en.wikipedia.org/wiki/Naive_Bayes_classifier
- [朴素贝叶斯的三个常用模型：高斯、多项式、伯努利(含代码实现)](https://blog.csdn.net/qq_27009517/article/details/80044431)
- [朴素贝叶斯以及三种常见模型推导](https://www.jianshu.com/p/b6cadf53b8b8)
- [Naive Bayes Sklearn document](https://scikit-learn.org/stable/modules/naive_bayes.html)

```Multinomial Naive Bayes:```
- [Multinomial Naive Bayes Explained](https://www.mygreatlearning.com/blog/multinomial-naive-bayes-explained/)
- [Stanford lecture ppt.](https://web.stanford.edu/~jurafsky/slp3/slides/7_NB.pdf)
- [文档集词例子](https://zhuanlan.zhihu.com/p/57554489)

```Bernouli Naive Bayes:```
- [伯努利分布(公式推导)](https://zhuanlan.zhihu.com/p/146934905)

### Reference (how to write math equations in markdown): 
- https://medium.com/analytics-vidhya/writing-math-equations-in-jupyter-notebook-a-naive-introduction-a5ce87b9a214
- https://blog.csdn.net/smilejiasmile/article/details/80670742
- https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet#images