# Task 4: Implementing Gaussian Naïve Bayes classifier

Dataset: Spambase  
Implement Gaussian naïve bayes by yourself, instead of calling function GaussianNB  
Using 10-fold cross validation for evaluation  
Evaluation matrices: Accuracy, Precision, Recall, F1  

## Symbols

Given training set $T$

$$T = \{(x_1, y_1), (x_2, y_2), ···, (x_N, y_N)\}$$

where $x_i$ is feature for $i$-th sample，$y_i$ is the lable for $i$-th sample , $y_i \in \{ c_1, c_2, ..., c_k\}$。


## Bayesian rule

Bayesian rule：

$$
\begin{aligned}
    P(Y = c_k \mid X = x) &= \frac{P(Y = c_k, X = x)}{P(X = x)} \\
                  &= \frac{P(X = x \mid Y = c_k)P(Y = c_k)}{\sum_kP(X = x \mid Y = c_k)P(Y = c_k)} \\
                  & \propto P(X = x \mid Y = c_k)P(Y = c_k) \\
\end{aligned}
$$


prior distribution

$$
P(Y = c_k), \ k = 1, 2, ..., K
$$

conditonal distribution 

$$
P(X = x \mid Y = c_k) = P(X^{(1)} = x^{(1)}, ···, X^{(n)} = x^{(n)} \mid Y = c_k), \ k = 1, 2, ..., K
$$



**How to obtain prior and likelihood？**

### 1. Prior $P(Y = c_k)$ ：


$$
P(Y = c_k) = \frac{\mathrm{number} \ \mathrm{of}\ c_k}{N}
$$

### 2. likilihood $P(X = x \mid Y = c_k)$ ：


$$P(X = x \mid Y = c_k) = \prod^n_{j=1}P(X^{(j)}=x^{(j)} \mid Y = c_k)$$

where $x^{(j)}$ is the $j$-th feature of sample $x$


### 3. Conditional distribution for feature $j$  $P(X^{(j)}=x^{(j)} \mid Y = c_k)$ ：

$$
P(X^{(j)} = a_{jl} \mid Y = c_k) = \frac{1}{\sqrt{2 \pi \sigma^2_{c_k,j}}} \exp{\bigg( - \frac{(a_{jl} - \mu_{c_k,j})^2}{2 \sigma^2_{c_k,j}} \bigg)}
$$


## Ojvective function

we can obtain the objective function of Gaussian naive bayes：

$$
y = \mathop{\arg\max}_{c_k} P(Y = c_k) \prod_j P(X^{(j)} = x^{(j)} \mid Y = c_k)
$$


Usually, the log of the conditional distribution is used for convenience

$$
\begin{aligned}
y &= \mathop{\arg\max}_{c_k} \big[ \log^{ \ P(Y = c_k) \prod_j P(X^{(j)} = x^{(j)} \mid Y = c_k)} \big] \\
&= \mathop{\arg\max}_{c_k} \big[ \log^{ \ P(y = c_k)} + \sum_j \log^{ \ P(X^{(j)} = x^{(j)} \mid Y = c_k)} \big]
\end{aligned}
$$



$$\begin{aligned}
\log^{ \ P(X^{(j)} = x^{(j)} \mid Y = c_k)} &= \log^{ \ \bigg[\frac{1}{\sqrt{2 \pi \sigma^2_{c_k,j}}} \exp{\bigg(- \frac{(a_{jl} - \mu_{c_k,j})^2}{2 \sigma^2_{c_k,j}}\bigg)}\bigg]}\\
&= \log^{ \frac{1}{\sqrt{2 \pi \sigma^2_{c_k,j}}} } + \log^{ \exp{\bigg(- \frac{(a_{jl} - \mu_{c_k,j})^2}{2 \sigma^2_{c_k,j}}\bigg)} }\\
&= - \frac{1}{2} \log^{2 \pi \sigma^2_{c_k,j}} - \frac{1}{2} \frac{(a_{jl} - \mu_{c_k,j})^2}{\sigma^2_{c_k,j}}
\end{aligned}
$$



$$
y = \mathop{\arg\max}_{c_k} \bigg[ \log^{ \ P(y = c_k)} + \sum_j \big( - \frac{1}{2} \log^{2 \pi \sigma^2_{c_k,j}} - \frac{1}{2} \frac{(a_{jl} - \mu_{c_k,j})^2}{\sigma^2_{c_k,j}} \big) \bigg]
$$



## 1. Inport dataset

In [None]:
import numpy as np

In [None]:
spambase = np.loadtxt('data/spambase/spambase.data', delimiter = ",")
spamx = spambase[:, :57]
spamy = spambase[:, 57]

# 2. split dataset

In [None]:
from sklearn.model_selection import train_test_split
trainX, testX, trainY, testY = train_test_split(spamx, spamy, test_size = 0.4, random_state = 32)

In [None]:
trainX.shape, trainY.shape, testX.shape, testY.shape

# 3. Implement Gaussian Naive Bayes classifier

In [None]:
# YOUR CODE HERE



In [None]:
# test case
from sklearn.metrics import accuracy_score
model = myGaussianNB()
model.fit(trainX, trainY)
accuracy_score(testY, model.predict(testX))  

# 4. Results in different metrics

###### Input your results

Precision|Accuracy|Recall|F1
-|-|-|-
0.0|0.0|0.0|0.0

In [None]:
# YOUR CODE HERE

