# Naive Bayes

The Naive Bayes algorithm calculates the probability that a certain input vector X has a corrisponding label y. The algorithm basically computes the frequency of similar vectors in the training data and reports back the most common label for said vectors. 

$$P(y | X) = \frac{P(X|y)P(y)}{P(X)}$$

The algorithm is called naive because of the naive assumption that the features of the training data are independant. Nevertheless, the algorithm has proven useful in many cases. Consequently, this gives us:

$$P(y | X) = \frac{P(X=x_1|y) \times P(X=x_2|y) ...P(X=x_n|y) \times  P(y)}{P(X)}$$

Since in computation, multiplying the probabilities, which are by definition small values between 0 and 1 might cause numerical issues, we can take the sum of the log probabilities instead. This gives us:

$$P(y | X) = \frac{\sum log(P(X=x_1|y)) + log(P(X=x_2|y)) ... log(P(X=x_n|y)) + log(P(y))}{P(X)}$$

Since we are looking for the most common label, y, $P(X)$ is of no significance to us, hence giving us the final expression:

$$P(y | X=x) = argmax_y \sum log(P(X=x_1|y)) + log(P(X=x_2|y)) ... log(P(X=x_n|y)) + log(P(y))$$

In order to calculate the probability of x given y, we use a normal distribution. However, depnding on the dataset, one could choose to implement the algorithm using a bernouli or multinomial distribution.

$$ f(x) = \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}$$

Below is a step by step implementation of the algorithm

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import datasets
import numpy as np


X, y = datasets.make_classification(n_samples=10000, n_features=10, n_classes=2, random_state=123)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(8000, 10) (2000, 10) (8000,) (2000,)


## Defining the parameters of the distribtuion and the prior

We will define three Matrices that will make our life easier during computation: 
$$\mu_x = \begin{bmatrix}
\mathbb{E}[{x^{(1)}}|y_1] & \mathbb{E}[{x^{(1)}}|y_2]\\
\mathbb{E}[{x^{(2)}}|y_1] & \mathbb{E}[{x^{(2)}}|y_2]\\
\mathbb{E}[{x^{(3)}}|y_1] & \mathbb{E}[{x^{(3)}}|y_2]\\
\cdots & \cdots \\
\mathbb{E}[{x^{(j)}}|y_1] & \mathbb{E}[{x^{(j)}}|y_2]\\
\end{bmatrix}$$
\
$$\sigma_x = \begin{bmatrix}
\sigma({x^{(1)}}|y_1) & \sigma({x^{(1)}}|y_2)\\
\sigma({x^{(2)}}|y_1) & \sigma({x^{(2)}}|y_2)\\
\sigma({x^{(3)}}|y_1) & \sigma({x^{(3)}}|y_2)\\
\cdots & \cdots \\
\sigma({x^{(j)}}|y_1) & \sigma({x^{(j)}}|y_2)\\
\end{bmatrix}$$
\
$$P(y) = \begin{bmatrix}
log(P(y_1))&Plog((y_2))\\
\end{bmatrix}$$

where j is the number of features 

In [None]:
#lets define the mean, std, and the log prior
#we need to calculate the mean for each feature in x
n_samples, n_features = X_train.shape
classes = np.unique(y_train)

mean = np.zeros((n_features, classes.shape[0]), dtype=np.float64)
std = np.zeros((n_features, classes.shape[0]), dtype=np.float64)
prior = np.zeros(classes.shape[0], dtype=np.float64)

for c in classes:
    curr_data = X[y == c]
    mean[:, c] = curr_data.mean(axis=0)
    std[:, c] = curr_data.std(axis=0)
    prior[c] = np.log(y[y==c].shape[0]/y.shape[0])


print(mean.shape, std.shape, prior.shape)

(10, 2) (10, 2) (2,)


## Defining the posterior

We will now fit the distribution to the mean and standard deviation of the train dataset and use it to get probabilities of features from the test set. This is basically getting $P(y|X)$. We should have the resulting matrix:
\
$$P(y|X)= \begin{bmatrix} \bigg[\log p(y=0) + \sum_{j=0}^{10} \log p(x_1^{(j)}|y=0) \bigg] & \bigg[\log p(y=1) + \sum_{j=0}^{10} \log p(x_1^{(j)}|y=1) \bigg] \\
\bigg[\log p(y=0) + \sum_{j=0}^{10} \log p(x_2^{(j)}|y=0) \bigg] & \bigg[\log p(y=1) + \sum_{j=0}^{10} \log p(x_2^{(j)}|y=1) \bigg] \\
\cdots & \cdots \\
\bigg[\log p(y=0) + \sum_{j=0}^{10} \log p(x_N^{(j)}|y=0) \bigg] & \bigg[\log p(y=1) + \sum_{j=0}^{10} \log p(x_N^{(j)}|y=1) \bigg] \\
\end{bmatrix}$$ 

where N is the number of datapoints and j is the number of features




In [None]:
N, d = X_test.shape
posterior = np.zeros((classes.shape[0], N))
# your code here
for i, c in enumerate(classes):
    posterior_i = np.array([np.sum(np.log(1/(std[:, c]*np.sqrt(2*np.pi)))-(1/2)*((x-mean[:, c])**2)/std[:, c]**2).item() for x in X_test])
    posterior[c, :] = prior[c] + posterior_i
preds = posterior.T.argmax(axis=1)


In [None]:
def accuracy(preds, y_test):
    return np.mean(preds == y_test)*100

In [None]:
print("Naive Bayes classification accuracy", accuracy(y_test, preds))

Naive Bayes classification accuracy 92.05


## Putting Everything Togethter


In [None]:
class NaiveBayes:
    def fit(self, X_train, y_train):
        n_samples, n_features = X_train.shape
        classes = np.unique(y_train)

        mean = np.zeros((n_features, classes.shape[0]), dtype=np.float64)
        std = np.zeros((n_features, classes.shape[0]), dtype=np.float64)
        prior = np.zeros(classes.shape[0], dtype=np.float64)

        for c in classes:
            curr_data = X[y == c]
            mean[:, c] = curr_data.mean(axis=0)
            std[:, c] = curr_data.std(axis=0)
            prior[c] = np.log(y[y==c].shape[0]/y.shape[0])

    def predict(self, X_test):
        N, d = X_test.shape
        posterior = np.zeros((classes.shape[0], N))
        for i, c in enumerate(classes):
            posterior_i = np.array([np.sum(np.log(1/(std[:, c]*np.sqrt(2*np.pi)))-(1/2)*
                                           ((x-mean[:, c])**2)/std[:, c]**2).item() for x in X_test])
            posterior[c, :] = prior[c] + posterior_i
        preds = posterior.T.argmax(axis=1)
        return preds

In [None]:
NBClassifier = NaiveBayes()
NBClassifier.fit(X_train, y_train)
preds = NBClassifier.predict(X_test)
acc = accuracy(y_test, preds)
print(f"Naive Bayes classification accuracy {acc}")

Naive Bayes classification accuracy 92.05
