## Naive Bayes

In [1]:
from __future__ import division, print_function
from sklearn import datasets
from sklearn.preprocessing import normalize
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import numpy as np
import math

In [2]:
data = datasets.load_digits()
X = normalize(data.data)
y = data.target

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)  # Split data into train and test
# You can reduce the dimensions to 2 or 3d for visualisation

Given data $$ \mathcal{X} = \mathcal{X}_d \times \ldots \times \mathcal{X}_d $$
The Naive Bayes estimate of p(x,y) is given by;
$$ \hat{p}_{NB}(x,y) := \hat{p}(y)\prod^{d}_{j=1} \hat{p}_j(x_j|y), $$
where;
* $\hat{p}_{NB}(x,y)$ is the naive bayes posterior distribution that shows the probability of a sample x belonging to the class y.
* $\hat{p}(y)$ is an estimate of $p(y)$ and also the prior distribution for each class
* $\hat{p}_j(x_j|y)$ is the likelihood and estimates of $p(x_j|y), j = {1,...,d}$

In the Naive model, we assume **independence**.

The model likelihood is therefore given by; $$ p(x_1, x_2,..., x_n|y) = p(x_1|y) p(x_2|y)...p(x_n|y) $$ 



The posterior distribution is given by; $$ p(y|x) = p(x|y)p(y)/Z $$ 
according to Bayes rule where Z is a scaling factor.

### Tasks

 1. TODO 1: Find the number of unique classes in y
 2. TODO 2: Calculate the gaussian likelihood p(x) for x given their means and variance
 $$ p(y_{i}|x) = \frac^{1}_{N_i}\sum^{K}_{k=1}x_ik $$
 3. TODO 3: Compute prior for each class
 4. TODO 4: Compute posterior for each class
 5. TODO 5: Return the class with the largest posterior probability

In [4]:
class NaiveBayes():
    """The Gaussian Naive Bayes classifier. """
    def fit(self, X, y):
        self.X, self.y = X, y
        self.classes = None # TODO 1
        self.params = []
        
        for i, c in enumerate(self.classes):
            
            X_where_c = X[np.where(y == c)]   # Separate X into their various classes
            self.params.append([])            
            
            # Compute the means and variances for each class
            for col in X_where_c.T:
                params = {"mean": np.mean(col), "var": np.cov(col)}
                self.params[i].append(params)

            
    def _calculate_likelihood(self, mean, var, x):  
        """ Gaussian likelihood of the data x given mean and var """
        eps = 1e-4 # AAdd a small-valued epsilon to prevent division by 0
        #TODO 2
        pass

    def _calculate_prior(self, c):
        """ Calculate the prior of class c
        (samples belonging to class c / total number of samples)"""
        p = None # TODO 3
        return p

    def _classify(self, sample):
        """ 
        P(y|x) - posterior probability
        P(x|y) - data likelihood
        P(y)   - Prior distribution over classes
        P(x)   - marginal distribution of x that scalees the probability distribution to a range between 0 and 1
        """
        posteriors = []
        # Go through list of classes
        for i, c in enumerate(self.classes):
            # Initialize posterior as prior
            posterior = self._calculate_prior(c)           
           
            for feature_value, param in zip(sample, self.params[i]):
                # Likelihood of feature value given distribution of feature values given y
                likelihood = self._calculate_likelihood(param["mean"], param["var"], feature_value)
                posterior = None # TODO 4 --- Multiply the likelihoods with prior
            posteriors.append(posterior)
        
        return None # TODO 5

    def predict(self, X):
        """ Predict the class labels of each sample in X """
        y_pred = [self._classify(x_i) for x_i in X]
        return y_pred


In [1]:
clf = NaiveBayes()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print ("Accuracy:", accuracy)