# Challenge - Recoding a GaussianNB from scratch

![](https://images.unsplash.com/photo-1530639834082-05bafb67fbbe?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

## Objectives

In this challenge, you will recode the Gaussian Naive Bayes algorithm **from scratch**. <br>
This is not an easy exercise, but coding ML algorithms yourself is the best way of making sure you have a good understanding of them :)

## Guidelines

A quick reminder on **Naive Bayes classification** :
1. The NB classifier is based on the **Bayes Theorem** :

$$ Pr(Y=y_k | X=x) = \frac{Pr(X=x | Y-y_k)Pr(Y=y_k)}{Pr(X=x)}$$

In the above formula :
- $Pr(Y=y_k|X=x)$ is **the probability of each class given the known features**
- $Pr(X=x|Y=y_k)$ is **the probability of the features given the class**
- $Pr(Y=y_k)$ is **the probability of apparition of the class on the given dataset**
- $Pr(X=x)$ **does not actually need to be computed**

Using the Bayes formula, **the algorithm computes the probability of each class for a given set of features, and returns the class with the higest computed probability**.

2. $Pr(Y=y_k)$, the probability of apparition of the class on the given dataset, is very easy to compute : it is just **the percentage of apparition on each class on the training set**.

3. $Pr(X=x|Y=y_k)$, the probability of the features given the class, is a little bit harder to compute, but no stress, let's go step by step :
- The probability of all features given the class is equal to the product of **the probability of each single feature given the class**.
- And **the probability of each single feature given the class can be computed using the following (gaussian) formula** :

$$ P(x_i \mid Y) = \frac{1}{\sqrt{2\pi\sigma^2_Y}} \exp\left(-\frac{(x_i - \mu_Y)^2}{2\sigma^2_Y}\right) $$

In the above formula :
- $\mu_k$ is the mean of the feature $x$ for a given class $y_k$
- $\sigma_k$ is the standard deviation of the feature $x$ for a given class $y_k$

⚠️ **The values of $\mu$, $\sigma$, and the probability of apparition on each class over the training dataset is what the algorithm learns during the fitting stage.**

## Some hints on how to proceed :

Create a class `GaussianNaiveBayes`, that will implement the following methods :
- `_get_target_proba` : Returns the probability of apparition of a given target on a given dataset, e.g. the percentage of representation of the given target over the dataset.
- `_get_mu` : Returns the mean of each feature of a dataset for all observations corresponding to a given target.
- `_get_sigma` : Returns the standard deviation of each feature of a dataset for all observations corresponding to a given target.
- `fit` :  Fits the algorithm, e.g. storing as attributes :
    - self.target_probas -- the probability of apparition of each target
    - self.mus -- the mean of each feature for each target
    - self.sigma -- the standard deviation of each feature for each target
- `_get_single_feature_probability` : Returns the probability of apparition of a single feature, given the target. Uses the Gaussian probabilistic law.
- `_predict_single_x` : Returns the probabilities of belonging to each of the target class for a given data point.
- `predict_proba` : Returns the probabilities of belonging to each of the target class for all points of the dataset.
- `predict` : Returns the predicted class for all points of the dataset.
- `score` : Returns the accuracy score for the model on the dataset X, y.

The architecture of the class, including the methods to complete and their docstrings, is provided below. **It is not mandatory for you to follow exactly this architecture**, if you have another idea on how to code the algorithm, feel free to follow it.

⚠️ One piece of advice : test your code on a simple dataset, such as the iris dataset from scikit-learn.

### 1. From scratch implementation

#### 1.1. Creating the `GaussianNaiveBayes` class

In [29]:
import math
import numpy as np

In [217]:
class GaussianNaiveBayes():
    
    def _get_target_proba(self, y, target):
        """
        Returns the probability of apparition of a given target on a given dataset,
        e.g. the percentage of representation of the given target over the dataset.
        
        Parameters
            y {np.ndarray} -- the target array (1-dimensional)
            target {int} -- the given target
            
        Returns
            float : the proportion of representation of the given target over the dataset.
        """
        proba = y[y == target].size / y.size
        return proba
    
    
    def _get_mu(self, X, y, target):
        """
        Returns the mean of each feature of a dataset for all observations corresponding to a given target.
        
        Parameters
            X {np.ndarray} -- the features array (n-dimensional)
            y {np.ndarray} -- the target array (1-dimensional)
            target {int} -- the given target
            
        Returns
            np.ndarray (1-dim) : the mean value of each feature in the dataset for the given target.
        """
        mu = X[y == target].mean(axis = 0)
        return mu
    
    
    def _get_sigma(self, X, y, target):
        """
        Returns the standard deviation of each feature of a dataset for all observations corresponding to a given target.
        
        Parameters
            X {np.ndarray} -- the features array (n-dimensional)
            y {np.ndarray} -- the target array (1-dimensional)
            target {int} -- the given target
            
        Returns
            np.ndarray (1-dim) : the standard deviation of each feature in the dataset for the given target.
        """
        sigma = X[y == target].std(axis = 0)
        return sigma 
    
    
    def fit(self, X, y):
        """
        Fits the algorithm, e.g. storing as attributes :
            > self.target_probas -- the probability of apparition of each target
            > self.mus -- the mean of each feature for each target
            > self.sigma -- the standard deviation of each feature for each target
        """
        self.target_probas = np.array([self._get_target_proba(y, target) for target in list(set(y))])
        self.mus = np.array([self._get_mu(X, y, target) for target in list(set(y))])
        self.sigmas = np.array([self._get_sigma(X, y, target) for target in list(set(y))])
        
        
    def _get_single_feature_probability(self, x, mu, sigma):
        """
        Returns the probability of apparition of a single feature, given the target.
        Uses the Gaussian probabilistic law.
        
        Parameters
            x {float} -- a single point from a single feature
            mu {float} -- the mean of the feature from which x comes
            sigma {float} -- the standard deviation of the feature from which x comes
            
        Returns
            float : the probability of a single feature point, for a given target.
        """
        feature_proba = 1 / (math.sqrt(2 * math.pi)) * math.e**(-(x-mu)**2/(2*sigma**2))
        return feature_proba
    
    
    def _predict_single_x(self, x):
        """
        Returns the probabilities of belonging to each of the target class for a given data point.
        
        Parameters
            x {np.ndarray} -- a single observation from the dataset (1-dimensional)
            
        Returns
            np.ndarray (1-dimensional) : the array of probabilities of each class for a single point.
        """
        single_proba = np.prod(self._get_single_feature_probability(x, self.mus, self.sigmas), axis = 1) * self.target_probas
        return single_proba
    
    
    def predict_proba(self, X):
        """
        Returns the probabilities of belonging to each of the target class for all points of the dataset.
        
        Parameters
            X {np.ndarray} -- the dataset (2-dimensional)
            
        Returns 
            np.ndarray (2-dimensional) : the array of probabilities of each class for each point in the dataset.
        """
        predict_probas = np.array([self._predict_single_x(x) for x in X])
        return predict_probas

    
    def predict(self, X):
        """
        Returns the predicted class for all points of the dataset.
        
        Parameters
            X {np.ndarray} -- the dataset (2-dimensional)
            
        Returns 
            np.ndarray (1-dimensional) : the predicted class for each point in the dataset.
        """
        pred_proba = self.predict_proba(X)
        pred_class = np.array([np.argmax(pred_proba[i]) for i in range(X.shape[0])])
        
        return pred_class
    
    def score(self, X, y):
        """
        Returns the accuracy score for the model on the dataset X, y.
        
        Parameters
            X {np.ndarray} -- the dataset (2-dimensional)
            y {np.ndarray} -- the target (1-dimensional)
        
        Returns
            float : accuracy score of the model.
        """
        pass

#### 1.2. Testing your model on the iris dataset

In [230]:
# TODO : load data

In [231]:
from sklearn.datasets import load_iris

iris_df = load_iris()

In [232]:
# TODO : split data

In [233]:
X = iris_df.data

In [234]:
y = iris_df.target

In [235]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

In [236]:
# TODO : fit model

In [237]:
gnb = GaussianNaiveBayes()

In [238]:
gnb.fit(X_train, y_train)

In [239]:
# TODO : test the model

In [240]:
gnb.predict_proba(X_test)

array([[7.55481259e-300, 5.58061021e-018, 3.30345367e-006],
       [3.06849554e-004, 7.20333756e-017, 2.05647668e-024],
       [1.49456263e-138, 1.32118101e-004, 1.65591630e-003],
       [1.86437861e-076, 7.23859825e-003, 2.78767088e-006],
       [4.34786753e-105, 2.64632495e-003, 6.16422130e-005],
       [1.09025954e-159, 1.36515706e-005, 1.77656247e-003],
       [1.78315051e-005, 1.79157380e-022, 4.30709230e-031],
       [7.04354214e-003, 2.67465063e-018, 2.20076713e-026],
       [1.05613983e-193, 2.42507768e-010, 2.83570923e-003],
       [3.17021373e-003, 3.27848956e-018, 9.35196604e-027],
       [1.68563053e-003, 2.02345885e-019, 5.18853172e-028],
       [6.61677005e-003, 5.86551119e-019, 4.54303513e-027],
       [2.65193397e-003, 4.67386986e-017, 2.50838329e-025],
       [2.79803418e-173, 3.99113723e-006, 4.71553188e-003],
       [6.58950835e-068, 5.80894014e-003, 2.96556140e-007],
       [6.89751067e-088, 4.57320296e-003, 1.28443948e-006],
       [2.73680833e-113, 2.00960323e-003

In [241]:
y_pred = gnb.predict(X_test)

In [243]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.9666666666666667

### 2. Testing the sklearn version of the GaussianNB on the iris dataset

In [244]:
# TODO : test that you have the same results using the sklearn version of the GaussianNB

In [245]:
from sklearn.naive_bayes import GaussianNB
gnb2 = GaussianNB()
gnb2.fit(X_train, y_train)
y_pred = gnb2.predict(X_test)
accuracy_score(y_test, y_pred)

0.9666666666666667

Results are the same between the two models.