# Challenge - Recoding a GaussianNB from scratch

![](https://images.unsplash.com/photo-1530639834082-05bafb67fbbe?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

## Objectives

In this challenge, you will recode the Gaussian Naive Bayes algorithm **from scratch**. <br>
This is not an easy exercise, but coding ML algorithms yourself is the best way of making sure you have a good understanding of them :)

## Guidelines

A quick reminder on **Naive Bayes classification** :
1. The NB classifier is based on the **Bayes Theorem** :

$$ Pr(Y=y_k | X=x) = \frac{Pr(X=x | Y-y_k)Pr(Y=y_k)}{Pr(X=x)}$$

In the above formula :
- $Pr(Y=y_k|X=x)$ is **the probability of each class given the known features**
- $Pr(X=x|Y=y_k)$ is **the probability of the features given the class**
- $Pr(Y=y_k)$ is **the probability of apparition of the class on the given dataset**
- $Pr(X=x)$ **does not actually need to be computed**

Using the Bayes formula, **the algorithm computes the probability of each class for a given set of features, and returns the class with the higest computed probability**.

2. $Pr(Y=y_k)$, the probability of apparition of the class on the given dataset, is very easy to compute : it is just **the percentage of apparition on each class on the training set**.

3. $Pr(X=x|Y=y_k)$, the probability of the features given the class, is a little bit harder to compute, but no stress, let's go step by step :
- The probability of all features given the class is equal to the product of **the probability of each single feature given the class**.
- And **the probability of each single feature given the class can be computed using the following (gaussian) formula** :

$$ P(x_i \mid Y) = \frac{1}{\sqrt{2\pi\sigma^2_Y}} \exp\left(-\frac{(x_i - \mu_Y)^2}{2\sigma^2_Y}\right) $$

In the above formula :
- $\mu_k$ is the mean of the feature $x$ for a given class $y_k$
- $\sigma_k$ is the standard deviation of the feature $x$ for a given class $y_k$

⚠️ **The values of $\mu$, $\sigma$, and the probability of apparition on each class over the training dataset is what the algorithm learns during the fitting stage.**

## Some hints on how to proceed :

Create a class `GaussianNaiveBayes`, that will implement the following methods :
- `_get_target_proba` : Returns the probability of apparition of a given target on a given dataset, e.g. the percentage of representation of the given target over the dataset.
- `_get_mu` : Returns the mean of each feature of a dataset for all observations corresponding to a given target.
- `_get_sigma` : Returns the standard deviation of each feature of a dataset for all observations corresponding to a given target.
- `fit` :  Fits the algorithm, e.g. storing as attributes :
    - self.target_probas -- the probability of apparition of each target
    - self.mus -- the mean of each feature for each target
    - self.sigma -- the standard deviation of each feature for each target
- `_get_single_feature_probability` : Returns the probability of apparition of a single feature, given the target. Uses the Gaussian probabilistic law.
- `_predict_single_x` : Returns the probabilities of belonging to each of the target class for a given data point.
- `predict_proba` : Returns the probabilities of belonging to each of the target class for all points of the dataset.
- `predict` : Returns the predicted class for all points of the dataset.
- `score` : Returns the accuracy score for the model on the dataset X, y.

The architecture of the class, including the methods to complete and their docstrings, is provided below. **It is not mandatory for you to follow exactly this architecture**, if you have another idea on how to code the algorithm, feel free to follow it.

⚠️ One piece of advice : test your code on a simple dataset, such as the iris dataset from scikit-learn.

### 1. From scratch implementation

#### 1.1. Creating the `GaussianNaiveBayes` class

In [None]:
class GaussianNaiveBayes():
    
    def _get_target_proba(self, y, target):
        """
        Returns the probability of apparition of a given target on a given dataset,
        e.g. the percentage of representation of the given target over the dataset.
        
        Parameters
            y {np.ndarray} -- the target array (1-dimensional)
            target {int} -- the given target
            
        Returns
            float : the proportion of representation of the given target over the dataset.
        """
        pass
    
    
    def _get_mu(self, X, y, target):
        """
        Returns the mean of each feature of a dataset for all observations corresponding to a given target.
        
        Parameters
            X {np.ndarray} -- the features array (n-dimensional)
            y {np.ndarray} -- the target array (1-dimensional)
            target {int} -- the given target
            
        Returns
            np.ndarray (1-dim) : the mean value of each feature in the dataset for the given target.
        """
        pass
    
    
    def _get_sigma(self, X, y, target):
        """
        Returns the standard deviation of each feature of a dataset for all observations corresponding to a given target.
        
        Parameters
            X {np.ndarray} -- the features array (n-dimensional)
            y {np.ndarray} -- the target array (1-dimensional)
            target {int} -- the given target
            
        Returns
            np.ndarray (1-dim) : the standard deviation of each feature in the dataset for the given target.
        """
        pass   
    
    
    def fit(self, X, y):
        """
        Fits the algorithm, e.g. storing as attributes :
            > self.target_probas -- the probability of apparition of each target
            > self.mus -- the mean of each feature for each target
            > self.sigma -- the standard deviation of each feature for each target
        """
        pass
        
        
    def _get_single_feature_probability(self, x, mu, sigma):
        """
        Returns the probability of apparition of a single feature, given the target.
        Uses the Gaussian probabilistic law.
        
        Parameters
            x {float} -- a single point from a single feature
            mu {float} -- the mean of the feature from which x comes
            sigma {float} -- the standard deviation of the feature from which x comes
            
        Returns
            float : the probability of a single feature point, for a given target.
        """
        pass
    
    
    def _predict_single_x(self, x):
        """
        Returns the probabilities of belonging to each of the target class for a given data point.
        
        Parameters
            x {np.ndarray} -- a single observation from the dataset (1-dimensional)
            
        Returns
            np.ndarray (1-dimensional) : the array of probabilities of each class for a single point.
        """
        pass
    
    
    def predict_proba(self, X):
        """
        Returns the probabilities of belonging to each of the target class for all points of the dataset.
        
        Parameters
            X {np.ndarray} -- the dataset (2-dimensional)
            
        Returns 
            np.ndarray (2-dimensional) : the array of probabilities of each class for each point in the dataset.
        """
        pass

    
    def predict(self, X):
        """
        Returns the predicted class for all points of the dataset.
        
        Parameters
            X {np.ndarray} -- the dataset (2-dimensional)
            
        Returns 
            np.ndarray (1-dimensional) : the predicted class for each point in the dataset.
        """
        pass
    
    
    def score(self, X, y):
        """
        Returns the accuracy score for the model on the dataset X, y.
        
        Parameters
            X {np.ndarray} -- the dataset (2-dimensional)
            y {np.ndarray} -- the target (1-dimensional)
        
        Returns
            float : accuracy score of the model.
        """
        pass

#### 1.2. Testing your model on the iris dataset

In [None]:
# TODO : load data

In [None]:
# TODO : split data

In [None]:
# TODO : fit model

In [None]:
# TODO : test model

### 2. Testing the sklearn version of the GaussianNB on the iris dataset

In [None]:
# TODO : test that you have the same results using the sklearn version of the GaussianNB