# Naive Bayes
A Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions.

![Bayes Theorem](http://cstwiki.wtb.tue.nl/images/Bayes_rule.png)

Naive Bayes has many variants that assume different distributions from the data. In this notebook we will implement the Gaussian Naive Bayes, which assumes a Gaussian Distribution from the data.

Basicly, what changes from the vanilla version, is the way we estimate the probability P(x_i|y) now that x_i is continuous.

![Gaussian Formula](https://iq.opengenus.org/content/images/2020/02/Screenshot_6.jpg)

Step by Step:
* Calculate the occurence of exemples for each class p(y_i)
* For each feature on X, estimate it's mean and standard deviation for each class separatly
* To predict a new example, calculate it's probability for each class p(x|y_i) and select the argument with the highest probability.

![Inference](https://ogrisel.github.io/scikit-learn.org/sklearn-tutorial/_images/math/f936a04ed9ff39ee17b12b68d8782b78016efe44.png)

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np
import pandas as pd



In [2]:
X, Y = load_breast_cancer(return_X_y=True)

In [3]:
X.shape

(569, 30)

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size=0.7)

In [5]:
class GaussianNaiveBayes():
    def calculate_probability(self, x, mean, stdev):
        exponent = np.exp(-((x - mean) ** 2 / (2 * stdev ** 2)))
        return (1 / (np.sqrt(2 * np.pi) * stdev)) * exponent

    def fit(self, X, Y):
        self.mus = {}
        self.stds = {}
        self.class_freq = {} 
        for c in set(Y):
            self.class_freq[c] = len(Y[Y==c])/len(Y)
            self.mus[c] = np.mean(X[Y==c], axis=0)
            self.stds[c] = np.std(X[Y==c], axis=0)
            
    def predict_proba(self, x):
        probas = {}
        for c in self.class_freq.keys():
            mean = self.mus[c]
            stdev = self.stds[c]
            probas[c] = self.calculate_probability(x, mean, stdev).prod() * self.class_freq[c]
#             A more numerically stable approach is to sum the log of both terms:
#             probas[c] = np.log(self.calculate_probability(x, mean, stdev).prod()) + np.log(self.class_freq[c])
        return probas
    
    def predict(self, x): 
        return np.argmax(list(clf.predict_proba(x).values()))

## Implemented Version

In [6]:
clf = GaussianNaiveBayes()
clf.fit(X_train, y_train)
results = np.array([clf.predict(x) for x in X_test])

In [7]:
print(classification_report(y_test, results))

              precision    recall  f1-score   support

           0       0.92      0.92      0.92        66
           1       0.95      0.95      0.95       105

    accuracy                           0.94       171
   macro avg       0.94      0.94      0.94       171
weighted avg       0.94      0.94      0.94       171



## Sklearn's Version

In [8]:
clf_sklearn = GaussianNB()
clf_sklearn.fit(X_train, y_train)
results_sklearn = clf_sklearn.predict(X_test)

In [9]:
print(classification_report(y_test, results_sklearn))

              precision    recall  f1-score   support

           0       0.92      0.91      0.92        66
           1       0.94      0.95      0.95       105

    accuracy                           0.94       171
   macro avg       0.93      0.93      0.93       171
weighted avg       0.94      0.94      0.94       171

