<a href="https://colab.research.google.com/github/HJoonKwon/ml_fundamentals/blob/main/NaiveBayes_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Naive Bayes Algorithm 

- Supervised learning (We need training data) 
- For classification 
- Based on Bayes' Theorem
- "Naive" means we assume features are independent to each other. 

### How does it work?
- Based on Bayes' theorm, we can calculate posterior probability using prior proabibility, likelihood, and evidence(or marginal probability).
 $$P(A|B) =  \frac{P(B|A)P(A)}{P(B)} $$
- We can just apply the Bayes' theorem to prediction for the probability of the output(classification or regression). Let's assume that the training data has ```n``` number of features, and we want to predict the probability of ```y``` given ```X```. Then,
 $$P(y|X) = P(y|x_1, ..., x_n) = \frac{P(x_1, .., x_n|y)P(y)}{P(x_1, ..., x_n)}  = \frac{P(x_1|y)...P(x_n|y)P(y)}{P(x_1)...P(x_n)}$$ 
- The process above includes multiplying fraction of number multiple times, which can cause underflow in numerical calcuation. It is better to wrap the multiplication process with log operation to avoid underflow. 
 $$log(P(y|X) = [\sum_{i=1}^{n} log(P(x_i|y))] + log(P(y)) - [\sum_{i=1}^n log(P(x_i))] $$
- For each ```y``` value, we can calculate ```P(y|X)``` and find the ```y``` that makes ```P(y|X)``` the largest. We can make the maximization process simpler by removing the denominator(or evidence) in the equation above because our predicted ```y``` does not affect the denominator. 

 ### What kind of data can it handle?
 - Continuous (Gaussian Naive Bayes)
 - Discrete (Binary/Multinomial Naive Bayes) 

In [50]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.naive_bayes import GaussianNB

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 

In [2]:
%matplotlib inline
plt.rcParams['figure.figsize'] = (5.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

%load_ext autoreload
%autoreload 2

np.random.seed(42)

In [3]:
from ml_algorithms.naive_bayes import * 

### 1) Prepare for the dataset 
- We are going to use the breast cancer dataset provided by scikit-learn. 
- We can see that all features are continuous, so the Gaussian Naive Bayes would be our choice. 

In [4]:
data = load_breast_cancer() 

In [5]:
print(data['data'].shape)
print(data['feature_names'])
print(data['data'][0])
print(data['target_names'])
print(data['target'])

(569, 30)
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
 1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
 6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
 1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
 4.601e-01 1.189e-01]
['malignant' 'benign']
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 

### 2) Preprocessing the data
- Scale Data (Normalization)
- Split Data into train/test sets  
- Display data in DataFrame for better understanding 

In [47]:
X = data['data']
y = data['target']
X = normalize(X)
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=42)
print(X[0])
print(f'train_X: {train_X.shape}')
print(f'test_X: {test_X.shape}')
print(f'train_y: {train_y.shape}')
print(f'test_y: {test_y.shape}')

[ 1.09706398 -2.07333501  1.26993369  0.9843749   1.56846633  3.28351467
  2.65287398  2.53247522  2.21751501  2.25574689  2.48973393 -0.56526506
  2.83303087  2.48757756 -0.21400165  1.31686157  0.72402616  0.66081994
  1.14875667  0.90708308  1.88668963 -1.35929347  2.30360062  2.00123749
  1.30768627  2.61666502  2.10952635  2.29607613  2.75062224  1.93701461]
train_X: (455, 30)
test_X: (114, 30)
train_y: (455,)
test_y: (114,)


### 3) Gaussian Naive Bayes Implementation 

In [48]:
import inspect
lines = inspect.getsource(GaussianNaiveBayes)
print(lines)

class GaussianNaiveBayes():

    def __init__(self):
        self.train_X: np.ndarray
        self.train_Y: np.ndarray
        self.classes: np.ndarray

    def log_likelihood(self, X: np.ndarray) -> np.ndarray:

        # X: (m x n)
        # log(P(X|Y))
        means, stds = mean_and_std(self.train_X, self.train_Y)
        log_likelihood = np.zeros((means.shape[0], X.shape[0]))
        for i in range(means.shape[0]):
            likelihood = 1 / np.sqrt(2 * np.pi) / stds[i] * np.exp(
                -0.5 * np.square((X - means[i]) / stds[i]))
            log_likelihood[i] = np.sum(np.log(likelihood),
                                       axis=1).reshape(1, -1)
        return log_likelihood

    def log_priors(self):
        # log(P(Y))
        priors = np.zeros(self.classes.shape)
        for cls in self.classes:
            priors[cls] = np.count_nonzero(self.train_Y == cls) / len(
                self.train_Y)
        return np.log(priors).reshape(-1, 1)

    def log_scores(self, 

### 4) Prediction 

In [57]:
gnb = GaussianNaiveBayes()
gnb.fit(train_X, train_y.reshape(-1, 1))
my_preds = gnb.predict(test_X)
my_accuracy = np.sum(my_preds==test_y)/test_y.shape[0]
print(f"Test accuracy is: {my_accuracy*100} %")

Test accuracy is: 96.49122807017544 %


## 5) Validation 
- check it the implemented model works the same as the Gaussian Naive-Bayes model in scikit-learn. 

In [52]:
validate_model = GaussianNB()
validate_model.fit(train_X, train_y)
preds = validate_model.predict(test_X)
accuracy = np.sum(preds==test_y)/test_y.shape[0]
print(f"Test accuracy is: {accuracy*100} %")

Test accuracy is: 96.49122807017544 %


In [58]:
assert np.allclose(my_accuracy, accuracy)
assert np.sum(my_preds == preds)/len(my_preds) == 1.0

## References 
- https://medium.com/@rangavamsi5/na%C3%AFve-bayes-algorithm-implementation-from-scratch-in-python-7b2cc39268b9
- https://towardsdatascience.com/implementing-naive-bayes-algorithm-from-scratch-python-c6880cfc9c41