<a href="https://colab.research.google.com/github/HJoonKwon/ml_fundamentals/blob/main/NaiveBayes_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Naive Bayes Algorithm 

- Supervised learning (We need training data) 
- For classification 
- Based on Bayes' Theorem
- "Naive" means we assume features are independent to each other. 

### How does it work?
- Based on Bayes' theorm, we can calculate posterior probability using prior proabibility, likelihood, and evidence(or marginal probability).
 $$P(A|B) =  \frac{P(B|A)P(A)}{P(B)} $$
- We can just apply the Bayes' theorem to prediction for the probability of the output(classification or regression). Let's assume that the training data has ```n``` number of features, and we want to predict the probability of ```y``` given ```X```. Then,
 $$P(y|X) = P(y|x_1, ..., x_n) = \frac{P(x_1, .., x_n|y)P(y)}{P(x_1, ..., x_n)}  = \frac{P(x_1|y)...P(x_n|y)P(y)}{P(x_1)...P(x_n)}$$ 

 ### What kind of data can it handle?
 - Continuous (Gaussian Naive Bayes)
 - Discrete (Binary/Multinomial Naive Bayes) 

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

import pandas as pd 
import numpy as np 

In [2]:
def calc_mean_var(data: pd.DataFrame, target: np.ndarray):
  mean = data.groupby(target).apply(np.mean).to_numpy()
  var = data.groupby(target).apply(np.var).to_numpy() 
  return mean, var 

def normalize(data: np.ndarray):
  data = (data - np.mean(data, axis=0))/np.std(data, axis=0)
  return data 

### 1) Prepare for the dataset 
- We are going to use the breast cancer dataset provided by scikit-learn. 
- We can see that all features are continuous, so the Gaussian Naive Bayes would be our choice. 

In [3]:
data = load_breast_cancer() 

In [4]:
print(data['data'].shape)
print(data['feature_names'])
print(data['data'][0])
print(data['target_names'])
print(data['target'])

(569, 30)
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
 1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
 6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
 1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
 4.601e-01 1.189e-01]
['malignant' 'benign']
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 

### 2) Preprocessing the data
- Scale Data (Normalization)
- Split Data into train/test sets  
- Display data in DataFrame for better understanding 

In [5]:
X = data['data']
y = data['target']
X = normalize(X)
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=42)
print(X[0])

[ 1.09706398 -2.07333501  1.26993369  0.9843749   1.56846633  3.28351467
  2.65287398  2.53247522  2.21751501  2.25574689  2.48973393 -0.56526506
  2.83303087  2.48757756 -0.21400165  1.31686157  0.72402616  0.66081994
  1.14875667  0.90708308  1.88668963 -1.35929347  2.30360062  2.00123749
  1.30768627  2.61666502  2.10952635  2.29607613  2.75062224  1.93701461]


In [6]:
columns = data['feature_names']
train_df = pd.DataFrame(data=train_X, columns=columns)
display(train_df)


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,-1.447987,-0.456023,-1.366651,-1.150124,0.728714,0.700428,2.814833,-0.133333,1.093024,2.503828,...,-1.234044,-0.492965,-1.243893,-0.977194,0.693984,1.159269,4.700669,0.919592,2.147190,1.859432
1,1.977508,1.694187,2.089619,1.866047,1.262455,3.389643,2.007548,2.596960,2.129892,1.585220,...,2.155897,1.270634,2.062335,2.124291,0.733436,3.207003,1.946890,2.675218,1.936879,2.463465
2,-1.407089,-1.263516,-1.349763,-1.120545,-1.362838,-0.318972,-0.363081,-0.699511,1.932741,0.968562,...,-1.296169,-1.049890,-1.241212,-1.002860,-1.490797,-0.550038,-0.635617,-0.970486,0.616770,0.052877
3,-0.987600,1.380033,-0.986877,-0.875668,0.014925,-0.606466,-0.816190,-0.845247,0.311723,0.069801,...,-0.832304,1.549097,-0.872165,-0.746907,0.768505,-0.728158,-0.766109,-0.810759,0.822228,-0.137199
4,-1.123927,-1.026155,-1.129395,-0.975496,1.212639,-0.449737,-0.978777,-0.929077,3.400421,0.964310,...,-1.087016,-1.339752,-1.114026,-0.900022,-0.213419,-0.989865,-1.201820,-1.352369,1.061659,-0.207578
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
450,-1.488033,-1.082004,-1.366651,-1.168611,0.104593,0.924055,-0.034392,-0.521016,0.329977,3.827870,...,-1.353531,-1.629614,-1.331463,-1.048038,-0.511503,-0.067845,-0.617866,-1.016318,-1.046309,1.355149
451,-0.706426,-0.223317,-0.691956,-0.689379,1.269571,-0.050051,-0.227236,-0.362899,-0.038768,0.340564,...,-0.648001,0.583433,-0.647878,-0.630885,1.597003,0.074651,0.072498,0.109537,-0.153294,0.389251
452,0.046211,-0.574704,-0.068748,-0.063392,-2.282296,-1.470464,-1.023849,-1.100607,-1.108494,-1.281175,...,-0.281464,-0.818652,-0.381891,-0.344521,-2.047074,-1.297121,-1.120358,-1.237560,-0.716282,-1.260478
453,-0.041833,0.076875,-0.034972,-0.157532,0.686015,0.169787,0.298817,0.405245,-0.520693,0.374586,...,0.159621,0.834212,0.197742,-0.019835,1.268234,0.652266,0.646282,1.036837,0.450138,1.194443


### 3) Gaussian Naive Bayes Implementation 

In [7]:
class GaussianNaiveBayes():
  def __init__(self):
    self.train_X: np.ndarray 
    self.train_y: np.ndarray 

  def _gaussian_log_likelihood(self, data: pd.DataFrame, target: np.ndarray, X: np.ndarray):
    mean, var = calc_mean_var(data, target)
    log_likelihood = np.zeros(mean.shape)
    for i in range(mean.shape[0]):
      numerator = np.exp(-0.5 * ((X-mean[i])**2) / (2 * var[i]))
      denominator = np.sqrt(2* np.pi * var[i])
      log_likelihood[i] = np.log(numerator / denominator)
    return log_likelihood

  def _calc_log_prior_probs(self, data: pd.DataFrame, target:np.ndarray):
    priors = data.groupby(target).apply(lambda x: len(x) / len(data)).to_numpy()
    return np.log(priors)

  def _calc_log_posterior_probs(self, data: pd.DataFrame, target:np.ndarray, X: np.ndarray):
    priors = self._calc_log_prior_probs(data, target)
    log_likelihood = self._gaussian_log_likelihood(data, target, X)
    evidence = np.sum(log_likelihood)
    posteriors = priors + np.sum(log_likelihood, axis=1) - evidence
    return posteriors 
  
  def fit(self, X: np.ndarray, y: np.ndarray):
    self.train_X = X 
    self.train_y = y 
  
  def predict(self, X_batch):
    predictions = []  
    train_X = pd.DataFrame(self.train_X, columns=columns)
    for X in X_batch:
      log_posterior = self._calc_log_posterior_probs(data=train_X, 
                                                      target=self.train_y, 
                                                      X=X)
      predictions.append(np.argmax(log_posterior))
    return predictions 

### 4) Prediction 

In [8]:
gnb = GaussianNaiveBayes()
gnb.fit(train_X, train_y)
preds = gnb.predict(test_X)
accuracy = np.sum(preds==test_y)/test_y.shape[0]
print(f"Test accuracy is: {accuracy*100} %")

Test accuracy is: 95.6140350877193 %


## References 
- https://medium.com/@rangavamsi5/na%C3%AFve-bayes-algorithm-implementation-from-scratch-in-python-7b2cc39268b9
- https://towardsdatascience.com/implementing-naive-bayes-algorithm-from-scratch-python-c6880cfc9c41