<a href="https://colab.research.google.com/github/GiX7000/10-machine-learning-algorithms-from-scratch/blob/main/07_Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Implementation of a Naive Bayes classifier

In this notebook, a quick and simple implementation of a [Naive Bayes](https://www.youtube.com/watch?v=CPqOCI0ahss) algorithm is presented. The Naive Bayes classifier is a probabilistic classifier based on applying [Bayes' theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem) with strong(naive) independence assumptions between the features.

In [1]:
# imports
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from collections import Counter

## 1. Create a dataset and split it to train and test sets.

In [2]:
# make a classification dataset
X, y = datasets.make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=123)

# split iiit to train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

# let's see some things about the data
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
print(X_train.dtype, X_test.dtype)
print(X_train[5])
print(y_train[5])
print(np.unique(y_train))

(800, 10) (800,) (200, 10) (200,)
float64 float64
[-0.14670394  0.68624262 -0.59005397  0.36530023 -0.82320836  0.12463487
 -0.69878555 -0.73534754 -0.69333204 -1.25232507]
0
[0 1]


## 2. Create a Naive Bayes model.

In [3]:
class NaiveBayes:

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self._classes = np.unique(y)
        n_classes = len(self._classes)

        # calculate mean, var, and prior for each class
        self._mean = np.zeros((n_classes, n_features), dtype=np.float64)
        self._var = np.zeros((n_classes, n_features), dtype=np.float64)
        self._priors = np.zeros(n_classes, dtype=np.float64)

        for idx, c in enumerate(self._classes):
            X_c = X[y == c]
            self._mean[idx, :] = X_c.mean(axis=0)
            self._var[idx, :] = X_c.var(axis=0)
            self._priors[idx] = X_c.shape[0] / float(n_samples) # num of samples per class / total number of samples


    def predict(self, X):
        y_pred = [self._predict(x) for x in X]
        return np.array(y_pred)

    def _predict(self, x):
        posteriors = []

        # calculate posterior probability for each class(prior probs + class cond probs)
        for idx, c in enumerate(self._classes):
            prior = np.log(self._priors[idx])
            posterior = np.sum(np.log(self._pdf(idx, x)))
            posterior = posterior + prior
            posteriors.append(posterior)

        # return class with the highest posterior
        return self._classes[np.argmax(posteriors)]

    # probability density function to calculate class conditional probs like P( x(i) | y )
    def _pdf(self, class_idx, x):
        mean = self._mean[class_idx]
        var = self._var[class_idx]
        numerator = np.exp(-((x - mean) ** 2) / (2 * var))
        denominator = np.sqrt(2 * np.pi * var)
        return numerator / denominator

## 3. Train, predict and evaluate the model.

In [4]:
# create ana instance of our classifier
clf_naive_bayes = NaiveBayes()

# train the classifier
clf_naive_bayes.fit(X_train, y_train)

In [5]:
# predict on test set
predictions = clf_naive_bayes.predict(X_test)

# Print the predictions and the actual labels
print("Predictions:", np.array(predictions))
print("Actual Labels:", y_test)

Predictions: [0 1 0 1 1 0 0 1 1 1 1 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 1 0 0 0 0 0 0 1 1 0 1
 1 0 1 0 1 1 0 0 0 0 1 0 0 1 0 1 1 1 1 0 0 0 1 0 0 0 0 1 1 1 1 1 0 1 1 0 0
 1 0 1 1 1 0 0 1 0 1 0 1 0 0 0 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 0 0 1
 0 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 0 0 1 0 0
 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 0 1 1 0 0 0 1 0 0 1 1 1 0 1 1 0 0 0 1 1 0 0
 0 0 1 0 1 0 1 1 0 0 1 0 0 0 1]
Actual Labels: [0 1 0 1 1 0 0 1 1 1 1 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 1
 1 0 1 0 1 1 0 0 0 0 1 0 0 1 0 1 1 1 1 0 0 0 1 0 0 0 0 1 1 1 1 1 0 0 1 0 0
 1 1 1 1 1 0 0 1 0 1 0 1 0 0 0 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 0 0 1
 0 0 0 1 1 0 1 1 1 1 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 0 1 1 0
 1 1 0 0 1 1 1 1 1 0 0 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 1 0 1 1 0 0 0 1 0 0 0
 0 1 1 0 1 0 1 1 1 0 1 0 0 0 1]


Let's use how good our model is.

In [6]:
# accuracy function
def accuracy(y_true, y_pred):
  return np.sum(y_true == y_pred) / len(y_true)

print(accuracy(y_test, predictions))

0.945


We got a very high score of accuracy of 94.5%.
Let's compare it now to the one of scikit-learn library.

## 4. Compare with the Naive Bayes classifier from scikit-learn library.

Let's see now, what results a [Gaussian Naive Bayes classifier](https://scikit-learn.org/stable/modules/naive_bayes.html) from scikit-learn library gives us.

In [7]:
# let's compare now with the accuracy that sklearn gives us
from sklearn.naive_bayes import GaussianNB

clf_sklearn = GaussianNB()
clf_sklearn.fit(X_train, y_train)
sklearn_predictions = clf_sklearn.predict(X_test)

In [8]:
# let's check the accuracy now
print(accuracy(y_test, sklearn_predictions))

0.945


We got exactly the same high score of accuracy like our custom Naive Bayes classifier gave us. Another great tutorial is [here](https://machinelearningmastery.com/classification-as-conditional-probability-and-the-naive-bayes-algorithm/).