<a href="https://colab.research.google.com/github/AsRumi/Colab-Notebooks/blob/main/Naive_Bayes_without_Sci_Kit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Implementing Naive Bayes without libraries

[iris.csv](https://gist.github.com/netj/8836201) is the dataset in use. In this notebook, all necessary computations are mathematically defined.

In [None]:
import numpy as np
import pandas as pd

In [None]:
idata = pd.read_csv("/content/IRIS.csv")
idata.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [None]:
idata.dtypes

Unnamed: 0,0
sepal_length,float64
sepal_width,float64
petal_length,float64
petal_width,float64
species,object


In [None]:
idata.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


## Prior Probabilties
Prior Probability is the frequency of each class divided by the total number of data points. There are two ways to achieve this, we could either use the Counter class in Python or iterate over each data point using numpy.

In [None]:
class_counts = idata['species'].value_counts().to_dict()
print(class_counts)

{'Iris-setosa': 50, 'Iris-versicolor': 50, 'Iris-virginica': 50}


In [None]:
def calculate_pp(data):
    total = len(data)
    class_counts = data['species'].value_counts().to_dict() # This line of code takes all the classes present in the column "species" of the data that was passed to the function and counts the recurrence of each class and converts it to a dictionary.
    priors = {cls: count/total for cls, count in class_counts.items()} # Using dictionary comprehension to create a dictionary that holds the class as a key and its prior distribution as the value.
    # Since the data was evenly distributed between the 3 species present in the dataset, each class has a prior probability of 0.34 recurring
    return priors

## Calculate Mean and Variance
We need to calculate the mean and variance ($σ^2$) of each class to fit their features into a Gaussian distribution. Doing this will allow us to use the Gaussian Probability Density function to calculate the class probability of a test instance from the distribution already made.

Now, likelihood can also be computed using a multinomial distribution or a Bernoulli Naive Bayes, I have decided to go with a Gaussian Density Function however.

In [None]:
def calculate_m_and_v(data):
    data_dict = {}
    features = data.columns[:-1]
    for iris_class in data['species'].unique(): # .unique() makes sure that this loop runs only three times (The number of unique classes that we have.)
        class_data = data[data['species'] == iris_class][features] # Extracting only the features of the species/class that is currently being focused by the loop.
        data_dict[iris_class] = { # Just appending the class as a key to the dictionary and mean and variance as the values.
            'mean': class_data.mean().to_dict(),
            'variance': class_data.var().to_dict(),
        }
    return data_dict

## Defining the Gaussian Probability Density Function
This will be a simple translation of the mathematical formula to the corresponding code.

$$ P(x_i | y) = \frac{1}{\sqrt{2 \pi \sigma^2}}e ^{\left(-\frac{(x_i - \mu)^2}{2 \sigma^2}\right)} $$



In [None]:
import math
def gaussian_prob(x, mean, variance):
    e = math.exp(-((x-mean)**2/(2*variance))) # The math library in Python automaticall raises Euler's constant to the value passed to math.exp
    return (1/(np.sqrt(2*np.pi*variance)))*e

## Classifying a New Instance

We will be using the calculated prior probability and calculated statistics to make the classification.

In [None]:
def classify(instance, priors, stats):
    probabilities = {} # We need a place to store the probabilites of this instance belonging to all the three classes, therefore this dictionary.
    # In the end, we can compare the values present in this dictionary and return the class that has the highest probability.
    for cls, prior in priors.items():
        probabilities[cls] = prior # Adding the class and its prior probability to the dictionary probabilities.
        for feature in instance.index[:-1]: # All columns except 'species'
            mean = stats[cls]['mean'][feature] # Taking the mean of the class
            variance = stats[cls]['variance'][feature] # Taking the variance of the class
            probabilities[cls] *= gaussian_prob(instance[feature], mean, variance) # Passing it to the Gaussian Probability Density function to get the score match for the particular class and storing it in the probabilites dictionary.
    return max(probabilities, key=probabilities.get)

## Split the data and test the data

All that is left to do is simply split the data and test the accuracy of the model with test_data.

In [None]:
train_data = idata.sample(frac=0.8, random_state=1)
test_data = idata.drop(train_data.index)

In [None]:
priors = calculate_pp(train_data)
stats = calculate_m_and_v(train_data)

In [None]:
correct_predictions = 0
for _, row in test_data.iterrows():
    prediction = classify(row, priors, stats)
    if prediction == row['species']:
        correct_predictions += 1

accuracy = correct_predictions / len(test_data)
print(f'Accuracy: {accuracy * 100:.2f}%')

Accuracy: 93.33%
