<a href="https://www.kaggle.com/code/ayushs9020/gaussian-naive-bayes-from-scratch?scriptVersionId=128491145" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Gaussian Naive Bayes

Gaussian Naive Bayes (GNB) is a probabilistic algorithm used for classification tasks. It is based on the Bayes' theorem, which states that the probability of a hypothesis (class label) given the evidence (input features) is proportional to the probability of the evidence given the hypothesis multiplied by the prior probability of the hypothesis. In other words, GNB tries to predict the probability of a particular class label given the input features.

GNB assumes that the input features are independent and identically distributed, and that their probability distribution follows a Gaussian or normal distribution. This means that GNB assumes that the probability of each feature is independent of the other features, and that their distribution is symmetric and bell-shaped.

To train a GNB model, we calculate the mean and variance of each feature for each class label in the training data. Then, during the testing phase, we use these parameters to calculate the probability of each class label given the input features. The class label with the highest probability is then selected as the predicted class label for the given input.

GNB is a simple and computationally efficient algorithm that works well for high-dimensional datasets with continuous input features. However, it may not perform well on datasets with correlated features or non-Gaussian distributions.

In [1]:
from IPython.display import IFrame

**Note** - This notebook is higly inspired by this [wiki page](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) and this [3Blue1Brown](https://www.youtube.com/@3blue1brown) video on [Center Limit Theroem](https://www.youtube.com/watch?v=zeJD6dqJ5lo)

# What we will learn after the end of this notebook
* Probablity
* Condtional Probablity
* Chain Rule In Condtional Probablity
* Maximum A Postirier
* How Numpy creates and stores Large Empty Arrays in Small Space

# Probablity 

So what is this term `probablity`

Probability is the measure of the likelihood of an event occurring, given some observations of the past. 

The Probablity $p(x)$ of an event $x$ occuring in $n$ number of events is given by
$$p(x) = \frac {x}{n}$$
For example the probablity of `Heads` occuring when a `coin` is tossed is $50$%. Here we have considered only two outcomes `Heads` and `Tails`.

So now we have a basic idea of the term proablity 

Now we move on to `Conditional Probablity`. So what is conditional probablity???

Conditional probability is the probability of an event occurring given that another event has already occurred. 

The conditional probablity $p(x|y)$ of an event $x$ occuring knwon that $y$ has been occured can be caluclated when given the probablity $p(x)$ of event $x$ occuring gloablly, probablity $p(y)$ of event $y$ occuring globally and the condtional probablity$p(y|x)$ or 
$$p(x|y) = p(\frac {x}{y}) = \frac {p(x)p(y|x)}{p(y)} = \frac {p(x , y)}{p(y)}$$

Saying in context to a machine Learning Algorithm, the formula changes a bit, dont worry, you will understand it. 

I have truncated $target$ to $t$ and $features$ to $f$

The conditional probablity $p(target|f_1 , f_2 , f_3 , ... , f_n)$ of an event $target$ occuring knwon that $(f_1 , f_2 , f_3 , ... , f_n)$ have been occured can be calucalted when given the probablity$p(target)$ of the event $target$ occuring gloablly, probablity $p(f_1 , f_2 , f_3 , ... , f_n)$ of events $(f_1 , f_2 , f_3 , ... , f_n)$ occuring globally and the condtional probablity $p(f_1 , f_2 , f_3 , ... , f_n|target)$ or 
 
$$p(t|f_1 , f_2 , f_3 , ... , f_n) = p(\frac {t}{f_1 , f_2 , f_3 , ... , f_n}) = \frac {p(t)p(f_1 , f_2 , f_3 , ... , f_n|t)}{p(f_1 , f_2 , f_3 , ... , f_n)} = \frac {p(t , f_1 , f_2 , f_3 , ... , f_n)}{p(f_1 , f_2 , f_3 , ... , f_n)}$$

Lets say we have $3$ events now. The contional prbablity will now be 

$$p(x|y , z) = \frac p{x}{y , z} = \frac {p(x)p(y , z|x)}{p(y , z)} = \frac {p(x , y , z)}{p(y , z)}$$

According to the cahin rule, the formula becomes like this 

$$p(x , y , z) = p(x|y , z)p(y , z)$$

([Short chain rule expalanation](https://youtu.be/WqqpwDVeW10))

Doing the same for the example 

$$p(t , f_1 , f_2 , f_3 , ... , f_n) = p(t|f_1 , f_2 , f_3 , ... , f_n)p(f_1 , f_2 , f_3 , ... , f_n)$$

We can apply the chain rule to $p(f_1 , f_2 , f_3 , ... , f_n)$ also, doing this iteratively, maybe one day we reach $p(f_n)$. But for one as of one more iteration our formula becomes 

$$= p(t|f_1 , f_2 , f_3 , ... , f_n)p(f_1 | f_2 , f_3 , ... f_n)p(f_2 , f_3 , ... , f_n)$$

We can rewrite this as 

$$= p(t|f_1 , f_2 , f_3 , ... , f_n)p(f_1 | f_2 , f_3 , ... f_n)p(f_2 , f_3 , ... , f_n) = p(t|f_1 , f_2 , f_3 , ... , f_n)p(f_1 | f_2 , f_3 , ... f_n) ... p(f_{n-1}|f_n)$$ or 

$$p(t|f_1 , f_2 , f_3 , ... , f_n) ∝ p(t)∏_{i = 1}^{n}p(f_i|t)$$

Replaing the $∝$ with $\frac {1}{z}$ we get

$$p(t|f_1 , f_2 , f_3 , ... , f_n) = \frac {1}{z} p(t)∏_{i = 1}^{n}p(f_i|t)$$

Applying the Maximum A Postirier we get. 
$$p(t|f_1 , f_2 , f_3 , ... , f_n) = argmax \frac {1}{z} p(t)∏_{i = 1}^{n}p(f_i|t)$$
([Maximum A Postiter Explanantion](https://youtu.be/tr1jxgKbNuw) enjoy the music)


# Probablity Desnsity Function 

We know that the curve for $x = y$ seems to be like this 

In [2]:
IFrame("https://www.desmos.com/calculator/512wkwdbob" , 400 , 400)

If we tweeke the function a little bit like $y = e^x$, we get 

In [3]:
IFrame("https://www.desmos.com/calculator/7x5pbdhrrn" , 400 , 400)

After tweeking the function more to $y = e^{-x}$ it becomes

In [4]:
IFrame("https://www.desmos.com/calculator/jtzzdklnrd" , 400 , 400)

If we make a modulus function here $y = e^{-|x|}$, it becomes

In [5]:
IFrame("https://www.desmos.com/calculator/zodxwedpjj" , 400 , 400)

An interesting fact is that, the functuon doesent really rely on $e$, it can be any constant number (except $1$, $-1$ and $0$) and it will show the same results. 

So if it doesnt change at all, we do we even put the $e$ there, why dont we just put anything else there 

Frankly speaking $e$ seems to be cool there

Also if we square the function as $y = e^{x^2}$, we get this 

In [6]:
IFrame("https://www.desmos.com/calculator/dqd1vrha9i" , 400 , 400)

Also we can move this function in the $x$ axis by `adding/subtracting` some constant from $x$ like $y = e^{-|x - a|^2}$

In [7]:
IFrame("https://www.desmos.com/calculator/0omydtyxdh" , 400 , 400)

What if I want to calaculate the area under of this curve ??

In [8]:
IFrame("https://www.desmos.com/calculator/1xlnxhm7zj" , 400 , 400)

Suprisingly this area is always seems to be $\sqrt\pi$

As we know the total probabalites of all the events happening will be $1$, 

Now for making the probablity $y$ to be $1$, we need to divide the function $e^{-|x - a|^2}$ by $\sqrt {π}$, to make it $1$.

So now our formula becomes 
$$y = \frac {1}{\sqrt{\pi}}e^{-|x - a|^2}$$

Going back some steps, I would like to add one thing to this formula $y = e^{-|x^2|}$ and adding just a variable term to control this function we can rewrite this as $y = e^{-c|x^2|}$

Now if we put $c = 10$, we get this

In [9]:
IFrame("https://www.desmos.com/calculator/c5q3nl2t1g" , 400 , 400)

After experiments it was found that $c = \frac {1}{2}$ is a great tuned parameter for the function to be more smooth and convineint for solutions 

So changing our formula a little bit we can rewrite it as $y = e^{-\frac {1}{2}|x^2|}$
Q
As we changed our $e^{-\frac {1}{2}|x^2|}$, this would aso affect the area under the curve a little bit, now we also need to divide it by $\frac {1}{2}$. Tweeking this change into the formula we get $$y = \frac {1}{\sqrt{2\pi}}e^{-\frac {1}{2}|x - a|^2}$$

After extensive research we found that, for finding the best bell like curve, we need to change the values of $a$ and $b$. Suprisingly, the value seemed out to be 
$a = mean$ or $a = μ$ and $b = standard_-deviation^2$ or $b = σ^2$. Now our formula becomes like $$y = \frac {1}{\sigma\sqrt{2\pi}}e^{-\frac {1}{2}\frac {|x - μ|^2}{\sigma^2}}$$

If you notice we are taking the square of a mode functon, What we can rather do is $$y = \frac {1}{\sigma \sqrt{2π}}e^{\frac {-(x - μ)^2}{2σ^2}}$$

And this my friends is the `Normal Distribution Fucntion`

$$F(x) = \frac {1}{\sigma \sqrt{2π}}e^{\frac {-(x - μ)^2}{2σ^2}}$$

# Making from Scratch 

So how do we code this thing 

In [10]:
import pandas as pd 
import numpy as np 

Lets assume we have this data 

|Gender|Height|Weight|Feet Size|
|---|---|---|---|
|Male|6|180|12
|Male|5.92|190|11
|Male|5.58|170|12
|Male|5.92|165|10
|Female|5|100|6
|Female|5.5|150|8
|Female|5.42|130|7
|Female|5.75|150|9

Or you can see this thing also 

In [11]:
sample_1 = np.array(["male" , 6 , 180 , 12])
sample_2 = np.array(["male" , 5.92 , 190 , 11])
sample_3 = np.array(["male" , 5.58 , 170 , 12])
sample_4 = np.array(["male" , 5.92 , 165 , 10])
sample_5 = np.array(["female" , 5 , 100 , 6])
sample_6 = np.array(["female" , 5.5 , 150 , 8])
sample_7 = np.array(["female" , 5.42 , 130 , 7])
sample_8 = np.array(["female" , 5.75 , 150 , 9])

In [12]:
data = np.array([sample_1 , sample_2 , sample_3 , sample_4 , sample_5 , sample_6 , sample_7 , sample_8])

In [13]:
data

array([['male', '6', '180', '12'],
       ['male', '5.92', '190', '11'],
       ['male', '5.58', '170', '12'],
       ['male', '5.92', '165', '10'],
       ['female', '5', '100', '6'],
       ['female', '5.5', '150', '8'],
       ['female', '5.42', '130', '7'],
       ['female', '5.75', '150', '9']], dtype='<U32')

In [14]:
data = pd.DataFrame(data)

In [15]:
data

Unnamed: 0,0,1,2,3
0,male,6.0,180,12
1,male,5.92,190,11
2,male,5.58,170,12
3,male,5.92,165,10
4,female,5.0,100,6
5,female,5.5,150,8
6,female,5.42,130,7
7,female,5.75,150,9


First of all lets change the values to float, for numerical and statistical purposes

In [16]:
data[1] = data[1].astype(float)
data[2] = data[2].astype(float)
data[3] = data[3].astype(float)

Now first we need to get the mean of particuler columns grouped by their classes. Lets first limit our focus to the case where the classes if `male` and the column is `Height`. For example this 

|Gender|height
|---|---
|Male|6
|male|5.92
|Male|5.58
|Male|5.92

If we want to get the values of `male` only, we can get that by difining the conditon in the dataframe like this 

In [17]:
data[data[0] == "male"]

Unnamed: 0,0,1,2,3
0,male,6.0,180.0,12.0
1,male,5.92,190.0,11.0
2,male,5.58,170.0,12.0
3,male,5.92,165.0,10.0


If we want to focus on the `Height/1` column only, we can do that by doing this

In [18]:
data[data[0] == "male"][1]

0    6.00
1    5.92
2    5.58
3    5.92
Name: 1, dtype: float64

If we see the dtaa type of this , we get

In [19]:
type(data[data[0] == "male"][1])

pandas.core.series.Series

As we can see this is a `pandas.core.series.Series`, we can get the mean by `pandas.core.series.Series.mean()`

In [20]:
data[data[0] == "male"][1].mean()

5.855

For getting the mean values of all the clasess with all the classes at once, we just need to difine a nested for loop. The first loop iterating over the columns and the second loop iterating over for the classes or `len(data[0].unique())`

Lets first make an empty array of this shape `(classes , columns)`

In [21]:
mean = np.empty(shape = (2  , 3))

In [22]:
mean

array([[5.09663114e-310, 0.00000000e+000, 0.00000000e+000],
       [0.00000000e+000, 0.00000000e+000, 0.00000000e+000]])

You might be thinking why we got $1$ random value and other $0$ values. Why this racism. This is beacuse to save the storage space `numpy.empty()`. uses the first element as the actual empty data and the other as the pointed values by the first element. This enables to store a very large amont of empty array in the sapce of only $1$ data. 

Now iterating the nested loop like this we get 

In [23]:
for j in range(1 , 4):
    for i in range(2):
        mean[i][j - 1] = data[data[0] == str(data[0].unique()[i])][j].mean()

In [24]:
mean

array([[  5.855 , 176.25  ,  11.25  ],
       [  5.4175, 132.5   ,   7.5   ]])

So this is our mean data.

Now lets calculate the variance by the same method.

In [25]:
var = np.empty(shape = (2 , 3))

In [26]:
for j in range(1 , 4):
    for i in range(2):
        var[i][j - 1] = data[data[0] == str(data[0].unique()[i])][j].var()

In [27]:
var

array([[3.50333333e-02, 1.22916667e+02, 9.16666667e-01],
       [9.72250000e-02, 5.58333333e+02, 1.66666667e+00]])

And we have got our variance data too

Now we just need to put these values to corresponding formulas and get the highest probablities

Lets assume we have a person with height $6$. How do we calculate the gender of this person 

In [28]:
x = 6

Applying the formual, lets test the probablity of this person being a male 

In [29]:
p_0_0 = (np.exp(-(np.square(x - mean[0][0])) / (2 * var[0][0]))) * (1 / (np.sqrt(2 * 3.14 * np.square(var[0][0]))))

In [30]:
p_0_0

8.437608698647615

Now lets compute the probablity of this person being a female 

In [31]:
p_0_1 = (np.exp(-(np.square(x - mean[0][1])) / (2 * var[0][1]))) * (1 / (np.sqrt(2 * 3.14 * np.square(var[0][1]))))

In [32]:
p_0_1

2.0219541189209726e-54

As we can see the probablity of the person being a male is way heigher than the person being a female 

So this column says that the person is a male

Now lets do the same for other values, like this 

Lets assume we have the testing values like this 

In [33]:
test = [6 , 130 , 8]

In [34]:
oup = np.zeros((2))

In [35]:
oup

array([0., 0.])

Rather than writing the same function, we can directly iterativey add the values of probablites by classes by using a nested for loop

In [36]:
for i in range(2):
    for j in range(2):
        oup[i] += (np.exp(-(np.square(test[j] - mean[j][i])) / (2 * var[j][i]))) * (1 / (np.sqrt(2 * 3.14 * np.square(var[j][i]))))

In [37]:
oup

array([8.43760870e+00, 7.10715634e-04])

So now we just need to put this all into a function

In [38]:
def gaussian(X_train , Y_train , Y_test):
    mean = np.empty(shape = (len(Y_train.unique())  , 
                             len(X_train.columns())))
    
    var = np.empty(shape = (len(Y_train.unique())  , 
                             len(X_train.columns())))
    
    
    for features in range(1 , len(X_train.columns) + 1):
    
        for classes in range(len(Y_train.unique())):
    
            mean[classes][features - 1] = data[data[0] == str(data[0].unique()[classes])][features].mean()
            var[classes][features - 1] = data[data[0] == str(data[0].unique()[classes])][features].var()

    output = np.zeros((len(Y_train.unique())))

    for feature in range(len(Y_train.unique())):
    
        for classess in range(len(X_train.columns) - 2):
    
            output[feature] += (np.exp(-(np.square(test[classess] - mean[classess][feature])) / (2 * var[classess][feature]))) * (1 / (np.sqrt(2 * 3.14 * np.square(var[classess][feature]))))

    output = np.sort(output)

    return output[-1]

And we just made our one of the simplest `GaussianNB`. 

I know this is not the complete form, but it can be the simplest form. I will add more functionalities to it in the updated versions. 

**I REQUEST YOU TO COMMENT DOWN WHAT FUNCTIONALIETIES I CAN ADD TO THIS MODEL. IT WILL BE HIGHLY APPRICIEATED**

**THATS IT FOR TODAT GYUS**

**HOPE YOU UNDERSTOOD AND LIKED MY WORK**

**DONT FORGET TO MAKE AN UPVOTE**

**PEACE OUT !!!**