In [1]:
import warnings
import datetime

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


from IPython.display import display, Markdown , Math 

sns.set()
warnings.filterwarnings('ignore')

In [2]:
def printmd(string): display(Markdown(string))
def latex(out): printmd(f'{out}')  
def pr(string): printmd('***{}***'.format(string))

<h1>Naive Bayes Classifier</h1>

<h2>
  <p>
    <a href =   "https://github.com/daodavid" > 
         author: daodeiv (David Stankov) 
       <img src="https://cdn.thenewstack.io/media/2014/12/github-octocat.png" align="left" width="120"  alt="daodavid" ></a>
    </p>      
</h2>   

<h2 id='works'>  How does Binomial Naive Bayes work? </h2>
<h6>
  <font size="4" face = "Times New Roma" color='#3f134f' > 
    <ul style="margin-left: 30px">
      <li><a href='#bayes_theorem'>Bayes Theorem</a> </li> <br>
      <!--<li><a href='#int-1'>Introduction </a> </li><br> -->
      <li><a href='#works'>How does Binomial Naive Bayes work?</a> </li><br>
      <li><a href='#optimization'>Optimizaton of  Softmax Loss with Gradient Descent (Deep math calculation)</a> </li><br>  
      <li><a href='#impl'>Implementation of Softmax using numpy </a> </li><br>
       <li><a href='#reg'>Regularization of softmax by learning rate and max iterations</a> </li><br> 
       <li><a href='#conclusion'>Conclusion</a> </li><br>  
        
</ul>    
 </font>
  </h6>
  

Naive Bayes is one of the simplest supervised ML algorithms meanwhile very efficient and also is able to learn fast and make a quick prediction, therefore it is so useful and popular.
Naive Bayes contains two words Naive and Bayes, Bayes because it is built on Bayes Theorem, and Naive because it assumes that features are independent even if they actually are interdependent.It is simple but very powerful and works well with large datasets and sparse matrices. It works really well on text classification problems, and spam filtering.

<h2 id='bayes_theorem'> Bayes Theorem </h2>

Bayes theorem describes a probability of an event, based on prior knowledge of conditions that might be related to an event.
First, let's take the formula of conditional probability and try to derive Bayes Theorem:

$$p(A|B) = \frac{p(B\cap A)}{p(B)}$$

Probability of event A given B, meaning what is the probability of A when event B is already taken place, which is equal to the probability of A intersection B (the probability of both A and B events are taking place) divided by the probability of B. <br>

we have the same for probability of event B given event A $$p(B|A) = \frac{p(A\cap B)}{p(A)}$$
the  $p(A\cap B)$ and  $p(B\cap A)$ are basicaly the same. Since they are the same, we can get two formulas and move denominator to the left of the equation,and equate them 
$$ p(B|A)p(A) = p(A\cap B) = p(B \cap A) = p(A|B)p(B) $$

So, when we want to find probability of A given B we can write our equation on this way: <br> <br>
$$P(A|B) = \frac{ P(B|A) * P(A)}{ P(B)}$$,<br> <br> and this is the equation of Bayes Theorem

* P(A|B) is the posterior probability of class (target) given predictor (attribute).
* P(B) is the prior probability of class.
* P(B|A) is the likelihood which is the probability of predictor given class.
* P(A) is the prior probability of predictor.

<h2 id='works'>  How does Binomial Naive Bayes work? (implementation) </h2>

For our purposes we going to use <a href='https://www.kaggle.com/datasets/priy998/golf-play-dataset'>Golf Play Dataset<a>

In [3]:
df = pd.read_csv("../../../resources/data/golf_df.csv")
df

Unnamed: 0,Outlook,Temperature,Humidity,Windy,Play
0,sunny,hot,high,False,no
1,sunny,hot,high,True,no
2,overcast,hot,high,False,yes
3,rainy,mild,high,False,yes
4,rainy,cool,normal,False,yes
5,rainy,cool,normal,True,no
6,overcast,cool,normal,True,yes
7,sunny,mild,high,False,no
8,sunny,cool,normal,False,yes
9,rainy,mild,normal,False,yes


We classify whether the day is suitable for playing golf, given the features of the day. The columns represent these features and the rows represent individual entries. If we take the first row of the dataset, we can observe that is not suitable for playing golf if the outlook is rainy, temperature is hot, humidity is high and it is not windy. We make two assumptions here, one as stated above we consider that these predictors are independent. That is, if the temperature is hot, it does not necessarily mean that the humidity is high. Another assumption made here is that all the predictors have an equal effect on the outcome. That is, the day being windy does not have more importance in deciding to play golf or not

According to this example, Bayes theorem can be rewritten as: <br> <br>
$$P(y|X) = P(X|y) * P(y) / P(X)$$ <br> <br>
The variable y is the class variable(play golf), which represents if it is suitable to play golf or not given the conditions. Variable X represent the parameters/features.

X is given as , <br>
 $$X = (x_1,x_2,...,x_n)$$ <br> <br>
 Here $x_1,x_2….x_n$ represent the features, i.e they can be mapped to outlook, temperature, humidity and windy. By substituting for X and expanding using the chain rule we get,

because we assume that features $x_i$ are independent we can write for all feature bayes formula as following:

$$P(y| x_1,x_2,...,x_n ) = \frac{P(x_1|y).P(x_2|y)...P(x_n|y)P(y)}{P(x_1)P(x_2)...P(x_n)} $$

In our data set the variable are descrete !

Now, you can obtain the values for each by looking at the dataset and substitute them into the equation. For all entries in the dataset, the denominator does not change, it remain static. Therefore, the denominator can be removed for our puposes.
$$P(y| x_1,x_2,...,x_n ) \propto p(y)\prod_{i=0}^{n}P(x_i|y)$$

For example using 'Outlook' feature to be equal to 'overcast'

$P(Play="yes"|Outlook="overcast") \propto P(Outlook="overcast"|\;Play="yes" )P(Play="yes")$

Let's to produce likelihood table.

In [4]:
label = "Play"
yes = df[df[label] == "yes"].groupby("Outlook")[label].count()
no = df[df[label] == "no"].groupby("Outlook")[label].count()
likelihood_yes = yes/yes.sum()
likelihood_no = no/no.sum()


In [5]:
likelihood_yes.index = [(lambda i: f'P ( Outlook= "{i}"| Play="yes" ) = ')(i) for i in likelihood_yes.index] 

In [6]:
label = "Play"
yes = df[df[label] == "yes"].groupby("Outlook")[label].count()
no = df[df[label] == "no"].groupby("Outlook")[label].count()

In [7]:
likelihood_yes.index = [(lambda i: f'P ( Outlook= "{i}"| Play="yes" ) = ')(i) for i in likelihood_yes.index] 
likelihood_yes

P ( Outlook= "P ( Outlook= "overcast"| Play="yes" ) = "| Play="yes" ) =     0.444444
P ( Outlook= "P ( Outlook= "rainy"| Play="yes" ) = "| Play="yes" ) =        0.333333
P ( Outlook= "P ( Outlook= "sunny"| Play="yes" ) = "| Play="yes" ) =        0.222222
Name: Play, dtype: float64

In [8]:
likelihood_no.index = [(lambda i: f'P ( Outlook= "{i}"| Play="no" ) = ')(i) for i in likelihood_no.index] 
likelihood_no

P ( Outlook= "rainy"| Play="no" ) =     0.4
P ( Outlook= "sunny"| Play="no" ) =     0.6
Name: Play, dtype: float64

In [9]:
likehood_table = {}
def get_value_feature(df,feature):
    try:
        return df[feature]
    except:
        return 0

In [10]:
def create_likehood_tb(df, label):
    likehood_table = {}
    features = df.drop(label, axis=1).columns
    for feature in features:
        yes = df[df[label] == "yes"].groupby(feature)[label].count()
        no = df[df[label] == "no"].groupby(feature)[label].count()
        all = df.groupby(feature)[label].count()
        for feature_value in all.index:
            c = all[feature_value]
            c1 = get_value_feature(yes, feature_value)
            c2 = get_value_feature(no, feature_value)
            likehood_table[feature_value] = {
                'yes': c1 / yes.sum(),
                'no': c2 / no.sum(),
                'P': c / all.sum(),
            }
    return likehood_table

In [11]:
likehood_df = create_likehood_tb(df, "Play")

In [12]:
likehood_df = pd.DataFrame(likehood_df)
#['P(x | Play="yes")', 'P(x | Play="no")', 'P(x)']
likehood_df

Unnamed: 0,overcast,rainy,sunny,cool,hot,mild,high,normal,False,True
yes,0.444444,0.333333,0.222222,0.333333,0.222222,0.444444,0.333333,0.666667,0.666667,0.333333
no,0.0,0.4,0.6,0.2,0.4,0.4,0.8,0.2,0.4,0.6
P,0.285714,0.357143,0.357143,0.285714,0.285714,0.428571,0.5,0.5,0.571429,0.428571


The above table shows all likelihood table for every feature value x 

To use bayes theorem we have to include prior probability $P(Play=yes)$ and $P(Play=no)$

In [13]:
c = df.groupby('Play').count().iloc[:, 0]
prior_probability  = c /c.sum()
prior_probability

Play
no     0.357143
yes    0.642857
Name: Outlook, dtype: float64

Let make the prediction using likeluhood table and prior probability 

In [14]:
df.head()

Unnamed: 0,Outlook,Temperature,Humidity,Windy,Play
0,sunny,hot,high,False,no
1,sunny,hot,high,True,no
2,overcast,hot,high,False,yes
3,rainy,mild,high,False,yes
4,rainy,cool,normal,False,yes


In [15]:
prior_probability['yes']
likehood_df

Unnamed: 0,overcast,rainy,sunny,cool,hot,mild,high,normal,False,True
yes,0.444444,0.333333,0.222222,0.333333,0.222222,0.444444,0.333333,0.666667,0.666667,0.333333
no,0.0,0.4,0.6,0.2,0.4,0.4,0.8,0.2,0.4,0.6
P,0.285714,0.357143,0.357143,0.285714,0.285714,0.428571,0.5,0.5,0.571429,0.428571


In [16]:
p =0 
def calculate_bayes(x,likelihood_tb, prior_probability):
    yes = prior_probability['yes']
    no = prior_probability['no']
    for index in x.index :
        value = x[index]
        yes = yes * likelihood_tb[value]['yes']
        no = no * likelihood_tb[value]['no']
    
    return "yes" if yes > no else "no"
   

In [17]:
test = df.drop("Play", axis=1)
predict = test.apply(calculate_bayes,likelihood_tb=likehood_df,prior_probability=prior_probability, axis=1)

In [18]:
predict

0      no
1      no
2     yes
3     yes
4     yes
5     yes
6     yes
7      no
8     yes
9     yes
10    yes
11    yes
12    yes
13     no
dtype: object

let's the accuracy 

In [20]:
pr(accuracy_score(df["Play"],predict))

***0.9285714285714286***

<h2> Gaussian Naive Bayes </h2>

When the predictors take up a continuous value and are not discrete, we assume that these values are sampled from a gaussian distribution.

$$P(x_i|y) = \frac{1}{(2\pi\sigma^2_y)^(1/2)}exp\big(- \frac { (x_i - \mu_y)^2}{2\sigma^2_y}\big) $$

just as naive baysian we have to find likelihood for all feature values 

We will use titanic dataset

In [22]:
train = pd.read_csv("../../../resources/data/titanic/train.csv")
train['Sex'] = (train['Sex']=='male').astype(int)
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,,S


In [26]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,0.647587,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,0.47799,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,0.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,1.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,1.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,1.0,80.0,8.0,6.0,512.3292


In order to calculate likelihood we have to find <br>
we can calculate baysian $$P(y| x_1,x_2,...,x_n ) \propto p(y)\prod_{i=0}^{n}P(x_i|y)$$

 $$ p(survived=1)P(x|survived=1)$$
  $$ p(survived=0)P(x|survived=0)$$

In [23]:
def get_stats(data, label):
    result = {}
    for i in train[label].unique():
        stats = train[train[label] == i].describe()  #[feature]
        result[i] = stats
    
    return result

In [25]:
label = 'Survived'
stats = get_stats(train, label = label)


{0:        PassengerId  Survived      Pclass         Sex         Age       SibSp  \
 count   549.000000     549.0  549.000000  549.000000  424.000000  549.000000   
 mean    447.016393       0.0    2.531876    0.852459   30.626179    0.553734   
 std     260.640469       0.0    0.735805    0.354968   14.172110    1.288399   
 min       1.000000       0.0    1.000000    0.000000    1.000000    0.000000   
 25%     211.000000       0.0    2.000000    1.000000   21.000000    0.000000   
 50%     455.000000       0.0    3.000000    1.000000   28.000000    0.000000   
 75%     675.000000       0.0    3.000000    1.000000   39.000000    1.000000   
 max     891.000000       0.0    3.000000    1.000000   74.000000    8.000000   
 
             Parch        Fare  
 count  549.000000  549.000000  
 mean     0.329690   22.117887  
 std      0.823166   31.388207  
 min      0.000000    0.000000  
 25%      0.000000    7.854200  
 50%      0.000000   10.500000  
 75%      0.000000   26.000000  
 m

In [None]:
<h2> Investigation of liklihood 

<h2> References </h2>
* <a href='https://towardsdatascience.com/implementing-naive-bayes-algorithm-from-scratch-python-c6880cfc9c41'>naive <br>
* <a href='https://prwatech.in/blog/machine-learning/naive-bayes-classifier-in-machine-learning/'> Indian Naive Bayes </a> <br>
* <a href='https://www.geeksforgeeks.org/naive-bayes-classifiers/'>Naive Bayes Classifiers </a>    
    * <a href='https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c'>Naive Bayes Classifiers </a>  

Now, you can obtain the values for each by looking at the dataset and substitute them into the equation. For all entries in the dataset, the denominator does not change, it remain static. Therefore, the denominator can be removed and a proportionality can be introduced.
$$P(y| x_1,x_2,...,x_n ) \propto p(Y)\prod_{i=0}^{n}P(x_i|y)$$
