# Naive Bayes Classifier

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.

## Why Naive Bayes Classifiers are called 'naive' ?

The naive Bayes classifier simplifies the prediction by assuming a simple assumption that all the features are independent of each other. Say, for example, if there was a case to predict the 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Let us now consider a small dataset which will be used to understand the concepts behind naive bayes classification

In [2]:
data = pd.read_csv("computer_purchase_data.csv")
data

Unnamed: 0,Age,Income,Student,Credit Rating,Buys Computer
0,Young,Low,Yes,Fair,No
1,Young,Low,Yes,Excellent,No
2,Young,Medium,Yes,Fair,Yes
3,Young,High,No,Fair,Yes
4,Young,High,No,Excellent,No
5,Middle-aged,High,No,Excellent,Yes
6,Middle-aged,High,No,Excellent,Yes
7,Middle-aged,Medium,Yes,Fair,Yes
8,Middle-aged,Low,No,Excellent,Yes
9,Middle-aged,Medium,No,Excellent,Yes


The dataset provided is a fictional set of records suitable for illustrating how Naive Bayes classification works. Each record represents a person and includes the following features:

- Age: Categorized as "Young" or "Middle-aged".
- Income: Categorized as "Low", "Medium", or "High".
- Student: Indicates whether the person is a student, represented as "Yes" or "No".
- Credit Rating: Describes the credit rating of the person, categorized as "Fair" or "Excellent".
- Buys Computer: Indicates whether the person bought a computer, represented as "Yes" or "No".

The dataset is divided into two parts, namely, feature matrix and the response vector.

- Feature matrix contains all the vectors(rows) of dataset in which each vector consists of the value of independent features. In above dataset, features are <mark>Age</mark>,<mark>Income</mark>,<mark>Student</mark>, <mark>Credit Rating</mark>.
- Response vector contains the value of class variable(prediction or output) for each row of feature matrix. In above dataset, the class variable name is <mark>Buys Computer</mark>.

## Assumptions of Naive Bias Classification

The fundamental Naive Bayes assumption is that each feature makes an:

- <b>Feature independence</b>: The features of the data are conditionally independent of each other, given the class label.
- <b>Features are equally important</b>: All features are assumed to contribute equally to the prediction of the class label.Each of them is given the same weight.
- <b>No missing data</b>: The data should not contain any missing values.

With relation to the dataset, this concept can be understood as follows:
- We assume that no pair of features are independent. For example, the <b>Age</b> of a person has nothing to do with his <b>Credit Rating</b> , hence are independent of each-other.
- We assume that all the features are equally important. The prediction cannot be made with only knowing the value of <b>Age</b> 
or <b>Income</b> of a person.
- There cannot be a record where the value of even on of the features is unknown. Such cases should be dealt with before the model is being trained with it.

The assumptions made by Naive Bayes are not generally correct in real-world situations. In-fact, the independence assumption is never correct but often works well in practice.Now, before moving to the formula for Naive Bayes, it is important to know about Bayes’ theorem.
 

## Bias Theorem

Bayes’ Theorem finds the probability of an event occurring given the probability of another event that has already occurred. Bayes’ theorem is stated mathematically as the following equation:

$$ P(Y|X) =  \frac{P(X|Y) P(Y)}{P(X)} $$


where Y and X are events and P(X) ≠ 0

Basically, we are trying to find probability of event Y, given the event X is true. Event X is also termed as evidence.
P(Y) is the priori of Y (the prior probability, i.e. Probability of event before evidence is seen). The evidence is an attribute value of an unknown instance(here, it is event X).
P(X) is Marginal Probability: Probability of Evidence.
P(Y|X) is a posteriori probability of X, i.e. probability of event after evidence is seen.
P(X|Y) is Likelihood probability i.e the likelihood that a hypothesis will come true based on the evidence.

With regards to a dataset, we can set the Bayes theorem is the following way:

$$ P(Y = y|X = <x_1, x_2, ... , x_n>) =  \frac{P(X = <x_1, x_2, ... , x_n> |(Y = y) ) * P(Y = y)}{P(X = <x_1, x_2, ... , x_n>)} $$

where, 
- Y is the target feature whose value we need to predict, in this case, it is the column "Buys Computer" 
- y is class variable, in our case , it can have the value either true or false for the column "Buys Computer" 

- X is a feature vector (of size n) ; for the dataset above, X is the values of the features in an instance or record.

For example,
X = <Young, Low, Yes, Fair>
y = No
So basically, in this instance, P(Y=y | X) means "Not Buying Computer" given that the customer is young, has low income, is a student and has an encellent credit rating (refer 1st row).

for simplicity, we write it as,
$$ P(y|x_1, x_2, ... , x_n) =  \frac{P(x_1, x_2, ... , x_n |(y) ) * P(y)}{P(x_1, x_2, ... , x_n)} $$

Now , according to the "naive" assumption , if two events A and B are indepedent, then,
$$ P(A,B) = P(A) * P(B) $$

Hence, we can write,
$$ P(y|x_1, x_2, ... , x_n) =  \frac{(P(x_1|y)*P(x_2,y)* ... * P(x_n|y) ) * P(y)}{P(x_1)*P(x_2)* ... *P(x_n)}     $$

which can be expressed as :
$$ P(y|x_1, x_2, ... , x_n) =  \frac{\prod_{i=1}^{n}P(x_i|y) * P(y)}{P(x_1)*P(x_2)* ... *P(x_n)}     $$

We try to find the best probability for all the values of $y \in Y$ and choose the best prediction based on the probabilities. In all this, the denominator is constant, so we can ignore it. So , the expression can be found as:
$$ P(y|x_1, x_2, ... , x_n) =  \prod_{i=1}^{n}P(x_i|y) * P(y)     $$

We need to find the best possible value for a given input Y = y.Mathematically, it can be expressed as:
$$ y = argmax \prod_{i=1}^{n}P(x_i|y) * P(y) $$

In [3]:
data

Unnamed: 0,Age,Income,Student,Credit Rating,Buys Computer
0,Young,Low,Yes,Fair,No
1,Young,Low,Yes,Excellent,No
2,Young,Medium,Yes,Fair,Yes
3,Young,High,No,Fair,Yes
4,Young,High,No,Excellent,No
5,Middle-aged,High,No,Excellent,Yes
6,Middle-aged,High,No,Excellent,Yes
7,Middle-aged,Medium,Yes,Fair,Yes
8,Middle-aged,Low,No,Excellent,Yes
9,Middle-aged,Medium,No,Excellent,Yes


Now, let us take a case our, let us predict if if a customer is going to buy a computer given he is young, is a student, has high income and a fair credit rating. Mathmematically we need to find, P(Yes|Customers_Condition) and P(No|Customers_Condition).
Whichever has the higher probability will have the outcome.

So,

$ P(Yes|Customers Condition) \\= P(Customers Condition|Yes) * P(Yes) \\
                              = P(Young|Yes) * P(Low|Yes) * P(Student = Yes|Yes) * P(Fair|Yes) * P(Yes) \\
                              = \frac{2}{7} * \frac{1}{7} * \frac{2}{7} * \frac{3}{7} * \frac{7}{10} \\
                              = 0.00349 $
                          

Similarly, we calculate,
$ P(No|Customers Condition) \\= P(Customers Condition|No) * P(No) \\
                              = P(Young|No) * P(Low|No) * P(Student = Yes|No) * P(Fair|No) * P(No) \\
                              = \frac{3}{3} * \frac{2}{3} * \frac{2}{3} * \frac{1}{3} * \frac{3}{10} \\
                              = 0.00444 $
                              
                              
So, we see that, P(Yes|Customers Condition) < P(No|Customers Condition)

Hence, we predict that a customer who is young, is a student, has high income and a fair credit rating will not be buying a computer anytime soon.


So let us focus on implementation now.

In [4]:
#This function calculates the P(Y = y)
def calculate_prior(df, Y):
    classes = sorted(list(df[Y].unique()))
    prior = []
    
    for i in classes:
        prior.append( len(df[df[Y] == i]) / len(df[Y]))
        
    return prior

In [21]:
# This function calculates P(xi|Y=y)
def calculate_likelihood(df,X,x_vals,Y,y):
    df = df[df[Y] == y]
    df = df.iloc[:,:-1]
    
    if(len(X) != len(x_vals)): print("not maching data")
    
    print(f"X : {X}   x : {x_vals}")
    
    p_xi_given_Y = 1
    
    n = len(X)
    
    for i in range(n):
        feature_name = X[i]
        feature_val = x_vals[i]
        p_xi_given_Y *= (len(df[df[feature_name] == feature_val])/len(df))
        
    return p_xi_given_Y

In [23]:
#calculate priors
p_Y_no, p_Y_yes = calculate_prior(data,'Buys Computer')

X = ["Age", "Income", "Student", "Credit Rating"]
x = ["Young", "Low", "Yes", "Fair"]
Y = "Buys Computer"

#calculate likelihood
p_X_given_Y_yes = calculate_likelihood(data,X,x,Y,y = "Yes")
p_X_given_Y_no = calculate_likelihood(data,X,x,Y,y ="No")

#calculate the main values
P_Y_given_X_yes = p_X_given_Y_yes * p_Y_yes

P_Y_given_X_no = p_X_given_Y_no * p_Y_no

if(P_Y_given_X_yes > P_Y_given_X_no):
    print("The customer with such condition will most likely purchase a computer")
else:
    print("The customer with such condition will probably not purchase a computer")

X : ['Age', 'Income', 'Student', 'Credit Rating']   x : ['Young', 'Low', 'Yes', 'Fair']
X : ['Age', 'Income', 'Student', 'Credit Rating']   x : ['Young', 'Low', 'Yes', 'Fair']
The customer with such condition will probably not purchase a computer


## Types of Naive Bias Classifiers

The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of $ P(x_i|Y) $

### Gaussian 
- Used in classification, and it assumes that features follow a normal distribution.
- Example : Classifying documents into different categories based on the based on their weights. 
![image.png](attachment:image.png)

### Multinomial 
- Suitable for features representing counts or frequencies, typically used in text classification tasks.
- Example: Classifying documents into different categories based on the frequency of words in the document.

### Bernoulli
- Assumes that features are binary (Boolean), representing presence or absence.
- Example: Classifying documents into categories based on the presence or absence of specific words, regardless of their frequency.

## Advantages 
- Easy to implement and computationally efficient.
- Effective in cases with a large number of features.
- Performs well even with limited training data.

## Disadvantages
- Assumes that features are independent, which may not always hold in real-world data.
- Can be influenced by irrelevant attributes.
- May assign zero probability to unseen events, leading to poor generalization.

## Application
- <b>Spam Email Filtering</b>: Classifies emails as spam or non-spam based on features.
- <b>Text Classification</b>: Used in sentiment analysis, document categorization, and topic classification.
- <b>Medical Diagnosis</b>: Helps in predicting the likelihood of a disease based on symptoms.
- <b>Credit Scoring </b>: Evaluates creditworthiness of individuals for loan approval.
- <b>Weather Prediction</b>: Classifies weather conditions based on various factors.