## ML using Python - Naive Bayes Classification 

Priyaranjan Mohanty

Naive Bayes is the most straightforward and fast classification algorithm, which is suitable for a large chunk of data. Naive Bayes classifier is successfully used in various applications such as spam filtering, text classification, sentiment analysis, and recommender systems. It uses Bayes theorem of probability for prediction of unknown class.

Naive Bayes Classification is Supervised Machine Learning method and thus follows following Workflow -

![image.png](attachment:image.png)

### Naive Bayes : Lets understand , what is Naive Bayes 

Naive Bayes is a statistical/probabilstic classification technique based on Bayes Theorem. 

It is one of the simplest supervised learning algorithms. 

Naive Bayes classifier is the fast, accurate and reliable algorithm. 

Naive Bayes classifiers have high accuracy and speed on large datasets.

Naive Bayes classifier assumes that the effect of a particular feature in a class is independent of other features. For example, a loan applicant is desirable or not depending on his/her income, previous loan and transaction history, age, and location. Even if these features are interdependent, these features are still considered independently. This assumption simplifies computation, and that's why it is considered as naive. This assumption is called class conditional independence.

![image.png](attachment:image.png)

P(h): the probability of hypothesis h being true (regardless of the data). This is known as the prior probability of h.

P(D): the probability of the data (regardless of the hypothesis). This is known as the prior probability.

P(h|D): the probability of hypothesis h given the data D. This is known as posterior probability.

P(D|h): the probability of data d given that the hypothesis h was true. This is known as posterior probability.

## Lets understand , How Naive Bayes classifier works

Let’s understand the working of Naive Bayes through an example. 

Given an example of weather conditions and playing sports. We need to calculate the probability of playing sports. Now, we need to classify whether players will play or not, based on the weather condition.

Naive Bayes classifier calculates the probability of an event in the following steps:
    Step 1: Calculate the prior probability for given class labels
    
    Step 2: Find Likelihood probability with each attribute for each class
    
    Step 3: Put these value in Bayes Formula and calculate posterior probability.
    
    Step 4: See which class has a higher probability, given the input belongs to the higher probability class.
    
For simplifying prior and posterior probability calculation we can use the two tables frequency and likelihood tables. Both of these tables will help us to calculate the prior and posterior probability. 

The Frequency table contains the occurrence of labels for all features. 

There are two likelihood tables. Likelihood Table 1 is showing prior probabilities of labels and Likelihood Table 2 is showing the posterior probability.

![image.png](attachment:image.png)

Now , we want to calculate the probability of playing when the weather is overcast.

Probability of playing:
        
        P(Yes | Overcast) = P(Overcast | Yes) P(Yes) / P (Overcast) .....................(1)


Calculate Prior Probabilities:
        
        P(Overcast) = 4/14 = 0.29

        P(Yes)= 9/14 = 0.64
        
Calculate Posterior Probabilities:

        P(Overcast |Yes) = 4/9 = 0.44

Put Prior and Posterior probabilities in equation (1)

        P (Yes | Overcast) = 0.44 * 0.64 / 0.29 = 0.98(Higher)

Similarly, you can calculate the probability of not playing....

## Now , lets build a Naive Bayes Classifier using Scikit-learn

We are going to use the IRIS dataset, which comes with the sklearn library. 

The dataset contains 3 classes of 50 instances each, where each class refers to a type of iris plant. 

Here we are going to use the GaussianNB model, which is already available in the SKLEARN Library.

### Step 1 : Importing Libraries and Loading Datasets

In [1]:
from sklearn import datasets
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
 

Iris_dataset = datasets.load_iris()

In [2]:
Iris_dataset

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

In [3]:
Iris_dataset.data.shape

(150, 4)

In [4]:
Iris_dataset.target.shape

(150,)

### Step 2 : Creating our Naive Bayes Model using Sklearn

In [6]:
NB_model = GaussianNB()

NB_model.fit(Iris_dataset.data, Iris_dataset.target)


Response_expected = Iris_dataset.target

Response_predicted = NB_model.predict(Iris_dataset.data)

In [7]:
Response_predicted

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

### Step 3 : Evaluating Accuracy and Statistics

In [8]:
# Confusion Matrix 

print(metrics.confusion_matrix(Response_expected, Response_predicted))

[[50  0  0]
 [ 0 47  3]
 [ 0  3 47]]
