# Naive Bayes algorithm from scratch

Naïve Bayes is a **classification** algorithm, which is based on **Bayes Theorem**. It has widely used in, 
- Spam email filtering
- Object and face detection
- Weather prediction

There are **3 types** of this algorithm. We can choose appropriate one according to our purpose.
- Gaussian Naïve Bayes
- Multinomial Naïve Bayes
- Bernoulli Naïve Bayes

**In this notebook, we have implemented the Gaussian Naïve Bayes algorithm.**

The following code block doesn’t necessary for the algorithm. It is just to give an idea about the dataset structure.
Its columns have following meanings.
-	6 – Pregnancies
-	148 – Glucose
-	72 – BloodPressure
-	35 – SkinThickness
-	0 – Insulin
-	33.6 – BMI
-	0.627 – DiabetesPedigreeFunction
-	50 – Age
-	1 - Class

In [1]:
import pandas as pd

data=pd.read_csv('datasets/pima-indians-diabetes.csv')
data.head()

Unnamed: 0,6,148,72,35,0,33.6,0.627,50,1
0,1,85,66,29,0,26.6,0.351,31,0
1,8,183,64,0,0,23.3,0.672,32,1
2,1,89,66,23,94,28.1,0.167,21,0
3,0,137,40,35,168,43.1,2.288,33,1
4,5,116,74,0,0,25.6,0.201,30,0


The following libraries will be needed throughout this notebook. So, they have imported at the beginning.

In [5]:
import csv
import math
import random

Then we need a function to read and load the data from *csv* file. **load_csv()** function does this.

In [6]:
def load_csv(filename):
    rows=csv.reader(open(filename))
    dataset=list(rows)
    for i in range(len(dataset)):
        dataset[i]=[float(x) for x in dataset[i]]
    
    return dataset

The dataset has to be divided into training and testing parts, according to the given ratio. This is done by randomly generated indexes. The following function helps to do this process. It returns train and test datasets.

In [14]:
def split_dataset(dataset,split_ratio):
    test_size=int(len(dataset)*split_ratio)
    trainset=list(dataset)
    testset=[]
    while len(testset)<test_size:
        ind=random.randrange(len(trainset))
        testset.append(trainset.pop(ind))
    
    return [trainset,testset]

Now the dataset has to be separated by the **class**. After doing this, the dataset will be divided into **tested positive for diabetes** and **tested negative for diabetes** groups.

In [15]:
def separate_by_class(dataset):
    separated={}
    for i in range(len(dataset)):
        vector=dataset[i]
        if vector[-1] not in separated:
            separated[vector[-1]]=[]
        separated[vector[-1]].append(vector)
    
    return separated

Since we are using **Gaussian Probability Distribution**, we need functions to calculate **mean** and **standard deviation**. The following two functions can be used for this. 

In [16]:
def mean(numbers):
    return sum(numbers)/float(len(numbers))

In [17]:
def stdev(numbers):
    average=mean(numbers)
    variance=sum([pow(x-average,2) for x in numbers])/float(len(numbers)-1)
    return math.sqrt(variance)

In [18]:
def summarize(dataset):
    summaries=[(mean(attri),stdev(attri)) for attri in zip(*dataset)]
    del summaries[-1]
    return summaries

In [30]:
def summarize_by_class(dataset):
    separated=separate_by_class(dataset)
    summaries={}
    for classValue,instance in separated.items():
        summaries[classValue]=summarize(instance)
    return summaries

In [20]:
def calc_prob(x,mean,stdev):
    exponent=math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))
    return (1/(math.sqrt(2*math.pi)*stdev))*exponent

In [21]:
def calc_cls_prob(summaries,inputVec):
    probs={}
    for classVal,classSummaries in summaries.items():
        probs[classVal]=1
        for i in range(len(classSummaries)):
            mean,stdev=classSummaries[i]
            x=inputVec[i]
            probs[classVal]*=calc_prob(x,mean,stdev)
    return probs

In [22]:
def predict(summaries,inputVec):
    probs=calc_cls_prob(summaries,inputVec)
    best_label,best_prob=None,-1
    for classVal,probability in probs.items():
        if best_label is None or probability>best_prob:
            best_prob=probability
            best_label=classVal
    return best_label

In [23]:
def get_predictions(summaries,test_set):
    predictions=[]
    for i in range(len(test_set)):
        result=predict(summaries,test_set[i])
        predictions.append(result)
    return predictions

In [36]:
def get_accuracy(test_set,predictions):
    correct=0
    for j in range(len(test_set)):
        if test_set[j][-1]==predictions[j]:
            correct+=1

    return (correct/float(len(test_set)))*100.0

In [37]:
def main():
    filename='datasets/pima-indians-diabetes.csv'
    test_size=0.33
    dataset=load_csv(filename)
    trainset,testset=split_dataset(dataset,test_size)
    
    print('Split total {0} rows into train = {1} and test = {2} rows'.format(len(dataset),len(trainset),len(testset)))
    
    summaries=summarize_by_class(trainset)
    predictions=get_predictions(summaries,testset)
    accuracy=get_accuracy(testset,predictions)
    print('Accuracy: {0}%'.format(accuracy))

main()

Split total 768 rows into train = 515 and test = 253 rows
Accuracy: 72.33201581027669%
