## 5. Naïve Bayesian classifier

### Aim: 
To classify Certain data tuple in the csv file to their respective classes (0,1) using Naive Bayesian Classifier.

Naive Bayesian Classifier is a Probabilistic Classifier which returns probability of the the data belonging to a certain class. Since it is probabilistic Value it returns value between 0 and 1.

### References:
1. GeeksforGeeks: https://www.geeksforgeeks.org/naive-bayes-classifiers/

In [1]:
import csv
import random
import math

## 1. Read file
Read the file from csv, convert each item into floating point number then return the dataset

In [2]:
def load_csv(filename):
    file_content = csv.reader(open(filename,'r'))
    dataset = list(file_content)
    # Convert String to float value
    for i in range(len(dataset)):
        dataset[i] = [float(line) for line in dataset[i]]
    return dataset

## 2. Split data
Splitting data into, Training & Testing<br>
Training is used to make the model learn & testing data is the one which we want to predict<br>
<code>split_ratio</code> is the parameter which specifies how much __percentage of data__ is to be given for __training__<br>
Select the lines randomly from dataset to train data, & remaing data will be obviously a testing data<br>
Return as list of ```train & test data```

In [6]:
def split_dataset(dataset, split_ratio):
    train_len = int(len(dataset) * split_ratio)
    train_data = []
    test_data = list(dataset)
    while len(train_data) < train_len:
        # Select one random index and move: test-> train data
        index = random.randrange(len(test_data))
        train_data.append(test_data.pop(index))
    return [train_data, test_data]

## 3. Get prepare the data
Convert a data of type<br>
```[attrib1, attrib2 ....attribn, label]``` <br>
to<br>```dictionary(label)->[(mean(attrib1),stddev(attrib1)),(mean(attrib1),stddev(attrib1))....(mean(attribn),stddev(attribn))]```.<br>

> Hence we need ```mean(numbers)``` as well as ```stdev(numbers)```.<br>
Formula, Mean: <img src="https://www.gstatic.com/education/formulas/images_long_sheet/mean.svg" width="20%"/> Standard Deviation :<img src="https://www.gstatic.com/education/formulas/images_long_sheet/sample_standard_deviation.svg" width="20%"/>

> ```separateByClass(dataset)``` : converts from list to a dictionary of __class__ and its __X ( attribute list)__.

> ```summarizeByClass(dataset)```: gets dataset and converts it into dictionary of __class__ and its __mean and std__ by calling summarize method over result obtained by separateByClass.


In [63]:
def mean(numbers):
    return sum(numbers) / float(len(numbers))

def stdev(numbers):
    avg = mean(numbers)
    varience = sum([pow(x - avg, 2) for x in numbers]) / float(len(numbers) - 1)
    return math.sqrt(varience)

def summarize(dataset):
    summary = [(mean(x), stdev(x)) for x in zip(*dataset)]
    # Remove the labels mean and std. no necessary
    del summary[-1]
    return summary
    
def separateByClass(dataset):
    separate = {}
    for vector in dataset:
        if vector[-1] not in separate:
            separate[vector[-1]] = []
        separate[vector[-1]].append(vector)
    return separate

def summarizeByClass(dataset):
    separated = separateByClass(dataset)
    summary = {}
    for classLabel, classValue in separated.items():
        summary[classLabel] = summarize(classValue)
    return summary

## 4. Predict
Over the test data predict the value using the Naive Bayesian Formula with Normal Distr.
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQYGloOirBzTFgFjY5esv_09Ksikn3NJYN6aErIn_P2BKmsrgZx" height="30%"/>

This formula is implemented in ```calculateProbability(x, mean, stdev)```.

### Naive Bayesian Formula:
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcRLZnzsVloSpAZU8YTmtzll58CqF0S0oWMjGCgVhUV9nctjNcPa" height="100px"/>
Here the <code>P(c)</code> is multiplied with <code>P(c|x1), P(c|x2)...</code> etc. And the P(c|xi) is calculated by <code>calculateProbability(x, mean, stdev)</code> On each row of dataset.
<hr>
<b>NOTE :</b>not to include the last column during <code>for i in range(len(testset[0])-1):</code>

In [75]:
def calculateProbability(x, mean, stdev):
    exponent = math.exp(-(math.pow(x - mean, 2))/ (2* math.pow(stdev, 2)))
    return ( 1 / (math.sqrt(2*math.pi)*stdev ) * exponent)
    
def calculateClassProbabilities(summary, inputVector):
    probabilities = {}
    for classValue, classSummaries in summary.items():
        probabilities[classValue] = 1
        for i in range(len(testset[0])-1):
            mean, stdev = classSummaries[i]
            x = inputVector[i]
            # Apply forumla
            probabilities[classValue] *= calculateProbability(x, mean, stdev)
    return probabilities

def predict(summaries, inputVector):
    probabilities = calculateClassProbabilities(summaries, inputVector)
    bestLabel, bestProb = None, -1
    for classValue, probability in probabilities.items():
        if bestLabel is None or probability > bestProb:
            bestProb = probability
            bestLabel = classValue
    return bestLabel
 
def getPredictions(summary, testSet):
    predictions = []
    for i in range(len(testSet)):
        result = predict(summaries, testSet[i])
        predictions.append(result)
    return predictions

## 5. Accuracy
Predict by comparing prediction of __Navie Bayesian__ and the __Test Result__.

In [72]:
def getAccuracy(testSet, predictions):
    correct = 0
    for i in range(len(testSet)):
        if testSet[i][-1] == predictions[i]:
            correct += 1
    return (correct/float(len(testSet))) * 100.0

In [74]:
dataset = load_csv('DATASET/lab5.csv')
print("\n The length of the Data Set : ",len(dataset))

trainset, testset = split_dataset(dataset, 0.7)
print('\n Number of Rows in Training Set:{0} rows'.format(len(trainset)))
print('\n Number of Rows in Testing Set:{0} rows'.format(len(testset)))

summaries = summarizeByClass(dataset)
print("\n Model Summaries:\n",summaries)

predictions = getPredictions(summaries, testset)
print("\nPredictions:\n",predictions)

accuracy = getAccuracy(testset, predictions)
print('\n Accuracy: {0}%'.format(accuracy))


 The length of the Data Set :  768

 Number of Rows in Training Set:537 rows

 Number of Rows in Testing Set:231 rows

 Model Summaries:
 {1.0: [(4.865671641791045, 3.741239044041554), (141.25746268656715, 31.939622058007195), (70.82462686567165, 21.49181165060413), (22.16417910447761, 17.67971140046571), (100.33582089552239, 138.6891247315351), (35.14253731343278, 7.262967242346376), (0.5505, 0.372354483554611), (37.06716417910448, 10.968253652367915)], 0.0: [(3.298, 3.01718458262189), (109.98, 26.14119975535359), (68.184, 18.063075413305828), (19.664, 14.889947113744254), (68.792, 98.86528929231767), (30.30419999999996, 7.689855011650112), (0.42973400000000017, 0.29908530435741093), (31.19, 11.667654791631156)]}

Predictions:
 [1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0,