# Naive Bayes Implementation






Function to split the data when single file is provided -

In [1]:
import pandas as pd
import numpy as np
import math

def splitData(filename, splitValue):
    dataset = pd.read_csv(filename)
    splitval = np.random.rand(len(dataset)) < splitValue
    train = dataset[splitval]
    test = dataset[~splitval]
    return train, test





Function to get probablity & standard deviation of column values with 0 & 1 -

In [2]:
def getStats(train):
    outcomeCol = train.columns[len(train.columns) - 1]

    dataframeOfZero = train[train[outcomeCol].isin([0])]
    dataframeOfOne = train[train[outcomeCol].isin([1])]
    
    meanOfZero = dataframeOfZero.mean(axis = 0)
    meanOfOne = dataframeOfOne.mean(axis = 0)
    stdOfZero = dataframeOfZero.std(axis = 0)
    stdOfOne = dataframeOfOne.std(axis = 0)
    probOfOne = len(dataframeOfOne)/len(train)
    probOfZero = len(dataframeOfZero)/len(train)
    
    return meanOfZero, meanOfOne, stdOfZero, stdOfOne, probOfZero, probOfOne





Calculate normal distribution likelihood -

In [3]:
def calcProb(columnNames, row, meanOfZero, meanOfOne, stdOfZero, stdOfOne, probOfOne, probOfZero):
    probOne = probOfOne
    probZero = probOfZero
    for i in range(len(columnNames) - 1):
        probOne = probOne * normpdf(row[i], meanOfOne[i], stdOfOne[i])
        probZero = probZero * normpdf(row[i], meanOfZero[i], stdOfZero[i])
    return 1 if probOne > probZero else 0





Function to calculate normal distribution -

In [4]:
def normpdf(x, mean, std):
    expn = math.exp(-((x-mean)**2 / (2 * std**2 )))
    denominator = (1 / (math.sqrt(2 * math.pi) * std))
    return expn * denominator





Computing consufion matrix -

In [5]:
def createConfusionMatrix(actual, pred):
    size = len(np.unique(actual))
    result = np.zeros((size, size))
    for i in range(len(actual)):
        result[actual[i]][pred[i]] = result[actual[i]][pred[i]] + 1
    return result.astype(int)





Function to calculate metrics -

In [6]:
def calculateMetrics(confusionMatrix):
    accuracy = np.trace(confusionMatrix) / np.sum(confusionMatrix)
    error = np.trace(confusionMatrix[::-1]) / np.sum(confusionMatrix)
    sensitivity = confusionMatrix[1,1] / (confusionMatrix[1,0] + confusionMatrix[1,1])
    specificity = confusionMatrix[0,0] / (confusionMatrix[0,0] + confusionMatrix[0,1])
    
    return accuracy, error, sensitivity, specificity





Main function -

In [None]:
def main():
    
    #Finding length of columns & setting up a list of columns for reading
    #This is done for generalizing so that this code can process any csv file
    columns = pd.read_csv('train.csv').columns
    columnsList = []
    for i in range(len(columns)):
        columnsList.append(str(i))
    
    #Read data
    train = pd.read_csv('train.csv', names = columnsList)
    test = pd.read_csv('test.csv', names = columnsList)
    
    #Get means & stdev
    meanOfZero, meanOfOne, stdOfZero, stdOfOne, probOfZero, probOfOne = getStats(train)
    
    #Get predicted data
    pred = []
    for i, row in test.iterrows():
        pred.append(calcProb(train.columns, row, meanOfZero, meanOfOne, stdOfZero, stdOfOne, probOfOne, probOfZero))
    
    #Get actual data
    actual = test[test.columns[len(test.columns) - 1]].to_numpy().astype(int)
    
    #Get confusion matrix
    confusionMatrix = createConfusionMatrix(actual, pred)
    print("Confusion matrix: ")
    print(confusionMatrix)
    
    #Get metrics
    accuracy, error, sensitivity, specificity = calculateMetrics(confusionMatrix)
    print("Accuracy: ", accuracy)
    print("Error: ", error)
    print("Sensitivity: ", sensitivity)
    print("Specificity: ", specificity)
    
main()    
    

##### Output - 

Confusion matrix: 

[[132  28]

 [ 36  58]]
 
Accuracy:  0.7480314960629921

Error:  0.25196850393700787

Sensitivity:  0.6170212765957447

Specificity:  0.825

A Bayes classifier calculates the posterior probability of the classes using Bayes' rule. Whereas, a naive Bayes classifier assumes independency of features. Bayes and Naive Bayes are similar when used for inference purposes. However, Naive bayes is significantly faster. Although the conditional independence assumption is questionable, naive bayes has surprisingly outperformed many classifiers over a large number of datasets.



##### From the Confusion Matrix values we can make the following observations:

Classification accuracy is the percentage of correct predictions - 75%
    
Sensitivity shows how 'sensitive' is the classifier to detecting positive instances.

Specificity shows how selective is the classifier in predicting positive instances.

Hence our model is sensitive and highly specific.





References - YouTube videos to study NaiveBayes Implementation. Pandas, Numpy library documentation & stackoverflow for usage of functions.