# Subtask 1 : Mathematical Understanding

## Definition of Naive Bayes

Naive Bayes predicts the value using the bayes theorem and assuming that each and every observation is independent of each other. It chooses the value with the greatest probability. 

## Underlying Principles

The Naive Bayes algorithm calculates the probability of each case using Bayes Theorem and predicts the value according to the probabaility. The class with the highest probability is assigned to the x value.

## Assumptions

The assumption we take during the creation of our model is that each and every observation are independent of each other. 

## Equations Involved

### Bayes Theorem

P(y|X) = P(X|y) * P(y) / P(X)

posterior probability = P(y|X)

prior probability = P(y)

likelihood probability (conditional probability) = P(X|y)

where y is the class variable and X represents the features

as X = [x1, x2, x3, x4, x5, ......]

P(y|X) = P(y|x1) * P(y|x2) * .... * P(y|xn) * P(y) / P(X)

For every value in the dataset the P(X) does not change so it can be removed. Also as probabilities can also have very low values, we take log on both sides to iognore the zero overflow error

log (P(y|X)) = log(P(y|x1)) + log(P(y|x2)) + .... + log(P(y|xn)) + log(P(y))

As we have more than one case of y, we need to find the class y with maximum probability.

So, y would be the class whose posterior probability will be maximum.

### Gaussian Naive Bayes Classifier

The conditional probability P(X|y) changes to: 

P(X|y) = exp(-((x-mean)^2)/(2 * var))/(sqrt(2 * pi * var))

where exp means exponentian, x represents each feature, var represent the variance of that feature, mean represents the mean of each feature, X represents all the features and y represents the class.

## Working of the Model

### Training

The model calculates the mean, variance of all the features and classes and also the prior probability of all the classes.

 ### Prediction

It uses the Bayes Theorem to calculate the posterior probabaility and then it returns the class with the highest posterior probability. 

# Subtask 2 : Training and Prediction

In [1]:
#importing the necessary libraries
import numpy as np
import pandas as pd

In [2]:
#defining my model of Naive_Bayes
class Naive_Bayes():
    
    def __init__(self, alpha = 0.1):
        self.alpha = alpha
    
    def fit(self,X,y):
        n_samples, n_features = X.shape
        self.classes = np.unique(y)
        n_classes = len(self.classes)
        
        self.mean = np.zeros((n_classes,n_features))
        self.var = np.zeros((n_classes,n_features))
        self.prior = np.zeros(n_classes)
        
        for idx,c in enumerate(self.classes):
            X_c = X[y==c]
            self.mean[idx,:] = X_c.mean(axis=0)
            self.var[idx,:] = X_c.var(axis=0)
            self.prior[idx] = X_c.shape[0]/n_samples
    
    def predict(self,X):
        y_pred = [self.pred(x) for x in X]
        return y_pred
    
    def pred(self,x):
        posteriors = []
        for idx,c in enumerate(self.classes):
            prior = np.log(self.prior[idx])
            likelihood = np.sum(np.log(self.cond_prob(idx,x)))
            posterior = prior + likelihood
            posteriors.append(posterior)
        prediction = self.classes[np.argmax(posteriors)]
        return prediction
    
    def cond_prob(self,idx,x):
        mean = self.mean[idx]
        var = self.var[idx] + self.alpha
        num = np.exp((-((x-mean)**2))/(2*var))
        den = np.sqrt(2*np.pi*var)
        P_x_y = num/den
        return P_x_y

In [3]:
#reading the files
train1 = pd.read_csv('ds1_train.csv')
test1 = pd.read_csv('ds1_test.csv')
train2 = pd.read_csv('ds2_train.csv')
test2 = pd.read_csv('ds2_test.csv')

In [4]:
#seperating the x and y values
X_train1 = train1.iloc[:,:-1].values
y_train1 = train1.iloc[:,-1].values
X_test1 = test1.iloc[:,:-1].values
y_test1 = test1.iloc[:,-1].values
X_train2 = train2.iloc[:,:-1].values
y_train2 = train2.iloc[:,-1].values
X_test2 = test2.iloc[:,:-1].values
y_test2 = test2.iloc[:,-1].values

In [5]:
#defining the accuracy function
def accuracy(y_pred,y_test):
    acc = (np.sum(y_pred==y_test))/len(y_test)
    return acc

In [6]:
#defining a function to get the accuracy for the dataset
def printing_accuracy(X_train, y_train, X_test, y_test):
    nb = Naive_Bayes()
    nb.fit(X_train,y_train)
    y_pred = nb.predict(X_test)
    acc = accuracy(y_pred,y_test)
    return acc

In [7]:
#getting the accuracies of the predictions for the datasets
my_acc_train_1 = printing_accuracy(X_train1, y_train1, X_train1, y_train1)
my_acc_test_1 = printing_accuracy(X_train1, y_train1, X_test1, y_test1)
my_acc_train_2 = printing_accuracy(X_train2, y_train2, X_train2, y_train2)
my_acc_test_2 = printing_accuracy(X_train2, y_train2, X_test2, y_test2)

In [8]:
#Printing the accuracies of the predictions done by the model
print('accuracy of the ds1_test.csv',my_acc_test_1)
print('accuracy of the ds1_train.csv',my_acc_train_1)
print('accuracy of the ds2_test.csv',my_acc_test_2)
print('accuracy of the ds2_train.csv',my_acc_train_2)

accuracy of the ds1_test.csv 0.83
accuracy of the ds1_train.csv 0.82375
accuracy of the ds2_test.csv 0.92
accuracy of the ds2_train.csv 0.91625


# Subtask 3 : Hyperparameter Tuning

The hyperparameter for the model is alpha also known as the Laplace smoothing parameter which is used to handle the zero probability cases

In [9]:
# defining a function to get the best alpha parameter
def best(X_train,y_train,X_test,y_test):
    best_alpha = 0
    best_acc = 0
    alpha = [0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5]
    for i in alpha:
        nb = Naive_Bayes(alpha = i)
        nb.fit(X_train, y_train)
        y_pred = nb.predict(X_test)
        acc = accuracy(y_pred,y_test)
        
        if acc>best_acc:
            best_alpha = i
            best_acc = acc
    print('Best value of alpha for the given dataset : ',best_alpha)
    print('Best accuracy for the best alpha for the given dataset : ',best_acc)

In [10]:
# printing the best alpha for dataset 1
print('Dataset - 1')
best(X_train1, y_train1, X_test1, y_test1)

Dataset - 1
Best value of alpha for the given dataset :  0.3
Best accuracy for the best alpha for the given dataset :  0.84


In [11]:
# printing the best alpha for dataset 2
print('Dataset - 2')
best(X_train2, y_train2, X_test2, y_test2)

Dataset - 2
Best value of alpha for the given dataset :  0
Best accuracy for the best alpha for the given dataset :  0.92


# Subtask 4 : Comparison with Scikit-Learn

In [12]:
# importing the necessary libraries
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics

In [13]:
# defining scikit learn naive bayes model
def scikit_naive_bayes(X_train, y_train, X_test, y_test):
    nb = GaussianNB()
    nb.fit(X_train, y_train)
    y_pred = nb.predict(X_test)
    acc = metrics.accuracy_score(y_test,y_pred)
    return acc

In [14]:
# printing the accuracies of the scikit learn model
acc_test_1 = scikit_naive_bayes(X_train1,y_train1,X_test1,y_test1)
print('Accuracy for the test dataset 1 : ',acc_test_1)
acc_train_1 = scikit_naive_bayes(X_train1,y_train1,X_train1,y_train1)
print('Accuracy for the train dataset 1 : ',acc_train_1)
acc_test_2 = scikit_naive_bayes(X_train2,y_train2,X_test2,y_test2)
print('Accuracy for the test dataset 2 : ',acc_test_2)
acc_train_2 = scikit_naive_bayes(X_train2,y_train2,X_train2,y_train2)
print('Accuracy for the train dataset 2 : ',acc_train_2)

Accuracy for the test dataset 1 :  0.83
Accuracy for the train dataset 1 :  0.8225
Accuracy for the test dataset 2 :  0.92
Accuracy for the train dataset 2 :  0.915


## Comparison between two models

In [15]:
# printing the comparison between the two models
print('Scikit-Learn Logistic Regression Model :: ds1_test.csv :: ',acc_test_1)
print('My Model :: ds1_test.csv :: ',my_acc_test_1)
print('Scikit-Learn Logistic Regression Model :: ds1_train.csv :: ',acc_train_1)
print('My Model :: ds1_test.csv :: ',my_acc_train_1)
print('Scikit-Learn Logistic Regression Model :: ds2_test.csv :: ',acc_test_2)
print('My Model :: ds2_test.csv :: ',my_acc_test_2)
print('Scikit-Learn Logistic Regression Model :: ds2_train.csv :: ',acc_train_2)
print('My Model :: ds2_test.csv :: ',my_acc_train_2)

Scikit-Learn Logistic Regression Model :: ds1_test.csv ::  0.83
My Model :: ds1_test.csv ::  0.83
Scikit-Learn Logistic Regression Model :: ds1_train.csv ::  0.8225
My Model :: ds1_test.csv ::  0.82375
Scikit-Learn Logistic Regression Model :: ds2_test.csv ::  0.92
My Model :: ds2_test.csv ::  0.92
Scikit-Learn Logistic Regression Model :: ds2_train.csv ::  0.915
My Model :: ds2_test.csv ::  0.91625
