![ADSA Logo](http://i.imgur.com/BV0CdHZ.png?2 "ADSA Logo")

# Fall 2018 ADSA Workshop - Naive Bayes from Scratch

Workshop content is adapted from:
* https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/


The Naive Bayes algorithm is an intuitive method that uses the probabilities of each attribute belonging to each class to make a prediction. It is the supervised learning approach you would come up with if you wanted to model a predictive modeling problem probabilistically.

Naive bayes simplifies the calculation of probabilities by assuming that the probability of each attribute belonging to a given class value is independent of all other attributes. This is a strong assumption but results in a fast and effective method.

The probability of a class value given a value of an attribute is called the conditional probability. By multiplying the conditional probabilities together for each attribute for a given class value, we have a probability of a data instance belonging to that class.

To make a prediction we can calculate probabilities of the instance belonging to each class and select the class value with the highest probability.

Naive bases is often described using categorical data because it is easy to describe and calculate using ratios. A more useful version of the algorithm for our purposes supports numeric attributes and assumes the values of each numerical attribute are normally distributed (fall somewhere on a bell curve). Again, this is a strong assumption, but still gives robust results.

##Pima Indians Diabetes Problem
This dataset contain 768 observations of medical details for Pima Indian patients. The records describe instantaneous measurements taken from the patient such as their age, the number of times pregnant and blood workup. All patients are women aged 21 or older. All attributes are numeric, and their units vary from attribute to attribute.

Each record has a class value that indicates whether the patient suffered an onset of diabetes within 5 years of when the measurements were taken (1) or not (0).

This is a standard dataset that has been studied a lot in machine learning literature. A good prediction accuracy is 70%-76%.

###Reading in the dataset

In [254]:
import pandas as pd
import numpy as np
import math
import io
from google.colab import files
uploaded = files.upload()

df = pd.read_csv(io.StringIO(uploaded['pima_indians_diabetes.csv'].decode('utf-8')))

KeyError: ignored

In [170]:
# Note that the csv data is now a Pandas DataFrame
print(type(df))

# Returns first 5 entries
df.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,y1
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


##Naive Bayes Algorithm
###1. Split Data: split the data into training and testing datasets.
###2. Summarize Data: summarize the properties in the training dataset so that we can calculate probabilities and make predictions.
###3. Make Predictions: Generate predictions given a test dataset and a summarized training dataset.
###4. Evaluate Accuracy: Evaluate the accuracy of predictions made for a test dataset as the percentage correct out of all predictions made.
###5. Tie it Together: Use all of the code elements to present a complete and standalone implementation of the Naive Bayes algorithm.

##1. Split data
Use the train_test_split function to create the training and testing data. 

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
features = df[['x1','x2','x3','x4','x5','x6','x7','x8']]
labels    = df[['y1']]

In [173]:
features.head()

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


In [174]:
labels.head()

Unnamed: 0,y1
0,1
1,0
2,1
3,0
4,1


In [0]:
x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size = .2,random_state = 0)

In [176]:
print(len(x_train))
x_train.head()

614


Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8
603,7,150,78,29,126,35.2,0.692,54
118,4,97,60,23,0,28.2,0.443,22
247,0,165,90,33,680,52.3,0.427,23
157,1,109,56,21,135,25.2,0.833,23
468,8,120,0,0,0,30.0,0.183,38


In [177]:
print(len(y_train))
y_train.head()

614


Unnamed: 0,y1
603,1
118,0
247,0
157,0
468,1


In [178]:
print(len(x_test))
x_test.head()

154


Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8
661,1,199,76,43,0,42.9,1.394,22
122,2,107,74,30,100,33.6,0.404,23
113,4,76,62,0,0,34.0,0.391,25
14,5,166,72,19,175,25.8,0.587,51
529,0,111,65,0,0,24.6,0.66,31


In [179]:
print(len(y_test))
y_test.head()

154


Unnamed: 0,y1
661,1
122,0
113,0
14,1
529,0


##2. Summarize Data
This step can be broken down into a series of steps. 

a. Separate Data By Class

b. Calculate Mean

c. Calculate Standard Deviation


###Separate Data by Class

In [0]:
# for all of the y_train data, extract only 1 values
one_y  = y_train[y_train.y1 == 1] 
# use the indices of one_y to get the corresponding rows in x_train
one_x  = x_train.loc[list(one_y.index)]

zero_y = y_train[y_train.y1 == 0]
zero_x = x_train.loc[list(zero_y.index)]

### Calulate Mean

In [181]:
one_mean  = one_x.mean()
print(one_mean)
zero_mean = zero_x.mean() 
print(zero_mean)

x1      4.764706
x2    140.361991
x3     70.728507
x4     22.511312
x5    100.904977
x6     35.410860
x7      0.538986
x8     37.420814
dtype: float64
x1      3.374046
x2    109.949109
x3     68.381679
x4     19.562341
x5     71.582697
x6     30.404835
x7      0.425692
x8     31.442748
dtype: float64


### Calculate Standard Deviation

In [182]:
one_sd  = one_x.std()
print(one_sd)
zero_sd = zero_x.std()
print(zero_sd)

x1      3.871671
x2     32.239236
x3     21.605524
x4     17.452432
x5    137.792218
x6      7.118176
x7      0.376519
x8     11.123248
dtype: float64
x1      3.071435
x2     25.792094
x3     18.269297
x4     15.128761
x5    103.056894
x6      7.776171
x7      0.299036
x8     11.977334
dtype: float64


##Make Predictions
We are now ready to make predictions using the summaries prepared from our training data. Making predictions involves calculating the probability that a given data instance belongs to each class, then selecting the class with the largest probability as the prediction.

We can also divide this part into a series of steps: 

a. Calculate Gaussian Probability Density Function

b. Calculate Class Probabilities

c. Make Predictions

d. Estimate Accuracy

In [0]:
import math
# this function the is the pdf formula for normal distributions that can be found here http://scikit-learn.org/stable/modules/naive_bayes.html
def calculateProbability(x, mean, stdev):
	exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))
	return (1 / (math.sqrt(2*math.pi) * stdev)) * exponent

In [0]:
all_calc = []
# iterate through the the x_test dataframe row by row
for i in range(0,len(x_test)):
  calc_one = []
  
  # calculate the probabilities using the means and sds of the 1 labels
  for j in range(0,len(one_sd)):
    calc_one.append(calculateProbability(x_test.iloc[i][j],one_mean[j],one_sd[j]))
  calc_zero = []
  
  # calculate the probabilities using the means and sds of the 0 labels
  for j in range(0,len(zero_sd)):
    calc_zero.append(calculateProbability(x_test.iloc[i][j],zero_mean[j],zero_sd[j]))
    
  # append both lists of probabilities into a final list
  all_calc.append([calc_one,calc_zero])

In [0]:
predictions = []
# iterate through our list containing lists of two lists of probabilities 
for calc in all_calc: 
  zero = 0
  one  = 0
  # now iterate through our two lists of probabilities
  for i in range(0,len(calc[0])): 
    
    # check which one has a greater probabilty
    if calc[0][i] > calc[1][i]: 
      one += 1
    else: 
      zero += 1
      
  # if our one count was greater than a zero count, then the data point corresponding to this index is 1(has diabetes)
  if one > zero: 
    predictions.append(1)
    
  # otherwise, the datapoint corresponding to this index is 0(no diabetes)
  else:
    predictions.append(0) 

In [186]:
print(predictions)
print(y_test['y1'].tolist())

[0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0]


In [0]:
correct = 0
total = 0
y_test_y1 = y_test.copy(deep = True)
y_test = y_test_y1['y1'].tolist()

In [235]:
print(type(y_test))
for i in range(len(predictions)): 
  if predictions[i] == y_test[i]: 
    correct += 1
  total +=1
accuracy = correct/total
print(accuracy)

<class 'list'>
0.7532467532467533


##Tie it Together

In [0]:
class naive_bayes: 
  def __init__(self): 
    self.mean_one = 0 
    self.sd_one   = 0
    self.mean_zero = 0 
    self.sd_zero   = 0 
  def fit(self,x,y): 
    one_y  = y_train[y_train.y1 == 1] 
    one_x  = x_train.loc[list(one_y.index)]

    zero_y = y_train[y_train.y1 == 0]
    zero_x = x_train.loc[list(zero_y.index)]
    
    one_mean  = one_x.mean()
    zero_mean = zero_x.mean() 
    
    one_sd    = one_x.std()
    zero_sd   = zero_x.std()
    
    self.mean_one  = one_mean
    self.mean_zero = zero_mean
    
    self.sd_one  = one_sd 
    self.sd_zero = zero_sd
    
  def calculateProbability(x, mean, stdev):
    exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))
    return (1 / (math.sqrt(2*math.pi) * stdev)) * exponent
  
  def predict(self,x_test):
    for i in range(0,len(x_test)):
      calc_one = []
      for j in range(0,len(self.sd_one)):
        calc_one.append(calculateProbability(x_test.iloc[i][j],self.mean_one[j],self.sd_one[j]))
      calc_zero = []
      for j in range(0,len(self.sd_zero)):
        calc_zero.append(calculateProbability(x_test.iloc[i][j],self.mean_zero[j],self.sd_zero[j]))
      all_calc.append([calc_one,calc_zero])
    predictions = []
    for calc in all_calc: 
      zero = 0
      one  = 0
      for i in range(0,len(calc[0])): 
        if calc[0][i] > calc[1][i]: 
          one += 1
        else: 
          zero += 1
      if one > zero: 
        predictions.append(1)
      else:
        predictions.append(0) 
        
    return predictions
  
  def score(self,predict, y_test): 
    correct = 0
    total = 0
    y_test_1 = y_test.copy(deep = True)
    y_test = y_test_1['y1'].tolist()
    for i in range(len(predictions)): 
      if predictions[i] == y_test[i]: 
        correct += 1
      total +=1
    accuracy = correct/total
    print(accuracy)

In [0]:
x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size = .2,random_state = 0)

In [0]:
clf = naive_bayes()

In [0]:
clf.fit(x_train,y_train)

In [0]:
predict = clf.predict(x_test)

In [266]:
clf.score(predict,y_test)

0.7532467532467533


##Sklearn's Naive Bayes


In [0]:
from sklearn.naive_bayes import GaussianNB

In [0]:
clf = GaussianNB()

In [274]:
clf.fit(x_train,y_train)

  y = column_or_1d(y, warn=True)


GaussianNB(priors=None)

In [0]:
predict = clf.predict(x_test)

In [276]:
clf.score(x_test,y_test)

0.7922077922077922