# Classification of Iris species using Gaussian Naive Bayes ML model  


## Abstract

Machine learning is a rapidly developing field which has many applications. 
In this notebook, I develop a **Naive Bayes classifier** from scratch using only python's standard libraries intended for data science (Pandas & math). Naive bayes is a classifier with many applications such as natural language analysis and spam detection, and in our case, label prediction. It works well on small datasets, it's computationally fast (O(P):prediction / O(NP): Training) and most importantly, it's quite simple to implement. 

The classifier is used on the Iris dataset to identify 3 different species of iris given the dimensions of its sepal and petal. The program has 3 main components: data pre-processing, maximum likelihood estimators of a normal probability distribution's mean and standard deviation, and the classifier which finds the probability given the test data and the maximum likelihood estimates.

After developing the model, we found a promising 94% accuracy in a test set of 50 different instances of Iris it had not seen before. 

## Introduction

 
    "If you torture the data long enough, it will confess to anything." -ChillPlate

Machine learning is a rapidly developing field in Computer science and applied mathematics. It's applications range from simulating cognition using neural networks to creating videogame cheating bots, detecting spam in email and replacing intellectual labor.

### Definitions  

#### Machine learning
Let us define some **Machine Learning** concepts:  

**Classification**: The act of classifying a set of data according to certain rules. 
Ex. What set does the dataset $$D = \{4,8,15,16,23,42\}$$ belong to given that this is not the full sequence of the set? One answer would be the set of integers between 1 and 100, that set is a class. 

**Dataset** ($D$): Any collection of data. This may include the features and the labels. 
<br>
<br>
**Features** ($X$):
   - General Case: What we have, X axis, the inputs, $X = (x_1,x_2, ... ,x_n)$ 
   - Specific Case: The vector which contains 2 different dimensions of Sepals and petals of an Iris flower, 
   
   $$SepalPetal = SP = (SepalLength,SepalWidth,PetalLength,PetalWidth)$$  
    
**Labels** ($C_k$ or $Y$):
- General Case: What we're looking for, Y axis,Target data, the output $Y = (y_1,y_2, ... , y_n)$    
- Specific Case: The set which contains the species of Iris: 

$$Species = \{Setosa, Virginica, Versicolor\}$$   

**Training Set** ($Tr$): the set which contains both the features and the labels which is used to train the algorithm.

**Test Set** ($T$): A set containing only features from which we extract the labels using the algorithm. This test is what is used for prediction and which allows us to determine the accuracy of our classifier.

There are 3 classes of ML (machine learning) problems in general. Supervised learning, unsupervised learning and reinforcement learning.

**Supervised learning** (which is the case of this ESP's problem): We have the features and we have the labels. We give the computer examples of features and their corresponding labels. the computer learns to find the correct labels given  a set (or a vector) of features.

**Unsupervised Learning** : In this case, only features are given to the computer and the computer finds patterns in them. This is an active area of research in machine learning, mostly in artificial neural networks.

**Reinforcement Learning**: In this category of problems, the machine excels at a certain task by doing it over and over again and evaluating itself. This is analogous to playing. For example, AlphaGo, google's go-playing algorithm is an example of reinforcement learning.

#### probability

**Parameter** ($\theta$): any statistical parameter of a dataset. E.g. Mean $\mu$ or Standard deviation $\sigma$.  

**Likelihood** $L(\theta)$: The likelihood of a parameter estimates the probability of a parameter $\theta$ being the true population parameter.

**Maximum likelihood**: Most likely value of the point estimate of a certain parameter of a probability distribution. 

**Maximum likelihood estimator** (MLE): A function that estimates the maximum likelihood of a parameter of the probability distribution.

### Data:

Our dataset is the Iris Dataset from 1988 by R.A. Fisher. It contains 150 iris flowers' Sepal and Petal dimensions (in cm), collectively belonging to 3 species: Iris-Setosa, Iris-Virginica or Iris-versicolor.

This dataset is meant for classification purposes and is a very common dataset referenced in machine learning litterature. 

It is formatted as follows:  

| Sepal Length | Sepal width | Petal Length | Petal Width | Class |
|--------------|-------------|--------------|------------ |-------|


It was retrieved from https://archive.ics.uci.edu/ml/datasets/iris as a zip file.

## Model & Method of Analysis:   

What is a **Gaussian Naive Bayes Classifier**?

**Guassian**: All probability functions are assumed to be normally distributed. And the maximum likelihood estimators are also functions of a gaussian distribution.  

**Naive**: It is assumed that the value of the features are independent.  

**Bayes**: The classifier uses bayes' theorem on conditional probability to find the most probable result.  

**Classifier**: a function which classifies labels given a set/vector of features.  

A **Gaussian Naive Bayes Classifier** is a function that assumes the features of a dataset are _independent_ and __classifies__ data using _Bayes' theorem_ by choosing most probable value of a certain set of features belonging to a label.

Our computer program will consist of 3 parts:
    1. Data Pre-Processing (list manipulations: 
    method of analysis purely computational. Refer to code.) 
    2. Data Training (Maximum Likelihood Estimations)
    3. Data Prediction (Using the Gaussian Naive Bayes classifier)

 
#### The Maximum Likelihood Estimator:
The MLE is defined as a function which gives point estimates of the statistics of our data. The point estimates we seek in our problem are the mean and the standard deviations of a normal distribution. This is because we assume the dimensions of iris petals and sepals to be normally distributed.   

To find the maximum likelihood estimates for the mean and the standard deviation of a normally distributed population, we first define the likelihood function for a random sample size n:  

$$L(\theta,\sigma^2) = \prod_{i=1}^n \frac{1}{\sigma \sqrt{2\pi}}e^{-(x_i - \mu)^2/(2\sigma^2)}$$

If we simplify this to a sum, we get:

$$L(\theta,\sigma^2) =  \frac{1}{(2\pi\sigma^2)^{n/2}}e^{-\frac{1}{2\sigma^2}\sum_{i=1}^n (x_i-\mu)^2}$$

Then we take the natural logarithm on both sides to get rid of the exponents:

$$ln L(\theta,\sigma^2) = -\frac{n}{2} ln(2\pi\sigma^2) - \frac{n}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2$$

now if we take the partial derivatives with respect to $\mu$ and $\sigma^2$ and set them to 0 (to find the maximum value), we get:

$$ \hat{\mu} = \bar{X}$$
<br>
$$and$$
<br>
$$\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2 $$

These respectively correspond to the arithemtic mean and the population standard deviation. These are the values that our estimator will calculate.

#### General Classifier:
<br>
        Consider a vector X containing n features :  
        
$$X = (x_1,...,x_n)$$  
        
        The probability that a certain classifier C of order k should be used given the vector of features is:

$$P(C_k|X)$$  
        
        using Bayes theorem, this probability can be estimated as:  
        
$$P(C_k|X)= \frac{P(C_k)P(X|C_k)}{p(X)}$$  
        
        The denominator is constant in the cases for all classifiers, so when comparing probabilities, it can be ignored. 
        We calculate the numerator using the chain rule (multiplication rule):  
        
$$P(C_k \cap X) = P(C_k)P(X|C_k)$$
$$= P(x_1|x_2,...,x_n,C_k)P(x_2|x_3,...,x_n,C_k) ... P(x_\left(n-1)\right|x_n,C_k)P(C_k)$$
      
          At this point, we must assume that the features are independent for simplicity's sake. This makes our and the computer's processors lives easier since it does not need to calculate the joint probabilites and their covariances. (---Apparently, the algorithm could still be effective using this assumption. We shall see about that.---) So our equation has been reduced to:
      
$$P(C_k \cap X) = P(C_k)P(X|C_k) = P(x_1|C_k)P(x_2|C_k) ... P(x_n)|C_k)P(C_k)$$

        and:  
    
$$ P(C_k|X) \propto P(C_k \cap X)$$ 
      
        So, we see that the posterior probability is proportional to the product of likelihood and the prior probability  

        This means we may only use    
        
$$P(C_k \cap X)$$  

        to estimate the most likely class that a set of features belong to.

#### Specific Classifier
<br>
Consider the vector SepalPetal or SPecies containing 4 **independent** features :  
        
$$SP = (SepalLength\space,SepalWidth\space,PetalLength\space,PetalWidth)$$  
        
<br>
For example, To find probability that the features describe a certain species of Iris (Virginica in this case):

$$P(Virginica\space |\space SP)$$  
        
using Bayes theorem, we find the product of the apriori probability (the probability of Iris being Virginica), which is equal to the proportion of Virginica Irises among all training examples:  

$$P(Virginica) =\frac{N_{Virginica}}{N_{total}}$$  
<br>

![title](./Diagrams/Virginica.png)

The prior probability is multiplied by the probability of calculated likelihood (likelihood as in $P(SP\space|\space Virginica)$) and it is normalized for $P(SP)$:
        
$$P(Virginica\space|\space SP)= \frac{P(Virginica)P(SP\space|\space Virginica)}{p(SP)}$$  
<br>

As discussed in the general case, the denominator (the evidence) can be ignored. However, for academic purposes, its value may be calculated as follows (This can be proven using a tree diagram):  
        
$$p(SP) = P(Setosa)P(SP|Setosa) + P(Versicolor)P(SP|Versicolor) + P(Virginica)P(SP|Virginica)$$
<br>

We calculate the numerator using the chain rule (multiplication rule):  
        
$$P(SP \cap Virginica) = P(SP)P(Virginica\space|\space SP)$$

given independence of features:
      
$$P(Virginica \cap SP) = P(Vir)P(SP|Vir) $$ <br>$$= P(Vir)P(Sepal_{Length}|Vir)P(Sepal_{Width}|Vir)P(Petal_{Length}|Vir)P(Petal_{Width}|Vir)$$

and:  
    
$$ P(Virginica \space |\space SP) \propto P(Virginica \cap SP)$$ 
      
We use the calculated value of $P(Virginica \cap SP)$ to choose the most likely estimate for the class of the test data '$Iris$'.

Finally, the probability values for all three different species are calculated and compared. 

The diagram underneath is a visual representation of how to calculate the probability for each of the three possible classes.

![title](./Diagrams/Overall.png)

### Here's an overview of the whole process.
   
    

**Assumptions**:
    - The dimensions of the Iris Petal & Sepals are normally distributed
    - These dimensions are independent of each other.
**Probability Density Function**
$$P(X_i) = \frac{1}{\sigma \sqrt{2\pi}}e^{-(x_i - \mu)^2/(2\sigma^2)}$$
**Maximum Likelihood Estimators**:
$$\mu = \bar{X}$$
<br>
$$\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2 $$  

**Classifier**: Gaussian Naive Bayes
$$ P(Virginica \space |\space SP) \propto P(Virginica \cap SP)$$ 

#### Graph of model, seen below:

Nodes:
- $\mu_{P_L}$ : first component of the Train Set Statistics $$Tr_{Stat} = \{S_l\{\mu_{feature},\sigma_{feature}\}\ ... P_w\{\mu_{feature},\sigma_{feature}\} \}$$  

- $Test_{P_L}$ : a component of the Test Set $T_{ij} = (S_l,S_w,P_l,P_w,IrisType)$

Methods
0. Our training data is properly formatted by being preprocessed. (e.g. Training data **labeled Virginica**)
1. The labeled training data is used to find the maximum likelihood estimates of each feature which lie in a feature vector. ($Tr_{Stat} = \{\mu_s,\sigma_s\}$  )
2. The probability of the test vector $T$ belonging to each category is calculated with

$$\frac{1}{\sigma \sqrt{2\pi}}e^{-(x_i - \mu)^2/(2\sigma^2)}$$

3. The choice with the highest probability is chosen as the predicted value. $$Max \space (\{P(Setosa \cap SP),P(Virginica \cap SP),P(Versicolor \cap SP)\})$$

![title](./Diagrams/Prediction.png)

## Code + Results
The code is comprehensively commented as an attempt to keep relevant code and explanation in the same place. Please read them as if they were part of the ESP.

### Data Pre-Processing

#### Import Libraries

In [1]:
#import libraries 
import random # to randomize the training dataset
import math # self-explanatory
import pprint # to pretty print our initial non-panda dataset
import pandas as pd # to list and do vector manipulation

#### Get the Data

In [2]:
#ready filename
filename = './Irisdata/iris.data'

In [3]:
#Define function to create dataset from file
#returns a list
def getData(file_name):
    #read lines from file
    file = open(filename, "r")
    #create list of lines
    data = list(file)

    for i in range(len(data)):
        data[i] = data[i].split(',')
        data[i][-1] = data[i][-1][:-1]
    return data

In [4]:
#Make the dataset
dataset = getData(filename)

#lets' pretty print our dataset using the pretty print library
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(dataset)

[   ['5.1', '3.5', '1.4', '0.2', 'Iris-setosa'],
    ['4.9', '3.0', '1.4', '0.2', 'Iris-setosa'],
    ['4.7', '3.2', '1.3', '0.2', 'Iris-setosa'],
    ['4.6', '3.1', '1.5', '0.2', 'Iris-setosa'],
    ['5.0', '3.6', '1.4', '0.2', 'Iris-setosa'],
    ['5.4', '3.9', '1.7', '0.4', 'Iris-setosa'],
    ['4.6', '3.4', '1.4', '0.3', 'Iris-setosa'],
    ['5.0', '3.4', '1.5', '0.2', 'Iris-setosa'],
    ['4.4', '2.9', '1.4', '0.2', 'Iris-setosa'],
    ['4.9', '3.1', '1.5', '0.1', 'Iris-setosa'],
    ['5.4', '3.7', '1.5', '0.2', 'Iris-setosa'],
    ['4.8', '3.4', '1.6', '0.2', 'Iris-setosa'],
    ['4.8', '3.0', '1.4', '0.1', 'Iris-setosa'],
    ['4.3', '3.0', '1.1', '0.1', 'Iris-setosa'],
    ['5.8', '4.0', '1.2', '0.2', 'Iris-setosa'],
    ['5.7', '4.4', '1.5', '0.4', 'Iris-setosa'],
    ['5.4', '3.9', '1.3', '0.4', 'Iris-setosa'],
    ['5.1', '3.5', '1.4', '0.3', 'Iris-setosa'],
    ['5.7', '3.8', '1.7', '0.3', 'Iris-setosa'],
    ['5.1', '3.8', '1.5', '0.3', 'Iris-setosa'],
    ['5.4', '3.4', '

#### Split the data into Train and Test

In [5]:
# Create Training Data

# Here, we define a function to split the data in 2 parts. 
# We shuffle the training set ensure the training sample includes all classes.
# We shall use the larger part to train, and the smaller part to test our algorithm
splitfraction = 0.66
def makeTrainingData(data, splitfraction):
    listlength = len(data)
    Size = int(len(data) * splitfraction)
    trainingSet = []
    dataCopy = list(data)
    while len(trainingSet) < Size:
        i = random.randrange(len(dataCopy))
        trainingSet.append(dataCopy.pop(i))
    return [trainingSet, dataCopy]

In [6]:
#make two lists, one to train, one to test
train, test = makeTrainingData(dataset, splitfraction)

#let's print out the results (prettily)

print('{0} training vectors:\n'.format(len(train)))
pp.pprint(train)

print( '\n \n \n and {0} test vectors:'.format(len(test)))
pp.pprint(test)

#you can click on the output window and scroll through the results

99 training vectors:

[   ['4.8', '3.1', '1.6', '0.2', 'Iris-setosa'],
    ['6.0', '2.2', '4.0', '1.0', 'Iris-versicolor'],
    ['6.3', '3.3', '6.0', '2.5', 'Iris-virginica'],
    ['6.1', '2.9', '4.7', '1.4', 'Iris-versicolor'],
    ['5.4', '3.0', '4.5', '1.5', 'Iris-versicolor'],
    ['6.0', '3.0', '4.8', '1.8', 'Iris-virginica'],
    ['7.7', '3.0', '6.1', '2.3', 'Iris-virginica'],
    ['5.3', '3.7', '1.5', '0.2', 'Iris-setosa'],
    ['5.0', '3.6', '1.4', '0.2', 'Iris-setosa'],
    ['6.0', '2.2', '5.0', '1.5', 'Iris-virginica'],
    ['5.1', '3.7', '1.5', '0.4', 'Iris-setosa'],
    ['6.8', '3.0', '5.5', '2.1', 'Iris-virginica'],
    ['6.4', '2.8', '5.6', '2.2', 'Iris-virginica'],
    ['5.6', '2.8', '4.9', '2.0', 'Iris-virginica'],
    ['7.9', '3.8', '6.4', '2.0', 'Iris-virginica'],
    ['4.7', '3.2', '1.3', '0.2', 'Iris-setosa'],
    ['6.0', '3.4', '4.5', '1.6', 'Iris-versicolor'],
    ['5.0', '2.3', '3.3', '1.0', 'Iris-versicolor'],
    ['6.1', '2.6', '5.6', '1.4', 'Iris-virginica'],


### Turn list training set into Pandas DataFrame
we initially do this in order to save ourselves the headache of iterating through every value in the list to get Mean and Standard Deviation. It's also a very nice way of visualizing the data

Here, we also define Training sets $Tr_{Species} \subset Tr$, $Species$ being the set  

$$Species = \{Setosa, Virginica, Versicolor\}$$   

Our separation of data and binning it like this is what makes the __Naive Bayes__ model a **supervised** model. 

In [7]:
df = pd.DataFrame(train)
df = df.rename(columns = {0:'SL',1:'SW',2:'PL',3:'PW',4:'Iris'})

Setosa = df[df['Iris'] == "Iris-setosa"] 
Versicolor = df[df['Iris'] == "Iris-versicolor"] 
Virginica = df[df['Iris'] == "Iris-virginica"] 

In [8]:
Virginica

Unnamed: 0,SL,SW,PL,PW,Iris
2,6.3,3.3,6.0,2.5,Iris-virginica
5,6.0,3.0,4.8,1.8,Iris-virginica
6,7.7,3.0,6.1,2.3,Iris-virginica
9,6.0,2.2,5.0,1.5,Iris-virginica
11,6.8,3.0,5.5,2.1,Iris-virginica
12,6.4,2.8,5.6,2.2,Iris-virginica
13,5.6,2.8,4.9,2.0,Iris-virginica
14,7.9,3.8,6.4,2.0,Iris-virginica
18,6.1,2.6,5.6,1.4,Iris-virginica
21,6.3,2.5,5.0,1.9,Iris-virginica


    This is the end of Data-Preprocessing.

## Maximum Likelihood Estimates for a Normal Probability Distribution

### Population $\mu$ and $\sigma$ estimators
in python 3.0

_The two functions below are descriptive and for demonstration. Built-in Pandas functions are used to get the mean and the standard deviation of the data._

In [9]:
#This calculates the maximum likelihood estimates of a normal dist.'s mean
def sum(list):
    sum = 0
    for i in list:
        sum += i
    return sum

def Mean(nums):
    return sum(nums)/len(nums)

In [10]:
#This calculates the maximum likelihood estimates of a normal dist.'s standard deviation
def standardDeviation(nums):
    mean = Mean(nums)
    variance = sum([(x-mean)**2 for x in nums])/float(len(nums)-1)
    return math.sqrt(variance) 

In [11]:
print(Mean([1,3,5,9,4,3]))
print(standardDeviation([1,3,5,9,4,3,2]))


4.166666666666667
2.6095064302514777


### Train: Create statistic dataframes
Here, we take the mean and the standard deviation of the data using the Panda's *.mean()* and *.std()* functions and assign them to a table for each species type. This is what we mean when we train a model. In other words, we perform an operation on the training data $Tr$ and get values which will be used by the classifier to classify our test data.

In [12]:
SetosaStat= pd.DataFrame({
    'Sepal Length' : [Setosa.SL.apply(eval).mean(),Setosa.SL.apply(eval).std()],
    'Sepal Width' : [Setosa.SW.apply(eval).mean(),Setosa.SW.apply(eval).std()],
    'Petal Length' : [Setosa.PL.apply(eval).mean(),Setosa.PL.apply(eval).std()],
    'Petal Width' : [Setosa.PW.apply(eval).mean(),Setosa.PW.apply(eval).std()]
    
}, index = ['Mean','Sd'])

VersicolorStat =  pd.DataFrame({
    'Sepal Length' : [Versicolor.SL.apply(eval).mean(),Versicolor.SL.apply(eval).std()],
    'Sepal Width' : [Versicolor.SW.apply(eval).mean(),Versicolor.SW.apply(eval).std()],
    'Petal Length' : [Versicolor.PL.apply(eval).mean(),Versicolor.PL.apply(eval).std()],
    'Petal Width' : [Versicolor.PW.apply(eval).mean(),Versicolor.PW.apply(eval).std()]    
}, index = ['Mean','Sd'])

VirginicaStat =  pd.DataFrame({
    'Sepal Length' : [Virginica.SL.apply(eval).mean(),Virginica.SL.apply(eval).std()],
    'Sepal Width' : [Virginica.SW.apply(eval).mean(),Virginica.SW.apply(eval).std()],
    'Petal Length' : [Virginica.PL.apply(eval).mean(),Virginica.PL.apply(eval).std()],
    'Petal Width' : [Virginica.PW.apply(eval).mean(),Virginica.PW.apply(eval).std()]
}, index = ['Mean','Sd'])

In [13]:
print("Setosa statistics")
SetosaStat

Setosa statistics


Unnamed: 0,Sepal Length,Sepal Width,Petal Length,Petal Width
Mean,5.033333,3.443333,1.433333,0.23
Sd,0.375393,0.348082,0.156102,0.095231


In [14]:
print("Versicolor statistics")
VersicolorStat

Versicolor statistics


Unnamed: 0,Sepal Length,Sepal Width,Petal Length,Petal Width
Mean,5.990625,2.79375,4.296875,1.334375
Sd,0.522006,0.285044,0.45258,0.178902


In [15]:
print("Virginica statistics")
VirginicaStat[:2]



Virginica statistics


Unnamed: 0,Sepal Length,Sepal Width,Petal Length,Petal Width
Mean,6.502703,2.997297,5.516216,2.040541
Sd,0.656373,0.356282,0.525206,0.270246


We assume the values of each species' features to be normally distributed.

## Let's convert the Dataframes back to lists

In [16]:
#The statistics lists with the MLEs for all 3 species
SetStatlist = SetosaStat.values.tolist()
VerStatlist =VersicolorStat.values.tolist()
VirStatlist = VirginicaStat.values.tolist()

In [17]:
#The test set as a dataframe for aesthetic purposes
Test = pd.DataFrame(test)
Test

Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,5.4,3.9,1.7,0.4,Iris-setosa
3,4.6,3.4,1.4,0.3,Iris-setosa
4,4.4,2.9,1.4,0.2,Iris-setosa
5,5.4,3.7,1.5,0.2,Iris-setosa
6,4.8,3.4,1.6,0.2,Iris-setosa
7,4.8,3.0,1.4,0.1,Iris-setosa
8,5.4,3.9,1.3,0.4,Iris-setosa
9,5.1,3.8,1.5,0.3,Iris-setosa


    This is the end of Training

## Finding Probabilities and Predicting

In [18]:
#To calculate the probability of each feature value  
def normalPDF(x, mean, stdev):
    exp = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))
    return (1 / (math.sqrt(2*math.pi) * stdev)) * exp

### Probability functions and predictions; Finally!

Here, we find the probability $P(Virginica \cap SP)$ to choose the most likely estimate for the class of our test data.

In [19]:
#
def SetosaProbabilityProduct(Testlist,x):
    #initial value for product, because n * 1 = n...
    product = 1
    
    #P(SP): This 1/3 represents a uniform prior distribution for each species. 
    Psp = 1/3
    
    #find the product of the probabilities of 
    for i in range(4):
        product *= normalPDF(float(Testlist[x][i]),float(SetStatlist[0][i]),float(SetStatlist[1][i]))
    return product * Psp 

def VirProbabilityProduct(Testlist,x):
    product = 1
    Psp = 1/3
    for i in range(4):
        product *= normalPDF(float(Testlist[x][i]),float(VirStatlist[0][i]),float(VirStatlist[1][i]))
    return product * Psp
def VerProbabilityProduct(Testlist,x):
    product = 1
    Psp = 1/3
    for i in range(4):
        product *= normalPDF(float(Testlist[x][i]),float(VerStatlist[0][i]),float(VerStatlist[1][i]))
    return product * Psp

### Naive Bayes Classifier! 

In [20]:
#gets the probabilites for all different species and compares them
#returns a list with probabilities and Prediction [Set,Ver,Vir,Prediction]
def getProbabilities(singleTest,x):
    Prediction = ''
    
    Ps = SetosaProbabilityProduct(singleTest,x)
    Pver = VerProbabilityProduct(singleTest,x)
    Pvir = VirProbabilityProduct(singleTest,x)

    probs = [Ps ,Pver, Pvir]
    m = max(probs)
    if m == Ps:
        Prediction = 'Iris-setosa'
    elif m == Pvir:
        Prediction = 'Iris-virginica'
    elif m == Pver:
        Prediction = 'Iris-versicolor'
    else:
        print("something's wrong")
    return probs, Prediction

In [21]:
#Get probability for a single row/Iris in the test set
getProbabilities(test,2)


([0.05389749102939712, 1.676124504414165e-17, 8.546169798522405e-23],
 'Iris-setosa')

### Scaling up

Let's tackle the whole test set...
and define nice functions that produce nice useful results. _Functions described in comments._

In [22]:
#this predicts the entire test set and returns a list of the probabilites and the predicted species
def TestSetPredict(testlist):
    predictionSet = []
    for i in range(len(testlist)):
        predictionSet.append(getProbabilities(Testlist,i))
    return predictionSet

In [23]:
#this predicts the entire test set and returns their predictions
def JustPredictions(testlist):
    predictionSet = []
    for i in range(len(testlist)):
        probabilities = getProbabilities(testlist,i)
        predictionSet.append(probabilities[-1])
    return predictionSet

In [24]:
#this one measures the accuracy as (# of successful predictions/# of Total Predictions)
def Accuracy(testlist):
    predictions = JustPredictions(testlist)
    success = 0

    for i in range(len(testlist)):
        if predictions[i] == testlist[i][-1]:
            success = success + 1
    accuracy = success/len(predictions)
    return accuracy

In [25]:
#this one give the predictions and the true values in one list. along with whether it was the right guess or not.
def PredvsTrue(testlist):
    predictions = JustPredictions(testlist)
    yayornay = []
    PvT= []
    for i in range(len(testlist)):
        if predictions[i] == testlist[i][-1]:
            yayornay.append('Y')
        else:
            yayornay.append('NO')
            
    for i in range(len(testlist)):
        PvT.append([testlist[i][-1], predictions[i],yayornay[i]] )
    return PvT    

In [26]:
#We can see all the predictions, the true values and if there's a difference in one table.
#This allows us to see which parts of our data was not accuratly predicted and 
#helps us understand why.
pd.DataFrame(PredvsTrue(test))

Unnamed: 0,0,1,2
0,Iris-setosa,Iris-setosa,Y
1,Iris-setosa,Iris-setosa,Y
2,Iris-setosa,Iris-setosa,Y
3,Iris-setosa,Iris-setosa,Y
4,Iris-setosa,Iris-setosa,Y
5,Iris-setosa,Iris-setosa,Y
6,Iris-setosa,Iris-setosa,Y
7,Iris-setosa,Iris-setosa,Y
8,Iris-setosa,Iris-setosa,Y
9,Iris-setosa,Iris-setosa,Y


#### Accuracy

In [27]:
#We get 94-98%! That's amazing! Though it seems as if the model fails to distinguish between
#some Virginicas and Versicolor.
#This means we need more data for Versicolor and Virginica.
Accuracy(test)

0.9607843137254902

## Discussion


Our Gaussian Naive Bayes Classifier showed some promise in predicting the Iris dataset given all of its features.
This model is fantastic because:
   
   1. it needs only a very very small dataset to give meaningful and useful results.
   2. It's also relatively easy to implement (unless you try to iterate through a dataframe at 0.6 milliseconds per loop and try to figure out why the whole thing is so slow).
   3. it's impossible to overfit and come up with patterns which aren't there and jumbled scary results as might be seen in certain neural networks.

There are 2 things that are wrong with this model: 

- First of all the assumption that all of its features are independent. Independence is not to be taken lightly, but in the case of the Iris, it seemed to be working.

- Second, Our final analysis show that in some cases, our model fails to distinguish between Iris Versicolor and Iris Virginica. One might think this would be solved by getting more data, but there's always the possibility of a type $\beta$ error.

## Conclusion

In conclusion, after my numerous readings of textbooks and Internet Searching to simplify this process for you, **you** were able to make your own **Gaussian Naive Bayes Classifier** along with its estimators and probability functions. It showed an accuracy of 94% the first time and an accuracy of 98% the second time.
I will post the results of the expedition into this statistical journey below.

In [32]:
print("Accuracy: " + str(Accuracy(test)*100) + "%")
pd.DataFrame(PredvsTrue(test)).iloc[35:45]

Accuracy: 98.0392156862745%


Unnamed: 0,0,1,2
35,Iris-virginica,Iris-virginica,Y
36,Iris-virginica,Iris-versicolor,NO
37,Iris-virginica,Iris-virginica,Y
38,Iris-virginica,Iris-virginica,Y
39,Iris-virginica,Iris-virginica,Y
40,Iris-virginica,Iris-virginica,Y
41,Iris-virginica,Iris-virginica,Y
42,Iris-virginica,Iris-virginica,Y
43,Iris-virginica,Iris-virginica,Y
44,Iris-virginica,Iris-virginica,Y


Projects like this is how education should be. It was a delightful and a horrible experience at the same time. Stressful, yet rewarding. and unlike many other projects I have done...
...Spoken like a true Daoist

## References 



[1] Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.  Chapters 1, 3

[2] Jeremy Orloff and Jonathan Bloom. Maximum Likelihood Estimates. MIT OpenCourseware. Class 10, 18.05
https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading10b.pdf  

[3] Stephen Marsland. Machine Learning : an algorithmic perspective. CRC Press, Boca Raton,
FL, 2015.  

[4] Douglas Montgomery. Applied statistics and probability for engineers. Wiley, Hoboken, NJ, 2018. Chapters 2, 7  

[5]The Kernel Trip. Computational complexity of machine learning algorithms. Accessed May 27th 2019.
https://www.thekerneltrip.com/machine/learning/computational-complexity-learning-algorithms/

