# Gaussian Naive Bayes

<br>

When dealing with continuous data, a typical assumption is that the continuous values associated with each class are distributed according to a Gaussian distribution.

For example of using the Gaussian Distribution, suppose the training data contain a continuous attribute, $x$. We first segment the data by the class, and then compute the mean and variance of $x$ in each class. Let $\mu_{c}$ be the mean of the values in $X$ associated with class $c$, and let $\sigma^{2}_{c}$ be the variance of the values in $X$ associated with class $c$. Then, the probability distribution of some value given a class, $p(X=x|c)$, can be computed by plugging $x$ into the equation for a Normal distribution parameterized by $\mu_{c}$ and $\sigma^{2}_{c}$. That is:

$$P(X=x|c)=\frac{1}{\sqrt{2\pi\sigma^2_c}}\,e^{ -\frac{(x-\mu_c)^2}{2\sigma^2_c} }$$
The key to Naive Bayes is making the (rather large) assumption that the presences (or absences) of each data feature are independent of one another, conditional on a data having a certain label.

In [2]:
import pandas as pd
from pandas import Series,DataFrame

import matplotlib.pyplot as plt
import seaborn as sns
from jupyterthemes import jtplot
jtplot.style(context='talk', fscale=1, spines=True, gridlines='--')

# Gaussian Naive Bayes
from sklearn import datasets
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB

In [4]:
# load the iris datasets
iris = datasets.load_iris()

# Grab features (X) and the Target (Y)
X = iris.data

Y = iris.target

# Show the Built-in Data Description
print (iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [6]:
# Fit a Naive Bayes model to the data
model = GaussianNB()

from sklearn.model_selection import train_test_split
# Split the data into Trainging and Testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)

# Fit the training model
model.fit(X_train,Y_train)

# Predicted outcomes
predicted = model.predict(X_test)

# Actual Expected Outvomes
expected = Y_test

print (metrics.accuracy_score(expected, predicted))

0.9473684210526315


We have about 94.7% accuracy using the Naive Bayes method.