<center><img src='img/ms_logo.jpeg' height=40% width=40%></center>

<center><h1>Naive Bayesian Classifiers</h1></center>

In this notebook, we'll be focusing on **_Naive Bayesian Classifiers_**, or "Naive Bayes" for short.  This is an algorithm that uses **_Bayes' Theorem_** to make a classification based on probability.  In case you're unfamiliar with Bayes' Theorem, let's look at the formula:
<br>
<br>
<center><img src='img/bayes_theorem.png' height=40% width=40%></center>
<br>
<br>
Don't worry if you've seen this mathematical notation before. In plain English, that formula reads:

"The probability of A given B equals the probability of B given A, times the probability of A, divided by the probability of B".  

Let's run through an example case here and see if we can demystify this equation a little bit more. 

<center><h3>Scenario: Spam Detection</h3></center>

We have a dataset of emails, and we're trying to build a classifier that can predict if an email is spam or not by examining the words based in the emails.  Each email in our training set has been labeled as "spam" or "ham" (a real email, not spam).  We've counted each word used in every email, and found the following:

**_65% of the emails in the dataset are "Spam"._**

**_"Spam"_** emails contain the word _"deal"_ 80% of the time, and _"win"_ 40% of the time.  

**_35% of the emails in the datasert are "Ham"._**

**_"Ham"_** emails contain the word _"deal"_ 17% of the time, and _"win"_ 6% of the time.  

The next email we try to predict contains the both words "deal" and "win". Given the information above, we can plug these numbers into Bayes' Theorem and predict the likelihood that this is email Spam. 

<center>P(Spam|deal, win) = (P(win, deal|Spam) * P(Spam)) / P(deal, win)</center>

This can be further broken down into: 
<br>
<br>
<center>P(Spam|deal) \* P(Spam|win) = P(deal|Spam) \* P(Spam) \*  P(win|Spam) \* P(Spam) / P(deal|Spam) + P(deal|!Spam) \* P(win|Spam) + P(win|!Spam)</center>

In the equation above, "P(deal|!Spam)" can be read as "the percentage that 'deal' occurs in 'Ham' emails".  

On the next step, we'll start defining the probabilities for everything in that equation so we can plug them in:

1. P(deal|Spam) = .8
1. P(win|Spam) = .4
1. P(Spam) = .65
1. P(deal|!Spam) = .17
1. P(win|!Spam) = .06
1. P(!Spam) = .35

Let's replace some of these terms with the probabilities listed above and see how it works out:
<br>
<br>
<center>(.8 \* .65 \* .4 \* .65) / .8 \* .65 + .35 \* .17 \* .4 \* .65 + .35 \* .6 = **0.922595** </center>
<br>
<br>
Based on the math from Bayes' Theorem, we can predict probability that a new email containing both "deal" and "win" is "Spam" is approximately **92.2%**!

<center><h3>Using Naive Bayes in the Real World</h3></center>

In the above example, we did the math by hand.  That isn't very practical in the real world.  Luckily, `sklearn` contains some awesome implementations of Naive Bayesian Classifiers (and regressors!).  

For this assignment, we're going to use a `GaussianNB()` object.  There are a few different kinds of Naive Bayesian Classifiers, but for this one we'll stick to one that assumes our data follows a Gaussian (normal) distribution.  

Let's Get Started!

In [1]:
import numpy as np
np.random.seed(0)
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score


# Use load_iris() to load the data into the iris variable, and then assign iris.data and iris.target to the appropriate
# variables
iris = load_iris()
data = iris.data
labels = iris.target

# Use train_test_split to split the data into X_train, X_test, y_train, and y_test variables
X_train, X_test, y_train, y_test = train_test_split(data, labels)

# Create a GaussianNB() object and fit it using the training data
clf = GaussianNB()
clf.fit(X_train, y_train)

# Use the fitted model to create predictions for the X_test data.
preds = clf.predict(X_test)


# Run it all and see how you did!
print(preds)
print(y_test)
print("accuracy score: {}".format(accuracy_score(y_test, preds)))
print("f1 score: {}".format(f1_score(y_test, preds, average='weighted')))

[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 1]
[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 1]
accuracy score: 1.0
f1 score: 1.0


<center><h3>Some Caveats</h3></center>

You may have wondered why this particular model is a called a **_Naive_** Bayesian Classifier.  In this scenario, the word "Naive" simply means that the model makes the "naive" assumption that all features are independent of one another.  This leads us to the main caveat of this model--if you have feature columns that are highly correlated, this model may not work as well as we'd like.  **_If you're going to use Naive Bayes, make sure you check for highly correlated features beforehand!_**


<center><h3>Where to Go From Here</h3></center>

For the latter part of this assignment, you're going to use the famous Pima Indians Diabetes Dataset to build a Naive Bayesian Classifier that predicts whether or not an individual has diabetes.  You'll find the `pima_indians_diabetes.csv` file inside the `datasets` folder.  

To build this classifier successfully, you'll want to follow the best practices for loading in and preprocessing a data set that you've learned in class:

1. Importing the data
1. Exploring the data
1. "Cleaning" the data
1. Splitting the data into training and testing sets (or using KFold Cross val--more on this below)
1. Fitting the model
1. Validating the model (checking predictions on the test set)

Be sure to consider the following questions as you solve this problem:

* How will you deal with null values?
* For this model, does scaling the data improve your results? (HINT: test your assumption!)

On top of cleaning and preprocessing this data set, you'll also use **_Cross Validation_** to get a better measure of the accuracy of your model.  We did not use K Fold Cross Validation in the above model on purpose--instead, you'll need to work your way through `sklearn`'s [model_selection documentation](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) to figure out how to effectively make use of cross-validation.  

(**_Hint:_** There are several ways to implement cross validation using sklearn.  In the `model_selection` section of the documentation, pay special attention to the `KFold` object, as well as the methods available under the _Model Selection_ subsection.)


Good luck!

In [2]:
# path to file: "datasets/pima_indians_diabetes.csv". The first row of the .csv contains the column names.
# Note that in the "Outcome" column, 0 denotes someone that does NOT have diabetes, and 1 denotes someone that does.  
import pandas as pd

diabetes_df = pd.read_csv('datasets/pima_indians_diabetes.csv')

# Attempt to drop rows with null values, but there are no null values in this dataset
diabetes_df = diabetes_df.dropna()

diabetes_df.describe()
# Glucose, Blood Pressure, Skin Thickness, Insuline, BMI cannot be 0

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [3]:
print('Glucose 0s: {}'.format(diabetes_df[diabetes_df.Glucose == 0].shape[0]))
print('BloodPressure 0s: {}'.format(diabetes_df[diabetes_df.BloodPressure == 0].shape[0]))
print('SkinThickness 0s: {}'.format(diabetes_df[diabetes_df.SkinThickness == 0].shape[0]))
print('Insulin 0s: {}'.format(diabetes_df[diabetes_df.Insulin == 0].shape[0]))
print('BMI 0s: {}'.format(diabetes_df[diabetes_df.BMI == 0].shape[0]))

Glucose 0s: 5
BloodPressure 0s: 35
SkinThickness 0s: 227
Insulin 0s: 374
BMI 0s: 11


In [4]:
# Because there are small amounts of 0s for Glucose, BloodPressure, and BMI, I will drop those rows.
clean_df = diabetes_df[diabetes_df.Glucose != 0]
clean_df = clean_df[clean_df.BloodPressure != 0]
clean_df = clean_df[clean_df.BMI != 0]

# Because there are large amounts of 0s for SkinThickness and Insulin, I will drop those columns.
clean_df.drop('SkinThickness', axis=1, inplace=True)
clean_df.drop('Insulin', axis=1, inplace=True)

In [5]:
# Store outcomes in labelss
diabetes_labels = clean_df['Outcome']

# Drop Outcomes column
clean_df.drop('Outcome', axis=1, inplace=True)

In [6]:
# Use train_test_split to split the data into X_train, X_test, y_train, and y_test variables
X_train, X_test, y_train, y_test = train_test_split(clean_df, diabetes_labels)

# Create a GaussianNB() object and fit it using the training data
diabetes_clf = GaussianNB()
diabetes_clf.fit(X_train, y_train)

# Use the fitted model to create predictions for the X_test data.
diabetes_preds = diabetes_clf.predict(X_test)

# See scores for model
print("accuracy score: {}".format(accuracy_score(y_test, diabetes_preds)))
print("f1 score: {}".format(f1_score(y_test, diabetes_preds, average='weighted')))

accuracy score: 0.7679558011049724
f1 score: 0.7699054837678644


In [7]:
# Try scaling diabetes dataset and training and testing model with scaled dataset
from sklearn.preprocessing import StandardScaler

# Create a StandardScaler() Object. 
scaler = StandardScaler()

# Call scaler.fit() on the X_vals that will be rescaled.
scaler.fit(clean_df)

# Bind the newly scaled X_vals to scaled_X_vals by calling scaler.transform() on X_vals.
scaled_X_vals = scaler.transform(clean_df)

In [8]:
# Use train_test_split to split the data into X_train, X_test, y_train, and y_test variables
X_train, X_test, y_train, y_test = train_test_split(scaled_X_vals, diabetes_labels)

# Create a GaussianNB() object and fit it using the training data
scale_clf = GaussianNB()
scale_clf.fit(X_train, y_train)

# Use the fitted model to create predictions for the X_test data.
scale_preds = scale_clf.predict(X_test)

# See scores for model
print("accuracy score: {}".format(accuracy_score(y_test, scale_preds)))
print("f1 score: {}".format(f1_score(y_test, scale_preds, average='weighted')))

accuracy score: 0.7513812154696132
f1 score: 0.7546267738128319


In [9]:
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score

# Create a KFold object and StratifiedKFold object with 10 splits / folds
k_fold = KFold(n_splits=10)
strat_k_fold = StratifiedKFold(n_splits=10)

# Create a new GaussianNB() object for k_fold
gaussian_clf = GaussianNB()

# Train and get score of gaussian_clf for all 10 splits of k_fold
for k, (train, test) in enumerate(k_fold.split(clean_df, diabetes_labels)):
    gaussian_clf.fit(clean_df.iloc[train], diabetes_labels.iloc[train])
    print("[fold {0}] score: {1:.5f}".format(k, gaussian_clf.score(clean_df.iloc[test], diabetes_labels.iloc[test])))

# .iloc deals with inconsistencies between Pandas DataFrame indexing versus NumPy ndarray indexing 
# K_fold was not returning non-continuous indices of clean_df and diabetes_labels, use .iloc for diabetes_labels

# Train and get score of gaussian_clf for all 10 splits of strat_k_fold
print('\n')
for k, (train, test) in enumerate(strat_k_fold.split(clean_df, diabetes_labels)):
    gaussian_clf.fit(clean_df.iloc[train], diabetes_labels.iloc[train])
    print("[fold {0}] score: {1:.5f}".format(k, gaussian_clf.score(clean_df.iloc[test], diabetes_labels.iloc[test])))

# KFold and StratifiedKFold don't appear that different for this diabetes dataset

[fold 0] score: 0.68493
[fold 1] score: 0.80822
[fold 2] score: 0.71233
[fold 3] score: 0.71233
[fold 4] score: 0.73611
[fold 5] score: 0.79167
[fold 6] score: 0.77778
[fold 7] score: 0.86111
[fold 8] score: 0.72222
[fold 9] score: 0.81944


[fold 0] score: 0.73973
[fold 1] score: 0.78082
[fold 2] score: 0.76712
[fold 3] score: 0.72603
[fold 4] score: 0.72603
[fold 5] score: 0.76389
[fold 6] score: 0.77778
[fold 7] score: 0.83333
[fold 8] score: 0.72222
[fold 9] score: 0.83099


In [10]:
# Try using cross_val_score() for scores instead of iterating over k_fold
print(cross_val_score(gaussian_clf, clean_df, diabetes_labels, cv=k_fold)) 

[ 0.68493151  0.80821918  0.71232877  0.71232877  0.73611111  0.79166667
  0.77777778  0.86111111  0.72222222  0.81944444]
