<center><img src='img/ms_logo.jpeg' height=40% width=40%></center>

<center><h1>Naive Bayesian Classifiers</h1></center>

In this notebook, we'll be focusing on **_Naive Bayesian Classifiers_**, or "Naive Bayes" for short.  This is an algorithm that uses **_Bayes' Theorem_** to make a classification based on probability.  In case you're unfamiliar with Bayes' Theorem, let's look at the formula:
<br>
<br>
<center><img src='img/bayes_theorem.png' height=40% width=40%></center>
<br>
<br>
Don't worry if you've seen this mathematical notation before. In plain English, that formula reads:

"The probability of A given B equals the probability of B given A, times the probability of A, divided by the probability of B".  

Let's run through an example case here and see if we can demystify this equation a little bit more. 

<center><h3>Scenario: Spam Detection</h3></center>

We have a dataset of emails, and we're trying to build a classifier that can predict if an email is spam or not by examining the words based in the emails.  Each email in our training set has been labeled as "spam" or "ham" (a real email, not spam).  We've counted each word used in every email, and found the following:

**_65% of the emails in the dataset are "Spam"._**

**_"Spam"_** emails contain the word _"deal"_ 80% of the time, and _"win"_ 40% of the time.  

**_35% of the emails in the datasert are "Ham"._**

**_"Ham"_** emails contain the word _"deal"_ 17% of the time, and _"win"_ 6% of the time.  

The next email we try to predict contains the both words "deal" and "win". Given the information above, we can plug these numbers into Bayes' Theorem and predict the likelihood that this is email Spam. 

<center>P(Spam|deal, win) = (P(win, deal|Spam) * P(Spam)) / P(deal, win)</center>

This can be further broken down into: 
<br>
<br>
<center>P(Spam|deal) \* P(Spam|win) = P(deal|Spam) \* P(Spam) \*  P(win|Spam) \* P(Spam) / P(deal|Spam) + P(deal|!Spam) \* P(win|Spam) + P(win|!Spam)</center>

In the equation above, "P(deal|!Spam)" can be read as "the percentage that 'deal' occurs in 'Ham' emails".  

On the next step, we'll start defining the probabilities for everything in that equation so we can plug them in:

1. P(deal|Spam) = .8
1. P(win|Spam) = .4
1. P(Spam) = .65
1. P(deal|!Spam) = .17
1. P(win|!Spam) = .06
1. P(!Spam) = .35

Let's replace some of these terms with the probabilities listed above and see how it works out:
<br>
<br>
<center>(.8 \* .65 \* .4 \* .65) / .8 \* .65 + .35 \* .17 \* .4 \* .65 + .35 \* .6 = **0.922595** </center>
<br>
<br>
Based on the math from Bayes' Theorem, we can predict probability that a new email containing both "deal" and "win" is "Spam" is approximately **92.2%**!

<center><h3>Using Naive Bayes in the Real World</h3></center>

In the above example, we did the math by hand.  That isn't very practical in the real world.  Luckily, `sklearn` contains some awesome implementations of Naive Bayesian Classifiers (and regressors!).  

For this assignment, we're going to use a `GaussianNB()` object.  There are a few different kinds of Naive Bayesian Classifiers, but for this one we'll stick to one that assumes our data follows a Gaussian (normal) distribution.  

Let's Get Started!

In [110]:
import numpy as np
np.random.seed(0)
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, f1_score


# Use load_iris() to load the data into the iris variable, and then assign iris.data and iris.target to the appropriate
# variables
iris = load_iris()
data = iris.data
labels = iris.target

# Use train_test_split to split the data into X_train, X_test, y_train, and y_test variables
X_train, X_test, y_train, y_test = train_test_split(data, labels)

# Create a GaussianNB() object and fit it using the training data
clf = GaussianNB()
clf.fit(X_train, y_train)

# Use the fitted model to create predictions for the X_test data.
preds = clf.predict(X_test)


# Run it all and see how you did!
print(preds)
print(y_test)
print("accuracy score: {}".format(accuracy_score(y_test, preds)))
print("f1 score: {}".format(f1_score(y_test, preds, average="weighted")))

[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 1]
[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 1]
accuracy score: 1.0
f1 score: 1.0


<center><h3>Some Caveats</h3></center>

You may have wondered why this particular model is a called a **_Naive_** Bayesian Classifier.  In this scenario, the word "Naive" simply means that the model makes the "naive" assumption that all features are independent of one another.  This leads us to the main caveat of this model--if you have feature columns that are highly correlated, this model may not work as well as we'd like.  **_If you're going to use Naive Bayes, make sure you check for highly correlated features beforehand!_**


<center><h3>Where to Go From Here</h3></center>

For the latter part of this assignment, you're going to use the famous Pima Indians Diabetes Dataset to build a Naive Bayesian Classifier that predicts whether or not an individual has diabetes.  You'll find the `pima_indians_diabetes.csv` file inside the `datasets` folder.  

To build this classifier successfully, you'll want to follow the best practices for loading in and preprocessing a data set that you've learned in class:

1. Importing the data
1. Exploring the data
1. "Cleaning" the data
1. Splitting the data into training and testing sets (or using KFold Cross val--more on this below)
1. Fitting the model
1. Validating the model (checking predictions on the test set)

Be sure to consider the following questions as you solve this problem:

* How will you deal with null values?
* For this model, does scaling the data improve your results? (HINT: test your assumption!)

On top of cleaning and preprocessing this data set, you'll also use **_Cross Validation_** to get a better measure of the accuracy of your model.  We did not use K Fold Cross Validation in the above model on purpose--instead, you'll need to work your way through `sklearn`'s [model_selection documentation](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) to figure out how to effectively make use of cross-validation.  

(**_Hint:_** There are several ways to implement cross validation using sklearn.  In the `model_selection` section of the documentation, pay special attention to the `KFold` object, as well as the methods available under the _Model Selection_ subsection.)


Good luck!

In [119]:
# path to file: "datasets/pima_indians_diabetes.csv". The first row of the .csv contains the column names.
# Note that in the "Outcome" column, 0 denotes someone that does NOT have diabetes, and 1 denotes someone that does.  
import pandas as pd
from sklearn.model_selection import KFold
df = pd.read_csv("datasets/pima_indians_diabetes.csv")

outcomes = df["Outcome"]
data = df.drop("Outcome", axis=1)
# Get rid of those 0 values
for colName in data.columns.values:
    data[colName] = data[data[colName] > 0][colName]
data = data.dropna()
data

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21
6,3.0,78.0,50.0,32.0,88.0,31.0,0.248,26
8,2.0,197.0,70.0,45.0,543.0,30.5,0.158,53
13,1.0,189.0,60.0,23.0,846.0,30.1,0.398,59
14,5.0,166.0,72.0,19.0,175.0,25.8,0.587,51
18,1.0,103.0,30.0,38.0,83.0,43.3,0.183,33
19,1.0,115.0,70.0,30.0,96.0,34.6,0.529,32
20,3.0,126.0,88.0,41.0,235.0,39.3,0.704,27
24,11.0,143.0,94.0,33.0,146.0,36.6,0.254,51
25,10.0,125.0,70.0,26.0,115.0,31.1,0.205,41


In [112]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data)
scaled_data = scaler.transform(data)

## Non-Scaled Data

In [113]:
kf = KFold(n_splits=5)

curSplit = 1
for train_index, test_index in kf.split(data):
    X_train, X_test = data.iloc[train_index], data.iloc[test_index]
    y_train, y_test = outcomes[train_index], outcomes[test_index]
    clf = GaussianNB()
    clf.fit(X_train, y_train)
    preds = clf.predict(X_test)
    acc = accuracy_score(y_test, preds)
    f1 = f1_score(y_test, preds, average="weighted")
    print("Split: {}".format(curSplit))
    print("accuracy score: {}".format(acc))
    print("f1 score: {}\n".format(f1))
    curSplit += 1
    

Split: 1
accuracy score: 0.4264705882352941
f1 score: 0.42834119949349597

Split: 2
accuracy score: 0.4626865671641791
f1 score: 0.4757356076759063

Split: 3
accuracy score: 0.582089552238806
f1 score: 0.5772350369365296

Split: 4
accuracy score: 0.5223880597014925
f1 score: 0.5146552949538024

Split: 5
accuracy score: 0.5970149253731343
f1 score: 0.5792753035993912



## Scaled Data

In [118]:
kf = KFold(n_splits=5)

curSplit = 1
for train_index, test_index in kf.split(scaled_data):
    X_train, X_test = scaled_data[train_index], scaled_data[test_index]
    y_train, y_test = outcomes[train_index], outcomes[test_index]
    clf = GaussianNB()
    clf.fit(X_train, y_train)
    preds = clf.predict(X_test)
    acc = accuracy_score(y_test, preds)
    f1 = f1_score(y_test, preds, average="weighted")
    print("Split: {}".format(curSplit))
    print("accuracy score: {}".format(acc))
    print("f1 score: {}\n".format(f1))
    curSplit += 1

Split: 1
accuracy score: 0.4264705882352941
f1 score: 0.42834119949349597

Split: 2
accuracy score: 0.4626865671641791
f1 score: 0.4757356076759063

Split: 3
accuracy score: 0.582089552238806
f1 score: 0.5772350369365296

Split: 4
accuracy score: 0.5223880597014925
f1 score: 0.5146552949538024

Split: 5
accuracy score: 0.5970149253731343
f1 score: 0.5792753035993912

