# Naïve Bayes
**Description:** 
Based on Bayes’ theorem with the assumption of independence between every pair of features. Naive Bayes classifiers work well in many real-world situations such as document classification and spam filtering.  It is comprised of two types of probabilities that can be calculated directly from your training data: 1) The probability of each class; and 2) The conditional probability for each class given each x value. Once calculated, the probability model can be used to make predictions for new data using Bayes Theorem.  When your data is real-valued it is common to assume a Gaussian distribution (bell curve) so that you can easily estimate these probabilities. (so normalize your data!)
It  assumes that each input variable is independent, thus ‘naive’.  This is a strong assumption and unrealistic for real data, nevertheless, the technique is very effective on a large range of complex problems.

**Output Type:** Multi-class 

**Pros:**
- Requires a smaller amount of training data than other algorithms to estimate the necessary parameters.  
- Easily calculated.  
- Operate well under strongly independent conditions.  
- Extremely fast compared to more sophisticated methods.  
- Simple
- Powerful

**Cons:**
- Can be a bad estimator if used in less than ideal problems.

**Example Use Case:**
- Detecting Spam
- Detecting Fraud
__________________________________________________

from *Thoughtful Machine Learning* by Matthew Kirk, O'Reilly Media, Inc., 2017

**Using Bayes’ Theorem to Find Fraudulent Orders**

"Imagine you’re running an online store and lately you’ve been overrun with fraudulent orders. You estimate that about 10% of all orders coming in are fraudulent. In other words, in 10% of orders, people are stealing from you. Now of course you want to mitigate this by reducing the fraudulent orders, but you are facing a conundrum.

Every month you receive at least 1,000 orders, and if you were to check every single one, you’d spend more money fighting fraud than the fraud was costing you in the first place. Assuming that it takes up to 60 seconds per order to determine whether it’s fraudulent or not, and a customer service representative costs around $15 per hour to hire, that totals 200 hours and $3,000 per year.

Another way of approaching this problem would be to construct a probability that an order is over 50% fraudulent. In this case, we’d expect the number of orders we’d have to look at to be much lower. But this is where things become difficult, because the only thing we can determine is the probability that it’s fraudulent, which is 10%. Given that piece of information, we’d be back at square one looking at all orders because it’s more probable that an order is not fraudulent!

Let’s say that we notice that fraudulent orders often use gift cards and multiple promotional codes. Using this knowledge, how would we determine what is fraudulent or not—namely, how would we calculate the probability of fraud given that the purchaser used a gift card?"

this is a conditional probability:  
$P(A|B) = \frac{P(A \bigcap B)}{P(B)}$  
$P( fraud | giftcard ) = \frac{P( fraud \bigcap giftcard )}{P( giftcard )}$

This is saying that the probability of a transaction being fraudulent given it's a gift card is the probability of fraud and gift card occurring in the same transaction divided by the probability of a transaction using a gift card.

$A \bigcap B$ 
- Intersection of A and B
- What exists in both A **AND** B
- if A = [1,2,3] and B = [1,4,5] then set(A) & set(B) => 1, or $A \bigcap B$ == 1

$P(A)$
- Probability of A

$A \bigcup B$
- Union of A and B
- What exists in either A **OR** B
- if A = [1,2,3] and B = [1,4,5] then set(A) | set(B) => [1,2,3,4,5], or $A \bigcup B$ == [1,2,3,4,5]

Probability of A given B: 


In [1]:
a = set([1,2,3])
b = set([1,4,5])

total = 5.0

p_a_and_b = len(a & b) / total
p_b = len(b) / total

p_a_given_b = p_a_and_b / p_b

print(p_a_given_b)

0.33333333333333337


This is possible if you know the actual probability of fruad and giftcard, or $P(fraud \bigcap giftcard)$  
In practice, we rarely know this probability, and that is where Bayes' Theorem comes into play.  

**Bayes’ Theorem, aka Inverse Conditional Probability**

"In the 1700s, Reverend Thomas Bayes came up with the original research that would become Bayes’ theorem. Pierre-Simon Laplace extended Bayes’ research to produce the beautiful result we know today."

Bayes’ theorem is as follows:
$P(B|A) = \frac{P(A|B)\centerdot P(B)}{P(A)}$

From what we have just learned about conditional probability:
$P(A|B) = \frac{P(A \bigcap B)}{P(B)}$   
We can expand Bayes theorem to the following:
$P(B|A) = \frac{\frac{P(A \bigcap B)\centerdot P(B)}{P(B)}}{P(A)} = \frac{P(A \bigcap B)}{P(A)}$

$P(Fraud | Giftcard) = \frac{P(Giftcard | Fraud) \centerdot P(Fraud)}{P(Giftcard)}$

$P(Fraud | Giftcard) = \frac{60\% \centerdot 10\%}{10\%} = 60\%$

**Why Naive?**   

TL;DR:  $P(A \bigcap B) = P(B | A) \centerdot P(A)$
This is called a joint probability, i.e. the probability that all events will happen, and we use the **chain rule** to calculate that probability.  
The general case of the rule is   
$P(A_{1}, A_{2},..., A_{3}) = P(A_{1}) \centerdot P(A_{2} | A_{1}) \centerdot P(A_{3} | A_{1}, A_{2}) \centerdot P(A_{n} | A_{1}, A_{2}, ... , A_{n-1})$ . 
Calculation of the joint probabiility assumes all events are not mutually exclusive.  
You can see how this can grow exponentially, as well as the difficulty in knowing all the probabilities of every possible interaction.  
For example, if we were to introduce multiple promos into our fraud example, then we’d have to know the interactions of not just fraud with promos and fraud with giftcards, but also promos with giftcards.   
The assumption we then make (thus *naive*), is that there is the promos and giftcards are independent of each other, i.e. there is no interaction.  In doing that, we only have to consider the interactions that include fraud.  
With that assumption, we end up with a formula that is much simpler:  

$P(Fraud | Giftcard, Promo) = P(Giftcard | Fraud) \centerdot P(Promo | Fraud)$  

This would be proportional to our numerator. And, to simplify things even more, we can assert that we’ll normalize later with some magical Z, which is the sum of all the probabilities of classes. So now our model becomes:

$P(Fraud | Giftcard, Promo) = \frac{1}{Z} \centerdot P(Giftcard | Fraud) \centerdot P(Promo | Fraud)$  


To turn this into a classification problem, we simply determine which input—fraud or not fraud—yields the highest probability. 

**Gotchas: Zeroing-out effect**


Pseudocount:  if history shows a probability of 0 for an interaction, like we have never seen fraud with a certain promo, then that 0 will cause the final probability to be 0, known as the 'zeroing-out effect'.  The fact we have never seen it in that promo could be due to that promo being new. To avoid this error of dismissing all other interactions that could come into play, like that transaction was also using a giftcard, we add 1 to the count of the word. So, everything will end up being transaction count + 1. This helps mitigate the zeroing-out effect for now.

In [2]:
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# ignore warnings
import warnings
warnings.filterwarnings("ignore")

In [3]:
%store -r X_train_scaled
%store -r X_test_scaled
%store -r y_train
%store -r y_test

X_train = X_train_scaled
X_test = X_test_scaled 
X_train.head()

Unnamed: 0,Age,Embarked,Fare,Parch,Pclass,Sex,SibSp,Title
0,0.421965,0.666667,0.063436,0.2,0.5,0.0,0.125,0.75
1,0.384267,0.666667,0.051237,0.2,0.5,0.0,0.125,0.75
2,0.447097,0.666667,0.05131,0.0,0.0,1.0,0.0,0.5
3,0.359135,0.0,0.015412,0.0,1.0,1.0,0.0,0.5
4,0.22091,0.666667,0.022447,0.0,0.5,1.0,0.0,0.5


## Train Model
#### Create the Gaussian Naive Bayes object

In [None]:
# from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

#### Fit the model to the training data

In [None]:
gnb.fit(X_train, y_train)

#### Estimate whether or not a passenger would survive, using the training data

In [4]:
y_pred = gnb.predict(X_train)

Accuracy of GNB classifier on training set: 0.80


#### Estimate the probability of a passenger surviving, using the training data

In [5]:
y_pred_proba = gnb.predict_proba(X_train)

## Evaluate Model
#### Compute the accuracy

In [None]:
print('Accuracy of GNB classifier on training set: {:.2f}'
     .format(gnb.score(X_train, y_train)))

#### Create a confusion matrix

In [6]:
print(confusion_matrix(y_train, y_pred))

[[308  71]
 [ 51 193]]


#### Compute Precision, Recall, F1-score, and Support

In [7]:
print(classification_report(y_train, y_pred))

             precision    recall  f1-score   support

        0.0       0.86      0.81      0.83       379
        1.0       0.73      0.79      0.76       244

avg / total       0.81      0.80      0.81       623



## Test Model
#### Compute the accuracy of the model when run on the test data

In [8]:
print('Accuracy of GNB classifier on test set: {:.2f}'
     .format(gnb.score(X_test, y_test)))

Accuracy of GNB classifier on test set: 0.79


## Visualize Model

## Store Model

In [9]:
%store gnb

Stored 'gnb' (GaussianNB)


## Exercises

1. Read in your train and test dataframes
2. Walk through the steps of training the logistic regression classifier
3. Evaluate your results using the model score, confusion matrix, and classification report.
4. Print and clearly label the following:  Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support. 
5. Look in the scikit-learn documentation to research the *x parameter*.  What is your best option(s) for the particular problem you are trying to solve and the data to be used? 
6. Run through steps 2-4 using another *x parameter* (from question 5) 
7. Which appears to perform better?
8. Test the best model on your testing data. 
9. Store your final model into logit for future use