# Bayes or BAEs?

Rev. Thomas Bayes           |  Salt Bae 
:-------------------------:|:-------------------------:
![](https://upload.wikimedia.org/wikipedia/commons/d/d4/Thomas_Bayes.gif) | ![](https://media.giphy.com/media/l4Jz3a8jO92crUlWM/giphy.gif)

## Where your bae at? Use bayes' theorem. 
### Seriously though. The army uses it to find [nuclear bombs](https://translatingnerd.com/2018/02/08/searching-for-lost-nuclear-bombs-bayes-rule-in-action/nuclear)

Learning Objects
1. Learn how to calculate **conditional probabilities** using Bayes' theorem. 
2. Frame a **classification problem** as a serries of conditional probabilities
3. Make a **prediction** by choosing the species with the higest conditional probability.
4. Illustrate the limitations of bayes' conditional probabilities and introduce the assumption of **independence** via naive bayes.
5. *Optional:* Find bae

## Calculating a probability

We learn new information each day. In essence, we update the knowledge that we already have on a daily basis from our past experiences. Each new day that passes we update our prior beliefs. We assign a probability of events occurring in the future based on these prior beliefs. 

#### Bayes' Rule
$$P(A \ | \ B) = \frac {P(B \ | \ A) \times P(A)} {P(B)}$$

##### Components
- $P(A \ | \ B)$ **Posterior**: How probrable is our hypothesis given the observed evidence? (*Note: Not directly computable*)
- $P(B \ | \ A)$ **Likelihood**: How probable is the evidence given that our hypothesis is true?
- $P(A)$ **Prior**: How probable is the hypothesis *before* observing evidence?
- $P(B)$ **Marginal**: How probable is the new evidence under all possible hypotheses? 

----------------------------------------------

## Applying Bayes' Theorem to Classification

![Iris Flower](https://images.unsplash.com/photo-1549719073-96ba59673da9?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=2748&q=80)

Can **Bayes' theorem** help us to solve a **classification problem**, namely predicting the species of an iris?

### Preparing the data

We'll read the iris data into a DataFrame, and **round up** all of the measurements to the next integer. 

*Note:* This step is make subsequent calculations easier - do not do for other applications.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# read the iris data into a DataFrame
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
iris = pd.read_csv(url, header=None, names=col_names)

In [3]:
#Inspect the Data


In [4]:
# apply the ceiling function to the numeric columns
iris.loc[:, 'sepal_length':'petal_width'] = \
    iris.loc[:, 'sepal_length':'petal_width'].apply(np.ceil)

In [5]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,6.0,4.0,2.0,1.0,Iris-setosa
1,5.0,3.0,2.0,1.0,Iris-setosa
2,5.0,4.0,2.0,1.0,Iris-setosa
3,5.0,4.0,2.0,1.0,Iris-setosa
4,5.0,4.0,2.0,1.0,Iris-setosa


### Deciding how to make a prediction

Let's say that I have an **out-of-sample iris** with the following measurements: **7, 3, 5, 2**. How might I predict the species?

In [6]:
# show all the observations with features 7, 3, 5, 2


In [7]:
# count all the species for these observations


In [8]:
# count the species for all the observations


Let's frame this as a **conditional probability problem**: What is the probability of some particular species, given the measurements 7, 3, 5, and 2?

$$P(species \ | \ 7352)$$

We could calculate the conditional probability for **each of the three species**, and then predict the species with the **highest probability**:

$$P(setosa \ | \ 7352)$$
$$P(versicolor \ | \ 7352)$$
$$P(virginica \ | \ 7352)$$

### Calculating the probability of each species

**Bayes' theorem** gives us a way to calculate these conditional probabilities.

Let's start with **versicolor**:

$$P(versicolor \ | \ 7352) = \frac {P(7352 \ | \ versicolor) \times P(versicolor)} {P(7352)}$$

We can calculate each of the terms on the right side of the equation:

$$P(7352 \ | \ versicolor) = \frac {13} {50} = 0.26$$

$$P(versicolor) = \frac {50} {150} = 0.33$$

$$P(7352) = \frac {17} {150} = 0.11$$

Therefore, Bayes' theorem says the **probability of versicolor given these measurements** is:

$$P(versicolor \ | \ 7352) = \frac {0.26 \times 0.33} {0.11} = 0.76$$

Let's repeat this process for **virginica** and **setosa**:

$$P(virginica \ | \ 7352) = \frac {0.08 \times 0.33} {0.11} = 0.24$$

$$P(setosa \ | \ 7352) = \frac {0 \times 0.33} {0.11} = 0$$

We predict that the iris is a versicolor, since that species had the **highest conditional probability**.

### Section Summary

1. We framed a **classification problem** as three conditional probability problems.
2. We used **Bayes' theorem** to calculate those conditional probabilities.
3. We made a **prediction** by choosing the species with the highest conditional probability.

------------------------------------------

## Naive Bayes

![Woman Texting](https://images.unsplash.com/photo-1525771576046-15ba04b2693b?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1650&q=80)

**Business Case**: Your managers at Smartphone Inc. have asked to develop a system to bucket text messages into two categories: spam and not spam (ham). The system will be implemented on your companies products to help users identify suspicious texts. 

Let's begin breaking down this problem by looking at only one sample text message. We will use Bayes Theorem with a "Naive" assumption to make the calculation. 

$$P(spam \ | \ \text{send money now}) = \frac {P(\text{send money now} \ | \ spam) \times P(spam)} {P(\text{send money now})}$$

By assuming that the features (the words) are **conditionally independent**, we can simplify the likelihood function:

$$P(spam \ | \ \text{send money now}) \approx \frac {P(\text{send} \ | \ spam) \times P(\text{money} \ | \ spam) \times P(\text{now} \ | \ spam) \times P(spam)} {P(\text{send money now})}$$

We can calculate all of the values in the numerator by examining a corpus of **spam email**:

$$P(spam \ | \ \text{send money now}) \approx \frac {0.2 \times 0.1 \times 0.1 \times 0.9} {P(\text{send money now})} = \frac {0.0018} {P(\text{send money now})}$$

We would repeat this process with a corpus of **ham email**:

$$P(ham \ | \ \text{send money now}) \approx \frac {0.05 \times 0.01 \times 0.1 \times 0.1} {P(\text{send money now})} = \frac {0.000005} {P(\text{send money now})}$$

All we care about is whether spam or ham has the **higher probability**, and so we predict that the email is **spam**.


In [9]:
# Importing Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

In [10]:
url = "https://raw.githubusercontent.com/sokjc/BayesNotBaes/master/sms.tsv"

df = pd.read_csv(url, sep='\t', header=None, names=['label', 'msg'])

X_train, X_test, y_train, y_test = \
    train_test_split(df.msg, df.label, random_state=1)

In [13]:
# How frequent are the labels?
df.label.describe()

count     5572
unique       2
top        ham
freq      4825
Name: label, dtype: object

In [14]:
# What are the most frequent texts?
df.msg.describe()

count                       5572
unique                      5169
top       Sorry, I'll call later
freq                          30
Name: msg, dtype: object

In [15]:
# convert label to a binary variable
df.label = df.label.map({'ham':0, 'spam':1})

#### Sample Feature Extraction Steps
1. Learn the vocabulary of the training data
2. Tranform the training data into a 'document-term-matrix'
3. Transform the testing data into a 'document-term-matrix' based on vocabulary of training data.


In [16]:
# start with a simple example
train_simple = ['call you tonight',
                'Call me a cab',
                'please call me... PLEASE 44!']

# learn the 'vocabulary' of the training data
vect = CountVectorizer()
vect.fit(train_simple)
vect.get_feature_names()

['44', 'cab', 'call', 'me', 'please', 'tonight', 'you']

In [17]:
# transform training data into a 'document-term matrix'
train_simple_dtm = vect.transform(train_simple)

print('Document-Term Matrix')
print(train_simple_dtm.toarray())

Document-Term Matrix
[[0 0 1 0 0 1 1]
 [0 1 1 1 0 0 0]
 [1 0 1 1 2 0 0]]


In [19]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(train_simple_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,44,cab,call,me,please,tonight,you
0,0,0,1,0,0,1,1
1,0,1,1,1,0,0,0
2,1,0,1,1,2,0,0


In [20]:
# transform testing data into a document-term matrix (using existing vocabulary)
test_simple = ["please don't call me"]
test_simple_dtm = vect.transform(test_simple)
test_simple_dtm.toarray()
pd.DataFrame(test_simple_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,44,cab,call,me,please,tonight,you
0,0,0,1,1,1,0,0


### Guided Exercise
Repeat the sample feature extraction steps above to create a document term matrix (dtm) for the training and test datasets of our sms data.

Remember, we've already read in our data and created our X_train and X_test matrices.

In [21]:
# instantiate the vectorizer
vect = CountVectorizer()

In [22]:
# learn vocabulary and create document-term matrix in a single step
# fit_transform accomplishses both the "fit" and "transform" funciton in one line
X_train_dtm = vect.fit_transform(X_train)

In [23]:
# transform testing data into a document-term matrix
X_test_dtm = vect.transform(X_test)

In [24]:
# store feature names and examine them
train_features = vect.get_feature_names()

### Fit Naive Bayes Model
Train a Naive Bayes model using our X_train_dtm and our y_train

In [25]:
# Step 1: Import Model
from sklearn.naive_bayes import MultinomialNB

# Step 2: Instiante Model
nb = MultinomialNB()

# Step 3: Fit Model
nb.fit(X_train_dtm, y_train)

# Step 4: Predict - make predictions on test data using test_dtm
preds = nb.predict(X_test_dtm)
preds

array(['ham', 'ham', 'ham', ..., 'ham', 'spam', 'ham'], dtype='<U4')

### Evaluate Naive Bayes Model

In [26]:
# Confusion Matrix
# compare predictions to true labels
from sklearn import metrics
acc = metrics.accuracy_score(y_test, preds)
matrix = metrics.confusion_matrix(y_test, preds)

print("Confusion Matrix")
print(matrix)

Confusion Matrix
[[1203    5]
 [  11  174]]


### Bonus: sklearn Pipelines

In [27]:
from sklearn.pipeline import Pipeline

vect = CountVectorizer()
nb = MultinomialNB()

pipe = Pipeline([('vect',vect), ('nb', nb)])

In [28]:
# We can train it on the raw data
pipe.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('nb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [29]:
# And then use the pipe to make predictions on raw data
pipe.predict(['Send me lots of money now', 'you won the lottery in Nigeria'])

array(['ham', 'ham'], dtype='<U4')

### Section Summary
- The **"naive" assumption** of Naive Bayes (that the features are conditionally independent) is critical to making these calculations simple.
- The **normalization constant** (the denominator) can be ignored since it's the same for all classes.
- The **prior probability** is much less relevant once you have a lot of features.

## Resources

- The Theory That Would Not Die, Sharon Bertsch McGrayne
- [How Bayesian Inference Works](https://brohrer.github.io/how_bayesian_inference_works.html)
- [Naive Bayes Unfolded](https://medium.com/data-science-group-iitr/naive-bayes-unfolded-b2ab036b42b1)
- [Count Vectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
- Data Source for this Exercise - https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection 