# Spam Classifier with Naive Bayes

**Naive Bayes** is a supervised machine learning algorithm that can be trained to classify data into multi-class categories. In the heart of the Naive Bayes algorithm is the probabilistic model that computes the conditional probabilities of the input features and assigns the probability distributions to each of the possible classes.

In this lesson, we will:
* Review the conditional probability and Bayes Rule
* Learn how the Naive Bayes algorithm works

At the end of this course, you will do a coding exercise to apply Naive Bayes in a Natural Language Processing (NLP) task, ie. spam emails classification, using the Scikit-Learn library

## Introduction to Bayes Theorem

Suppose we are at an office and there are two people, Alex and Brenda. One of them passes us by in the hallway, but we aren't sure which it is. Without any other information, what is the probability it was Alex? Brenda?
> *Prior*: initial 50:50 guess, it's all we can infer *prior* to new information

Suppose we have some other information, like the person was wearing a red sweater. We happen to know that the frequency with which Alex and Brenda wear red sweaters each week (for whatever reason), but we don't know which days. If we know Alex wears a red sweater 2 days and Brenda wears a red sweater 3 days out of the same work week, what is the probability that Alex passed us? Brenda?
> * *Posterior*: inferred after new information arrived

<img src="img_18.png" width=700 align='center'>

## How Bayes Theorem Works

What Bayes Theorem does is switch from what we know to what we infer

We start with knowing there are 2 people, Brenda and Alex.

Then, we know the probability that Alex wears red and we know the probability that Brenda wears red.

Given these probabilities, we can infer that someone wearing red is Alex, or someone wearing red is Brenda.

<img src='img_19.png' width=700 align='center'>

**Generalized:**

Initially, we know the probability of an event, say $A$ such that $\text{P} \left( A \right)$

When we update that with new information, say we introduce a new event $R$ such that $\text{P} \left( R\;\middle|  A \right)$, so we know the probability of R given A

**Bayes Theorem** infers the probability of A given R, $\text{P} \left( A \;\middle| R \right)$, which is the *new* probability of A once we know that the event R occured

<img src='img_20.png' width=700 align='center'>

## Applying Bayes Theorem

If we know that Alex comes in 3 days a week and Brenda only comes to the office one day a week, then it's 3 times more likely to see Alex than it is to see Brenda on any given day of the week:
$$\text{Probability Alex}=\text{P}\left(A\right)=0.75$$
$$\text{Probability Brenda}=\text{P}\left(B\right)=0.25$$

This is our prior

Now, let's update that knowledge with the fact that the person had a red sweater. Recall, Alex wears red 2 times a week and Brenda wears red 3 times a week:

$$\text{Probability Red given it's Alex}=\text{P}\left(R\mid A\right)=\frac{2}{5}=0.4$$
$$\text{Probability Red given it's Brenda}=\text{P}\left(R\mid B\right)=\frac{3}{5}=0.6$$

Taking that we are 3-times more likely to see Alex for every 1 time we see Brenda, then we can think in terms of weeks. So, take 4-weeks (3 for Alex + 1 for Brenda = 4 total, 20-days) with Alex wearing red 2-times a week (3 weeks for Alex with 2 red shirts a week, 6 times we see a red shirt in a 4 week period from Alex) and Brenda wearing red 3-times a week (1 week for Brenda wearing red 3-times a week, 3 times we see a red shirt in a 4 week period from Brenda), so 9-days (6 from Alex + 3 from Brenda) out of the 20-day period had red:
$$\text{Probability of Red}=\text{P}\left(R\right)=\frac{9}{20}=0.45$$

Or more simply...
$$\text{Probability Red given it's Alex} \cdot \text{Probability it's Alex} + \text{Probability Red given it's Brenda} \cdot \text{Probability it's Brenda} = 0.4 \cdot 0.75 + 0.6 \cdot 0.25 = 0.45$$


<img src='img_21.png' width=700 alig='center'>

In the 9 times someone was wearing red, 6 of them were Alex and three of them were Brenda.
> Note, Brenda can wear red and not be at the office, but we are only interested in information about wearing red at the office.

So, $\frac{2}{3}$ of the time we saw someone wearing red, it was Alex, and $\frac{1}{3}$ of the time it was Brenda.
> We now update our prior to get a posterior:
$$\text{Probability Alex given Red}=\text{P}\left(A\mid R\right)=\frac{2}{3}$$
$$\text{Probability Brenda given Red}=\text{P}\left(B\mid R\right)=\frac{1}{3}$$

We can similarly get $\text{P}\left(A\mid R\right)=\frac{2}{3}$ and $\text{P}\left(B\mid R\right)=\frac{1}{3}$ by:
$$\text{P}\left(A\mid R\right)=\frac{\text{Probability Alex} \cdot \text{Probability Red given it's Alex}}{\text{Probability Red}} = \frac{\text{P}\left(A\right) \cdot \text{P}\left(R\mid A\right)}{\text{P}\left(R\right)} = \frac{0.75 \cdot 0.4}{0.45}=\frac{2}{3}$$
$$\text{P}\left(B\mid R\right)=\frac{\text{Probability Brenda} \cdot \text{Probability Red given it's Brenda}}{\text{Probability Red}} = \frac{\text{P}\left(B\right) \cdot \text{P}\left(R\mid B\right)}{\text{P}\left(R\right)} = \frac{0.25 \cdot 0.6}{0.45}=\frac{1}{3}$$

<img src='img_22.png' width=700 align='center'>

## Formal Representation of Bayes Theorem

So we have a prior of an event A (or B), and we update that prior with some information about event R (or the complement of R, where R does not occur), and we can get the probability of R given A (or "not R" given A), which is the probability of the intersection of R and A.
<img src='img_23.png' width=700 align='center'>

But, we are interested for this exercise for finding out which event is more likely, A or B, given R, so we are not concerned with the complement of R:

<img src='img_24.png' width=700 align='center'>

By the [law of conditional probability](https://en.wikipedia.org/wiki/Conditional_probability) ($P(A|B)=\frac{P(A \cap B)}{P(B)} \equiv P(R|A)=\frac{P(R \cap A)}{P(A)} \rightarrow P(R|A) \cdot P(A) = P(R \cap A)$, we get:

<img src='img_25.png' width=700 align='center'>

Since these probabilities do not sum to one ($P(R|A) \cdot P(A) + P(R|B) \cdot P(B) \neq 1$), we can normalize them so that the new probabilities do:

<img src='img_26.png' width=700 align='center'>

**Concluding:**
* We had **prior** probabilities: $P(A)$ and $P(B)$
* We updated them with event R to get **posterior** probabilities: $P(A|R)$ and $P(B|R)$

<img src='img_27.png' width=700 align='center'>

**Bayes Theorem:**
$$\text{P}\left(\text{A} \mid \text{B} \right) = \frac{\text{P}\left(\text{B} \mid \text{A} \right) \cdot \text{P}\left(\text{A}\right)}{\text{P}\left(\text{B}\right)}$$

## Another Application of the Bayes Theorem

<img src='img_28.png' width=700 align=center>

$$p(\text{sick})=\frac{1}{10,000}=0.0001=\text{0.01%}$$
$$p(\text{test is positive given sick})=\frac{99}{100}=0.99=\text{99%}$$
$$p(\text{sick}\mid\text{test is positive})=\frac{p(\text{test is positive}\mid\text{sick})\cdot p(\text{sick})}{p(\text{test is positive})}$$

Let:
$$\text{S:Sick}$$
$$\text{H:Healthy}$$
$$\text{pos:positive result}$$

So:
$$\text{P}\left(\text{S}\right) = 0.0001 \land \text{P}\left(\text{H}\right) = 0.9999$$
$$\text{P}\left(\text{pos}\mid\text{S}\right) = 0.99 \land \text{P}\left(\text{pos}\mid\text{H}\right) = 0.01$$
$$\text{P}\left(\text{pos}\right) = \text{P}\left(\text{S}\right)\cdot \text{P}\left(\text{pos}\mid\text{S}\right) + \text{P}\left(\text{H}\right)\cdot \text{P}\left(\text{pos}\mid\text{H}\right) = 0.0001\cdot0.99 + 0.9999\cdot0.01 = 0.010098$$

Then,
$$\text{P}\left(\text{S}\mid\text{pos}\right) = \frac{\text{P}\left(\text{S}\right) \cdot \text{P}\left(\text{pos}\mid\text{S}\right)}{\text{P}\left(\text{pos}\right)} = \frac{0.0001\cdot0.99}{0.010098}=0.0098$$

<img src='img_29.png' width=700 align='center'>

<img src='img_30.png' width=700 align='center'>

<img src='img_31.png' width=700 align='center'>

# Bayesian Learning

<img src='img_32.png' width=700 align='center'>

$$p(easy|spam) = \frac{1}{3}$$
$$p(money|spam) = \frac{2}{3}$$

<img src='img_33.png' width=700 align='center'>

If we calculate out the probabilities of all, spam or ham, easy or not containing easy, we get:
$$p(spam)=\frac{3}{8} \land p(ham)=\frac{5}{8}$$
Given spam (3 emails):
$$p(easy)=\frac{1}{3}$$

Given ham (5 emails):
$$p(easy)=\frac{1}{5}$$

Then:
$$p(easy|spam)=\frac{3}{8}\cdot\frac{1}{3}=\frac{1}{8}$$
$$p(easy|ham)=\frac{5}{8}\cdot\frac{1}{5}=\frac{1}{8}$$

<img src='img_34.png' width=700 align='center'>

But if we know that the email contains the word "easy". so our entire universe consists only of these two cases of "easy" being present: when the email is spam or ham. Those two have the same probabiltiy of happening, $\frac{1}{8}$. So, once we normalize the probabilities, they both turn into 50%! Thus, our two posterior probabilities are 50%.

<img src='img_35.png' width=700 align='center'>

For ham emails with the word "money", we can do the same procedure!

<img src='img_36.png' width=700 align='center'>

Again, because the emails we are interested in are only those with the word "money", we can ignore the rest of the population! We can update our posteriors of being ham or spam given the word money to $p(spam|"money")=\frac{2}{3} \land p(ham|"money")=\frac{1}{3}$

<img src='img_37.png' width=700 align='center'>

## Conditional Probability

Where does the word "naive" come from in naive bayes?

Lets look at the probability of two events happening together: $p(\text{A & B})$ or $p(\text{A}\cap\text{B})$ which $p(\text{A}\cap\text{B})=p(A)\cdot p(B)$ , but that is only true when the two events are independent. As an example, consider the probability of it being hot outside and the probability of it being cold outside. What's the probability of the two events happening simultaneously? 0. It cannot be both hot and cold simultaneously. So, the probability of being hot and the probabiltiy of being cold are dependent.

<img src='img_38.png' width=700 align='center'>