## Bayes Theorem

###  Independence vs. Exclusivity

In the last lesson, we continued learning about conditional probability and focused on subjects like the multiplication rule, the order of conditioning, and statistical independence. Next, we'll learn about Bayes' theorem, which is a central topic in probability.

We begin with clarifying an important distinction between independence and exclusivity. We learned in the previous lesson that two events A and B are independent if the occurrence of one doesn't change the probability of the other. In mathematical terms, we've seen A and B are independent if any of the conditions below are true:

![image.png](attachment:31245ba6-1672-43cd-af6a-1bbeb7fa1964.png)

In the previous course, we learned that two events — A and B — are mutually exclusive if they cannot occur both at the same time. If one event happens, the other cannot possibly happen anymore, and vice-versa. Examples of mutually exclusive events include:

- Getting a 5 (event A) and getting a 3 (event B) when we roll a regular six-sided die — it's impossible to get both a 5 and a 3.
- A coin lands on heads (event A) and tails (event B) — it's impossible to flip a coin and see it landing on both heads and tails.

If two events A and B are mutually exclusive, then it's impossible that they both occur, which means (A ∩ B) is an impossible event (and the probability of impossible events is always 0):

![image.png](attachment:2d28b156-49fd-4c6f-a66b-6ff3c072b250.png)

Both independence and exclusivity describe the relationship between two or more events, and we see that they have different mathematical meanings:

![image.png](attachment:954ecca4-d361-402a-acef-c3ac19fcf9d3.png)

Let's take a quick look at a few examples. Say we roll a fair six-sided die twice and consider these four events:

- Event A: We get a 4 on the first roll.
- Event B: We get a 2 on the second roll.
- Event C: We get an even number on the first roll.
- Event D: We get a 5 on the first roll.

If event A happens, then the probability of event B stays the same, since the result of the first roll doesn't influence the result of the second one in any way — this means A and B are independent. Also, we can get a 4 on the first roll (event A) and a 2 on the second roll (event B), which means A and B are not mutually exclusive.

Now let's look at the relationship between events A and C. If C happens, then the probability of A changes, and vice-versa (if you don't recall why the probability changes, you can quickly review the previous lesson). This means A and C are dependent. Also, if the outcome was 4, then we'd get a 4 (event A) and an even number (event C) at the same time, which means A and C are not mutually exclusive.

However, if we look at events A and D, we see they cannot possibly happen together — we cannot get both a 4 and a 5 on the first roll. This means event A and D are mutually exclusive. Since A and D cannot possibly happen together, it becomes meaningless to talk about independence since the concept of independence makes sense only as long as both events can happen.

Now let's look at a few exercises.

For the exercises below, consider the following probabilities:

![image.png](attachment:861208f2-9153-4893-90c4-5a2e8b3493fe.png)

Assess with True or False the following statements:

![image.png](attachment:834351da-7cc4-412f-938f-3f4f15fb84a6.png)

In [1]:
statement_1 = False
statement_1 = True
statement_1 = True

##  Example Walk-through

On the previous screen, we saw there's a difference between independence and exclusivity. Before we move on, recall that in the previous course we learned about the addition rule:

![image.png](attachment:0db1fa72-75fb-4cbf-8c47-fd32cce6a808.png)

![image.png](attachment:b5de1903-b5c5-467d-af46-48774cd61b2f.png)

With this in mind, let's consider the probabilities associated with testing for an HIV test:

![image.png](attachment:4183002c-13de-4ba0-b6bf-a3e26f824e05.png)

Now what if we just want to find P(T+), the probability that a person selected at random will get a positive result? There are two possible scenarios when someone gets a positive result:

1. The person is infected with HIV and gets a positive result.
2. The person is not infected with HIV and gets a positive result.

![image.png](attachment:a7a63f4c-19d2-4c2b-9e0f-171509dbb7ed.png)

We can visualize this set union using a Venn diagram:

![image.png](attachment:46f5f7a8-658e-4abb-9ec1-679d2c5d4845.png)

![image.png](attachment:7be62366-3a60-4350-b460-cfde01d9469a.png)

![image.png](attachment:e3632547-4fba-42b4-9e28-9b2fd1483cbb.png)

![image.png](attachment:dbb6004c-8e45-4eeb-b7b6-f09a034ff7ad.png)

![image.png](attachment:0a286a02-d0d5-4110-9ddf-00aa0a5ef6a1.png)

We can find the word "secret" in many spam emails. However, some emails are not spam even though they contain the word "secret." Let's say we know the following probabilities:

![image.png](attachment:6c43e003-f98f-424b-b99a-a39f4e8aff79.png)

![image.png](attachment:541088f4-0daf-4f51-a054-40b317fd5e50.png)

In [2]:
p_non_spam = 1 - 0.2388
p_spam_and_secret = 0.2388 * 0.4802
p_non_spam_and_secret = p_non_spam * 0.1284
p_secret = 0.2388 * 0.4802 + p_non_spam * 0.1284



## A General Formula

![image.png](attachment:4b0f9694-bba2-4390-9294-6174a52665ea.png)


![image.png](attachment:e0ec2600-2ebe-40a9-bd12-042b0500ca28.png)


![image.png](attachment:64cf2cad-ed5e-4075-ad02-d15a8958da39.png)

![image.png](attachment:bfa2cb2c-3caa-4d26-82dd-eb9bbfb8bb1f.png)

![image.png](attachment:fe645ba0-e294-46dd-a3d7-ed50194b4744.png)

![image.png](attachment:ab6a168d-570c-422b-8212-9611798ef79e.png)

![image.png](attachment:b730c67c-38ef-4672-90e3-e22203f9b56e.png)

With this in mind, we can now develop a general formula for P(A):

![image.png](attachment:a0e5c921-2d2c-4913-a60a-a89277d44272.png)

We'll see on the following screens that this formula plays a critical role in Bayes' theorem. Now let's do a quick exercise.

An airline transports passengers using two types of planes: a Boeing 737 and an Airbus A320.

- The Boeing operates 73% of the flights. Out of these flights, 3% arrive at the destination with a delay.
- The Airbus operates the remaining 27% of the flights. Out of these flights, 8% arrive with a delay.

Convert the percentages above to probabilities:

1. Assign the probability of flying with a Boeing to p_boeing (to better understand what this probability means, imagine a passenger having bought a ticket with this airline — what's the probability that this passenger will be assigned to fly to her destination with a Boeing?).
2. Assign the probability of flying with an Airbus to p_airbus.
3. Assign the probability of arriving at the destination with a delay given that the passenger flies with a Boeing to p_delay_given_boeing.
4. Assign the probability of arriving at the destination with a delay given that the passenger flies with an Airbus to p_delay_given_airbus.

Calculate:

The probability that a passenger will arrive at her destination with a delay. Assign your answer to p_delay. Check the hint if you get stuck.

In [4]:
p_boeing = 0.73
p_airbus = 0.27
p_delay_given_boeing = 0.03
p_delay_given_airbus = 0.08


p_delay = 0.73 * 0.03 + 0.27 * 0.08

## Formula for Three Events

In the previous exercise, we used this formula to calculate the probability of having a delay when flying with a particular airline:

![image.png](attachment:f85fd7f4-3611-432c-b573-de3c3cb4a9d3.png)

Recall that the airline transports passengers using two types of planes: a Boeing 737 and an Airbus A320. This allowed us to model P(Delay) as:

![image.png](attachment:b2b1de98-4b12-4e07-975f-c0fb8daf126c.png)

However, let's consider another airline which has three types of planes: a Boeing 737, an Airbus A320, and an ERJ 145.

- The Boeing operates 58% of the flights. Out of these flights, 4% arrive at the destination with a delay.
- The Airbus operates 31% of the flights. Out of these flights, 7% arrive with a delay.
- The ERJ operates the remaining 11% of the flights. Out of these flights, 2% arrive with a delay.

A passenger buying a ticket with this airline will be assigned to only one of the three types of airplanes. This means that the sample space is made up of three events that are all mutually exclusive and exhaustive. On a Venn diagram, we have:

![image.png](attachment:a1c8fcc7-1115-4e5c-9368-f40289f3556c.png)

Now let's add the event Delay on the above Venn diagram:

![image.png](attachment:6c184800-2fb0-418f-82e1-0b47a70956a1.png)

Judging by the diagram, we can see that P(Delay) is:

![image.png](attachment:7264d196-cd24-48c1-9c9a-06d304fe7d4f.png)

 To develop a more general formula, imagine that instead of the events Delay, Boeing, Airbus, and ERJ, we have events A, B1, B2, and B3:

![image.png](attachment:971032cf-0b84-457e-ab57-9bef741d0f22.png)

We'll now get a little practice with this new expanded formula. On the next screen, we're going to expand the formula for an indefinite number of events. After that, we'll be well-equipped to start discussing Bayes' theorem.

An airline transports passengers using three types of planes: a Boeing 737, an Airbus A320, and an ERJ 145.

- The Boeing operates 62% of the flights. Out of these flights, 6% arrive at the destination with a delay.
- The Airbus operates 35% of the flights. Out of these flights, 9% arrive with a delay.
- The ERJ operates the remaining 3% of the flights. Out of these flights, 1% arrive with a delay.

Calculate the probability of delay and assign your result to p_delay. See the hint if you get stuck.

In [6]:
p_delay = 0.62 * 0.06 + 0.35 * 0.09 + 0.03 * 0.01

## The Law of Total Probability

On the previous screen, we developed a formula to include three events:

![image.png](attachment:e1f47d8f-d57e-4b6c-a9b9-a88485b9a839.png)

To develop a formula with four events, we can use the same reasoning as we used to develop the formula above. Let's say the sample space 
Ω
 is made up of four mutually exclusive and exhaustive events:

![image.png](attachment:a0524a88-f818-4420-9220-c80f9539dfeb.png)

Then we can understand event A as the union of the following events:

![image.png](attachment:9a3d7388-5b6e-43d9-95a1-2ee2f9daecb6.png)

![image.png](attachment:b8e8791a-5088-4819-a64a-5b96008fbf03.png)

Using the addition rule for mutually exclusive events, we have:

![image.png](attachment:2ab549e8-2621-492b-81ea-ad65f4ffc409.png)

Using the multiplication rule, we arrive at:

![image.png](attachment:819addff-54b3-4c9f-859c-c5835e3c2ec7.png)

This is the formula we can use for four events. Now, instead of four events, let's say the sample space 
Ω
 is made up of n mutually exclusive and exhaustive events:

![image.png](attachment:5d388e01-4daf-4f00-a331-072b7098c666.png)

Using the same reasoning as we used above, the formula for n events is:

![image.png](attachment:1ec8e943-189d-46d7-afb6-56eb215ea132.png)

The above formula is called the law of total probability.

![image.png](attachment:23cea699-9331-4687-8907-a7ac9e6e331e.png)

Now let's move to the next screen and discuss Bayes' theorem. There's no exercise for this screen, but we'll get more practice in a minute.



## Bayes' Theorem

On previous screens, we discussed a few examples around plane delays and tried to use the law of total probability to find P(Delay), the probability that a passenger will arrive at her destination with a delay. Once a plane arrived with a delay, however, we might be interested to calculate the probability that it's a Boeing. In other words, what's the probability that the plane is a Boeing given that it arrived with a delay?

Let's bring back a concrete example we've used earlier. An airline transports passengers using two types of planes: a Boeing 737 and an Airbus A320.

The Boeing operates 73% of the flights. Out of these flights, 3% arrive at the destination with a delay.
The Airbus operates the remaining 27% of the flights. Out of these flights, 8% arrive with a delay.

Let's say a plane did arrive with a delay and we want to find the probability that the plane is a Boeing. In other words, we want to find P(Boeing|Delay). Let's begin by expanding P(Boeing|Delay) using the conditional probability formula:

![image.png](attachment:431b0b13-9917-4ca4-a5d0-9c02049a95ca.png)

![image.png](attachment:075de5d5-2dfa-416d-8a60-e67e10007ab1.png)

Now we can plug in the values in our initial conditional probability formula and find P(Boeing|Delay):

![image.png](attachment:dc720296-ab24-4e6c-af57-dfb27fe6aa68.png)

This is an instance where we applied `Bayes' theorem` to solve a probability problem. Mathematically, Bayes' theorem can be defined as:

![image.png](attachment:637abca7-ea49-4ab5-9cd3-33f9c861a58e.png)

Note that we arrived at Bayes' theorem by substituting the law of total probability into the conditional probability formula and expanding the numerator P(B ∩ A) using the multiplication rule:

![image.png](attachment:ae4f1e19-ea21-4a78-bada-031b249740eb.png)

Above, we defined the formulas for P(B|A), but we can also define them for P(A|B):

![image.png](attachment:fbbf93ec-684f-43a7-b264-1c8bf3940d4f.png)

Now let's use Bayes' theorem to find P(Airbus|Delay). On the next screen, we'll learn more about Bayes' theorem.

An airline transports passengers using two types of planes: a Boeing 737 and an Airbus A320.

- The Boeing operates 73% of the flights. Out of these flights, 3% arrive at the destination with a delay.
-  Airbus operates the remaining 27% of the flights. Out of these flights, 8% arrive with a delay.
Use Bayes' theorem to find P(Airbus|Delay). Assign your answer to p_airbus_delay. Don't forget you can check the hint if you get stuck.

In [8]:
p_airbus_delay = 0.27 * 0.08 / (0.73 * .03 + .27 * .08)
p_airbus_delay

0.4965517241379311

## Prior and Posterior Probability

Near the beginning of this lesson, we considered an example around HIV testing and saw the following probabilities:

![image.png](attachment:3423d7e5-c503-4047-88c0-9909f1db3a5f.png)

![image.png](attachment:1610a418-d2a4-4cc6-ba4e-5d686a6eaf90.png)

![image.png](attachment:60457150-fc41-4dcb-ad57-af2dcb25c4a3.png)

![image.png](attachment:4441ed6d-66af-42b2-9894-93565e002c88.png)

![image.png](attachment:f53faf69-b48f-4b7e-b2d8-93f251768ad5.png)

 ![image.png](attachment:9b22cc39-278d-4f89-8d3a-c8ce82423284.png)

![image.png](attachment:ab4774de-ffd5-494f-9900-03532af3d949.png)

We see that if a person tests positive, the probability of being infected with HIV is still pretty low: 11.74%. This low value may be a bit counter-intuitive given the high efficiency of the test. However, the probability is low because P(HIV) — the probability of having HIV — is very low in the first place: 0.14%.

Notice, however, that if a person tests positively, the probability of being infected with HIV actually increases a lot. The regular person in the population has a 0.14% chance to be infected with HIV — since

![image.png](attachment:b3210e72-8f1a-44a3-8ae6-38778f77aa27.png)

In the above example, we've considered the probability of being infected with HIV in two scenarios:

1. Before doing any test: P(HIV)
2. After testing positive: P(HIV|T+)

The probability of being infected with HIV before doing any test is called the prior probability ("prior" means "before"). The probability of being infected with HIV after testing positive is called the posterior probability ("posterior" means "after"). So, in this case, the prior probability is 0.14%, and the posterior probability is 11.74%.

Now let's look at an exercise about spam emails and wrap up this lesson on the next screen.

Many spam emails contain the word "secret". However, some emails are not spam even though they contain the word "secret". Let's say we know the following probabilities:

![image.png](attachment:bfa2dc70-6e86-42e8-b9b8-e58b20630fbd.png)

Use Bayes' theorem to find P(Spam|"secret"). Assign your answer to p_spam_given_secret.

Assign the prior probability of getting a spam email to prior.

Assign the posterior probability of getting a spam email (after we see the email contains the word "secret") to posterior.

Calculate the ratio between the posterior and the prior probability — you'll need to divide the posterior probability by the prior probability. Assign your answer to ratio.

In [14]:
p_spam_given_secret = 0.2388 * 0.4802 / ( 0.2388 * 0.4802 + 0.7612 * 0.1284)
prior  = 0.2388
posterior = p_spam_given_secret
ratio = posterior / prior

## The Naive Bayes Algorithm

### A Spam Filter

Over the last three lessons, we've managed to learn many new concepts, including conditional probability, independence, the law of total probability, and Bayes' theorem. In this lesson and the next, we'll look at an application of conditional probability — we'll build a spam filter.

Spam is most commonly associated with emails. For instance, unwanted and unsolicited advertising emails are usually classified as spam. Spamming, however, occurs in ways and environments that don't necessarily relate to emails:

- Articles or blog posts can be spammed with comments — the comments are ads or they are repetitive.
- An educational forum may be spammed with posts that are, in fact, ads.
- Mobile phone users may receive unwanted and unsolicited SMS messages, usually about advertising.

In our lessons, we're going to build a spam filter specifically directed at preventing mobile phone spam. The filter will be able to analyze new messages and tell whether they are spam or not — this way, we might be able to prevent spam from bothering mobile phone users.

To build the spam filter, we're going to use an algorithm called Naive Bayes — as the name suggests, the algorithm is based on Bayes' theorem.

This lesson explores the theoretical side of the algorithm and is dedicated to helping you understand how the algorithm works. In the next lesson, which is a guided project, we'll apply the algorithm to [a dataset](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) of over 5,000 SMS messages.

Let's start by getting an overview of the Naive Bayes algorithm.

## Naive Bayes Overview

Imagine we just got a new SMS message:

- "WINNER! You have 8 hours to claim your money by calling 090061701461. Claim code: KL341."

This must be spam, but how could we create an algorithm that reaches the same conclusion? One thing we might think of is to create a list of words that occur frequently in spam messages, and then write a bunch of if statements:

- If the word "money" is in the message, then classify the message as spam.
- If the words "secret" and "money" are both in the message, then classify the message as spam; etc.

However, as messages become numerous and more complex, coming up with the right if statements will slowly become very difficult.

Another solution would be to classify a couple of messages ourselves and make the computer learn from our classification. And this is exactly what the Naive Bayes algorithm is about: It makes the computer learn from the classification a humans does, and then the computer uses that knowledge to classify new messages.

The computer uses the specifications of the Naive Bayes algorithm to learn how we classify messages (what counts as spam and non-spam for us), and then it uses that human knowledge to estimate probabilities for new messages. Following the specifications of the algorithm, the computer tries to answer two conditional probability questions:

![image.png](attachment:51a41940-cf03-457e-bc8e-0f22975ccdf0.png)

In plain English, these two questions are:

- What's the probability that this new message is spam, given its content (its words, punctuation, letter case, etc.)?
- What's the probability that this new message is non-spam, given its content?

Once it has an answer to these two questions, the computer classifies the message as spam or non-spam based on the probability values. If the probability for spam is greater, then the message is classified as spam. Otherwise, it goes into the non-spam category.

Now let's move to the next screen, where we'll start to look into the details of the algorithm.

## Using Bayes' Theorem

On the previous screen, we saw an overview of how the computer may classify new messages using the Naive Bayes algorithm:

1. The computer learns how humans classify messages.
2. Then it uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
3. Finally, the computer classifies a new message based on the probability values it calculated in step 2 — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may want a human to classify the message — we'll come back to this issue in the guided project).

We saw on the previous screen that when a new message comes in, the algorithm requires the computer to calculate the following probabilities:

![image.png](attachment:cae09d98-dc25-4499-99f0-aa250a539223.png)

Let's take the first equation and expand it using Bayes' theorem:

![image.png](attachment:132595bd-ae23-481d-9d4d-60b144e425b5.png)

For the sake of example, let's assume the following probabilities are already known:

![image.png](attachment:ed98980f-940d-4d64-8f08-a31a670bc799.png)

If the computer knows these values, then it can calculate the probabilities it needs to classify a new message:

![image.png](attachment:18c3187d-497f-420b-80f8-c74ca849120b.png)

Let's now do a quick exercise and continue the discussion in the next screen.

A new mobile message has been received: "URGENT!! You have one day left to claim your $873 prize." The following probabilities are known:

![image.png](attachment:18453d50-ae62-47e0-b4bd-51176d8f0cfc.png)

Classify this new message as spam or non-spam:

1. Calculate P(Spam|New Message). Assign your answer to p_spam_given_new_message.
2. Calculate P(SpamC|New Message). Assign your answer to p_non_spam_given_new_message.
3. Classify the message by comparing the probability values. If the message is spam, then assign the string 'spam' to the variable classification. Otherwise, assign the string 'non-spam'.

In [22]:
p_spam = 0.5
p_non_spam = 0.5
p_new_message = 0.5417
p_new_message_given_spam = 0.75
p_new_message_given_non_spam = 0.3334

p_spam_given_new_message = 0.5 * 0.75 / 0.5417
p_non_spam_given_new_message = 0.5 * 0.3334 / .5417
classification = 'spam'

In [20]:
p_spam_given_new_message

0.6922650913789921

In [21]:
p_non_spam_given_new_message 

0.3082887206941112

##  Ignoring the Division

On the last screen, we saw the computer can use these two equations to calculate the probabilities it needs to classify new messages:

![image.png](attachment:62178b39-ae5e-46cf-89b6-934a705358cd.png)

Although we've taken a great first step so far, the actual equations of the Naive Bayes algorithm are a bit different — we'll gradually develop the equations throughout this lesson. Let's start by pointing out that both equations above have the same denominator: P(New message).

When a new message comes in, P(New message) has the same value for both equations. Since we only need to compare the results of the two equations to classify a new message, we can ignore the division:

![image.png](attachment:7b7b7aa1-0d48-40f0-885d-ac83cf7ac65a.png)

This means our two equations reduce to:

![image.png](attachment:98351216-3ddb-4ab6-9cc2-f4006584be0a.png)

Ignoring the division doesn't affect the algorithm's ability to classify new messages. For instance, let's repeat the classification we did on the previous screen using the new equations above. Recall that we assumed we already know these values:

![image.png](attachment:3a09813c-b35d-47c6-83c1-ae21bd654445.png)

Previously, the algorithm classified the new message as spam. Using the new equations, we see the conclusion is identical — the new message is spam because

![image.png](attachment:c66a3c45-6b35-471e-907e-db8167821fd7.png)

The classification works fine, but ignoring the division changes the probability values, and some probability rules also begin to break. For instance, let's take this conditional probability rule that we've learned about in a previous lesson:

![image.png](attachment:4214b49e-e3fd-4589-a2eb-bef052111479.png)

With the values we got from the new equations, however, the law breaks:

![image.png](attachment:56161d73-e5f6-4f15-91da-23814ea3c6be.png)

Even though probability rules break, the Naive Bayes algorithm still requires us to ignore the division by P(New message). This might not make a lot of sense, but there's actually a very good reason we do that.

The main goal of the algorithm is to classify new messages, not to calculate probabilities — calculating probabilities is just a means to an end. Ignoring the division by P(New message) means less calculations, which can make a lot of difference when we use the algorithm to classify 500,000 new messages.

It's true the probability values are not accurate anymore. However, this is not important with respect to the the goal of the algorithm — correctly classifying new messages (not to accurately estimate probabilities).

The classification itself remains completely unaffected because we ignore division for both equations (not just for one). The probability values change, but they change directly proportional with one another, so the result of the comparison doesn't change.

![image.png](attachment:4bfaf57e-0952-471a-8130-956504652c6d.png)

Let's now use the optimized equations of the algorithm for our next exercise.

A new mobile message has been received: "URGENT!! You have one day left to claim your $873 prize." The following probabilities are known:

![image.png](attachment:15022f32-bb36-460f-ace0-72573c77bc35.png)

![image.png](attachment:7a881657-d25b-4be8-80f6-3182f30a40a9.png)

In [26]:
p_spam_given_new_message = 0.5 * 0.75
p_non_spam_given_new_message = 0.5 * 0.3334
classification = 'spam'

In [24]:
p_spam_given_new_message

0.375

In [25]:
p_non_spam_given_new_message 

0.1667

##  A One-Word Message

On the previous screen, we optimized the algorithm and concluded that we can use these two optimized equations if all we're interested in is classifying messages (and not calculating accurate probabilities):

![image.png](attachment:503207af-f117-406b-8184-4c070b67b6e1.png)

We'll now look at how the algorithm can use messages that are already classified by humans to calculate the values it needs for:

![image.png](attachment:27b49e46-a6f1-46cf-9a25-e46d3447f8ef.png)

We'll start with some examples that may look a bit too simplistic and unrealistic, but they will make it easier to understand the mathematics behind the algorithm.

Let's say we have three messages that are already classified:

![![image.png](attachment:48d405b0-9d10-4c2d-a99f-ca9e232361bf.png)

Now let's say the one-word message "secret" comes in and we want to use the Naive Bayes algorithm to classify it — to tell whether it's spam or non-spam.

![image.png](attachment:ecbd307e-5999-4dad-8959-fa77905404a0.png)

As we learned, we first need to answer these two probability questions (note that we changed New Message to "secret" inside the notation below) and then compare the values (recall that the 
∝
 symbol replaces the equal sign):

![image.png](attachment:610d4e5f-ab06-41b3-b0ef-52be7a1d3e2b.png)

Let's begin with the first equation, for which we need to find the values of P(Spam) and P("secret"|Spam). To find P(Spam), we use the messages that are already classified and divide the number of spam messages by the total number of messages:

To calculate P("secret"|Spam), we only look at the spam messages and divide the number of times the word "secret" occurred in all the spam messages by the total number of words.

![image.png](attachment:452d85d2-e780-4702-a978-ff2399a6bc06.png)

Notice that "secret" occurs four times in the spam messages:

![image.png](attachment:8731a3c6-59a3-42e3-8e14-26a26ba04a12.png)

We have two spam messages and there's a total of seven words in all of them, so P("secret"|Spam) is:

![image.png](attachment:6c826d41-6c78-4c21-a856-aab16495fc8e.png)

Now that we know the values for P(Spam) and P("secret"|Spam), we have all we need to calculate P(Spam|"secret"):

![image.png](attachment:232bac18-c8f3-4891-9ac1-dbbce62336e4.png)

For the exercise below, we'll take the same steps as above to calculate P(SpamC|"secret"). Then, we can compare the values of P(SpamC|"secret") and P(Spam|"secret") to classify the message "secret" as spam or non-spam.

Using the table below (there are the same messages as above), classify the message "secret" as spam or non-spam.

![image.png](attachment:5ea0ede5-f39c-4a15-90a2-cdcbd74a85d6.png)

![image.png](attachment:947f5316-fd6b-4a35-94b1-1941cb3f663e.png)

In [29]:
p_non_spam = 1/3
p_secret_given_non_spam = 1/4

p_non_spam_given_secret = p_non_spam * p_secret_given_non_spam
p_non_spam_given_secret
classification = 'spam'

0.08333333333333333

In [28]:
8/21

0.38095238095238093

## Multiple Words

On the previous screen, we used our algorithm to classify the message "secret", and we concluded it's spam. The message "secret" has only one word, but what about the situation where we have to classify messages that have more words?

Let's say we want to classify the message "secret place secret secret" based on four messages that are already classified (the four messages below are different than what what we saw on the previous screen):

![image.png](attachment:61eab47f-e6a6-4eb3-b1e7-9266f0834678.png)

To calculate the probabilities we need, we'll treat each word in our new message separately. This means that the word "secret" at the beginning is different and separate from the word "secret" at the end. There are four words in the message "secret place secret secret", and we're going to abbreviate them "w1", "w2", "w3" and "w4" (the "w" comes from "word").

![image.png](attachment:08998c32-7303-471e-aadb-4120722b436b.png)

Since we treat each word separately, these are the two equations we can use to calculate the probabilities:

\begin{equation}
P(Spam | w_1,w_2,w_3,w_4) \propto P(Spam) \cdot P(w_1|Spam) \cdot P(w_2|Spam) \cdot P(w_3|Spam) \cdot P(w_4|Spam) \\
P(Spam^C | w_1,w_2,w_3,w_4) \propto P(Spam^C) \cdot P(w_1|Spam^C) \cdot P(w_2|Spam^C) \cdot P(w_3|Spam^C) \cdot P(w_4|Spam^C) \\
\end{equation}

(On the next screen, we'll explain in detail why we're using these equations specifically.)

Let's begin with calculating P(Spam|w1, w2, w3, w4). To calculate the probabilities we need, we'll look at the four messages that are already classified. We have four messages and two of them are spam, so:

\begin{equation}
P(Spam) = \frac{2}{4} = \frac{1}{2}
\end{equation}

The first word, w1, is "secret", and we see that "secret" occurs four times in all spam messages. There's a total of seven words in all the spam messages, so:

\begin{equation}
P(w_1|Spam) = \frac{4}{7} 
\end{equation}

Applying a similar reasoning, we have:

![image.png](attachment:40e3f292-323b-44df-903d-090a30ddd622.png)

We now have all the probabilities we need to calculate P(Spam|w1, w2, w3, w4):

\begin{aligned}
P(Spam | w_1,w_2,w_3,w_4) &\propto P(Spam) \cdot P(w_1|Spam) \cdot P(w_2|Spam) \cdot P(w_3|Spam) \cdot P(w_4|Spam) \\
&= \frac{1}{2} \cdot \frac{4}{7} \cdot \frac{1}{7} \cdot \frac{4}{7} \cdot \frac{4}{7} = \frac{64}{4802} = 0.01333
\end{aligned}

![image.png](attachment:d9598533-33bd-481a-81df-664846674280.png)

Using the table below (the same as above), classify the message "secret place secret secret" as spam or non-spam.

![image.png](attachment:f7055705-dd14-400d-8c52-0e899b3ab30d.png)

![image.png](attachment:03453e3b-0a0a-4552-8317-5bdb14a8dd32.png)

In [33]:
p_spam_given_w1_w2_w3_w4 = 64/4802
p_spam_given_w1_w2_w3_w4 

0.013327780091628489

In [34]:
p_non_spam_given_w1_w2_w3_w4 = 2/4 * 2/9 * 1/9 * 2/9 * 2/9
p_non_spam_given_w1_w2_w3_w4 
classification = 'spam'

## Conditional Independence

On the previous screen, we introduced these two equations without much explanation:

\begin{equation}
P(Spam | w_1,w_2,w_3,w_4) \propto P(Spam) \cdot P(w_1|Spam) \cdot P(w_2|Spam) \cdot P(w_3|Spam) \cdot P(w_4|Spam) \\
P(Spam^C | w_1,w_2,w_3,w_4) \propto P(Spam^C) \cdot P(w_1|Spam^C) \cdot P(w_2|Spam^C) \cdot P(w_3|Spam^C) \cdot P(w_4|Spam^C) \\
\end{equation}

To explain the mathematics behind these equations, let's start by looking at P(Spam|w1, w2, w3, w4). Using the conditional probability formula, we can expand P(Spam|w1, w2, w3, w4) like this (below, make sure you notice the 
∩
 symbol in the numerator):

![image.png](attachment:d80aa2a1-6626-42cb-add7-74703738df7a.png)

Recall that we learned in a previous screen that we can ignore the division, which means we can drop P(w1, w2, w3, w4) to avoid redundant calculations (when we ignore the division, we also replace the equals sign with 
∝
, which means directly proportional):

![image.png](attachment:fb733425-dca8-462c-aa3d-18aa0c10e7ce.png)

Note that (w1, w2, w3, w4) can be modeled as an intersection of four events:

![image.png](attachment:0850ba1f-da61-472d-a132-53554f4d5474.png)

For instance, we could think of a message like "thanks for your help" as the intersection of four words inside a single message: "thanks", "for, "your", and "help". In probability jargon, finding the value of 
P
(
w
1
∩
w
2
∩
w
3
∩
w
4
)
 means finding the probability that the four words w1, w2, w3, w4 occur together in a single message — this is similar to 
P
(
A
∩
B
∩
C
∩
D
)
, which is the probability that events A, B, C, and D occur together.

With this in mind, our equation above transforms to:

![image.png](attachment:12ca6c56-4366-437c-b5d4-e8f5406c56b6.png)

![image.png](attachment:15bc7d22-ce3a-4109-968a-56d344c8c149.png)

\begin{equation}
P(w_1 \cap w_2 \cap w_3 \cap w_4 \cap Spam) = P(w_1 | w_2 \cap w_3 \cap w_4 \cap Spam) \cdot \underbrace{P(w_2 | w_3 \cap w_4 \cap Spam) \cdot P(w_3 \cap w_4 \cap Spam)}_{\displaystyle P(w_2 \cap w_3 \cap w_4 \cap Spam)}
\end{equation}

We can use the multiplication rule successively, until there's nothing more left to expand:

\begin{aligned}
P(w_1 \cap w_2 \cap w_3 \cap w_4 \cap Spam) &= P(w_1 | w_2 \cap w_3 \cap w_4 \cap Spam) \cdot P(w_2 \cap w_3 \cap w_4 \cap Spam) \\
&= P(w_1 | w_2 \cap w_3 \cap w_4 \cap Spam) \cdot P(w_2 | w_3 \cap w_4 \cap Spam) \cdot P(w_3 \cap w_4 \cap Spam) \\
&= P(w_1 | w_2 \cap w_3 \cap w_4 \cap Spam) \cdot P(w_2 | w_3 \cap w_4 \cap Spam) \cdot P(w_3 | w_4 \cap Spam) \cdot P(w_4 \cap Spam) \\
&= P(w_1 | w_2 \cap w_3 \cap w_4 \cap Spam) \cdot P(w_2 | w_3 \cap w_4 \cap Spam) \cdot P(w_3 | w_4 \cap Spam) \cdot P(w_4|Spam) \cdot P(Spam) \\
\end{aligned}

In theory, the last equation you see above is what we'd have to use if we wanted to calculate P(Spam|w1, w2, w3, w4). However, the equation is pretty long for just four words. Also, imagine how would the equation look for a 50-word message — just think of how many calculations we'd have to perform!!

To make the calculations tractable for messages of all kinds of lengths, we can assume conditional independence between w1, w2, w3, and w4. This implies that:

![image.png](attachment:280ffe7f-3aa4-434d-b6ab-3a0f058f0f37.png)

Under the assumption of independence, our lengthy equation above reduces to:

\begin{equation}
P(w_1 \cap w_2 \cap w_3 \cap w_4 \cap Spam) = P(w_1|Spam) \cdot P(w_2|Spam) \cdot P(w_3|Spam) \cdot P(w_4|Spam) \cdot P(Spam)
\end{equation}

The assumption of conditional independence is unrealistic in practice because words are often in a relationship of dependence. For instance, if you see the word "WINNER" in a message, the probability of seeing the word "money" is very likely to increase, so "WINNER" and "money" are most likely dependent. The assumption of conditional independence between words is thus naive since it rarely holds in practice, and this is why the algorithm is called Naive Bayes (also called simple Bayes or independence Bayes).

Despite this simplifying assumption, the algorithm works quite well in many real-word situations, and we'll see that ourselves in the guided project.

That being said, on the previous screen we assumed conditional independence when we introduced these two equations:

\begin{equation}
P(Spam | w_1,w_2,w_3,w_4) \propto P(Spam) \cdot P(w_1|Spam) \cdot P(w_2|Spam) \cdot P(w_3|Spam) \cdot P(w_4|Spam) \\
P(Spam^C | w_1,w_2,w_3,w_4) \propto P(Spam^C) \cdot P(w_1|Spam^C) \cdot P(w_2|Spam^C) \cdot P(w_3|Spam^C) \cdot P(w_4|Spam^C) \\
\end{equation}

On the next screen, we'll make the equations more general to account for any number of words.

##  A General Equation

On the previous screen, we learned about the conditional independence assumption, which is central to the Naive Bayes algorithm. As a result of the assumption, we saw we can use these simplified equations:

\begin{equation}
P(Spam | w_1,w_2,w_3,w_4) \propto P(Spam) \cdot P(w_1|Spam) \cdot P(w_2|Spam) \cdot P(w_3|Spam) \cdot P(w_4|Spam) \\
P(Spam^C | w_1,w_2,w_3,w_4) \propto P(Spam^C) \cdot P(w_1|Spam^C) \cdot P(w_2|Spam^C) \cdot P(w_3|Spam^C) \cdot P(w_4|Spam^C) \\
\end{equation}

The equations above work for messages that have four words, but we need a more general form to use with messages of various word lengths.

A new message has n words, where n can be any positive integer (1, 2, 3, ..., 50, 51, 53, ...). If we wanted to find P(Spam|w1, w2, ..., wn), then this is an equation we could use:

\begin{equation}
P(Spam | w_1,w_2, \ldots, w_n) \propto P(Spam) \cdot P(w_1|Spam) \cdot P(w_2|Spam) \cdot \ldots \cdot P(w_n|Spam)
\end{equation}

Notice that there's a certain pattern in the equation above — after P(Spam), the only thing that changes is the word number.

\begin{equation}
P(Spam | w_1,w_2, \ldots, w_n) \propto P(Spam) \cdot P(\overbrace{w_1}^{1}|Spam) \cdot P(\overbrace{w_2}^{2}|Spam) \cdot \ldots \cdot P(\overbrace{w_n}^{n}|Spam)
\end{equation}

 ![image.png](attachment:dcd9bb8d-7da5-4b4f-985a-4dc813ecbdc7.png)

The equation above is the same as:

\begin{equation}
P(Spam | w_1,w_2, \ldots, w_n) \propto P(Spam) \cdot \overbrace{P(w_1|Spam) \cdot P(w_2|Spam) \cdot \ldots \cdot P(w_n|Spam)}^{\displaystyle \prod_{i=1}^{n}P(w_i|Spam)}
\end{equation}

![image.png](attachment:5afd55ce-e129-46cf-a209-5f112536a20d.png)

Now that we have these general equations, we're going to discuss a few edge cases on the next few screens. After this, we'll be ready to start working on the guided project, where we'll work with over 5,000 real messages.

##  Edge Cases

On a previous screen, we looked at a few messages that were already classified:

![image.png](attachment:912fd7e1-edc1-41e8-89b6-20f560514a3f.png)

Above, we have four messages and nine unique words: "secret", "party", "at", "my", "place", "money", "you", "know", "the". We call the set of unique words a vocabulary.

Now, what if we receive a new message that contains words which are not part of the vocabulary? How do we calculate probabilities for these kind of words?

For instance, say we received the message "secret code to unlock the money".

![image.png](attachment:4f6d0030-5568-4d78-b72a-2f0c86e7a701.png)

Notice that for this new message:

- The words "code", "to", and "unlock" are not part of the vocabulary.
- The word "secret" is part of both spam and non-spam messages.
- The word "money" is only part of the spam messages and is missing from the non-spam messages.
- The word "the" is missing from the spam messages and is only part of the non-spam messages.

Whenever we have to deal with words that are not part of the vocabulary, one solution is to ignore them when we're calculating probabilities. If we wanted to calculate P(Spam|"secret code to unlock the money"), we could skip calculating P("code"|Spam), P("to"|Spam), and P("unlock"|Spam) because "code", "to", and "unlock" are not part of the vocabulary:

\begin{equation}
P(Spam|\text{"secret code to unlock the money"}) \propto P(Spam) \cdot {P(\text{"secret"}|Spam) \cdot P(\text{"the"}|Spam) \cdot P(\text{"money"}|Spam)}
\end{equation}

We can also apply the same reasoning for calculating P(SpamC|"secret code to unlock the money"):

\begin{equation}
P(Spam^C|\text{"secret code to unlock the money"}) \propto P(Spam^C) \cdot P(\text{"secret"}|Spam^C) \cdot P(\text{"the"}|Spam^C) \cdot P(\text{"money"}|Spam^C)
\end{equation}

Let's now calculate P(Spam|"secret code to unlock the money") and P(SpamC|"secret code to unlock the money"), and see what we get.

P(Spam|"secret code to unlock the money") is already calculated for you. Use the table below (the same as above) to calculate P(SpamC|"secret code to unlock the money").

![image.png](attachment:da4a181c-ce27-4163-8144-2f5c8eab46dd.png)

1. Calculate P(SpamC|"secret code to unlock the money"). Assign your answer to p_non_spam_given_message.
2. Print p_spam_given_message and p_non_spam_given_message. Why do you think we got these values? We'll discuss more about this in the next screen.

In [35]:
p_spam = 2/4
p_secret_given_spam = 4/7
p_the_given_spam = 0/7
p_money_given_spam = 2/7
p_spam_given_message = (p_spam * p_secret_given_spam *
                        p_the_given_spam * p_money_given_spam)
p_non_spam = 2/4
p_secret_given_non_spam = 2/9
p_the_given_non_spam = 1/9
p_money_given_non_spam = 0/9
p_non_spam_given_message = (p_non_spam * p_secret_given_non_spam *
                            p_the_given_non_spam * p_money_given_non_spam)


print(p_spam_given_message)
print(p_non_spam_given_message)

0.0
0.0


In the previous exercise, we saw that both P(Spam|"secret code to unlock the money") and P(SpamC|"secret code to unlock the money") were equal to 0. This will always happen when we have words that occur in only one category — "money" occurs only in spam messages, while "the" only occurs in non-spam messages.

![image.png](attachment:0b8e6499-9ce8-4236-bd54-1425ee34a2e0.png)

When we calculate P(Spam|"secret code to unlock the money"), we can see that P("the"|Spam) is equal to 0 because "the" is not part of the spam messages. Unfortunately, that single value of 0 has the drawback of turning the result of the entire equation to 0:

\begin{aligned}
P(Spam|\text{"secret code to unlock the money"}) &\propto P(Spam) \cdot P(\text{"secret"}|Spam) \cdot P(\text{"the"}|Spam) \cdot P(\text{"money"}|Spam) \\
&= \frac{2}{4} \cdot \frac{4}{7} \cdot \frac{0}{7} \cdot \frac{2}{7} = 0
\end{aligned}

To fix this problem, we need to find a way to avoid these cases where we get probabilities of 0. Let's start by laying out the equation we're using to calculate P("the"|Spam):

\begin{equation}
P(\text{"the"}|Spam) = \frac{\text{total number of times "the" occurs in spam messages}}{\text{total number of words in spam messages}} = \frac{0}{7}
\end{equation}

We're going to add some notation and rewrite the equation above as:

\begin{equation}
P(\text{"the"}|Spam) = \frac{N_{\text{"the"}|Spam}}{N_{Spam}} = \frac{0}{7}
\end{equation}

![image.png](attachment:70e220d7-14d4-4587-a1b8-1d7a25b17053.png)

The additive smoothing technique solves the issue and gets us a non-zero result, but it introduces another problem. We're now calculating probabilities differently depending on the word — take P("the"|Spam) and P("secret"|Spam) for instance:

![image.png](attachment:c47d5330-521d-42fc-a4e2-6f52c632ade4.png)

Words like "the" are thus given special treatment and their probability are increased artificially to avoid non-zero cases, while words like "secret" are treated normally. To keep the probability values proportional across all words, we're going to use the additive smoothing for every word:

![image.png](attachment:7fc9f7d1-102b-4842-8db5-cc35463ea62a.png)

In more general terms, this is the equation that we'll need to use for every word:

\begin{equation}
P(word|Spam) = \frac{N_{\text{word}|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}

![image.png](attachment:b0915f58-b149-4772-a505-fc5b47d476c3.png)

[here](https://en.wikipedia.org/wiki/Additive_smoothing).

Let's now recalculate the probabilities for the message "secret code to unlock the money" and try to classify the message.

P(Spam|"secret code to unlock the money") is already calculated for you. Use the table below (the same as above) to calculate P(SpamC|"secret code to unlock the money").

![image.png](attachment:5c71b94b-dbe5-4440-9c21-a0a9e8e6a20e.png)

1. Using the additive smoothing technique, calculate P(SpamC|"secret code to unlock the money"). Assign your answer to p_non_spam_given_message.2. 
Compare p_spam_given_message and p_non_spam_given_message to classify the message as spam or non-spam. If you think it's spam, then assign the string 'spam' to classification. Otherwise, assign 'non-spam'.

In [41]:
p_spam = 2/4
p_secret_given_spam = (4 + 1) / (7 + 9)
p_the_given_spam = (0 + 1) / (7 + 9)
p_money_given_spam = (2 + 1) / (7 + 9)
p_spam_given_message = (p_spam * p_secret_given_spam *
                        p_the_given_spam * p_money_given_spam)
p_non_spam_given_message = 2/4 * (3/18 * 2/18 * 1/18)
classification = 'spam'

In [39]:
p_spam_given_message

0.0018310546875

In [40]:
p_non_spam_given_message

0.0005144032921810699

## Multinomial Naive Bayes

The Naive Bayes algorithm can be used for more than just building spam filters. For instance, we could use it to perform sentiment analysis for Twitter messages — the input is a Twitter message, and the output is the sentiment type (positive or negative). This follows the same pattern we saw with our spam filter, where the input is a new SMS message and the output is the message type (spam or non-spam).

Depending on the math and the assumptions used, the Naive Bayes algorithm has a few variations. The three most popular Naive Bayes algorithms are:

- Multinomial Naive Bayes
- Gaussian Naive Bayes
- Bernoulli Naive Bayes

In this lesson, we learned the multinomial Naive Bayes version of the algorithm. Explaining the mathematical differences between the various versions is out of the scope of this course, but it's important to keep in mind that all the Naive Bayes algorithms build on the (naive) conditional independence assumption we learned about earlier in this lesson.

On the next screen, we'll summarize everything we've done so far, and then we'll wrap up this lesson!

To summarize everything we've done so far, these are the two equations we can use for our spam filtering problem moving forward:

\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
\end{equation}

\begin{equation}
P(Spam^C | w_1,w_2, ..., w_n) \propto P(Spam^C) \cdot \prod_{i=1}^{n}P(w_i|Spam^C)
\end{equation}

To calculate P(wi|Spam) and P(wi|SpamC), we need to use the additive smoothing technique:

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}

\begin{equation}
P(w_i|Spam^C) = \frac{N_{w_i|Spam^C} + \alpha}{N_{Spam^C} + \alpha \cdot N_{Vocabulary}}
\end{equation}

Let's also summarize what the terms in the equations above mean:

![image.png](attachment:70108b88-055e-45d0-9577-214550dd270a.png)

It's worth emphasizing that:

![image.png](attachment:a122a0a2-700d-4a75-b8c3-c5040803cbc8.png)

In the next lesson, which is a guided project, we'll use the multinomial Naive Bayes algorithm to create a spam filter, and we'll use [a dataset](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) of over 5,000 SMS messages.