# [CPSC 322]() Data Science Algorithms
[Gonzaga University](https://www.gonzaga.edu/) |
[Sophina Luitel](https://www.gonzaga.edu/school-of-engineering-applied-science/faculty/detail/sophina-luitel-phd-0dba6a9d)

---

# Naive Bayes
What are our learning objectives for this lesson?
* Learn about Bayes Theorem
* Learn about the Naive Bayes classification algorithm

Content used in this lesson is based upon information in the following sources:
* Dr. Gina Sprint's Data Science Algorithms notes

## Today 10/22
* Announcements
    * LA7 posted and is due on end of Thursday
    * Note on midterm grades: They are a combination of LAs, PAs, and IQs. I will be entering 0s for any missing LA1–6, PA1–2, and IQ1–4.
* Naive Bayes Lab tasks #1 & #2


## Naive Bayes Classification
Basic ideas
* Predict class labels based on probabilities (statistics)
* Naive Bayes comparable in performance to "fancier" approaches
* Relatively efficient on large datasets
* Assumes "conditional independence"
    * Effect of one attribute on a class is independent from other attributes
    * This is why it is called "naive."
    * Helps with execution time (speed)
### Probability Basics
* Conditional probabilities represent the probability of an event given some other event has occurred, which is represented with the following formula:

$$
P(A|B) = \frac{P(A \cap B)}{P(B)}
$$

Where:

- $P(A \cap B)$ = probability that both A and B occur  
- $P(B)$ = probability that B occurs, P(B) is never equal to zero.
  
**Example:**  
- Deck of 52 cards:  
  - Let \(A\) = card is a King  
  - Let \(B\) = card is a Spade  
If you know the card is a Spade what is the probability it is a king?  
$ P(B)=\frac{13}{52}= \frac{1}{4}$
$$
P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{1/52}{1/4} = \frac{1}{13}
$$


## Bayes Theorem 
If $P(A \cap B)$ is the probability that both $A$ and $B$ occur, then:
$$P(A \cap B) = P(A|B)P(B) = P(B|A)P(A)$$
In other words:
* Let's say $A$ occurs $x$% of the time given (within) $B$
* And $B$ occurs $y$% of the time
* Then $A$ and $B$ occur together, i.e., $A \cap B$: $x$% $\cdot y$% of the time


For example:
* Assume we have a bucket of Lego bricks
* 50% of the 1x2 bricks are Red
* 10% of the bricks in the bucket are 1x2's
* Then, 50% of the 10% of 1x2's are Red-1x2's (i.e., 50% $\cdot$ 10%)  

This shows how we can compute the probability of two events happening together using conditional probability.
Now, Bayes’ theorem simply rearranges this relationship to find the reverse conditional probability: 

$$
P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{P(B|A) P(A)}{P(B)}
$$

- The probability of \(A\) given \(B\) can be computed using the reverse conditional probability P(B|A) and the prior P(A).  

### Bayes Theorem for Classification
- Let:  
  - H = hypothesis that an instance belongs to a class \(C\)  
  - X = observed attribute values of the instance  

Bayes Theorem becomes:
$$
P(H|X) = \frac{P(X|H) P(H)}{P(X)}
$$

Where:  
$P(H)$ ... the probability of event $H$
* $H$ (hypothesis) for us would be that any given instance is of a class $C$
* Called the prior probability

$P(X)$ ... the probability of event $X$
* For us, $X$ would be an instance (a row in a table)
* The probability that an instance would have $X$'s attribute values
  
$P(X|H)$ ... the conditional probability of $X$ given $H$
* The probability of X’s attribute values assuming we know it is of class C

$P(H|X)$ ... the conditional probability of $H$ given $X$
* The probability that $X$ is of class $C$ given $X$'s attribute values
* A posterior probability
* This is the one we want to know to make predictions!
    * i.e., we want the $C$ that gives the highest probability
* We can estimate $P(H)$, $P(X)$, and $P(X|H)$ from the training set
* From these, we can use Bayes Theorem to estimate $P(H|X)$: 

**Prediction:**  
- Compute P(H|X) for all classes C.  
- Assign the instance to the class with the **highest posterior probability**.  

> Estimations of P(H), P(X|H), and P(X) are computed from the training dataset.
  

## Classification Approach
Basic Approach:
1. We're given an instance $X = [v_1, v_2, ..., v_n]$ to classify
1. For each class $C_1, C_2, ... , C_m$, we want to find the class $C_i$ such that:
$$P(C_i|X) > P(C_j|X) \: \textrm{for} \: i \leq j \leq m, j \neq i$$
In other words, we want to find the class $C_i$ with the largest $P(C_i|X)$
1. Use Bayes Theorem to find each $P(C|X)$, i.e., for each $C_i$ calculate:
$$P(C_i|X) = P(X|C_i)P(C_i)$$
We leave out $P(X)$ since it is the same for all classes ($C_i$'s)
1. We estimate $P(C)$ as the percentage of $C$-labeled rows in the training set
$$P(C) = \frac{|C|}{D}$$
where $|C|$ is the number of instances classified as $C$ in the training set and $D$ is the training set size

1. We estimate $P(X|C_i)$ using the independence assumption of attributes:
$$P(X|C_i) = \prod_{k=1}^{n}P(v_k|C_i)$$

Expanding this gives:
$$
P(X|C_i) = P(v_1|C_i) \times P(v_2|C_i) \times \cdots \times P(v_n|C_i)
$$
If attribute $k$ is categorical  
* We estimate $P(v_k|C_i)$ as the percentage of instances with value $v_k$ (in attribute $k$) across training set instances of class $C_i$
    
Some notes:
* Step 5 is an optimization: comparing entire rows is expensive (esp. if many attributes)
* For smaller datasets, there may also not be any matches
* Can extend the approach to support continuous attributes...

## Lab Tasks
### Lab Task 1
Consider the following labeled dataset, where result denotes class information and the remaining columns have categorical values.

|att1|att2|result|
|-|-|-|
|1|5|yes|
|2|6|yes|
|1|5|no|
|1|5|no|
|1|6|yes|
|2|6|no|
|1|5|yes|
|1|6|yes|

1. Compute the priors for the dataset (e.g. what is $P(result = yes)$ and $P (result = no)$?)
1. Compute the conditional probabilities for the dataset by making a table like Bramer 3.2 (e.g. what is $P(att1 = 1|result = yes)$, $P(att1 = 2|result = yes)$, $P(att2 = 5|result = yes)$, ...
1. If $X = [1, 5]$, what is $P(result = yes|X)$ and $P(result = no|X)$ assuming conditional independence? Show your work.
    1. What would the class label prediction be for the instance $X = [1, 5]$? Show your work.
1. If $X = [1, 5]$, what is $P(result = yes|X)$ and $P(result = no|X)$ *without* assuming conditional independence? Show your work.
    1. What would the class label prediction be for the instance $X = [1, 5]$? Show your work.

### Lab Task 2
Example adapted from [this Naive Bayes example](https://www.geeksforgeeks.org/naive-bayes-classifiers/)

Suppose we have the following dataset that has four attributes and a class attribute (PLAY GOLF):

|OUTLOOK	|TEMPERATURE	|HUMIDITY	|WINDY	|PLAY GOLF|
|-|-|-|-|-|
|Rainy	|Hot	|High	|False	|No|
|Rainy	|Hot	|High	|True	|No|
|Overcast	|Hot	|High	|False	|Yes|
|Sunny	|Mild	|High	|False	|Yes|
|Sunny	|Cool	|Normal	|False	|Yes|
|Sunny	|Cool	|Normal	|True	|No|
|Overcast	|Cool	|Normal	|True	|Yes|
|Rainy	|Mild	|High	|False	|No|
|Rainy	|Cool	|Normal	|False	|Yes|
|Sunny	|Mild	|Normal	|False	|Yes|
|Rainy	|Mild	|Normal	|True	|Yes|
|Overcast	|Mild	|High	|True	|Yes|
|Overcast	|Hot	|Normal	|False	|Yes|
|Sunny	|Mild	|High	|True	|No|

Suppose we have a new instance X = \[Sunny, Hot, Normal, False\]. 
1. What is $P(PLAY GOLF = Yes|X)$? 
1. What is $P(PLAY GOLF = No|X)$? 
1. What is the prediction for X?