# Lab 08: Naive Bayes Classifier

Overview of the lab:
* Bayesian Statistics & Bayes' Theorem
* Naive Bayes Classifier
* Conditional Independence Assumption
* Zero-Probability Problem & Laplace Smoothing
* Practical Implementation

----

## Introduction to Naive Bayes

**Naive Bayes Classifier** is a probabilistic machine learning algorithm based on **Bayes' Theorem**. It is called "naive" because it makes a strong assumption that all features are **conditionally independent** given the class label.

### Where does it fit in Machine Learning?

* **Supervised Learning**: Uses labeled training data in the format `<input data, true label>`
* **Classification**: Predicts the class/category of new examples
* **Probabilistic Approach**: Unlike KNN (which uses distance), Naive Bayes uses probability theory

### Common Applications:
* Spam email detection
* Sentiment analysis
* Document classification
* Medical diagnosis
* Real-time prediction (fast training and prediction)

## Machine Learning Recap

### Data Terminology
* **Training data**: Data used to train the algorithm
* **Validation data**: Data used to tune hyperparameters
* **Test data**: Previously unseen data used to test how well the algorithm generalizes predictions
* **k-Cross Validation**: Method for finding best results (typically k=5 or k=10)
* **Bias**: The model's inability to capture the relationship between data
* **Variance**: Difference in prediction on unseen data

-----

## Bayes' Theorem

**Bayes' Theorem** is the foundation of the Naive Bayes classifier. It allows us to calculate the probability of an event based on prior knowledge of conditions related to that event.

### Probability Basics

For two events A and B:

$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$

$$P(B|A) = \frac{P(A \cap B)}{P(A)}$$

From these, we get:

$$P(A \cap B) = P(B|A) \cdot P(A) = P(A|B) \cdot P(B)$$

### Bayes' Theorem Formula

Rearranging the above equation gives us **Bayes' Theorem**:

$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

Where:
* $P(A|B)$ = **Posterior probability**: Probability of A given B
* $P(B|A)$ = **Likelihood**: Probability of B given A  
* $P(A)$ = **Prior probability**: Probability of A
* $P(B)$ = **Evidence**: Probability of B

-----

## Naive Bayes Classifier - General Formula

For classes $C_1, C_2, ..., C_K$ and attributes/features $\vec{x} = (x_1, x_2, ..., x_n)$:

### The Classification Formula

Using Bayes' Theorem, we want to find:

$$P(C_i | x_1, x_2, ..., x_n) = \frac{P(x_1, x_2, ..., x_n | C_i) \cdot P(C_i)}{P(x_1, x_2, ..., x_n)}, \quad i = 1, ..., K$$

**What does this mean?** 

$P(C_i | x_1, x_2, ..., x_n)$ represents: *"Given that we observe features $x_1, x_2, ..., x_n$, what is the probability that the example belongs to class $C_i$?"*

**Example - Medical Diagnosis:**

Suppose we want to diagnose whether a patient has a disease based on symptoms:
* Classes: $C_1$ = Has Disease, $C_2$ = Healthy
* Features: $x_1$ = Fever (Yes/No), $x_2$ = Cough (Yes/No), $x_3$ = Fatigue (Yes/No)

For a patient with Fever=Yes, Cough=Yes, Fatigue=No:
* $P(C_1 | \text{Fever=Yes, Cough=Yes, Fatigue=No})$ = "What is the probability the patient has the disease, given these symptoms?"
* $P(C_2 | \text{Fever=Yes, Cough=Yes, Fatigue=No})$ = "What is the probability the patient is healthy, given these symptoms?"

We classify the patient into whichever class has the higher probability.

**Goal**: For a new example with features $\vec{x}$, we want to find the class $C_i$ that maximizes $P(C_i | \vec{x})$.

### The Problem

Computing $P(x_1, x_2, ..., x_n | C_i)$ directly is **computationally expensive** because:
* We need to calculate the joint probability of all features
* This requires exponentially many combinations
* For n features with d values each, we need $d^n$ probabilities!

**Example**: With 10 binary features, we'd need $2^{10} = 1024$ probability values per class!

-----

## The "Naive" Assumption - Conditional Independence

The **"naive"** in Naive Bayes comes from a **strong simplifying assumption**:

### Conditional Independence Assumption

**All features are conditionally independent given the class.**

Mathematically, for a given class $C_i$:

$$P(x_j | x_{j+1}, x_{j+2}, ..., x_n, C_i) = P(x_j | C_i)$$

This means: *Given we know the class, knowing one feature tells us nothing about another feature.*

### Why is this "Naive"?

This assumption is often **not true in reality**! 

**Example**: Email spam detection
* Feature 1: Email contains word "free"
* Feature 2: Email contains word "money"

These features are actually correlated (they often appear together in spam), but Naive Bayes assumes they're independent given the class.

### The Benefit

Despite being unrealistic, this assumption allows us to **decompose** the joint probability:

$$P(x_1, x_2, ..., x_n | C_i) = P(x_1 | C_i) \cdot P(x_2 | C_i) \cdot ... \cdot P(x_n | C_i) = \prod_{j=1}^{n} P(x_j | C_i)$$

This reduces complexity from $d^n$ to just $n \cdot d$ probability values!

-----

## Naive Bayes - Final Formula

Combining Bayes' Theorem with the conditional independence assumption:

$$P(C_i | x_1, ..., x_n) \propto P(C_i) \cdot \prod_{j=1}^{n} P(x_j | C_i)$$

**Note**: We ignore the denominator $P(x_1, ..., x_n)$ because:
* It's the same for all classes $C_i$
* We only care about which class has the highest probability
* We can compare classes without computing it

### Classification Decision

To classify a new example $\vec{x} = (x_1, ..., x_n)$:

$$\hat{C} = \arg\max_{C_i} P(C_i) \cdot \prod_{j=1}^{n} P(x_j | C_i)$$

Choose the class $C_i$ that **maximizes** the expression above.

### What We Need to Learn from Training Data

1. **Prior probabilities**: $P(C_i)$ for each class
   * Simply count: $P(C_i) = \frac{\text{count of examples in class } C_i}{\text{total count of examples}}$

2. **Conditional probabilities**: $P(x_j | C_i)$ for each feature value given each class
   * Count: $P(x_j | C_i) = \frac{\text{count of times feature } x_j \text{ appears in class } C_i}{\text{count of examples in class } C_i}$

-----

## Example: Email Spam Classification

Let's work through a concrete example to understand how Naive Bayes works.

### Problem Setup

**Classes**: 
* $C_1$ = Spam
* $C_2$ = Not Spam (Ham)

**Features** (words in email):
* $x_1$ = contains "free"
* $x_2$ = contains "money"
* $x_3$ = contains "meeting"

### Training Data

Suppose we have 100 emails:
* 40 are spam
* 60 are ham

**Prior probabilities**:
* $P(\text{Spam}) = 40/100 = 0.4$
* $P(\text{Ham}) = 60/100 = 0.6$

**Conditional probabilities** (counted from training data):

| Word | P(word \| Spam) | P(word \| Ham) |
|------|-----------------|----------------|
| "free" | 0.30 | 0.05 |
| "money" | 0.25 | 0.10 |
| "meeting" | 0.05 | 0.40 |

### Classification Task

We receive a new email with words: **"free"** and **"money"** (but not "meeting")

**Calculate**:

For Spam:
$$P(\text{Spam} | \text{free, money}) \propto P(\text{Spam}) \cdot P(\text{free}|\text{Spam}) \cdot P(\text{money}|\text{Spam})$$
$$= 0.4 \times 0.30 \times 0.25 = 0.03$$

For Ham:
$$P(\text{Ham} | \text{free, money}) \propto P(\text{Ham}) \cdot P(\text{free}|\text{Ham}) \cdot P(\text{money}|\text{Ham})$$
$$= 0.6 \times 0.05 \times 0.10 = 0.003$$

**Decision**: Since $0.03 > 0.003$, classify as **Spam** ✉️

-----

## Zero-Probability Problem

### The Problem

**What happens if a feature value never appears in the training data for a particular class?**

**Example**: 
* Word "discount" never appeared in any spam email in training data
* $P(\text{"discount"} | \text{Spam}) = 0/40 = 0$

**Issue**:
$$P(\text{Spam} | \text{contains "discount"}) \propto P(\text{Spam}) \cdot P(\text{"discount"}|\text{Spam}) \cdot ...$$
$$= 0.4 \times \mathbf{0} \times ... = \mathbf{0}$$

The entire probability becomes **zero**, regardless of other features! This is **catastrophic**.

### Why is this a problem?

* **Lack of training data**: Just because we didn't see "discount" in spam during training doesn't mean it can NEVER appear in spam
* **Probability multiplication**: A single zero makes the entire product zero
* **Loss of information**: All other features are ignored

This is called the **Zero-Probability Problem** or **Zero-Frequency Problem**.

-----

## Solution: Laplace Smoothing

**Laplace Smoothing** (also called **additive smoothing**) solves the zero-probability problem by adding a small constant to all counts.

### The Formula

Instead of:
$$P(x_j | C_i) = \frac{\text{Count}(x_j, C_i)}{\text{Count}(C_i)}$$

We use:
$$P(x_j | C_i) = \frac{\text{Count}(x_j, C_i) + \alpha}{\text{Count}(C_i) + \alpha \cdot V}$$

Where:
* $\alpha$ = smoothing parameter (typically $\alpha = 1$, called **Laplace smoothing** or **add-one smoothing**)
* $V$ = number of different possible values for the feature (vocabulary size)
* $\text{Count}(x_j, C_i)$ = number of times feature value $x_j$ appears in class $C_i$
* $\text{Count}(C_i)$ = total count of all feature values in class $C_i$

### Why Does This Work?

1. **Guarantees non-zero probabilities**: Even if $\text{Count}(x_j, C_i) = 0$, we get $P(x_j | C_i) = \frac{\alpha}{\text{Count}(C_i) + \alpha \cdot V} > 0$

2. **Maintains probability distribution**: The probabilities still sum to 1

3. **Minimal impact on frequent features**: For features that appear often, adding $\alpha$ has negligible effect

### Example with Laplace Smoothing

Suppose for word "discount":
* Never appeared in 40 spam emails: Count("discount", Spam) = 0
* Vocabulary size: V = 1000 words
* $\alpha = 1$

**Without smoothing**: 
$$P(\text{"discount"} | \text{Spam}) = \frac{0}{40} = 0$$

**With Laplace smoothing**:
$$P(\text{"discount"} | \text{Spam}) = \frac{0 + 1}{40 + 1 \times 1000} = \frac{1}{1040} \approx 0.00096$$

Now the probability is small but **not zero**!

-----

## Advantages and Disadvantages of Naive Bayes

| Advantages ✅ | Disadvantages ❌ |
|--------------|------------------|
| **Fast training and prediction**: Only needs to count frequencies | **Naive independence assumption**: Assumes features are independent, which is often unrealistic |
| **Simple to implement**: Easy to understand and code | **Zero-frequency problem**: Requires smoothing techniques |
| **Works well with high-dimensional data**: Scales well with number of features | **Poor probability estimates**: The actual probability values can be inaccurate (though classification can still be good) |
| **Requires small amount of training data**: Can work with limited data | **Sensitive to irrelevant features**: Irrelevant features can affect predictions |
| **Handles missing values well**: Can easily skip missing features | **Cannot learn feature interactions**: Misses relationships between features |
| **Probabilistic predictions**: Provides probability estimates for classes | **Continuous features**: Requires assumptions about distribution (e.g., Gaussian) |

### When to Use Naive Bayes?

**Good for**:
* Text classification (spam detection, sentiment analysis, document categorization)
* Real-time prediction (very fast)
* Multi-class prediction
* When features are relatively independent
* When you have limited training data

**Not ideal for**:
* Problems with strong feature dependencies
* When you need accurate probability estimates
* Regression problems (it's a classifier)

-----

## Summary: Key Takeaways

### Core Concepts

1. **Naive Bayes** is a **probabilistic classifier** based on **Bayes' Theorem**

2. **The "Naive" Assumption**: All features are **conditionally independent** given the class
   * This is usually **not true** in reality
   * But it makes computation **tractable** and often works well in practice

3. **Classification Formula**:
   $$\hat{C} = \arg\max_{C_i} P(C_i) \cdot \prod_{j=1}^{n} P(x_j | C_i)$$

4. **Training**: Just count frequencies
   * $P(C_i)$ = fraction of training examples in class $C_i$
   * $P(x_j | C_i)$ = fraction of times feature $x_j$ appears in class $C_i$

5. **Zero-Probability Problem**: Solved by **Laplace Smoothing**
   $$P(x_j | C_i) = \frac{\text{Count}(x_j, C_i) + \alpha}{\text{Count}(C_i) + \alpha \cdot V}$$

### Why Naive Bayes Works Despite Being "Naive"

Even though the independence assumption is often violated:
* It simplifies computation dramatically
* The classification decision often remains correct even if probability estimates are wrong
* Works surprisingly well in many real-world applications (especially text classification)

-----