# Module 11 - Naive Bayes 
-----
## Learning outcomes

- LO 1: Use Bayes’ theorem to calculate conditional probabilities.
- LO 2: Describe important components of Bayes’ theorem.
- LO 3: Build a simple Naïve Bayes classifier.
- LO 4: Convert numbers into categorical predictors. 
- LO 5: Discuss real-life applications of Naïve Bayes. (Assignment 11.3)
- LO 6: Apply the Naïve Bayes method in Python. (Assignment 11.4)

## Misc and Keywords
- Independence of probabilities means multiplying, addition is used to find the probability of either event happening
- $\propto$ means proportional to
- A **probabilistic classifier** is a type of machine learning model that predicts the category of something based on probabilities. Instead of just saying, "This is definitely spam" or "This is definitely not spam," it gives a likelihood for each possible category.

---

| **Concept**              | **Formula**                                                       | **Key Points**                                                            |
|--------------------------|-------------------------------------------------------------------|---------------------------------------------------------------------------|
| **Bayes' Theorem**  | $$P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}$$ | The complete version of Bayes' theorem with exact posterior probability calculation, where $P(B)$ is the **marginal likelihood**. |
| **Proportional Bayes**        | $$P(A \mid B) \propto P(B \mid A) \cdot P(A)$$          | Used when comparing relative probabilities of hypotheses, marginal likelihood is not needed for comparison. |
| **Naïve Bayes**           | $$P(C \mid x_1, \dots, x_n) \propto P(C) \cdot \prod_{i=1}^{n} P(x_i \mid C)$$ | Conditional independence assumption makes the model simpler, used for classification tasks, and marginal likelihood is ignored. |
| **Marginal Likelihood**   | $$P(B) = \sum_{A} P(B \mid A) \cdot P(A)$$ | Represents the total probability of the evidence, summing over all possible values of the hidden variables. |




---
### Naive Bayes
One of the most successful machine learning methods when it comes to analysing text

It is commonly used in email spam filters, document classification algorithms and sentiment analysis

Its heavily based on **Bayes theorm**

---

### Bayes Theorem 
This tells us how our probability estimate of a certain event should be updated in light of additional information

Bayes theorem is the likelihood of $A$ happening with the additional information that event $B$ has already happend. Also known as conditional probability. i.e, Probability that mail is spam, $A$, given that we know the mail contains the word 'unsubscribe', $B$.
$$P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)}$$
where;
- $P(A | B)$ is the **posterior**: the probability of class $A$ given feature $B$.
- $P(B | A)$ is the **likelihood**: the probability of observing the feature $B$ given class $A$.
- $P(A)$ is the **prior**: the initial probability of class $A$ before observing any features.
- $P(B)$ is the **marginal likelihood** (or **normalising constant**): the total probability of observing the feature $B$, which sums over  all possible classes

The likelihood divded by the marginal likelihood is a correction factor. If $\frac{P(B \mid A)}{P(B)}$
-  $= 1$ then the feature $B$ provides no new information about $A$
-  $> 1$ the observation of feature $B$ makes class $A$ more likely than it was before
-  $< 1$ the observation of feature $B$ makes class $A$ less likely than it was before

Can calculate the likelihood using two main approaches: explcitiy, using **marginal** or normalising using a **proportional** method

#### Marginal Likelihood Approach

The marginal likelihood is the probability of observing the features across all possible classes. This is calculated by summing over all possible scenarios of the classes $A$:

$$
P(B) = \sum_{A} P(B | A) P(A)
$$
##### Example:

Suppose we have two possible classes: $A_1$ and $A_2$. Then, the marginal likelihood is the total probability of observing feature $B$, considering both classes:

$$
P(B) = P(B | A_1) P(A_1) + P(B | A_2) P(A_2)
$$
Once we have the marginal likelihood $P(B)$, we can calculate the posterior probability using Bayes' Theorem:
$$
P(A_1 | B) = \frac{P(B | A_1) P(A_1)}{P(B)}
$$
#### Proportional Approach 

In many situations, calculating the exact marginal likelihood $P(B)$ is difficult or unnecessary, especially when we are comparing the likelihoods of different classes. The proportional approach allows us to ignore the normalising constant (the denominator $P(B)$) and calculate the relative probabilities instead.

Here, we drop the denominator $P(B)$ because it is constant for all hypotheses. Thus, we can directly compare the relative likelihoods of different hypotheses by comparing:

$$
P(A_1 | B) \propto P(B | A_1) P(A_1)
$$ 

##### Example:

Suppose we again have two possible classes: $A_1$ and $A_2$. Instead of computing the full posterior probability, we compute the relative probabilities:

$$
P(A_1 | B) \propto P(B | A_1) P(A_1)
$$

$$
P(A_2 | B) \propto P(B | A_2) P(A_2)
$$

If:

- $P(B | A_1) = 0.6$, $P(A_1) = 0.4$
- $P(B | A_2) = 0.2$, $P(A_2) = 0.6$

Then, calculating the proportional scores:

$$
P(A_1 | B) \propto (0.6)(0.4) = 0.24
$$

$$
P(A_2 | B) \propto (0.2)(0.6) = 0.12
$$

Since $P(A_1 | B) > P(A_2 | B)$, we conclude that $A_1$ is the more likely class.


#### When to Use Each Approach

**Marginal Likelihood Approach**: Use when you need the exact value of the posterior probability, which requires calculating the marginal likelihood $P(B)$.

**Proportional Approach**: Use when you're interested in comparing the relative probabilities of different hypotheses and don't need the exact value of the posterior probability.

----
### Naive Bayes Classifiers
The **Naïve Bayes classifier** is a simple probabilistic classifier based on Bayes' Theorem, with the **naïve** assumption that features are conditionally independent given the class label. This assumption simplifies the computation of probabilities, allowing the classifier to perform well with relatively little data.

In the context of classification, Bayes' Theorem is used to calculate the probability of a class given a set of features. This is written as:

$$
P(C | X_1, X_2, \dots, X_n) = \frac{P(X_1, X_2, \dots, X_n | C) P(C)}{P(X_1, X_2, \dots, X_n)}
$$

Where:
- $P(C | X_1, X_2, \dots, X_n)$ is the **posterior probability** of class $C$ given features $X_1, X_2, \dots, X_n$.
- $P(X_1, X_2, \dots, X_n | C)$ is the **likelihood** of observing the features given class $C$.
- $P(C)$ is the **prior probability** of class $C$.
- $P(X_1, X_2, \dots, X_n)$ is the **marginal likelihood** or the total probability of observing the features (over all classes).

The **naïve assumption** is that all the features $X_1, X_2, \dots, X_n$ are conditionally independent given the class $C$. This simplifies the likelihood term:

$$
P(X_1, X_2, \dots, X_n | C) = P(X_1 | C) P(X_2 | C) \dots P(X_n | C)
$$

Thus, the posterior probability for class $C$ becomes:

$$
P(C | X_1, X_2, \dots, X_n) \propto P(C) \prod_{i=1}^n P(X_i | C)
$$
Where:
- $P(C)$ is the prior probability of class $C$.
- $P(X_i | C)$ is the likelihood of observing feature $X_i$ given class $C$.

#### Predicting the Class

We compute this for each class, and the class with the highest posterior probability is chosen as the predicted class.

To classify a new instance with features $(X_1, X_2, \dots, X_n)$, follow these steps:

1. **Calculate Prior Probability for Each Class**:
   The prior probability $P(C)$ is the relative frequency of class $C$ in the dataset.

   $$P(C) = \frac{\text{Number of instances of class C}}{\text{Total number of instances}}$$

3. **Calculate Likelihood for Each Feature**:
   For each feature $X_i$, calculate the likelihood $P(X_i | C)$, which is the probability of observing feature $X_i$ given class $C$. This depends on the type of feature (e.g., categorical, continuous).

   $$P(X_{i} \mid C) = \frac{\text{Number of instances where} X_{i} \text{ occurs in C}}{\text{Total number of instances in C}}$$

5. **Apply the Naïve Assumption**:
   Multiply the likelihoods of the individual features together, assuming conditional independence.

6. **Compute Posterior Probability for Each Class**:
   Use Bayes' Theorem to calculate the posterior probability for each class:

   $$
   P(C | X_1, X_2, \dots, X_n) \propto P(C) \prod_{i=1}^n P(X_i | C)
   $$

7. **Choose the Class with the Maximum Posterior**:
   The class with the highest posterior probability is chosen as the predicted class.


   $$Class = argmax_{c}P(C \mid X_1, X_2,..., X_n)$$


   

---

### Converting Features from Numerical to Categorical
- Two popular methods to achieve this:
    - Manual binning: Manually define the cut off between bins i.e. < 18, 18 to 36 etc.
    - Automated binning: Specify the number of bins, and automatically divide the feature into this based on the min/max values of the data i.e., min(0) max(100), bins = 5 would result in bins < 20, < 40, < 60, < 80, > 100
- Which method should be used:
    - If there is a commonly accepted method for the feature, use that. For example some medical features may have specific groupings
    - If you know the essentials for defining cut-offs, i,e. numeric RGB into primary colours. Use manual
    - If neither of these, then use automatic.
- How many bins?
    - Too few bins then information is lost
    - Too many bins results in small counts in the frequency table and is suscepitble to noise
---

### Advantages of Naive Bayes

- Simplicity and Speed: Naive Bayes is computationally efficient and easy to implement, making it suitable for large datasets.
- Performance with Small Training Sets: It can perform surprisingly well even with limited training data compared to more complex models.
- Handles High Dimensionality: Works well with high-dimensional data like text classification, where each word represents a feature.
- Multi-class Classification: Naturally handles multiple classes without modification.
- Real-time Prediction: Its computational efficiency makes it ideal for real-time predictions.
- Insensitive to Irrelevant Features: Irrelevant features tend not to affect the results significantly due to how probability calculations work.
- Handles Missing Values: Missing values can be ignored during model building and prediction.

### Disadvantages of Naive Bayes

- Independence Assumption: The core "naive" assumption that features are independent is often violated in real-world scenarios, sometimes leading to suboptimal performance.
- Zero Frequency Problem: When a class and feature value never occur together in the training data, the probability estimate will be zero, potentially causing prediction errors (typically addressed with smoothing techniques).
- Sensitive to Feature Selection: Performance can vary significantly based on how features are selected and represented.
- Biased Probability Estimates: While classification decisions may be accurate, the probability estimates themselves are often not well-calibrated.
- Data Imbalance Issues: Can be biased toward majority classes without proper adjustments.
- Continuous Data Limitations: Requires discretisation of continuous features or assumption of distribution (usually Gaussian), which may not accurately represent the data.
- Feature Independence is Rare: In practice, features are often correlated, which can reduce the model's effectiveness for certain applications.