<a href="https://colab.research.google.com/github/Metallicode/Math/blob/main/Naive_Bayes_classifier_theory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Naive Bayes classifier

The Naive Bayes classifier is based on Bayes' Theorem, a fundamental concept in probability theory and statistics. Let's break this down step by step.

**1. Bayes' Theorem**

Bayes' Theorem describes the probability of an event, based on prior knowledge of conditions related to the event. Mathematically, it's defined as:

**P(A|B) = P(B|A)*P(A) / P(B)**

Where:
- P(A|B) is the posterior probability of A occurring given B.
- P(B|A) is the likelihood which is the probability of B occurring given A.
- P(A) and  P(B) are the prior probabilities of A and B respectively.

**2. Application to Classification:**

Imagine you want to classify a piece of text as either positive or negative sentiment. Here, A is the class (positive/negative) and B is the observed text.

Using Bayes' Theorem, the probability that a text belongs to the Positive class is:

P(Positive|Text) = P(Text|Positive) * P(Positive) / P(Text)

And similarly for the Negative class.

**3. The "Naive" Assumption:**

The word "naive" in Naive Bayes comes from the assumption that each feature in the dataset is independent of all other features. This is a strong and usually unrealistic assumption, hence the name.

For the text example, it means that the presence of each word in the text is independent of the presence of any other word, which isn't always true. But despite this simplifying assumption, Naive Bayes often performs surprisingly well.

**4. For Classification:**

You would compute the probability of the text being in each class, then assign the class with the highest probability.

For example, if P(Positive|Text) > P(Negative|Text), then classify the text as Positive.

**5. Simplifying Further:**

The denominator P(Text) is constant for all classes, so for the purpose of classification, we can ignore it. We're only interested in which numerator is largest.

Therefore, the decision rule simplifies to comparing:

**P(Text|Positive) * P(Positive)**

versus:

**P(Text|Negative) * P(Negative)**

**6. Example with Numbers:**

Let's imagine a toy scenario:
- 60% of our texts are Positive, so P(Positive) = 0.6
- 40% are Negative, so P(Negative) = 0.4
- The word "awesome" appears in 80% of Positive texts and 10% of Negative texts.

Given a new text "This movie is awesome", what class should it belong to?

P(Positive|awesome) ∝ 0.8 * 0.6 = 0.48


P(Negative|awesome) ∝ 0.1 * 0.4 = 0.04

Since 0.48 > 0.04, we classify the text as Positive.

**Conclusion:**
This is a highly simplified introduction to the math behind Naive Bayes. In practice, additional techniques such as "smoothing" are employed to handle issues like zero probabilities for unseen words.

### ∝ "is proportional to"

The symbol ∝ is read as "is proportional to." It indicates that two quantities have a constant ratio or a constant multiplicative relationship.

In the context provided (Naive Bayes)

(Positive|awesome) ∝ 0.8 * 0.6

this means that the probability
P(Positive|awesome) is proportional to 0.8 * 0.6, but not necessarily equal to it.

The reason for this is because in Naive Bayes classification, we're typically only concerned with the relative magnitudes of probabilities, not their absolute values. Thus, the normalization factor (the denominator in Bayes' theorem) is often left out for simplicity, and we use the proportional symbol to indicate this.

###Joint Probabilities

Let's delve deeper into the concept of joint probabilities and the complications that arise without the independence assumption in the context of the Naive Bayes classifier.

1. **Joint Probability**:
   The joint probability of multiple events is the probability of all those events occurring simultaneously. For instance, consider two features \( A \) and \( B \). The joint probability \( P(A, B) \) represents the probability that \( A \) and \( B \) occur together.

2. **Number of Joint Probabilities Without Independence**:
   If we don't make the independence assumption, we need to consider the joint probabilities of all feature combinations. With \( n \) binary features, there are \( 2^n \) possible combinations, meaning you'd have to estimate \( 2^n \) joint probabilities. As \( n \) grows, this number explodes. For continuous or multi-class features, the situation becomes even more complex.

3. **Example with Binary Features**:
   Let's assume you're building a Naive Bayes classifier to predict if an email is spam or not, and you have 3 binary features:
   - \( F_1 \): Whether the email contains the word "sale" (1 if it does, 0 otherwise).
   - \( F_2 \): Whether the email contains the word "free" (1 if it does, 0 otherwise).
   - \( F_3 \): Whether the email has an attachment (1 if it does, 0 otherwise).
   
   Without the independence assumption, you'd need to estimate the joint probabilities for all combinations of these features, such as:
   - \( P(F_1 = 1, F_2 = 1, F_3 = 1) \)
   - \( P(F_1 = 0, F_2 = 1, F_3 = 1) \)
   - \( P(F_1 = 1, F_2 = 0, F_3 = 1) \)
   - ...
   - \( P(F_1 = 0, F_2 = 0, F_3 = 0) \)
   
   That's \( 2^3 = 8 \) joint probabilities just for 3 features! Imagine if you had 100 features; you'd need to estimate \( 2^{100} \) joint probabilities!

4. **Simplified with Independence**:
   With the independence assumption, the joint probability can be decomposed into individual probabilities:
   \( P(A, B) = P(A) * P(B) \)
   Extending this to our email example, instead of computing the 8 joint probabilities, we'd compute individual probabilities like \( P(F_1 = 1) \), \( P(F_2 = 1) \), and \( P(F_3 = 1) \). This drastically reduces computational complexity.

In essence, the independence assumption in Naive Bayes is a trade-off. We introduce a potential source of inaccuracy (because features in real-world data are rarely truly independent) in exchange for a significant simplification in our calculations and a model that remains tractable even with a large number of features.

###Joint probability of events

The notation P(A, B, C) denotes the joint probability of events A, B, and C occurring simultaneously. In other words, it represents the probability that all three of these events happen at the same time.

Let's break this down:

- P(A): Probability of event A occurring.
- P(B): Probability of event B occurring.
- P(C): Probability of event C occurring.
- P(A, B): Joint probability of events A and B occurring together.
- P(A, B, C): Joint probability of events A, B, and C all occurring together.

For example, consider you have a deck of cards, and you're drawing three cards consecutively (without replacement):

- Let A be the event that the first card is an Ace.
- Let B be the event that the second card is a King.
- Let C be the event that the third card is a Queen.

Then, P(A, B, C) would represent the probability that the first card you draw is an Ace, the second is a King, and the third is a Queen.

The joint probability can often be represented as a product of conditional probabilities using the chain rule, as I explained in the previous answer:

**P(A, B, C) = P(A|B, C) * P(B|C) * P(C)**

However, when the events are independent:

**P(A, B, C) = P(A) * P(B) * P(C)**

This understanding of joint probabilities is foundational in many areas of probability and statistics, including the operation of many machine learning algorithms.

###Chain rule



The Bayes' theorem is a fundamental concept in probability theory and statistics that describes the probability of an event, based on prior knowledge of related events. Mathematically:

**P(A|B) = P(B|A)* P(A) / P(B)**

Where:
- P(A|B) is the posterior probability of A given B.
- P(B|A) is the likelihood, which is the probability of B given A.
- P(A) is the prior probability of A.
- P(B) is the total probability of B.

**Applying Bayes' Theorem to Classification:**

In the context of classification, suppose we have a data point with features F and we want to determine the probability it belongs to class C:

P(C|F) ∝ P(F|C) * P(C)

Here, we're interested in finding the class C that maximizes P(C|F). This is found by looking at the product of:
- P(F|C): How likely is the feature set F given the class C.
- P(C): The prior probability of class C.

**Chain Rule of Probability:**

The chain rule of probability (or the product rule) is a fundamental rule providing a way to get the joint probability of a collection of events from conditional probabilities. It states:

P(A, B) = P(A|B) * P(B)
P(A, B, C) = P(A|B, C) * P(B|C) * P(C)
... and so forth.


**Applying the Chain Rule to our Classifier:**

Let's assume for simplicity that our data point has two features F_1 and F_2. Using the chain rule:


**P(F_1, F_2|C) = P(F_1|F_2, C) * P(F_2|C)**


However, here's where the "Naive" in Naive Bayes comes in. We make an assumption that each feature is independent of every other feature given the class. So, for our example:


**P(F_1|F_2, C) = P(F_1|C)**


Hence, the joint probability can be simplified to:


**P(F_1, F_2|C) = P(F_1|C) * P(F_2|C)**

If you have more features, you continue multiplying these probabilities for each feature. The reason we use this is that estimating each P(F_i|C) individually (with the naive assumption) is much easier and requires less data than estimating the joint probabilities directly without this assumption.

In conclusion, the Naive Bayes classifier combines the Bayes' theorem with the chain rule and a simplifying naive assumption of independence among features to create an efficient and often surprisingly accurate classification method.