# Maximum A Posteriori (MAP)

Imagine you have a bag filled with red and blue balls, but you don't know the exact proportion of each color. You want to estimate the proportion of red balls in the bag. To do this, you randomly pick one ball from the bag and observe its color. 

Now, let's say you have some prior belief about the proportion of red balls in the bag before you even pick a ball. This belief is based on some general knowledge or assumptions you might have. This prior belief is represented by a probability distribution.

For example, let's say your prior belief is that there's an equal chance of the bag having mostly red balls or mostly blue balls, so your prior distribution is uniform, meaning both red and blue are equally likely.

Now, you pick a ball from the bag, and let's say it's red. 

Maximum A Posteriori (MAP) estimation comes into play after you've observed this red ball. It's a way to update your prior belief (probability distribution) with the new information (the observed red ball) to get a better estimate of the proportion of red balls in the bag.

### Here's how it works mathematically:

1. **Prior belief**: You start with your prior belief, represented by a probability distribution. In this case, it's a uniform distribution since you initially assumed both red and blue balls are equally likely.

2. **Likelihood**: This is the probability of observing the data (in this case, picking a red ball) given the parameters of the model (the proportion of red balls). This is where the observed data comes into play.

3. **Posterior belief**: The posterior distribution is what you get after updating your prior belief with the likelihood of the observed data. In MAP estimation, you're looking for the parameter value (proportion of red balls) that maximizes this posterior distribution.

4. **Maximum A Posteriori estimate**: This is the value of the parameter (proportion of red balls) that maximizes the posterior distribution. It's the most probable value given both your prior belief and the observed data.

In our example, after observing the red ball, you update your prior belief using Bayes' theorem to get the posterior distribution, and then you find the proportion of red balls that maximizes this distribution. This proportion is your MAP estimate of the true proportion of red balls in the bag.

The formula for Maximum A Posteriori (MAP) estimation involves finding the parameter value that maximizes the posterior probability distribution. Mathematically, it can be expressed as follows:

$ \hat{\theta}_{MAP} = \arg \max_{\theta} P(\theta | D) $

Where:
- $ \hat{\theta}_{MAP} $ is the MAP estimate of the parameter $\theta$.
- $ P(\theta | D) $ is the posterior probability distribution of the parameter $\theta$ given the observed data $D$.

In Bayesian statistics, the posterior distribution $ P(\theta | D) $ is calculated using Bayes' theorem:

$ P(\theta | D) = \frac{P(D | \theta) \cdot P(\theta)}{P(D)} $

Where:
- $ P(D | \theta) $ is the likelihood function, representing the probability of observing the data $D$ given the parameter $\theta$.
- $ P(\theta) $ is the prior probability distribution of the parameter $\theta$, representing our prior beliefs about the parameter before observing any data.
- $ P(D) $ is the marginal likelihood or evidence, acting as a normalizing constant.
- The term "argmax" stands for "argument of the maximum" and it is used in mathematics and computer science. The argmax of a function is the input value that maximizes the function's output. In simpler terms, it's the value of the variable (or argument) that gives the highest value of the function.

    For example, if you have a function $ f(x) $ and you want to find the value of $ x $ that maximizes $ f(x) $, you would write it as:

    $ \text{argmax}_x \, f(x) $

    It means you're looking for the value of $ x $ that maximizes $ f(x) $.

    In the context of MAP estimation, we use argmax to find the value of the parameter (often denoted as $ \theta $) that maximizes the posterior probability distribution $ P(\theta | D) $. So, when we write:

    $ \hat{\theta}_{MAP} = \text{argmax}_{\theta} \, P(\theta | D) $

## Example 

Let $X$ be a continuous random variable with probability density function $f(x)$ given by:
$ f(x) = 
\begin{cases} 
2x, & 0 \leq x \leq 1 \\
0, & \text{otherwise}
\end{cases}
$

Suppose we have another random variable $Y$ with a conditional probability mass function $p(Y|X = x)$ given by:
$ p(Y|X = x) = x(1 - x)^{y - 1} $

Given that $Y = 3$, find the Maximum A Posteriori (MAP) estimate of $X$.

## Solution

To find the MAP estimate of $X$ given $Y = 3$, we first need to compute the product $P(Y = 3|X) \cdot P(X)$, and then maximize this expression with respect to $x$.

Given:
- Likelihood function: $P(Y = 3|X = x) = x(1 - x)^{2}$
- Prior distribution: $P(X = x) = 2x$ for $0 \leq x \leq 1$

We'll compute $P(Y = 3|X) \cdot P(X)$:

$ P(Y = 3|X) \cdot P(X) = x(1 - x)^{2} \cdot 2x $

$ = 2x^2(1 - x)^{2} $

Now, to find the critical points, we'll differentiate this expression with respect to $x$ and set the derivative equal to zero:

$ \frac{d}{dx} \left( 2x^2(1 - x)^{2} \right) = 0 $

We can simplify this by first expressing it as a polynomial:

$ 2x^2(1 - x)^{2} = 2x^2(1 - 2x + x^2) = 2x^2 - 4x^3 + 2x^4 $

Now, taking the derivative:

$ \frac{d}{dx} \left( 2x^2 - 4x^3 + 2x^4 \right) = 4x - 12x^2 + 8x^3 $

Setting this equal to zero and solving for $x$:

$ 4x - 12x^2 + 8x^3 = 0 $

$ 4x(1 - 3x + 2x^2) = 0 $

$ x(1 - 3x + 2x^2) = 0 $

The solutions are $x = 0$ and the roots of $1 - 3x + 2x^2 = 0$.

To solve $1 - 3x + 2x^2 = 0$, we can use the quadratic formula:

$ x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} $

where $a = 2$, $b = -3$, and $c = 1$.

$ x = \frac{-(-3) \pm \sqrt{(-3)^2 - 4 \cdot 2 \cdot 1}}{2 \cdot 2} $

$ x = \frac{3 \pm \sqrt{9 - 8}}{4} $

$ x = \frac{3 \pm \sqrt{1}}{4} $

$ x = \frac{3 \pm 1}{4} $

$ x_1 = \frac{3 + 1}{4} = \frac{4}{4} = 1 $

$ x_2 = \frac{3 - 1}{4} = \frac{2}{4} = \frac{1}{2} $

So, the critical points are $x = 0$, $x = 1$, and $x = \frac{1}{2}$.

Now, we evaluate $P(Y = 3|X) \cdot P(X)$ at these critical points and endpoints:

$ P(Y = 3|X) \cdot P(X) \mid_{x = 0} = 0 $

$ P(Y = 3|X) \cdot P(X) \mid_{x = \frac{1}{2}} = 2 \left( \frac{1}{2} \right)^2 \left(1 - \frac{1}{2}\right)^2 = \frac{1}{2} $

$ P(Y = 3|X) \cdot P(X) \mid_{x = 1} = 2 \cdot 1^2 \left(1 - 1\right)^2 = 0 $

So, the MAP estimate of $X$ given $Y = 3$ is $x = \frac{1}{2}$.

# Bayes Optimal Classification

**1. Introduction:**
   We're talking about a way to make decisions when we have some data and we need to choose the best option based on that data.

**2. Background:**
   Imagine you have some information, called "data" (we'll call it $D$), and you're trying to figure out something. For example, you're trying to predict if a person will like a movie based on their age and gender.

**3. MAP Hypothesis:**
   Normally, we would pick the most likely guess based on the data we have. This guess is called the "MAP hypothesis" or $h_{MAP}$. It's like saying, "Based on what we know, this is the best guess."

**4. Problem with MAP Hypothesis:**
   But sometimes, just looking at the most likely guess isn't the best idea. Imagine we have three guesses: $h_1$, $h_2$, and $h_3$. The MAP hypothesis might say $h_1$ is the best guess, but when we look at all three guesses together, it turns out $h_1$ might not be the best choice after all.

**5. Bayes Optimal Classification:**
   So, instead of just picking the most likely guess, we look at all possible guesses and weigh them based on how likely they are. We do this using a formula. This formula helps us pick the best choice by considering all the options together.

   In Bayes optimal classification, we aim to determine the most probable classification for a new instance $x$ by considering all possible hypotheses and their corresponding probabilities given the observed data $D$. Unlike the Maximum A Posteriori (MAP) hypothesis $h_{MAP}$, which considers only the most probable hypothesis, Bayes optimal classification takes into account the probabilities of all hypotheses.

   To find the most probable classification, we calculate the probability of each class label given each hypothesis, weighted by the probability of each hypothesis given the data $D$, and then choose the class label with the highest overall probability.

   The formula for Bayes optimal classification is:

   $
   \arg \max_{v_{j} \in V} \sum_{h_{i} \in H} P(v_{j}\,|\,h_{i}) P(h_{i}\,|\,D)
   $

   Where:
   - $V$ is the set of possible class labels.
   - $H$ is the set of possible hypotheses.
   - $P(v_{j}\,|\,h_{i})$ is the probability of class label $v_{j}$ given hypothesis $h_{i}$.
   - $P(h_{i}\,|\,D)$ is the probability of hypothesis $h_{i}$ given the observed data $D$.

## Example

In the example provided, we have three hypotheses: $h_1$, $h_2$, and $h_3$, and two possible outcomes: $-$ (minus) and $+$ (plus). Let's use Bayes optimal classification to determine the most probable classification given the data $D$.

**Given Probabilities:**
- $P(h_1|D) = 0.4$
- $P(-|h_1) = 0$, $P(+|h_1) = 1$
- $P(h_2|D) = 0.3$
- $P(-|h_2) = 1$, $P(+|h_2) = 0$
- $P(h_3|D) = 0.3$
- $P(-|h_3) = 1$, $P(+|h_3) = 0$

## Solution

For each outcome, we'll calculate the weighted sum of probabilities across all hypotheses, then choose the outcome with the highest overall probability.

1. For outcome $-$:

   \begin{align*}
   \text{Weighted Sum} &= P(-|h_1) \cdot P(h_1|D) + P(-|h_2) \cdot P(h_2|D) + P(-|h_3) \cdot P(h_3|D) \\
   &= 0 \times 0.4 + 1 \times 0.3 + 1 \times 0.3 \\
   &= 0 + 0.3 + 0.3 \\
   &= 0.6
   \end{align*}
   $
2. For outcome $+$:
   $
   \begin{align*}
   \text{Weighted Sum} &= P(+|h_1) \cdot P(h_1|D) + P(+|h_2) \cdot P(h_2|D) + P(+|h_3) \cdot P(h_3|D) \\
   &= 1 \times 0.4 + 0 \times 0.3 + 0 \times 0.3 \\
   &= 0.4 + 0 + 0 \\
   &= 0.4
   \end{align*}


**Conclusion:**
The most probable classification is $-$, as it has the highest overall probability ($0.6$). Therefore, according to Bayes optimal classification, the outcome for the new instance is $-$.

# Gibbs Classifier

**Understanding the Problem:**
Imagine you're trying to classify something, like whether an email is spam or not. There are different ways to make these decisions, but some methods can be really complicated and slow, especially if there are lots of possibilities to consider.

**Gibbs Classifier as a Solution:**
The Gibbs Classifier is like a simpler and faster way to make these decisions, especially when there are many options to choose from.

**How Gibbs Classifier Works:**
1. **Choosing a Hypothesis:**
   - Instead of looking at all the possibilities at once, the Gibbs Classifier just picks one possibility randomly. It's like closing your eyes and pointing to a choice on a list.
   - But, it's not entirely random. It picks options that are more likely according to the information we have.

2. **Using the Chosen Hypothesis:**
   - Once it has a random choice, it uses that choice to make a decision about the new thing we're trying to classify, like an email.
   - Even though it's a random pick, it's still based on what we know from our data.

**Surprising Fact About Gibbs Classifier:**
Here's something interesting: Even though the Gibbs Classifier is much simpler and faster than other methods like the Bayes Optimal Classifier, it can still be pretty good at making decisions.

**Comparison with Bayes Optimal Classifier:**
- The Bayes Optimal Classifier is like the gold standard—it gives the best results but can be slow, especially if there are lots of options to consider.
- The surprising thing is that, under certain conditions, the Gibbs Classifier's expected error (how often it makes mistakes) is at most twice as much as the expected error of the Bayes Optimal Classifier.
- It's like saying, "Even though the Gibbs Classifier takes shortcuts and doesn't consider everything, it still does pretty well, and its mistakes aren't much worse than the Bayes Optimal Classifier's mistakes."

**In Simple Terms:**
- The Gibbs Classifier is like a quick and simple way to make decisions, even though it doesn't look at everything.
- Surprisingly, it can still do a pretty good job, and its mistakes aren't much worse than the fancier methods.

### Mathematical

1. **Probability Distributions:**
   - In the context of classification, we have a set of possible hypotheses or choices $H$. Each hypothesis represents a possible way to classify our data.
   - We also have observed data $D$, which helps us understand the likelihood of each hypothesis given the data, represented by $P(h|D)$.

2. **Choosing a Hypothesis:**
   - The Gibbs Classifier randomly selects a hypothesis from the set $H$ based on their probabilities given the observed data $D$.
   - Mathematically, this means sampling a hypothesis according to the posterior probability distribution $P(h|D)$.

3. **Using the Chosen Hypothesis:**
   - Once a hypothesis $h$ is chosen, the Gibbs Classifier uses this hypothesis to classify new instances.
   - This classification is done based on the specific rules or predictions associated with the chosen hypothesis.

4. **Error Analysis:**
   - Error analysis involves comparing the classifications made by the Gibbs Classifier with the true labels of the data.
   - The expected error of the Gibbs Classifier ($E[error_{Gibbs}]$) is calculated by considering the frequency of incorrect classifications over many instances.

5. **Comparison with Bayes Optimal Classifier:**
   - The surprising fact about the Gibbs Classifier is that, under certain conditions, its expected error is at most twice the expected error of the Bayes Optimal Classifier ($E[error_{BayesOptimal}]$).
   - Mathematically, this can be expressed as $E[error_{Gibbs}] \leq 2 \times E[error_{BayesOptimal}]$.

6. **Uniform Prior Distribution:**
   - In some cases, if we assume a correct and uniform prior distribution over the hypothesis space $H$, the Gibbs Classifier's expected error is no worse than twice the expected error of the Bayes Optimal Classifier.

# Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is a method used in statistics and machine learning to estimate the parameters of a statistical model. It's based on the principle of choosing the parameter values that maximize the likelihood of observing the data that we have.

Here's a more detailed explanation:

1. **Parameters and Data:**
   - In statistics, many models have parameters that need to be estimated from data. For example, in a normal distribution, the parameters are the mean and the standard deviation.
   - We have some data that we believe follows a certain distribution, but we don't know the parameters of that distribution.

2. **Likelihood Function:**
   - The likelihood function measures how likely it is to observe our data given specific parameter values.
   - It's essentially the probability of observing our data given our model and the parameter values.
   - Mathematically, it's denoted as $ L(\theta | \text{data}) $, where $ \theta $ represents the parameters of the model.

3. **Maximizing Likelihood:**
   - MLE seeks to find the parameter values that maximize the likelihood function.
   - In other words, MLE finds the parameter values that make our observed data the most likely under our model.

4. **Optimization:**
   - To find the parameter values that maximize the likelihood function, we often use optimization techniques such as gradient descent or the Newton-Raphson method.
   - These methods iteratively adjust the parameter values until we find the maximum likelihood.

5. **Interpretation:**
   - Once we've estimated the parameters using MLE, we can use them to make predictions or draw conclusions about our data.
   - The estimated parameters represent the values that are most consistent with the observed data according to our model.

In summary, Maximum Likelihood Estimation (MLE) is a method used to estimate the parameters of a statistical model by finding the parameter values that maximize the likelihood of observing the data. It's widely used in statistics, machine learning, and many other fields for parameter estimation.

### Example
Given a discrete random variable $ X $ with a parameter $ \theta $ in the range $ 0 \leq \theta \leq 1 $, and the following probability mass function (PMF) for $ X $:

$ P(X = 0) = \frac{2\theta}{3} $

$ P(X = 1) = \frac{\theta}{3} $

$ P(X = 2) = \frac{2(1 - \theta)}{3} $

$ P(X = 3) = \frac{1 - \theta}{3} $

We're given 10 independent observations $ x $ from such a distribution: $ x = (3, 0, 2, 1, 3, 2, 1, 0, 2, 1) $.
Determine the Maximum Likelihood Estimate (MLE) of the parameter $ \theta $ based on the given observations $ x $.

### Solution

1. **Determine the Formula for $ L(\theta) $**

The likelihood function $ L(\theta) $ is the product of the probabilities of observing each data point under the given probability mass function (PMF). Given the observations $ x = (3, 0, 2, 1, 3, 2, 1, 0, 2, 1) $, the formula for $ L(\theta) $ is:

$ L(\theta) = P(X=3) \times P(X=0) \times P(X=2) \times \ldots \times P(X=1) $

Substituting the given probabilities into this formula:

$ L(\theta) = \left(\frac{1-\theta}{3}\right)^2 \times \left(\frac{2\theta}{3}\right)^2 \times \left(\frac{2(1-\theta)}{3}\right)^3 \times \left(\frac{\theta}{3}\right)^3 $

$ L(\theta) = \left(\frac{1-\theta}{3}\right)^2 \times \left(\frac{2\theta}{3}\right)^2 \times \left(\frac{2(1-\theta)}{3}\right)^3 \times \left(\frac{\theta}{3}\right)^3 $

$ L(\theta) = \left(\frac{(1-\theta)^2 \times (2\theta)^2 \times 2(1-\theta)^3 \times \theta^3}{3^8}\right) $

2. **Differentiate $ L(\theta) $ with Respect to $ \theta $**

To find the maximum likelihood estimate (MLE) of $ \theta $, we differentiate $ L(\theta) $ with respect to $ \theta $ and set it equal to zero:

$ LL(\theta) = \log L(\theta) = 2\log\left(\frac{1-\theta}{3}\right) + 2\log\left(\frac{2\theta}{3}\right) + 3\log\left(\frac{2(1-\theta)}{3}\right) + 3\log\left(\frac{\theta}{3}\right) $

$ = 2\left(\log(1-\theta) - \log 3\right) + 2\left(\log(2\theta) - \log 3\right) + 3\left(\log(2(1-\theta)) - \log 3\right) + 3\left(\log(\theta) - \log 3\right) $

$ = 2\log(1-\theta) - 2\log 3 + 2\log(2\theta) - 2\log 3 + 3\log(2(1-\theta)) - 3\log 3 + 3\log(\theta) - 3\log 3 $

$ = 2\log(1-\theta) + 2\log(2\theta) + 3\log(2(1-\theta)) + 3\log(\theta) - 10\log 3 $

When differentiating with respect to $ \theta $, we apply the chain rule:

$ \frac{d}{d\theta} \log(f(\theta)) = \frac{1}{f(\theta)} \cdot \frac{d}{d\theta} f(\theta) $

Applying this rule to each term:

$ \frac{d}{d\theta} \log(1-\theta) = -\frac{1}{1-\theta} $
$ \frac{d}{d\theta} \log(2\theta) = \frac{2}{2\theta} $
$ \frac{d}{d\theta} \log(2(1-\theta)) = -\frac{2}{2(1-\theta)} $
$ \frac{d}{d\theta} \log(\theta) = \frac{1}{\theta} $

Substituting these derivatives into the expression for $LL(\theta)$:

$ \frac{d}{d\theta}LL(\theta) = -2\frac{1}{1-\theta} + 2\frac{1}{\theta} - 3\frac{2}{2(1-\theta)} + 3\frac{1}{\theta} $

$ = -\frac{2}{1-\theta} + \frac{2}{\theta} - \frac{3}{1-\theta} + \frac{3}{\theta} $

$ = -\frac{2}{1-\theta} - \frac{3}{1-\theta} + \frac{2}{\theta} + \frac{3}{\theta} $

$ = -\frac{5}{1-\theta} + \frac{5}{\theta} $


$ -\frac{5}{1-\theta} + \frac{5}{\theta} = 0 $

Multiplying both sides by $ \theta(1-\theta) $ to clear the denominators:

$ -5\theta + 5(1-\theta) = 0 $

$ -5\theta + 5 - 5\theta = 0 $

$ -10\theta + 5 = 0 $

Now, we isolate $ \theta $:

$ -10\theta = -5 $

$ \theta = \frac{-5}{-10} $

$ \theta = \frac{1}{2} $

So, the Maximum Likelihood Estimate (MLE) of $ \theta $ is $ \frac{1}{2} $.

# **Naive Bayes Algorithm**

- **Definition**: Naive Bayes algorithm is a supervised machine learning algorithm based on Bayes' Theorem, used primarily for classification problems.
  
- **Key Features**:
  - Simple and effective for building fast machine learning models.
  - Utilizes probabilistic classification, making predictions based on probabilities.
  - Widely used in various applications like spam filtration, sentiment analysis, and article classification.
  
**Bayes' Theorem**

- **Definition**: Bayes’ theorem, also known as Bayes’ Rule or Bayes’ Law, is a fundamental theorem in probability theory used to determine the probability of a hypothesis given prior knowledge. It depends on conditional probability.

- **Formula**:
  - $ P(A|B) = \frac{P(B|A) \times P(A)}{P(B)} $
  - $ P(B|A) = \frac{P(A|B) \times P(B)}{P(A)} $
  
- **Components**:
  - $ P(A|B) $: Posterior probability - probability of hypothesis A given observed event B.
  - $ P(B|A) $: Likelihood probability - probability of evidence given that hypothesis A is true.
  - $ P(A) $: Prior probability - probability of hypothesis before observing the evidence.
  - $ P(B) $: Marginal probability - probability of evidence.
  
**Types of Naive Bayes Models**

1. **Gaussian Naive Bayes**:
   - Used in classification assuming features follow a normal distribution.
   - Ideal for continuous features.

2. **Multinomial Naive Bayes**:
   - Suitable for discrete counts, often used in text classification.
   - Counts occurrences of features, applicable in the bag-of-words model.

3. **Bernoulli Naive Bayes**:
   - Designed for binary feature vectors (e.g., 0s and 1s).
   - Applied in text classification where features represent "word occurs in the document" or "word does not occur in the document".

### Example 1

| Sky   | AirTemp | Humidity | Wind    | Forecast | Enjoy Sport? |
|-------|---------|----------|---------|----------|--------------|
| Sunny | Warm    | Normal   | Strong  | Same     | Yes          |
| Sunny | Warm    | High     | Strong  | Same     | No           |
| Rainy | Cold    | High     | Strong  | Change   | No           |
| Sunny | Warm    | Normal   | Breeze  | Same     | Yes          |
| Sunny | Hot     | Normal   | Breeze  | Same     | No           |
| Rainy | Cold    | High     | Strong  | Change   | No           |
| Sunny | Warm    | High     | Strong  | Change   | Yes          |
| Rainy | Warm    | Normal   | Breeze  | Same     | ???          |


### Step 1: Calculate Class Probabilities

Let's calculate the probabilities of each class (Yes/No).

| Enjoy Sport? | Count | Probability |
|--------------|-------|-------------|
| Yes          | 3     | 3/7   |
| No           | 4     | 4/7   |

### Step 2: Calculate Feature Probabilities

#### Feature Probabilities for Class "Yes":
#### Sky:
| Sky   | Play = Yes | Play = No |
|-------|-------------|---------------------|
| Sunny | 3/3           | 2/4        |
| Rainy | 0/3           | 2/4          |

#### AirTemp:
| AirTemp | Play = Yes | Play = No |
|---------|-------------|------------------------|
| Hot    | 0/3           | 1/4             |
| Warm     | 3/3           | 1/4                |
| Cold    | 0/3           | 2/4             |

#### Humidity:
| Humidity | Play = Yes | Play = No |
|----------|-------------|---------------------------|
| High   | 1/3           | 3/4                 |
| Normal     | 2/3           | 1/4                 |

#### Wind:
| Wind   | Play = Yes | Play = No |
|--------|-------------|------------------------
| Strong | 2/3           | 3/4                 |
| Breeze | 1/3           | 1/4               |
 
#### Forecast:
| Forecast   | Play = Yes | Play = No |
|--------|-------------|------------------------
| Same | 2/3           | 2/4                |
| Change | 1/3           | 2/4              |               

### Step 4, we'll classify a new instance using the probabilities calculated in Step 3. Let's say we have a new instance with the following features:

- Sky = Sunny
- AirTemp = Warm
- Humidity = Normal
- Wind = Strong
- Forecast = Change

P(Enjoy=Yes | X) 
= P(X | Enjoy=Yes). P(Enjoy=Yes) / P(X)

= P(X | Enjoy=Yes). P(Enjoy=Yes)

= P(X | Enjoy=Yes). (3/7)

MAP rule
= P(Sunny | Enjoy=Yes). P(Warm | Enjoy=Yes). P(Normal | Enjoy=Yes). P(Strong | Enjoy=Yes). P(Change | Enjoy=Yes).(3/7)

= (3/3) . (3/3) . (2/3) . (2/3) . (1/3) . (3/7) =0.0635

P(Enjoy=No | X) 

= P(X | Enjoy=No). P(Enjoy=No) / P(X)

= P(X | Enjoy=No). P(Enjoy=No)

= P(X | Enjoy=No). (4/7)

= P(Sunny | Enjoy=No). P(Warm | Enjoy=No). P(Normal | Enjoy=No). P(Strong | Enjoy=No). P(Change | Enjoy=No). (4/7)

= (2/4) . (1/4) . (1/4) . (3/4) . (2/4) . (4/7) = 0.006696

## Example 2
<img src ="images/nbc_1.png">

# Laplace smoothing

The Naïve Bayes Classifier, despite its simplicity and efficiency, may encounter certain issues, especially when dealing with data that contains sparse or missing values. Laplace smoothing, also known as additive smoothing, is a technique used to address some of these issues. Here are a few problems with Naïve Bayes Classifier that Laplace smoothing can help mitigate:

1. **Zero Frequency Problem**:
   - If a particular feature value in the training data does not occur with a certain class label, the conditional probability of that feature given the class becomes zero. This can lead to a probability of zero for the entire class when predicting with new data.

2. **Overfitting**:
   - In cases where the training dataset is small, the model may overfit, especially when dealing with categorical variables with many possible values. Laplace smoothing can help prevent overfitting by introducing a small amount of pseudocounts to all feature counts.

3. **Unseen Features**:
   - When encountering a feature value in the test data that was not present in the training data, the Naïve Bayes Classifier assigns a probability of zero to that feature value given the class. Laplace smoothing allows the model to assign non-zero probabilities to unseen feature values by adding pseudocounts to all feature counts.

4. **Imbalanced Data**:
   - In situations where the distribution of classes in the training data is imbalanced, the classifier may give disproportionate weight to the majority class. Laplace smoothing can help balance the impact of rare classes by smoothing the estimates of probabilities.

The formula for Laplace smoothing can be expressed as follows:

Given:
- $ N $: Total number of observations or samples in the dataset.
- $ V $: Total number of unique values for a feature.
- $ \alpha $: Smoothing parameter (usually set to 1 for Laplace smoothing).

For a specific event or feature $ x $ with $ n $ occurrences in the dataset, the Laplace-smoothed probability $ P(x) $ is calculated as:

$ P_{\text{smoothed}}(x) = \frac{n + \alpha}{N + \alpha \cdot V} $

Where:
- $ n $: Count of occurrences of the event in the dataset.
- $ N $: Total count of all events or samples in the dataset.
- $ V $: Total number of unique values for the feature.
- $ \alpha $: Smoothing parameter (often set to 1 for Laplace smoothing).

This formula ensures that even if an event has not been observed in the training data (i.e., $ n = 0 $), it still receives a non-zero probability estimate after smoothing. The $ \alpha $ parameter controls the degree of smoothing, with larger values of $ \alpha $ resulting in stronger smoothing. Typically, Laplace smoothing with $ \alpha = 1 $ is used as a simple and effective way to address the zero-frequency problem in probability estimation.

### Example 3

Contunuing from Example 1

P(Enjoy=Yes | X) = P(X | Enjoy=Yes). P(Enjoy=Yes) / P(X)

= P(X | Enjoy=Yes). P(Enjoy=Yes)

= P(X | Enjoy=Yes). (3/7)

= P(Rainy | Enjoy=Yes). P(Warm | Enjoy=Yes). P(Normal | Enjoy=Yes). P(Breeze | Enjoy=Yes). P(Same | Enjoy=Yes).(3/7)

``
= (0+1/3+2) . (3/3) . (2/3) . (1/3) . (2/3) . (3/7)
``

P(Enjoy=No | X) = P(X | Enjoy=No). P(Enjoy=No) / P(X)

= P(X | Enjoy=No). P(Enjoy=No)

= P(X | Enjoy=No). (4/7)

= P(Rainy | Enjoy=No). P(Warm | Enjoy=No). P(Normal | Enjoy=No). P(Breeze | Enjoy=No). P(Same |
Enjoy=No). (4/7)

``
= (2+1/4+2) . (1/4) . (1/4) . (1/4) . (2/4) . (4/7)
``