My Biggest Question woould be why do we need MAP(Maximum a Posteriori ) estimation if we have the **MLE(Maximum likelihood Estimation)** is still here.

There were some limitations in MLE that leads to inefficiency, e.g. noisy dataset, small dataset, etc. 

That is why we need some algos that might be working well if there are some kind of inefficiency. 



We chose $\theta$ that maximizes the likelihood of the observed data:

$$
\hat{\theta}_{MLE} = \arg\max_{\theta} P(D \mid \theta)
$$

That’s great if you have tons of data.  
But what happens if:

- You have very few samples?  
- Your data has noise or missing points?  
- You want to include prior knowledge or beliefs about the parameter?  

Then MLE can get overconfident or overfit.  
It has no regularization, no mechanism to encode what we already know.

That’s where **MAP estimation** enters.


### Bayes’ Theorem for Parameters

Bayes’ theorem for parameters is:

$$
P(\theta \mid D) = \frac{P(D \mid \theta) \, P(\theta)}{P(D)}
$$

Where:

- $P(\theta \mid D)$ : **Posterior** (updated belief about $\theta$ after seeing data)  
- $P(D \mid \theta)$ : **Likelihood** (how likely data is given $\theta$)  
- $P(\theta)$ : **Prior** (belief about $\theta$ before seeing data)  
- $P(D)$ : **Evidence** (normalization constant)

---

### MAP Estimation

MAP estimation chooses the most probable parameter given the data:

$$
\hat{\theta}_{MAP} = \arg\max_{\theta} P(\theta \mid D)
$$

Substitute Bayes’ theorem:

$$
\hat{\theta}_{MAP} = \arg\max_{\theta} P(D \mid \theta) \, P(\theta)
$$

We drop $P(D)$ since it doesn’t depend on $\theta$.

---

**Final MAP Formula:**

$$
\hat{\theta}_{MAP} = \arg\max_{\theta} \big[ P(D \mid \theta) \cdot P(\theta) \big]
$$

Or equivalently, in log-space:

$$
\hat{\theta}_{MAP} = \arg\max_{\theta} \big[ \log P(D \mid \theta) + \log P(\theta) \big]
$$


## Introduce a Prior

Let’s say we believe before seeing data that $\theta$ is likely near $0.5$  
(i.e., most coins are fair).

We model that prior belief using a **Beta distribution**:

$$
P(\theta) = \text{Beta}(\theta; \alpha, \beta) = 
\frac{\theta^{\alpha - 1}(1 - \theta)^{\beta - 1}}{B(\alpha, \beta)}
$$

If we set $\alpha = \beta = 2$, it means:

> “$\theta$ is probably near $0.5$, but not fixed.”


## Mode (Peak) of the Beta Distribution

The **mode** (peak) of the Beta distribution — i.e., the most likely value of $\theta$ — is given by:

$$
\hat{\theta}_{MAP} = \frac{\alpha + H - 1}{\alpha + \beta + H + T - 2}
$$

That’s the formula we used.


| α & β      | Interpretation                          | Effect on MAP            |
| ---------- | --------------------------------------- | ------------------------ |
| α = β = 1  | Uniform prior (no bias)                 | MAP = MLE                |
| α > β      | Prior belief: coin likely heads         | MAP shifts upward        |
| α < β      | Prior belief: coin likely tails         | MAP shifts downward      |
| α, β large | Strong belief (less impact of new data) | Posterior changes little |
| α, β small | Weak belief (data dominates)            | Posterior moves easily   |
