# Maximum A Posteriori (MAP) Estimation

The Maximum A Posteriori (MAP) estimate is a Bayesian approach to parameter estimation. It finds the value of the parameter $p$ that maximizes the posterior distribution $P(p \mid \text{data})$. Unlike Maximum Likelihood Estimation (MLE), which only considers the likelihood, MAP incorporates prior knowledge about the parameter through the prior distribution.

For the Bernoulli-Beta model, we will derive the MAP estimate for $p$ and compare it with MLE.

---

## Step-by-Step Derivation of MAP for Bernoulli-Beta Model

### 1. Posterior Distribution

From earlier derivations, the posterior distribution for $p$ given the data is:

$$
P(p \mid \text{data}) \propto P(\text{data} \mid p) P(p),
$$

where:

- $P(\text{data} \mid p) = p^s (1 - p)^f$ is the likelihood,
- $P(p) = \frac{1}{B(\alpha, \beta)} p^{\alpha - 1} (1 - p)^{\beta - 1}$ is the Beta prior.

Substitute these into the posterior:

$$
P(p \mid \text{data}) \propto p^s (1 - p)^f \cdot p^{\alpha - 1} (1 - p)^{\beta - 1}.
$$

Combine terms:

$$
P(p \mid \text{data}) \propto p^{s + \alpha - 1} (1 - p)^{f + \beta - 1}.
$$

This is proportional to a Beta distribution:

$$
P(p \mid \text{data}) \sim \text{Beta}(\alpha + s, \beta + f).
$$

---

### 2. Objective: Maximize the Posterior

To find the MAP estimate, we maximize the posterior distribution $P(p \mid \text{data})$.  
Equivalently, we can maximize the log-posterior for computational simplicity:

$$
\ell(p) = \ln P(p \mid \text{data}) = \ln \left[ p^{s + \alpha - 1} (1 - p)^{f + \beta - 1} \right].
$$

Simplify using logarithmic properties:

$$
\ell(p) = (s + \alpha - 1) \ln p + (f + \beta - 1) \ln (1 - p).
$$

---

### 3. Maximize the Log-Posterior

Take the derivative of $\ell(p)$ with respect to $p$:

$$
\frac{d}{dp} \ell(p) = \frac{s + \alpha - 1}{p} - \frac{f + \beta - 1}{1 - p}.
$$

Set the derivative equal to zero to find the critical points:

$$
\frac{s + \alpha - 1}{p} = \frac{f + \beta - 1}{1 - p}.
$$

Cross-multiply:

$$
(s + \alpha - 1)(1 - p) = (f + \beta - 1)p.
$$

Expand and rearrange:

$$
s + \alpha - 1 - (s + \alpha - 1)p = (f + \beta - 1)p.
$$

Combine the terms involving $p$:

$$
s + \alpha - 1 = (s + \alpha - 1 + f + \beta - 1)p.
$$

Solve for $p$:

$$
p = \frac{s + \alpha - 1}{s + f + \alpha + \beta - 2}.
$$

---

### 4. MAP Estimate

The MAP estimate of $p$ is:

$$
\hat{p}_{\text{MAP}} = \frac{s + \alpha - 1}{n + \alpha + \beta - 2},
$$

where:

- $s = \sum_{i=1}^n x_i$ is the total number of successes,
- $f = n - s$ is the total number of failures,
- $\alpha$ and $\beta$ are the parameters of the Beta prior.

---

### 5. Comparison of MAP and MLE

**MLE Estimate**:  
The MLE estimate for $p$ is:

$$
\hat{p}_{\text{MLE}} = \frac{s}{n}.
$$

**MAP Estimate**:  
The MAP estimate incorporates the prior information:

$$
\hat{p}_{\text{MAP}} = \frac{s + \alpha - 1}{n + \alpha + \beta - 2}.
$$

---

### Intuition

- **Prior Influence**: The terms $\alpha - 1$ and $\beta - 1$ in the numerator and denominator reflect the influence of the prior.
- **Regularization**: The MAP estimate can be viewed as a regularized version of the MLE estimate, where the prior acts as a smoothing factor to prevent overfitting to small datasets.
- **Behavior with Small Data**:  
  When $n$ is small, the MAP estimate is closer to the prior mean $\frac{\alpha}{\alpha + \beta}$ than the MLE estimate.  
  As $n \to \infty$, the influence of the prior diminishes, and $\hat{p}_{\text{MAP}} \to \hat{p}_{\text{MLE}}$.


# Predictive Distribution in Bayesian Inference  

The predictive distribution is a key concept in Bayesian statistics. It allows us to make predictions about new, unseen data based on the observed data and our prior beliefs. Specifically, it integrates over the uncertainty in the parameter $p$ (e.g., the probability of success in a Bernoulli trial) by using the posterior distribution.  

In this discussion, we'll derive the posterior predictive distribution for the Bernoulli-Beta model and explore its interpretation and applications.  

## Posterior Predictive Distribution: Definition  

The posterior predictive distribution gives the probability of observing a new data point $X_{\text{new}}$ given the observed data. Mathematically:  

$$  
P(X_{\text{new}} \mid \text{data}) = \int_{0}^{1} P(X_{\text{new}} \mid p) P(p \mid \text{data}) dp.  
$$  

where:  

- $P(X_{\text{new}} \mid p)$ is the likelihood of the new observation given $p$,  
- $P(p \mid \text{data})$ is the posterior distribution of $p$ after observing the data.  

For the Bernoulli-Beta model:  

- $P(X_{\text{new}} \mid p)$ is Bernoulli: $P(X_{\text{new}} = 1 \mid p) = p$ and $P(X_{\text{new}} = 0 \mid p) = 1 - p$,  
- $P(p \mid \text{data})$ is Beta: $P(p \mid \text{data}) \sim \text{Beta}(\alpha + s, \beta + f)$.  

---

## Derivation of the Posterior Predictive Distribution  

### Step 1: Write the Predictive Distribution  

The predictive distribution is:  

$$  
P(X_{\text{new}} \mid \text{data}) = \int_{0}^{1} P(X_{\text{new}} \mid p) P(p \mid \text{data}) dp.  
$$  

Substitute the Bernoulli likelihood $P(X_{\text{new}} \mid p)$ and the Beta posterior $P(p \mid \text{data})$:  

$$  
P(X_{\text{new}} \mid \text{data}) = \int_{0}^{1} \left[ p^{X_{\text{new}}} (1 - p)^{1 - X_{\text{new}}} \right] \cdot \frac{p^{\alpha + s - 1} (1 - p)^{\beta + f - 1}}{B(\alpha + s, \beta + f)} dp.  
$$  

Combine terms:  

$$  
P(X_{\text{new}} \mid \text{data}) = \frac{1}{B(\alpha + s, \beta + f)} \int_{0}^{1} p^{\alpha + s + X_{\text{new}} - 1} (1 - p)^{\beta + f + (1 - X_{\text{new}}) - 1} dp.  
$$  

---

### Step 2: Recognize the Beta Integral  

The integral has the form of a Beta function:  

$$  
\int_{0}^{1} p^{a - 1} (1 - p)^{b - 1} dp = B(a, b).  
$$  

Here:  
- $a = \alpha + s + X_{\text{new}}$,  
- $b = \beta + f + (1 - X_{\text{new}})$.  

Thus:  

$$  
P(X_{\text{new}} \mid \text{data}) = \frac{B(\alpha + s + X_{\text{new}}, \beta + f + (1 - X_{\text{new}}))}{B(\alpha + s, \beta + f)}.  
$$  

---

### Step 3: Simplify for Specific Cases  

**Case 1:** $X_{\text{new}} = 1$ (Success)  

If $X_{\text{new}} = 1$:  

$$  
P(X_{\text{new}} = 1 \mid \text{data}) = \frac{B(\alpha + s + 1, \beta + f)}{B(\alpha + s, \beta + f)}.  
$$  

Using the property of the Beta function:  

$$  
B(a, b) = \frac{\Gamma(a) \Gamma(b)}{\Gamma(a + b)},  
$$  

we can rewrite this as:  

$$  
P(X_{\text{new}} = 1 \mid \text{data}) = \frac{\Gamma(\alpha + s + 1) \Gamma(\beta + f)}{\Gamma(\alpha + s + \beta + f + 1)} \cdot \frac{\Gamma(\alpha + s + \beta + f)}{\Gamma(\alpha + s) \Gamma(\beta + f)}.  
$$  

Simplify using $\Gamma(n + 1) = n \Gamma(n)$:  

$$  
P(X_{\text{new}} = 1 \mid \text{data}) = \frac{\alpha + s}{\alpha + s + \beta + f}.  
$$  

---

**Case 2:** $X_{\text{new}} = 0$ (Failure)  

If $X_{\text{new}} = 0$:  

$$  
P(X_{\text{new}} = 0 \mid \text{data}) = \frac{\beta + f}{\alpha + s + \beta + f}.  
$$  

---

## Final Result: Posterior Predictive Distribution  

The posterior predictive distribution for $X_{\text{new}}$ is:  

$$  
P(X_{\text{new}} = 1 \mid \text{data}) = \frac{\alpha + s}{\alpha + s + \beta + f},  
$$  
$$  
P(X_{\text{new}} = 0 \mid \text{data}) = \frac{\beta + f}{\alpha + s + \beta + f}.  
$$  

---

## Intuition Behind the Predictive Distribution  

1. **Weighted Average of Prior and Data:**  
   The predictive probabilities are weighted averages of the prior mean $\frac{\alpha}{\alpha + \beta}$ and the observed proportion of successes $\frac{s}{n}$.  
   - As $n \to \infty$, the influence of the prior diminishes, and the predictive probabilities converge to the observed proportion.  

2. **Uncertainty Quantification:**  
   The predictive distribution incorporates the uncertainty in $p$ by integrating over the posterior distribution. This makes it more robust than simply using the posterior mean.  

3. **Applications:**  
   The predictive distribution is useful for making probabilistic predictions about future outcomes.  
   - For example, in medical trials, it can predict the probability of success for the next patient based on previous observations.  
