# Bayesian statistics intro

## Bayesian Motivation

We start with **Bayes’ theorem** for parameters \(\theta\) and data \(D\):

$$
p(\theta \mid D) = \frac{p(D \mid \theta)\,p(\theta)}{p(D)}
$$

- \(p(\theta)\) is the **prior**.  
- \(p(D \mid \theta)\) is the **likelihood**.  
- \(p(\theta \mid D)\) is the **posterior**.  

The normalizing constant

$$
p(D) = \int p(D \mid \theta)\,p(\theta)\,d\theta
$$

is often intractable, but we don’t need it for many sampling methods.

---

## Likelihood

$$
\mathcal{L}(\theta) = p(D \mid \theta)
$$

For independent data points \(d_i\):

$$
\mathcal{L}(\theta)
= \prod_{i=1}^N p(d_i \mid \theta)
\quad\Longrightarrow\quad
\ln \mathcal{L}(\theta)
= \sum_{i=1}^N \ln p(d_i \mid \theta).
$$

---

## Why Use the Log–Likelihood?

1. **Products → Sums**  
   Summing log-terms is more stable than multiplying small probabilities.

2. **Numerical Stability**  
   \(\ln \mathcal{L}\) avoids underflow with large datasets.

3. **Easy Derivatives**  
   Gradients of sums of logs are straightforward for optimization or MCMC.

---

## Putting It All Together

We sample from the unnormalized log-posterior:

$$
\ln p(\theta \mid D)
= \ln \mathcal{L}(\theta) + \ln p(\theta) + \text{constant}.
$$

MCMC methods draw samples by evaluating \(\ln \mathcal{L}(\theta)\) and \(\ln p(\theta)\) many times, exploring regions where that sum is largest.
