# Likelihood — What Is It?

Likelihood is a fundamental concept in statistics. It is closely related to probability, but **not the same**.

- **Probability:** “Given parameters, what’s the chance of seeing this data?”  
- **Likelihood:** “Given the data, how plausible are different parameters?”

---

## Formal Definition

Suppose you have data:

$$
D = \{x_1, x_2, \dots, x_n\}
$$

and a model with parameters $\theta$.

The **probability of observing the data under the model** is:

$$
P(D \mid \theta)
$$

This is also called the **likelihood function**, denoted as:

$$
L(\theta \mid D) = P(D \mid \theta)
$$

---

### Intuition

- In probability, we fix the parameters and ask how likely the data is.  
- In likelihood, we **fix the data** and ask how plausible different parameter values are.  

This distinction is key in **Maximum Likelihood Estimation (MLE)** and **Bayesian inference**, where the likelihood updates our belief about model parameters given observed data.


# Likelihood Example: Coin Toss

Suppose we have a biased coin, and let:

$$
\theta = P(\text{Heads})
$$

We observe the following data:

```
H H T H T H H T H H
```

- 7 heads, 3 tails

---

### Likelihood Function

The likelihood of $\theta$ given this data is:

$$
L(\theta) = \theta^7 (1-\theta)^3
$$

---

### Key Points

- **Data is fixed**: we observed 7 heads and 3 tails.  
- **Parameter is variable**: $\theta$ can take any value between 0 and 1.  
- **Likelihood function** tells us how plausible each $\theta$ is given the observed data.  

For example, $\theta = 0.5$ has a likelihood of:

$$
L(0.5) = 0.5^7 \cdot 0.5^3 = 0.5^{10}
$$

By maximizing $L(\theta)$, we find the **most likely value of $\theta$** — this is the **Maximum Likelihood Estimate (MLE)**.


# `Maximum Likelihood Estimation`


Imagine you’re trying to guess how biased a coin is.

You toss it **10 times**.

You observe: **7 Heads**, **3 Tails**.

Now you’re asking:

> “What’s the probability of this data if the coin has a bias θ for heads?”

That’s given by:

$$
P(D \mid \theta) = \theta^7 (1 - \theta)^3
$$

Different values of **θ** give different probabilities of seeing your data.

---

#### Maximum Likelihood Intuition

**MLE** says:  
Pick the value of **θ** that makes the data you actually saw **most likely**.

It’s the **most plausible parameter**, given what you’ve observed.





### Maximum Likelihood Intuition

We have:

- A model parameterized by **θ** (like mean **μ**, variance **σ²**, probability **p**, etc.)
- Data samples:  
  $$ D = \{ x_1, x_2, \ldots, x_n \} $$

Each sample is assumed to come from the same distribution:  
$$ P(X \mid \theta) $$

---

#### Our Goal

Find the best parameter **θ** that makes the observed data most probable under our model.

That’s literally what **maximum likelihood** means.


### Likelihood Function

Given data **D**, the likelihood function is defined as:

$$
L(\theta \mid D) = P(D \mid \theta)
$$

If the samples are independent (which is assumed in most real-world models):

$$
L(\theta \mid D) = \prod_{i=1}^{n} P(x_i \mid \theta)
$$


### Log-Likelihood

It’s almost always easier to maximize the **log-likelihood** instead of the raw product:

$$
\ell(\theta) = \log L(\theta)
             = \sum_{i=1}^{n} \log P(x_i \mid \theta)
$$

---

#### Why take the log?

- **Log is monotonic** → doesn’t change the maximization point.  
- **Turns products into sums** → easier to differentiate and compute.  
- **Reduces floating point underflow** in computation (especially for large datasets).


## Limitations of MLE

MLE is **frequentist** — it doesn’t care about **prior** beliefs.

So:<br>
- Works great with large data (data dominates).<br>
- Can overfit small data — because it only uses current observations.<br>
- Doesn’t express uncertainty about θ.<br>
- That’s why MAP estimation (next) adds a prior and gives more robust estimates.