## Chapter 7 - Bayes Classifier

_**Author:** Zitong Su_

*This note includes some formula derivations (and/or extended materials) additional to the __Machine Learning ("Watermelon Book")__ or the __Pumpkin Book__.*


### 7.2 Maximum Likelihood Estimation
#### 7.2.1 Likelihood function, cross-entropy loss, and KL-divergence
**Problem:**

**Background:**

1. **Shannon information / self-information:**
<br>From a statistical perspective, information quantifies the reduction of uncertainty following an observation of a random variable X. In other words, it measures the surprisal of an event: the less likely an event is, the more information it conveys when it occurs.
<br><br>If X is a discrete random variable with probability mass function $p(x) := \mathbb{P}(X = x)$, and it takes values from set $\mathcal{X}$. Then the self-information of observing $X = x$ is defined as:
$$\mathrm{I}(X = x) = \log \frac{1}{p(x)} = -\log p(x)$$

2. **Entropy:**
<br>Entropy quantifies the average level of uncertainty in a random variable's outcomes. It measures how unpredictable or "surprising" the value of a variable is, on average, based on its probability distribution.
<br><br>It is defined as the expectation of self-information:
$$\mathrm{H}(X)
:= \mathbb{E}[\mathrm{I}(X)]
= \mathbb{E}[-\log p(X)]
= -\sum_{x \in \mathcal{X}} p(x) \log p(x)$$
Entropy reaches its maximum when all outcomes are equally likely (i.e., the distribution is uniform), reflecting maximal uncertainty. Conversely, if one outcome is certain, entropy is zero, and there is no surprise in the result.


## Additional Notes

### Formula Representation on Expectation
$$\mathrm{H}(X)
= \mathbb{E}[-\log p(X)]
= \mathbb{E}_{x \sim p(x)}[-\log p(x)]$$

<br>Note that both $\mathbb{E}[-\log p(X)]$ and $\mathbb{E}_{x \sim p(x)}[-\log p(x)]$ are the correct forms of representation.

<br>$\mathbb{E}[-\log p(X)]$ --> **Random-variable form**
- Take the expectation of $-\log p(X)$, where $X$ is the random variable. No subscript to declare the dummy — so $X$ stays the random variable throughout.

<br>$\mathbb{E}_{x \sim p(x)}[-\log p(x)]$ --> **Sampling-notation form**
- Draw a sample $x$ from distribution $p$, then compute $-\log p(x)$, and average. In the subscript $x \sim p(x)$, $x$ is just a dummy variable (like the $x$ in $\int f(x)\,dx$). We consistently use that same symbol inside the brackets.
- Why lowercase $x$ inside the expectation?
<br>**Dummy-variable convention:** In calculus we write $\int f(t)\,dt$ or $\sum_i a_i$. The symbol $t$ or $i$ is bound—it lives only inside the integral or sum. Here, $x$ plays the same role: a placeholder inside the expectation.  Using lowercase stresses it's not the capital-letter random variable $X$ we might refer to elsewhere in our text; it's just the "current draw" from $p$.
- Dummy variables can live inside $\mathbb{E}[\cdot]$. Because $\mathbb{E}[\cdot]$ is defined by a sum or an integral, any symbol we choose inside—so long as it matches the subscript declaration—is perfectly valid:
$$\mathbb{E}_{u \sim p(u)}[g(u)]
\quad\text{or}\quad
\sum_{u} g(u)\,p(u)$$
And something like $\mathbb{E}_{z\sim p(z)}[z^2]$, which means “sum/integrate $z^2\,p(z)$.”
- $\mathbb{E}_{x\sim p(x)}[f(x)]$ is just a notational shortcut for the underlying sum/integral. The lowercase $x$ is a bound dummy variable—like the $t$ in $\int f(t)\,dt$. We can swap $x$ for any letter, what matters is that the subscript and the bracketed expression agree.





