# Properties of the KL divergence

Given two discrete probability distributions $P$ and $Q$ over the outcome space $\chi$, define their KL divergence by

$$D_{KL}(P||Q) = \sum_{x \in \chi} P(x)\log \frac{P(x)}{Q(x)}$$

## $D_{KL} \geq 0$ and $D_{KL} = 0$ iff $P = Q$.

We can re-write KL divergence as an expectation. With $f(x) = x\log x$,

$$D_{KL}(P||Q) = \sum_{x \in \chi} Q(x) \frac{P(x)}{Q(x)}\log \frac{P(x)}{Q(x)} = E_{x \sim Q} \, f\left(\frac{P(x)}{Q(x)}\right)$$

Since $f$ is strictly convex, we have Jensen's inequality $Ef(X)\geq f(EX)$, so the RHS is bounded below by

$$f \left( E_{x \sim Q} \frac{P(x)}{Q(x)} \right) = f(1) = 0$$

Moreover if equality holds, then there is a constant $c$ so that $P(x)/Q(x) = c$ for all $x$ with $Q(x) > 0$. But this means that $P(x) = Q(x)$ for all such $x$ after summing both sides over $\chi$. Then just note that this implies

$$\sum_{x: Q(x) = 0} P(x) = 1 - \sum_{x : Q(x) > 0} P(x) = 0 $$
 
which means $P = Q$ everywhere.


## The 'Chain rule' 

### $$D_{KL}(P(X,Y)||Q(X,Y)) = D_{KL}(P(X|Y)||Q(X|Y)) + D_{KL}(P||Q)$$

This follows in a completely straightforward way from writing out the definition. On the RHS, the conditional KL divergence is not exactly what 
one would naively think. It is defined by:

$$D_{KL}(P(X|Y)||Q(X|Y)) = \sum_y p(y) \sum_x p(x|y) \log \frac{p(x|y)}{q(x|y)}$$

That is, it is the expectation wrt to Y of the KL distance of the conditional distributions $P(X|Y)$ and $Q(X|Y)$

## KL divergence and MLE

Given a training set $S =\{x^{(i)}\}_{i=1}^m$, define the empirical distribution wrt this training set by 

$$P_S(x) = \frac{|i | \{x^{(i)} = x \}|}{m}$$

Given a family of probability distributions $P_\theta \sim p(x; \theta) \, dx$, we can calculate the KL divergence between the empirical measure and $P_\theta$ by

$$D_{KL}(P_S , P_\theta) = -\sum_{i=1}^m \frac{1}{m} \log m - \frac{1}{m}\sum_{i=1}^m \log p(x^{(i)};\theta)$$

so the minimizing the KL divergence is equivalent to maximizing the log likelihood of $P_\theta$.

## An application of realizing MLE as minimizing $D_{KL}$

The main point is it can be useful to apply the chain rule formula to simplify an optimizing problem. 

### Example: Naive Bayes




