# [Kullback–Leibler divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence)

## Definition: Kullback–Leibler (KL) Divergence

Let $\mathcal{M}_1(\mathcal{X})$ denote the set of all probability measures on the
measurable space $(\mathcal{X}, \mathcal{P}(\mathcal{X}))$.

Define
$$
\widehat{\mathcal{M}_1(\mathcal{X})^2}
:= \big\{\, (p,q)\in \mathcal{M}_1(\mathcal{X}) \times \mathcal{M}_1(\mathcal{X})
\; : \; p(x) > 0 \Rightarrow q(x) > 0,\ \forall x \in \mathcal{X} \,\big\}.
$$

We define the **Kullback–Leibler (KL) divergence** as the function
$$
\begin{aligned}
D_{KL} : \widehat{\mathcal{M}_1(\mathcal{X})^2} &\longrightarrow [0,\infty) \\
D_{KL}(p \Vert q) 
&:= \sum_{x \in \mathcal{X}} 
p(x)\,\log\!\left(\frac{p(x)}{q(x)}\right),
\end{aligned}
$$

with the convention $0 \log 0 := 0$.

## Basic properties

1. **Non-negativity (Gibbs' inequality)**

$$
D_{KL}(p \,\|\, q) \ge 0
$$

with equality if and only if $p(x) = q(x)$ for all $x\in\mathcal{X}$ (i.e. $p = q$).

2. **Not symmetric and no triangle inequality**

In general,

$$
\begin{align*}
D_{KL}(p \,\|\, q) &\neq D_{KL}(q \,\|\, p),\\
D_{KL}(p \,\|\, r) &\not\le D_{KL}(p \,\|\, q)+ D_{KL}(q \,\|\, r)

\end{align*}
$$

Therefore, KL divergence is **not a metric** (it does not define a distance in the mathematical sense).

3. **Not bounded**

KL divergence can take arbitrarily large values and may be $+\infty$ if
there exists $x$ such that

$$
p(x) > 0 \quad \text{and} \quad q(x) = 0.
$$

#### **Proof (1.):**


We use the elementary inequality
$$
\log t \le t - 1 \quad \text{for all } t > 0,
$$
with equality if and only if $t = 1$.

This is equivalent to
$$
-\log t \ge 1 - t.
$$
Then we have
$$
\begin{align*}
D_{KL}(p \,\|\, q)
&= \sum_{x \in \mathcal{X}} p(x)\,\log\!\left( \frac{p(x)}{q(x)} \right),\\
&= \sum_{x \in \mathcal{X}} p(x)\left[-\log\!\left( \frac{p(x)}{p(x)} \right)\right],\\
&\ge \sum_{x \in \mathcal{X}} p(x)\left(1- \frac{q(x)}{p(x)} \right),\\
&= \sum_{x \in \mathcal{X}} \left(p(x)-q(x) \right),\\
&= \sum_{x \in \mathcal{X}} p(x)-\sum_{x \in \mathcal{X}}q(x),\\
&= 1 -1,\\
&= 0.
\end{align*}
$$

For the equality condition we have

Equality holds if and only if
$$
-\log\frac{q(x)}{p(x)} = 1 - \frac{q(x)}{p(x)}
$$
for all $x$ with $p(x) > 0$,
which happens if and only if
$$
\frac{q(x)}{p(x)} = 1
\quad \Longleftrightarrow \quad
p(x) = q(x)
$$
for all $x$ with $p(x)>0$.

Since both are probability distributions, this implies
$$
p(x) = q(x) \quad \text{for all } x \in \mathcal{X}.
$$

Conversely, if $p=q$, then
$$
D_{KL}(p\|q) = \sum_x P(x)\log 1 = 0.
$$

Thus
$$
D_{KL}(p\|q) \ge 0,
$$
with equality if and only if $p = q$.

## Property: Relation with entropy and cross-entropy

A key identity connects cross-entropy, entropy, and KL-divergence:

$$
H(p \Vert q) 
=  D_{KL}(p \,\|\, q)+ H(p).
$$

This decomposition shows that cross-entropy contains **two distinct effects**:

1. An intrinsic part: the entropy of the true distribution $H(p)$
2. A mismatch part: the divergence between the model and the truth $D_{KL}(p\Vert q)$

Interpretation

* High cross-entropy $H(p \Vert q)$ means that outcomes generated by the true distribution $p$ are (on average) highly unpredictable — either because  
  - $p$ itself is highly random (high entropy) $H(p)$, or  
  - the model $q$ is very different from the true distribution $p$, or  
  - both.

* Low cross-entropy $H(p \Vert q)$ means that outcomes generated by $p$ are (on average) quite predictable using the model $q$, which happens when  
  - $p$ is structured (low entropy) and  
  - $q$ is close to $p$.

#### **Proof:**

$$
\begin{aligned}
H(p \Vert q) 
&:= -\sum_{x \in \mathcal{X}} p(x)\log q(x),\\
&= \sum_{x \in \mathcal{X}} p(x)\log \left(\frac{1}{q(x)}\right)-H(p)+H(p),\\
&= \sum_{x \in \mathcal{X}} p(x)\log \left(\frac{1}{q(x)}\right)+\sum_{x \in \mathcal{X}} p(x)\log \left(p(x)\right)+H(p),\\
&= \sum_{x \in \mathcal{X}} p(x)\log \left(\frac{p(x)}{q(x)}\right)+H(p),\\
&=  D_{KL}(p \,\|\, q)+ H(p).
\end{aligned}
$$


## Property: Relation with mutual information

Then mutual information is a special case of KL divergence:

$$
I(X;Y)
= D_{KL}\!\left( \mathbb{P}_{(X,Y)} \, \| \, \mathbb{P}_X\,\mathbb{P}_Y \right).
$$

This shows:

- Mutual information measures how far the joint distribution is
  from the distribution under independence
- Hence it quantifies **statistical dependence**
- And explains why
$$
I(X;Y) \ge 0
$$
with equality if and only if $X$ and $Y$ are independent

#### **Proof:** 
From the KL-divergence form property of the mutual informaiton we have 
$$I(Y;X) = \sum_{x,y} p(x,y)\log \frac{p(x,y)}{p(x)p(y)}=D_{KL}\!\left( \mathbb{P}_{(X,Y)} \, \| \, \mathbb{P}_X\,\mathbb{P}_Y \right)$$

## Intuitive interpretation

- Entropy $H(p)$ → uncertainty of a distribution
- Mutual information $I(X;Y)$ → shared information between variables
- KL divergence $D_{KL}(p\|q)$ → how *wrong* $q$ is as a model for $p$

KL divergence measures the expected additional information needed to represent
samples from $p$ when assuming the distribution is $q$ instead of $p$.

| Concept | Definition |
|--------|------------|
| Self-Information | $I(x) = -\log p(x)$ |
| Entropy | $H(X) = -\sum_{x \in \mathcal{X}} p(x)\log p(x)$ |
| Joint Entropy | $H(X,Y) = -\sum_{x \in \mathcal{X}}\sum_{y \in \mathcal{Y}} p(x,y)\log p(x,y)$ |
| Conditional Entropy | $H(X \mid Y) = -\sum_{y \in \mathcal{Y}} p(y)\sum_{x \in \mathcal{X}} p(x \mid y)\log p(x \mid y)$ |
| Alternative form | $H(X \mid Y) = H(X,Y) - H(Y)$ |
| Mutual Information | $I(X;Y) = H(X) - H(X \mid Y)$ |
| Symmetry | $I(X;Y) = H(Y) - H(Y \mid X)$ |
| Equivalent form | $I(X;Y) = H(X) + H(Y) - H(X,Y)$ |
| KL-divergence form | $I(X;Y) = \sum_{x,y} p(x,y)\log \frac{p(x,y)}{p(x)p(y)}$ |
| KL divergence | $D_{KL}(p\|q) = \sum_{x} p(x)\log \frac{p(x)}{q(x)}$ |