# [Kullback–Leibler divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence)

In [1]:
import numpy as np
from scipy.stats import entropy

## Definition: Kullback–Leibler (KL) Divergence

Let $\mathcal{M}_1(\mathcal{X})$ denote the set of all pmf of probability measures on the
measurable space $(\mathcal{X}, \mathcal{P}(\mathcal{X}))$ with $\mathcal{X}$ finite.

Define
$$
\widehat{\mathcal{M}_1(\mathcal{X})^2}
:= \big\{\, (p,q)\in \mathcal{M}_1(\mathcal{X})^2
\; : \; p(x) > 0 \Rightarrow q(x) > 0,\ \forall x \in \mathcal{X} \,\big\}.
$$

We define the **Kullback–Leibler (KL) divergence** as the function
$$
\begin{aligned}
D_{KL} : \widehat{\mathcal{M}_1(\mathcal{X})^2} &\longrightarrow [0,\infty) \\
D_{KL}(p \Vert q) 
&:= \sum_{x \in \mathcal{X}} 
p(x)\,\log\!\left(\frac{p(x)}{q(x)}\right),
\end{aligned}
$$
with the convention $0 \log 0 := 0$.

**Intuition:**

The KL divergence $D_{KL}(p \Vert q)$ measures how *inefficient* or *informationally costly* it would be to use the model $q$ to represent or encode data that is actually generated by $p$.  
It is zero if and only if $p=q$, and increases as $q$ becomes a worse approximation of $p$.

## Properties: Kullback–Leibler (KL) Divergence

### Metric properties

1. **Non-negativity (Gibbs' inequality)**

$$
D_{KL}(p \,\|\, q) \ge 0
$$

with equality if and only if $p = q$.

2. **Not symmetric and no triangle inequality**

In general,

$$
\begin{align*}
D_{KL}(p \,\|\, q) &\neq D_{KL}(q \,\|\, p),\\
D_{KL}(p \,\|\, r) &\not\le D_{KL}(p \,\|\, q)+ D_{KL}(q \,\|\, r)

\end{align*}
$$

Therefore, KL divergence is **not a metric** (it does not define a distance in the mathematical sense).

3. **Not bounded**

KL divergence can take arbitrarily large values and may be $+\infty$ if
there exists $x$ such that

$$
p(x) > 0 \quad \text{and} \quad q(x) = 0.
$$

**Proof (1.):**
We use the elementary inequality
$$
\log t \le t - 1 \quad \text{for all } t > 0,
$$
with equality if and only if $t = 1$.

This is equivalent to
$$
-\log t \ge 1 - t.
$$
Then we have
$$
\begin{align*}
D_{KL}(p \,\|\, q)
&= \sum_{x \in \mathcal{X}} p(x)\,\log\!\left( \frac{p(x)}{q(x)} \right),\\
&= \sum_{x \in \mathcal{X}} p(x)\left[-\log\!\left( \frac{p(x)}{p(x)} \right)\right],\\
&\ge \sum_{x \in \mathcal{X}} p(x)\left(1- \frac{q(x)}{p(x)} \right),\\
&= \sum_{x \in \mathcal{X}} \left(p(x)-q(x) \right),\\
&= \sum_{x \in \mathcal{X}} p(x)-\sum_{x \in \mathcal{X}}q(x),\\
&= 1 -1,\\
&= 0.
\end{align*}
$$

For the equality condition we have

Equality holds if and only if
$$
-\log\frac{q(x)}{p(x)} = 1 - \frac{q(x)}{p(x)}
$$
for all $x$ with $p(x) > 0$,
which happens if and only if
$$
\frac{q(x)}{p(x)} = 1
\quad \Longleftrightarrow \quad
p(x) = q(x)
$$
for all $x$ with $p(x)>0$.

Since both are probability distributions, this implies
$$
p(x) = q(x) \quad \text{for all } x \in \mathcal{X}.
$$

Conversely, if $p=q$, then
$$
D_{KL}(p\|q) = \sum_x P(x)\log 1 = 0.
$$

Thus
$$
D_{KL}(p\|q) \ge 0,
$$
with equality if and only if $p = q$.

### Property: Relation with entropy and cross-entropy

A key identity connects cross-entropy, entropy, and KL-divergence:

$$
H(p \Vert q) 
=  D_{KL}(p \,\|\, q)+ H(p).
$$

This decomposition shows that cross-entropy contains **two distinct effects**:

1. An intrinsic part: the entropy of the true distribution $H(p)$
2. A mismatch part: the divergence between the model and the truth $D_{KL}(p\Vert q)$

Interpretation

* High cross-entropy $H(p \Vert q)$ means that outcomes generated by the true distribution $p$ are (on average) highly unpredictable — either because  
  - $p$ itself is highly random (high entropy) $H(p)$, or  
  - the model $q$ is very different from the true distribution $p$, or  
  - both.

* Low cross-entropy $H(p \Vert q)$ means that outcomes generated by $p$ are (on average) quite predictable using the model $q$, which happens when  
  - $p$ is structured (low entropy) and  
  - $q$ is close to $p$.

#### **Proof:**

$$
\begin{aligned}
H(p \Vert q) 
&:= -\sum_{x \in \mathcal{X}} p(x)\log q(x),\\
&= \sum_{x \in \mathcal{X}} p(x)\log \left(\frac{1}{q(x)}\right)-H(p)+H(p),\\
&= \sum_{x \in \mathcal{X}} p(x)\log \left(\frac{1}{q(x)}\right)+\sum_{x \in \mathcal{X}} p(x)\log \left(p(x)\right)+H(p),\\
&= \sum_{x \in \mathcal{X}} p(x)\log \left(\frac{p(x)}{q(x)}\right)+H(p),\\
&=  D_{KL}(p \,\|\, q)+ H(p).
\end{aligned}
$$


### Property: Relation with mutual information

Then mutual information is a special case of KL divergence:

$$
 H(X \Cap Y)
= D_{KL}\!\left( p_{(X,Y)} \, \| \, p_X\,p_Y \right).
$$

This shows:

- Mutual information measures how far the joint distribution is
  from the distribution under independence
- Hence it quantifies **statistical dependence**
- And explains why
$$

 H(X \Cap Y) \ge 0
$$
with equality if and only if $X$ and $Y$ are independent

#### **Proof:** 
From the KL-divergence form property of the mutual informaiton we have 
$$ H(X \Cap Y) = \sum_{x,y} p(x,y)\log \frac{p(x,y)}{p(x)p(y)} = D_{KL}\!\left( p_{(X,Y)} \, \| \, p_X\,p_Y \right)$$

## Code: Kullback–Leibler (KL) Divergence

Consider the random variable $X,Y$ with probability mass function
$$
\begin{aligned}
p_X:\mathcal{X}&\to[0,1]\\
p_X(0)&=0.2\\
p_X(1)&=0.3\\
p_X(2)&=0.5,
\end{aligned}
$$
and 
$$
\begin{aligned}
p_Y:\mathcal{Y}&\to[0,1]\\
p_Y(0)&=0.1\\
p_Y(1)&=0.4,\\
p_Y(2)&=0.5,\\
\end{aligned}
$$

In [2]:
def kl_divergence(P, Q, base=2):

    assert P.shape == Q.shape, "P and Q must have the same shape"
    assert np.isclose(np.sum(P), 1), "P must sum to 1"
    assert np.isclose(np.sum(Q), 1), "Q must sum to 1"
    assert np.all(P[Q == 0] == 0), "Violation: Q==0 implies P must also be 0"

    
    # Mask of valid entries
    mask = P > 0

    # I(X;Y) = sum p(x,y) log( p(x,y) / (p(x)p(y)) )
    return (np.sum(P[mask] * np.log((P/Q)[mask]))/np.log(base)).item()

In [3]:
def cross_entropy(P, Q,base=2):
    P_pos = P[P > 0]
    Q_pos = Q[P > 0]
    return -((P_pos * np.log(Q_pos)).sum()/np.log(base)).item()   

In [4]:
P_X=np.array([0.2, 0.5, 0.3])
P_X_hat=np.array([0.3, 0.4, 0.3])
P_Y=np.array([0.8, 0.1, 0.1])
print(f"{P_X=}")
print(f"{P_X_hat=}")
print(f"{P_Y=}")

P_X=array([0.2, 0.5, 0.3])
P_X_hat=array([0.3, 0.4, 0.3])
P_Y=array([0.8, 0.1, 0.1])


In [5]:
print(f"D_KL(P_X,P_Y) = {entropy(P_X,P_Y).item()}")

D_KL(P_X,P_Y) = 0.857043770593505


In [6]:
print(f"D_KL(P_X,P_Y) = {kl_divergence(P_X,P_Y,base=np.e)}") 

D_KL(P_X,P_Y) = 0.857043770593505


In [7]:
print(f"D_KL(P_X,P_X_hat) = {kl_divergence(P_X,P_X_hat,base=np.e)}") 

D_KL(P_X,P_X_hat) = 0.030478754035472025


In [8]:
print(f"D_KL(P_X,P_X) = {kl_divergence(P_X,P_X,base=np.e)}") 

D_KL(P_X,P_X) = 0.0


Property
$$
H(p \Vert q) 
=  D_{KL}(p \,\|\, q)+ H(p).
$$


In [9]:
print(f"H(P_X || P_Y) = {cross_entropy(P_X, P_Y, base=np.e)}")
print(f"D_KL(P_X,P_Y) + H(P_X) = {kl_divergence(P_X,P_Y,base=np.e) + entropy(P_X)}")

H(P_X || P_Y) = 1.8866967846580782
D_KL(P_X,P_Y) + H(P_X) = 1.8866967846580787
