# Kullback – Leibler Divergence

![image.png](attachment:2c5fdec8-c27d-4733-9644-1ece3dc1992a.png)

## Overview:
**KL Divergence**, also known as **Kullback-Leibler Divergence** or **Relative Entropy**, is a measure used in Information Theory and Statistics to quantify the difference between two probability distributions. It measures how one probability distribution **diverges** from another, and <span style="font-size: 11pt; color: mediumseagreen; font-weight: bold">provides insight into the amount of information lost when one distribution is used to approximate another</span>.

Information Theory definition of Kullback-Leibler Divergence:  
- KL Divergence quantifies the increase in the average number of units of information needed per symbol if the encoding is optimized for the probability distribution $Q$ instead of the true distribution $P$. 

Informally, the Kullback-Leibler Divergence (relative entropy) quantifies the expected excess in <u>surprise</u> experienced if one believes the true distribution is $Q$ when it is actually $P$.

**NOTE**: While the KL Divergence is sometimes *called* the distance between two probability distributions, <span style="font-size: 11pt; color: tomato; font-weight: bold">KL Divergence does not fulfill the requirements of a true distance metric</span>, such that:
1. It does not satisfy symmetry property:  $D_{KL}(P \parallel Q) \neq D_{KL}(Q \parallel P)$.
2. It violates triangle inequality: *In a true distance metric, the sum of distances between three points must be greater than or equal to the distance between any two points*.
***
KL Divergence was introduced by **Solomon Kullback** and **Richard Leibler** in 1951 as a way to compare two probability distributions. It found applications in various fields, including information theory, statistics, and machine learning.
***

## In Machine Learning:
In the process of training machine learning models, minimizing the <u>KL Divergence can be an objective to ensure that the model's predictions align as closely as possible with the true data distribution</u>. This is particularly important in generative models like Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs), where the goal is to generate data samples that closely resemble the true data distribution.

In the context of building machine learning models, KL Divergence is often used as a measure to guide the training process toward capturing the most accurate representation of the data distribution.


## Formulas:
For two <span style="font-size: 11pt; color: steelblue; font-weight: bold">discrete probability distributions</span> $P$ and $Q$ KL Divergence is defined as:

$$ \large
D_{KL}(P \parallel Q) = \sum_{i} P(i) \log \frac{P(i)}{Q(i)}
$$

For two <span style="font-size: 11pt; color: steelblue; font-weight: bold">continuous probability distributions</span>, KL Divergence is defined as::

$$ \large
D_{KL}(P \parallel Q) = \int_{x \in X} P(x) \log \frac{P(x)}{Q(x)} \, dx
$$

## Applications:
KL Divergence has many use cases, including, but not limited to:

- **Optimization:** It appears in optimization problems, such as when optimizing parameters in machine learning models.
- **Probabilistic Modeling:** It is used to assess the similarity between the estimated and actual distributions in models like Gaussian Mixture Models.
- **Natural Language Processing:** Evaluating language model performance by comparing predicted and actual word distributions.
- **Generative Models:** It's used in training generative models like Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs).
- **Information Retrieval:** It measures the difference between the actual and retrieved documents' distributions.
- **Finance:** Assessing portfolio risk by comparing market and model distributions.


In conclusion, KL Divergence is a valuable concept in Information Theory and Machine Learning, allowing us to quantify the difference between probability distributions and find applications in a wide range of fields. Understanding KL Divergence is essential for effectively utilizing it in various Machine Learning algorithms and applications.

# Worked Example

Suppose we have rolled a six-sided die, expecting that it would be a **FAIR** die with underlying probability distribution of landing on each side being  $P(x) = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]$, but it had happened to be an **UNFAIR** die with underlying probability distribution of landing on each side being $Q(x) = [0.4, 0.1, 0.1, 0.1, 0.1, 0.2]$, then the KL Divergence (Relative Entropy) $D_{KL}(P \parallel Q)$ is:

In [1]:
import numpy as np
from scipy.stats import entropy

# Define two probability distributions
P = np.array([1/6, 1/6, 1/6, 1/6, 1/6, 1/6])
Q = np.array([0.4, 0.1, 0.1, 0.1, 0.1, 0.2])

# Compute KL Divergence (relative entropy)
kl_divergence = entropy(P, Q)
print("KL Divergence:", kl_divergence)

KL Divergence: 0.1642520334860181
