# Kullback-Leibler (KL) Divergence

## Introduction

The **Kullback-Leibler (KL) Divergence** is a fundamental concept in information theory and machine learning. It is used to measure how one probability distribution differs from another. 

Unlike distance metrics like Euclidean distance, KL divergence is not symmetric, meaning that the divergence from distribution \( P \) to distribution \( Q \) is not the same as the divergence from \( Q \) to \( P \). It is often referred to as the **relative entropy**.

---

## Definition of KL Divergence

Mathematically, the KL divergence between two probability distributions \( P(x) \) and \( Q(x) \) is given by:

$$
D_{KL}(P \, || \, Q) = \sum_{x} P(x) \log \left( \frac{P(x)}{Q(x)} \right),
$$

for discrete random variables, or

$$
D_{KL}(P \, || \, Q) = \int_{-\infty}^{\infty} P(x) \log \left( \frac{P(x)}{Q(x)} \right) dx,
$$

for continuous random variables.

In both cases, \( P(x) \) is called the **true distribution**, and \( Q(x) \) is called the **approximation** or **reference distribution**.

---

## Intuition Behind KL Divergence

The KL divergence measures the **extra information (in bits)** needed to encode data that follows the true distribution \( P(x) \) using a model based on the distribution \( Q(x) \).

If \( P \) and \( Q \) are identical, then \( D_{KL}(P \, || \, Q) = 0 \), meaning there is no additional cost in using \( Q \) to approximate \( P \).

If \( P \) and \( Q \) differ, the KL divergence quantifies how much information is lost when using \( Q \) instead of \( P \).

---

![image6](https://www.researchgate.net/profile/Duco-Veen/publication/319662351/figure/fig1/AS:614240771637255@1523457820692/KL-divergences-between-two-normal-distributions-In-this-example-p-1-is-a-standard-normal.png)

## Derivation of KL Divergence (Discrete Case)

Let’s derive the KL divergence step by step for the discrete case.

### Step 1: Expected Information Content

From information theory, the amount of information (or surprise) associated with an event \( x \) occurring, given that it follows a distribution \( P(x) \), is:

$$
h(x) = -\log P(x).
$$

If we approximate \( P(x) \) by another distribution \( Q(x) \), the information content changes to:

$$
h_Q(x) = -\log Q(x).
$$

### Step 2: Expected Information Difference

Now, consider the difference in information content when we mistakenly use \( Q(x) \) instead of the true distribution \( P(x) \):

$$
h(x) - h_Q(x) = - \log P(x) + \log Q(x).
$$

### Step 3: Taking the Expectation

The KL divergence measures the expected difference between these information contents, weighted by the true distribution \( P(x) \):

$$
D_{KL}(P \, || \, Q) = \sum_{x} P(x) \left[ - \log P(x) + \log Q(x) \right].
$$

Simplifying:

$$
D_{KL}(P \, || \, Q) = \sum_{x} P(x) \log \left( \frac{P(x)}{Q(x)} \right).
$$

---

## Properties of KL Divergence

1. **Non-Negativity**: 
   - The KL divergence is always non-negative:
     $$
     D_{KL}(P \, || \, Q) \geq 0,
     $$
     with equality if and only if \( P = Q \).
   
2. **Asymmetry**:
   - KL divergence is not symmetric:
     $$
     D_{KL}(P \, || \, Q) \neq D_{KL}(Q \, || \, P).
     $$
     This means that switching \( P \) and \( Q \) will yield different values.

3. **No True Distance Metric**:
   - Since it is not symmetric and does not satisfy the triangle inequality, KL divergence is not a true distance metric.

---

## KL Divergence for Continuous Distributions

For continuous random variables, KL divergence is defined using integrals instead of summations:

$$
D_{KL}(P \, || \, Q) = \int_{-\infty}^{\infty} P(x) \log \left( \frac{P(x)}{Q(x)} \right) dx.
$$

---

## Example

### Suppose \( P(x) \) and \( Q(x) \) are two probability distributions defined on the same sample space:

- \( P(x) = [0.4, 0.6] \),
- \( Q(x) = [0.5, 0.5] \).

To compute \( D_{KL}(P \, || \, Q) \):

1. \( P(1) = 0.4 \), \( Q(1) = 0.5 \),
2. \( P(2) = 0.6 \), \( Q(2) = 0.5 \).

Now, apply the formula:

$$
D_{KL}(P \, || \, Q) = 0.4 \log \left( \frac{0.4}{0.5} \right) + 0.6 \log \left( \frac{0.6}{0.5} \right).
$$

Calculate:

- \( \log \left( \frac{0.4}{0.5} \right) = \log(0.8) \approx -0.097 \),
- \( \log \left( \frac{0.6}{0.5} \right) = \log(1.2) \approx 0.079 \).

Thus:

$$
D_{KL}(P \, || \, Q) \approx (0.4)(-0.097) + (0.6)(0.079),
$$

$$
D_{KL}(P \, || \, Q) \approx -0.0388 + 0.0474 = 0.0086.
$$

---

## Applications of KL Divergence

1. **Machine Learning**:
   - Used in classification, clustering, and generative models (e.g., Variational Autoencoders).
   - A key component in loss functions, like Cross-Entropy Loss.

2. **Information Theory**:
   - Measures the inefficiency of assuming distribution \( Q \) when the true distribution is \( P \).

3. **Bayesian Inference**:
   - Quantifies the difference between prior and posterior distributions.

---

## Summary

- KL divergence measures how much one probability distribution diverges from another.
- It is non-negative and asymmetric.
- KL divergence plays a crucial role in machine learning, Bayesian inference, and information theory.



[refer this video](https://www.youtube.com/watch?v=LOwj7UxQwJ0&t=520s)
and
[refer to this vide aswell](https://www.youtube.com/watch?v=q0AkK8aYbLY)
