# Lecture 7: Learning Probability Distributions 

In this notebook, you'll find various tasks encompassing both theoretical and coding exercises. Each exercise corresponds to a specific number of points, which are explicitly indicated within the task description.

Always use the Jupyter kernel associated with the dedicated environment when compiling the notebook and completing your exercises.

#### **Attention** This exercise sheet contains 10 bonus points (excercise 5.b). Furthermore, if you present your own **correct** solution to either excercises 2, 3, 4, or 5 in the tutorials, you are entitled for 10 additional bonus points.

## Excercise 1 (Theory) (10/100)

### The KL divergence is lower bounded by zero

Starting from the definition of the Gibbs inequality, show that the Kullback-Leibler divergence is a positive quantity, i.e., $\textrm{KL}(P\vert\vert Q)\geq0$.

> #### Your solution here

## Excercise 2 (Theory) (20/100)

### Kullback-Leibler divergence as a limit of the Rényi (alpha) divergence

The [Rényi divergence](https://en.wikipedia.org/wiki/Rényi_entropy) is a generalized notion of distance between two  probability distributions which can also be used for Variational Inference as it was shown in [this paper](https://proceedings.neurips.cc/paper_files/paper/2016/file/7750ca3559e5b8e1f44210283368fc16-Paper.pdf) from 2016. 
More specifically, the Rényi divergence of order $\alpha$ or alpha-divergence of a distribution $P$ from a distribution $Q$ is defined to be 

$$ D_{\alpha}(P \| Q) = \frac{1}{\alpha - 1} \ln \sum_{i} \left( p_i \right)^{\alpha} \left( q_i \right)^{1 - \alpha} $$

Show that the limit for $\alpha\to 1$ gives you the Kullback–Leibler divergence $ D_{\text{KL}}(P \| Q)$, i.e., 
$$ D_{\text{KL}}(P \| Q) = \sum_{i} p_i \ln \frac{p_i}{q_i} $$

> #### Your solution here

## Excercise 3 (Theory) (30/100)

### Differential Entropy

The differential entropy between two random variables can be understood as a measure of the uncertainty or unpredictability of the joint distribution of the variables with respect to their continuous values. Unlike Shannon entropy, which deals with discrete probability distributions, differential entropy is used for continuous random variables.

The differential entropy $ H(X,Y) $ between two continuous random variables $ X $ and $ Y $ with joint probability density function $ p(x,y) $ is defined as:

$$ H(X,Y) = -\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} p(x,y) \ln(p(x,y)) \, dx \, dy $$

where $ p(x,y) $ is the joint probability density function of $ X $ and $ Y $.

Consider two variables $X$ and $Y$ having joint distribution $p(x, y)$.          
- **Task 3.a (10 pts.)** Show that the differential entropy of this pair of variables satisfies $H(X, Y) \leq H(X) + H(Y)$.
- **Task 3.b (20 pts.)** Starting from the results in **3.a** prove that the equality holds if, and only if, $x$ and $y$ are statistically independent.


> #### Your solution here

## Excercise 4 (Theory) (20/100)

### Training by Forward KL

Imagine you are given a dataset $\mathbf{X}=\{\mathbf{x}_i\}_{i=1}^N$ of inputs (e.g., images) being 2-d arrays $\mathbf{x}_i\in\mathbb{R}^{n\times n}$. You want to infer the distribution $p$ from which these data were sampled.      

In order to do so, you train a generative model, e.g., a normalizing flow, which outputs a parametrized (variational) distribution $q_\theta$. 

**Task (4.a)** **(20 pts.)** Starting from the definition of $\textrm{KL}(p||q_\theta)$ show that the loss function can be rewritten as an expectation value with respect to the true distribution $p$ of the likelihood of the learned distribution $q_\theta$, i.e., 
$$
-\mathbb{E}_{x\sim p}{\left[\ln q_\theta(x)\right]}.
$$  
and derive the corresponding gradient of the loss function.


> #### Your solution here

## Excercise 5 (Theory) (30/100)

### The Triangle Inequality

The [triangle inequality](https://en.wikipedia.org/wiki/Triangle_inequality) is a property that holds for distance measures (metrics). It states that for any three points $ A $, $ B $, and $ C $, the distance from $ A $ to $ C $ should be less than or equal to the sum of the distances from $ A $ to $ B $ and from $ B $ to $ C $:

$$ d(A, C) \leq d(A, B) + d(B, C) $$


- **Task 5.a (20 pts.)** Does the KL divergence $D_{KL}(P \parallel Q)$ fulfill the triangle inequality, i.e., is the KL divergence a true metric? To find this out, consider two  examples where probability distributions $P(x), Q(x),$ and $R(x)$ are given. For these examples, compute (numerically) if the triangle inequality is violated or fulfilled.
> Hint: For simplicity, consider each sample $x\in \mathcal{X}=\{0,1\}$. You can start by considering the 3 following probability distributions:
$$ P(x) = \begin{pmatrix} 1 & 0 \end{pmatrix}, Q(x) = \begin{pmatrix} 0.5 & 0.5 \end{pmatrix}, R(x) = \begin{pmatrix} 0 & 1 \end{pmatrix} $$
What can you say about the triangle inequality in this case? Can you provide a second example with different probabilities? Can you say something different in this second case? 

> #### Your solution here

- **Task 5.b (10 (bonus) pts.)** Using the `pytorch` library, provide a simple implementation of your numerical example above that proves the fullfillment/violation of the triangle inequality.

In [None]:
import torch
import torch.nn.functional as F

# TODO: Define the probability distributions
#----------------
# P = 
# Q = 
# R = 
#----------------

# Compute KL divergence using PyTorch's built-in function
def kl_divergence(P, Q):
#----------------
# TODO: Your code here
#----------------

# Compute the KL divergences
kl_pq = kl_divergence(P, Q)
kl_qr = kl_divergence(Q, R)
kl_pr = kl_divergence(P, R)

# Print the results
print(f"D_KL(P || Q): {kl_pq.item()}")
print(f"D_KL(Q || R): {kl_qr.item()}")
print(f"D_KL(P || R): {kl_pr.item()}")

# Check the triangle inequality
#----------------
# TODO: Your code here
#----------------