# PS3-3: KL Divergence, Fisher Information, and the Natural Gradient

## Background
#### Dependence of gradient ascent on parametrization

Suppose we have a family of probability distributions $P_\theta = p(x;\theta) \, dx$ parametrized by $\theta \in \mathbb{R}^{k}$. Suppose we are trying to find the distribution which maximizes some objective $J(P_\theta)$. So far, we have considered gradient ascent with respect to $\theta$. That is, after each iteration we update theta by the rule

$$\theta \mapsto \theta + \alpha \nabla_\theta J(\theta)$$

In a sense, the step size we are moving by is controlled by this parameter $\alpha$, but the problem with this notion of distance per step is that it depends on the choice of parameterization $\theta \mapsto P_\theta$. More precisely, the norm of the gradient $\nabla_\theta J(\theta)$ will depend on the choice of parametrization, so the 'distance' towards the optimal distribution one moves in one update varies depending on the choice of parametrization. This notebook describes an alternate approach to finding the optimal distribution in which each update moves the distribution a fixed parameterization-independent 'distance' towards the optimal distribution. In this notebook the invariant notion of distance we will use is the KL divergence.

## (A): The score function

Given a parametrized family of distributions $p(x;\theta)$, define the score function $S(x,\theta)$ by

$$S(x;\theta) = \nabla_\theta \log p(x;\theta)$$

If we were sampling $x$ from the distribution $P_\theta$, the score function would be a random variable with its own distribution. This is a reasonable thing to study if we are interested in fitting via MLE, since we are trying to maximize the log likelihood $\sum \log p(x^{(i)};\theta)$ assuming that the data $x^{(i)}$ is being sampled from $P_\theta$ itself.

#### Suppose that $x \sim P_\theta$. Prove that $E \, S(x,\theta) = 0$.

Write out the RHS

$$E S(x,\theta) = \int_{\mathbb{R}^n} p(x;\theta) \nabla_\omega \log p(x,\omega)|_{\omega = \theta} \, dx = \nabla_\omega|_{\omega =\theta} \int_{\mathbb{R}^n} p(x,\omega) \, dx = 0$$

## (B): Fisher information

We now define the Fisher information $\mathcal{I}(\theta)$ to be the covariance of the score function:

$$\mathcal{I}(\theta) = \text{Cov}_{x \sim P_\theta} \, S(x,\theta) = E_{x \sim P_\theta} \left( \nabla_\omega p(x;\omega)|_{\omega = \theta}\nabla_\omega p(x;\omega)|_{\omega = \theta}^T\right)$$

The formula in terms of expectation follows from the fact that $E[S(x,\theta)] = 0$ from part (A).

## (C): Alternate form of Fisher information

For any function $f > 0$, we have the identity

$$\nabla^2 \log f = f^{-1}\nabla^2 f - (f^{-1}\nabla f)(f^{-1}\nabla f)^T$$

So,

$$E_{x \sim P_\theta} -\nabla^2_\omega \log p(x;\omega)|_{\omega = \theta} = \int_{\mathbb{R}^n} -\nabla^2_\omega p(x;\omega)|_{\omega = \theta} + p(x;\theta) \left(\nabla_\omega \log p(x;\omega)|_{\omega = \theta} \right) \left(\nabla_\omega \log p(x;\omega)|_{\omega = \theta} \right)^T \, dx$$

The first term vanishes for the same reason as above and the second term is just the definition of covariance of the score function. Hence

$$\mathcal{I}(\theta) = E_{x \sim P_\theta} -\nabla^2_\omega \log p(x;\omega) |_{\omega = \theta}$$

## (D): Fisher information is the infinitisemal change in KL divergence

Consider a parametrized family of probability distributions $P_\theta$ with densities $p(x;\theta) \, dx$. For fixed $\theta$, consider the increment in KL divergence:

$$D(\delta) = D_{KL}(P_{\theta}||P_{\theta + \delta}) = \int p(x;\theta) \log \frac{p(x;\theta)}{p(x;\theta + \delta)} \, dx = \int -p(x;\theta)\log p(x;\theta + \delta) \, dx + \text{const.}$$ 

It follows from the previous parts of this question that $D(0) = \nabla_\delta D |_{\delta = 0} = 0$, and $\nabla^2_\delta D|_{\delta = 0} = \mathcal{I}(\theta)$. Hence, by Taylor's formula, we have

$$D_{KL}(P_\theta||P_{\theta + \delta}) = \frac{1}{2}\bigg\langle \mathcal{I}(\theta) \delta , \delta  \bigg\rangle + O(|\delta|^3)$$

## (E): The natural gradient update rule

Let $\ell(\theta) = \log p(x;\theta)$ be the log-likelihood. Now we fix a small 'step size' $c$ and we seek to increase $\ell(\theta)$ as much as possible by updating $\theta \mapsto \theta + \delta$ while ensuring that the KL divergence $D_{KL}(P_\theta||P_{\theta+\delta}) = c$. To do this we maximize $\ell(\theta + \delta)$ subject to this constraint using Lagrange multipliers.

We are looking for $\delta$ such that there is a $\lambda$ satisfying 

$$\nabla_\delta \ell(\theta + \delta) = \lambda \nabla_\delta D_{KL}(P_\theta||P_{\theta + \delta})$$

Replacing both sides by their taylor expansions at $\delta = 0$, this is approximately

$$\nabla_\omega \ell(\omega)|_{\omega = \theta} = \lambda\mathcal{I}(\theta)\delta$$

or $$\delta = \frac{1}{\lambda} \mathcal{I}(\theta)^{-1} \nabla \ell(\theta)$$

To determine $\lambda$, use the constraint equation: 

$$D_{KL}(P_\theta||P_{\theta + \delta}) \simeq \frac{1}{2}\bigg\langle \mathcal{I}(\theta)\delta , \delta \bigg\rangle = c$$

which results in $\lambda^2 = \frac{1}{2c}\langle \mathcal{I}(\theta) \nabla \ell(\theta) , \nabla \ell(\theta) \rangle$. Finally note that $\lambda > 0$ because of the positive definiteness of $\mathcal{I}$; since $\langle \mathcal{I}^{-1}\nabla \ell , \nabla \ell \rangle > 0$, $\mathcal{I}^{-1}\nabla \ell$ points in the same direction as the gradient $\ell$, which is the direction of increase.

## (F): Natural gradient ascent for GLMs

We consider the example of fitting a general linear model. Recall that the exponential distributions depending on parameter $\eta \in \mathbb{R}$ are given by

$$p(y,\eta) = b(y) \exp \left(\eta y - a(\eta) \right)$$

The hessian of the log-likelihood with respect to $\eta$ is simply $-a''(\eta)$ which does not depend on $y$, so it equals its expectation wrt $p(y,\eta)$ This means the update direction of natural gradient ascent is 

$$-\frac{1}{\lambda}\left(\nabla_\eta^2 \log p(y,\eta)\right)^{-1} \nabla_\eta \log p(y,\eta)$$

which differs from the update direction in newton's method applied to $-\nabla_\eta \log p(y,\eta)$:

$$\eta \mapsto \eta - (\nabla_\eta^2\log p(y,\eta))^{-1}\nabla_\eta \log p(y,\eta)$$

by a positive scalar multiple.