# Natural gradients

This shows some basic usage of the natural gradient optimizer, both on its own and in combination with other optimizers using the Actions class.

In [1]:
from matplotlib import pyplot as plt
import warnings
import numpy as np
import gpflow
from gpflow.actions import Loop, Action
from gpflow.models import VGP, GPR, SGPR, SVGP
from gpflow.training import NatGradOptimizer, AdamOptimizer, XiSqrtMeanVar

%matplotlib inline
%precision 4
warnings.filterwarnings('ignore')

np.random.seed(0)

N, D = 100, 2
M = 10
X = np.random.uniform(size=(N, D))
Y = np.sin(10 * X)
Z = np.random.uniform(size=(M, D))
learning_rate = 0.01
iterations = 5
    

def make_matern_kernel():
    return gpflow.kernels.Matern52(D)

    
class PrintAction(Action):
    def __init__(self, model, text):
        self.model = model
        self.text = text
        
    def run(self, ctx):
        likelihood = ctx.session.run(self.model.likelihood_tensor)
        print('{}: iteration {} likelihood {:.4f}'.format(self.text, ctx.iteration, likelihood))

  from ._conv import register_converters as _register_converters


### "VGP is a GPR"

Natural gradients turn VGP into GPR in a *single step, if the likelihood is Gaussian*.

In [2]:
vgp = VGP(X, Y, make_matern_kernel(), gpflow.likelihoods.Gaussian())
gpr = GPR(X, Y, make_matern_kernel())

* Exact GP likelihood:

In [3]:
gpr.compute_log_likelihood()

-231.0899

* VGP likelihood is before natural gradient step:

In [4]:
vgp.compute_log_likelihood()

-328.8429

* VGP likelihood after a single natural gradient step:

In [5]:
NatGradOptimizer(gamma=1.).minimize(vgp, maxiter=1, var_list=[[vgp.q_mu, vgp.q_sqrt]])
vgp.compute_log_likelihood()

-231.0900

### Interleaving an ordinary gradient step with a NatGrad's optimizer step

In this case (Gaussian likelihood) it achieves optimization of hyperparameters as if the model were GPR.

Method for running Adam optimization on GPR:

In [6]:
def run_adam(model, lr, iterations, callback=None):
    adam = AdamOptimizer(lr).make_optimize_action(model)
    actions = [adam] if callback is None else [adam, callback]
    loop = Loop(actions, stop=iterations)()
    model.anchor(model.enquire_session())

Method for running Adam and Natural gradients optimizationon VGP. The hyperparameters at the end should match the GPR model.

In [7]:
def run_nat_grads_with_adam(model, lr, gamma, iterations, var_list=None, callback=None):
    # we'll make use of this later when we use a XiTransform
    if var_list is None:
        var_list = [(model.q_mu, model.q_sqrt)]

    # we don't want adam optimizing these
    model.q_mu.set_trainable(False)
    model.q_sqrt.set_trainable(False)

    adam = AdamOptimizer(lr).make_optimize_action(model)
    natgrad = NatGradOptimizer(gamma).make_optimize_action(model, var_list=var_list)
    
    actions = [adam, natgrad]
    actions = actions if callback is None else actions + [callback]

    Loop(actions, stop=iterations)()
    model.anchor(model.enquire_session())

* Optimize GPR with Adam:

In [8]:
run_adam(gpr, learning_rate, iterations, callback=PrintAction(gpr, 'GPR with Adam'))

GPR with Adam: iteration 0 likelihood -230.6706
GPR with Adam: iteration 1 likelihood -230.2508
GPR with Adam: iteration 2 likelihood -229.8303
GPR with Adam: iteration 3 likelihood -229.4093
GPR with Adam: iteration 4 likelihood -228.9876


* Optimizer VGP with Adam and NatGrads:

In [9]:
run_nat_grads_with_adam(vgp, learning_rate, 1., iterations, callback=PrintAction(vgp, 'VGP with nat grads with Adam'))

VGP with nat grads with Adam: iteration 0 likelihood -230.6707
VGP with nat grads with Adam: iteration 1 likelihood -230.2508
VGP with nat grads with Adam: iteration 2 likelihood -229.8304
VGP with nat grads with Adam: iteration 3 likelihood -229.4093
VGP with nat grads with Adam: iteration 4 likelihood -228.9877


Compare GPR and VGP lengthscales:

In [10]:
"GPR lengthscales = {0:.4f}, VGP lengthscales = {1:.4f}".format(gpr.kern.lengthscales.read_value(), vgp.kern.lengthscales.read_value())

'GPR lengthscales = 0.9686, VGP lengthscales = 0.9686'

### This also works for the sparse model
Nat grads turn SVGP into SGPR in the Gaussian likelihood case. We can apply the above with hyperparameters, too, though here we'll just do a single step.

In [11]:
svgp = SVGP(X, Y, make_matern_kernel(), gpflow.likelihoods.Gaussian(), Z=Z)
sgpr = SGPR(X, Y, make_matern_kernel(), Z=Z)

for model in svgp, sgpr:
    model.likelihood.variance = 0.1

Analytically optimal sparse model likelihood:

In [12]:
sgpr.compute_log_likelihood()

-281.5616

SVGP likelihood before natural gradient step:

In [13]:
svgp.compute_log_likelihood()

-1404.0805

SVGP likelihood after a single natural gradient optimization step:

In [14]:
NatGradOptimizer(1.0).minimize(svgp, maxiter=1, var_list=[(svgp.q_mu, svgp.q_sqrt)])
svgp.compute_log_likelihood()

-281.5616

### Minibatches
A crucial property of the natural gradient method is that it still works with minibatches. We need to use a smaller gamma.

In [15]:
svgp = SVGP(X, Y, make_matern_kernel(), gpflow.likelihoods.Gaussian(), Z=Z, minibatch_size=50)
svgp.likelihood.variance = 0.1

NatGradOptimizer(gamma=0.1).minimize(svgp, maxiter=100, var_list=[(svgp.q_mu, svgp.q_sqrt)])

Minibatch SVGP likelihood after natural gradient optimization:

In [16]:
np.average([svgp.compute_log_likelihood() for _ in range(1000)])

-281.8616

### Comparison with ordinary gradients in the conjugate case

#### (Take home message: natural gradients are always better)

Compared with doing SVGP with ordinary gradients with minibatches, the natural gradient optimizer is much faster in the Gaussian case. 

Here we'll do hyperparameter learning together optimization of the variational parameters, comparing the interleaved nat grad approach and using ordinary gradients for the hyperparameters and variational parameters jointly 

In [17]:
svgp_ordinary = SVGP(X, Y, make_matern_kernel(), gpflow.likelihoods.Gaussian(), Z=Z, minibatch_size=50)
svgp_nat = SVGP(X, Y, make_matern_kernel(), gpflow.likelihoods.Gaussian(), Z=Z, minibatch_size=50)

# ordinary gradients and Adam
AdamOptimizer(learning_rate).minimize(svgp_ordinary, maxiter=iterations)

# NatGrads with Adam
run_nat_grads_with_adam(svgp_nat, learning_rate, 0.1, iterations)

SVGP likelihood after ordinary _Adam optimization_:

In [18]:
np.average([svgp_ordinary.compute_log_likelihood() for _ in range(1000)])

-307.0880

SVGP likelihood after _NatGrad + Adam optimization_:

In [19]:
np.average([svgp_nat.compute_log_likelihood() for _ in range(1000)])

-234.1927

### Comparison with ordinary gradients in the non-conjugate case

#### (Take home message: natural gradients are usually better)

We can use nat grads even when the likelihood isn't Gaussian. It isn't guaranteed to be better, but it usually is better in practical situations.

In [20]:
Y_binary = np.random.choice([1., -1], size=X.shape)

vgp_bernoulli = VGP(X, Y_binary, make_matern_kernel(), gpflow.likelihoods.Bernoulli())
vgp_bernoulli_natgrads = VGP(X, Y_binary, make_matern_kernel(), gpflow.likelihoods.Bernoulli())

# ordinary gradients and Adam
AdamOptimizer(learning_rate).minimize(vgp_bernoulli, maxiter=iterations)

# nat grads with Adam 
run_nat_grads_with_adam(vgp_bernoulli_natgrads, learning_rate, 0.1, iterations)

VGP likelihood after ordinary *Adam optimization*:

In [21]:
vgp_bernoulli.compute_log_likelihood()

-186.4464

VGP likelihood after combination of optimizers, *NatGrad + Adam*:

In [22]:
vgp_bernoulli_natgrads.compute_log_likelihood()

-146.8059

We can also choose to run natural gradients in another parameterization. The 
sensible choice might is the model parameters (q_mu, q_sqrt), which is already in gpflow.

In [None]:
vgp_bernoulli_natgrads_xi = VGP(X, Y_binary, make_matern_kernel(), gpflow.likelihoods.Bernoulli())

var_list = [(vgp_bernoulli_natgrads_xi.q_mu, vgp_bernoulli_natgrads_xi.q_sqrt, XiSqrtMeanVar())]
run_nat_grads_with_adam(vgp_bernoulli_natgrads_xi, learning_rate, 0.01, iterations, var_list=var_list)

VGP likelihood after NatGrads with XiSqrtMeanVar + Adam optimization:

In [None]:
vgp_bernoulli_natgrads_xi.compute_log_likelihood()

With sufficiently small steps, it shouldn't make a difference which transform is used, but for large 
step this can make a difference in practice.