In [1]:
%matplotlib inline
import itertools
import os
os.environ['CUDA_VISIBLE_DEVICES']=""
import numpy as np
import gpflow
import gpflow.training.monitor as mon
import numbers
import matplotlib.pyplot as plt
import tensorflow as tf

np.random.seed(0)
X = np.random.rand(10000, 1) * 10
Y = np.sin(X) + np.random.randn(*X.shape)
Xt = np.random.rand(10000, 1) * 10
Yt = np.sin(Xt) + np.random.randn(*Xt.shape)



# Demo: `gpflow.training.monitor`
In this notebook we'll demo how to use `gpflow.training.monitor` for logging the optimisation of a GPflow model. The example should cover pretty much all use cases.

## Creating the GPflow model
We first create the GPflow model. Under the hood, GPflow gives a unique name to each model which is used to name the Variables it creates in the TensorFlow graph containing a random identifier. This is useful in interactive sessions, where people may create a few models, to prevent variables with the same name conflicting. However, when loading the model, we need to make sure that the names of all the variables are exactly the same as in the checkpoint. This is why we pass `name="SVGP"` to the model constructor, and why we use `gpflow.defer_build()`.

In [2]:
with gpflow.defer_build():
    kernel = gpflow.kernels.RBF(1)
    likelihood = gpflow.likelihoods.Gaussian()
    Z = np.linspace(0, 10, 5)[:, None]
    m = gpflow.models.SVGP(X, Y, kern=kernel, likelihood=likelihood, Z=Z, minibatch_size=100, name="SVGP")
    m.likelihood.variance = 0.01

m.compile()

In [3]:
m.compute_log_likelihood()

-1271605.6219440382

## Setting up the optimisation
Next we need to set up the optimisation process. `gpflow.training.monitor` provides classes that manage the optimsation, and perform certain logging tasks. In this example, we want to:
- log certain scalar parameters in TensorBoard,
- log the full optimisation objective (log marginal likelihood bound) periodically, even though we optimise with minibatches,
- store a backup of the optimisation process periodically,
- log performance for a test set periodically.

Because of the integration with TensorFlow ways of storing and logging, we will need to perform a few TensorFlow manipulations outside of GPflow as well.

We start by creating the `global_step` variable. This is not strictly required by TensorFlow optimisers, but they do all have support for it. Its purpose is to track how many optimisation steps have occurred. It is useful to keep this in a TensorFlow variable as this allows it to be restored together with all the parameters of the model.

In [4]:
global_step = tf.Variable(0, trainable=False, name="global_step")
m.enquire_session().run(global_step.initializer)

Next, we create the optimiser action. `make_optimize_action` also creates the optimisation tensor, which is added to the computational graph. Later, the saver will store the whole graph, and so can also restore the exact optimiser state.

In [5]:
adam = gpflow.train.AdamOptimizer(0.01).make_optimize_action(m, global_step=global_step)

## Creating actions for keeping track of the optimisation
We now create an instance of `FileWriter`, which will save the TensorBoard logs to a file. This object needs to be shared between all `gpflow_monitor.TensorBoard` objects, if they are to write to the same path.

In [6]:
# create a filewriter for summaries
fw = tf.summary.FileWriter('./model_tensorboard', m.graph)

Now the TensorFlow side is set up, we can focus on the `monitor` part. Each part of the monitoring process is taken care of by an `Action`. Each `Action` is something that needs to be run periodically during the optimisation. The first and second parameters of all actions are a generator returning times (either in iterations or time) of when the action needs to be run. The second determines whether a number of iterations (`Trigger.ITER`), or an amount of wall-clock time (`Trigger.TOTAL_TIME`) triggers the `Action` to be run. The following `Action`s are run once in every 10 or 100 iterations.

In [7]:
print_lml = mon.PrintTimings(itertools.count(), mon.Trigger.ITER, single_line=True, global_step=global_step)
sleep = mon.SleepAction(itertools.count(), mon.Trigger.ITER, 0.01)
saver = mon.StoreSession(itertools.count(step=10), mon.Trigger.ITER, m.enquire_session(),
                         hist_path="./monitor-saves/checkpoint", global_step=global_step)
tensorboard = mon.ModelTensorBoard(itertools.count(step=10), mon.Trigger.ITER, m, fw, global_step=global_step)
lml_tensorboard = mon.LmlTensorBoard(itertools.count(step=100), mon.Trigger.ITER, m, fw, global_step=global_step)

The optimisation step is also encapsulated in an `Action`, in this case the `adam` variable which we created earlier. We place all actions in a list in the order that they should be executed.

In [8]:
actions = [adam, print_lml, tensorboard, lml_tensorboard, saver, sleep]

## Custom `Action`s
We may also want to perfom certain tasks that do not have pre-defined `Action` classes. For example, we may want to compute the performance on a test set. Here we create such a class by extending `ModelTensorBoard` to log the testing benchmarks in addition to all the scalar parameters.

In [9]:
class TestTensorBoard(mon.ModelTensorBoard):
    def __init__(self, sequence, trigger: mon.Trigger, model, file_writer, Xt, Yt, *, global_step=global_step):
        super().__init__(sequence, trigger, model, file_writer, global_step=global_step)
        self.Xt = Xt
        self.Yt = Yt
        self._full_test_err = tf.placeholder(gpflow.settings.tf_float, shape=())
        self._full_test_nlpp = tf.placeholder(gpflow.settings.tf_float, shape=())
        self.summary = tf.summary.merge([tf.summary.scalar("test_rmse", self._full_test_err),
                                         tf.summary.scalar("test_nlpp", self._full_test_nlpp)])

    def run(self, ctx):
        minibatch_size = 100
        preds = np.vstack([self.model.predict_y(Xt[mb * minibatch_size:(mb + 1) * minibatch_size, :])[0]
                            for mb in range(-(-len(Xt) // minibatch_size))])
        test_err = np.mean((Yt - preds) ** 2.0)**0.5
        summary, step = self.model.enquire_session().run([self.summary, self.global_step],
                                      feed_dict={self._full_test_err: test_err,
                                                 self._full_test_nlpp: 0.0})
        self.file_writer.add_summary(summary, step)

We now add the custom `TestTensorBoard` to the list which will be run later.

In [10]:
actions.append(TestTensorBoard(itertools.count(step=100), mon.Trigger.ITER, m, fw, Xt, Yt, global_step=global_step))

## Running the optimisation
We finally get to running the optimisation. The second time this is run, the session should be restored from a checkpoint created by `StoreSession`. This is important to ensure that the optimiser starts off from _exactly_ the same state as that it left. If this is not done correctly, models may start diverging after loading.

In [11]:
gpflow.actions.Loop(actions, stop=500)()

 29%|██▉       | 29/100 [00:00<00:00, 279.31it/s]

0, 1:	0.00 optimisation iter/s	0.00 total iter/s	0.00 last iter/s


100%|██████████| 100/100 [00:00<00:00, 355.36it/s]


Full lml: -1186370.994050 (-1.19e+06)
89, 90:	nan optimisation iter/s	33.06 total iter/s	90.61 last iter/s

  0%|          | 0/100 [00:00<?, ?it/s]

90, 91:	nan optimisation iter/s	33.29 total iter/s	90.12 last iter/s91, 92:	nan optimisation iter/s	32.51 total iter/s	10.42 last iter/s92, 93:	nan optimisation iter/s	32.74 total iter/s	89.13 last iter/s93, 94:	nan optimisation iter/s	32.96 total iter/s	89.89 last iter/s94, 95:	nan optimisation iter/s	33.18 total iter/s	91.28 last iter/s95, 96:	nan optimisation iter/s	33.40 total iter/s	90.57 last iter/s96, 97:	nan optimisation iter/s	33.62 total iter/s	90.57 last iter/s97, 98:	nan optimisation iter/s	33.84 total iter/s	91.35 last iter/s98, 99:	nan optimisation iter/s	34.05 total iter/s	91.42 last iter/s99, 100:	nan optimisation iter/s	34.27 total iter/s	90.89 last iter/s100, 101:	nan optimisation iter/s	34.48 total iter/s	90.26 last iter/s


100%|██████████| 100/100 [00:00<00:00, 381.38it/s]


Full lml: -281692.419903 (-2.82e+05)
193, 194:	nan optimisation iter/s	36.93 total iter/s	90.29 last iter/s

 37%|███▋      | 37/100 [00:00<00:00, 362.94it/s]

194, 195:	nan optimisation iter/s	37.04 total iter/s	88.86 last iter/s195, 196:	nan optimisation iter/s	37.15 total iter/s	86.44 last iter/s196, 197:	nan optimisation iter/s	37.26 total iter/s	90.44 last iter/s197, 198:	nan optimisation iter/s	37.37 total iter/s	89.87 last iter/s198, 199:	nan optimisation iter/s	37.47 total iter/s	88.02 last iter/s199, 200:	nan optimisation iter/s	37.58 total iter/s	89.37 last iter/s200, 201:	nan optimisation iter/s	37.69 total iter/s	89.65 last iter/s


100%|██████████| 100/100 [00:00<00:00, 396.26it/s]


Full lml: -165358.931919 (-1.65e+05)
290, 291:	nan optimisation iter/s	38.64 total iter/s	89.63 last iter/s

  0%|          | 0/100 [00:00<?, ?it/s]

291, 292:	nan optimisation iter/s	38.29 total iter/s	10.60 last iter/s292, 293:	nan optimisation iter/s	38.37 total iter/s	89.59 last iter/s293, 294:	nan optimisation iter/s	38.44 total iter/s	91.39 last iter/s294, 295:	nan optimisation iter/s	38.52 total iter/s	91.16 last iter/s295, 296:	nan optimisation iter/s	38.59 total iter/s	86.32 last iter/s296, 297:	nan optimisation iter/s	38.66 total iter/s	88.99 last iter/s297, 298:	nan optimisation iter/s	38.74 total iter/s	90.34 last iter/s298, 299:	nan optimisation iter/s	38.81 total iter/s	88.53 last iter/s299, 300:	nan optimisation iter/s	38.88 total iter/s	90.25 last iter/s300, 301:	nan optimisation iter/s	38.96 total iter/s	90.23 last iter/s


100%|██████████| 100/100 [00:00<00:00, 388.40it/s]


Full lml: -113183.199641 (-1.13e+05)
393, 394:	nan optimisation iter/s	39.32 total iter/s	91.75 last iter/s

 45%|████▌     | 45/100 [00:00<00:00, 444.50it/s]

394, 395:	nan optimisation iter/s	39.37 total iter/s	90.52 last iter/s395, 396:	nan optimisation iter/s	39.43 total iter/s	90.23 last iter/s396, 397:	nan optimisation iter/s	39.48 total iter/s	90.43 last iter/s397, 398:	nan optimisation iter/s	39.54 total iter/s	92.77 last iter/s398, 399:	nan optimisation iter/s	39.60 total iter/s	91.26 last iter/s399, 400:	nan optimisation iter/s	39.65 total iter/s	90.97 last iter/s400, 401:	nan optimisation iter/s	39.71 total iter/s	90.68 last iter/s


100%|██████████| 100/100 [00:00<00:00, 397.48it/s]


Full lml: -84186.935655 (-8.42e+04)
499, 500:	nan optimisation iter/s	39.94 total iter/s	89.79 last iter/s