## Autodiff

In this short notebook we'll look at the auto-differentiation provided by Tensorflow.

**Autodiff** means automatic differentiation. In other words, when writing neural networks in Tensorflow, (or in Keras, which is a user interface layer on top of TF), we just need to write the network. The gradient (ie derivative), including back-propagation, is automatically calculated internally. 

To use Tensorflow, or Keras (or PyTorch, which has the same thing), we don't really need to know how this works internally. So, this notebook is optional, and we're not covering this directly in lectures. But it is pretty interesting and useful to understand at a deeper level.

In [1]:
import tensorflow as tf

Let's suppose we have a very simple linear model:

$\hat{y} = wx$

where $w$ and $x$ are just scalars, but $w$ is a parameter of the model and $x$ is some input data.

Let's suppose further that we will use a squared error loss:

$$L = (y - \hat{y})^2 = y^2 - 2 y \hat{y} + \hat{y}^2 = y^2 - 2 y w x + w^2x^2$$

where again $y$ is just a scalar. (We didn't bother writing $\sum$ because we'll only consider one training case $(x, y)$, and remember, everything is a scalar.)

Now in order to minimise the loss $L$ by optimising the weight $w$ of this model, we need the gradient of $L$ with respect to $w$: $$\frac{dL}{dw} = -2xy + 2wx^2$$

**Exercise**: check the derivative above.

Let's choose some arbitrary values for $y$, $w$, and $x$, and then use Tensorflow to calculate the gradient.

In [2]:
y = 2.0 # target y value
x = 3.0 # input data
w = tf.Variable(1.) # initial value of the weight w, stored as a Variable

Notice we have wrapped $w$ up in a `tf.Variable`. This tells Tensorflow it is to be used as a parameter of the model, which can change.

Now we're going to use the Tensorflow `tf.GradientTape`. This is the object used behind the scenes by TF to track computations whose gradients will be needed. **In typical TF code, the Gradient Tape is behind the scenes: we don't usually interact with it directly.**

In [3]:
with tf.GradientTape() as tape: # set up the tape
    tape.watch(w) # tell the tape that w is a parameter we might need the gradient of
    yhat = x * w # run the model
    L = (y - yhat) ** 2 # calculate the loss
    dL_dw = tape.gradient(L, w) # find the gradient, dL/dw
    print(dL_dw)

tf.Tensor(6.0, shape=(), dtype=float32)


**Exercise**: this has printed the value $\frac{dL}{dw} = 6.0$. Look back at our calculation for $\frac{dL}{dw}$: is the result 6.0 correct?

Now let's convince ourselves a little bit more. We'll actually use the gradient to optimise.

By looking at the initial numbers, we should see that the optimum value for $w$ is $w=2/3$ (**Exercise**: check this). So, let's set up a learning rate and an optimisation loop which uses the gradient calculation.

In [4]:
lr = 0.01 # learning rate
tolerance = 0.001

y = 2.0 # target y value
x = 3.0 # input data
w = tf.Variable(1.) # initial value of the weight w, stored as a Variable


while True:
    with tf.GradientTape() as tape:
        tape.watch(w)
        yhat = x * w
        L = (y - yhat) ** 2
    dL_dw = tape.gradient(L, w)
    w = w - lr * dL_dw # take one learning step. notice w will still be a tf.Variable after this
    print(f"{w.numpy():.3f} {L:.3f} {dL_dw:.3f}")
    if tf.abs(L) < tolerance: # if the loss is small, we quit
        break

0.940 1.000 6.000
0.891 0.672 4.920
0.850 0.452 4.034
0.817 0.304 3.308
0.790 0.204 2.713
0.768 0.137 2.224
0.750 0.092 1.824
0.735 0.062 1.496
0.723 0.042 1.226
0.712 0.028 1.006
0.704 0.019 0.825
0.697 0.013 0.676
0.692 0.009 0.555
0.687 0.006 0.455
0.684 0.004 0.373
0.681 0.003 0.306
0.678 0.002 0.251
0.676 0.001 0.206
0.674 0.001 0.169


We see that $w$ approaches the correct value. So, it looks like things are correct!

**Exercise**: Try setting `lr` larger, eg `lr = 0.2`. What happens?

**Reminder**: typically, when writing Keras code using `Sequential`, we don't have to think about the gradient tape or autodiff. We just specify the loss for our model, and Keras sets up the gradient tape correctly.

**Note**. It seems ugly to make a new instance of the tape in every iteration, as we do above. However, it is done this way eg in: https://www.tensorflow.org/guide/core/logistic_regression_core
