In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

import tensorflow as tf

Part of the training process requires calculating derivatives that involve tensors. So let's learn about TensorFlow's built-in [automatic differentiation](https://www.tensorflow.org/guide/autodiff) engine, using a very simple example. Consider the following two tensors:

$$
\begin{align}
  U =
  \begin{bmatrix}
    1 & 2
  \end{bmatrix}
  &&
  V =
  \begin{bmatrix}
    3 & 4 \\
    5 & 6
  \end{bmatrix}
\end{align}
$$

Now suppose that you want to multiply $U$ by $V$, and then sum all the values in the resulting tensor, such that the result is a scalar. In math notation, you might represent this as the following scalar function $f$:

$$
f(U, V) = \mathrm{sum} (U \, V) = \sum_j \sum_i u_i \, v_{ij}
$$

Your goal is to calculate the derivative of $f$ with respect to each of its inputs: $\frac{\partial f}{\partial U}$ and $\frac{\partial f}{\partial V}$. Start by creating the two tensors $U$ and $V$. Then create a [tf.GradientTape](https://www.tensorflow.org/guide/autodiff#gradient_tapes), and tell TensorFlow to watch for mathematical operations involving $U$ and $V$, recording those operations onto your *tape*. The tape then lets you calculate the derivatives of the function $f$ with respect to $U$ and $V$.

In [None]:
# Decimal points in tensor values ensure they are floats, which automatic differentiation requires.
U = tf.constant([[1., 2.]])
V = tf.constant([[3., 4.], [5., 6.]])

with tf.GradientTape(persistent=True) as tape:
  tape.watch(U)
  tape.watch(V)
  W = tf.matmul(U, V)
  f = tf.math.reduce_sum(W)

print(tape.gradient(f, U)) # df/dU
print(tape.gradient(f, V)) # df/dV

TensorFlow automatically watches tensors that are defined as `Variable` instances. So let's turn `U` and `V` into variables, and remove the `watch` calls:

In [None]:
# Decimal points in tensor values ensure they are floats, which automatic differentiation requires.
U = tf.Variable(tf.constant([[1., 2.]]))
V = tf.Variable(tf.constant([[3., 4.], [5., 6.]]))

with tf.GradientTape(persistent=True) as tape:
  W = tf.matmul(U, V)
  f = tf.math.reduce_sum(W)

print(tape.gradient(f, U)) # df/dU
print(tape.gradient(f, V)) # df/dV

As you'll see later, in deep learning, you need to calculate the derivatives of the loss function with respect to the model parameters. Those parameters are variables because they change during training. Therefore, the fact that variables are automatically watched is handy in this scenario.  

## Optional explanation of the math

Let's take a look at the math used to compute the derivatives. You only need to understand matrix multiplication and partial derivatives to follow along, but if the math isn't as interesting to you, feel free to skip to the next notebook.

Start by thinking of $U$ and $V$ as generic 1 &times; 2 and 2 &times; 2 matrices:

$$
\begin{align}
  U =
  \begin{bmatrix}
    u_1 & u_2
  \end{bmatrix}
  &&
  V =
  \begin{bmatrix}
    v_{11} & v_{12} \\
    v_{21} & v_{22}
  \end{bmatrix}
\end{align}
$$

Then the scalar function $f$ can be written as:

$$
\begin{align}
  f(U, V)
  &= \mathrm{sum}(U \, V) \\
  &= \mathrm{sum} 
    \left( 
      \begin{bmatrix}
        u_1 & u_2
      \end{bmatrix}
      \begin{bmatrix}
        v_{11} & v_{12} \\
        v_{21} & v_{22}
      \end{bmatrix}
    \right) \\
  &= \mathrm{sum}
    \left(
      \begin{bmatrix}
        u_1 v_{11} + u_2 v_{21} & u_1 v_{12} + u_2 v_{22}
      \end{bmatrix}
    \right) \\
  &= u_1 v_{11} + u_2 v_{21} + u_1 v_{12} + u_2 v_{22}
\end{align}
$$

You can now calculate the derivatives of $f$ with respect to each of its inputs:

$$
\frac{\partial f}{\partial U} =
  \begin{bmatrix}
    \frac{\partial f}{\partial u_1} & \frac{\partial f}{\partial u_2}
  \end{bmatrix} = 
  \begin{bmatrix}
    v_{11} + v_{12} & v_{21} + v_{22}
  \end{bmatrix} = 
  \begin{bmatrix}
    7 & 11
  \end{bmatrix} 
$$

$$
\frac{\partial f}{\partial V} =
  \begin{bmatrix}
    \frac{\partial f}{\partial v_{11}} & \frac{\partial f}{\partial v_{12}} \\
    \frac{\partial f}{\partial v_{21}} & \frac{\partial f}{\partial v_{22}} 
  \end{bmatrix} = 
  \begin{bmatrix}
    u_1 & u_1 \\
    u_2 & u_2
  \end{bmatrix} = 
  \begin{bmatrix}
    1 & 1 \\
    2 & 2
  \end{bmatrix}
$$

As you can see, when you plug in the numerical values of $U$ and $V$, you get the same result as TensorFlow's automatic differentiation.
