# Exercise 06 - Gradient Computations

This notebook is about gradient computations in neural networks, and how to get the weights and gradient vector for the model with TensorFlow. It also considers the gradient descent step, and performs some steps manually and explores the development of the neural network outputs.

**Learning objectives:**
- Get to know gradient calculations
- Learn to work with the gradient tape of TensorFlow
- See how to perform user defined weight initialization
- Manually do the computations from output scores to class probabilities, and then to a loss value
- Predict class index from scores or probabilities
- Get gradients from intermediate results

Before you start, find a GPU on the system that is not heavily used by other users (with **nvidia-smi**), and change X to the id of this GPU.

In [18]:
# Change X to the GPU number you want to use,
# otherwise you will get a Python error
# e.g. USE_GPU = 4
USE_GPU = 4

In [19]:
# Import TensorFlow 
import tensorflow as tf

# Print the installed TensorFlow version
print(f'TensorFlow version: {tf.__version__}\n')

# Get all GPU devices on this server
gpu_devices = tf.config.list_physical_devices('GPU')

# Print the name and the type of all GPU devices
print('Available GPU Devices:')
for gpu in gpu_devices:
    print(' ', gpu.name, gpu.device_type)
    
# Set only the GPU specified as USE_GPU to be visible
tf.config.set_visible_devices(gpu_devices[USE_GPU], 'GPU')

# Get all visible GPU  devices on this server
visible_devices = tf.config.get_visible_devices('GPU')

# Print the name and the type of all visible GPU devices
print('\nVisible GPU Devices:')
for gpu in visible_devices:
    print(' ', gpu.name, gpu.device_type)
    
# Set the visible device(s) to not allocate all available memory at once,
# but rather let the memory grow whenever needed
for gpu in visible_devices:
    tf.config.experimental.set_memory_growth(gpu, True)

TensorFlow version: 2.12.0

Available GPU Devices:
  /physical_device:GPU:0 GPU
  /physical_device:GPU:1 GPU
  /physical_device:GPU:2 GPU
  /physical_device:GPU:3 GPU
  /physical_device:GPU:4 GPU
  /physical_device:GPU:5 GPU
  /physical_device:GPU:6 GPU
  /physical_device:GPU:7 GPU

Visible GPU Devices:
  /physical_device:GPU:4 GPU


## Neuron with sigmoid activation

In the first part, we look at gradient computation using a neuron with two input values and one output, like the one used in the lecture to illustrate the backpropagation algorithm. For this purpose, a sequential model is constructed with one dense (fully connected) layer consisting of exactly one neuron. 

The layer weights of neural networks can be initialized in TensorFlow with initialization functions or initialization classes. By default, the initializer for the layer weights (the `kernel_initializer`) is the Xavier (or Glorot) uniform initializer, and the initializer for the biases (the `bias_initializer`) is the Zeroes initializer that initializes all values with zero. But there are other initializers implemented in TensorFlow that could be used or you can implement an function or class for this purpose. In order to get the weights initialized according to the example of the lecture, we define a function `init_with_specific_values()` that returns a tensor with values 2.0 and -3.0 for the two weights. (There is no initializer that we could use that provides specific values, since this is not how we typically want to initialize our neurons. Rather, we want to initialize with small random values according to some distribution.) An initializer function must take the shape and optionally the data type as arguments and return the tensor accordingly. To explore what the expected shape is, we print the shape argument in the function. TensorFlow works very well together with NumPy, so that we can return the initial weight values per NumPy array and TensorFlow will convert it into a TensorFlow tensor object. The alternative would be to directly construct a TensorFlow Tensor object. For initializing the bias, we can use the `Constant` initializer class from TensorFlow (Keras) that we construct in a way that it always returns the value -3.0.

In [20]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.initializers import Constant

import numpy as np

def init_with_specified_values(shape, dtype=None):
    print('Kernel initializer should return tensor of shape:', shape, '\n')
    return np.array([[2.0], [-3.0]])

model = Sequential([
    Dense(input_shape=(2,),
          units = 1, 
          kernel_initializer=init_with_specified_values, #usually by default : 'glorot_uniform'
          bias_initializer=tf.keras.initializers.Constant(value=-3),
          activation = 'sigmoid',
          name='DenseLayer')
])

model.summary()

Kernel initializer should return tensor of shape: [2, 1] 

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 DenseLayer (Dense)          (None, 1)                 3         
                                                                 
Total params: 3
Trainable params: 3
Non-trainable params: 0
_________________________________________________________________


The tensor shape is expected to be (2,1), which means we need a tensor with two rows and one column (one value per row). We therefore have to put each value in brackets, and have another set of brackets for defining the one row.

In [21]:
np.array([[2.0], [-3.0]]).shape

(2, 1)

If we would leave out the inner brackets, we would end up with a tensor of shape (2,), which would be a vector of two values.

In [22]:
np.array([2.0, -3.0]).shape

(2,)

Before we continue on: what would be the tensor shape that is expected from the initializer function, if the dense layer has three neurons instead of one? Think about it, try it out to have the tensor shape printed, and construct an array accordingly that provides a tensor of this shape with your own values.

**Task: In the following, the ´zeros()´ function of NumPy is used to construct a tensor of zeroes in the specified shape. Replace it with the `array()` function to construct a tensor of the same shape, but with your own values (e.g. 1.0, 2.0, 3.0, ...).**

In [55]:
def init_with_specified_values(shape, dtype=None):
    print('Kernel initializer should return tensor of shape:', shape, '\n')
    return np.zeros(shape)
#    return np.array([...])

model = Sequential([
    Dense(input_shape=(2,),
          units = 3, 
          kernel_initializer=init_with_specified_values,  # usually by default : 'glorot_uniform'
          bias_initializer=tf.keras.initializers.Constant(value=-3),
          activation = 'sigmoid',
          name='DenseLayer')
])

model.summary()

Kernel initializer should return tensor of shape: [2, 3] 

Model: "sequential_15"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 DenseLayer (Dense)          (None, 3)                 9         
                                                                 
Total params: 9
Trainable params: 9
Non-trainable params: 0
_________________________________________________________________


For the sake of completeness, a version without NumPy is also shown below, which generates a TensorFlow tensor instead in the initialization function. TensorFlow distinguishes tensors that are constant (non-trainable) or are variable (trainable). Since the tensor is used for trainable weights, the tensor is generated with the `Variable()` constructor. (We could also use `constant()` to generate a non-trainable tensor, and TensorFlow makes a trainable copy of this constant in the background. Try out the commented out line that uses the `constant()` function.)

In [24]:
def init_with_specified_values(shape, dtype=None):
    return tf.Variable([[2.0], [-3.0]])
#    return tf.constant([[2.0], [-3.0]])

model = Sequential([
    Dense(input_shape=(2,),
          units = 1, 
          kernel_initializer=init_with_specified_values, 
          bias_initializer=tf.keras.initializers.Constant(value=-3),
          activation = 'sigmoid',
          name='DenseLayer')
])

model.summary()

Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 DenseLayer (Dense)          (None, 1)                 3         
                                                                 
Total params: 3
Trainable params: 3
Non-trainable params: 0
_________________________________________________________________


The trainable weights of a model can be accessed with the `trainable_weights` attribute. The attribute is iterable, and we can access all elements with a foor loop.

In [25]:
for v in model.trainable_weights:
    print(v)

<tf.Variable 'DenseLayer/kernel:0' shape=(2, 1) dtype=float32, numpy=
array([[ 2.],
       [-3.]], dtype=float32)>
<tf.Variable 'DenseLayer/bias:0' shape=(1,) dtype=float32, numpy=array([-3.], dtype=float32)>


To finish up on our example, we construct an input tensor for the model with input values -1.0 and -2.0. Here, a version using NumPy.

In [33]:
import numpy as np

x = np.array([-1.0, -2.0])

print(x.shape)
print(x)

(2,)
[-1. -2.]


However, we cannot use the tensor in this way, since neural network models expect batches of data. So, we have to add another (first) dimension to the tensor that denotes the batch size. Since we only use a single dataset as a batch, the batch size is one. Therefore, we reshape the tensor of two values to a tensor of shape (1,2). Now we have a batch of one dataset, where each dataset has 2 values.

In [34]:
x = x.reshape((1, 2))

print(x.shape)

(1, 2)


Again, the TensorFlow version constructing a Tensor object. Check out the brackets to directly generate a tensor of shape (1,2).

In [28]:
x = tf.constant([[-1.0, -2.0]])

print(x)

tf.Tensor([[-1. -2.]], shape=(1, 2), dtype=float32)


Finally, we can use the input tensor x in the forward pass by calling the neural network model object with the input data. In Python, we can call objects of a class like a function, which is equivalent like calling the method `__call__()` of the object. For TensorFlow classes that are derived from the class `Layer`, this `__call__()` method performs a forward pass and does all the necessary calculations. The result is the predicted target probability.

In [37]:
y_pred = model(x)

print(y_pred)

tf.Tensor([[0.8021838]], shape=(1, 1), dtype=float32)


The value of the returned tensor is around 0.73, which means the neuron (the network) is 73% sure that the object of the input belongs to the positive class.

When we call the model or layer methods within the context of the `GradientTape()`, then the operations that are called within this context are recorded and can be used for automatic differentiation.

In [52]:
with tf.GradientTape() as tape:
    y = model(x)

Calling the method `gradient` of the gradient tape on source (first argument) and target (second argument), the list of target tensors like the trainable weights are differentiated against the elements of the source (first argument). Here, the target are the trainable weights of the model that are differentiated against y. (We basically calculate the partial derivatives of the function that calculated y with respect to all weights of the model.)

In [53]:
grad = tape.gradient(y, model.trainable_weights)

Since both the number of trainable weights are the same as the number of elements in the gradient object, we can zip them together in order to be able to iterate over them at the same time. We use this to print all gradients with the name of the variable. Since the elements of the gradient object can be scalar or vectors, we also iterate (within the first loop) over the gradient vectors and print the elements of these vectors row by row. (We could leave out the second loop and print g instead of looping through its elements, but then the output will not be printed as nicely row by row.)

In [54]:
for var, g in zip(model.trainable_weights, grad):
    for i in g:
        print(f'{var.name:<20} {i}')

DenseLayer/kernel:0  [-0.22878425]
DenseLayer/kernel:0  [-0.4575685]
DenseLayer/bias:0    0.2287842482328415


As you notice, the gradients are (almost) the same as in the lecture. The small differences come from the fact that the numbers in the lecture slides were rounded to the second decimal, and the small rounding errors accumulated up to the gradients of the weights.

## Weights that result in higher score

Next, we use initial weights that give us a slightly higher score and probability than before. For example, after one or more training steps.

In [47]:
def init_with_specified_values(shape, dtype=None):
    return tf.Variable([[1.9], [-3.1]])

model = Sequential([
    Dense(input_shape=(2,),
          units = 1, 
          kernel_initializer=init_with_specified_values, 
          bias_initializer=tf.keras.initializers.Constant(value=-2.9),
          activation = 'sigmoid',
          name='DenseLayer')
])

y_pred = model(x)

print(y_pred)

tf.Tensor([[0.8021838]], shape=(1, 1), dtype=float32)


The predicted value is now a probability of 80%.

In [48]:
with tf.GradientTape() as tape:
    y = model(x)

grad = tape.gradient(y, model.trainable_weights)

for var, g in zip(model.trainable_weights, grad):
    for i in g:
        print(f'{var.name:<20} {i}')

DenseLayer/kernel:0  [-0.15868495]
DenseLayer/kernel:0  [-0.3173699]
DenseLayer/bias:0    0.15868495404720306


You notice that the gradients (the gradient vector) points to the same direction, but it is shorter (has smaller values). This is because the score (class probability) is higher and the slope of the sigmoid function smaller.

## Weights that result in lower score

And now changing the weights to get a slighly smaller score and probability.

In [49]:
def init_with_specified_values(shape, dtype=None):
    return tf.Variable([[2.1], [-2.9]])

model = Sequential([
    Dense(input_shape=(2,),
          units = 1, 
          kernel_initializer=init_with_specified_values, 
          bias_initializer=tf.keras.initializers.Constant(value=-3.1),
          activation = 'sigmoid',
          name='DenseLayer')
])

y_pred = model(x)

print(y_pred)

tf.Tensor([[0.6456564]], shape=(1, 1), dtype=float32)


The predicted value is now a probability of 64.5%.

In [50]:
with tf.GradientTape() as tape:
    y = model(x)

grad = tape.gradient(y, model.trainable_weights)

for var, g in zip(model.trainable_weights, grad):
    for i in g:
        print(f'{var.name:<20} {i}')

DenseLayer/kernel:0  [-0.22878422]
DenseLayer/kernel:0  [-0.45756844]
DenseLayer/bias:0    0.2287842184305191


And the gradient vector is now longer.

If we play around with the weights until we have a predicted probability of around 1.0 - 0.645 = 0.355 (where 0.645 is the above probability), we end up with the same gradient vector. (You need to ignore the slight differences that result from the probabilities not being exactly opposite.) The reason is that the sigmoid function is symmetric and we get the same gradients for y and 1.0-y.

In [51]:
def init_with_specified_values(shape, dtype=None):
    return tf.Variable([[2.45], [-2.55]])

model = Sequential([
    Dense(input_shape=(2,),
          units = 1, 
          kernel_initializer=init_with_specified_values, 
          bias_initializer=tf.keras.initializers.Constant(value=-3.25),
          activation = 'sigmoid',
          name='DenseLayer')
])

y_pred = model(x)

print(y_pred, "\n")

with tf.GradientTape() as tape:
    y = model(x)

grad = tape.gradient(y, model.trainable_weights)

for var, g in zip(model.trainable_weights, grad):
    for i in g:
        print(f'{var.name:<20} {i}')

tf.Tensor([[0.35434368]], shape=(1, 1), dtype=float32) 

DenseLayer/kernel:0  [-0.22878425]
DenseLayer/kernel:0  [-0.4575685]
DenseLayer/bias:0    0.2287842482328415


## 2-layer dense network with softmax loss

In the next part, we construct a larger network with two dense layers, both with five neurons. We initialize their weights with the `RandomNormal()` initialier with zero mean and 0.05 as standard deviation. To reproduce the results, a seed value of 7 is used to generate random numbers. Now we have five scores as output values.

In [None]:
model = Sequential([
    Dense(input_shape=(2,),
          units = 5, 
          kernel_initializer=tf.keras.initializers.RandomNormal(mean=0.0, stddev=0.05, seed=7), 
          bias_initializer=tf.keras.initializers.RandomNormal(mean=0.0, stddev=0.05, seed=7),
          activation = 'sigmoid',
          name='DenseLayer1'),
    Dense(units = 5, 
          kernel_initializer=tf.keras.initializers.RandomNormal(mean=0.0, stddev=0.05, seed=7), 
          bias_initializer=tf.keras.initializers.RandomNormal(mean=0.0, stddev=0.05, seed=7),
          name='DenseLayer2')
])

model.summary()

As input, we construct a tensor with random numbers. (We specify the seed value of the random function to always get the same values to reproduce our results.)

In [None]:
from tensorflow.random import uniform

tf.random.set_seed(7)

x = tf.random.uniform((1,2))

x.shape

Applying the model on the input data x, we get the predicted class scores y.

In [None]:
y_pred = model(x)

y_pred

To get from the scores to the probabilities, the softmax function can be used.

In [None]:
from tensorflow.nn import softmax

y_prob = softmax(y_pred)

y_prob

You notice that the probability of the class with index 1 is the highest and we would predict class 1.

We could also use the `argmax()` function on the tensor of probabilities (or on the tensor of scores) to get the predicted class index. Since the tensor is of shape (1,5), we need to apply the argmax function on the second (the last) dimension, by specifying the axis to be 2 (or -1).

In [None]:
from tensorflow.math import argmax

cls_label = tf.math.argmax(y_prob, axis=-1)

print(cls_label)

Remember that we typically have the correct class label to start with (now stored in cls_label). With the true class label and the predicted scores, we can compute the sparse softmax cross entropy loss with the `sparse_softmax_cross_entropy_with_logits()` function. Sparse mean that we have the class labels as integer values instead of a one-hot encoding vector. And from logits means that we have not applied the softmax yet, but input the predicted scores to the function.

In [None]:
from tensorflow.nn import sparse_softmax_cross_entropy_with_logits

loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=cls_label, logits=y_pred)

loss

The loss value for the above configuration is 1.5745.

From our above probabilities, we can slice out the probability of the true class, and apply the negative log on it to get the loss by hand. We need to use the slice operator on the `y_prob` tensor, since it is of shape (1,5), and we want to get the value from the first row (first index 0), and from the second column (which would be index 1, the true class). Since `cls_label` is a one-dimensional tensor, we need to use the index operator to get the value out. Hence the somewhat cumbersome syntax.

In [None]:
from tensorflow.math import log

y_prob_for_true_cls = y_prob[0, cls_label[0]]

-log(y_prob_for_true_cls)

But in the end, we get the same loss value of 1.5745. So, our calculations by hand were correct.

Let's put this all together and apply it in the context of the gradient tape. We first apply the model on input x to get intermediate outputs y, and then use y as the input to the loss function. We then compute the gradients of the trainable weights of the model with respect to the loss value, and print them all out.

In [None]:
with tf.GradientTape() as tape:
    y = model(x)
    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=cls_label, logits=y)    

grad = tape.gradient(loss, model.trainable_weights)

for var, g in zip(model.trainable_weights, grad):
    for i in g:
        print(f'{var.name:<20} {i}')

Assuming that the correct (the true) label is not class 1, but class 2, which is actually the lowest score, we get the following loss value and gradients.

In [None]:
with tf.GradientTape() as tape:
    y = model(x)
    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=tf.Variable([2]), logits=y) 
                                                          
print('Loss:', loss, '\n')

grad = tape.gradient(loss, model.trainable_weights)

for var, g in zip(model.trainable_weights, grad):
    for i in g:
        print(f'{var.name:<20} {i}')

The loss with value 1.69 is higher than the previous one, which is correct since the probability of that class is lower, and the gradients of the weights are also different. If we apply this gradient vector to the weights, the loss value with class 2 as the true class should improve, as well as the probability. Let's try this out.

## Manual gradient descent step

In order to make a gradient descent step, we use an optimizer object. The most simple one is stochastic gradient descent, as it will only subtract the gradient vector (multiplied by the learning rate) from the weights. Therefore, we first need to construct such an optimizer object.

In [None]:
from tensorflow.keras import optimizers

optimizer = optimizers.SGD(learning_rate=1e-3)

Before we make the gradient descent step, let us output the weights of the model. But only from the first layer and without the biases, as otherwise the output would be too large.

In [None]:
model.trainable_weights[0]

And we output the gradients of the first layer.

In [None]:
grad[0]

A gradient descent step by hand would subtract from the gradient, which is multiplied by the learning rate, from the trainable weights.

In [None]:
model.trainable_weights[0] - 0.001 * grad[0]

Using the above constructed optimizer object, we can call the `apply_gradients()` function, which receives an iterable object that returns pairs of gradients and weights.

In [None]:
optimizer.apply_gradients(zip(grad, model.trainable_weights))

After the gradient descent step, the weights of the model are updated, and are the same as the ones we calculated above.

In [None]:
model.trainable_weights[0]

If we input the same data as before to the updated model, we should get a lower loss value as before, and also the probabilities should change so that our class 2 (that we assume to be the true class) is improved.

In [None]:
y = model(x)

loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=tf.Variable([2]), logits=y) 

print(loss)

y_prob_new = softmax(y)

print('Probabilities before update:', y_prob)
print('Probabilities after update :', y_prob_new)

And this is actually the case: The loss decreased a little from 1.6985874 to 1.6967251, and the probabilities changed slightly in favor of class 2.

We can do this gradient descent step another 100 times by having the above code of performing the forward pass in the context of the gradient tape, calculating the gradients, and applying the gradients to the trainable weights of the model, into a for loop that is executed 100 times. (The loss value is printed just once before the loop and once more after the loop.)

In [None]:
print('Loss:', loss)

for i in range(100):

    with tf.GradientTape() as tape:
        y = model(x)
        loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=tf.Variable([2]), logits=y) 
        
    grad = tape.gradient(loss, model.trainable_weights)

    optimizer.apply_gradients(zip(grad, model.trainable_weights))
    
print('Loss:', loss)

Once more, we do a prediction with the input data, show the loss, and how the predicted probabilities changed.

In [None]:
y = model(x)

loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=tf.Variable([2]), logits=y) 

print(loss)

y_prob_new = softmax(y)

print('Probabilities before update:', y_prob)
print('Probabilities after update :', y_prob_new)

The loss function is now much lower than what we started with, and the probabilities are now that the model would predict class 2.

**But please keep in mind that we trained this model with one dataset only. Normally, we would train a model with lots of datasets and with many different true target classes.**

## Gradients of intermediate results

With the gradient tape, we can also calculate the gradients of intermediate results. In the following, we calculate the gradients for both y, which is the result of the model taking input x, as well as all the trainable weights.

In [None]:
with tf.GradientTape() as tape:
    y = model(x)
    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=tf.Variable([2]), logits=y)
    
grad = tape.gradient(loss, [y, model.trainable_weights])

grad

The first output tensor is the gradient of y. Remember that y is not part of the trainable weights of the model, it is the score vector from the model before these scores go into the loss function.

Last, we do some basic mathematical computations (log(c*(a+b))), where we store the result in a tensor variable, and compute and print the gradients of this computation.

In [None]:
a = tf.Variable((5.0))
b = tf.Variable((7.3))
c = tf.Variable((0.2))

with tf.GradientTape() as tape:
    x = a + b
    y = x * c
    z = log(y)

grad = tape.gradient(z, [x, y, z])

grad

Notice that the gradient of z (the final output) takes the value 1.0. 

In the shown way, using the TensorFlow functions on TensorFlow tensors, a computational graph is build in the background, it is evaluated in a forward pass, and by using the `gradiend()` method of the gradient tape object, we can get the gradients that result from backpropagation. Then, the gradients can be applied on the trainable weights of the tensors. In this way, we could define our own training loop. However, the `fit()` method of the TensorFlow model class is much more convenient, and we have a lot of ways to configure our training loop. For example, use different optimizers that not just subtracts the gradient vector from the weights, but also builds and keeps track of some momentum that allows to get to the global minimum of our function much more efficiently.

If we do need more control over the training process, there are plenty of ways to define our own cost functions, weight initializers, optimizers, etc. by deriving from existing TensorFlow (base) classes, and use those in the training process with the fit method.