## Neural Networks for Data Science Applications (2023-2024)
### Lab session 1: Logistic regression in TensorFlow

Teaching assistants: [Jary Pomponi](https://scholar.google.com/citations?user=Zha7UeoAAAAJ&hl=it), [Francesco Verdini](https://phd.uniroma1.it/web/FRANCESCO-VERDINI_nP1765820_EN.aspx)

#### Part 1: Setup of the notebook

In [None]:
# At the moment of writing, the latest TF version is 2.14, the one installed
# on Colab is 2.13. If you want the latest one, uncomment this line.
# %pip install tensorflow --upgrade --quiet

In [None]:
import tensorflow as tf
print(tf.__version__)

In [None]:
# Check available devices. By default, Colab should only have a single CPU device.
tf.config.list_logical_devices()

#### Part 2: Manipulating tensors

In [None]:
# Initialize a simple tensor whose entries are drawn from a normal distribution.
x = tf.random.normal((3, 2))

In [None]:
# Vectors can be scaled (this will be equivalent to tf.random.normal((3, 2), mean=10, stddev=0.1)).
x = tf.random.normal((3, 2)) * 0.1 + 10

In [None]:
print(x)

In [None]:
# By default, a tensor is represented by its shape and type.
print(x.shape)
print(x.dtype)

In [None]:
# Extract a NumPy representation
x.numpy()

In [None]:
# Indexing is based on the NumPy semantics.
# Check here for a more complete guide: https://www.tensorflow.org/guide/tensor
x[0:2]

In [None]:
# Reshape can modify the dimensions and also add/remove singleton dimensions.
tf.reshape(x, (1, 6, 1, 1))

In [None]:
# This example shows how two tensors can be multiplied even if the dimensions are different by using broadcasting.
# Also, it shows how new axes can be added using None.
x[:, None] * x[None]

#### Part 3: Automatic differentiation

In [None]:
Z = tf.random.normal((3, 3))
X = tf.random.normal((3, 2))

# GradientTape stores the intermediate computations to allow for gradient computation.
# In order to store a computation, it must originate from a tensor that is 'watched',
# or encapsulated inside a tf.Variable (see below).
with tf.GradientTape() as tape:
    tape.watch(X)
    tape.watch(Z)
    y = tf.reduce_mean(tf.math.cos(X) @ tf.transpose(X) + Z)

In [None]:
# Note: The operator @ performs a matrix multiplication, which calls tf.matmul() under the hood (https://www.tensorflow.org/api_docs/python/tf/linalg/matmul).
tf.math.cos(X) @ tf.transpose(X)

In [None]:
# Gradients of y with respect to X and Z.
tape.gradient(y, [X, Z])

In [None]:
# Executing the gradient twice gives an error, because resources
# are immediately freed as soon as the gradient is computed: try it by
# uncommenting this line.
# tape.gradient(y, [X, Z])

In [None]:
# Comparison between the same operation performed with and without using a Context Manager.
'''
With
with open(...) as f:
  f.read()

Without
f = open(...)
f.read()
f.close()
'''

In [None]:
# Just an example of how the same thing can be done without a context manager, even if the GradientTape class discourages it.
tape = tf.GradientTape()
tape._push_tape()
tape.watch(X)
tape.watch(Z)
y = tf.reduce_mean(tf.math.cos(X) @ tf.transpose(X) + Z)
tape.gradient(y, X)

#### Part 4: Logistic regression

In [None]:
import tensorflow_datasets as tfds

In [None]:
# See here to learn more about the dataset: https://www.tensorflow.org/datasets/catalog/penguins
train_data = tfds.load('penguins', as_supervised=True, split='train[0:80%]')
test_data = tfds.load('penguins', as_supervised=True, split='train[80%:]')

In [None]:
# Datasets in TF are built as iterators over the single elements (we will see more about
# tf.data in the next lecture).
train_data

In [None]:
# This code extracts the two tensors from the iterator.
Xtrain, ytrain = train_data.batch(1000).get_single_element()

In [None]:
# Each row is an example, each column a feature of the input.
Xtrain.shape

In [None]:
Xtrain

In [None]:
# Each element is the true class of the corresponding row in X.
print(ytrain.shape)
print(ytrain[0:10])

In [None]:
ytrain

In [None]:
# We one-hot encode the y tensor, which turns it into a (n, 3) tensor.
ytrain = tf.one_hot(tf.cast(ytrain, tf.int32), 3)
print(ytrain.shape)
print(ytrain[0:5])

In [None]:
# We do the same for the test part of the dataset.
Xtest, ytest = test_data.batch(1000).get_single_element()
ytest = tf.one_hot(ytest, 3)

In [None]:
def init():
  # Initialize the parameters of the logistic regression model.
  # Any tensor wrapped inside tf.Variable is automatically watched inside the GradientTape.
  W = tf.Variable(tf.random.normal((4, 3)))
  b = tf.Variable(tf.random.normal((3,)))
  return W, b

In [None]:
def logreg(X, W, b):
  # Logistic regression model (note how the softmax is applied row-wise).
  return tf.nn.softmax(X @ W + b , 1)

In [None]:
W, b = init()
ypred = logreg(Xtrain, W, b)
print(ypred.shape)

In [None]:
def cross_entropy(ytrue, ypred):
  """ Compute the average cross-entropy over the elements.
  Inputs:
  - ytrue (n, 3): one-hot encoded tensor of the correct labels.
  - ypred (n, 3): output of the logistic regression models (after the softmax).

  Returns a scalar which is the average cross-entropy.
  """
  return -tf.reduce_mean(ytrue * tf.math.log(ypred))

In [None]:
cross_entropy(ytrain, ypred)

In [None]:
def accuracy(ytrue, ypred):
  """ Compute the average accuracy over the elements. Input parameters are
      the same as for the cross-entropy loss function.
  """
  # Note the casting operation, since we cannot take the average of a boolean vector.
  return tf.reduce_mean(
        tf.cast(tf.argmax(ytrue, 1) == tf.argmax(ypred, 1), tf.float32)
      )

In [None]:
accuracy(ytrain, ypred)

In [None]:
import matplotlib.pyplot as plt

losses = []
accuracies = []

W, b = init()

for i in range(5000):

    with tf.GradientTape() as tape:

        # Get the predictions of the model
        ypred = logreg(Xtrain, W, b)

        # Compute the loss
        loss = cross_entropy(ytrain, ypred)

    # Compute the gradients
    gradients = tape.gradient(loss, [W, b])

    # Apply the gradients
    W.assign_sub(0.1*gradients[0])
    b.assign_sub(0.1*gradients[1])

    # This is an incorrect version, since the result is not a tf.Variable anymore.
    # W = W - 0.01*gradients[0]
    # b = b - 0.01*gradients[1]

    # Track interesting quantities
    losses.append(loss.numpy())
    accuracies.append(accuracy(ytrain, ypred).numpy())

plt.plot(losses, label='loss')
plt.plot(accuracies, label='accuracy')
plt.legend()

In [None]:
print(accuracy(ytest, logreg(Xtest, W, b)))

In [None]:
W.numpy().mean(-1)

#### Part 5: Logistic regression, reloaded
This is similar to before, but we will use a few high-level components from TensorFlow instead of reimplementing them ourselves.

In [None]:
# Most high-level components are inside the tf.keras module.
from tensorflow import keras

In [None]:
# Layers are defined by how you initialize their variables,
# and how they process data, similar to init / logreg above.
# Learn more here: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Layer
model = keras.layers.Dense(3)

In [None]:
# Skip this if you have not read L4 (Fully-connected models).
# You can easily replace the logistic-regression model with a fully-connected
# model by using a Sequential Model from TensorFlow to "stitch"
# together two different fully-connected blocks.
# https://www.tensorflow.org/api_docs/python/tf/keras/Sequential
# model = keras.Sequential([
#    keras.layers.Dense(50, activation=keras.activations.relu),
#    keras.layers.Dense(3)
# ])

In [None]:
# This is an alternative syntax (the model is equivalent
# to the one above).
# model = keras.Sequential()
# model.add(keras.layers.Dense(50, activation=keras.activations.relu))
# model.add(keras.layers.Dense(3))

In [None]:
# The variables are lazily created only the first time the model is called (see below).
# Note: this will raise an error for the tf.keras.Sequential model.
model.variables

In [None]:
model(Xtrain).shape

In [None]:
# Note that by default the biases are initialized to zero.
[v.shape for v in model.variables]

In [None]:
# We move back to an index-based representation.
ytrain = tf.argmax(ytrain, 1)
ytest = tf.argmax(ytest, 1)

In [None]:
ypred = model(Xtrain)

In [None]:
# Functional variant of the cross-entropy (see the slides for more information
# about the different variants).
tf.reduce_mean(
    keras.losses.sparse_categorical_crossentropy(ytrain, ypred, from_logits=True)
)

In [None]:
# Object-oriented variant.
keras.losses.SparseCategoricalCrossentropy(from_logits=True)(ytrain, ypred)

In [None]:
cross_entropy = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
accuracy = keras.metrics.SparseCategoricalAccuracy()
optimizer = keras.optimizers.SGD(learning_rate=0.1)

In [None]:
losses = []
accuracies = []

for i in range(5000):

    with tf.GradientTape() as tape:

        ypred = model(Xtrain)
        loss = cross_entropy(ytrain, ypred)

    # --> First difference: we differentiate w.r.t. all variables.
    gradients = tape.gradient(loss, model.trainable_variables)

    # --> We use the optimizer to apply the gradients.
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    losses.append(loss.numpy())
    accuracies.append(accuracy(ytrain, ypred).numpy())

plt.plot(losses, label='loss')
plt.plot(accuracies, label='accuracy')
plt.legend()

#### Exercises

1. Modify part 4 to use the index-based version of y instead of the one-hot encoded version (hint: you need to suitably modify the `cross_entropy` and `accuracy` methods).

2. Optimizers have a `minimize()` function, allowing to combine the computation of the gradient with the gradient descent step:
https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Optimizer#usage. Rewrite the train loop in step 5 to use `minimize` instead of `apply_gradients`.

3. Momentum is a simple technique to improve the convergence speed of gradient descent. The key idea is to update each variable using a weighted average of the current gradient, and the gradient at the previous iteration (see Section 12.6 in the book). The weighting parameter is called the momentum weight. Implement momentum in the codelab, using a weight of 0.5, both in part 4 (manually) and in part 5 (using the `momentum` parameter of `SGD`).