In [1]:
# Setup for Keras

# Common imports
import numpy as np
import os
import pandas as pd
import sklearn

import tensorflow as tf
import keras #requirement: keras 3
os.environ["KERAS_BACKEND"] = "tensorflow"
#os.environ["KERAS_BACKEND"] = "pytorch"

# to make this notebook's output stable across runs
np.random.seed(42)
keras.utils.set_random_seed(42)

print(tf.__version__) #requirement: >= 15

# Where to save the models
PROJECT_ROOT_DIR = "."
MODEL_PATH = os.path.join(PROJECT_ROOT_DIR, "models")
os.makedirs(MODEL_PATH, exist_ok=True)


# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

2.19.0


First check GPU availability: 

In [2]:
tf.config.experimental.list_physical_devices('GPU')

[]

# Practical 1: MLP with Keras (MNIST)

In this first practical we train a simple MLP on the Fashion MNIST data and check the training progress on TensorBoard.
We experience the exploding gradient problem with deep MLPs first hand and try out learning rate schedules.  

**Aim:** Get a basic understanding of MLPs and dimensions, activation functions, get to know the basic principles of Keras and monitor the training process with TensorBoard. 

First, we load the data. 

In [None]:
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()


Split up the full train set and the train labels into a validation set `X_valid` (5000 instances) with validation labels `y_valid`  and a train set `X_train` (the rest) with labels `y_train`. 

Find out any way you want how many different classes there are in the Fashion MNIST data. 

Data preprocessing: Scale all inputs (test, train and validation set) to mean 0 and standard deviation 1 first.

Now set the random seed for keras and np to 42 to make outputs stable across runs. 

Also, in case you run the code several times, precede these random seed settings by a clear_session() to make sure Keras doesn't screw anything up. 

Then, define a model for the **Fashion MNIST classification problem** using Keras Sequential API with all layers within `Sequential()`, which does the following: 
- first, turn the 2D-input into a vector (Flatten)
- First hidden layer: 300 units, to set the initialization to HeNormal, use the keyword `kernel_initializer`, see [documentation](https://keras.io/api/layers/initializers/)
- activation function for the first layer: LeakyReLU
- Second hidden layer: 100 units, again set the initialization to HeNormal
- activation function for the second layer: LeakyReLU
- a classification layer (what does this have to look like?). Determine how many output neurons you need yourself! 

**Question:** Explain why He initialization is used and why it stabilizes training at least at the beginning! 

**Answer:** #TODO



**Question:** How many parameters do you have in this not very deep NN? Derive the number by computation.

**Answer:** #TODO

Check your result by getting the model's summary. 

**Question:** What is the advantage of LeakyReLU
- over the sigmoid function?
- over ReLU?
If you are not sure how to set the slope in the negative part, what could you do?

**Answer:** #TODO


Compile the model using SGD; additionally ask the model to output accuracy as a metric. 

In [None]:
# compiling the model

### Setting up TensorBoard for Monitoring the Training Process

We want to log our training process and visualize it with TensorBoard to see if the models overfit or how well training works. 


Start TensorBoard with the below code and go to localhost:6006

In [None]:
%load_ext tensorboard
%tensorboard --logdir=./my_logs --port=6006

We want to create log files for Tensorboard with the TensorBoard callback. 
First, we define the path of the log files:

In [None]:
root_logdir = os.path.join(PROJECT_ROOT_DIR, "my_logs")

def get_run_logdir():
    import time
    run_id = time.strftime("run_%Y_%m_%d-%H_%M_%S")
    return os.path.join(root_logdir, run_id)

run_logdir = get_run_logdir()
run_logdir

Now, define the Tensorboard callback.
Also, define a Checkpoint so the model can be rolled back to the best one.

Attention! Keras' ModelCheckpoint sometimes leads to erratic errors if the file already exists. Therefore (just in case you run this code several times) add some code that deletes the model if it exists before you define the checkpoints. 

Also: If Keras still gives you an error on the ModelCheckpoint callback, restart the kernel and try again. 

### Model Training

Fit the model by training for 5 epochs, using the tensorboard callback and the checkpoint callback. 
Then rollback to the best model by loading the best model.

Evaluate the model on the test data:

IMPORTANT NOTE: These trained models will not be used later on, which is why we don't save them. However, if you want, you can save all the trained models. 

Now we modify our model.
Reset the random seeds. 

### Deep Model - Vanishing/Exploding Gradients

Let's make the above model a lot deeper (>60 Layers!) and compare the results. However, for deep models, computing the gradient can be instable: for 50 layers, computing the gradient boils down to using the chain rule at least 60 times! 

In the following cell, use the Sequential API of Keras using model.add to build the same model as above, with the following modification: 
- repeat the 100-unit-dense layer before the output layer 60 times

**Question:**
What is the number of trainable parameters now? Again, first use theory to arrive at a number before you check it with code.

**Answer:** #TODO



Compile and train the model with SGD and additional metric accuracy.

Now let's train it for 5 epochs, again with (only) the TensorBoard Callback, where we create a new log_dir path. 

Evaluate the model. What might be the problem? #TODO

Reset random seeds. 

### A theoretical exercise on MLPs

**Question:**

Suppose we have a shallow regression network with 1 input units, 2 hidden units with activation function=$\sin$, 1 sigmoid output unit, with biases. 
- Draw the architecture. How many are there? 
- Write the pre-activation, activation and output for one input $x$ in terms of the weights and biases.


**Answer:** #TODO


**Question 2:**

Suppose you have a Feedforward Network for 5-class Classification and: 
- 10 input features, 
- one hidden layer with 100 units with ReLU activation (and biases)
- a second hidden layer with 200 units with Parametric ReLU activation (and biases)
- a third hidden layer with 50 units and SELU activation (and biases)
- an output layer 



1. What does the output layer have to look like for 5-class classification?

#TODO

2. Fill out the x-es in the following table: 

#TODO


|Layer|number of weights|number of biases|additional parameters|
|:---|:---|:---|:---|
|first layer|x|x|x|
|second layer|x|x|x|
|third layer|x|x|x|
|output layer|x|x|x|

3. Let $W_i, b_i$ be weight matrix and bias vector for the ith layer. Write the activations of all layers as function in the output vector of the layer below (denote the activations=outputs of layer i by $z_i$) 
   
|Layer|Output|
|:---|:---|
|first layer|x|
|second layer|x|
|third layer|x|
|output layer|x|
   
4. Write the output in terms of matrices and vectors for 
- only one instance
- a mini-batch of size 128 called X
and give the dimensions of all matrices and vectors you use! Here you can write $B_i$ for the matrix with the bias vector $b_i$ in each row.

#TODO

**Question 3:**

Suppose you have a Regression Feedforward Network to predict tomorrow's temperature, wind speed, an air pressure at midday. You have as input
- 20 input features, 
- one hidden layer with 30 units 
- a second hidden layer with 40 units 
- an output layer 
where you have sigmoid activation functions for all hidden layers.

1. What is the number of parameters of each layer?
   #TODO

2. If the network does not learn, what could be the problem?
   #TODO

3. What are the dimensions of the weight matrices $W_1, W_2, W_3$ and biases $b_1,b_2,b_3$?
   #TODO

4. Write down the formula for the output for a single input $x$.
   #TODO


**Question 4:**

Suppose you have a MLP for binary classification without biases, and for some input $x$ we know the following about the last hidden layer: 
- all pre-activations in this hidden layer are negative, 
- the activation function of the hidden layer is a ReLU. 
What is the output for input $x$?