# Hackathon #3

Written by Eleanor Quint

Topics:
- Subclassing `tf.module`
- Saving and loading TensorFlow models
- Running TensorFlow-based Python programs on Crane
- Overfitting, regularization, and early stopping

This is all setup in a IPython notebook so you can run any code you want to experiment with. Feel free to edit any cell, or add some to run your own code.

In [None]:
# We'll start with our library imports...
from __future__ import print_function

import numpy as np                 # to use numpy arrays
import tensorflow as tf            # to specify and run computation graphs
import tensorflow_datasets as tfds # to load training data
import matplotlib.pyplot as plt    # to visualize data and draw plots
from tqdm import tqdm              # to track progress of loops

DATA_DIR = './tensorflow-datasets/'

### Subclassing `tf.module`

The most flexible way to specify a model is by subclassing `tf.module`. This allows a model to be specified in a general way. The key function to implement is `__call__` which should take in data, run the model forward, and return the output. The function will frequently be decorated with `tf.function` (when not debugging the model), so that it can be compiled and run more quickly.

An example implementing a dense layer (generally you should use `tf.keras.layers.Dense`):

In [None]:
class Dense(tf.Module):
    def __init__(self, output_size, activation=tf.nn.relu, name=None):
        super(Dense, self).__init__(name=name) # remember this call to initialize the superclass
        self.output_size = output_size
        self.activation = activation
        self.is_built = False

    def build(self, x):
        input_dim = x.shape[-1]
        self.w = tf.Variable(
          tf.random.normal([input_dim, self.output_size]), name='w')
        self.b = tf.Variable(tf.zeros([self.output_size]), name='b')
        self.is_built = True

    @tf.function
    def __call__(self, x):
        if not self.is_built:
            self.build(x)
        y = tf.matmul(x, self.w) + self.b
        return self.activation(y)

# Create an instance of the layer
dense_layer = Dense(10)
# Call the model by passing the input to it
dense_layer(tf.ones([32,3]))
# We can get the variables of a Module for used calculating and applying gradients with `.trainable_variables`
dense_layer.trainable_variables

### Saving and Loading TensorFlow models

There are two main ways to save and load TF models: `tf.train.Checkpoint` and `tf.SavedModel`. First, we'll look at `tf.train.Checkpoint`. It's best used in the process of training rather than for serving models. This is because it only saves and loads the variables, and not the structure of the model. Thus, to use it, you must first instantiate the model from Python code and then load the variables.

In [None]:
# we have to pass in the model (and anything else we want to save) as a kwarg
dense_layer = Dense(10)
optimizer = tf.keras.optimizers.Adam()
checkpoint = tf.train.Checkpoint(model=dense_layer, optimizer=optimizer)

# Save a checkpoint to /tmp/training_checkpoints-{save_counter}. Every time
# checkpoint.save is called, the save counter is increased.
save_dir = checkpoint.save('/tmp/training_checkpoints')

# Restore the checkpointed values to the `model` object.
print("The save path is", save_dir)
status = checkpoint.restore(save_dir)
# we can check that everything loaded correctly, this is silent if all is well
status.assert_consumed()

The other way to save your model is with `tf.SavedModel`, which saves the variables and structure of the model. Specifically, it only saves methods which have been traced with `tf.function`.

In [None]:
# We can manually trace a function that has been decorated with tf.function using get_concrete_function
# We pass in the call signature. Here, None means that any number could fill in.
# Typically we don't have to do this explicitly, unless we want the None in the first dimension (as in the homework)
dense_layer = Dense(10)
fn = dense_layer.__call__.get_concrete_function(
    x=tf.TensorSpec([None, 3], tf.float32))

# We can call the function we traced
fn(tf.zeros([1,3]))
print(fn(tf.ones([3,3])))

tf.saved_model.save(dense_layer, '/tmp/saved_model')

del dense_layer

restored_dense = tf.saved_model.load('/tmp/saved_model')
# this should be the same result as above
print(restored_dense(tf.ones([3,3])))

### Running TensorFlow-based Python programs on Crane

#### 1. Get a shell on Crane
To access a shell on Crane's login node, you can run `ssh <username>@crane.unl.edu` or visit `crane.unl.edu` in a browser. If you use the browser, login and then use the dropdown: `Clusters > Crane Shell Access`. Because this shell is on the login node, you shouldn't run your jobs directly. Instead, you can submit your job to a cluster node to be run with a GPU.

#### 2. Get the slurm submit script and set up your anaconda environment
Try this out by saving the following bash script as `submit_gpu.sh` and change `<your env name>` to the name of your anaconda environment. You should create the anaconda environment by following the [Crane docs](https://hcc.unl.edu/docs/applications/user_software/using_anaconda_package_manager/#creating-custom-gpu-anaconda-environment) replacing the first `module load` command with `module load tensorflow-gpu/py38/2.3`.

```bash
#!/bin/sh
#SBATCH --time=6:00:00          # Maximum run time in hh:mm:ss
#SBATCH --mem=16000             # Maximum memory required (in megabytes)
#SBATCH --job-name=default_479  # Job name (to track progress)
#SBATCH --partition=cse479      # Partition on which to run job
#SBATCH --gres=gpu:1            # Don't change this, it requests a GPU

module load anaconda
conda activate <your env name>
# This line runs everything that is put after "sbatch submit_gpu.sh ..."
$@
```

#### 3. Submit a job
Once you've got your script, you can run it like so to submit a job: `sbatch submit_gpu.sh python <filepath>.py`. This will submit a job to the `cse479` partition. Each student is only allowed one job at a time on this partition, and you may check the status of your jobs with `squeue -u <username>`. If you would like to submit more than one job (which we encourage generally), you can submit the extras to the `cse479_preempt` partition by substituting in the line `#SBATCH --partition=cse479_preempt`. You can submit an unlimited number of jobs to this queue, but they may be interrupted anytime another student needs a gpu on the `cse479` partition and there are no others available. Thus, we reccomend saving your model periodically so that you can restart training from where it was interrupted rather than having to start all over again.

You can cancel jobs with `scancel <JOBID>`, where the job ID is the number associated with the job you can see in squeue or right after you submit the job, or with `scancel -u <username>` which cancels all your running jobs. For more details, please visit the [HCC docs](https://hcc-docs.unl.edu/), ask a question on Piazza, or come to office hours.

### Overfitting, regularization, and early stopping

If we have enough parameters in our model, and little enough data, after a long period of training we begin to experience overfitting. Empirically, this is when the loss value of data in training drops significantly below the loss value of the data set aside for testing. It means that the model is looking for patterns specific to the training data that won't generalize to future, unseen data. This is a problem.

Solutions? Here are some first steps to think about:
1. Get more data for the training set
2. Reduce the number of model parameters
3. Regularize the scale of the model parameters
4. Regularize using dropout
5. Early Stopping

We'll go over how to do 3, 4, and 5 here.

#### L2 Regularization
We calculate l2 loss on the value of the weight matrix, so it's invariant to the input value. We'll add the regularization loss value to the total loss value so that it's included in the gradient update. 

In [None]:
L2_COEFF = 0.1 # Controls how strongly to use regularization

class L2DenseNetwork(tf.Module):
    def __init__(self, name=None):
        super(L2DenseNetwork, self).__init__(name=name) # remember this call to initialize the superclass
        self.dense_layer1 = tf.keras.layers.Dense(32, activation=tf.nn.relu)
        self.dense_layer2 = tf.keras.layers.Dense(10)
        
    def l2_loss(self):
        # Make sure the network has been called at least once to initialize the dense layer kernels
        return tf.nn.l2_loss(self.dense_layer1.kernel) + tf.nn.l2_loss(self.dense_layer2.kernel)

    @tf.function
    def __call__(self, x):
        embed = self.dense_layer1(x)
        output = self.dense_layer2(embed)
        return output
    
# Defining, creating and calling the network repeatedly will trigger a WARNING about re-tracing the function
# So we'll check to see if the variable exists already
if 'l2_dense_net' not in locals():
    l2_dense_net = L2DenseNetwork()
l2_dense_net(tf.ones([1, 100]))

l2_loss = l2_dense_net.l2_loss()                     # calculate l2 regularization loss
cross_entropy_loss = 0.                              # calculate the classification loss
total_loss = cross_entropy_loss + L2_COEFF * l2_loss # and add to the total loss, then calculate gradients

#### Dropout

Let's re-specify the network with regularization from [dropout](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout). The Dropout layer randomly sets values to 0 with a frequency of the `rate` input (given when the layer is constructed). Inputs not set to 0 are scaled up by 1/(1 - rate) such that the sum over all inputs is unchanged.

In [None]:
class DropoutDenseNetwork(tf.Module):
    def __init__(self, name=None):
        super(DropoutDenseNetwork, self).__init__(name=name) # remember this call to initialize the superclass
        self.dense_layer1 = Dense(32)
        self.dropout = tf.keras.layers.Dropout(0.2)
        self.dense_layer2 = Dense(10, activation=tf.identity)

    @tf.function
    def __call__(self, x, is_training):
        embed = self.dense_layer1(x)
        propn_zero_before = tf.reduce_mean(tf.cast(tf.equal(embed, 0.), tf.float32))
        embed = self.dropout(embed, is_training)
        propn_zero_after = tf.reduce_mean(tf.cast(tf.equal(embed, 0.), tf.float32))
        # Note that in a tf.function, we have to use tf.print to print the value of tensors
        tf.print('Zeros before and after:', propn_zero_before, "and", propn_zero_after)
        output = self.dense_layer2(embed)
        return output

# Defining, creating and calling the network repeatedly will trigger a WARNING about re-tracing the function
# So we'll check to see if the variable exists already
if 'drop_dense_net' not in locals():
    drop_dense_net = DropoutDenseNetwork()
drop_dense_net(tf.ones([1, 100]), tf.constant(True))
print("Something to think about: why isn't the difference exactly equal to the proportion we passed to dropout?")

#### Early Stopping

Each gradient descent update takes only a small step, so we want to look at each input datum many times. How do we know when to stop though? We want to keep training until improvement stops, but because neural networks are non-linear, they might get worse before they get better, so we don't want to stop them after getting worse one time. We'll use the following code to do "early stopping" with patience. We pass in the validation loss (Note: make sure you use validation loss, not training loss. This is important) and `check` will tell us whether we should stop. It will return `True` after the loss hasn't improved for `patience` epochs.

Although you might choose whether or not the last two regularizers are appropriate based on the problem, you should always use early stopping.

In [None]:
class EarlyStopping:
    def __init__(self, patience=5, epsilon=1e-4):
        """
        Args:
            patience (int): how many epochs of not improving before stopping training
            epsilon (float): minimum amount of improvement required to reset counter
        """
        self.patience = patience
        self.epsilon = epsilon
        self.best_loss = float('inf')
        self.epochs_waited = 0
    
    def __str__(self):
        return "Early stopping has waited {} epochs out of {} at loss {}".format(self.epochs_waited, self.patience, self.best_loss)
        
    def check(self, loss):
        """
        Call after each epoch to check whether training should halt
        
        Args:
            loss (float): loss value from the most recent epoch of training
            
        Returns:
            True if training should halt, False otherwise
        """
        if loss < (self.best_loss - self.epsilon):
            self.best_loss = loss
            self.epochs_waited = 0
            return False
        else:
            self.epochs_waited += 1
            return self.epochs_waited > self.patience
            
early_stop_module = EarlyStopping()
# pass in validation loss at each training epoch
print("Training...")
print("Should we stop training?", early_stop_module.check(1.4))
print(early_stop_module)
print("Training...")
print("Should we stop training?", early_stop_module.check(2.3))
print(early_stop_module)
print("Training...")
print("Should we stop training?", early_stop_module.check(1.2))
print(early_stop_module)

### Homework

Your homework is to complete the following two tasks:
1. Make sure you're comfortable submitting you're comfortable submitting jobs on Crane and saving models. Submit a job to `cse479` and put the job id from the submission in a text file along with;
2. Think about the question posed above about dropout, "why isn't the difference between the number of zeros before and after applying dropout exactly equal to the dropout proportion?" Consider the network architecture and what operations were run before the dropout layer. Write a few sentences about this in the same text file as the previous question and submit to Canvas.

I'm expecting this to take about an hour (or less if you're experienced). Feel free to use any code from this or previous hackathons. If you don't understand how to do any part of this or if it's taking you longer than that, please let me know in office hours or by email (both can be found on the syllabus). I'm also happy to discuss if you just want to ask more questions about anything in this notebook!