# Hackathon #5

Topics: 
- Transfer learning
    - Changing the final layers of an existing network
    - Stopping gradients and freezing layers
- Batch Normalization

This is all setup in a IPython notebook so you can run any code you want to experiment with. Feel free to edit any cell, or add cells to run your own code.

In [1]:
# We'll start with our library imports...
from __future__ import print_function

import os  # to work with file paths

import tensorflow as tf         # to specify and run computation graphs
import numpy as np              # for numerical operations taking place outside of the TF graph
import matplotlib.pyplot as plt # to draw plots

In [2]:
cifar_dir = '/work/cse479/shared/hackathon/05/cifar/'

# load CIFAR-10
train_images = np.load(cifar_dir + 'cifar10_train_data.npy')
train_images = np.reshape(train_images, [-1, 32, 32, 3]) # `-1` means "everything not otherwise accounted for"
train_labels = np.load(cifar_dir + 'cifar10_train_labels.npy')

test_images = np.load(cifar_dir + 'cifar10_test_data.npy')
test_images = np.reshape(test_images, [-1, 32, 32, 3]) # `-1` means "everything not otherwise accounted for"
test_labels = np.load(cifar_dir + 'cifar10_test_labels.npy')

### Transfer Learning

Let's imagine you have an image classification task that's doable, but don't have a large enough dataset with which to train a network without overfitting. What to do? The answer is to use the convolutional stack of a powerful, existing network that you (or Google) has trained on a broad task like classifying [ImageNet](http://www.image-net.org/) ([Wikipedia](https://en.wikipedia.org/wiki/ImageNet)) and only train a few dense layers on top of it for classification. You get the benefit of good feature extraction without the danger of overfitting such a large network on your small dataset.

In [3]:
# pull out classes 0 and 2 (airplanes and birds respectively)
train_idxs = np.union1d(np.where(train_labels == 0), np.where(train_labels == 2))
test_idxs = np.union1d(np.where(test_labels == 0), np.where(test_labels == 2))

# the subsets we'll be working with
# and handle an off by one error I haven't been able to diagnose
subset_train_data = train_images[train_idxs][1:]
subset_train_labels = train_labels[train_idxs][1:]
subset_test_data = test_images[test_idxs][1:]
subset_test_labels = test_labels[test_idxs][1:]

Let's load a model that's been trained on a similar classification task (the other part of CIFAR-10), and see how it performs. The first thing we have to do is find the tensors we need.

In [4]:
# This is the pretrained model
session = tf.Session()

# Load graph and all variables into memory
saver = tf.train.import_meta_graph(cifar_dir + 'cifar10_network.meta')
saver.restore(session, cifar_dir + 'cifar10_network')

# Print out the operations of the graph
print(session.graph.get_operations())

W0930 16:39:50.333127 47074515312768 deprecation.py:323] From /util/opt/anaconda/deployed-conda-envs/packages/tensorflow/envs/tensorflow-1.14.0-py36/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.


[<tf.Operation 'input_placeholder' type=Placeholder>, <tf.Operation 'conv_net_2d/conv_2d_0/w/Initializer/truncated_normal/shape' type=Const>, <tf.Operation 'conv_net_2d/conv_2d_0/w/Initializer/truncated_normal/mean' type=Const>, <tf.Operation 'conv_net_2d/conv_2d_0/w/Initializer/truncated_normal/stddev' type=Const>, <tf.Operation 'conv_net_2d/conv_2d_0/w/Initializer/truncated_normal/TruncatedNormal' type=TruncatedNormal>, <tf.Operation 'conv_net_2d/conv_2d_0/w/Initializer/truncated_normal/mul' type=Mul>, <tf.Operation 'conv_net_2d/conv_2d_0/w/Initializer/truncated_normal' type=Add>, <tf.Operation 'conv_net_2d/conv_2d_0/w' type=VariableV2>, <tf.Operation 'conv_net_2d/conv_2d_0/w/Assign' type=Assign>, <tf.Operation 'conv_net_2d/conv_2d_0/w/read' type=Identity>, <tf.Operation 'conv_net_2d/conv_2d_0/w/Regularizer/l2_regularizer/scale' type=Const>, <tf.Operation 'conv_net_2d/conv_2d_0/w/Regularizer/l2_regularizer/L2Loss' type=L2Loss>, <tf.Operation 'conv_net_2d/conv_2d_0/w/Regularizer/l2_re

This is a huge list of operations, but it looks like the operations of interest are `<tf.Operation 'input_placeholder' type=Placeholder>`, `<tf.Operation 'conv_net_2d_1/Relu' type=Relu>`, and `<tf.Operation 'mlp_1/linear_0/add' type=Add>`. The moral of this story is that you should always name your tensors consistently and document your networks clearly.

In [5]:
# This is the output of the convolutional stack <tf.Operation 'conv_net_2d_1/Relu' type=Relu>

graph = session.graph
x = graph.get_tensor_by_name('input_placeholder:0')
conv_out = graph.get_tensor_by_name('conv_net_2d_1/Relu:0')
mlp_out = graph.get_tensor_by_name('mlp_1/linear_0/add:0')

y = tf.placeholder(tf.int32, shape=[None])
conf_matrix_op = tf.confusion_matrix(y, tf.argmax(mlp_out, axis=1), num_classes=10)
conf_matrix = session.run(conf_matrix_op, {x: subset_test_data, y: np.squeeze(subset_test_labels)})

# This is confusion matrix on pretrained model
# Actually, nothing is classified correctly
print(conf_matrix)

[[  0 520   0  28   0   2   0  90 310  49]
 [  0   0   0   0   0   0   0   0   0   0]
 [  0 499   0   5   0   3   0 143 145 205]
 [  0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0]]


This network doesn't classify any of our data correctly. So we're going to replace the classifying dense layer and train it on our data. We'll make sure that the weights in the convolutional stack don't change with [tf.stop_gradient](https://www.tensorflow.org/api_docs/python/tf/stop_gradient). This operation is a special version of the identity that copies data in the forward pass, but stops gradients in the backward pass. This allows the later parts of the network to update while preserving everything preceding, up to the inputs.

In [6]:
# As we don't have much training data we train a much smaller network

# This allows forward pass through it, but does not allow gradients to flow through it
# It acts as a block so training does not continue through it
# We thus only update parameters before this
conv_out_no_gradient = tf.reshape(tf.stop_gradient(conv_out), [-1, 16*16*32]) # flatten input from 16x16x32
our_dense_layer = tf.layers.dense(conv_out_no_gradient, 2, name="our_dense_layer")

# Operation to create confusion matrix
conf_matrix_op = tf.confusion_matrix(y, tf.argmax(our_dense_layer, axis=1), num_classes=2)

W0930 16:43:51.024467 47074515312768 deprecation.py:323] From <ipython-input-6-3d292f696282>:7: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
W0930 16:43:51.040288 47074515312768 deprecation.py:506] From /util/opt/anaconda/deployed-conda-envs/packages/tensorflow/envs/tensorflow-1.14.0-py36/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


We'll also use [tf.get_collection](https://www.tensorflow.org/api_docs/python/tf/get_collection) to get the variables we've added for initialization. We want to make sure not to run the global variable initializer, because that would reset the variables from the values loaded from the saved model, so we'll use [tf.variables_initializer](https://www.tensorflow.org/api_docs/python/tf/variables_initializer) to only initialize the subset we've added. `tf.get_collection` uses [GraphKeys](https://www.tensorflow.org/api_docs/python/tf/GraphKeys) to get subsets of the graph.

In [7]:
# change data to be a binary classification probem between airplanes and birds (0 and 1)
subset_train_labels[subset_train_labels > 0] = 1
subset_test_labels[subset_test_labels > 0] = 1

with tf.name_scope('optimizer') as scope:
    # define new loss
    # Sparse softmax is often used when you have only two classes (one-hot encoding not needed)
    cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=our_dense_layer)
    optimizer = tf.train.AdamOptimizer()
    train_op = optimizer.minimize(cross_entropy)

# initialize new variables (without declaring all as new variables globally)
# Declaring all as new is what we have done before
# These 3 are the layers we have created above
optimizer_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, "optimizer")
dense_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, "our_dense_layer")

# We would use the name given here is we were planning to save and reload it
session.run(tf.variables_initializer(optimizer_vars + dense_vars, name='init'))

Finally, we'll train for one epoch and and evaluate like we usually do to see the effect of our transfer learning.

In [8]:
# train for an epoch using this modified network with the block

batch_size = 16
for i in range(subset_train_data.shape[0] // batch_size):
    batch_xs = subset_train_data[i*batch_size:(i+1)*batch_size, :]
    batch_ys = np.squeeze(subset_train_labels[i*batch_size:(i+1)*batch_size])
    session.run(train_op, {x: batch_xs, y: batch_ys})

# and evaluate
# Note this now has only two possible outputs as we only consider two classes
conf_matrix = session.run(conf_matrix_op, {x: subset_test_data, y: np.squeeze(subset_test_labels)})
print('BINARY CONFUSION MATRIX:')
print(conf_matrix)

BINARY CONFUSION MATRIX:
[[586 413]
 [ 61 939]]


Not bad for such a difficult classification problem and small dataset. Notice that we go from a 10 class problem to a 2 class one. The output can be totally different from the original, but transfer learning is most useful when in the same modality (RGB vs. greyscale or natural images vs. audio, etc.)

Another option, instead of using tf.stop_gradient, is to pass an explicit list of trainable variables to [minimize](https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer#minimize). Normally, the `var_list` argument is set to all the variables under `GraphKeys.TRAINABLE_VARIABLES`, but we can set it just to be the variables we've added.

In [9]:
# Instead of using stop_gradient (in case there are multiple paths back)
# We can instead pass in list of variables, and it only updates those declared variables

# this only trains a few variables by only updating the collection of variables
train_op = optimizer.minimize(cross_entropy, var_list=tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, "our_dense_layer"))

Add an [l2 regularizer](https://www.tensorflow.org/api_docs/python/tf/nn/l2_loss) to the weight matrix of the old dense layer `<tf.Operation 'mlp/linear_0/w' type=VariableV2>`, then get a train_op with `AdamOptimizer.minimize` that trains that layer (`linear_0`), with the new regularizer on the cross entropy loss of its output with the correct label.

Hints:

1. Remember that 'mlp/linear_0/w' is the name of an Operation, not a Tensor
2. We retrieved the output of the layer above as `mlp_out`
3. No code needs to run, it just needs to be correct

In [10]:

# Start the session
session = tf.Session()

# Load graph and all variables into memory
saver = tf.train.import_meta_graph(cifar_dir + 'cifar10_network.meta')
saver.restore(session, cifar_dir + 'cifar10_network')

# Grab tensorflow graph from session
graph = session.graph

# Identify the input_placeholder tensor for initializing new graph
x = graph.get_tensor_by_name('input_placeholder:0')
# Define placeholder for output predictions
y = tf.placeholder(tf.int32, shape=[None])

# Grab the tensor that, in the original graph, feeds directly into the mpl/linear_0 layer
conv_out = graph.get_tensor_by_name('conv_net_2d_1/Relu:0')
# Flatten it for feeding into dense layers and stop gradients from flowing back into the convolutional layer
conv_out_no_gradient = tf.reshape(tf.stop_gradient(conv_out), [-1, 16*16*32]) # flatten input from 16x16x32

# Here we define a dense layer which the above-defined tensor flows into
# We reuse the weights from the mlp/linear_0 layer, while adding L2 regularization
# This allows us to keep what was previously learned from the original graph, while incorporating in the new graph
# in such a way it can be modified and still learned from
# NOTE: HCC was down when developing this, and are not 100% sure we grabbed the correct name for the weights
modified_old_dense_layer = tf.layers.dense(conv_out_no_gradient, 10, activation=tf.nn.relu,
                                   kernel_regularizer=tf.contrib.layers.l2_regularizer(scale=0.001),
                                   bias_regularizer=tf.contrib.layers.l2_regularizer(scale=0.001),name='mlp/linear_0/w',\
                                   reuse=True)

# This defines the layer to take us from the above 10-neuron output to our desired 2 classes
our_dense_layer = tf.layers.dense(modified_old_dense_layer, 2, activation=tf.nn.relu,name='our_dense_layer')


############### This code is very similar to that used previously in notebook ##################
# If above defined correctly this shoudl work

# Operation to create confusion matrix
conf_matrix_op = tf.confusion_matrix(y, tf.argmax(our_dense_layer, axis=1), num_classes=2)

# Change data to be a binary classification probem between airplanes and birds (0 and 1)
subset_train_labels[subset_train_labels > 0] = 1
subset_test_labels[subset_test_labels > 0] = 1

# Define optimizer we will use for training
with tf.name_scope('optimizer') as scope:
    # Define new loss
    regularization_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
    # this is the weight of the regularization part of the final loss
    REG_COEFF = 0.1 
    cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=our_dense_layer)
    xentropy_w_reg = cross_entropy + REG_COEFF * sum(regularization_losses)
    optimizer = tf.train.AdamOptimizer()
    train_op = optimizer.minimize(xentropy_w_reg)

# Initialize new variables below (without declaring all as new variables globally)
# Used for optimizer
optimizer_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, "optimizer")
# Used for modified 10 neuron layer
modified_old_dense_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, "modified_old_dense_layer")
# Used for output dense layer
dense_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, "our_dense_layer")

# Initialize these variables
session.run(tf.variables_initializer(optimizer_vars + modified_old_dense_vars + dense_vars, name='init'))

# train modified network for 1 epoch
batch_size = 16
for i in range(subset_train_data.shape[0] // batch_size):
    batch_xs = subset_train_data[i*batch_size:(i+1)*batch_size, :]
    batch_ys = np.squeeze(subset_train_labels[i*batch_size:(i+1)*batch_size])
    session.run(train_op, {x: batch_xs, y: batch_ys})

# Evaluate using a confusion matrix
conf_matrix = session.run(conf_matrix_op, {x: subset_test_data, y: np.squeeze(subset_test_labels)})
print('BINARY CONFUSION MATRIX:')
print(conf_matrix)


## Batch Normalization

We're going to briefly cover how to use batch norm; see Chapter 11 of textbook for more info on batch normalization. It's used to deal with internal covariate shift, or the tendency of the mean activation of a layer over the dataset to drift away from 0. Keeping this close to zero has an effect similar to normalization of the data in terms of making the learning problem easier and less sensitive to initial conditions. Generally normalization occurs just before the activation of a layer.

TensorFlow implements batch norm with [tf.layers.batch_normalization](https://www.tensorflow.org/api_docs/python/tf/nn/batch_normalization). It only requires the input tensor, but may be customized with many, many optimal parameters. It's important to set whether the network is training or not with the `training` argument because, to quote the TensorFlow docs:

> training: Either a Python boolean, or a TensorFlow boolean scalar tensor (e.g. a placeholder). Whether to return the output in training mode (normalized with statistics of the current batch) or in inference mode (normalized with moving statistics). NOTE: make sure to set this parameter correctly, or else your training/inference will not work properly.

Batch normalization maintains internal statistics (its own parameters) on how to normalize between minibatches. It's important that these are updated as training proceeds, but equally important that they're not updated when the network layers aren't being updated as well.

In order to ensure the parameters get updated in the correct order, dependencies should be explicitly signaled. By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so they need to be added as a dependency to the train_op. For example:

```
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
    train_op = optimizer.minimize(loss)
```

The `axis` argument should be set if channels isn't the last dimension of your imagetensors.

An example use in a convolutional block, also using the [tf.nn.elu](https://www.tensorflow.org/api_docs/python/tf/nn/elu) activation (closely related to ReLU):

In [11]:
# We have to give it input tensor and whether we want to train it when using batch normalization
# control dependencies makes sure that everything has completed before this line before running next line
# Tensorflow tends to reorder operations for speed of computation unless told not to

tf.reset_default_graph()
def my_conv_block(inputs, filters, is_training):
    """
    Args:
        - inputs: 4D tensor of shape NHWC
        - filters: iterable of ints
    """
    with tf.name_scope('conv_block') as scope:
        x = inputs
        for i in range(len(filters)):
            x = tf.layers.conv2d(x, filters[0], 3, 1, padding='same')
            
            # Batch Normalization generally put after layer and before activation function
            x = tf.layers.batch_normalization(x, training=is_training)
            
            x = tf.nn.elu(x)
    return x

x = tf.placeholder(tf.float32, [None, 32, 32, 3])
conv_output = my_conv_block(x, [16, 32, 64], True)
flat_conv = tf.reshape(conv_output, [-1, 32*32*64])
output = tf.layers.dense(flat_conv, 10)

y = tf.placeholder(tf.int32, [None])
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=output)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
    train_op = optimizer.minimize(cross_entropy)

W0930 17:00:51.914386 47074515312768 deprecation.py:323] From <ipython-input-11-b2e1cf1aef23>:13: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.keras.layers.Conv2D` instead.
W0930 17:00:52.609322 47074515312768 deprecation.py:323] From <ipython-input-11-b2e1cf1aef23>:14: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.BatchNormalization instead.  In particular, `tf.control_dependencies(tf.GraphKeys.UPDATE_OPS)` should not be used (consult the `tf.keras.layers.batch_normalization` documentation).
