# Training Deep Neural Nets


## Vanishing/Exploding Gradients Problems
### Vanishing gradient problem
Because gradients sometimes get smaller as they progress down the network, the weights may become too small to converge 
on a solution.
### Exploding Gradient problem
In some cases, they may become larger as they progress through the network causing the algorithm to diverge.<br/>

This means that deep neural networks have unstable gradients, layers may learn at very different speeds.  This has 
been show by analyzing the networks.  The use of the logistic sigmoid activation causes the outputs to have a 
higher variance than the inputs, this increases as we work through the network.  The function then saturates near
the boundaries of 0 and 1 causing the backpropagation to have almost no gradient to propagate through the layers.

## Xavier and He Initialization
To deal with the data flowing in both directions: forward for making predictions, and backwards when propagating 
gradients.  The problems of saturation and explosion must be dealt with.  This means that the variance of the inputs
and outputs should be equal going both ways through the network.  The only way to ensure this is to have the same
amount of neurons, but that is not possible.  
#### Possible solution:
The connection weights should be initialized randomly at each layer.  THis speeds up training and has led to the 
success of deep learning.  tf.layers.dense() uses Xavier initialization, to use He, see code snip below.

import tensorflow as tf
X = {}
n_hidden = 100  # dummy values
he_init = tf.contrib.layers.variance_scaling_initializer(factor=2.)
hidden1 = tf.layers.dense(X, n_hidden, activation=tf.nn.relu, kernal_initializer=he_init, name="hidden1")



## Nonsaturating Activation Functions
Part of the cause of the exploding/vanishing problem is caused by the choice of activation function.<br/>
ReLU does not saturate for positive values, so it is often chosen for a deep network.  But it can cause
some neurons to die, they only output zero.  If you use a large learning rate, this might be particularly bad.
### Leaky ReLU
This allows the function to 'leak' over time using a hyperparameter &alpha;.  This keeps the neurons from dying over time, they
may go into a 'coma' but they can recover from this state.  From research &aplha; = 0.2 is more effective than 
&alpha; = 0.01.<br/>
### ELU versus ReLU
Exponential Linear Unit (ELU) can outperform ReLU.  It is similar to ReLU, but differs because it takes on 
negative values so the average is closer to 0.  This helps with the vanishing gradients problem.  It has a non-zero 
gradient for z is less than 0 that prevents dying.  If &alpha; is equal to one, it is smooth everywhere speeding 
up gradient descent (doesn't bounce past the minimum).  But it is slower to compute.


## Batch Normalization
Batch normalization (BN) is used to solve the vanishing and exploding gradients problem (that the distribution of each
layers inputs changes during training).  In BN an operation is added before the activation function of each layer. 
This just centers the data on zero and normalizes it, then scales and shifts the data. The model learns the optimal 
scale and mean for each layer inputs.


## Implementing Batch Normalization with Tensorflow

In [None]:
import warnings
warnings.filterwarnings('ignore')
import tensorflow as tf
from functools import partial


n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")  # will act as the input layer, during execution it 
                                                                  # will be replaced with one training batch at a time.
y = tf.placeholder(tf.int64, shape=None , name ="y")
training = tf.Variable(False, shape=(), name='training')

# creates a wrapper around the function and defines the defaults for some parameters
my_batch_norm_layer = partial(tf.layers.batch_normalization, training=training, momentum=0.9)

hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1")
bn1 = my_batch_norm_layer(hidden1)
bn1_act = tf.nn.elu(bn1)

hidden2 = tf.layers.dense(bn1_act, n_hidden2, name="hidden2")
bn2 = my_batch_norm_layer(hidden2)
bn2_act = tf.nn.elu(bn2)
logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name="outputs")
logits = my_batch_norm_layer(logits_before_bn)

# the rest of the the program is the same as chapter 10
with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")
    
# define a gradient descent optimizer that will tweak the model parameters to minimize the cost function

learning_rate = 0.01
with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

# model evaluation, here use accuracy basically test if the models logit is the same as the target class
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    
# create a node to initialize all variables and create a saver
init = tf.global_variables_initializer()
saver = tf.train.Saver()

# Execution phase: load MNIST from TensorFlow
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/")

# define the number of epochs and batch sizes
n_epochs = 40
batch_size = 50


extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

# train the model
with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run([training_op, extra_update_ops], feed_dict={training: True, X: X_batch, y: y_batch})
            
        acc_val = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels})

        print(epoch, " Test Accuracy: ", acc_val)
    save_path = saver.save(sess, "./my_model_final_11.ckpt")


## Gradient Clipping
To solve exploding gradients- the gradients can be clipped during backpropagation so they don't exceed a threshold.
Used mostly in neural network, but batch normalization tends to be preferred.  In tensorflow the minimise() function 
handles the clipping, after the gradients have been computed.  


## Reusing Pretrained Layers
You can reuse parts of neural networks that have been already trained to speed up your new task.  If the model was trained
in tensorflow you can restore it with import_meta_graph() function to get the default graph (will have a .meta extension).  

In [None]:
# reuse the model trained above
saver = tf.train.import_meta_graph("./my_model_final_11.ckpt.meta")

# Figure out the tensors and the operations needed for testing.
# the name of the tensor is the name of the operation that ouput it followed by :0, for first, :1 second, and so on.
X = tf.get_default_graph().get_tensor_by_name("X:0")
y = tf.get_default_graph().get_tensor_by_name("y:0")
accuracy = tf.get_default_graph().get_tensor_by_name("eval/accuracy:0")
training_op = tf.get_default_graph().get_operation_by_name("GradientDescent")

# if not well documented  explore the graph using TensorBoard
for op in tf.get_default_graph().get_operation():
    print(op.name)
    


In [None]:
# could also create a collection with all the important operations
for op in (X, y, accuracy, training_op):
    tf.add_to_collection("my_important_ops", op)
    
# can be easily reused by
X, y, accuracy, training_op = tf.get_collection("my_important_ops")

