In [2]:
import tensorflow as tf

x = tf.ones((2,2))
with tf.GradientTape() as t:
    t.watch(x)
    y = tf.reduce_sum(x)
    z = tf.multiply(y, y)

# Use the tape to compute the derivative of z with respect to the intermediate value y
dz_dy = t.gradient(z, y)
# make sure that the resulting derivative, 2*y, = sum(x)*2 = 8
assert dz_dy.numpy() == 8.0

In [3]:
x = tf.constant(3.0)
with tf.GradientTape(persistent=True) as t:
    t.watch(x)
    y = x*x
    z = y*y
dz_dx = t.gradient(z, x)
dy_dx = t.gradient(y, x)

In [4]:
print(dz_dx)
print(dy_dx)

tf.Tensor(108.0, shape=(), dtype=float32)
tf.Tensor(6.0, shape=(), dtype=float32)


 Backpropagation : update interior weights within the network in a principled way, it has several shortcomings
 that made deep networks difficult to use in practice. 
 1. Vanishing Gradient -> one of the solution: ReLU

 2. How network utilizes its available free paremters
 P(Y|X) = P(X|Y)*P(Y) / P(X) => posterior probability = likelihood * prior probability (distribution)
 To say this in other words, the output of the neuron = all the input values * distributions on inputs

A problem occurs when those values become tightly coupled. This makes it intractable to compute the relative
contribution of different parameters, particularly in a deep network. One of the solution: Boltzmann machine

Other solutions: Support Vector Machines, Gradient and Stochastic Gradient Boosting Models, 
                Random Forests, Penalized Regression Methods such as LASSO and Elastic Net

In theory, deep neural networks had potentially greater power since they have a lot of "deep" layers.

Introducing AlexNet

It is true that ReLU can solve vanishing gradient problem, but ReLU functions have the downside that they 
can "turn off" if the input falls below 0.
y = x if x > 0, else 0.01x        or PReLU(Parameterized Leak ReLU): y = max(ax,x) if a <= 1

Another trick used by AlexNet is dropout. Idea of dropout is inspired by ensemble methods in which we average 
the predictions of many models to obtain more robust results. Cleary for deep neural networks, this is prohibitive. Compromise is to randomly set the values of a subset of neurons to 0 with a probability of 0.5

Another enhancement used in AlexNet is local response normalization. Even though ReLUs don't saturate in the same manner as other units, the authors of the model still found value in constraining the range of output. 
For example, in an individual kernel, they normalized the input using values of adjacent kernels, meaning the 
overall response was rescaled. This is similar to batch normalization, which applies a transformation on "raw" activations within a network. 

Let's just say for Natural Language Processing you need to take account of past context. It is the reason why RNN was developed. But RNN only takes account of prior hidden layer. So to improve that, LSTM was developed. Unlike feedforward networks, RNNs aren't trained with traditional backpropagation, but rather a variant known as 'backpropagation through time'(BPTT).

Optimization Procedure

1. How to initialize the weights
        ex random weights with some range - at least to find local minimum
2. How to find the local minimum loss
        ex gradient descent

To improve SGD, Nesterov Momentum was introduced. It is the idea to use a form of exponentially weighted momentum that remembers prior steps and continues in promising directions.

Adaptive Gradient (Adagrad)
scales the learning rate for each update by the running the sum of squares of the gradient of that parameter;
thus elements that are frequently updated are downsampled, while those that are infrequently updated are pushed to update with greater magnitude. For ReLU, there is problem of exploding gradient. So if you use Adagrad with ReLU, learning rate is easily vanished. To prevent this, two variant methods, RMSProp(RNN) AdaDelta imposed fixed-width windows of n steps in the computation of G.

Final Boss: Adaptive Momentum Estimation (ADAM): combine momentum + AdaDelta: the momentum calculation is used to preserve the history of past gradient updates, while the sum of decaying squared gradients within a fixed update window used in AdaDelta is applied to scale the resulting gradient. 

Now let's talk about weight initialization. In earlier research, it was common to initialize weights in a neural network with some range of random values. Breakthrough was to use pre-traiing before backpropagation.

Tensorflow Keras module initializes weights from either a truncated normal distribution or uniform distribution. This comes from activation functions such as sigmoidal and hyperbolic can be easily saturated. Then our mission is to keep the weights in such a range that they don't saturate the neuron's output. In other words, assume that input and output values of the neuron have similar variance. The signal shouldn't be massively amplifying or diminishing while passing neurons.
I am skipping little bit, we can use a truncated normal or uniform distribution with variance 1/N(number of weights or the average number of input and output units)