# <div align="center">8.7 Optimization Strategies and Meta-Algorithms</div>
---------------------------------------------------------------------

you can Find me on Github:
> ###### [ GitHub](https://github.com/lev1khachatryan)

Many optimization techniques are not exactly algorithms, but rather general
templates that can be specialized to yield algorithms, or subroutines that can be
incorporated into many different algorithms.

## <div align="center">8.7.1 Batch Normalization</div>
---------------------------------------------------------------------

Batch normalization (Ioffe and Szegedy, 2015) is one of the most exciting recent
innovations in optimizing deep neural networks and it is actually not an optimization
algorithm at all. Instead, it is a method of adaptive reparametrization, motivated
by the difficulty of training very deep models.

Very deep models involve the composition of several functions or layers. The
gradient tells how to update each parameter, under the assumption that the other
layers do not change. In practice, we update all of the layers simultaneously.
When we make the update, unexpected results can happen because many functions
composed together are changed simultaneously, using updates that were computed
under the assumption that the other functions remain constant. 

As a simple example, suppose we have a deep neural network that has only one unit per layer
and does not use an activation function at each hidden layer: $yˆ = xw_{1}w_{2}w_{3} . . . w_{l}$.
Here, wi provides the weight used by layer i. The output of layer i is $h_{i} = h_{i−1}w_{i}$.
The output $yˆ$ is a linear function of the input x, but a nonlinear function of the
weights $w_{i}$. Suppose our cost function has put a gradient of 1 on $yˆ$, so we wish to
decrease $yˆ$ slightly. The back-propagation algorithm can then compute a gradient
$g = ∇wyˆ$. Consider what happens when we make an update $w ← w − \epsilon g$. The
first-order Taylor series approximation of $yˆ$ predicts that the value of $yˆ$ will decrease
by $\epsilon g^{T} g$. If we wanted to decrease $yˆ$ by 0.1, this first-order information available in
the gradient suggests we could set the learning rate $\epsilon$ to 0.1/ $g^{T}g$ . However, the actual
update will include second-order and third-order effects, on up to effects of order l.
The new value of $yˆ$ is given by

<img src='asset/8_7/1.png'>

An example of one second-order term arising from this update is $\epsilon^{2}g_{1} g_{2} \Pi_{i=3}^l w_i$

This term might be negligible if $\Pi_{i=3}^l w_i$ is small, or might be exponentially large
if the weights on layers 3 through l are greater than 1. This makes it very hard
to choose an appropriate learning rate, because the effects of an update to the
parameters for one layer depends so strongly on all of the other layers. Second-order
optimization algorithms address this issue by computing an update that takes these
second-order interactions into account, but we can see that in very deep networks,
even higher-order interactions can be significant. Even second-order optimization
algorithms are expensive and usually require numerous approximations that prevent
them from truly accounting for all significant second-order interactions. Building
an n-th order optimization algorithm for n > 2 thus seems hopeless. What can we
do instead?

Batch normalization provides an elegant way of reparametrizing almost any deep
network. The reparametrization significantly reduces the problem of coordinating
updates across many layers. Batch normalization can be applied to any input
or hidden layer in a network. Let H be a minibatch of activations of the layer
to normalize, arranged as a design matrix, with the activations for each example
appearing in a row of the matrix. To normalize H, we replace it with

<img src='asset/8_7/2.png'>

where µ is a vector containing the mean of each unit and σ is a vector containing
the standard deviation of each unit. The arithmetic here is based on broadcasting
the vector µ and the vector σ to be applied to every row of the matrix H . Within
each row, the arithmetic is element-wise, so $H_{i,j}$ is normalized by subtracting $µ_{j}$ and dividing by $σ_j$. The rest of the network then operates on $H^{'}$ in exactly the
same way that the original network operated on H.

At training time,

<img src='asset/8_7/3.png'>

and

<img src='asset/8_7/4.png'>

where δ is a small positive value such as $10^{−8}$ imposed to avoid encountering
the undefined gradient of √z at z = 0. Crucially, we back-propagate through
these operations for computing the mean and the standard deviation, and for
applying them to normalize H. This means that the gradient will never propose
an operation that acts simply to increase the standard deviation or mean of
$h_i$; the normalization operations remove the effect of such an action and zero
out its component in the gradient. This was a major innovation of the batch
normalization approach. Previous approaches had involved adding penalties to
the cost function to encourage units to have normalized activation statistics or
involved intervening to renormalize unit statistics after each gradient descent step.
The former approach usually resulted in imperfect normalization and the latter
usually resulted in significant wasted time as the learning algorithm repeatedly
proposed changing the mean and variance and the normalization step repeatedly
undid this change. Batch normalization reparametrizes the model to make some
units always be standardized by definition, deftly sidestepping both problems.

At test time, µ and σ may be replaced by running averages that were collected
during training time. This allows the model to be evaluated on a single example,
without needing to use definitions of µ and σ that depend on an entire minibatch.

Revisiting the yˆ = $xw_{1}w_{2} . . . w_{l}$ example, we see that we can mostly resolve the
difficulties in learning this model by normalizing $h_{l−1}$. Suppose that x is drawn
from a unit Gaussian. Then $h_{l−1}$ will also come from a Gaussian, because the
transformation from x to $h_{l}$ is linear. However, $h_{l−1}$ will no longer have zero mean
and unit variance. After applying batch normalization, we obtain the normalized
$\hat{h}_{l−1}$ that restores the zero mean and unit variance properties. For almost any
update to the lower layers, $\hat{h}_{l−1}$ will remain a unit Gaussian. The output yˆ may
then be learned as a simple linear function yˆ = $w_{l} \hat{h}_{l−1}$. Learning in this model is
now very simple because the parameters at the lower layers simply do not have an
effect in most cases; their output is always renormalized to a unit Gaussian. In
some corner cases, the lower layers can have an effect. Changing one of the lower
layer weights to 0 can make the output become degenerate, and changing the sign of one of the lower weights can flip the relationship between $\hat{h}_{l−1}$ and y. These
situations are very rare. Without normalization, nearly every update would have
an extreme effect on the statistics of $h_{l−1}$. Batch normalization has thus made
this model significantly easier to learn. In this example, the ease of learning of
course came at the cost of making the lower layers useless. In our linear example,
the lower layers no longer have any harmful effect, but they also no longer have
any beneficial effect. This is because we have normalized out the first and second
order statistics, which is all that a linear network can influence. In a deep neural
network with nonlinear activation functions, the lower layers can perform nonlinear
transformations of the data, so they remain useful. Batch normalization acts to
standardize only the mean and variance of each unit in order to stabilize learning,
but allows the relationships between units and the nonlinear statistics of a single
unit to change.

Because the final layer of the network is able to learn a linear transformation,
we may actually wish to remove all linear relationships between units within a
layer. Indeed, this is the approach taken by Desjardins et al. (2015), who provided
the inspiration for batch normalization. Unfortunately, eliminating all linear
interactions is much more expensive than standardizing the mean and standard
deviation of each individual unit, and so far batch normalization remains the most
practical approach

Normalizing the mean and standard deviation of a unit can reduce the expressive
power of the neural network containing that unit. In order to maintain the
expressive power of the network, it is common to replace the batch of hidden unit
activations H with γH' +β rather than simply the normalized H'. 

The variables
γ and β are learned parameters that allow the new variable to have any mean
and standard deviation. At first glance, this may seem useless—why did we set
the mean to 0, and then introduce a parameter that allows it to be set back to
any arbitrary value β? The answer is that the new parametrization can represent
the same family of functions of the input as the old parametrization, but the new
parametrization has different learning dynamics. In the old parametrization, the
mean of H was determined by a complicated interaction between the parameters
in the layers below H. In the new parametrization, the mean of γH' + β is
determined solely by β. The new parametrization is much easier to learn with
gradient descent.

Most neural network layers take the form of φ(XW + b) where φ is some
fixed nonlinear activation function such as the rectified linear transformation. It
is natural to wonder whether we should apply batch normalization to the input
X , or to the transformed value XW + b. Ioffe and Szegedy (2015) recommend the latter. More specifically, XW + b should be replaced by a normalized version
of XW. The bias term should be omitted because it becomes redundant with
the β parameter applied by the batch normalization reparametrization. The input
to a layer is usually the output of a nonlinear activation function such as the
rectified linear function in a previous layer. The statistics of the input are thus
more non-Gaussian and less amenable to standardization by linear operations.

In convolutional networks, it is important to apply the
same normalizing µ and σ at every spatial location within a feature map, so that
the statistics of the feature map remain the same regardless of spatial location.

### <div align="center">8.7.1.1 Batch Normalization Layers</div>
---------------------------------------------------------------------

The batch normalization methods for fully-connected layers and convolutional layers are slightly different. This is due to the dimensionality of the data generated by convolutional layers. We discuss both cases below. Note that one of the key differences between BN and other layers is that BN operates on a a full minibatch at a time (otherwise it cannot compute the mean and variance parameters per batch).

#### <div align="center">8.7.1.1.1 Fully-Connected Layers</div>
---------------------------------------------------------------------

Usually we apply the batch normalization layer between the affine transformation and the activation function in a fully-connected layer. In the following, we denote by  u  the input and by  x=Wu+b  the output of the linear transform. This yields the following variant of BN:

<img src='asset/8_7/5.png'>

Recall that mean and variance are computed on the same minibatch  B  on which the transformation is applied. Also recall that the scaling coefficient  γ  and the offset  β  are parameters that need to be learned. They ensure that the effect of batch normalization can be neutralized as needed.

#### <div align="center">8.7.1.1.2 Convolutional Layers</div>
---------------------------------------------------------------------

For convolutional layers, batch normalization occurs after the convolution computation and before the application of the activation function. If the convolution computation outputs multiple channels, we need to carry out batch normalization for each of the outputs of these channels, and each channel has an independent scale parameter and shift parameter, both of which are scalars. Assume that there are  m  examples in the mini-batch. On a single channel, we assume that the height and width of the convolution computation output are  p  and  q , respectively. We need to carry out batch normalization for  m×p×q  elements in this channel simultaneously. While carrying out the standardization computation for these elements, we use the same mean and variance. In other words, we use the means and variances of the  m×p×q  elements in this channel rather than one per pixel.

#### <div align="center">8.7.1.1.3 Batch Normalization During Prediction</div>
---------------------------------------------------------------------

At prediction time, we might not have the luxury of computing offsets per batch—we might be required to make one prediction at a time. Secondly, the uncertainty in  μ  and  σ , as arising from a minibatch are undesirable once we’ve trained the model. One way to mitigate this is to compute more stable estimates on a larger set for once (e.g. via a moving average) and then fix them at prediction time. Consequently, BN behaves differently during training and at test time (recall that dropout also behaves differently at train and test times).

### <div align="center">8.7.1.2 Implementation from Scratch</div>
---------------------------------------------------------------------

In [4]:
# import d2l
# from mxnet import autograd, gluon, nd, init
# from mxnet.gluon import nn

# def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
#     # Use autograd to determine whether the current mode is training mode or
#     # prediction mode
#     if not autograd.is_training():
#         # If it is the prediction mode, directly use the mean and variance
#         # obtained from the incoming moving average
#         X_hat = (X - moving_mean) / nd.sqrt(moving_var + eps)
#     else:
#         assert len(X.shape) in (2, 4)
#         if len(X.shape) == 2:
#             # When using a fully connected layer, calculate the mean and
#             # variance on the feature dimension
#             mean = X.mean(axis=0)
#             var = ((X - mean) ** 2).mean(axis=0)
#         else:
#             # When using a two-dimensional convolutional layer, calculate the
#             # mean and variance on the channel dimension (axis=1). Here we
#             # need to maintain the shape of X, so that the broadcast operation
#             # can be carried out later
#             mean = X.mean(axis=(0, 2, 3), keepdims=True)
#             var = ((X - mean) ** 2).mean(axis=(0, 2, 3), keepdims=True)
#         # In training mode, the current mean and variance are used for the
#         # standardization
#         X_hat = (X - mean) / nd.sqrt(var + eps)
#         # Update the mean and variance of the moving average
#         moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
#         moving_var = momentum * moving_var + (1.0 - momentum) * var
#     Y = gamma * X_hat + beta  # Scale and shift
#     return Y, moving_mean, moving_var

Now, we can customize a BatchNorm layer. This retains the scale parameter gamma and the shift parameter beta involved in gradient finding and iteration, and it also maintains the mean and variance obtained from the moving average, so that they can be used during model prediction. The num_features parameter required by the BatchNorm instance is the number of outputs for a fully-connected layer and the number of output channels for a convolutional layer. The num_dims parameter also required by this instance is 2 for a fully-connected layer and 4 for a convolutional layer.

Besides the algorithm per se, also note the design pattern in implementing layers. Typically one defines the math in a separate function, say batch_norm. This is then integrated into a custom layer that mostly focuses on bookkeeping, such as moving data to the right device context, ensuring that variables are properly initialized, keeping track of the running averages for mean and variance, etc. That way we achieve a clean separation of math and boilerplate code. Also note that for the sake of convenience we did not add automagic size inference here, hence we will need to specify the number of features throughout (the Gluon version will take care of this for us).

In [6]:
# class BatchNorm(nn.Block):
#     def __init__(self, num_features, num_dims, **kwargs):
#         super(BatchNorm, self).__init__(**kwargs)
#         if num_dims == 2:
#             shape = (1, num_features)
#         else:
#             shape = (1, num_features, 1, 1)
#         # The scale parameter and the shift parameter involved in gradient
#         # finding and iteration are initialized to 0 and 1 respectively
#         self.gamma = self.params.get('gamma', shape=shape, init=init.One())
#         self.beta = self.params.get('beta', shape=shape, init=init.Zero())
#         # All the variables not involved in gradient finding and iteration are
#         # initialized to 0 on the CPU
#         self.moving_mean = nd.zeros(shape)
#         self.moving_var = nd.zeros(shape)

#     def forward(self, X):
#         # If X is not on the CPU, copy moving_mean and moving_var to the
#         # device where X is located
#         if self.moving_mean.context != X.context:
#             self.moving_mean = self.moving_mean.copyto(X.context)
#             self.moving_var = self.moving_var.copyto(X.context)
#         # Save the updated moving_mean and moving_var
#         Y, self.moving_mean, self.moving_var = batch_norm(
#             X, self.gamma.data(), self.beta.data(), self.moving_mean,
#             self.moving_var, eps=1e-5, momentum=0.9)
#         return Y

For more detail information about Batch Normalization look at the [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/pdf/1502.03167.pdf) and [https://www.d2l.ai/chapter_convolutional-modern/batch-norm.html](https://www.d2l.ai/chapter_convolutional-modern/batch-norm.html)