<a href="https://colab.research.google.com/github/Machine-Learning-Tokyo/DL-workshop-series/blob/master/Part%20II%20-%20Learning%20in%20Deep%20Networks/regularization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import keras
from keras.layers import Dense, Input, BatchNormalization
from keras.initializers import constant
from keras.regularizers import l1, l2
from keras.models import Model
import keras.backend as K

import numpy as np

## Batch Normalization

In this very simple example we have a network with:
- one input of shape (1,) (or simply stated the input is one number)
- one Dense layer with only one unit (aka neuron)

When we print the summary of the model we can see that there is a total of 2 parameters, both of which are trainable.

The first of these parameters is the weights (*w*) of the unit and the second is the bias (*b*)
the function inside the unit is:
$$f(x)=wx+b$$


---


Now if we uncomment the commented line and re-run the model we will see that there is one more layer in the end of the model summary, the BatchNormalization layer.

This time there is a total of 6 parameters:

- the initial 2 trainable parameters of the Dense layer
- 2 trainable parameters of the batch_normalization layer
- 2 non-trainable parameters of the batch_normalization layer


---


These parameters come from the above function of the batch normalization layer:
$$\hat h = \gamma \frac{h-\mu_B}{\sigma_B}+\beta$$

where:
- $h$ and $\hat h$ are the hidden values before and after the normalization
- $\mu_B$ and $\sigma_B$ represent the mean and the standard deviation of $h$.

  The are estimated within a batch of M samples. These are the **non trainable** parameters (since they are computed from the batch)
-  $\gamma$ is a scale parameter and $\beta$ is a shift parameter.

  These are the **trainable** paramters. We can define if we want the layer to make use of them or not. By changing the values of the *center* and *scale* arguments to *False* the layer does not make use of these parameters and thus we do not have these 2 trainable parameters


In [0]:
K.clear_session()

input = Input([1])
output = Dense(1, kernel_initializer=constant(2), bias_initializer=constant(1))(input)
# output = BatchNormalization(center=False, scale=False)(output)
model = Model(input, output)
model.summary()

Now let's see how it works on a specific example. In order to simplify the model even more we will define a model with an input layer of 1 number and a Batch Normalization layer on top of it.

This means that the input numbers will pass directly through the batch_norm layer and we will get its output.

We define center and scale to be False so there are no trainable parameters.

We also define the momentum and the epsilon to be 0 in order to get the results based on the formula presented above (otherwise the results will be different). It is recomended when using this layer in real applications **not** to set these parameters to 0.

In [0]:
K.clear_session()

input = Input([1])
output = BatchNormalization(center=False, scale=False, momentum=0, epsilon=0)(input)
model = Model(input, output)
model.summary()

In our example we will use as input an array with 2 elements: 1 and 2.

We reshape the array so that the model accepts the two numbers as a batch of two elements.

When we get the output of the model however we see that the numbers remained unchainged...

In [0]:
x = 1, 2
x = np.reshape(np.array(x), (2, 1))

y_pred = model.predict(x)
print(*y_pred)

Now this happened because the (non trainable) weights of the model ($\mu_B$ and $\sigma_B$) were not calculated. The values of these parameters are the initial ones (0 and 1)

In [0]:
print(*model.get_weights())

All the parameters of the model, even the non trainable, are calculated during the training phase of the model and are retained during the inference phase.

Thus we have to "train" our model on our batch.

In order to train the model we first have to compile it with a specific otpimizer and loss function. The choice of these two arguments is arbitrary, as it is the choice of the y values. Thus we can safely use 'sgd', 'mae' and 'x' without loss of generality.

In [0]:
model.compile('sgd', 'mae')
t = model.train_on_batch(x, x)

Now if we run again the previous cell and print the model's weights we will get the updated mean and standard deviation (actually the variance) based on the batch

mean:
$$\bar x=\frac{\sum^N_{i=1}x_i}{N}$$
standard deviation:
$$\sigma=\sqrt{\frac{\sum^N_{i=1}(x_i-\bar x)^2}{N-1}}$$
variance:
$$Var=\sigma^2$$

Now you can rerun the prediction cell and obtain the new outputs of the model. The two outputs have indeed mean = 0 and std = 1.

If you want to check it you can set the h variable at the next model to be equal to y_pred

In [0]:
h = np.array([1, 2])  # y_pred
mean = np.sum(h) / len(h)
std = np.sqrt(np.sum(np.square(h - mean)) / (len(h) - 1))
var = std**2
print('mean: %.2f\nstd: %.3f\nvar: %.2f' % (mean, std, var))

## L1 and L2 regularization

In this example we define a simple model with an input layer of one number and one Dense layer with one unit (aka neuron).

However, for the specific unit we set the values of the weight and the bias during the initialization:

$$f(x)=wx+b$$

where $w=2$ and $b=1$

We also explicitly define the kernel, bias and activity regularizers to be None (which is their default value)


In [0]:
K.clear_session()

input = Input([1])
output = Dense(1, kernel_initializer=constant(2), bias_initializer=constant(1),
               kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None)(input)
model = Model(input, output)
model.summary()

now if we run the model on a specific number, let's say 2, we can see that we get the correct result (5)

In [0]:
x = 2
y_pred = model.predict(np.array((x,)))
y_pred[0, 0]

So when we compile the model and evaluate it with the correct numbers we see that the loss is equal to 0

In [0]:
x, y = 2, 5
x, y = np.array((x,)), np.array((y,))
model.compile('sgd', 'mae')
loss = model.evaluate(x, y, verbose=0)
print(loss)

Now let's make some changes. If we change the activity_regularizer for example to l1 norm with a factor of 1 we get the following model

In [0]:
K.clear_session()

input = Input([1])
output = Dense(1, kernel_initializer=constant(2), bias_initializer=constant(1),
               kernel_regularizer=None, bias_regularizer=None, activity_regularizer=l1(1))(input)
model = Model(input, output)
model.summary()

Based on the summary nothing really changed. and if we predict on the same x the result will be once again 5

In [0]:
x = 2
y_pred = model.predict(np.array((x,)))
y_pred[0, 0]

However, the loss this time is different. this happens because the new loss is:
$$new\_loss=loss+regularization$$
where in our case the regularization is:
$$l_1(a)=\sum{w\cdot|a|}$$

where $w$ is the argument we define in the $l1()$ function

Similarly, we have:
$$l_2(a)=\sum{w\cdot a^2}$$

In [0]:
x, y = 2, 5
x, y = np.array((x,)), np.array((y,))
model.compile('sgd', 'mae')
loss = model.evaluate(x, y, verbose=0)
print(loss)

Fell free to change:
- the initial values of the weight and the bias of the model
- the type of regularization function for weight, bias or activation
- the factor of each regularization function

two obtain different results