In [None]:
# Copyright 2019 Google LLC
# 
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and

<a target="_blank" href="https://colab.research.google.com/github/GoogleCloudPlatform/keras-idiomatic-programmer/blob/master/community-labs/Community Lab - Regularization.ipynb">
<img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>

For best performance using Colab, once the notebook is launched, from dropdown menu select **Runtime -> Change Runtime Type**, and select **GPU** for **Hardware Accelerator**.

### Composable "Design Pattern" for AutoML friendly models

## Community Lab 2: Using Regularization to Tackle Overfitting

### Objective

Prior success for training models for high accuracy was to use large models. Today, we believe the success of large models is due to the fact that they are a collection of sub-models, and one of the sub-models is the winning model (lottery ticket hypothesis).

Today, we try to train compact size models. One of the challenges in such a model is the training data may "fit" itself to the model's weights, and not generalize to the validation/test data.

In this lab, we will explore methods of regularization and learning rates to prevent the training data from "fitting" to the weights in a compact model -- without use of historical methods such as dropout or data augmentation.

*Question*: Can we generalize a compact model without image augmentation?

*Question*: How is training time effected?

*Question*: How small can a compact model be made and maintain accuracy on the validation/test data?

### Approach

We will use the composable design pattern, and prebuilt units from the Google Cloud AI Developer Relations repo: [Model Zoo](https://github.com/GoogleCloudPlatform/keras-idiomatic-programmer/tree/master/zoo)

If you are not familiar with the Composable design pattern, we recommemd you review the [ResNet](https://github.com/GoogleCloudPlatform/keras-idiomatic-programmer/tree/master/zoo/resnet) model in our zoo.

We recommend a constant set for hyperparameters, where batch_size is 32 and initial learning rate is 0.001 -- but you may use any value for hyperparameters you prefer.

We will use the metaparameters feature in the composable design pattern for the macro architecture search -- sort of a 'human assisted AutoML'.


### Reporting Findings

You can contact us on your findings via the twitter account: @andrewferlitsch

### Dataset

In this notebook, we use the CIFAR-10 datasets which consist of images 32x32x3 for 10 classes -- but you may use any dataset you prefer.

### Steps

1. Build a baseline (reference) model for CIFAR-10 with no regularization.

2. Add regularization to the classifier (softmax) layer by adding Guassian noise.

3. Add a large and small amounts of L2 regularization to convolutional and dense layers' weights.

4. Compare the results of different magnitudes of layer regularization.

5. Train with a two-tier learning rate schedule.

## Lab

### Imports

In [None]:
import tensorflow as tf
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Conv2D, ReLU, Add, Dense, GaussianNoise
from tensorflow.keras.layers import BatchNormalization, GlobalAveragePooling2D, Activation
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.regularizers import l2
from tensorflow.keras.datasets import cifar10
import numpy as np

### Get the Dataset

Load the dataset into memory as numpy arrays, and then normalize the image data (preprocessing).

In [None]:
from tensorflow.keras.datasets import cifar10
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
x_train = (x_train / 255.0).astype(np.float32)
x_test  = (x_test / 255.0).astype(np.float32)
print(x_train.shape)

### Build Baseline Model for CIFAR-10

In [None]:
# from resnet/resnet_v2_c.py

class ResNetV2(object):
    """ Construct a Residual Convolution Network Network V2 """
    # Meta-parameter: list of groups: number of filters and number of blocks
    groups = { 50 : [ { 'n_filters' : 64, 'n_blocks': 3 },
                      { 'n_filters': 128, 'n_blocks': 4 },
                      { 'n_filters': 256, 'n_blocks': 6 },
                      { 'n_filters': 512, 'n_blocks': 3 } ],            # ResNet50
               101: [ { 'n_filters' : 64, 'n_blocks': 3 },
                      { 'n_filters': 128, 'n_blocks': 4 },
                      { 'n_filters': 256, 'n_blocks': 23 },
                      { 'n_filters': 512, 'n_blocks': 3 } ],            # ResNet101
               152: [ { 'n_filters' : 64, 'n_blocks': 3 },
                      { 'n_filters': 128, 'n_blocks': 8 },
                      { 'n_filters': 256, 'n_blocks': 36 },
                      { 'n_filters': 512, 'n_blocks': 3 } ]             # ResNet152
             }
    init_weights = 'he_normal'
    reg=l2(0.001)
    _model = None

    def __init__(self, n_layers, input_shape=(224, 224, 3), n_classes=1000):
        """ Construct a Residual Convolutional Neural Network V2
            n_layers   : number of layers
            input_shape: input shape
            n_classes  : number of output classes
        """
        # predefined
        if isinstance(n_layers, int):
            if n_layers not in [50, 101, 152]:
                raise Exception("ResNet: Invalid value for n_layers")
            groups = self.groups[n_layers]
        # user defined
        else:
            groups = n_layers

        # The input tensor
        inputs = Input(input_shape)

        # The stem convolutional group
        x = self.stem(inputs)

        # The learner
        x = self.learner(x, groups=groups)

        # The classifier 
        outputs = self.classifier(x, n_classes)

        # Instantiate the Model
        self._model = Model(inputs, outputs)

    @property
    def model(self):
        return self._model

    @model.setter
    def model(self, _model):
        self._model = _model

    def stem(self, inputs):
        """ Construct the Stem Convolutional Group 
            inputs : the input vector
        """
        # The 224x224 images are zero padded (black - no signal) to be 230x230 images prior to the first convolution
        x = ZeroPadding2D(padding=(3, 3))(inputs)
    
        # First Convolutional layer uses large (coarse) filter
        x = Conv2D(64, (7, 7), strides=(2, 2), padding='valid', use_bias=False, 
                   kernel_initializer=self.init_weights, kernel_regularizer=self.reg)(x)
        x = BatchNormalization()(x)
        x = ReLU()(x)
    
        # Pooled feature maps will be reduced by 75%
        x = ZeroPadding2D(padding=(1, 1))(x)
        x = MaxPooling2D((3, 3), strides=(2, 2))(x)
        return x

    def learner(self, x, **metaparameters):
        """ Construct the Learner
            x     : input to the learner
            groups: list of groups: number of filters and blocks
        """
        groups = metaparameters['groups']

        # First Residual Block Group (not strided)
        x = ResNetV2.group(x, strides=(1, 1), **groups.pop(0))

        # Remaining Residual Block Groups (strided)
        for group in groups:
            x = ResNetV2.group(x, **group)
        return x
    
    @staticmethod
    def group(x, strides=(2, 2), init_weights=None, **metaparameters):
        """ Construct a Residual Group
            x         : input into the group
            strides   : whether the projection block is a strided convolution
            n_filters : number of filters for the group
            n_blocks  : number of residual blocks with identity link
        """
        n_blocks  = metaparameters['n_blocks']

        # Double the size of filters to fit the first Residual Group
        x = ResNetV2.projection_block(x, strides=strides, init_weights=init_weights, **metaparameters)

        # Identity residual blocks
        for _ in range(n_blocks):
            x = ResNetV2.identity_block(x, init_weights=init_weights, **metaparameters)
        return x

    @staticmethod
    def identity_block(x, init_weights=None, **metaparameters):
        """ Construct a Bottleneck Residual Block with Identity Link
            x        : input into the block
            n_filters: number of filters
            reg      : kernel regularizer
        """
        n_filters = metaparameters['n_filters']
        if 'reg' in metaparameters:
            reg = metaparameters['reg']
        else:
            reg = ResNetV2.reg

        if init_weights is None:
            init_weights = ResNetV2.init_weights
    
        # Save input vector (feature maps) for the identity link
        shortcut = x
    
        ## Construct the 1x1, 3x3, 1x1 convolution block
    
        # Dimensionality reduction
        x = BatchNormalization()(x)
        x = ReLU()(x)
        x = Conv2D(n_filters, (1, 1), strides=(1, 1), use_bias=False, 
                   kernel_initializer=init_weights, kernel_regularizer=reg)(x)

        # Bottleneck layer
        x = BatchNormalization()(x)
        x = ReLU()(x)
        x = Conv2D(n_filters, (3, 3), strides=(1, 1), padding="same", use_bias=False, 
                   kernel_initializer=init_weights, kernel_regularizer=reg)(x)

        # Dimensionality restoration - increase the number of output filters by 4X
        x = BatchNormalization()(x)
        x = ReLU()(x)
        x = Conv2D(n_filters * 4, (1, 1), strides=(1, 1), use_bias=False, 
                   kernel_initializer=init_weights, kernel_regularizer=reg)(x)

        # Add the identity link (input) to the output of the residual block
        x = Add()([shortcut, x])
        return x

    @staticmethod
    def projection_block(x, strides=(2,2), init_weights=None, **metaparameters):
        """ Construct a Bottleneck Residual Block of Convolutions with Projection Shortcut
            Increase the number of filters by 4X
            x        : input into the block
            strides  : whether the first convolution is strided
            n_filters: number of filters
            reg      : kernel regularizer
        """
        n_filters = metaparameters['n_filters']
        if 'reg' in metaparameters:
            reg = metaparameters['reg']
        else:
            reg = ResNetV2.reg

        if init_weights is None:
            init_weights = ResNetV2.init_weights

        # Construct the projection shortcut
        # Increase filters by 4X to match shape when added to output of block
        shortcut = BatchNormalization()(x)
        shortcut = Conv2D(4 * n_filters, (1, 1), strides=strides, use_bias=False, 
                          kernel_initializer=init_weights, kernel_regularizer=reg)(shortcut)

        ## Construct the 1x1, 3x3, 1x1 convolution block
    
        # Dimensionality reduction
        x = BatchNormalization()(x)
        x = ReLU()(x)
        x = Conv2D(n_filters, (1, 1), strides=(1,1), use_bias=False, 
                   kernel_initializer=init_weights, kernel_regularizer=reg)(x)

        # Bottleneck layer
        # Feature pooling when strides=(2, 2)
        x = BatchNormalization()(x)
        x = ReLU()(x)
        x = Conv2D(n_filters, (3, 3), strides=strides, padding='same', use_bias=False, 
                   kernel_initializer=init_weights, kernel_regularizer=reg)(x)

        # Dimensionality restoration - increase the number of filters by 4X
        x = BatchNormalization()(x)
        x = ReLU()(x)
        x = Conv2D(4 * n_filters, (1, 1), strides=(1, 1), use_bias=False, 
                   kernel_initializer=init_weights, kernel_regularizer=reg)(x)

        # Add the projection shortcut to the output of the residual block
        x = Add()([x, shortcut])
        return x

    def classifier(self, x, n_classes):
        """ Construct the Classifier Group 
            x         : input to the classifier
            n_classes : number of output classes
        """
        # Pool at the end of all the convolutional residual blocks
        x = GlobalAveragePooling2D()(x)

        # Final Dense Outputting Layer for the outputs
        outputs = Dense(n_classes, activation='softmax', 
                        kernel_initializer=self.init_weights, kernel_regularizer=self.reg)(x)
        return outputs

In [None]:
def makeModel(reg=None, n_blocks=4, lr=0.001, noise=None):
    ResNetV2.reg = reg
    
    # Stem
    inputs = Input((32, 32, 3))
    x = Conv2D(32, (3, 3), strides=(1, 1), padding='same', 
               kernel_initializer='he_normal', kernel_regularizer=reg)(inputs)
    x = BatchNormalization()(x)
    x = ReLU()(x)

    # Learner
    x = ResNetV2.group(x, n_blocks=n_blocks, n_filters=16)
    x = ResNetV2.group(x, n_blocks=n_blocks, n_filters=64)
    x = ResNetV2.group(x, n_blocks=n_blocks, n_filters=128)

    # Classifier
    x = GlobalAveragePooling2D()(x)
    
    if noise:
        x = GaussianNoise(noise)(x)
        x = ReLU()(x)
        
    outputs = Dense(10, activation='softmax',
                    kernel_initializer='he_normal', kernel_regularizer=reg)(x)
    
    resnet = Model(inputs, outputs)
    resnet.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(lr=lr), metrics=['acc'])
    return resnet

### Train Model and Tackle Overfitting

This small models still has too many parameters, that the training data can't fit to the parameters. As is (after 10 epochs), the validation/test data will plateau out at ~73% accuracy, while the training accuracy has climbed to 91%. But if we reduce the size of the model, we eliminate too many parameters to increase accuracy.

#### Base Model

Let's first train as-is to demonstrate.

In [None]:
resnet = makeModel()
resnet.summary()
resnet.fit(x_train, y_train, epochs=10, batch_size=32, validation_split=0.1, verbose=1)
resnet.evaluate(x_test, y_test)

#### Gaussian Noise

Let's try adding some noise to the input to the output classification layer. This will act as a regularizer. Note how we added a ReLU() afterwards. If we did not, some of the weights might have a negative value from the noise (as if it was a leaky ReLU).

As is (after 10 epochs), the training accuracy remains unchanged, but the validation/test data has crept up a small amount to ~75%.

In [None]:
resnet = makeModel(noise=0.1)
resnet.summary()
resnet.fit(x_train, y_train, epochs=10, batch_size=32, validation_split=0.1, verbose=1)
resnet.evaluate(x_test, y_test)

#### Layer Regularization

Let's use an aggresive form of kernel regularization -- this will penalize any large weight changes to prevent data snapping into the node (L2 regularation), but may greatly reduce the rate of learning - or learning at all (rate = 0.01). 

As is (after 10 epochs), the training accuracy will be plateaued around ~60%. It just won't learn at this level of aggressive layer regularization.

In [None]:
resnet = makeModel(noise=0.1, reg=l2(0.01))
resnet.fit(x_train, y_train, epochs=10, batch_size=32, validation_split=0.1, verbose=1)
resnet.evaluate(x_test, y_test)

Let's now try less aggressive amount of regularization (reduce by a magnitude of 10). The rate of increase in training accuracy will slow down and be more stable with the validation accuracy. We can now increase the number of epochs to 30.

As is (after 30 epochs), the validation/test data has crept up a modest amount to ~80%, while the training accuracy has plauteaued also around 80%.

In [None]:
resnet = makeBaseModel(noise=0.1, reg=l2(0.001))
resnet.fit(x_train, y_train, epochs=30, batch_size=32, validation_split=0.1, verbose=1)
resnet.evaluate(x_test, y_test)

#### Learning Rate

You can see that we are plateauing out around 80% on the validation/test data after 30 epochs and the training accuracy seems to be equally plateaud. This suggests that the weight updates are bouncing back/forth trying to fit the training data; whereby, lines of linear separation are slightly shifting causing swings in the validation loss.

Let's address this by dropping the learning rate a magnitude after 30 epochs, and run another 10. We can see now that the validation/test data climb and plateaus at ~84%.

In [None]:
resnet.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(lr=0.0001), metrics=['acc'])
resnet.fit(x_train, y_train, epochs=30, batch_size=32, validation_split=0.1, verbose=1)
resnet.evaluate(x_test, y_test)

## Next

Think how you can modify this experiment, to meet the objectives.