# ResNet networks
The ResNet type of CNN was designed by Microsoft Research to compete in the international ILSVRC competition. The ResNet in the 2015 contest took first place in all categories for the ImageNet and Common Objects in Context (COCO) competition.

 The VGGNet design pattern covered in the previous section had limitations in how deep the model architecture could go in layers, before suffering from vanishing and exploding gradients.


The researchers for the residual block design pattern component of the residual network proposed a new novel layer connection they called an identity link. The identity link introduced the earliest concept of feature reuse. Prior to the identity link, each convolutional block did feature extraction on the previous convolutional output, without retaining any knowledge from prior outputs.

Concurrently along with ResNet, other researchers—such as at Google, with Inception v1 (GoogLeNet)—further refined convolutional design patterns into groups and blocks. In parallel to these design improvements was the introduction of batch normalization.

 Using identity links along with batch normalization provided more stability across layers, reducing both vanishing and exploding gradients and divergence between
layers, allowing model architectures to go deeper in layers to increase accuracy in prediction.

## Architecture
ResNet, and other architectures within this class, use different **layer-to-layer** connection patterns.The patterns we’ve discussed so far (ConvNet and VGG) use the fully connected layer-to-layer pattern.

ResNet34 introduced a new block layer and layer-connection pattern,
* residual blocks, and
* identity connection, respectively.

Each block has an identity connection that creates a parallel path between the input of the residual block and its output, as depicted in figure 3.11. As in VGG, each successive block doubles the number of filters. Pooling is done at the end of the sequence of block

<img src="img_12.png">

One of the problems with neural networks is that as we add deeper layers (under the presumption of increasing accuracy), their performance can degrade. It can get
worse, not better. This occurs for several reasons.

1. As we go deeper, we are adding more parameters (weights). The more parameters, the more places that each input in the training data will fit to the excess parameters. Instead of generalizing, the neural network will simply learn each training example (rote memorization). (1)
2. The other issue is **covariate shift**: the distribution of the weights will widen (spread further apart) as we go deeper, resulting in making it more difficult for the neural network to converge. (2)

The former case (1) causes a degradation in performance on the test (holdout) data, and the latter (2), on the training data as well as a vanishing or exploding gradient.

Residual blocks allow neural networks to be built with deeper layers without a degradation in performance on the test data.

A ResNet block could be viewed as a VGG block with the addition of the identity link. While the VGG style of the block performs feature detection, the identity link retains the input for the next subsequent block, whereby the input to the next block consists of both the previous features’ detection and input.

 By retaining information from the past (previous input), this block design allows neural networks to go deeper than the VGG counterpart, with an increase in accuracy

VGG: h(x) = f(x, {W})
ResNet: h(x) = f(x, {W}) + x

In [11]:
from keras import Model, Sequential
from keras import layers
from keras.layers import Dense, Conv2D, MaxPooling2D, Input, GlobalAveragePooling2D, ReLU, BatchNormalization, Flatten

In [2]:
def resnet_blk_example(X, num_filters):
    short_cut = X
    X = Conv2D(num_filters, activation="relu")(X)
    X = Conv2D(num_filters, activation="relu")(X)
    X = layers.add([short_cut, X])
    return X


<img src="img_13.png">

The ResNet architectures take as input a (224, 224, 3) vector—an RGB image (3 channels) of 224 (height) × 224 (width) pixels. The first layer is a basic convolutional layer, consisting of a convolution using a fairly large filter size of 7 × 7. The output (feature maps) is then reduced in size by a max pooling layer.

 After the initial convolutional layer is a succession of groups of residual blocks. Each successive group doubles the number of filters (similar to VGG). Unlike VGG, though, there is no pooling layer between the groups that would reduce the size of the feature maps.

  The input to the next block has a shape based on the previous block’s filter size (let’s call it X). The next block, by doubling the filters, will cause the output of that residual block to be double in size (let’s call it 2X). The identity link would attempt to add the input matrix (X) and the output matrix (2X). Yikes—we get an error, indicating we can’t broadcast (for the add operation) matrices of different sizes.  For ResNet, this is solved by adding a convolutional block between each “doubling” group of residual blocks. As depicted in figure 3.12, the convolutional block **doubles the filters** to **reshape the siz** and **doubles the stride** to **reduce the feature map** size by 75% (performs feature pooling).

<img src="img_14.png">

In [3]:
def residual_block(n_filters, X):
    short_cut = X
    X = Conv2D(n_filters, kernel_size=(3,3), strides=(1,1), padding="same", activation="relu")(X)
    X = Conv2D(n_filters, kernel_size=(3,3), strides=(1,1), padding="same", activation="relu")(X)
    X = layers.add([short_cut, X])
    return X

In [4]:
def conv_block(n_filters, X):
    X = Conv2D(n_filters, kernel_size=(3,3), strides=(2,2), padding="same", activation="relu")(X)
    X = Conv2D(n_filters, kernel_size=(3,3), strides=(2,2), padding="same", activation="relu")(X)
    return X

In [8]:
resnet_input = Input((224, 224, 3))
X = Conv2D(64, kernel_size=(7,7), strides=(1,1), padding="same", activation="relu")(resnet_input)
X = MaxPooling2D(pool_size=(3, 3), strides=(2, 2), padding='same')(X)

for _ in range(2):
    X = residual_block(64, X)

X = conv_block(128, X)

for _ in range(3):
    X = residual_block(128, X)

X = conv_block(256, X)

for _ in range(3):
    X = residual_block(256, X)

X = conv_block(512, X)

for _ in range(3):
    X = residual_block(512, X)

X = GlobalAveragePooling2D()(X)
resnet_output = Dense(1000, activation='softmax')(X)
resnet_model = Model(resnet_input, resnet_output)
resnet_model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_2 (InputLayer)           [(None, 224, 224, 3  0           []                               
                                )]                                                                
                                                                                                  
 conv2d_29 (Conv2D)             (None, 224, 224, 64  9472        ['input_2[0][0]']                
                                )                                                                 
                                                                                                  
 max_pooling2d_1 (MaxPooling2D)  (None, 112, 112, 64  0          ['conv2d_29[0][0]']              
                                )                                                             

Let’s now run model.summary(). We see that the total number of parameters to learn is 21 million. This is in contrast to the VGG16, which has 138 million parameters. So the ResNet architecture is six times computationally faster. This reduction is mostly achieved by the construction of the residual blocks. Notice that the DNN backend is just a single output Dense layer. In effect, there is no backend. The early residual block groups act as the CNN frontend doing the feature detection, while the latter residual blocks perform the classification. In doing so, unlike in VGG, there was no need for several fully connected dense layers, which would have substantially increased the number of parameters.

 Unlike the previous example of pooling, in which the size of each feature map is reduced according to the size of the stride, GlobalAveragePooling2D is like a supercharged version of pooling:
1. each feature map is replaced by a single value, which in this case is the average of all values in the corresponding feature map.
2. For example, if the input is 256 feature maps, the output will be a 1D vector of size 256.

After ResNet, it became the general practice for deep convolutional neural networks to use GlobalAveragePooling2D at the last pooling stage, which benefited from a substantial reduction of the number of parameters coming into the classifier, without significant loss in representational power.

Another advantage is the identity link, which provided the ability to add deeper layers, without degradation, for higher accuracy.


 ResNet50 introduced a variation of the residual block referred to as the **bottleneck residual block**. In this version, the group of two 3 × 3 convolutional layers is replaced by a group of 1 × 1, then 3 × 3, and then 1 × 1 convolutional layers. The first 1 × 1 convolution performs a dimensionality reduction, reducing the computational complexity, and the last convolution restores the dimensionality, increasing the number of filters by a factor of 4. The middle 3 × 3 convolution is referred to as the bottleneck convolution, like the neck of a bottle. The bottleneck residual block, depicted in figure 3.13, allows
for deeper neural networks, without degradation, and further reduction in computational complexity.

<img src="img_15.png">

In [9]:
def bottleneck_block(n_filters, X):
    short_cut = X
    X = Conv2D(n_filters, kernel_size=(1,1), strides=(1,1), padding="same", activation="relu")(X)
    X = Conv2D(n_filters, kernel_size=(3,3), strides=(1,1), padding="same", activation="relu")(X)
    X = Conv2D(n_filters*4, kernel_size=(1,1), strides=(1,1), padding="same", activation="relu")(X)
    X =layers.add([short_cut, X])
    return X

Residual blocks introduced the concepts of representational power and representational equivalence.

1. **Representational power** is a measure of how powerful a block is as a *feature extractor*
2. **Representational equivalence** is the idea that a block can be factored into a **lower computational complexity**, while **maintaining representational power**.


# Batch normalization
Another problem with adding deeper layers in a neural network is the vanishing gradient problem. This is actually about computer hardware. During training (the process of backward propagation and gradient descent), at each layer the weights are multiplied by very small numbers—specifically, numbers less than 1. As you know, two numbers less than 1 multiplied together make an even smaller number. When these tiny values are propagated through deeper layers, they continuously get smaller. At some point, the computer hardware can’t represent the value anymore—hence, the vanishing gradient.

The problem is further exacerbated if we try to use half-precision floats (16-bit floats) for the matrix operations versus single-precision floats (32-bit floats). The advantage of the former is that the weights (and data) are stored in half the amount of space—and using a general rule of thumb, by reducing the computational size in half, we can execute four times as many instructions per computing cycle. The problem, of course, is that with even smaller precision, we will encounter the vanishing gradient even sooner.


Batch normalization is a technique applied to the output of a layer (before or after the activation function). Without going into the statistics aspect, it normalizes the shift in the weights as they are being trained. This has several advantages:
1. it smooths out (across a batch) the amount of change, thus slowing the possibility of getting a number so small that it can’t be represented by the hardware.
2. Additionally, by narrowing the amount of shift between the weights, convergence can happen sooner by using a higher learning rate and reducing the overall amount of training time.

In earlier implementations, batch normalization was implemented post-activation.
The batch normalization would occur after the convolution and dense layers. At the time, it was debated whether the batch normalization should be before or after the activation function.



In [13]:
model_sample = Sequential([
    Conv2D(64, kernel_size=3, strides=2, padding="same",input_shape=(128, 128, 3)),
    BatchNormalization(),
    ReLU(),
    Flatten(),
    Dense(1024),
    ReLU(),
    BatchNormalization()])

# ResNet50
ResNet50 is a well-known model, which is commonly reused as a stock model, such as for transfer learning, as shared layers in objection detection, and for performance benchmarking. The model has three versions: v1, v1.5 and v2.

ResNet50 v1 formalized the concept of a convolutional group. This is a set of convolutional blocks that share a common configuration, such as the number of filters. In v1, the neural network is decomposed into groups, and each group doubles the number of filters from the previous group.

Additionally, the concept of a separate convolution block to double the number of filters was removed and replaced by a residual block that uses linear projection. Each group starts with a residual block using linear projection on the identity link to double the number of filters, while the remaining residual blocks pass the input directly to the output for the matrix add operation. Additionally, the first 1 × 1 convolution in the residual block with linear projection uses a **stride of 2** (feature pooling), which is also known as a **strided convolution**, reducing the feature map sizes by 75%, as depicted in figure 3.14.


<img src="img_16.png">

The following is an implementation of ResNet50 v1 using the bottleneck block combined with batch normalization:

In [42]:
def projection_block(n_filters,input_X, stride=(2,2)):
    short_cut = Conv2D(n_filters*4, kernel_size=(1,1), strides=stride)(input_X)
    short_cut = BatchNormalization()(short_cut)

    hidden_X = Conv2D(n_filters, kernel_size=(1,1), strides=stride)(input_X)
    hidden_X = BatchNormalization()(hidden_X)
    hidden_X = ReLU()(hidden_X)

    hidden_X= Conv2D(n_filters, kernel_size=(3,3), strides=(1,1), padding="same")(hidden_X)
    hidden_X = BatchNormalization()(hidden_X)
    hidden_X = ReLU()(hidden_X)

    hidden_X = Conv2D(n_filters*4, kernel_size=(1,1), strides=(1, 1))(hidden_X)
    hidden_X = BatchNormalization()(hidden_X)

    hidden_X = layers.add([short_cut, hidden_X])
    hidden_X = ReLU()(hidden_X)
    return hidden_X

ResNet50 v2 introduced preactivation batch normalization (BN-RE-Conv), in which the batch normalization and activation functions are placed before (instead of after) the corresponding convolution or dense layer.


In [43]:
def identity_block(n_filters, input_X):
    short_cut = X

    hidden_X = Conv2D(n_filters, kernel_size=(1,1), strides=(1,1))(input_X)
    hidden_X = BatchNormalization()(hidden_X)
    hidden_X = ReLU()(hidden_X)

    hidden_X= Conv2D(n_filters, kernel_size=(3,3), strides=(1,1), padding="same")(hidden_X)
    hidden_X = BatchNormalization()(hidden_X)
    hidden_X = ReLU()(hidden_X)

    hidden_X = Conv2D(n_filters*4, kernel_size=(1,1), strides=(1, 1))(hidden_X)
    hidden_X = BatchNormalization()(hidden_X)

    hidden_X = layers.add([short_cut, hidden_X])
    hidden_X = ReLU()(hidden_X)
    return hidden_X

In [44]:
from keras.layers import ZeroPadding2D
vgg50_input = Input((224, 224, 3))

X = layers.ZeroPadding2D(padding=(3, 3))(vgg50_input)
X = layers.Conv2D(64, kernel_size=(7, 7), strides=(2, 2), padding='valid')(X)
X = layers.BatchNormalization()(X)
X = layers.ReLU()(X)
X = layers.ZeroPadding2D(padding=(1, 1))(X)
X = layers.MaxPool2D(pool_size=(3, 3), strides=(2, 2))(X)

X = projection_block(64, input_X=X, stride=(1,1))

for _ in range(2):
    X = identity_block(64, X)
X = projection_block(128, X)

for _ in range(3):
    X = identity_block(128, X)
X = projection_block(256, X)

for _ in range(5):
    X = identity_block(256, X)

X = projection_block(512, X)

for _ in range(2):
    X = identity_block(512, X)

X = layers.GlobalAveragePooling2D()(X)

vgg50_outputs = layers.Dense(1000, activation='softmax')(X)

model = Model(vgg50_input, vgg50_outputs)
model.summary()


Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_16 (InputLayer)          [(None, 224, 224, 3  0           []                               
                                )]                                                                
                                                                                                  
 zero_padding2d_23 (ZeroPadding  (None, 230, 230, 3)  0          ['input_16[0][0]']               
 2D)                                                                                              
                                                                                                  
 conv2d_109 (Conv2D)            (None, 112, 112, 64  9472        ['zero_padding2d_23[0][0]']      
                                )                                                           

ResNet50 v2 introduced preactivation batch normalization (BN-RE-Conv), in which the batch normalization and activation functions are placed before (instead of after) the corresponding convolution or dense layer. This has now become a common practice, as depicted here for implementation of the residual block with the identity link in v2:

In [45]:
def identity_block_v2(n_filters, input_X):
    short_cut = X

    hidden_X = BatchNormalization()(X)
    hidden_X = ReLU()(hidden_X)
    hidden_X = Conv2D(n_filters, kernel_size=(1,1), strides=(1,1))(hidden_X)

    hidden_X = BatchNormalization()(hidden_X)
    hidden_X = ReLU()(hidden_X)
    hidden_X= Conv2D(n_filters, kernel_size=(3,3), strides=(1,1), padding="same")(hidden_X)

    hidden_X = BatchNormalization()(hidden_X)
    hidden_X = ReLU()(hidden_X)
    hidden_X = Conv2D(n_filters*4, kernel_size=(1,1), strides=(1, 1))(hidden_X)


    hidden_X = layers.add([short_cut, hidden_X])

    return hidden_X

# Summary
1. A convolutional neural network can be described as adding a frontend to a deep neural network.
2. The purpose of the CNN frontend is to reduce the high-dimensional pixel input to low-dimensional feature representation.
3. The lower dimensionality of the feature representation makes it practical to do deep learning with real-world images.
4. Image resizing and pooling are used to reduce the number of parameters in the model, without information loss.
5. Using a cascading set of filters to detect features has similarities to the human eye
6. VGG formalized the concept of a convolutional pattern that is repeated.
7. Residual networks introduced the concept of feature reuse and demonstrated the ability to obtain higher accuracy at the same number of layers as a VGG, and go deeper in layers for more accuracy.
8. Batch normalization allowed models to go deeper in layers for more accuracy before being exposed to vanishing or exploding gradients.

