## Residual Networks

Welcome to the first assignment, You'll be building a very deep convolutional network using residual Networks (ResNets). In theroy, very deep networks can represent very complex functions; but in practice, they are hard to train. Residual Networks, introduced by He et al in 2015, allow you to train much deeper networks than were previously feasible. 

**By the end of this assignment, you will be able to:**
- Implement the basic building block of ResNets in deep neural networks using Keras.
- Put together these building blocks to implement and train a state-of-the-art network for image classification.
- Implement a skip connection in your network. 

For this assignment, you'll use Keras. 


### 1.1 Packages


In [1]:
import os


os.chdir(os.path.join(os.getcwd(), 'Chapter04-Convolutional-Neural-Networks',
           'DeepConvolutional_Models-CaseStudies',
                         'W2A1'))


In [2]:
os.getcwd()

'/workspace/Chapter04-Convolutional-Neural-Networks/DeepConvolutional_Models-CaseStudies/W2A1'

In [3]:
import tensorflow as tf
import numpy as np
import scipy.misc
from tensorflow.keras.applications.resnet_v2 import ResNet50V2
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet_v2 import preprocess_input, decode_predictions
from tensorflow.keras import layers

from tensorflow.keras.layers import Input, Add, Dense, Activation, ZeroPadding2D, Flatten, Conv2D, AveragePooling2D, MaxPooling2D, GlobalMaxPooling2D
from tensorflow.keras.models import Model, load_model
from resnets_utils import *
from tensorflow.keras.initializers import random_uniform, glorot_uniform, constant, identity
from tensorflow.python.framework.ops import EagerTensor
from matplotlib.pyplot import imshow


from test_utils import summary, comparator
import public_tests

%matplotlib inline
np.random.seed(1)
tf.random.set_seed(2)

2026-01-07 08:36:38.898622: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


### 2. The Problem of Very Deep Networks

Last week, you built your first Convolutional neural network: first manually with numpy, then using Tensorflow and Keras.  

In recent years, neural networks have become deeper, with state-of-the-art networks evolving from having nust a few layers to over a hundred layers.  
- The main benefit of a very deep network is that it can represent very complex functions. It can also learn features at many different levels of abstraction from edges(at the shallow layers, closer to input) to very complex features (at the deeper layers, closer to output).
- However, using a deeper network does not help. A huge barrier to training them is vanishing  gradients; very deep networks often have a gradient signal that goes to zero quickly, thus making gradient descent prohibitively slow. 
- More specifically, during gradient descent, as you backpropagate from the final layer back to the first layer, you are multiplying by the weight matrix on each step, and thus the gradient can decrease exponentially quickly to zero. (or, in rare cases, grow exponentially quickly and 'explode' from gaining very large values)
- During training, you might therefore see the magnitude（规模， 大小，数量级） of the gradient for the shallower layers decrease to zero very rapidly as the training proceeds as shown below: 

<img src="images/vanishing_grad_kiank.png" style="width:600px;height:300px;">
<caption><center> <u> <font color='purple'> <b>Figure 1</b> </u><font color='purple'>  : <b>Vanishing gradient</b> <br> The speed of learning decreases very rapidly for the shallower layers as the network trains </center></caption>

Not to worry! You are now going to solve this problem by building a Residual Network!

### 3 - Building a Residual Network


In ResNets, a "shortcut" or a "skip connection" allows the model to skip layers:  

<img src="images/skip_connection_kiank.png" style="width:650px;height:200px;">
<caption><center> <u> <font color='purple'> <b>Figure 2</b> </u><font color='purple'>  : A ResNet block showing a skip-connection <br> </center></caption>



The image on the left shows the "main path" through the network. The image on the right adds a shortcut to the main path. 
By stacking these ResNet blocks on top of each other, you can form a very deep network. 

The lecture mentioned that having ResNet blocks with the shortcut also makes it very easy for one of blocks to learn an identity function. This means that you can stack on additional ResNet blocks with little risk of harming trainning set performance. 

Two main types of ResNet blocks are used in practice: the identity block and the convolutional block, depending mainly on whether the input/output dimensions are the same. you are going to implement both of them: the "identity block" and the "convolutional block".

### 3.1 The identity block

The identity block is the standard block used in ResNets, and corresponds to the case where the input activation (say a[l]) has the same dimension as the output activation (a[l+2]). The flesh out the different steps of what happens in a ResNet identity block, here is an alternative diagram showing the individual steps:

<img src="images/idblock2_kiank.png" style="width:650px;height:150px;">
<caption><center> <u> <font color='purple'> <b>Figure 3</b> </u><font color='purple'>  : <b>Identity block.</b> Skip connection "skips over" 2 layers. </center></caption>


The upper path is the "shortcut path." The lower path is the "main path." In this diagram, notice the CONV2D and ReLU steps in each layer. To speed up training, a BatchNorm step has been added. Don't worry about this being complicated to implement--you'll see that BatchNorm is just one line of code in Keras! 



In this exercise, you'll actually implement a slightly more powerful version of this identity block, in which the skip connection "skips over" 3 hidden layers rather than 2 layers. It looks like this: 

<img src="images/idblock3_kiank.png" style="width:650px;height:150px;">
    <caption><center> <u> <font color='purple'> <b>Figure 4</b> </u><font color='purple'>  : <b>Identity block.</b> Skip connection "skips over" 3 layers.</center></caption>

These are the individual steps:


First component of main path: 
- The first CONV2D has $F_1$ filters of shape (1,1) and a stride of (1,1). Its padding is "valid". Use 0 as the seed for the random uniform initialization: `kernel_initializer = initializer(seed=0)`. 
- The first BatchNorm is normalizing the 'channels' axis.
- Then apply the ReLU activation function. This has no hyperparameters. 

Second component of main path:
- The second CONV2D has $F_2$ filters of shape $(f,f)$ and a stride of (1,1). Its padding is "same". Use 0 as the seed for the random uniform initialization: `kernel_initializer = initializer(seed=0)`.
- The second BatchNorm is normalizing the 'channels' axis.
- Then apply the ReLU activation function. This has no hyperparameters.

Third component of main path:
- The third CONV2D has $F_3$ filters of shape (1,1) and a stride of (1,1). Its padding is "valid". Use 0 as the seed for the random uniform initialization: `kernel_initializer = initializer(seed=0)`. 
- The third BatchNorm is normalizing the 'channels' axis.
- Note that there is **no** ReLU activation function in this component. 

Final step: 
- The `X_shortcut` and the output from the 3rd layer `X` are added together.
- **Hint**: The syntax will look something like `Add()([var1,var2])`
- Then apply the ReLU activation function. This has no hyperparameters. 


### Exercise 1 - identity_block

Implement the ResNet identity block. The first component of the main path has been implemented for you already! First, you should read these docs carefully to make sure you understand what's happening. Then, implement the rest. 
- To implement the Conv2D step: [Conv2D](https://www.tensorflow.org/versions/r2.9/api_docs/python/tf/keras/layers/Conv2D)
- To implement BatchNorm: [BatchNormalization](https://www.tensorflow.org/versions/r2.9/api_docs/python/tf/keras/layers/BatchNormalization) `BatchNormalization(axis = 3)(X)`. If training is set to False, its weights are not updated with the new examples. I.e when the model is used in prediction mode.
- For the activation, use:  `Activation('relu')(X)`
- To add the value passed forward by the shortcut: [Add](https://www.tensorflow.org/versions/r2.9/api_docs/python/tf/keras/layers/Add)

We have added the initializer argument to our functions. This parameter receives an initializer function like the ones included in the package [tensorflow.keras.initializers](https://www.tensorflow.org/versions/r2.9/api_docs/python/tf/keras/initializers) or any other custom initializer. By default it will be set to [random_uniform](https://www.tensorflow.org/versions/r2.9/api_docs/python/tf/keras/initializers/RandomUniform)

Remember that these functions accept a `seed` argument that can be any value you want, but that in this notebook must set to 0 for **grading purposes**.

 Here is where you're actually using the power of the Functional API to create a shortcut path: 

In [4]:
### UNIQ_C1
### GRADE FUNCTION: identity_block

def identity_block(X, f, filters,  initializer=random_uniform):
    """
    Implementation of the identity block as defined in Figure 4, skipping over 3 layers
    
    Arguments:
    X -- input tensor of shape (m, n_H_prev, n_W_prev, n_C_prev)
    f -- integer, specifying the shape of the middle CONV's window for the main path
    filters -- python list of integers, defining the number of filters in the CONV layers of the main path
    initializer -- to set up the initial weights of a layer. Equals to random uniform initializer

    Returns:
    X -- output of the identity block, tensor of shape (n_H, n_W, n_C)
    """
    ## retrieve Filters
    F1, F2, F3 = filters
    ## Save the input value, You'll need this later to add back to the main path.  
    X_shortcut = X

    ## First component of main path
    X = Conv2D(filters=F1, 
               kernel_size=(1,1), 
               strides=(1,1), 
               padding='valid',
               kernel_initializer=initializer(seed=0))(X)
    X = BatchNormalization(axis=3)(X)
    X = Activation('relu')(X) 
    print('After first conv layer:')
    print(f'The X shape is {X.shape}')

    ### START CODE HERE
    ## Second component of main path (≈3 lines) set paddig='same'
    X = Conv2D(filters=F2, 
               kernel_size=(f,f), 
               strides=(1,1), 
               padding='same',
               kernel_initializer=initializer(seed=0))(X)
    X = BatchNormalization(axis=3)(X)
    X = Activation('relu')(X)
    print('After Second conv layer:')
    print(f'The X shape is {X.shape}')

    ### Third component of main path (≈2 lines), set padding='valid'
    X = Conv2D(filters=F3, 
               kernel_size=(1,1), 
               strides=(1,1), 
               padding='valid',
               kernel_initializer=initializer(seed=0))(X)
    X = BatchNormalization(axis=3)(X)
    print('After Third conv layer:')
    print(f'The X shape is {X.shape}')


    ## Final step: Add shortcut value to main path, and pass it through a RELU activation (≈2 lines)
    X = Add()([X, X_shortcut])
    X = Activation('relu')(X)

    ### END CODE HERE

    return X



In [5]:
### you cannot edit this cell

tf.keras.backend.set_learning_phase(False)

np.random.seed(1)
tf.random.set_seed(2)
X1 = np.ones((1, 4, 4, 3)) * -1
X2 = np.ones((1, 4, 4, 3)) * 1
X3 = np.ones((1, 4, 4, 3)) * 3

X = np.concatenate((X1, X2, X3), axis = 0).astype(np.float32)

A3 = identity_block(X, f=2, filters=[4, 4, 3],
                   initializer=lambda seed=0:constant(value=1))

print('\n\033[1mTHE EXPECTED OUTPUT SHAPE IS (3, 4, 4, 3) \033[0m\n')
print('Output shape: {}'.format(A3.shape))


print('\033[1mWith training=False\033[0m\n')
A3np = A3.numpy()
print(np.around(A3.numpy()[:,(0,-1),:,:].mean(axis = 3), 5))
resume = A3np[:,(0,-1),:,:].mean(axis = 3)
print(resume[1, 1, 0])

tf.keras.backend.set_learning_phase(True)

print('\n\033[1mWith training=True\033[0m\n')
np.random.seed(1)
tf.random.set_seed(2)
A4 = identity_block(X, f=3, filters=[3, 3, 3],
                   initializer=lambda seed=7:constant(value=1))
A4np = A4.numpy()
resume = A4np[:,(0,-1),:,:].mean(axis = 3)
print(np.around(resume, 5))

public_tests.identity_block_test(identity_block)

After first conv layer:
The X shape is (3, 4, 4, 4)
After Second conv layer:
The X shape is (3, 4, 4, 4)
After Third conv layer:
The X shape is (3, 4, 4, 3)

[1mTHE EXPECTED OUTPUT SHAPE IS (3, 4, 4, 3) [0m

Output shape: (3, 4, 4, 3)
[1mWith training=False[0m

[[[  0.        0.        0.        0.     ]
  [  0.        0.        0.        0.     ]]

 [[192.99974 192.99974 192.99974  96.99986]
  [ 96.99986  96.99986  96.99986  48.99993]]

 [[578.9994  578.9994  578.9994  290.99963]
  [290.99963 290.99963 290.99963 146.99982]]]
96.999855

[1mWith training=True[0m

After first conv layer:
The X shape is (3, 4, 4, 3)
After Second conv layer:
The X shape is (3, 4, 4, 3)
After Third conv layer:
The X shape is (3, 4, 4, 3)
[[[0.      0.      0.      0.     ]
  [0.      0.      0.      0.     ]]

 [[0.37387 0.37387 0.37387 0.37387]
  [0.37387 0.37387 0.37387 0.37387]]

 [[3.23793 4.13955 4.13955 3.23793]
  [3.23793 4.13955 4.13955 3.23793]]]
After first conv layer:
The X shape is (3, 4, 

2026-01-07 08:36:42.414070: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


## 3.2 - The convolitional Block

The ResNet "convolutional block" is the secord block type. You can use this type of block when the input and output dimensions are different. The difference to the identity block is that there is a CONV2D layer in the shortcut path.
<img src="images/convblock_kiank.png" style="width:650px;height:200px;">




* The CONV2D layer in the shortcut path is used to resize the input $x$ to a different dimension, so that the dimensions match up in the final addition needed to add the shortcut value back to the main path. (This plays a similar role as the matrix $W_s$ discussed in lecture.)
* For example, to reduce the activation dimensions's height and width by a factor of 2, you can use a 1x1 convolution with a stride of 2. 
* The CONV2D layer on the shortcut path does not use any non-linear activation function. Its main role is to just apply a (learned) linear function that reduces the dimension of the input, so that the dimensions match up for the later addition step. 
* As for the previous exercise, the additional `initializer` argument is required for grading purposes, and it has been set by default to [glorot_uniform](https://www.tensorflow.org/versions/r2.9/api_docs/python/tf/keras/initializers/GlorotUniform)

The details of the convolutional block are as follows. 

First component of main path:
- The first CONV2D has $F_1$ filters of shape (1,1) and a stride of (s,s). Its padding is "valid". Use 0 as the `glorot_uniform` seed `kernel_initializer = initializer(seed=0)`.
- The first BatchNorm is normalizing the 'channels' axis.
- Then apply the ReLU activation function. This has no hyperparameters. 

Second component of main path:
- The second CONV2D has $F_2$ filters of shape (f,f) and a stride of (1,1). Its padding is "same".  Use 0 as the `glorot_uniform` seed `kernel_initializer = initializer(seed=0)`.
- The second BatchNorm is normalizing the 'channels' axis.
- Then apply the ReLU activation function. This has no hyperparameters. 

Third component of main path:
- The third CONV2D has $F_3$ filters of shape (1,1) and a stride of (1,1). Its padding is "valid".  Use 0 as the `glorot_uniform` seed `kernel_initializer = initializer(seed=0)`.
- The third BatchNorm is normalizing the 'channels' axis. Note that there is no ReLU activation function in this component. 

Shortcut path:
- The CONV2D has $F_3$ filters of shape (1,1) and a stride of (s,s). Its padding is "valid".  Use 0 as the `glorot_uniform` seed `kernel_initializer = initializer(seed=0)`.
- The BatchNorm is normalizing the 'channels' axis. 

Final step: 
- The shortcut and the main path values are added together.
- Then apply the ReLU activation function. This has no hyperparameters. 
 
<a name='ex-2'></a>    
### Exercise 2 - convolutional_block
    
Implement the convolutional block. The first component of the main path is already implemented; then it's your turn to implement the rest! As before, always use 0 as the seed for the random initialization, to ensure consistency with the grader.
- [Conv2D](https://www.tensorflow.org/versions/r2.9/api_docs/python/tf/keras/layers/Conv2D)
- [BatchNormalization](https://www.tensorflow.org/versions/r2.9/api_docs/python/tf/keras/layers/BatchNormalization) (axis: Integer, the axis that should be normalized (typically the features axis)) `BatchNormalization(axis = 3)(X)`. If training is set to False, its weights are not updated with the new examples. I.e when the model is used in prediction mode.
- For the activation, use:  `Activation('relu')(X)`
- [Add](https://www.tensorflow.org/versions/r2.9/api_docs/python/tf/keras/layers/Add)
    
We have added the initializer argument to our functions. This parameter receives an initializer function like the ones included in the package [tensorflow.keras.initializers](https://www.tensorflow.org/versions/r2.9/api_docs/python/tf/keras/initializers) or any other custom initializer. By default it will be set to [glorot_uniform](https://www.tensorflow.org/versions/r2.9/api_docs/python/tf/keras/initializers/GlorotUniform)

Remember that these functions accept a `seed` argument that can be any value you want, but that in this notebook must set to 0 for **grading purposes**.

In [6]:
## UNQ_C2
### GRADE FUNCTION: convolutional_block

def convolutional_block(X, f, filters, s = 2, initializer=glorot_uniform):
    """
    Implementation of the convolutional block as defined in Figure 4
    
    Arguments:
    X -- input tensor of shape (m, n_H_prev, n_W_prev, n_C_prev)
    f -- integer, specifying the shape of the middle CONV's window for the main path
    filters -- python list of integers, defining the number of filters in the CONV layers of the main path
    s -- Integer, specifying the stride to be used
    initializer -- to set up the initial weights of a layer. Equals to glorot uniform initializer

    Returns:
    X -- output of the convolutional block, tensor of shape (n_H, n_W, n_C)
    """
    ## Retrieve Filters
    F1, F2, F3 = filters

    ## Save the input value
    X_shortcut = X

    ### START CODE HERE
    ## First component of main path 
    X = Conv2D(filters=F1, 
               kernel_size=(1,1), 
               strides=(s,s), 
               padding='valid',
               kernel_initializer=initializer(seed=0))(X)
    X = BatchNormalization(axis=3)(X)
    X = Activation('relu')(X) 
    print('\033[1mAfter first conv layer:\033[0m')
    print(f'\033[1mThe X shape is {X.shape}\033[0m')


    ### Second component of main path (≈3 lines) set paddig='same'
    X = Conv2D(filters=F2,
               kernel_size=(f,f), 
                strides=(1,1),
                 padding='same',
                   kernel_initializer=initializer(seed=0))(X)
    X = BatchNormalization(axis=3)(X)
    X = Activation('relu')(X)
    print('\n\033[1mAfter Second conv layer:\033[0m')
    print(f'\033[1mThe X shape is {X.shape}\033[0m')


    ### Third component of main path (≈2 lines), set padding='valid'
    X = Conv2D(filters=F3,  
               kernel_size=(1,1), 
               strides=(1,1), 
               padding='valid',
               kernel_initializer=initializer(seed=0))(X)
    X = BatchNormalization(axis=3)(X)
    print('\n\033[1mAfter Third conv layer:\033[0m')
    print(f'\033[1mThe X shape is {X.shape}\033[0m')


    ## SHORTCUT PATH (≈2 lines)
    X_shortcut = Conv2D(filters=F3, 
                        kernel_size=(1,1), 
                        strides=(s,s), 
                        padding='valid',
                        kernel_initializer=initializer(seed=0))(X_shortcut)
    X_shortcut = BatchNormalization(axis=3)(X_shortcut)
    print('\n\033[1mAfter SHORTCUT PATH layer:\033[0m')
    print(f'\033[1mThe X_shortcut shape is {X_shortcut.shape}\033[0m')


    ## Final step: Add shortcut value to main path, and pass it through a RELU activation (≈2 lines)
    X = Add()([X, X_shortcut])
    X = Activation('relu')(X)

    ### END CODE HERE

    return X



In [7]:
### you cannot edit this cell

public_tests.convolutional_block_test(convolutional_block)

Public testing utils - Convolutional Block:
The original X shape is (3, 4, 4, 3)
[1mAfter first conv layer:[0m
[1mThe X shape is (3, 1, 1, 2)[0m

[1mAfter Second conv layer:[0m
[1mThe X shape is (3, 1, 1, 4)[0m

[1mAfter Third conv layer:[0m
[1mThe X shape is (3, 1, 1, 6)[0m

[1mAfter SHORTCUT PATH layer:[0m
[1mThe X_shortcut shape is (3, 1, 1, 6)[0m
[1mAfter first conv layer:[0m
[1mThe X shape is (3, 2, 2, 2)[0m

[1mAfter Second conv layer:[0m
[1mThe X shape is (3, 2, 2, 4)[0m

[1mAfter Third conv layer:[0m
[1mThe X shape is (3, 2, 2, 6)[0m

[1mAfter SHORTCUT PATH layer:[0m
[1mThe X_shortcut shape is (3, 2, 2, 6)[0m
tf.Tensor(
[[[0.3347573  1.6415622  0.33794785 0.08483201 0.8150141  0.        ]
  [0.17481059 1.5698532  0.26053628 0.         0.7671118  0.        ]]

 [[0.         1.4983335  0.16898686 0.         0.6183615  0.        ]
  [0.         1.4503356  0.11640047 0.         0.58086616 0.        ]]], shape=(2, 2, 6), dtype=float32)
[1mAfter first

### 4 - Building Your First ResNet Model (50 layers)

You now have the necessary blocks to build a deep ResNet. The following figure describes in detail wht architecture of thie Neural Network. "ID BLOCK" in the diagram stands for "identity block" and "ID BLOCK x3" means you should stack 3 together.  

<img src="images/resnet_kiank.png" style="width:1000px;height:200px;">

The details of this ResNet-50 model are:
- Zero padding pads the input with a pad of (3,3)
- Stage 1:
    - The 2D Convolution has  64 filters of shape (7,7) and uses a stride of (2,2)
    - Then, a BatchNorm layer is applied to the 'channels' axis of input
    - Then, a MaxPooling layer uses a (3,3) window and a (2,2) stride
- Stage 2:
    - The convolutional block uses three sets of filters of size [64,64,256], "f" is 3, and "s" is 1.
    - The 2 identity blocks use three sets of filters of size [64,64,256], and "f" is 3.
- Stage 3:
    - The convolutional block uses three sets of filters of size [128,128,512], "f" is 3 and "s" is 2.
    - The 3 identity blocks use three sets of filters of size [128,128,512] and "f" is 3.
- Stage 4:
    - The convolutional block uses three sets of filters of size [256, 256, 1024], "f" is 3 and "s" is 2.
    - The 5 identity blocks use three sets of filters of size [256, 256, 1024] and "f" is 3.
- Stage 5:
    - The convolutional block uses three sets of filters of size [512, 512, 2048], "f" is 3 and "s" is 2.
    - The 2 identity blocks use three sets of filters of size [512, 512, 2048] and "f" is 3.
- The 2D Average Pooling uses a window (pool_size) of shape (2,2).
- The 'flatten' layer doesn't have any hyperparameters.
- The Fully Connected (Dense) layer reduces its input to the number of classes using a softmax activation.
