# Batch Layer Norms and Residual Connections

Deep neural networks are highly capable but can sometimes be difficult to train due to issues like vanishing or exploding gradients and intense training resources. To address these challenges, several techniques have been developed that help stabilize and accelerate training. Some methods include batch normalization, layer normalization, and residual connections.

![](Batch_Layer.png)

## Batch Normalization

Batch normalization was introduced to improve the training of deep networks by controlling the distribution of layer outputs. 

In this technique, the activations from a layer are normalized using the statistics computed from a mini-batch. For each mini-batch, the mean and variance of the activations are calculated, and then each activation is normalized by subtracting the mean and dividing by the standard deviation (with a small constant added for stability). After normalization, learnable scale and shift parameters are applied, allowing the network to recover the original representations if necessary. 

This approach can help the network converge faster and may permit the use of higher learning rates.

![](https://miro.medium.com/v2/resize:fit:898/0*pSSzicm1IH4hXOHc.png)

### The Process

In batch normalization, we adjust the activations within a mini-batch so that they have a mean of zero and a variance of one. Suppose you have a mini-batch containing $m$ examples, and you consider a particular activation $x$ (which could be a scalar value from a feature map or fully connected layer). The steps are as follows:

1. **Compute the Mean:**  
   For the mini-batch, the mean $\mu_B$ is calculated by summing all the values and dividing by the number of samples:
   $$
   \mu_B = \frac{1}{m} \sum_{i=1}^{m} x^{(i)}.
   $$
   This average value represents the central tendency of the activations within the batch.

2. **Compute the Variance:**  
   Next, the variance $\sigma_B^2$ measures how much the activations vary around the mean:
   $$
   \sigma_B^2 = \frac{1}{m} \sum_{i=1}^{m} \left( x^{(i)} - \mu_B \right)^2.
   $$
   The variance gives us an idea of the spread or dispersion of the activation values.

3. **Normalize the Activations:**  
   Each activation is normalized by subtracting the mean and dividing by the square root of the variance (plus a small constant $\epsilon$ for numerical stability). This yields:
   $$
   \hat{x}^{(i)} = \frac{x^{(i)} - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}.
   $$
   Here, the term $\sqrt{\sigma_B^2 + \epsilon}$ ensures that even if the variance is very small, we avoid division by zero.

4. **Apply Learnable Scaling and Shifting:**  
   After normalization, the network applies an affine transformation with parameters $\gamma$ (for scaling) and $\beta$ (for shifting):
   $$
   y^{(i)} = \gamma \hat{x}^{(i)} + \beta.
   $$
   This learnable transformation enables the network to recover the original representation if that is optimal, rather than forcing the activations to have zero mean and unit variance at all times. Additionally, during inference, running estimates of the batch statistics are used to maintain consistency. This series of steps not only makes the optimization landscape smoother but also helps maintain healthy gradients during training, contributing to faster and more stable convergence.

### Benefits

1. **Smoothes the Optimization Landscape:**  
Batch normalization reduces the variability of activations across different mini-batches by ensuring that each mini-batch has a consistent mean and variance. This standardization smoothes the loss surface, making the optimization process more predictable and stable. With a smoother landscape, gradient descent can take larger steps—meaning a higher learning rate—without overshooting the minimum, which leads to faster convergence.

2. **Uniformity Across Examples:**  
By normalizing the activations within a batch, batch normalization makes the examples more uniform in terms of their statistical properties. This uniformity prevents certain samples with extreme values from dominating the gradient updates, resulting in a more balanced learning process. With reduced internal variability, the network can safely use a larger learning rate, which in turn accelerates training.

3. **Regularization Through Noise:**  
The use of mini-batch statistics introduces a degree of randomness (or noise) during training. This noise serves as an implicit regularizer, discouraging the model from overfitting to the training data. The stochastic variations in the normalization process help the network to find a more robust solution, often resulting in better generalization and improved performance on unseen data.

### Examples

#### TensorFlow Example

In [3]:
import tensorflow as tf

class SimpleNetBN(tf.keras.Model):
    def __init__(self):
        super(SimpleNetBN, self).__init__()
        self.fc1 = tf.keras.layers.Dense(20)
        self.bn1 = tf.keras.layers.BatchNormalization()
        self.relu = tf.keras.layers.ReLU()
        self.fc2 = tf.keras.layers.Dense(1)
    
    def call(self, x, training=False):
        x = self.fc1(x)
        x = self.bn1(x, training=training)  # Batch normalization applied here
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Create network and sample input
net_bn_tf = SimpleNetBN()
sample_input_tf = tf.random.normal([5, 10])  # Batch size of 5, input dimension of 10
output_bn_tf = net_bn_tf(sample_input_tf, training=True)
print("TensorFlow BatchNorm Output:\n", output_bn_tf)

TensorFlow BatchNorm Output:
 tf.Tensor(
[[-0.5078056 ]
 [ 0.9584242 ]
 [-0.12232404]
 [ 0.19608045]
 [-2.4165165 ]], shape=(5, 1), dtype=float32)


#### PyTorch Example

In [None]:
import torch
import torch.nn as nn

class SimpleNetBN(nn.Module):
    def __init__(self):
        super(SimpleNetBN, self).__init__()
        self.fc1 = nn.Linear(10, 20)
        self.bn1 = nn.BatchNorm1d(20)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(20, 1)
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.bn1(x)  # Batch normalization applied here
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Create network and sample input
net_bn = SimpleNetBN()
sample_input = torch.randn(5, 10)  # Batch size of 5, input dimension of 10
output_bn = net_bn(sample_input)
print("PyTorch BatchNorm Output:\n", output_bn)


### Real World Example

In [1]:
import tensorflow as tf
from tensorflow.keras import layers, models

# Load and preprocess the MNIST dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

# Normalize pixel values to [0, 1]
x_train = x_train.astype('float32') / 255.0
x_test  = x_test.astype('float32') / 255.0

# Expand dimensions to add the channel axis (MNIST is grayscale)
x_train = x_train[..., tf.newaxis]
x_test  = x_test[..., tf.newaxis]

def create_model_batch_norm():
    model = models.Sequential([
        layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
        layers.BatchNormalization(),  # Normalize across mini-batch
        layers.Conv2D(64, (3, 3), activation='relu'),
        layers.BatchNormalization(),
        layers.MaxPooling2D((2, 2)),
        layers.Flatten(),
        layers.Dense(128, activation='relu'),
        layers.BatchNormalization(),
        layers.Dense(10, activation='softmax')
    ])
    return model

model_bn = create_model_batch_norm()
model_bn.compile(optimizer='adam',
                 loss='sparse_categorical_crossentropy',
                 metrics=['accuracy'])

print("=== Training Model with Batch Normalization ===")
model_bn.summary()
model_bn.fit(x_train, y_train, epochs=5, batch_size=128, validation_split=0.1)

print("Test evaluation:")
model_bn.evaluate(x_test, y_test)

2025-02-24 03:08:19.877338: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


=== Training Model with Batch Normalization ===
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 26, 26, 32)        320       
                                                                 
 batch_normalization (BatchN  (None, 26, 26, 32)       128       
 ormalization)                                                   
                                                                 
 conv2d_1 (Conv2D)           (None, 24, 24, 64)        18496     
                                                                 
 batch_normalization_1 (Batc  (None, 24, 24, 64)       256       
 hNormalization)                                                 
                                                                 
 max_pooling2d (MaxPooling2D  (None, 12, 12, 64)       0         
 )                                                               
        

2025-02-24 03:09:12.863341: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test evaluation:


[0.0363968163728714, 0.9894000291824341]

## Layer Normalization

Layer normalization is another normalization technique that is particularly useful in scenarios where batch sizes are small or variable, such as in recurrent neural networks or transformer architectures. 

Unlike batch normalization, which computes normalization statistics across the mini-batch, layer normalization computes the mean and variance across the features of each individual sample. This per-sample normalization makes the technique robust to changes in batch size and is well-suited for sequential data. 

The process involves normalizing each sample by subtracting the mean and dividing by the standard deviation calculated over its features, followed by the application of learnable scaling and shifting parameters.

![](https://theaisummer.com/static/ac89fbcf1c115f07ae68af695c28c4a0/ee604/normalization.png)

### The Process

Layer normalization operates on a per-sample basis by normalizing the features within a single data point rather than across a mini-batch. Suppose you have an input vector $x \in \mathbb{R}^d$ for a single sample, where $d$ is the number of features. The process involves:

1. **Compute the Mean Across Features:**  
   For the input sample $x$, calculate the mean $\mu$ of all its features:
   $$
   \mu = \frac{1}{d} \sum_{j=1}^{d} x_j.
   $$
   This mean represents the average feature value for that specific sample.

2. **Compute the Variance Across Features:**  
   Next, determine the variance $\sigma^2$ to measure the spread of the features:
   $$
   \sigma^2 = \frac{1}{d} \sum_{j=1}^{d} \left( x_j - \mu \right)^2.
   $$
   This variance tells us how much the feature values deviate from the mean.

3. **Normalize the Features:**  
   Each feature $x_j$ is then normalized by subtracting the mean and dividing by the standard deviation:
   $$
   \hat{x}_j = \frac{x_j - \mu}{\sqrt{\sigma^2 + \epsilon}},
   $$
   where $\epsilon$ is a small constant added to prevent division by zero. This ensures that each sample has features with zero mean and unit variance, regardless of the batch size.

4. **Apply Learnable Scaling and Shifting:**  
   Finally, similar to batch normalization, layer normalization uses learnable parameters $\gamma_j$ and $\beta_j$ for each feature:
   $$
   y_j = \gamma_j \hat{x}_j + \beta_j.
   $$
   This step allows each feature to be scaled and shifted independently, so the network can adjust the normalized values as needed for optimal performance.

Since layer normalization works on each individual sample, it is especially useful in scenarios where the batch size is small or even variable (for example, in recurrent neural networks or transformer models), ensuring that the dynamic range of features remains consistent.

### Benefits

1. **Stable Inputs Across Layers:**  
Layer normalization normalizes the features of each individual sample, ensuring that the inputs to each layer remain on a consistent scale. By doing so, it prevents the activations from exploding or vanishing as they pass through the network. This stability is crucial for maintaining healthy gradients and ensuring that every layer receives inputs that are properly scaled, which facilitates the learning of deep representations.

2. **Independence from Batch Size:**  
Unlike batch normalization, layer normalization computes statistics based solely on the features of a single sample. This independence means that its performance is not affected by the size of the mini-batch, making it especially effective in scenarios where batch sizes are small or variable—such as in recurrent neural networks or transformer models. This property simplifies model design and training, as the normalization behavior remains consistent regardless of the batch size.

3. **Consistent Behavior During Training and Testing:**  
Since layer normalization normalizes each sample independently, the exact same procedure is applied during both training and inference. There is no need for moving averages or separate inference rules, which are required by batch normalization to handle different statistics in testing. This consistency eliminates any discrepancy between the training and testing phases, making the model's behavior more predictable and easier to debug.

### Examples

#### TensorFlow Example

In [4]:
import tensorflow as tf

class SimpleNetLN(tf.keras.Model):
    def __init__(self):
        super(SimpleNetLN, self).__init__()
        self.fc1 = tf.keras.layers.Dense(20)
        self.ln1 = tf.keras.layers.LayerNormalization()
        self.relu = tf.keras.layers.ReLU()
        self.fc2 = tf.keras.layers.Dense(1)
    
    def call(self, x):
        x = self.fc1(x)
        x = self.ln1(x)  # Layer normalization applied here
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Create network and sample input
net_ln_tf = SimpleNetLN()
sample_input_tf = tf.random.normal([5, 10])
output_ln_tf = net_ln_tf(sample_input_tf)
print("TensorFlow LayerNorm Output:\n", output_ln_tf)

TensorFlow LayerNorm Output:
 tf.Tensor(
[[-0.8342132 ]
 [-1.8368168 ]
 [-0.6121157 ]
 [-0.70810634]
 [-1.4149286 ]], shape=(5, 1), dtype=float32)


#### PyTorch Example

In [None]:
import torch
import torch.nn as nn

class SimpleNetLN(nn.Module):
    def __init__(self):
        super(SimpleNetLN, self).__init__()
        self.fc1 = nn.Linear(10, 20)
        self.ln1 = nn.LayerNorm(20)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(20, 1)
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.ln1(x)  # Layer normalization applied here
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Create network and sample input
net_ln = SimpleNetLN()
sample_input = torch.randn(5, 10)
output_ln = net_ln(sample_input)
print("PyTorch LayerNorm Output:\n", output_ln)


### Real World Example

In [2]:
import tensorflow as tf
from tensorflow.keras import layers, models

# Load and preprocess the MNIST dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
# Normalize pixel values to [0, 1]
x_train = x_train.astype('float32') / 255.0
x_test  = x_test.astype('float32') / 255.0
# Expand dimensions to add the channel axis (MNIST is grayscale)
x_train = x_train[..., tf.newaxis]
x_test  = x_test[..., tf.newaxis]


def create_model_layer_norm():
    model = models.Sequential([
        layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
        layers.LayerNormalization(),  # Normalize across the features of each sample
        layers.Conv2D(64, (3, 3), activation='relu'),
        layers.LayerNormalization(),
        layers.MaxPooling2D((2, 2)),
        layers.Flatten(),
        layers.Dense(128, activation='relu'),
        layers.LayerNormalization(),
        layers.Dense(10, activation='softmax')
    ])
    return model

model_ln = create_model_layer_norm()
model_ln.compile(optimizer='adam',
                 loss='sparse_categorical_crossentropy',
                 metrics=['accuracy'])

print("\n=== Training Model with Layer Normalization ===")
model_ln.summary()
model_ln.fit(x_train, y_train, epochs=5, batch_size=128, validation_split=0.1)

print("Test evaluation:")
model_ln.evaluate(x_test, y_test)


=== Training Model with Layer Normalization ===
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_2 (Conv2D)           (None, 26, 26, 32)        320       
                                                                 
 layer_normalization (LayerN  (None, 26, 26, 32)       64        
 ormalization)                                                   
                                                                 
 conv2d_3 (Conv2D)           (None, 24, 24, 64)        18496     
                                                                 
 layer_normalization_1 (Laye  (None, 24, 24, 64)       128       
 rNormalization)                                                 
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 12, 12, 64)       0         
 2D)                                                             
     

[0.04103352874517441, 0.9884999990463257]

## Residual Connections

Training very deep neural networks can be challenging, partly due to the difficulty of propagating gradients through many layers. Residual connections were introduced as a solution to this problem by allowing the gradient to flow more directly through the network. 

The core idea is to add the original input of a block to its output after a series of transformations. This creates a shortcut that helps preserve the original signal and facilitates the learning of identity mappings, if necessary. By effectively “skipping” layers, residual connections help prevent degradation in performance as the network depth increases.

![](https://miro.medium.com/v2/resize:fit:1122/1*RTYKpn1Vqr-8zT5fqa8-jA.png)

### The Process

Residual connections offer a strategy to ease the training of very deep neural networks by introducing shortcut paths that allow the gradient to flow more directly. Here’s a more detailed breakdown:

1. **Learning a Residual Function:**  
   In a conventional network block, one might aim to learn a direct mapping $H(x)$ from the input $x$ to the output. With residual connections, the block instead learns a residual function $F(x)$ such that:
   $$
   F(x) = H(x) - x.
   $$
   In other words, rather than learning the full transformation, the block learns the difference between the desired transformation and the identity function.

2. **Combining the Input with the Residual:**  
   The output of the residual block is then given by:
   $$
   y = F(x) + x.
   $$
   This simple addition means that if the optimal transformation is close to the identity function (i.e., $H(x) \approx x$), the residual function $F(x)$ can easily learn to output values near zero, thereby preserving the original input.

3. **Benefits for Gradient Flow:**  
   During backpropagation, gradients can pass through the addition operation with minimal modification. This is because the derivative of the addition is one, which prevents the gradients from becoming too small (a phenomenon known as vanishing gradients). As a result, the network can be trained deeper without suffering from degradation in performance.

4. **Facilitating Identity Mappings:**  
   If, in any layer, the best function to learn is simply the identity (i.e., no change to the input), residual connections make this easy. The network can set $F(x)$ to zero (or near zero) without any special architectural modifications, allowing the original input to pass unchanged through the block.

This approach has proven especially effective in architectures like ResNet, where networks with hundreds of layers have been successfully trained, largely thanks to the enhanced gradient flow provided by residual connections.


### Benefits

1. **Expanded Representational Capacity:**  
Residual connections allow a network to learn residual functions—i.e., the differences between the desired transformation and the identity function. This design makes it easier for the network to represent complex functions because the layers can focus on learning the modifications necessary to improve the input rather than learning the entire transformation from scratch. Consequently, the network can represent a broader range of functions, increasing its overall expressive power.

2. **Prevention of Shattered Gradients:**  
Deep networks often suffer from shattered gradients, where gradients become noisy and unstable as they propagate back through many layers. Residual connections provide shortcut paths that allow gradients to bypass multiple layers, maintaining their strength and consistency. By preserving the gradient signal, residual connections prevent the degradation of the gradient during backpropagation, which is critical for the successful training of very deep networks.

### Examples

#### TensorFlow Example

In [5]:
import tensorflow as tf

class ResidualBlock(tf.keras.layers.Layer):
    def __init__(self, in_features):
        super(ResidualBlock, self).__init__()
        self.fc1 = tf.keras.layers.Dense(in_features)
        self.bn1 = tf.keras.layers.BatchNormalization()
        self.relu = tf.keras.layers.ReLU()
        self.fc2 = tf.keras.layers.Dense(in_features)
        self.bn2 = tf.keras.layers.BatchNormalization()
    
    def call(self, x, training=False):
        residual = x  # Save input for the shortcut connection
        out = self.fc1(x)
        out = self.bn1(out, training=training)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.bn2(out, training=training)
        out = out + residual  # Add the shortcut connection
        out = self.relu(out)
        return out

# Create residual block and sample input
res_block_tf = ResidualBlock(in_features=20)
sample_input_tf = tf.random.normal([5, 20])
output_res_tf = res_block_tf(sample_input_tf, training=True)
print("TensorFlow Residual Block Output:\n", output_res_tf)

TensorFlow Residual Block Output:
 tf.Tensor(
[[0.         0.11795241 2.3836324  1.6010782  0.28818804 0.
  0.05150962 0.         1.5185751  0.         0.         0.
  0.44952118 1.3067663  0.         0.         0.         0.
  1.2832165  0.6867311 ]
 [0.         1.9266136  0.         0.         2.7598815  0.5555755
  0.19757074 0.         0.18135703 2.241429   0.56072474 1.8227056
  0.         0.         0.         0.         0.59057254 0.
  0.         1.3937647 ]
 [1.1942751  0.         0.4036159  2.1789103  0.         0.45983297
  0.30273807 0.14427543 2.6960754  0.         0.         0.
  0.24127856 0.5534748  0.         2.6060038  0.7564375  1.2216339
  0.19340134 0.        ]
 [0.         0.69583654 0.         0.15185216 0.         0.31141677
  2.115551   0.         0.4362759  0.         1.3639169  0.
  0.         0.         1.3726174  0.4778955  1.6547519  1.867121
  0.         0.        ]
 [1.1807977  0.5903413  1.3898568  2.0094955  0.27392915 1.4721
  0.         0.28615826 0. 

#### PyTorch Example

In [None]:
import torch
import torch.nn as nn

class ResidualBlock(nn.Module):
    def __init__(self, in_features):
        super(ResidualBlock, self).__init__()
        self.fc1 = nn.Linear(in_features, in_features)
        self.bn1 = nn.BatchNorm1d(in_features)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(in_features, in_features)
        self.bn2 = nn.BatchNorm1d(in_features)
    
    def forward(self, x):
        residual = x  # Save input for the shortcut connection
        out = self.fc1(x)
        out = self.bn1(out)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.bn2(out)
        out += residual  # Add the shortcut connection
        out = self.relu(out)
        return out

# Create residual block and sample input
res_block = ResidualBlock(in_features=20)
sample_input = torch.randn(5, 20)
output_res = res_block(sample_input)
print("PyTorch Residual Block Output:\n", output_res)

### Real World Example

In [6]:
import tensorflow as tf
from tensorflow.keras import layers, models

# Load and preprocess the MNIST dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
# Normalize pixel values to [0, 1]
x_train = x_train.astype('float32') / 255.0
x_test  = x_test.astype('float32') / 255.0
# Expand dimensions to add the channel axis (MNIST is grayscale)
x_train = x_train[..., tf.newaxis]
x_test  = x_test[..., tf.newaxis]

# Define a custom Residual Block
class ResidualBlock(layers.Layer):
    def __init__(self, filters, kernel_size=3):
        super(ResidualBlock, self).__init__()
        self.conv1 = layers.Conv2D(filters, kernel_size, padding='same', activation='relu')
        self.conv2 = layers.Conv2D(filters, kernel_size, padding='same', activation=None)
        self.relu = layers.ReLU()
    
    def call(self, inputs, training=None):
        x = self.conv1(inputs)
        x = self.conv2(x)
        x += inputs  # Add shortcut (residual connection)
        return self.relu(x)


def create_model_residual():
    inputs = layers.Input(shape=(28, 28, 1))
    # Initial convolution layer
    x = layers.Conv2D(32, (3, 3), padding='same', activation='relu')(inputs)
    
    # First residual block
    x = ResidualBlock(32)(x)
    x = layers.MaxPooling2D((2, 2))(x)
    
    # Second convolution and residual block
    x = layers.Conv2D(64, (3, 3), padding='same', activation='relu')(x)
    x = ResidualBlock(64)(x)
    x = layers.MaxPooling2D((2, 2))(x)
    x = layers.Flatten()(x)
    x = layers.Dense(128, activation='relu')(x)
    outputs = layers.Dense(10, activation='softmax')(x)
    
    model = tf.keras.Model(inputs, outputs)
    return model

model_res = create_model_residual()
model_res.compile(optimizer='adam',
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

print("\n=== Training Model with Residual Connections ===")
model_res.summary()
model_res.fit(x_train, y_train, epochs=5, batch_size=128, validation_split=0.1)

print("Test evaluation:")
model_res.evaluate(x_test, y_test)



=== Training Model with Residual Connections ===
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 28, 28, 1)]       0         
                                                                 
 conv2d_4 (Conv2D)           (None, 28, 28, 32)        320       
                                                                 
 residual_block_1 (ResidualB  (None, 28, 28, 32)       18496     
 lock)                                                           
                                                                 
 max_pooling2d_2 (MaxPooling  (None, 14, 14, 32)       0         
 2D)                                                             
                                                                 
 conv2d_7 (Conv2D)           (None, 14, 14, 64)        18496     
                                                                 
 residual_b

[0.02266356721520424, 0.9925000071525574]