<a href="https://colab.research.google.com/github/Jessicantma/Task_Week4/blob/main/taskweek4_jessicanatamanapitupulu_2106726150.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Jessica Natama Napitupulu 2106726150

In [None]:
!pip install d2l



In [None]:
import torch
from torch import nn
from d2l import torch as d2l

In [None]:
class AlexNet(d2l.Classifier):
    def __init__(self, lr=0.1, num_classes=10):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential(
            nn.LazyConv2d(96, kernel_size=11, stride=4, padding=1),
            nn.ReLU(), nn.MaxPool2d(kernel_size=3, stride=2),
            nn.LazyConv2d(256, kernel_size=5, padding=2), nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.LazyConv2d(384, kernel_size=3, padding=1), nn.ReLU(),
            nn.LazyConv2d(384, kernel_size=3, padding=1), nn.ReLU(),
            nn.LazyConv2d(256, kernel_size=3, padding=1), nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2), nn.Flatten(),
            nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(p=0.5),
            nn.LazyLinear(4096), nn.ReLU(),nn.Dropout(p=0.5),
            nn.LazyLinear(num_classes))
        self.net.apply(d2l.init_cnn)

In [None]:
AlexNet().layer_summary((1, 1, 224, 224))

Conv2d output shape:	 torch.Size([1, 96, 54, 54])
ReLU output shape:	 torch.Size([1, 96, 54, 54])
MaxPool2d output shape:	 torch.Size([1, 96, 26, 26])
Conv2d output shape:	 torch.Size([1, 256, 26, 26])
ReLU output shape:	 torch.Size([1, 256, 26, 26])
MaxPool2d output shape:	 torch.Size([1, 256, 12, 12])
Conv2d output shape:	 torch.Size([1, 384, 12, 12])
ReLU output shape:	 torch.Size([1, 384, 12, 12])
Conv2d output shape:	 torch.Size([1, 384, 12, 12])
ReLU output shape:	 torch.Size([1, 384, 12, 12])
Conv2d output shape:	 torch.Size([1, 256, 12, 12])
ReLU output shape:	 torch.Size([1, 256, 12, 12])
MaxPool2d output shape:	 torch.Size([1, 256, 5, 5])
Flatten output shape:	 torch.Size([1, 6400])
Linear output shape:	 torch.Size([1, 4096])
ReLU output shape:	 torch.Size([1, 4096])
Dropout output shape:	 torch.Size([1, 4096])
Linear output shape:	 torch.Size([1, 4096])
ReLU output shape:	 torch.Size([1, 4096])
Dropout output shape:	 torch.Size([1, 4096])
Linear output shape:	 torch.Size([1,

Certainly! Let's delve into a comprehensive analysis of AlexNet, addressing each of your queries in detail. We'll cover computational properties, memory footprint, computational costs, memory bandwidth impacts, chip design trade-offs, benchmarking practices, training behaviors compared to LeNet, model simplifications, and more.

1. Computational Properties of AlexNet
AlexNet is a pioneering convolutional neural network (CNN) architecture introduced by Alex Krizhevsky et al. in 2012. It significantly advanced the field of deep learning, particularly in image classification tasks. Here's a breakdown of its computational properties:

Architecture Overview:

Layers: 8 layers (5 convolutional layers followed by 3 fully connected layers).
Parameters: Approximately 60 million parameters.
Activation Functions: ReLU (Rectified Linear Unit) activations.
Regularization: Dropout and data augmentation techniques.
Key Features:

Deep Architecture: Allowed for learning hierarchical feature representations.
Use of GPUs: Leveraged GPU parallelism for training efficiency.
Local Response Normalization (LRN): Applied after certain layers to aid generalization.
2. Memory Footprint for Convolutions and Fully Connected Layers
a. Memory Footprint Calculation
Memory footprint refers to the amount of memory required to store the parameters and intermediate activations during the forward and backward passes.

Convolutional Layers:

Parameters: Number of filters × (filter height × filter width × input channels) + biases.
Activations: Output feature maps × spatial dimensions.
Fully Connected Layers:

Parameters: Number of neurons in the layer × number of inputs + biases.
Activations: Number of neurons.
Example Calculation for AlexNet:

Assuming input images of size 224x224x3 (standard for AlexNet):

Convolutional Layers:
Conv1: 96 filters of size 11x11x3 → Parameters: 96 × (11×11×3) + 96 ≈ 34,944.
Similar calculations apply to other convolutional layers.
Fully Connected Layers:
FC6: 4096 neurons × (flattened input from previous layer) + 4096 biases.
This results in a significantly higher number of parameters compared to convolutional layers.
b. Dominance: Fully Connected Layers
In AlexNet, fully connected layers dominate the memory footprint due to their dense connectivity and large number of parameters. For instance, FC6 alone accounts for a substantial portion of the total parameters and memory usage.

3. Computational Cost for Convolutions and Fully Connected Layers
a. Computational Cost Metrics
FLOPs (Floating Point Operations): A common metric to measure computational cost.
Convolutional Layers:

FLOPs per layer: Number of operations = 2 × (Number of filters) × (Filter height × Filter width × Input channels) × (Output feature map height × Output feature map width).
Example: For Conv1 in AlexNet: 2 × 96 × (11×11×3) × (55×55) ≈ 1.08 billion FLOPs.
Fully Connected Layers:

FLOPs per layer: 2 × (Number of input neurons) × (Number of output neurons).
Example: FC6: 2 × (flattened input size) × 4096 ≈ several billion FLOPs.
b. Dominance: Fully Connected Layers
Fully connected layers typically require more FLOPs compared to convolutional layers, especially in deep networks like AlexNet. This is due to the dense connections and large number of parameters, leading to higher computational demands.

4. Impact of Memory (Bandwidth, Latency, Size) on Computation
a. Memory Characteristics:
Read and Write Bandwidth: The rate at which data can be read from or written to memory.
Latency: Time delay between a request for data and the delivery of data.
Memory Size: Total capacity available for storing data and parameters.
b. Effects on Computation:
Bandwidth Bottlenecks:

High computational demands require substantial data movement.
Limited bandwidth can lead to data stalls, reducing overall throughput.
Latency Impacts:

High latency can delay data availability, impacting the efficiency of parallel computations.
Memory Size Constraints:

Insufficient memory can force frequent data swapping between faster (e.g., GPU) and slower (e.g., CPU) memories, increasing computation time.
c. Differences for Training and Inference:
Training:

Higher Memory Usage: Due to storage of intermediate activations for backpropagation.
Increased Bandwidth Demand: Frequent data reads/writes for gradient calculations.
Latency Sensitivity: More pronounced as training requires sequential computations (forward and backward passes).
Inference:

Lower Memory Requirements: Only forward pass activations are needed.
Bandwidth Less Critical: Reduced data movement compared to training.
Latency Less Sensitive: Especially in batch processing scenarios.
5. Chip Design Trade-offs: Balancing Computation and Memory Bandwidth
As a chip designer aiming to optimize for both computation and memory bandwidth while considering power and area constraints, the following trade-offs are essential:

a. Optimization Strategies:
Compute vs. Memory:

Higher Compute Capability:
Increases parallel processing units (e.g., more ALUs).
Trade-Off: Higher power consumption and larger chip area.
Enhanced Memory Bandwidth:
Wider memory buses or higher frequency memory.
Trade-Off: More pins, increased control logic, larger chip area.
Memory Hierarchy Design:

Implementing caches (L1, L2) to reduce bandwidth demands.
Utilizing on-chip SRAM to store frequently accessed data, minimizing off-chip memory accesses.
Data Reuse and Local Storage:

Designing processing elements that can reuse data from local storage reduces the need for external memory bandwidth.
Power Management:

Dynamic voltage and frequency scaling (DVFS) to adjust power based on computational load.
Power gating unused parts of the chip to save energy.
b. Balancing Act:
Performance Requirements: Prioritize based on whether the application demands high throughput (e.g., real-time inference) or high flexibility.
Area Constraints: Optimize the layout to maximize parallelism without excessively increasing chip size.
Power Budget: Ensure that the design adheres to power consumption limits, especially for mobile or embedded applications.
c. Example Optimization:
ASIC Design for AlexNet:
Parallel Convolution Units: To handle multiple filters simultaneously.
High-Bandwidth Memory Interfaces: To supply data quickly to convolution units.
On-Chip Memory Buffers: To store frequently accessed weights and activations, reducing external memory accesses.
6. Decline in Reporting Performance Benchmarks on AlexNet
Reasons Engineers No Longer Report Performance Benchmarks on AlexNet:

Advancements in Architecture:

Newer architectures like VGG, ResNet, Inception, and EfficientNet offer better performance and efficiency.
AlexNet is considered outdated in comparison to these models.
Dataset Evolution:

AlexNet was primarily benchmarked on ImageNet, but newer datasets with higher complexity and different characteristics are now standard.
Hardware Improvements:

Modern GPUs and specialized accelerators have evolved, making AlexNet benchmarks less relevant as they no longer reflect current hardware capabilities.
Research Focus Shift:

The research community focuses on more challenging tasks, such as object detection, segmentation, and tasks requiring greater accuracy and efficiency.
Optimization and Efficiency:

Contemporary models emphasize not just accuracy but also efficiency in terms of parameter count, FLOPs, and memory usage, areas where AlexNet is less competitive.
7. Increasing Epochs: AlexNet vs. LeNet
a. Training AlexNet for More Epochs
Behavior:
AlexNet: With its deeper architecture and larger number of parameters, increasing epochs can lead to better convergence and potentially higher accuracy up to a point. However, it also increases the risk of overfitting if not properly regularized.
Compared to LeNet:
LeNet: A simpler architecture with fewer parameters, suitable for simpler tasks like digit recognition.
Impact of More Epochs:
LeNet: May quickly converge to optimal performance with fewer epochs; additional epochs offer diminishing returns.
AlexNet: Requires more epochs to fully train due to its complexity but benefits more from extended training to capture intricate patterns.
b. Reasons for Differences:
Model Complexity:

AlexNet's deeper and wider architecture can model more complex functions, benefiting from additional training.
LeNet's simplicity may not leverage the benefits of extended training to the same extent.
Dataset Complexity:

AlexNet is typically trained on more complex datasets (e.g., ImageNet), which require more training epochs to learn diverse features.
LeNet is often applied to simpler datasets (e.g., MNIST), which are easier to learn quickly.
Overfitting Risks:

AlexNet is more prone to overfitting with more epochs due to its high capacity, necessitating careful regularization.
LeNet's lower capacity reduces the risk of overfitting with additional epochs.
8. Complexity of AlexNet for Fashion-MNIST
Issue: AlexNet may be too complex for the Fashion-MNIST dataset, particularly due to the low resolution of the initial images.

a. Reasons:
High Parameter Count:

Fashion-MNIST images are 28x28 grayscale images, much smaller and simpler than ImageNet images (224x224 RGB).
AlexNet's large number of parameters can lead to overfitting on such a simple dataset.
Redundant Computations:

The depth and complexity of AlexNet are unnecessary for the relatively low variability in Fashion-MNIST.
Inefficient Use of Resources:

Large fully connected layers consume excessive memory and computational resources without proportional benefits.
b. Consequences:
Longer Training Times: Due to increased computational demands.
Potential Overfitting: The model may not generalize well to unseen data.
Inefficient Performance: Higher latency and lower throughput without accuracy gains.
9. Simplifying AlexNet for Faster Training on Fashion-MNIST
To optimize training speed while maintaining accuracy on Fashion-MNIST, consider simplifying the model:

a. Strategies:
Reduce Depth:

Decrease the number of convolutional and fully connected layers.
Decrease Width:

Use fewer filters in each convolutional layer.
Replace Fully Connected Layers:

Use global average pooling instead of dense layers to reduce parameters.
Input Image Size:

Adjust the input size to match the dataset (e.g., 32x32 instead of 224x224) to reduce computational load.
Use Batch Normalization:

Stabilizes and accelerates training, allowing for higher learning rates.
Apply Regularization:

Techniques like dropout can prevent overfitting, allowing for smaller models.

In [None]:
import torch.nn as nn
import torch.nn.functional as F

class SimplifiedAlexNet(nn.Module):
    def __init__(self):
        super(SimplifiedAlexNet, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1),  # Adjusted for grayscale
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2),
            nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2),
        )
        self.classifier = nn.Sequential(
            nn.Dropout(),
            nn.Linear(64 * 7 * 7, 256),
            nn.ReLU(inplace=True),
            nn.Dropout(),
            nn.Linear(256, 10),  # Assuming 10 classes for Fashion-MNIST
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), 64 * 7 * 7)
        x = self.classifier(x)
        return x


c. Benefits:
Reduced Parameters: Lower memory footprint and faster computations.
Faster Training: Due to fewer layers and parameters.
Maintained Accuracy: Adequate capacity for Fashion-MNIST's complexity.
10. Designing a Better Model for Direct Image Processing
To design a model that works directly on images (assuming high-resolution images), consider the following enhancements over AlexNet:

a. Modern Architectural Improvements:
Residual Connections (ResNet):

Helps in training deeper networks by mitigating vanishing gradients.
Inception Modules:

Allows for multi-scale feature extraction within the same layer.
Depthwise Separable Convolutions (MobileNet):

Reduces computational cost while maintaining performance.
Batch Normalization:

Stabilizes and accelerates training.
Efficient Activation Functions:

Leaky ReLU or ELU can offer better gradient flow.

In [None]:
import torch.nn as nn
import torch.nn.functional as F

class ImprovedCNN(nn.Module):
    def __init__(self):
        super(ImprovedCNN, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),

            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),

            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),

            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),

            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
        )
        self.classifier = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(256 * 6 * 6, 1024),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(1024, 10),  # Adjust as per the number of classes
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), 256 * 6 * 6)
        x = self.classifier(x)
        return x


import torch.nn as nn
import torch.nn.functional as F

class ImprovedCNN(nn.Module):
    def __init__(self):
        super(ImprovedCNN, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
            
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
            
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
        )
        self.classifier = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(256 * 6 * 6, 1024),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(1024, 10),  # Adjust as per the number of classes
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), 256 * 6 * 6)
        x = self.classifier(x)
        return x


In [None]:
import torch.nn as nn
import torch.nn.functional as F

class EnhancedLeNet5(nn.Module):
    def __init__(self):
        super(EnhancedLeNet5, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=5)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool2d(2)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=5)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool2d(2)
        self.fc1 = nn.Linear(64 * 4 * 4, 256)
        self.relu3 = nn.ReLU()
        self.dropout = nn.Dropout(0.5)
        self.fc2 = nn.Linear(256, 10)

    def forward(self, x):
        x = self.pool1(self.relu1(self.conv1(x)))
        x = self.pool2(self.relu2(self.conv2(x)))
        x = x.view(x.size(0), -1)
        x = self.relu3(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x


13. Making AlexNet Overfit and Breaking Training
a. Overfitting AlexNet:
Overfitting occurs when a model learns the training data too well, including its noise and outliers, leading to poor generalization on unseen data.

How to Make AlexNet Overfit:

Increase Model Capacity:

Utilize all layers without regularization.
Maintain high parameter counts.
Reduce Training Data:

Limit the number of training samples to make the model memorize the data.
Remove Regularization:

Eliminate dropout layers and weight decay.
Train for Excessive Epochs:

Allow the model to train beyond the point of optimal generalization.
b. Breaking Training to Prevent Overfitting:
To prevent or break overfitting, certain features or practices need to be altered or removed:

Introduce Regularization:

Dropout: Randomly deactivates neurons during training.
Weight Decay (L2 Regularization): Penalizes large weights.
Data Augmentation:

Increases the diversity of training data, forcing the model to learn more generalized features.
Early Stopping:

Halt training when validation performance stops improving.
Reduce Model Complexity:

Decrease the number of layers or parameters.
Implement Batch Normalization:

Stabilizes learning and can have a regularizing effect.
c. Specific Feature to Remove or Change:
Remove Dropout Layers:

To intentionally overfit, eliminating dropout reduces regularization, allowing the model to memorize training data.

In [None]:
class AlexNetOverfit(nn.Module):
    def __init__(self):
        super(AlexNetOverfit, self).__init__()
        self.features = nn.Sequential(
            # ... (same as original AlexNet)
        )
        self.classifier = nn.Sequential(
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(inplace=True),
            # nn.Dropout(),  # Removed Dropout
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            # nn.Dropout(),  # Removed Dropout
            nn.Linear(4096, 1000),
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), 256 * 6 * 6)
        x = self.classifier(x)
        return x


In [None]:
model = AlexNet(lr=0.01)
data = d2l.FashionMNIST(batch_size=128, resize=(224, 224))
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
trainer.fit(model, data)

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ../data/FashionMNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 26421880/26421880 [00:02<00:00, 9786852.68it/s] 


Extracting ../data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ../data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ../data/FashionMNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 29515/29515 [00:00<00:00, 204421.98it/s]


Extracting ../data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ../data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to ../data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 4422102/4422102 [00:01<00:00, 3735794.39it/s]


Extracting ../data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ../data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to ../data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 5148/5148 [00:00<00:00, 13832336.32it/s]

Extracting ../data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/FashionMNIST/raw






In [None]:
def vgg_block(num_convs, out_channels):
    layers = []
    for _ in range(num_convs):
        layers.append(nn.LazyConv2d(out_channels, kernel_size=3, padding=1))
        layers.append(nn.ReLU())
    layers.append(nn.MaxPool2d(kernel_size=2,stride=2))
    return nn.Sequential(*layers)

In [None]:
class VGG(d2l.Classifier):
    def __init__(self, arch, lr=0.1, num_classes=10):
        super().__init__()
        self.save_hyperparameters()
        conv_blks = []
        for (num_convs, out_channels) in arch:
            conv_blks.append(vgg_block(num_convs, out_channels))
        self.net = nn.Sequential(
            *conv_blks, nn.Flatten(),
            nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(0.5),
            nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(0.5),
            nn.LazyLinear(num_classes))
        self.net.apply(d2l.init_cnn)

In [None]:
VGG(arch=((1, 64), (1, 128), (2, 256), (2, 512), (2, 512))).layer_summary(
    (1, 1, 224, 224))

In [None]:
model = VGG(arch=((1, 16), (1, 32), (2, 64), (2, 128), (2, 128)), lr=0.01)
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(224, 224))
model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
trainer.fit(model, data)

In [None]:
def nin_block(out_channels, kernel_size, strides, padding):
    return nn.Sequential(
        nn.LazyConv2d(out_channels, kernel_size, strides, padding), nn.ReLU(),
        nn.LazyConv2d(out_channels, kernel_size=1), nn.ReLU(),
        nn.LazyConv2d(out_channels, kernel_size=1), nn.ReLU())

In [None]:
NiN().layer_summary((1, 1, 224, 224))

In [None]:
model = NiN(lr=0.05)
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(224, 224))
model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
trainer.fit(model, data)

In [None]:
class Inception(nn.Module):
    # c1--c4 are the number of output channels for each branch
    def __init__(self, c1, c2, c3, c4, **kwargs):
        super(Inception, self).__init__(**kwargs)
        # Branch 1
        self.b1_1 = nn.LazyConv2d(c1, kernel_size=1)
        # Branch 2
        self.b2_1 = nn.LazyConv2d(c2[0], kernel_size=1)
        self.b2_2 = nn.LazyConv2d(c2[1], kernel_size=3, padding=1)
        # Branch 3
        self.b3_1 = nn.LazyConv2d(c3[0], kernel_size=1)
        self.b3_2 = nn.LazyConv2d(c3[1], kernel_size=5, padding=2)
        # Branch 4
        self.b4_1 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
        self.b4_2 = nn.LazyConv2d(c4, kernel_size=1)

    def forward(self, x):
        b1 = F.relu(self.b1_1(x))
        b2 = F.relu(self.b2_2(F.relu(self.b2_1(x))))
        b3 = F.relu(self.b3_2(F.relu(self.b3_1(x))))
        b4 = F.relu(self.b4_2(self.b4_1(x)))
        return torch.cat((b1, b2, b3, b4), dim=1)

In [None]:
class GoogleNet(d2l.Classifier):
    def b1(self):
        return nn.Sequential(
            nn.LazyConv2d(64, kernel_size=7, stride=2, padding=3),
            nn.ReLU(), nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

In [None]:
@d2l.add_to_class(GoogleNet)
def b2(self):
    return nn.Sequential(
        nn.LazyConv2d(64, kernel_size=1), nn.ReLU(),
        nn.LazyConv2d(192, kernel_size=3, padding=1), nn.ReLU(),
        nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

In [None]:
@d2l.add_to_class(GoogleNet)
def b3(self):
    return nn.Sequential(Inception(64, (96, 128), (16, 32), 32),
                         Inception(128, (128, 192), (32, 96), 64),
                         nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

In [None]:
@d2l.add_to_class(GoogleNet)
def b4(self):
    return nn.Sequential(Inception(192, (96, 208), (16, 48), 64),
                         Inception(160, (112, 224), (24, 64), 64),
                         Inception(128, (128, 256), (24, 64), 64),
                         Inception(112, (144, 288), (32, 64), 64),
                         Inception(256, (160, 320), (32, 128), 128),
                         nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

In [None]:
@d2l.add_to_class(GoogleNet)
def b5(self):
    return nn.Sequential(Inception(256, (160, 320), (32, 128), 128),
                         Inception(384, (192, 384), (48, 128), 128),
                         nn.AdaptiveAvgPool2d((1,1)), nn.Flatten())

In [None]:
@d2l.add_to_class(GoogleNet)
def __init__(self, lr=0.1, num_classes=10):
    super(GoogleNet, self).__init__()
    self.save_hyperparameters()
    self.net = nn.Sequential(self.b1(), self.b2(), self.b3(), self.b4(),
                             self.b5(), nn.LazyLinear(num_classes))
    self.net.apply(d2l.init_cnn)

In [None]:
model = GoogleNet().layer_summary((1, 1, 96, 96))

In [None]:
model = GoogleNet(lr=0.01)
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(96, 96))
model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
trainer.fit(model, data)

In [None]:
def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
    # Use is_grad_enabled to determine whether we are in training mode
    if not torch.is_grad_enabled():
        # In prediction mode, use mean and variance obtained by moving average
        X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
    else:
        assert len(X.shape) in (2, 4)
        if len(X.shape) == 2:
            # When using a fully connected layer, calculate the mean and
            # variance on the feature dimension
            mean = X.mean(dim=0)
            var = ((X - mean) ** 2).mean(dim=0)
        else:
            # When using a two-dimensional convolutional layer, calculate the
            # mean and variance on the channel dimension (axis=1). Here we
            # need to maintain the shape of X, so that the broadcasting
            # operation can be carried out later
            mean = X.mean(dim=(0, 2, 3), keepdim=True)
            var = ((X - mean) ** 2).mean(dim=(0, 2, 3), keepdim=True)
        # In training mode, the current mean and variance are used
        X_hat = (X - mean) / torch.sqrt(var + eps)
        # Update the mean and variance using moving average
        moving_mean = (1.0 - momentum) * moving_mean + momentum * mean
        moving_var = (1.0 - momentum) * moving_var + momentum * var
    Y = gamma * X_hat + beta  # Scale and shift
    return Y, moving_mean.data, moving_var.data

In [None]:
class BatchNorm(nn.Module):
    # num_features: the number of outputs for a fully connected layer or the
    # number of output channels for a convolutional layer. num_dims: 2 for a
    # fully connected layer and 4 for a convolutional layer
    def __init__(self, num_features, num_dims):
        super().__init__()
        if num_dims == 2:
            shape = (1, num_features)
        else:
            shape = (1, num_features, 1, 1)
        # The scale parameter and the shift parameter (model parameters) are
        # initialized to 1 and 0, respectively
        self.gamma = nn.Parameter(torch.ones(shape))
        self.beta = nn.Parameter(torch.zeros(shape))
        # The variables that are not model parameters are initialized to 0 and
        # 1
        self.moving_mean = torch.zeros(shape)
        self.moving_var = torch.ones(shape)

    def forward(self, X):
        # If X is not on the main memory, copy moving_mean and moving_var to
        # the device where X is located
        if self.moving_mean.device != X.device:
            self.moving_mean = self.moving_mean.to(X.device)
            self.moving_var = self.moving_var.to(X.device)
        # Save the updated moving_mean and moving_var
        Y, self.moving_mean, self.moving_var = batch_norm(
            X, self.gamma, self.beta, self.moving_mean,
            self.moving_var, eps=1e-5, momentum=0.1)
        return Y

In [None]:
class BNLeNetScratch(d2l.Classifier):
    def __init__(self, lr=0.1, num_classes=10):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential(
            nn.LazyConv2d(6, kernel_size=5), BatchNorm(6, num_dims=4),
            nn.Sigmoid(), nn.AvgPool2d(kernel_size=2, stride=2),
            nn.LazyConv2d(16, kernel_size=5), BatchNorm(16, num_dims=4),
            nn.Sigmoid(), nn.AvgPool2d(kernel_size=2, stride=2),
            nn.Flatten(), nn.LazyLinear(120),
            BatchNorm(120, num_dims=2), nn.Sigmoid(), nn.LazyLinear(84),
            BatchNorm(84, num_dims=2), nn.Sigmoid(),
            nn.LazyLinear(num_classes))

In [None]:
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128)
model = BNLeNetScratch(lr=0.1)
model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
trainer.fit(model, data)

In [None]:
model.net[1].gamma.reshape((-1,)), model.net[1].beta.reshape((-1,))

In [None]:
class BNLeNet(d2l.Classifier):
    def __init__(self, lr=0.1, num_classes=10):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential(
            nn.LazyConv2d(6, kernel_size=5), nn.LazyBatchNorm2d(),
            nn.Sigmoid(), nn.AvgPool2d(kernel_size=2, stride=2),
            nn.LazyConv2d(16, kernel_size=5), nn.LazyBatchNorm2d(),
            nn.Sigmoid(), nn.AvgPool2d(kernel_size=2, stride=2),
            nn.Flatten(), nn.LazyLinear(120), nn.LazyBatchNorm1d(),
            nn.Sigmoid(), nn.LazyLinear(84), nn.LazyBatchNorm1d(),
            nn.Sigmoid(), nn.LazyLinear(num_classes))

In [None]:
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128)
model = BNLeNet(lr=0.1)
model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
trainer.fit(model, data)

In [None]:
class Residual(nn.Module):
    """The Residual block of ResNet models."""
    def __init__(self, num_channels, use_1x1conv=False, strides=1):
        super().__init__()
        self.conv1 = nn.LazyConv2d(num_channels, kernel_size=3, padding=1,
                                   stride=strides)
        self.conv2 = nn.LazyConv2d(num_channels, kernel_size=3, padding=1)
        if use_1x1conv:
            self.conv3 = nn.LazyConv2d(num_channels, kernel_size=1,
                                       stride=strides)
        else:
            self.conv3 = None
        self.bn1 = nn.LazyBatchNorm2d()
        self.bn2 = nn.LazyBatchNorm2d()

    def forward(self, X):
        Y = F.relu(self.bn1(self.conv1(X)))
        Y = self.bn2(self.conv2(Y))
        if self.conv3:
            X = self.conv3(X)
        Y += X
        return F.relu(Y)

In [None]:
blk = Residual(3)
X = torch.randn(4, 3, 6, 6)
blk(X).shape

In [None]:
blk = Residual(6, use_1x1conv=True, strides=2)
blk(X).shape

In [None]:
class ResNet(d2l.Classifier):
    def b1(self):
        return nn.Sequential(
            nn.LazyConv2d(64, kernel_size=7, stride=2, padding=3),
            nn.LazyBatchNorm2d(), nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

In [None]:
@d2l.add_to_class(ResNet)
def block(self, num_residuals, num_channels, first_block=False):
    blk = []
    for i in range(num_residuals):
        if i == 0 and not first_block:
            blk.append(Residual(num_channels, use_1x1conv=True, strides=2))
        else:
            blk.append(Residual(num_channels))
    return nn.Sequential(*blk)

In [None]:
@d2l.add_to_class(ResNet)
def __init__(self, arch, lr=0.1, num_classes=10):
    super(ResNet, self).__init__()
    self.save_hyperparameters()
    self.net = nn.Sequential(self.b1())
    for i, b in enumerate(arch):
        self.net.add_module(f'b{i+2}', self.block(*b, first_block=(i==0)))
    self.net.add_module('last', nn.Sequential(
        nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten(),
        nn.LazyLinear(num_classes)))
    self.net.apply(d2l.init_cnn)

In [None]:
class ResNet18(ResNet):
    def __init__(self, lr=0.1, num_classes=10):
        super().__init__(((2, 64), (2, 128), (2, 256), (2, 512)),
                       lr, num_classes)

ResNet18().layer_summary((1, 1, 96, 96))

In [None]:
model = ResNet18(lr=0.01)
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(96, 96))
model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
trainer.fit(model, data)

In [None]:
class ResNeXtBlock(nn.Module):
    """The ResNeXt block."""
    def __init__(self, num_channels, groups, bot_mul, use_1x1conv=False,
                 strides=1):
        super().__init__()
        bot_channels = int(round(num_channels * bot_mul))
        self.conv1 = nn.LazyConv2d(bot_channels, kernel_size=1, stride=1)
        self.conv2 = nn.LazyConv2d(bot_channels, kernel_size=3,
                                   stride=strides, padding=1,
                                   groups=bot_channels//groups)
        self.conv3 = nn.LazyConv2d(num_channels, kernel_size=1, stride=1)
        self.bn1 = nn.LazyBatchNorm2d()
        self.bn2 = nn.LazyBatchNorm2d()
        self.bn3 = nn.LazyBatchNorm2d()
        if use_1x1conv:
            self.conv4 = nn.LazyConv2d(num_channels, kernel_size=1,
                                       stride=strides)
            self.bn4 = nn.LazyBatchNorm2d()
        else:
            self.conv4 = None

    def forward(self, X):
        Y = F.relu(self.bn1(self.conv1(X)))
        Y = F.relu(self.bn2(self.conv2(Y)))
        Y = self.bn3(self.conv3(Y))
        if self.conv4:
            X = self.bn4(self.conv4(X))
        return F.relu(Y + X)

In [None]:
blk = ResNeXtBlock(32, 16, 1)
X = torch.randn(4, 32, 96, 96)
blk(X).shape

In [None]:
def conv_block(num_channels):
    return nn.Sequential(
        nn.LazyBatchNorm2d(), nn.ReLU(),
        nn.LazyConv2d(num_channels, kernel_size=3, padding=1))

In [None]:
class DenseBlock(nn.Module):
    def __init__(self, num_convs, num_channels):
        super(DenseBlock, self).__init__()
        layer = []
        for i in range(num_convs):
            layer.append(conv_block(num_channels))
        self.net = nn.Sequential(*layer)

    def forward(self, X):
        for blk in self.net:
            Y = blk(X)
            # Concatenate input and output of each block along the channels
            X = torch.cat((X, Y), dim=1)
        return X

In [None]:
blk = DenseBlock(2, 10)
X = torch.randn(4, 3, 8, 8)
Y = blk(X)
Y.shape

In [None]:
def transition_block(num_channels):
    return nn.Sequential(
        nn.LazyBatchNorm2d(), nn.ReLU(),
        nn.LazyConv2d(num_channels, kernel_size=1),
        nn.AvgPool2d(kernel_size=2, stride=2))

In [None]:
blk = transition_block(10)
blk(Y).shape

In [None]:
class DenseNet(d2l.Classifier):
    def b1(self):
        return nn.Sequential(
            nn.LazyConv2d(64, kernel_size=7, stride=2, padding=3),
            nn.LazyBatchNorm2d(), nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

In [None]:
@d2l.add_to_class(DenseNet)
def __init__(self, num_channels=64, growth_rate=32, arch=(4, 4, 4, 4),
             lr=0.1, num_classes=10):
    super(DenseNet, self).__init__()
    self.save_hyperparameters()
    self.net = nn.Sequential(self.b1())
    for i, num_convs in enumerate(arch):
        self.net.add_module(f'dense_blk{i+1}', DenseBlock(num_convs,
                                                          growth_rate))
        # The number of output channels in the previous dense block
        num_channels += num_convs * growth_rate
        # A transition layer that halves the number of channels is added
        # between the dense blocks
        if i != len(arch) - 1:
            num_channels //= 2
            self.net.add_module(f'tran_blk{i+1}', transition_block(
                num_channels))
    self.net.add_module('last', nn.Sequential(
        nn.LazyBatchNorm2d(), nn.ReLU(),
        nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten(),
        nn.LazyLinear(num_classes)))
    self.net.apply(d2l.init_cnn)

In [None]:
model = DenseNet(lr=0.01)
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(96, 96))
trainer.fit(model, data)

In [None]:
import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

In [None]:
class AnyNet(d2l.Classifier):
    def stem(self, num_channels):
        return nn.Sequential(
            nn.LazyConv2d(num_channels, kernel_size=3, stride=2, padding=1),
            nn.LazyBatchNorm2d(), nn.ReLU())

In [None]:
@d2l.add_to_class(AnyNet)
def stage(self, depth, num_channels, groups, bot_mul):
    blk = []
    for i in range(depth):
        if i == 0:
            blk.append(d2l.ResNeXtBlock(num_channels, groups, bot_mul,
                use_1x1conv=True, strides=2))
        else:
            blk.append(d2l.ResNeXtBlock(num_channels, groups, bot_mul))
    return nn.Sequential(*blk)

In [None]:
@d2l.add_to_class(AnyNet)
def __init__(self, arch, stem_channels, lr=0.1, num_classes=10):
    super(AnyNet, self).__init__()
    self.save_hyperparameters()
    self.net = nn.Sequential(self.stem(stem_channels))
    for i, s in enumerate(arch):
        self.net.add_module(f'stage{i+1}', self.stage(*s))
    self.net.add_module('head', nn.Sequential(
        nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten(),
        nn.LazyLinear(num_classes)))
    self.net.apply(d2l.init_cnn)

In [None]:
class RegNetX32(AnyNet):
    def __init__(self, lr=0.1, num_classes=10):
        stem_channels, groups, bot_mul = 32, 16, 1
        depths, channels = (4, 6), (32, 80)
        super().__init__(
            ((depths[0], channels[0], groups, bot_mul),
             (depths[1], channels[1], groups, bot_mul)),
            stem_channels, lr, num_classes)

In [None]:
RegNetX32().layer_summary((1, 1, 96, 96))

In [None]:
model = RegNetX32(lr=0.05)
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(96, 96))
trainer.fit(model, data)