# Layers and Modules
:label:`sec_model_construction`

- When we first introduced neural networks, we focused on linear models with a single output.
  - The entire model consists of just a single neuron.
  - A single neuron:
    - Takes some set of inputs.
    - Generates a corresponding scalar output.
    - Has a set of associated parameters that can be updated to optimize some objective function.
- When dealing with networks with multiple outputs, we leveraged vectorized arithmetic to characterize an entire layer of neurons.
  - Like individual neurons, layers:
    - Take a set of inputs.
    - Generate corresponding outputs.
    - Are described by a set of tunable parameters.
- In softmax regression, a single layer was itself the model.
- In multi-layer perceptrons (MLPs), the model retains this basic structure.

- In MLPs, both the entire model and its constituent layers share a similar structure:
  - The entire model:
    - Takes in raw inputs (features).
    - Generates outputs (predictions).
    - Possesses parameters (combined from all constituent layers).
  - Each individual layer:
    - Ingests inputs from the previous layer.
    - Generates outputs for the subsequent layer.
    - Has tunable parameters updated according to the signal flowing backward.

- Neurons, layers, and models provide useful abstractions, but sometimes, we need intermediate components:
  - Example: The ResNet-152 architecture (widely used in computer vision) consists of hundreds of layers.
    - Layers form repeating patterns of *groups of layers*.
    - Implementing such networks one layer at a time is tedious.
    - The ResNet architecture won the 2015 ImageNet and COCO competitions for recognition and detection :cite:`He.Zhang.Ren.ea.2016`.
    - Similar architectures are common in natural language processing and speech processing.

- To implement complex networks efficiently, we introduce the concept of a neural network *module*:
  - A module can describe:
    - A single layer.
    - A component consisting of multiple layers.
    - The entire model itself.
  - Benefits of the module abstraction:
    - Modules can be combined into larger structures recursively.
    - Enables writing compact yet powerful code for complex networks.
    - Illustrated in :numref:`fig_blocks`.

![Multiple layers are combined into modules, forming repeating patterns of larger models.](../img/blocks.svg)
:label:`fig_blocks`

- From a programming perspective, a module is represented by a *class*:
  - Any subclass must define:
    - A forward propagation method to transform input into output.
    - Storage for any necessary parameters (some modules may not require parameters).
  - The module must also support backpropagation for gradient calculations.
  - Auto differentiation (introduced in :numref:`sec_autograd`) simplifies this process:
    - We only need to define parameters and the forward propagation method.


In [1]:
import torch
from torch import nn
from torch.nn import functional as F

- To begin, we revisit the code that we used to implement MLPs (:numref:`sec_mlp`).
- The following code generates a neural network with:
  - One fully connected hidden layer:
    - Contains **256 units**.
    - Uses **ReLU activation**.
  - A fully connected output layer:
    - Contains **10 units**.
    - Has **no activation function**.

In [2]:
net = nn.Sequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))

X = torch.rand(2, 20)
net(X).shape

torch.Size([2, 10])

- In this example, we constructed our model by instantiating an `nn.Sequential`:
  - Layers are passed as arguments in the order they should be executed.
- **`nn.Sequential` defines a special kind of `Module` in PyTorch**:
  - It maintains an **ordered list** of constituent `Module`s.
  - Each fully connected layer is an instance of the `Linear` class.
    - `Linear` itself is a subclass of `Module`.
- The `forward` propagation method is straightforward:
  - It **chains each module** in the list together.
  - The output of one module is **passed as input** to the next module.
- Until now, we have been invoking models using `net(X)`:
  - This is shorthand for **`net.__call__(X)`**.


## **A Custom Module**

- The easiest way to understand how a module works is to implement one ourselves.
- Before doing so, we summarize the basic functionality that each module must provide:
  1. **Ingest input data** as arguments to its forward propagation method.
  2. **Generate an output** by returning a value from the forward propagation method.
     - The output may have a **different shape** from the input.
     - Example: In our previous model, the first fully connected layer:
       - Takes an input of arbitrary dimension.
       - Returns an output of dimension **256**.
  3. **Calculate the gradient** of its output with respect to its input.
     - This is accessible via its **backpropagation method**.
     - Typically, this happens **automatically**.
  4. **Store and provide access** to necessary parameters for forward propagation computation.
  5. **Initialize model parameters** as needed.

- The following snippet implements a **custom module**:
  - Corresponds to an **MLP** with:
    - One hidden layer of **256 hidden units**.
    - A **10-dimensional output layer**.
  - The `MLP` class **inherits** from the base module class.
  - We rely heavily on **parent class methods**:
    - We only need to define:
      - The **constructor** (`__init__` method in Python).
      - The **forward propagation** method.


In [3]:
class MLP(nn.Module):
    def __init__(self):
        # Call the constructor of the parent class nn.Module to perform
        # the necessary initialization
        super().__init__()
        self.hidden = nn.LazyLinear(256)
        self.out = nn.LazyLinear(10)

    # Define the forward propagation of the model, that is, how to return the
    # required model output based on the input X
    def forward(self, X):
        return self.out(F.relu(self.hidden(X)))

- Let's first focus on the **forward propagation method**:
  - Takes `X` as input.
  - Calculates the **hidden representation** with the activation function applied.
  - Outputs **logits**.

- In this `MLP` implementation:
  - Both layers are **instance variables**.
  - Why is this reasonable?
    - Imagine two MLP instances: `net1` and `net2`.
    - If trained on different data, they should represent **different learned models**.

- **Instantiation of MLP's layers** occurs in the constructor:
  - These layers are subsequently **invoked in each call** to the forward propagation method.

- **Key implementation details**:
  1. The customized `__init__` method:
     - Calls the parent class's `__init__` method via **`super().__init__()`**.
     - Avoids repeating boilerplate code applicable to most modules.
  2. Two fully connected layers are instantiated and assigned to:
     - `self.hidden` → Represents the **hidden layer**.
     - `self.out` → Represents the **output layer**.
  3. No need to implement:
     - **Backpropagation method**.
     - **Parameter initialization**.
     - The system **automatically handles these** unless we define a custom layer.

- Let's try this out!


In [4]:
net = MLP()
net(X).shape

torch.Size([2, 10])

- A key virtue of the **module abstraction** is its **versatility**:
  - We can **subclass a module** to create:
    - **Layers** (e.g., fully connected layer class).
    - **Entire models** (e.g., `MLP` class).
    - **Intermediate complexity components**.
  - This versatility will be **exploited throughout the coming chapters**, such as:
    - **Convolutional neural networks** (CNNs).

## **The Sequential Module**
:label:`subsec_model-construction-sequential`

- We now take a **closer look** at how the `Sequential` class works.
- **`Sequential` is designed to daisy-chain other modules together**.

- To build a simplified version (`MySequential`), we need to define two key methods:
  1. **A method to append modules** one by one into a list.
  2. **A forward propagation method**:
     - Passes an input through the chain of modules.
     - Follows the **same order as they were appended**.

- The following `MySequential` class provides the **same functionality** as the default `Sequential` class.


In [5]:
class MySequential(nn.Module):
    def __init__(self, *args):
        super().__init__()
        for idx, module in enumerate(args):
            self.add_module(str(idx), module)

    def forward(self, X):
        for module in self.children():
            X = module(X)
        return X

- In the `__init__` method:
  - Every module is added by calling the **`add_modules` method**.
  - These added modules can later be accessed via the **`children` method**.

- **Why is this important?**
  - The system keeps track of the **added modules**.
  - It ensures that each module's **parameters are properly initialized**.


- When `MySequential`'s **forward propagation method** is invoked:
  - Each **added module** is executed **in the order they were added**.

- We can now **reimplement an MLP** using our `MySequential` class.


In [6]:
net = MySequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))
net(X).shape

torch.Size([2, 10])

- The use of `MySequential` is **identical** to the code we previously wrote for the `Sequential` class.
  - As described in :numref:`sec_mlp`.

## **Executing Code in the Forward Propagation Method**

- The `Sequential` class makes **model construction easy**:
  - Allows assembling architectures **without defining a custom class**.
- However, **not all architectures** are simple daisy chains:
  - When greater **flexibility** is required, we define **custom blocks**.

- **Why define custom blocks?**
  1. **Executing Python control flow** within the forward propagation method.
  2. **Performing arbitrary mathematical operations** beyond predefined neural network layers.

- **Network operations so far**:
  - Have **acted upon network activations** and **trainable parameters**.
  - But sometimes, we need **constant parameters** that:
    - Are **not derived from previous layers**.
    - Are **not updated during optimization**.

- **Example: A layer calculating**  
  \[
  f(\mathbf{x},\mathbf{w}) = c \cdot \mathbf{w}^\top \mathbf{x}
  \]
  - Where:
    - **$\mathbf{x}$** is the input.
    - **$\mathbf{w}$** is a trainable parameter.
    - **$c$** is a fixed constant **(not updated during optimization)**.

- We implement this idea in the `FixedHiddenMLP` class.


In [7]:
class FixedHiddenMLP(nn.Module):
    def __init__(self):
        super().__init__()
        # Random weight parameters that will not compute gradients and
        # therefore keep constant during training
        self.rand_weight = torch.rand((20, 20))
        self.linear = nn.LazyLinear(20)

    def forward(self, X):
        X = self.linear(X)
        X = F.relu(X @ self.rand_weight + 1)
        # Reuse the fully connected layer. This is equivalent to sharing
        # parameters with two fully connected layers
        X = self.linear(X)
        # Control flow
        while X.abs().sum() > 1:
            X /= 2
        return X.sum()

- In this model, we implement a **hidden layer with fixed weights**:
  - `self.rand_weight` is initialized **randomly** at instantiation.
  - These weights **remain constant** thereafter.
  - This weight **is not a model parameter** and is **never updated by backpropagation**.
- The network structure:
  1. The **output of the fixed layer** is passed through a **fully connected layer**.

- **Unusual operation before returning the output**:
  1. **A while-loop runs**, checking if the **$\ell_1$ norm** of the output is **greater than 1**.
  2. If the condition holds:
     - The output vector is **divided by 2** until it satisfies the condition.
  3. The **sum of the entries in `X`** is returned.

- **Key observations**:
  - No standard neural network **performs this type of operation**.
  - This particular operation **may not be useful** in real-world tasks.
  - **The purpose** of this implementation:
    - To demonstrate how to **integrate arbitrary code** into a neural network's computation flow.


In [8]:
net = FixedHiddenMLP()
net(X)

tensor(-0.3836, grad_fn=<SumBackward0>)

- We can **mix and match various ways** of assembling modules together.
- In the following example, we **nest modules** in creative ways.


In [9]:
class NestMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(nn.LazyLinear(64), nn.ReLU(),
                                 nn.LazyLinear(32), nn.ReLU())
        self.linear = nn.LazyLinear(16)

    def forward(self, X):
        return self.linear(self.net(X))

chimera = nn.Sequential(NestMLP(), nn.LazyLinear(20), FixedHiddenMLP())
chimera(X)

tensor(0.0679, grad_fn=<SumBackward0>)

## **Summary**

- **Modules and Layers**:
  - **Individual layers** can be **modules**.
  - **Many layers** can comprise a **module**.
  - **Many modules** can comprise a **module**.

- **Capabilities of a Module**:
  - A module **can contain code** beyond standard layers.
  - Modules **handle housekeeping tasks**, including:
    - **Parameter initialization**.
    - **Backpropagation**.

- **Sequential Module**:
  - Handles **sequential concatenation** of layers and modules.
  - Enables **building complex architectures** with minimal effort.
