# Builder's Guide 

To get you this far this fast, we called upon the libraries, but skipped over more advanced details about how they work.

These insights will move you from end user to power user, giving you the tools needed to reap the benefits of a mature deep learning library while retaining the flexibility to implement more complex models, including those you invent yourself! 

In [None]:
import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

## Implementation of a Custom Module

Perhaps the easiest way to develop intuition about how a module works is to implement one ourselves. Before we do that, we briefly summarize the basic functionality that each module must provide:
1. __Ingest input__ data as arguments to its forward propagation method.
2. __Generate an output__ by having the forward propagation method return a value. Note that the output may have a different shape from the input. For example, the first fully connected layer in our model above ingests an input of arbitrary dimension but returns an output of dimension 256.
3. __Calculate the gradient__ of its output with respect to its parameters, which can be accessed via its backpropagation method. Typically this happens automatically.
4. __Store and provide access to those parameters__ necessary for executing the forward prop- agation computation.
5. __Initialize model parameters__ as needed.

In [None]:
class MLP(nn.Module):
    
    def __init__(self):
        # Call the constructor of the parent class nn.Module to perform
        # the necessary initialization
        super().__init__()
        self.hidden = nn.LazyLinear(256)
        self.out = nn.LazyLinear(10)
    
    # Define the forward propagation of the model, that is, how to return the
    # required model output based on the input X
    def forward(self, X):
        return self.out(F.relu(self.hidden(X)))


Note that unless we implement a new layer, we need not worry about the backpropagation method or parameter initialization. The system will generate these methods automatically.

In [None]:
X = torch.rand(2, 20)

net = MLP()
net(X).shape

## The `nn.Sequential` Module

The following code generates a network with one fully connected hidden layer with 256 units and ReLU activation, followed by a fully connected output layer with ten units (no activation function).

In [None]:
X = torch.rand(2, 20)

net = nn.Sequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))
net(X).shape

`nn.Sequential` defines a special kind of `Module`, the class that presents a module in PyTorch. It maintains an ordered list of constituent Modules. 

Note that each of the two fully connected layers is an instance of the `Linear` class which is itself a subclass of `Module`.

Forward propagation (`forward`) method is also remarkably simple: it chains each module in the list together, passing the output of each as input to the next. Note that until now, we have been invoking our models via the construction `net(X)` to obtain their outputs. This is actually just shorthand for `net.__call__(X)`.


But let's take a closer look by designing our own `MySequential` Module. We need to define two key methods:
1. A method for appending modules one by one to a list.
2. A forward propagation method for passing an input through the chain of modules, in the same order as they were appended.

In [None]:
class MySequential(nn.Module):
    
    def __init__(self, *args):
        super().__init__()
        for idx, module in enumerate(args):
            self.add_module(str(idx), module)
            
    def forward(self, X):
        for module in self.children():
            X = module(X)
        return X

In the __init__ method, we add every module by calling the `add_modules` method. These modules can be accessed by the children method at a later date. In this way the system knows the added modules, and it will properly initialize each module’s parameters.

When our `MySequential`’s forward propagation method is invoked, each added module is executed in the order in which they were added. 

In [None]:
net = MySequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))
net(X).shape

## Constant parameters 

You may have noticed that until now, all of the operations in our networks have acted upon our network’s activations and its parameters. Sometimes, however, we might want to incorporate terms that are neither the result of previous layers nor updatable parameters. We call these constant parameters.

Say for example that we want a layer that calculates the function $f(x, w) = 𝑐 w^⊤ x $, where $x$ is the input, $w$ is our parameter, and 𝑐 is some specified constant that is not updated during optimization. So we implement a `FixedHiddenMLP` class as follows.

In [None]:
class FixedHiddenMLP(nn.Module):
    
    def __init__(self):
        super().__init__()
        # Random weight parameters that will not compute gradients and
        # therefore keep constant during training
        self.rand_weight = torch.rand((20, 20))
        self.linear = nn.LazyLinear(20)

    def forward(self, X):
        X = self.linear(X)
        X = F.relu(X @ self.rand_weight + 1)
        # Reuse the fully connected layer. This is equivalent to sharing
        # parameters with two fully connected layers
        X = self.linear(X)
        # Control flow
        while X.abs().sum() > 1:
            X /= 2
        return X.sum()

In this model, we implement a hidden layer whose weights (self.rand_weight) are initialized randomly at instantiation and are thereafter constant. This weight is not a model parameter and thus it is never updated by backpropagation. 

In [None]:
net = FixedHiddenMLP()
net(X)

> __NOTE__ that before returning the output, our model ran a while-loop, testing on the condition its *l1* norm is larger than 1, and 
> dividing our output vector by 2 until it satisfied the condition. Finally, we returned the sum of the entries in X. 
> No standard neural network performs this operation. The point is only to show you how to integrate arbitrary code into the flow 
> of your neural network computations.

## Mix and Nest Modules

We can mix and match various ways of assembling modules together. In the following example, we nest modules in some creative ways.

In [None]:
class NestMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(nn.LazyLinear(64), nn.ReLU(),
                                 nn.LazyLinear(32), nn.ReLU())
        self.linear = nn.LazyLinear(16)
    def forward(self, X):
        return self.linear(self.net(X))

In [None]:
chimera = nn.Sequential(NestMLP(), nn.LazyLinear(20), FixedHiddenMLP())
chimera(X)

# Parameters Management

After training, we will need these parameters in order to make future predictions. Additionally, we will sometimes wish to 
- extract the parameters perhaps to reuse them in some other context, 
- to save our model to disk 
    - so that it may be executed in other software, 
    - or for examination in the hope of gaining scientific understanding.

## Accessing the model parameters

In [None]:
net = nn.Sequential(nn.LazyLinear(8),
                    nn.ReLU(),
                    nn.LazyLinear(1))
X = torch.rand(size=(2, 4))
net(X).shape

In [None]:
# We can inspect the parameters of the second fully connected layer as follows.
# We can see that this fully connected layer contains two parameters, corresponding to that layer’s weights and biases, respectively.

net[2].state_dict()

Note that each parameter is represented as an instance of the parameter class. To do anything useful with the parameters, we first need to access the underlying numerical values.

In [None]:
# Parameters are complex objects, containing values, gradients, and additional information. That is why we need to request the value explicitly.

type(net[2].bias), net[2].bias.data

In [None]:
# Because we have not invoked backpropagation for this network yet, it is in its initial state (aka the gradient is set to None).

net[2].weight.grad == None

When we need to perform operations on all parameters, accessing them one-by-one can grow tedious. 

The situation can grow especially unwieldy when we work with more complex, e.g., nested, modules, since we would need to recurse through the entire tree to extract each sub-module’s parameters. 

In [None]:
[(name, param.shape) for name, param in net.named_parameters()]


Often, we want to share parameters across multiple layers. Let’s see how to do this elegantly. 

In the following we allocate a fully connected layer and then use its parameters specifically to set those of another layer. Here we need to run the forward propagation net(X) before accessing the parameters.

In [None]:
# We need to give the shared layer a name so that we can refer to its
# parameters
shared = nn.LazyLinear(8)
net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(),
                    shared, nn.ReLU(),
                    shared, nn.ReLU(),
                    nn.LazyLinear(1))
net(X)


In [None]:
# Check whether the parameters are the same
print(net[2].weight.data[0] == net[4].weight.data[0])


In [None]:
# Change the parameters for the shared layer
net[2].weight.data[0, 0] = 100


In [None]:
# Make sure that they are actually the same object rather than just having the same value
print(net[2].weight.data[0] == net[4].weight.data[0])

ATTENTION: They are not just equal, they are represented by the same exact tensor. Thus, if we change one of the parameters, the other one changes, too.

# Initialize Parameters

Now that we know how to access the parameters, let’s look at how to initialize them properly.

The deep learning framework provides default random initializations to its layers. 
However, we often want to initialize our weights according to various other protocols. 
The framework provides most commonly used protocols, and also allows to create a custom initializer.

In [None]:
net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(), nn.LazyLinear(1))
X = torch.rand(size=(2, 4))
net(X).shape


By default, PyTorch initializes weight and bias matrices uniformly by drawing from a range that is computed according to the input and output dimension. 

## Built-in Initialization

PyTorch’s `nn.init` module provides a variety of preset initialization methods.

In [None]:
def init_normal(module):
    if type(module) == nn.Linear:
        nn.init.normal_(module.weight, mean=0, std=0.01)
        nn.init.zeros_(module.bias)
        
net.apply(init_normal)
net[0].weight.data[0], net[0].bias.data[0]

In [None]:
def init_constant(module):
    if type(module) == nn.Linear:
        nn.init.constant_(module.weight, 1)
        nn.init.zeros_(module.bias)
        
net.apply(init_constant)
net[0].weight.data[0], net[0].bias.data[0]

In [None]:
# We can also apply different initializers for certain blocks.

def init_xavier(module):
    if type(module) == nn.Linear:
        nn.init.xavier_uniform_(module.weight)

def init_42(module):
    if type(module) == nn.Linear:
        nn.init.constant_(module.weight, 42)

net[0].apply(init_xavier)
net[2].apply(init_42)

print(net[0].weight.data[0])
print(net[2].weight.data)

## Custom Initialization

In the example below, we define an initializer for any weight parameter $w$ using the following distribution:

$$
w \sim 
\begin{cases}
U(5, 10), & \text{with probability } \tfrac{1}{4}, \\[6pt]
0, & \text{with probability } \tfrac{1}{2}, \\[6pt]
U(-10, -5), & \text{with probability } \tfrac{1}{4}.
\end{cases}
$$

Again, we implement a `my_init` function to apply to `net`.

In [None]:
# to understand this function think that the interval -10, 10 can be split in 4: (-10, -5) / (-5, 0) / (0, 5) / (5, 10)
# if you sample randomly from this interval, you will get values in these ranges with equal probability
# (module.weight.data.abs() >= 5) this returns a boolean with False (zeros) for values in (-5, 5) which is already 1/2 of the total interval. 
def my_init(module):

    if type(module) == nn.Linear:
        print("Init", *[(name, param.shape)
                        for name, param in module.named_parameters()][0])
        nn.init.uniform_(module.weight, -10, 10)
        module.weight.data =  module.weight.data * (module.weight.data.abs() >= 5) 


net.apply(my_init)
net[0].weight[:2]


In [None]:
# Note that we always have the option of setting parameters directly.

print(net[0].weight.data)

net[0].weight.data[:] += 1
net[0].weight.data[0, 0] = 42

net[0].weight.data

## Lazy Initialization

So far, it might seem that we got away with being sloppy in setting up our networks. Specif- ically, we did the following unintuitive things, which might not seem like they should work:

- We defined the network architectures without specifying the input dimensionality.
- We added layers without specifying the output dimension of the previous layer.
- We even “initialized” these parameters before providing enough information to determine how many parameters our models should contain.

You might be surprised that our code runs at all. 

The trick here is that the framework defers initialization, waiting until the first time we pass data through the model, to infer the sizes of each layer on the fly.

In [None]:
net = nn.Sequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))

At this point, the network cannot possibly know the dimensions of the input layer’s weights
because the input dimension remains unknown.

In [None]:
net[0].weight

In [None]:
# Create a dummmy input
X = torch.rand(2, 20)

# Run the dummy through teh net to infer the shapes of the parameters
net(X)

# Now we can inspect the shapes of the parameters
net[0].weight.shape

The following method passes in dummy inputs through the network for a dry run to infer all parameter shapes and subsequently initializes the parameters.

In [None]:
@d2l.add_to_class(d2l.Module)  #@save
def apply_init(self, inputs, init=None):
    self.forward(*inputs)
    if init is not None:
        self.net.apply(init)

Later on, when working with convolutional neural networks, this technique will become even more convenient since the input dimensionality (e.g., the resolution of an image) will affect the dimensionality of each subsequent layer.

# Custom Layers

Sooner or later, you will need a layer that does not exist yet in the deep learning framework. In these cases, you must build a custom layer. 

The following `CenteredLayer` class simply subtracts the mean from its input.

In [None]:
class CenteredLayer(nn.Module):
    
    def __init__(self):
        super().__init__()

    def forward(self, X):
        return X - X.mean()

In [None]:
# Let’s verify that our layer works as intended by feeding some data through it.

layer = CenteredLayer()
layer(torch.tensor([1.0, 2, 3, 4, 5]))

We can now incorporate our layer as a component in constructing more complex models.

In [None]:
net = nn.Sequential(nn.LazyLinear(128), CenteredLayer())

In [None]:
Y = net(torch.rand(4, 8))
Y.mean()

Now let’s implement our own version of the fully connected layer. 

Recall that this layer requires two parameters, one to represent the weight and the other for the bias. In this implementation, we bake in the ReLU activation as a default. This layer requires two input arguments: in_units and units, which denote the number of inputs and outputs, respectively.

In [None]:
class MyLinear(nn.Module):

    def __init__(self, in_units, units):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(in_units, units))
        self.bias = nn.Parameter(torch.randn(units,))
        
    def forward(self, X):
        linear = torch.matmul(X, self.weight.data) + self.bias.data
        return F.relu(linear)

In [None]:
linear = MyLinear(5, 3)
linear.weight

In [None]:
# We can directly carry out forward propagation calculations using custom layers.
linear(torch.rand(2, 5))

In [None]:
# We can also construct models using custom layers. Once we have that we can use it justlike the built-in fully connected layer.
net = nn.Sequential(MyLinear(64, 8), MyLinear(8, 1))
net(torch.rand(2, 64))

# File I/O

Oncce we are happy wth our model, we will want to save the results for later use in various contexts (perhaps even to make predictions in deployment). 

Additionally, when running a long training process, the best practice is to periodically save intermediate results (checkpointing) to ensure that we do not lose several days’ worth of computation if we trip over the power cord of our server.

In [None]:
x = torch.arange(4)
torch.save(x, '../data/x-file')

In [None]:
# Read the data from the stored file back into memory.
x2 = torch.load('../data/x-file')
x2

In [None]:
# We can store a list of tensors and read them back into memory.
y = torch.zeros(4)
torch.save([x, y],'x-files')

x2, y2 = torch.load('x-files')
(x2, y2)

In [None]:
# We can even write and read a dictionary that maps from strings to tensors.

mydict = {'x': x, 'y': y}
torch.save(mydict, 'mydict')
mydict2 = torch.load('mydict')
mydict2


## Loading and Saving Model Parameters

eep learning framework provides built-in functionalities to load and save entire networks. 

An important detail to note is that this saves model parameters and not the entire model. For example, if we have a 3-layer MLP, we need to specify the architecture separately. 

In [None]:
class MLP(nn.Module):

    def __init__(self):
        super().__init__()
        self.hidden = nn.LazyLinear(256)
        self.output = nn.LazyLinear(10)

    def forward(self, x):
        return self.output(F.relu(self.hidden(x)))
    
    
net = MLP()
X = torch.randn(size=(2, 20))
Y = net(X)

In [None]:
torch.save(net.state_dict(), 'mlp.params')

In [None]:
clone = MLP()

clone.load_state_dict(torch.load('mlp.params'))

clone.eval()

Since both instances have the same model parameters, the computational result of the same input X should be the same. Let’s verify this.

In [None]:
Y_clone = clone(X)
Y_clone == Y

# GPUs

In PyTorch, every array has a device; we often refer it as a *context*. So far, by default, all variables and associated computation have been assigned to the CPU. Typically, other contexts might be various GPUs. Things can get even hairier when we deploy jobs across multiple servers. 
By assigning arrays to contexts intelligently, we can minimize the time spent transferring data between devices. For example, when training neural networks on a server with a GPU, we typically prefer for the model’s parameters to live on the GPU.

__To run the programs in this section, you need at least two GPUs.__

#### Note -- Macs only have one GPU with a different architecture:

Apple silicon GPUs use a unified memory architecture where the CPU and GPU share a single memory pool, providing high bandwidth and efficient data sharing, while traditional NVIDIA GPUs employ discrete memory (VRAM) for the GPU and separate RAM for the CPU. Key differences also lie in their respective graphics APIs (Metal for Apple, CUDA for NVIDIA), target applications (Apple silicon for integrated system efficiency, NVIDIA for raw performance in dedicated tasks), and power efficiency, with Apple silicon focusing on lower power consumption for its integrated system. 

In [None]:
def cpu():  #@save
    """Get the CPU device."""
    return torch.device('cpu')

def gpu(i=0):  #@save
    """Get a GPU device."""
    if torch.backends.mps.is_available():
        return torch.device("mps")
    
    elif torch.cuda.is_available():
        return torch.device(f'cuda:{i}')
    
    else:
        return torch.device("cpu")

cpu(), gpu(), gpu(1)

In [None]:
def num_gpus():  #@save
    if torch.backends.mps.is_available():
        return 1  # Only 1 MPS GPU is available
    elif torch.cuda.is_available():
        return torch.cuda.device_count()

num_gpus() # Macs won't have any NVIDIA GPUs

In [None]:
def try_gpu(i=0):  #@save
    """Return gpu(i) if exists, otherwise return cpu()."""
    if num_gpus() >= i + 1:
        return gpu(i)
    return cpu()

def try_all_gpus():  #@save
    """Return all available GPUs, or [cpu(),] if no GPU exists."""
    return [gpu(i) for i in range(num_gpus())]

try_gpu(), try_gpu(10), try_all_gpus()

By default, tensors are created on the CPU. We can query the device where the tensor is located.

In [None]:
x = torch.tensor([1, 2, 3])
x.device

In [None]:
X = torch.ones(2, 3, device=try_gpu())
X, X.device

In [None]:
Y = torch.rand(2, 3, device=try_gpu(1))
Y, Y.device

# In Macs with one GPU: this will be put on a CPU, but it won't matter because CPU and GPU on new Macs share the same memory pool

In [None]:
if num_gpus() > 1:
    # need to have both tensor on the same GPU
    Z = X.cuda(1)
    print(Z.device)
else:
    # For Macs: need to have both tensors on GPU
    Z = X.to(try_gpu())
    
print(X)
print(Z)

In [None]:
X + Z

People use GPUs to do machine learning because they expect them to be fast. But transferring variables between devices is slow: much slower than computation. 

Transferring data is not only slow, it also makes parallelization a lot more difficult, since we have to wait for data to be sent (or rather to be received) before we can proceed with more operations. This is why copy operations should be taken with great care. 

As a rule of thumb, many small operations are much worse than one big operation. 

## NN and GPUs

Similarly, a neural network model can specify devices. The following code puts the model parameters on the GPU.

In [None]:
net = nn.Sequential(nn.LazyLinear(1))
net = net.to(device=try_gpu())

net(X)

In [None]:
# Let’s confirm that the model parameters are stored on the same GPU.
net[0].weight.data.device

Let the trainer support GPU.

In [None]:
@d2l.add_to_class(d2l.Trainer)  #@save
def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0):
    self.save_hyperparameters()
    self.gpus = [d2l.gpu(i) for i in range(min(num_gpus, d2l.num_gpus()))]

@d2l.add_to_class(d2l.Trainer)  #@save
def prepare_batch(self, batch):
    if self.gpus:
        batch = [a.to(self.gpus[0]) for a in batch]
    return batch

@d2l.add_to_class(d2l.Trainer)  #@save
def prepare_model(self, model):
    model.trainer = self
    model.board.xlim = [0, self.max_epochs]
    if self.gpus:
        model.to(self.gpus[0])
    self.model = model

In short, as long as all data and parameters are on the same device, we can learn models efficiently. In the following chapters we will see several such examples.