# 1️⃣ Layers and Modules

### 👇🏻 
There are the times **when we need to have an access to the repetable structures** to create complicated networks easiliy and with more managable way.

1. The thing we often need an access to in a *modular* way is not the layer and also not a whole model... it is somewhere in between. Possibly a **group of layers**.
2. Also **we may need to tweak the `forward` operation** in the way we like -- instead just the simple matmul. We may want to do some programming in between or change some input or do some operations before producing the output.

These all can be achieved through **grouping the layers** and giving them a **special treatment**. This is done through `nn.Module` class.

---

Here, we will first see `what` and then we will have a look at `how`. 

In [1]:
import torch
from torch import nn
from torch.nn import functional as F

import warnings
warnings.filterwarnings('ignore')

The following code generates a network
with one fully connected hidden layer
with 256 units and ReLU activation,
followed by a fully connected output layer
with ten units (no activation function).


In [2]:
net = nn.Sequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))

X = torch.rand(2, 20)

# the standard forward
net(X).shape

torch.Size([2, 10])

## A Custom Module

Perhaps the easiest way to develop intuition
about how a module works
is to implement one ourselves.
Before we do that,
we briefly summarize the basic functionality
that each module must provide:


1. **Ingest input data as arguments** to its forward propagation method.
1. Generate **an output** by having the **forward propagation** method **return a value**. Note that the output may have a different shape from the input. For example, the first fully connected layer in our model above ingests an input of arbitrary dimension but returns an output of dimension 256.
1. Calculate the gradient of its output with respect to its input, which can be accessed via its backpropagation method. Typically this happens automatically.
1. Store and provide access to those parameters necessary
   for executing the forward propagation computation.
1. Initialize model parameters as needed.


In [3]:
class MLP(nn.Module):
    def __init__(self):
        # Call the constructor of the parent class nn.Module to perform
        # the necessary initialization
        super().__init__()
        self.hidden = nn.LazyLinear(256)
        self.out = nn.LazyLinear(10)

    # **define** the forward propogation of the model, that is, how to return the
    # required model output based on the input X
    def forward(self, X):
        h = self.hidden(X)
        h = F.relu(h)
        o = self.out(h)
        return o

In [4]:
net = MLP()
net(X).shape

torch.Size([2, 10])

**A key virtue of the module abstraction is its versatility. We can subclass a module to create layers (such as the fully connected layer class), entire models (such as the `MLP` class above), or various components of intermediate complexity.**

## **The Sequential Module**

We can now take a closer look
at how the `Sequential` class works.
Recall that `Sequential` was designed
to daisy-chain other modules together.
To build our own simplified `MySequential`,
we just need to define two key methods:

1. A method for appending modules one by one to a list.
1. A forward propagation method for passing an input through the chain of modules, in the same order as they were appended.

The following `MySequential` class delivers the same
functionality of the default `Sequential` class.


In [5]:
class MySequential(nn.Module):
    def __init__(self, *args):
        super().__init__()
        for idx, module in enumerate(args):
            self.add_module(str(idx), module)

    def forward(self, X):
        for module in self.children():
            X = module(X)
        return X

In the `__init__` method, we add every module
by calling the `add_modules` method. These modules can be accessed by the `children` method at a later date.
In this way the system knows the added modules,
and it will properly initialize each module's parameters.


When our `MySequential`'s forward propagation method is invoked,
each added module is executed
in the order in which they were added.
We can now reimplement an MLP
using our `MySequential` class.


In [6]:
net = MySequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))
net(X).shape

torch.Size([2, 10])

## **Executing Code in the Forward Propagation Method**

> This is the way to **manipulate what happens in the `forward` method**

The `Sequential` class makes model construction easy,
allowing us to assemble new architectures
without having to define our own class.
However, not all architectures are simple daisy chains.
When greater flexibility is required,
we will want to define our own blocks.
For example, we might want to execute
Python's control flow within the forward propagation method.
Moreover, we might want to perform
arbitrary mathematical operations,
not simply relying on predefined neural network layers.

You may have noticed that until now,
all of the operations in our networks
have acted upon our network's activations
and its parameters.
Sometimes, however, we might want to
incorporate terms
that are neither the result of previous layers
nor updatable parameters.
We call these *constant parameters*.
Say for example that we want a layer
that calculates the function
$f(\mathbf{x},\mathbf{w}) = c \cdot \mathbf{w}^\top \mathbf{x}$,
where $\mathbf{x}$ is the input, $\mathbf{w}$ is our parameter,
and $c$ is some specified constant
that is not updated during optimization.
So we implement a `FixedHiddenMLP` class as follows.


In [7]:
class FixedHiddenMLP(nn.Module):
    def __init__(self):
        super().__init__()
        # Random weight parameters that will not compute gradients and
        # therefore keep constant during training
        self.rand_weight = torch.rand((20, 20))
        self.linear = nn.LazyLinear(20)

    def forward(self, X):
        X = self.linear(X)
        X = F.relu(X @ self.rand_weight + 1)
        # Reuse the fully connected layer. This is equivalent to sharing
        # parameters with two fully connected layers
        X = self.linear(X)
        # Control flow
        while X.abs().sum() > 1:
            X /= 2
        return X.sum()

In this model,
we implement a hidden layer whose weights
(`self.rand_weight`) are initialized randomly
at instantiation and are thereafter constant.
This weight is not a model parameter
and thus it is never updated by backpropagation.
The network then passes the output of this "fixed" layer
through a fully connected layer.

**Note that before returning the output,
our model did something unusual.**

We ran a while-loop, testing
on the condition its $\ell_1$ norm is larger than $1$,
and dividing our output vector by $2$
until it satisfied the condition.
Finally, we returned the sum of the entries in `X`.

> ⚠️ To our knowledge, ***no standard neural network
performs this operation.*** Note that this particular operation may not be useful
in any real-world task.
Our point is only to show you how to integrate
arbitrary code into the flow of your
neural network computations.


In [8]:
net = FixedHiddenMLP()
net(X)

tensor(-0.0446, grad_fn=<SumBackward0>)

We can **mix and match various
ways of assembling modules together.**
In the following example, we nest modules
in some creative ways.


In [9]:
class NestMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(nn.LazyLinear(64), nn.ReLU(),
                                 nn.LazyLinear(32), nn.ReLU())
        self.linear = nn.LazyLinear(16)

    def forward(self, X):
        return self.linear(self.net(X))

chimera = nn.Sequential(NestMLP(), nn.LazyLinear(20), FixedHiddenMLP())
chimera(X)

tensor(0.2316, grad_fn=<SumBackward0>)

# 2️⃣ Accessing the Parameters!

Once we have chosen an architecture
and set our hyperparameters,
we proceed to the training loop,
where **our goal is to find parameter values
that minimize our loss function.**


After training, we will need these parameters
in order to make future predictions.
Additionally, we will sometimes wish
to extract the parameters
perhaps to reuse them in some other context,
to save our model to disk so that
it may be executed in other software,
or for examination in the hope of
gaining scientific understanding.

Most of the time, we will be able
to ignore the nitty-gritty details
of how parameters are declared
and manipulated, relying on deep learning frameworks
to do the heavy lifting.
However, when we move away from
stacked architectures with standard layers,
we will sometimes need to get into the weeds
of declaring and manipulating parameters.
In this section, we cover the following:

* Accessing parameters for debugging, diagnostics, and visualizations.
* Sharing parameters across different model components.


In [10]:
import torch
from torch import nn

In [11]:
net = nn.Sequential(nn.LazyLinear(8),
                    nn.ReLU(),
                    nn.LazyLinear(1))

X = torch.rand(size=(2, 4))
net(X).shape

torch.Size([2, 1])

It is as simple as it gets. The data has 2 rows and 4 features. The net has single hidden layer which has 8 nodes (double the features) and single output.

Thus the prediction has `2 x 1` shape.

## **Parameter Access**

Let's start with how to access parameters
from the models that you already know.


When a model is defined via the `Sequential` class,
we can first access any layer by indexing
into the model **as though it were a list**.
Each layer's parameters are conveniently
located in its attribute.


In [15]:
# hidden layer weights 
net[0].state_dict()

OrderedDict([('weight',
              tensor([[-0.2271,  0.3601,  0.0261,  0.4770],
                      [ 0.0050, -0.1673,  0.4541,  0.2741],
                      [ 0.4006,  0.1460,  0.2844,  0.3897],
                      [-0.4409, -0.3244, -0.1448,  0.3665],
                      [-0.4509,  0.2230,  0.0334,  0.0913],
                      [ 0.1658,  0.2303, -0.4725, -0.4774],
                      [ 0.2010, -0.0702,  0.4590, -0.1829],
                      [ 0.2476,  0.4240,  0.0115, -0.0101]])),
             ('bias',
              tensor([-0.0299, -0.4780,  0.2742,  0.3652, -0.3704,  0.0021, -0.2558, -0.3816]))])

In [16]:
# last layer weights
net[2].state_dict()

OrderedDict([('weight',
              tensor([[ 0.2736, -0.0023,  0.0856, -0.0967,  0.0933, -0.2383, -0.2015, -0.2272]])),
             ('bias', tensor([-0.2758]))])

In [18]:
net[1].state_dict()

OrderedDict()

In [34]:
# TYPE: Parameter
net[0].weight

Parameter containing:
tensor([[-0.2271,  0.3601,  0.0261,  0.4770],
        [ 0.0050, -0.1673,  0.4541,  0.2741],
        [ 0.4006,  0.1460,  0.2844,  0.3897],
        [-0.4409, -0.3244, -0.1448,  0.3665],
        [-0.4509,  0.2230,  0.0334,  0.0913],
        [ 0.1658,  0.2303, -0.4725, -0.4774],
        [ 0.2010, -0.0702,  0.4590, -0.1829],
        [ 0.2476,  0.4240,  0.0115, -0.0101]], requires_grad=True)

In [43]:
# TYPE: Tensor
net[0].weight.data

tensor([[-0.2271,  0.3601,  0.0261,  0.4770],
        [ 0.0050, -0.1673,  0.4541,  0.2741],
        [ 0.4006,  0.1460,  0.2844,  0.3897],
        [-0.4409, -0.3244, -0.1448,  0.3665],
        [-0.4509,  0.2230,  0.0334,  0.0913],
        [ 0.1658,  0.2303, -0.4725, -0.4774],
        [ 0.2010, -0.0702,  0.4590, -0.1829],
        [ 0.2476,  0.4240,  0.0115, -0.0101]])

In [45]:
net[0].weight.grad == None #?

True

> 🔥 The `.state_dict()` is the OP.

👉🏻 **Parameters are complex objects,
containing values, gradients,
and additional information.**
That is why we need to request the value explicitly.

In addition to the value, each parameter also allows us to access the gradient. Because we have not invoked backpropagation for this network yet, it is in its initial state.

### ⭐ **All Parameters at Once**

When we need to perform operations on all parameters,
accessing them one-by-one can grow tedious.
The situation can grow especially unwieldy
when we work with more complex, e.g., nested, modules,
since we would need to recurse
through the entire tree to extract
each sub-module's parameters. Below we demonstrate accessing the parameters of all layers.


In [46]:
[(name, param.shape) for name, param in net.named_parameters()]

[('0.weight', torch.Size([8, 4])),
 ('0.bias', torch.Size([8])),
 ('2.weight', torch.Size([1, 8])),
 ('2.bias', torch.Size([1]))]

## 🖇️ **Tied (SHARED) Parameters**

Often, we want to ***share parameters*** across multiple layers.
Let's see how to do this elegantly.

In the following we allocate a fully connected layer
and then use its parameters specifically
to set those of another layer.
Here we need to run the forward propagation
`net(X)` before accessing the parameters.


In [None]:
# We need to give the shared layer a name so that we can refer to its
# parameters
shared = nn.LazyLinear(8)
net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(),
                    shared, nn.ReLU(),
                    shared, nn.ReLU(),
                    nn.LazyLinear(1))

net(X)

In [None]:
# Check whether the parameters are the same
print(net[2].weight.data[0] == net[4].weight.data[0])

We have simply picked up from the following:

    net = nn.Sequential(
        nn.LazyLinear(8), # 0 
        nn.ReLU(),        # 1
        shared,           # 2
        nn.ReLU(),        # 3
        shared,           # 4
        nn.ReLU(),        # 5
        nn.LazyLinear(1)  # 6
        )


In [None]:
# same as doing it with [4].
net[2].weight.data[0, 0] = 100

# Make sure that they are actually the same object rather than just having the
# same value
print(net[2].weight.data[0] == net[4].weight.data[0])

### The essence of the shared weights.

See, these are the *shared weights* and not *shared nodes*. Here, we **reuse the weights**. 

***Thus, it looks like the following 👇🏻***

<img src="../images/shared-weights.png">

# Next up,
We will see the initializers.