# Parameter Initialization

Now that we know how to access the parameters,
let's look at how to initialize them properly.


However, we often want to initialize our weights
according to various other protocols. The framework provides most commonly
used protocols, and also allows to create a custom initializer.


In [1]:
import torch
from torch import nn

import warnings
warnings.filterwarnings('ignore')

By default, PyTorch initializes weight and bias matrices
uniformly by drawing from a range that is computed according to the input and output dimension.
PyTorch's `nn.init` module provides a variety
of preset initialization methods.


In [2]:
net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(), nn.LazyLinear(1))
X = torch.rand(size=(2, 4))
net(X).shape

torch.Size([2, 1])

## **Built-in Initialization**

Let's begin by calling on built-in initializers.
The code below initializes all weight parameters
as Gaussian random variables
with standard deviation 0.01, while bias parameters are cleared to zero.


In [17]:
def init_normal(module):
    if type(module) == nn.Linear:
        nn.init.normal_(module.weight, mean=0, std=0.01)
        nn.init.zeros_(module.bias)

net.apply(init_normal)
net[0].weight.data[0], net[0].bias.data[0]

(tensor([-0.0068,  0.0050, -0.0197, -0.0054]), tensor(0.))

### What was that!?

In [4]:
# have a look, this `net` is sequential and it is a module
net

Sequential(
  (0): Linear(in_features=4, out_features=8, bias=True)
  (1): ReLU()
  (2): Linear(in_features=8, out_features=1, bias=True)
)

In [5]:
# See? This the Sequential is the type "module"
type(net)

torch.nn.modules.container.Sequential

In [6]:
# See? This is the type linear
type(net[0])

torch.nn.modules.linear.Linear

In [15]:
nn.init.normal_?

[1;31mSignature:[0m [0mnn[0m[1;33m.[0m[0minit[0m[1;33m.[0m[0mnormal_[0m[1;33m([0m[0mtensor[0m[1;33m:[0m [0mtorch[0m[1;33m.[0m[0mTensor[0m[1;33m,[0m [0mmean[0m[1;33m:[0m [0mfloat[0m [1;33m=[0m [1;36m0.0[0m[1;33m,[0m [0mstd[0m[1;33m:[0m [0mfloat[0m [1;33m=[0m [1;36m1.0[0m[1;33m)[0m [1;33m->[0m [0mtorch[0m[1;33m.[0m[0mTensor[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Fills the input Tensor with values drawn from the normal
distribution :math:`\mathcal{N}(\text{mean}, \text{std}^2)`.

Args:
    tensor: an n-dimensional `torch.Tensor`
    mean: the mean of the normal distribution
    std: the standard deviation of the normal distribution

Examples:
    >>> w = torch.empty(3, 5)
    >>> nn.init.normal_(w)
[1;31mFile:[0m      c:\users\aayus\anaconda3\envs\d2l\lib\site-packages\torch\nn\init.py
[1;31mType:[0m      function

> ### 📖 Did you read it?
> It ***Fills the input Tensor with values drawn from the normal***.
>
> It FILLs the EXISTING tensor with the normal distribution.

## Why to check `nn.Linear`?

In the `init_normal` function:
```python
def init_normal(module):
    if type(module) == nn.Linear:
        nn.init.normal_(module.weight, mean=0, std=0.01)
        nn.init.zeros_(module.bias)
```

They check for `nn.Linear` because:
1. **Selective Initialization**:  (👈🏻👈🏻👈🏻)
   - The model (`net`) contains multiple layers, including `nn.ReLU()`, which does not have weights or biases.
   - Initialization only makes sense for layers like `nn.Linear` that **have trainable parameters** (weights and biases).
2. **Avoid Errors**:
   - If they tried initializing `module.weight` or `module.bias` on a layer without these attributes (like `nn.ReLU`), it would raise an error.

---

### What Are Other Layers in PyTorch Besides `nn.Linear`?

There are many types of layers in PyTorch, each serving different purposes. Here's an overview:

#### 1. **Linear Layers**:
   - `nn.Linear`: Fully connected (dense) layer for regular feedforward neural networks.

#### 2. **Activation Layers**:
   - `nn.ReLU`: Rectified Linear Unit activation.
   - `nn.Sigmoid`: Sigmoid activation.
   - `nn.Tanh`: Hyperbolic tangent activation.
   - `nn.Softmax`: Normalizes output to probabilities.

#### 3. **Convolutional Layers**:
   - `nn.Conv1d`: 1D convolution layer (e.g., for time-series data).
   - `nn.Conv2d`: 2D convolution layer (e.g., for images).
   - `nn.Conv3d`: 3D convolution layer (e.g., for volumetric data).

#### 4. **Pooling Layers**:
   - `nn.MaxPool2d`: 2D max-pooling layer.
   - `nn.AvgPool2d`: 2D average-pooling layer.
   - `nn.AdaptiveAvgPool2d`: Pooling layer that outputs a specified size, regardless of input size.

#### 5. **Recurrent Layers**:
   - `nn.RNN`: Basic recurrent neural network layer.
   - `nn.LSTM`: Long Short-Term Memory layer (handles sequences with memory).
   - `nn.GRU`: Gated Recurrent Unit (a simpler alternative to LSTM).

#### 6. **Normalization Layers**:
   - `nn.BatchNorm1d`: Batch normalization for 1D inputs.
   - `nn.LayerNorm`: Layer normalization.
   - `nn.InstanceNorm2d`: Instance normalization for images.

#### 7. **Dropout Layers**:
   - `nn.Dropout`: Randomly drops out elements in the input.
   - `nn.AlphaDropout`: Special dropout for self-normalizing neural networks (like SELU activations).

#### 8. **Embedding Layers**:
   - `nn.Embedding`: Maps discrete indices (e.g., word IDs) to dense vectors.

#### 9. **Transformers and Attention Layers**:
   - `nn.Transformer`: Implements the Transformer architecture.
   - `nn.MultiheadAttention`: Multi-head attention mechanism.

#### 10. **Custom Layers**:
   - You can also define your own layers by subclassing `nn.Module`.

---

### How Does This Relate to `init_normal`?

The `init_normal` function specifically initializes **only the weights and biases of layers that have them**, such as `nn.Linear`. It skips other layers, like `nn.ReLU`, which don’t require initialization.

If you wanted to initialize parameters for other layers (e.g., `nn.Conv2d`), you would need to extend the `if type(module) == ...` logic accordingly.

> ## 💭
> That means the `.apply()` passes each layer in seperately!

We can also initialize all the parameters
to a given constant value (say, 1).


In [7]:
def init_constant(module):
    if type(module) == nn.Linear:
        nn.init.constant_(module.weight, 1)
        nn.init.zeros_(module.bias)

net.apply(init_constant)
net[0].weight.data[0], net[0].bias.data[0]

(tensor([1., 1., 1., 1.]), tensor(0.))

# Till Now...

We tried to see how to initialize the layer **formally** with the random values -- still with the normal distribution *but* using `nn.init` -- which is the standard way of doing that.

**Now, we will see** how to use any of the formal **initialization methods** to do the same.

## **We can also apply <u>different initializers</u> for certain blocks.**

For example, below we initialize the first layer
with the Xavier initializer
and initialize the second layer
to a constant value of 42.


In [19]:
def init_xavier(module):
    if type(module) == nn.Linear:
        nn.init.xavier_uniform_(module.weight)

def init_42(module):
    if type(module) == nn.Linear:
        nn.init.constant_(module.weight, 42)

net[0].apply(init_xavier)
net[2].apply(init_42)
print(net[0].weight.data[0])
print(net[2].weight.data)

tensor([ 0.4750, -0.4546, -0.3507,  0.2501])
tensor([[42., 42., 42., 42., 42., 42., 42., 42.]])


### How `apply` works?

In [27]:
a = net[0]

In [38]:
def see_apply(module):
    print("This is:", type(module))

In [39]:
net.apply(see_apply);

This is: <class 'torch.nn.modules.linear.Linear'>
This is: <class 'torch.nn.modules.activation.ReLU'>
This is: <class 'torch.nn.modules.linear.Linear'>
This is: <class 'torch.nn.modules.container.Sequential'>


### **Custom Initialization**

Sometimes, the initialization methods we need
are not provided by the deep learning framework.
In the example below, we define an initializer
for any weight parameter $w$ using the following strange distribution:

$$
\begin{aligned}
    w \sim \begin{cases}
        U(5, 10) & \textrm{ with probability } \frac{1}{4} \\
            0    & \textrm{ with probability } \frac{1}{2} \\
        U(-10, -5) & \textrm{ with probability } \frac{1}{4}
    \end{cases}
\end{aligned}
$$


Again, we implement a `my_init` function to apply to `net`.


In [6]:
def my_init(module):
    if type(module) == nn.Linear:
        print("Init", *[(name, param.shape)
                        for name, param in module.named_parameters()][0])
        nn.init.uniform_(module.weight, -10, 10)
        module.weight.data *= module.weight.data.abs() >= 5

net.apply(my_init)
net[0].weight[:2]

Init weight torch.Size([8, 4])
Init weight torch.Size([1, 8])


tensor([[ 0.0000, -7.6364, -0.0000, -6.1206],
        [ 9.3516, -0.0000,  5.1208, -8.4003]], grad_fn=<SliceBackward0>)

Note that we always have the option
of setting parameters directly.


In [7]:
net[0].weight.data[:] += 1
net[0].weight.data[0, 0] = 42
net[0].weight.data[0]

tensor([42.0000, -6.6364,  1.0000, -5.1206])

# Lazy initialization

In [42]:
net = nn.Sequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))

In [45]:
net[0].weight.data

tensor([])

In [47]:
net[2].weight.data

tensor([])

Initially they are not initialized... (haha, the irony.) Once we pass the data... it does the initialization. 

In [48]:
net(X)

tensor([[ 0.2868, -0.2500,  0.1342, -0.2411,  0.0015,  0.0318, -0.1633, -0.0479,
          0.0599,  0.2075],
        [ 0.2482, -0.3402,  0.0272, -0.1896,  0.2319,  0.0018, -0.2308,  0.0148,
         -0.0428,  0.2038]], grad_fn=<AddmmBackward0>)

In [52]:
net[0].weight.data.shape

torch.Size([256, 4])

In [51]:
net[2].weight.data.shape

torch.Size([10, 256])

## What if we pass different shaped data!? *(will give error)*

In [53]:
X = torch.rand(size=(2, 10))
net(X)

RuntimeError: mat1 and mat2 shapes cannot be multiplied (2x10 and 4x256)

## 🤯 Cool, right!?

# If we **didn't use** the lazy one?

In [62]:
net = nn.Sequential(nn.Linear(10, 256), nn.ReLU(), nn.Linear(256, 10))

In [63]:
net[0].weight.data

tensor([[ 0.0469,  0.1504, -0.2587,  ..., -0.2354, -0.0803, -0.2436],
        [-0.0101,  0.0106,  0.2719,  ...,  0.0333,  0.1952, -0.3083],
        [-0.2032, -0.2326,  0.0322,  ..., -0.2869,  0.1113, -0.1812],
        ...,
        [-0.0319, -0.2055,  0.2413,  ..., -0.1496, -0.0833,  0.1338],
        [-0.0740, -0.2955,  0.1789,  ..., -0.0697, -0.2620,  0.1696],
        [ 0.1776,  0.1022, -0.2120,  ..., -0.1148, -0.1557,  0.0234]])

In [64]:
net[2].weight.data

tensor([[-0.0378, -0.0440, -0.0553,  ...,  0.0443,  0.0069,  0.0415],
        [ 0.0251, -0.0255,  0.0154,  ...,  0.0005,  0.0061,  0.0593],
        [ 0.0385, -0.0234,  0.0311,  ..., -0.0506,  0.0076, -0.0358],
        ...,
        [ 0.0095, -0.0243, -0.0069,  ..., -0.0271, -0.0247,  0.0382],
        [ 0.0410,  0.0261,  0.0582,  ..., -0.0459, -0.0078,  0.0260],
        [ 0.0171, -0.0460, -0.0417,  ..., -0.0047,  0.0191,  0.0323]])

## 🤯 It works!

# What if we use the Xavier, before passing the X?

In [66]:
def init_xavier(module):
    if type(module) == nn.Linear:
        nn.init.xavier_uniform_(module.weight)

def init_42(module):
    if type(module) == nn.Linear:
        nn.init.constant_(module.weight, 42)

net = nn.Sequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))
net[0].apply(init_xavier)
net[2].apply(init_42)

LazyLinear(in_features=0, out_features=10, bias=True)

In [67]:
print(net[0].weight)
print(net[2].weight)

<UninitializedParameter>
<UninitializedParameter>


Ahum. **We first need to pass the data**.

# Summary

We can initialize parameters using built-in and custom initializers.

## Exercises

Look up the online documentation for more built-in initializers.
