<img align="center" style='max-width: 1000px' src="images/banner.png">

<img align="left" style='max-width: 150px; height: auto' src="images/hsg_logo.png">

# Lab 03 - "Hypernetworks"


## Objective

After learning the concepts in this lab, you should be able to:

- Understand the basic tools and methods needed for the implementation of Hypernetworks 
- Implement basic Hypernetworks
- Apply two different types of slicing techniques to reduce the size of the Hypernetwork


## Outline


1. **A Simple Hypernetwork**: How Hypernetworks can be implemented in PyTorch.
2. **Slicing Technique 1**: A slicing technique that treats all parameters as a single vector.
2. **Slicing Technique 2**: A layer-wise slicing technique.



<img align='center' style='max-width: 700px' src='images/hypernet_forward.gif'>

*Animation: The forward and backward propagation steps of a Hypernetwork. First, the Hypernetwork (the blue network) generates the weights of the main model (the white network) using some context information $t$. Then, it makes prediction on input $x$ using generated weights $w$ in a stateless manner. Finally, in the backpropagation step, the gradiants of the Hypernetwork are obtained by backpropagating through the main model to the Hypernetwork.*

### Install Required Packages

In [1]:
import torch
from torchvision import datasets, transforms
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from tqdm import tqdm
import numpy as np


## <font color='red'>1. A Simple Hypernetwork</font>



In this section, we implement a simple hypernetwork that generates the weights of an MLP. The weights are generated as a single vector of weights. 

First, let's start with the definition of the MLP model:

In [2]:
class MLP(nn.Module):
    def __init__(self, n_inp, n_hidden, n_out):
        super().__init__()
        self.linear_1 = nn.Linear(n_inp, n_hidden)
        self.linear_2 = nn.Linear(n_hidden, n_hidden)
        self.linear_3 = nn.Linear(n_hidden, n_hidden)
        self.classifier = nn.Linear(n_hidden, n_out)
        
        self.activ = nn.ReLU()
    
    def forward(self, x):
        x = self.activ(self.linear_1(x))
        x = self.activ(self.linear_2(x)) 
        x = self.classifier(x)

        return x

Typically, we initialize an instance of the model, and feed it with some input to get the output:

In [3]:
main_model = MLP(10, 50, 5)
x = torch.randn(32, 10)
out = main_model(x)
print(out.shape)

torch.Size([32, 5])


In the example, the weights of the model are stored inside the model. Therefore, when we call `main_model(x)`, it uses the weights stored in the model's `state_dict` to do the forward propagation. What if the weights are provided from outside the model?

Now, let's first generate the weights of the model with another neural network called the Hypernetwork:

### 1.1 Hypernetwork

To generate the weights of anotehr model, we first need to know the number of parameters in the main model:

In [4]:
# Shape of each parameter as a dictionary of name: shape
param_shapes = {n: p.shape for (n, p) in main_model.named_parameters()}

# Total number of parameters in the model
num_params = sum([p.numel() for p in main_model.parameters()])
print("Number of parameters: ", num_params)

Number of parameters:  5905


Then, we need to define the architecture of the Hypernetwork. The Hypernetwork is also an MLP that maps the input to the vector space of the main model's weights.

The most basic version of a Hypernetwork treats all weights as a single vector as shown in the following animation:

<img align='center' style='max-width: 700px' src='images/no_slice.gif'>


In [5]:
class Hypernetwork(nn.Module):
    def __init__(self, n_inp, n_hidden, n_out):
        super().__init__()
        self.linear_1 = nn.Linear(n_inp, n_hidden)
        self.linear_2 = nn.Linear(n_hidden, n_out)

        self.actv = nn.ReLU()
    
    def forward(self, x):
        x = self.actv(self.linear_1(x))
        x = self.linear_2(x)

        return x

In [6]:
hypernetwork = Hypernetwork(10, 20, num_params)

The input to the hypernetwork is a 10-dimensional tensor, which is then mapped to a to 20-dimenstional hidden state. In the final linear layer, the hidden state is mapped to the vector space of the main model's weights. Let's forward a random tensor to the  Hypernetwork:

In [7]:
hn_out = hypernetwork(torch.randn(1, 10))
print("Hypernetwork output size: ", hn_out.shape)

Hypernetwork output size:  torch.Size([1, 5905])


The output size is equal to the number of parameters in the main model. Now, we need to reshape this tensor back to the original tensor shapes of the main model.

To reshape the output of the hypernetwork, we can start from the index zero of the Hypernetwork's output tensor, and slice it according to the original number of parameters in each layer of the main model. In the end, we need to reshape the tensor to the original size. We can store the reshaped results in a dictionary.

In [8]:
# Dictionary to store the reshaped parameters
reshaped_params = {}

# Start with an offset of 0
offset = 0
for (n, p) in param_shapes.items():
    sliced_parameter = hn_out[0][offset:offset+p.numel()]
    reshaped_params[n] = sliced_parameter.view(p)
    offset += p.numel()

Let's print the shape of reshaped parameters:

In [9]:
for n, p in reshaped_params.items():
    print(n, p.shape)

linear_1.weight torch.Size([50, 10])
linear_1.bias torch.Size([50])
linear_2.weight torch.Size([50, 50])
linear_2.bias torch.Size([50])
linear_3.weight torch.Size([50, 50])
linear_3.bias torch.Size([50])
classifier.weight torch.Size([5, 50])
classifier.bias torch.Size([5])


### 1.2 Forwarding with Parameters 

Now, an important question to answer is: how to use these generated weights to make prediction with the main model? 

<font color='darkgreen'>[Q] Can we just copy these weights to the `state_dict` dictionary of the model?</font>


In general, we have two ways to forward input with parameters:

1. Defining the function `forward_with_parameters()`
2. Calling the main model ina stateless way

#### Method 1: Defining a new forward function that accepts external parameters

We can add a new forward function that receivs both $x$ and $w$:

In [10]:
# Same model with a different forward function
class ModelV2(nn.Module):
    def __init__(self, n_inp, n_hidden, n_out):
        super().__init__()
        # ! These layers are not used during the forward pass
        self.linear_1 = nn.Linear(n_inp, n_hidden)
        self.linear_2 = nn.Linear(n_hidden, n_hidden)
        self.linear_3 = nn.Linear(n_hidden, n_hidden)
        self.classifier = nn.Linear(n_hidden, n_out)
        
        self.activ = nn.ReLU()
    
    def forward_with_params(self, x, params):
        # Params is a dictionary of name: tensor
        x = F.linear(x, params["linear_1.weight"], params["linear_1.bias"])
        x = F.relu(x) 
        x = F.linear(x, params["linear_2.weight"], params["linear_2.bias"])
        x = F.relu(x)
        x = F.linear(x, params["linear_3.weight"], params["linear_3.bias"])
        x = F.relu(x)
        x = F.linear(x, params["classifier.weight"], params["classifier.bias"])

        return x

<font color='darkgreen'> [Q] Why are the opeations inside the new forward function performed as functionals instead of using the layers?</font>

Now, we create an instance of the model with the forward-with-parameter pass, and feed it with the same random tensor used to  generate the weights:

In [11]:
model = ModelV2(10, 20, 5)
out = model.forward_with_params(torch.randn(1, 10), reshaped_params)

# Print the output shape
print(out.shape)

torch.Size([1, 5])


#### Method 1: Stateless call

To make stateless calls from a stateful model, we can the use following function from PyTorch (available since version 2.0):

In [12]:
from torch.nn.utils.stateless import functional_call

We can directly use the main model without adding a new forward function. The only thing we need to do is to call it as below:

In [13]:
out = functional_call(main_model, reshaped_params, torch.randn(1, 10))
print(out.shape)

torch.Size([1, 5])




It's that simple! So far, we have learned to use an external model called the Hypernetwork to generate th weights a main model and make prediction with the generated weights.

#### <font color='darkred'>**BUT**, there is a big problem!</font>

The number of parameters in the hypernetwork can easily "explode" this way. The Hypernetwork employs a linear layer in its final layer to map the hidden state of the Hypernetwork to the vectors space of the main model's weights. This essentially means that, if the size of the hidden state is $S$, and the total number of parameters is $N$, the total number of parameters in the hypernetwork will be $N \times S$:

In [14]:
n_params_main_model = sum([p.numel() for p in main_model.parameters()])
n_params_hypernetwork = sum([p.numel() for p in hypernetwork.parameters()])

print("Number of parameters in main model: ", n_params_main_model)
print("Number of parameters in hypernetwork: ", n_params_hypernetwork)

# Ratio of parameters in hypernetwork to main model
print("Ratio: ", n_params_hypernetwork / n_params_main_model)

Number of parameters in main model:  5905
Number of parameters in hypernetwork:  124225
Ratio:  21.037256562235395


This is super inefficient. The number of parameters in the Hypernetwork is ~21 times more than the number of parameters in the main model. We need to find better ways to generate the weights.

## <font color='red'>2. Slicing Technique 1</font>



In this part, we design a specific slicing technique that splits the entire network parameters with $N$ parameters into $K$ chunks, where $N \mod K = 0$.

The Hypernetwork then generates the weight of each chunk separately, conditioned on the chunk ID:

<img align='center' style='max-width: 700px' src='images/slice_1.gif'>

In this example, we want to implement an MLP to train an MNIST classifier:

In [15]:
class MLP(nn.Module):
    def __init__(self, n_inp, n_hidden, n_out):
        super().__init__()
        self.linear_1 = nn.Linear(n_inp, n_hidden)
        self.linear_2 = nn.Linear(n_hidden, n_hidden)
        self.linear_3 = nn.Linear(n_hidden, n_hidden)
        self.classifier = nn.Linear(n_hidden, n_out)

        self.activ = nn.ReLU()

    def forward(self, x):
        x = x.view(x.shape[0], -1)
        x = self.activ(self.linear_1(x))
        x = self.activ(self.linear_2(x))
        x = self.classifier(x)

        return x

    def forward_with_params(self, x, params):
        x = x.view(x.shape[0], -1)
        x = F.linear(x, params["linear_1.weight"], params["linear_1.bias"])
        x = F.relu(x)
        x = F.linear(x, params["linear_2.weight"], params["linear_2.bias"])
        x = F.relu(x)
        x = F.linear(x, params["linear_3.weight"], params["linear_3.bias"])
        x = F.relu(x)
        x = F.linear(x, params["classifier.weight"], params["classifier.bias"])
        return x


Similar to the previous example, we need to implement a Hypernetwork that generats the weights of this MLP. The important point here is to slice the weights in the output as explained above.

In order to avoid "parameter explosion" in the Hypernetwork, we need to use a single linear mapping from the hidden state of the Hypernetwork to each chunk of the main model's weight. Using the same mapping, requires conditioning the mapping on the chunk ID. Therefore, we define an **embedding layer** that maps chunk ID to a vector which is then concatenated to the hidden state of the Hypernetwork:

In [16]:
class Hypernetwork(nn.Module):
    def __init__(self, n_inp, n_hidden, n_out, chunk_size, dim_emb):
        super().__init__()
        # Embedding layer for each state
        self.n_chunks = n_out // chunk_size
        self.emb = nn.Embedding(self.n_chunks, dim_emb)

        # Initialize emb weights with uniform distribution
        nn.init.uniform_(self.emb.weight, -1.0, 1.0)

        # Hypernetwork's layers
        self.linear_1 = nn.Linear(n_inp, n_hidden)
        self.linear_2 = nn.Linear(n_hidden + dim_emb, chunk_size)
        
        # Activation function
        self.actv = nn.ReLU()

    def forward(self, x):
        # Retrieve embedding for all layers
        emb_inp = torch.arange(self.n_chunks).to(x.device)
        emb = self.emb(emb_inp)

        # Flatten x and apply the first linear layer
        x = x.view(x.shape[0], -1)
        x = self.actv(self.linear_1(x))

        # Unsqueeze x in the second dimension and replicate it for n times
        x = x.unsqueeze(1).repeat(1, self.n_chunks, 1)

        # Unsqueeze emb in the first dimension and replicate it for n times
        emb = emb.unsqueeze(0).repeat(x.shape[0], 1, 1)

        # Concatenate x and emb along the last dimension
        x = torch.cat([x, emb], dim=-1)

        # Apply the second linear layer on the conditioned x
        x = self.linear_2(x)

        # Flatten the output and return it
        x = x.view(x.shape[0], -1)

        return x

**The next question to answer is: what is a good chunk size?**

Since the number of parameters in the main model can vary, we define a function that takes the number of parameters $N$ , and returns the biggest divisor of $N$ that is smaller than $\sqrt{N}$.

In [17]:
def biggest_divisor(n):
    # Find the biggest divisor of n that is smaller than the square root of n
    for i in range(int(n**0.5), 0, -1):
        if n % i == 0:
            return i

Good, the `biggest_divisor` function finds the chunks size for us.

Now, we need to define an instance of the main model and its corresponding Hypernetwork:

In [18]:
# Initiliaze random seeds n PyTorch and Numpy for reproducibility
torch.manual_seed(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

np.random.seed(0)

In [19]:
# Main model for the MNIST dataset
main_model = MLP(28*28, 50, 10)

# Define parameters shapes and number of parameters
param_shapes = {n: p.shape for (n, p) in main_model.named_parameters()}
num_params = sum([p.numel() for p in main_model.parameters()])

# Chunk size
chunk_size = biggest_divisor(num_params)
print("Chunk size:", chunk_size)

# Hypernetwork with sliced output    n_inp, n_hidden, n_out, chunk_size, dim_emb 
hypernetwork = Hypernetwork(n_inp=28*28, n_hidden=5, n_out=num_params, chunk_size=chunk_size, dim_emb=2)

Chunk size: 20


Let's compare the number of parameters:

In [20]:
n_params_main_model = sum([p.numel() for p in main_model.parameters()])
n_params_hypernetwork = sum([p.numel() for p in hypernetwork.parameters()])

print("Number of parameters in main model: ", n_params_main_model)
print("Number of parameters in hypernetwork: ", n_params_hypernetwork)

# Ratio of parameters in hypernetwork to main model
print("Ratio: ", n_params_hypernetwork / n_params_main_model)

Number of parameters in main model:  44860
Number of parameters in hypernetwork:  8571
Ratio:  0.1910610789121712


Great! The number of parameters in the Hypernetwork is now much smaller than the number of parmaeters in the main model.

One last step before training the model is: to define the function that reshapes the generated parameters. The reshape function can be different for each slicing technique.


In [21]:
def reshape_generated_parameters(hn_out, param_shapes):
    reshaped_params = {}
    offset = 0
    for (n, p) in param_shapes.items():
        sliced_parameter = hn_out[0][offset:offset+p.numel()]
        reshaped_params[n] = sliced_parameter.view(p)
        offset += p.numel()

    return reshaped_params

Now, we train the model on the MNIST dataset to see how the final performance will be:

In [22]:
# Load the MNIST dataset
mnist_transform = transforms.Compose(
    [transforms.ToTensor(), transforms.Normalize((0.2860,), (0.3530,))]
)
train_set = datasets.MNIST(root="./data", train=True,
                           download=True, transform=mnist_transform)
test_set = datasets.MNIST(root="./data", train=False,
                          download=True, transform=mnist_transform)
train_loader = DataLoader(train_set, batch_size=64,
                          shuffle=True)
test_loader = DataLoader(test_set, batch_size=64)

In [23]:
# Define the optimizer and the loss function
optimizer = torch.optim.Adam(hypernetwork.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

# Set the device
device = torch.device("cpu")

# Move the model and the hypernetwork to the device
main_model.to(device)
hypernetwork.to(device)

# Standard PyTorch training loop
n_epochs = 10
for epoch in range(n_epochs):
    pbar = tqdm(train_loader)
    for batch in pbar:
        x, y = batch
        x, y = x.to(device), y.to(device)

        # Zero-grad optimizer for the hypernetwork
        optimizer.zero_grad()

        # Generate weights with the hypernetwork
        hn_out = hypernetwork(x)

        # Reshape generated weights
        reshaped_params = reshape_generated_parameters(hn_out, param_shapes)

        # Make prediction
        pred = main_model.forward_with_params(x, reshaped_params)

        # Compute loss and backpropagate
        loss = criterion(pred, y)
        loss.backward()

        # Optimizer step update
        optimizer.step()

        # Set progress bar description
        pbar.set_description(f"Loss value: {loss.item():.4f}")

    with torch.no_grad():
        # Evaluate model after each epoch
        batch_accuracies = []
        pbar_test = tqdm(test_loader)
        for batch in test_loader:
            x, y = batch
            x, y = x.to(device), y.to(device)

            # Generate weights with the hypernetwork
            hn_out = hypernetwork(x)
            reshaped_params = reshape_generated_parameters(
                hn_out, param_shapes)

            # Make prediction
            pred = main_model.forward_with_params(x, reshaped_params)
            n_corrects = sum(pred.argmax(dim=1) == y).item()
            acc_batch = n_corrects / len(x)
            batch_accuracies.append(acc_batch)
            pbar_test.update()

    print(f"Average accuracy for epoch {epoch}: {sum(batch_accuracies)/len(batch_accuracies):.4f} \n")


  0%|          | 0/938 [00:00<?, ?it/s]

Loss value: 2.3006: 100%|██████████| 938/938 [00:12<00:00, 75.00it/s]
 89%|████████▊ | 139/157 [00:00<00:00, 188.54it/s]

Average accuracy for epoch 0: 0.1097 



Loss value: 2.2759: 100%|██████████| 938/938 [00:13<00:00, 70.44it/s]
100%|██████████| 157/157 [00:14<00:00, 11.11it/s] 


Average accuracy for epoch 1: 0.1139 



Loss value: 2.3060: 100%|██████████| 938/938 [00:12<00:00, 76.09it/s]
100%|██████████| 157/157 [00:13<00:00, 11.95it/s] 
 92%|█████████▏| 145/157 [00:00<00:00, 200.34it/s]

Average accuracy for epoch 2: 0.1061 



Loss value: 0.8949: 100%|██████████| 938/938 [00:13<00:00, 70.83it/s]
100%|██████████| 157/157 [00:14<00:00, 11.19it/s] 


Average accuracy for epoch 3: 0.8115 



Loss value: 0.3389: 100%|██████████| 938/938 [00:12<00:00, 76.89it/s]
100%|██████████| 157/157 [00:12<00:00, 12.08it/s] 
 92%|█████████▏| 145/157 [00:00<00:00, 202.73it/s]

Average accuracy for epoch 4: 0.8663 



Loss value: 0.5424: 100%|██████████| 938/938 [00:13<00:00, 70.86it/s]
100%|██████████| 157/157 [00:14<00:00, 11.20it/s] 


Average accuracy for epoch 5: 0.8915 



Loss value: 0.2130: 100%|██████████| 938/938 [00:12<00:00, 75.56it/s]
100%|██████████| 157/157 [00:13<00:00, 11.88it/s] 
 96%|█████████▌| 150/157 [00:00<00:00, 186.14it/s]

Average accuracy for epoch 6: 0.9048 



Loss value: 0.1965: 100%|██████████| 938/938 [00:13<00:00, 69.76it/s]
100%|██████████| 157/157 [00:14<00:00, 10.98it/s] 


Average accuracy for epoch 7: 0.9079 



Loss value: 0.0528: 100%|██████████| 938/938 [00:12<00:00, 74.50it/s]
100%|██████████| 157/157 [00:13<00:00, 11.68it/s] 
 90%|█████████ | 142/157 [00:00<00:00, 198.44it/s]

Average accuracy for epoch 8: 0.9203 



Loss value: 0.2256: 100%|██████████| 938/938 [00:13<00:00, 69.72it/s]
100%|██████████| 157/157 [00:14<00:00, 11.02it/s] 


Average accuracy for epoch 9: 0.9226 



We can see that we got a not so bad performance for 80% less parameters in the model. 

## <font color='red'>3. Slicing Technique 2</font>



In the second slicing technique, we have separate heads for each layer of the main model.

In each "HyperHead", the weights of the correponding layers are sliced and then generated conditioned on the chunk ID:

<img align='center' style='max-width: 700px' src='images/slice_2.gif'>

We want to use the same model as in the first slicing technique:

In [24]:
class MLP(nn.Module):
    def __init__(self, n_inp, n_hidden, n_out):
        super().__init__()
        self.linear_1 = nn.Linear(n_inp, n_hidden)
        self.linear_2 = nn.Linear(n_hidden, n_hidden)
        self.linear_3 = nn.Linear(n_hidden, n_hidden)
        self.classifier = nn.Linear(n_hidden, n_out)

        self.activ = nn.ReLU()

    def forward_with_params(self, x, params):
        x = x.view(x.shape[0], -1)
        x = F.linear(x, params["linear_1.weight"], params["linear_1.bias"])
        x = F.relu(x)
        x = F.linear(x, params["linear_2.weight"], params["linear_2.bias"])
        x = F.relu(x)
        x = F.linear(x, params["linear_3.weight"], params["linear_3.bias"])
        x = F.relu(x)
        x = F.linear(x, params["classifier.weight"], params["classifier.bias"])
        return x


The first module of the Hypernetwork that need to implement is called the "HyperHead". Each HyperHead has its own embedding layer, and the number of chuncks in the head can be determined by the original size of the weights vector in the correponding layer:

In [25]:
class HyperHead(nn.Module):
    def __init__(self, n_hidden, n_out, chunk_size, dim_emb):
        super().__init__()
        n_chunks = n_out // chunk_size 
        # Embedding layer for each head
        self.emb = nn.Embedding(n_chunks, dim_emb)
        self.n_chunks = n_chunks

        # Initialize emb with uniform distribution
        nn.init.uniform_(self.emb.weight, -1.0, 1.0)

        # Output head linear mapping
        self.linear_1 = nn.Linear(n_hidden + dim_emb, chunk_size)

    def forward(self, x):
        # Retrieve embedding for all layers
        emb_inp = torch.arange(self.n_chunks).to(x.device)
        emb = self.emb(emb_inp)

        # Unsqueeze x in the second dimension and replicate it for n times
        x = x.unsqueeze(1).repeat(1, self.n_chunks, 1)

        # Unsqueeze emb in the first dimension and replicate it for n times
        emb = emb.unsqueeze(0).repeat(x.shape[0], 1, 1)

        # Concatenate x and emb along the last dimension
        x = torch.cat([x, emb], dim=-1)

        x = F.relu(x)
        x = self.linear_1(x)
        x = x.view(x.shape[0], -1)

        return x

As mentioned before, the function that finds the biggest divisor of a number can be modified according to the slicing method. For example, here want the chunk size of the layers whose number of weights is less than 50 to be equal to one. Therefore, if `n<50`, `chunk_size=1`:

In [26]:
def biggest_divisor(n):
    if n < 50:
        return n 
    # Find the biggest divisor of n that is smaller than the square root of n
    for i in range(int((n)**0.5), 0, -1):
        if n % i == 0:
            return i

The implementation of the Hypernetwork's class also need to change accordingly. The Hypernetwork needs to keep a list of HyperHeads for each layer in the model.

<font color='darkgreen'>[Q] Why not use a list to store the HyperHead instead of using `nn.ModuleList`?</font>

In [27]:
class Hypernetwork(nn.Module):
    def __init__(self, n_inp, n_hidden, param_shapes, dim_emb):
        super().__init__()
        self.linear_1 = nn.Linear(n_inp, n_hidden)

        self.heads = nn.ModuleList(
            [
                HyperHead(n_hidden,
                          pshape.numel(),
                          biggest_divisor(pshape.numel()),
                          dim_emb)
                for pshape in param_shapes]
        )
        self.actv = nn.ReLU()

    def forward(self, x):
        x = x.view(x.shape[0], -1)
        x = self.actv(self.linear_1(x))

        # Loop over all heads and generate weights
        head_outs = [self.heads[i](x) for i in range(len(self.heads))]
        head_outs = torch.concat(head_outs, dim=1)

        return head_outs

Now, let's initialize the main model and its corresponding Hypernetwork:

In [28]:
# Initiliaze random seeds n PyTorch and Numpy for reproducibility
torch.manual_seed(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

np.random.seed(0)

In [29]:
# Initialize the main model
main_model = MLP(28*28, 50, 10)

# Define parameters shapes and number of parameters
param_shapes = {n: p.shape for (n, p) in main_model.named_parameters()}
num_params = sum([p.numel() for p in main_model.parameters()])
hn_inp_shape = 28 * 28

# Initialize the Hypernetwork
hypernetwork = Hypernetwork(hn_inp_shape, 4, list(param_shapes.values()), 4)

We are also interested in knowing how much compression does the current method make in the end with the set values:

In [30]:
n_params_main_model = sum([p.numel() for p in main_model.parameters()])
n_params_hypernetwork = sum([p.numel() for p in hypernetwork.parameters()])

print("Number of parameters in main model: ", n_params_main_model)
print("Number of parameters in hypernetwork: ", n_params_hypernetwork)

# Ratio of parameters in hypernetwork to main model
print("Ratio: ", n_params_hypernetwork / n_params_main_model)

Number of parameters in main model:  44860
Number of parameters in hypernetwork:  7633
Ratio:  0.17015158270173875


That's good, very similar to the previous method!

It's time to train the Hypernetwork on the MNIST dataset:

In [31]:
# Define the optimizer and the loss function
optimizer = torch.optim.Adam(hypernetwork.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

# Set the device
device = torch.device("cpu")

# Move the model and the hypernetwork to the device
main_model.to(device)
hypernetwork.to(device)

# Standard PyTorch training loop
n_epochs = 10
for epoch in range(n_epochs):
    pbar = tqdm(train_loader)
    for batch in pbar:
        x, y = batch
        x, y = x.to(device), y.to(device)

        # Zero-grad optimizer for the hypernetwork
        optimizer.zero_grad()

        # Generate weights with the hypernetwork
        hn_out = hypernetwork(x)

        # Reshape generated weights
        reshaped_params = reshape_generated_parameters(hn_out, param_shapes)

        # Make prediction
        pred = main_model.forward_with_params(x, reshaped_params)

        # Compute loss and backpropagate
        loss = criterion(pred, y)
        loss.backward()

        # Optimizer step update
        optimizer.step()

        # Set progress bar description
        pbar.set_description(f"Loss value: {loss.item():.4f}")

    with torch.no_grad():
        # Evaluate model after each epoch
        batch_accuracies = []
        pbar_test = tqdm(test_loader)
        for batch in test_loader:
            x, y = batch
            x, y = x.to(device), y.to(device)

            # Generate weights with the hypernetwork
            hn_out = hypernetwork(x)
            reshaped_params = reshape_generated_parameters(
                hn_out, param_shapes)

            # Make prediction
            pred = main_model.forward_with_params(x, reshaped_params)
            n_corrects = sum(pred.argmax(dim=1) == y).item()
            acc_batch = n_corrects / len(x)
            batch_accuracies.append(acc_batch)
            pbar_test.update()

    print(f"Average accuracy for epoch {epoch}: {sum(batch_accuracies)/len(batch_accuracies):.4f} \n")


Loss value: 2.2843: 100%|██████████| 938/938 [00:12<00:00, 76.49it/s]
100%|██████████| 157/157 [00:13<00:00, 11.99it/s] 
 94%|█████████▍| 148/157 [00:00<00:00, 206.57it/s]

Average accuracy for epoch 0: 0.1059 



Loss value: 1.7137: 100%|██████████| 938/938 [00:13<00:00, 70.68it/s]
100%|██████████| 157/157 [00:14<00:00, 11.19it/s] 


Average accuracy for epoch 1: 0.6671 



Loss value: 0.9009: 100%|██████████| 938/938 [00:12<00:00, 77.99it/s]
100%|██████████| 157/157 [00:12<00:00, 12.29it/s] 
 97%|█████████▋| 152/157 [00:00<00:00, 207.72it/s]

Average accuracy for epoch 2: 0.7777 



Loss value: 0.2711: 100%|██████████| 938/938 [00:13<00:00, 70.57it/s]
100%|██████████| 157/157 [00:14<00:00, 11.18it/s] 


Average accuracy for epoch 3: 0.8320 



Loss value: 0.4998: 100%|██████████| 938/938 [00:12<00:00, 75.90it/s]
100%|██████████| 157/157 [00:13<00:00, 12.00it/s] 
 97%|█████████▋| 152/157 [00:00<00:00, 212.98it/s]

Average accuracy for epoch 4: 0.8512 



Loss value: 0.3640: 100%|██████████| 938/938 [00:13<00:00, 71.09it/s]
100%|██████████| 157/157 [00:13<00:00, 11.26it/s] 


Average accuracy for epoch 5: 0.8697 



Loss value: 0.9274: 100%|██████████| 938/938 [00:12<00:00, 75.30it/s]
100%|██████████| 157/157 [00:13<00:00, 11.90it/s] 
 96%|█████████▌| 150/157 [00:00<00:00, 215.42it/s]

Average accuracy for epoch 6: 0.8697 



Loss value: 0.2603: 100%|██████████| 938/938 [00:13<00:00, 71.50it/s]
100%|██████████| 157/157 [00:13<00:00, 11.33it/s] 


Average accuracy for epoch 7: 0.8794 



Loss value: 0.2241: 100%|██████████| 938/938 [00:12<00:00, 77.87it/s]
100%|██████████| 157/157 [00:12<00:00, 12.27it/s] 
 98%|█████████▊| 154/157 [00:00<00:00, 217.62it/s]

Average accuracy for epoch 8: 0.8887 



Loss value: 0.1514: 100%|██████████| 938/938 [00:13<00:00, 70.14it/s]
100%|██████████| 157/157 [00:14<00:00, 11.14it/s] 


Average accuracy for epoch 9: 0.8880 





As we can see, the speed of convergence can be different in the two methods!

The choice of slicing technique can depend on the architecure of the main model and complexity of the problem. However, the principles remain the same!

Finally, we need to compare the performance of the Hypernetwork with the same main model trained in a sateful way:

In [32]:
# Initiliaze random seeds n PyTorch and Numpy for reproducibility
torch.manual_seed(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

np.random.seed(0)

In [33]:
# Define model

class MLP(nn.Module):
    def __init__(self, n_inp, n_hidden, n_out):
        super().__init__()
        self.linear_1 = nn.Linear(n_inp, n_hidden)
        self.linear_2 = nn.Linear(n_hidden, n_hidden)
        self.linear_3 = nn.Linear(n_hidden, n_hidden)
        self.classifier = nn.Linear(n_hidden, n_out)

        self.activ = nn.ReLU()

    def forward(self, x):
        x = x.view(x.shape[0], -1)
        x = self.activ(self.linear_1(x))
        x = self.activ(self.linear_2(x))
        x = self.classifier(x)

        return x

    def forward_with_params(self, x, params):
        x = x.view(x.shape[0], -1)
        x = F.linear(x, params["linear_1.weight"], params["linear_1.bias"])
        x = F.relu(x)
        x = F.linear(x, params["linear_2.weight"], params["linear_2.bias"])
        x = F.relu(x)
        x = F.linear(x, params["linear_3.weight"], params["linear_3.bias"])
        x = F.relu(x)
        x = F.linear(x, params["classifier.weight"], params["classifier.bias"])
        return x
    
model = MLP(28*28, 50, 10)

# Define the optimizer and the loss function
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

# Set the device
device = torch.device("cpu")

# Move the model to the device
model.to(device)

# Standard PyTorch training loop
n_epochs = 10
for epoch in range(n_epochs):
    pbar = tqdm(train_loader)
    for batch in pbar:
        x, y = batch
        x, y = x.to(device), y.to(device)

        # Zero-grad optimizer for the hypernetwork
        optimizer.zero_grad()

        # Generate weights with the hypernetwork
        pred = model(x)

        # Compute loss and backpropagate
        loss = criterion(pred, y)
        loss.backward()

        # Optimizer step update
        optimizer.step()

        # Set progress bar description
        pbar.set_description(f"Loss value: {loss.item():.4f}")

    with torch.no_grad():
        # Evaluate model after each epoch
        batch_accuracies = []
        pbar_test = tqdm(test_loader)
        for batch in test_loader:
            x, y = batch
            x, y = x.to(device), y.to(device)

            # Make prediction
            pred = model(x)
            n_corrects = sum(pred.argmax(dim=1) == y).item()
            acc_batch = n_corrects / len(x)
            batch_accuracies.append(acc_batch)
            pbar_test.update()

    print(f"Average accuracy for epoch {epoch}: {sum(batch_accuracies)/len(batch_accuracies):.4f} \n")


Loss value: 0.0334: 100%|██████████| 938/938 [00:03<00:00, 266.24it/s]
100%|██████████| 157/157 [10:22<00:00,  3.97s/it] 
 80%|████████  | 126/157 [00:00<00:00, 414.32it/s]

Average accuracy for epoch 0: 0.9360 



Loss value: 0.2522: 100%|██████████| 938/938 [00:04<00:00, 199.56it/s]
100%|██████████| 157/157 [00:05<00:00, 30.90it/s] 


Average accuracy for epoch 1: 0.9522 



Loss value: 0.2244: 100%|██████████| 938/938 [00:03<00:00, 270.25it/s]
100%|██████████| 157/157 [00:03<00:00, 40.67it/s] 
 80%|███████▉  | 125/157 [00:00<00:00, 417.43it/s]

Average accuracy for epoch 2: 0.9529 



Loss value: 0.0881: 100%|██████████| 938/938 [00:04<00:00, 204.57it/s]
100%|██████████| 157/157 [00:04<00:00, 31.64it/s] 


Average accuracy for epoch 3: 0.9625 



Loss value: 0.1022: 100%|██████████| 938/938 [00:03<00:00, 271.06it/s]
100%|██████████| 157/157 [00:03<00:00, 40.84it/s] 
 80%|████████  | 126/157 [00:00<00:00, 418.44it/s]

Average accuracy for epoch 4: 0.9673 



Loss value: 0.0050: 100%|██████████| 938/938 [00:04<00:00, 204.38it/s]
100%|██████████| 157/157 [00:04<00:00, 31.62it/s] 


Average accuracy for epoch 5: 0.9672 



Loss value: 0.0514: 100%|██████████| 938/938 [00:03<00:00, 269.33it/s]
100%|██████████| 157/157 [00:03<00:00, 40.61it/s] 
 80%|███████▉  | 125/157 [00:00<00:00, 415.74it/s]

Average accuracy for epoch 6: 0.9643 



Loss value: 0.5854: 100%|██████████| 938/938 [00:04<00:00, 204.16it/s]
100%|██████████| 157/157 [00:04<00:00, 31.57it/s] 


Average accuracy for epoch 7: 0.9695 



Loss value: 0.0056: 100%|██████████| 938/938 [00:03<00:00, 265.61it/s]
100%|██████████| 157/157 [00:03<00:00, 40.05it/s] 
 80%|███████▉  | 125/157 [00:00<00:00, 410.41it/s]

Average accuracy for epoch 8: 0.9708 



Loss value: 0.1619: 100%|██████████| 938/938 [00:04<00:00, 203.06it/s]
100%|██████████| 157/157 [00:05<00:00, 31.38it/s] 


Average accuracy for epoch 9: 0.9706 





Conclusion: To be discussed during the session

<font color='darkgreen'> [Q] Why bother using Hypernetworks with all additional complexities?</font>