
# MatFormer: Nested Transformer for Elastic Inference

## Introduction
This notebook provides the implementation of the MatFormer model as described in the paper "MatFormer: Nested Transformer for Elastic Inference". The MatFormer introduces a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints.

## Description of the Method
The MatFormer architecture is based on the concept of nested sub-structures within the Transformer model. Each Feed Forward Network (FFN) block of a MatFormer model is jointly optimized with a few nested smaller FFN blocks, enabling the extraction of smaller sub-models from a larger, universal model.
To evaluate the model, we are going to use IMDB dataset.

### Key Components of the architecture
1. **Nested FFN Blocks:** The FFN block in the Transformer is modified to include a nested structure, allowing for multiple granularities within a single model.
2. **Mix’n’Match:** This approach allows for the combination of different granularities across layers, generating numerous sub-models without additional training.
## Implementation Details

### Loading Libraries and Data

In [1]:
import torch

# For some reason needed or torchtext will not work...
torch.utils.data.datapipes.utils.common.DILL_AVAILABLE = torch.utils._import_utils.dill_available()

from tqdm import tqdm
from torchtext.datasets import IMDB
import re
from transformers import AutoTokenizer,AutoModel
import torch.nn as nn
import torch.optim as optim
from torchtext.datasets import IMDB
import torch.nn.functional as F
import numpy as np

train_iter, test_iter = IMDB(split=('train','test'))
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

  from .autonotebook import tqdm as notebook_tqdm


cuda


In [2]:
# Ensure the reproducibility of results
from transformers import set_seed

seed = 42

np.random.seed(seed)
torch.manual_seed(seed)

if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

# Set Seed for transfomers
set_seed(seed)


### Dataset Processing

To train and evaluate the MatFormer model, we need to preprocess the dataset. This involves cleaning the text data, tokenizing and encoding it into a format suitable for the model, and organizing it into batches for training and evaluation. Below we have functions used for these preprocessing steps.
1. **clean_text:** Responsible for lowercasing the text, removing HTML tags and unwanted characters, and ensuring that the data is in a consistent format. This step helps in reducing noise and improving the quality of the input data.

2. **tokenize_and_encode:** Takes the cleaned text and converts it into a sequence of tokens, then encodes these tokens into numerical values that can be processed by the model.

3. **process_dataset:** Applies the cleaning, tokenizing, and encoding steps to the entire dataset. It then organizes the data into batches for training and evaluation. This function ensures that the data is ready to be fed into the MatFormer model.

Then as tokenizer we are going to use bert, the model matFormer in evaluation will exploit the bert pretrained model freezing its parameters in training phase

In [3]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r'<br />', ' ', text)
    text = re.sub(r'[^a-z0-9\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def tokenize_and_encode(batch, tokenizer, max_length=512):
    inputs = tokenizer.batch_encode_plus(
        batch,
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )
    return inputs['input_ids'], inputs['attention_mask'].to(torch.bool)


def process_dataset(iterator, tokenizer):
    texts = []
    labels = []
    #label 1 is negative, 2 is positive
    for i, (label, text) in enumerate(iterator):
        cleaned_text = clean_text(text)
        texts.append(cleaned_text)
        labels.append(1 if label == 2 else 0)


    input_ids, attention_masks = tokenize_and_encode(texts, tokenizer)
    labels = torch.tensor(labels)
    return input_ids, attention_masks, labels

In [4]:
tokenizer = AutoTokenizer.from_pretrained('prajjwal1/bert-mini')#'kanishka/GlossBERT')

train_input_ids, train_attention_masks, train_labels = process_dataset(train_iter, tokenizer)
test_input_ids, test_attention_masks, test_labels = process_dataset(test_iter, tokenizer)



In [5]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

batch_size = 32

vocab_size = tokenizer.vocab_size

train_data = TensorDataset(train_input_ids, train_attention_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

test_data = TensorDataset(test_input_ids, test_attention_masks, test_labels)
test_sampler = SequentialSampler(test_data) 
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)


In [None]:
len(test_data)

25000

In [None]:
# Used only for testing

#for elem in train_dataloader:
#    input_ids_batch, attention_mask_batch, target_batch = elem
#    output = model(input_ids_batch.to(device), attention_mask = attention_mask_batch.to(device), granularity_level=3)
#    break

torch.Size([32, 512, 768])
torch.Size([32, 768])


# Model definition
### MatFormer Structure

The MatFormer model defines $( g )$ Transformer blocks $( T_i )$ such that $( T_1 \subset T_2 \subset \cdots \subset T_g )$, where $( T_i \subset T_{i+1} )$ indicates that the parameters of $( T_i )$ are contained in those of $( T_{i+1} )$.

While it is possible to impose such a structure on any part of the Transformer, we select the Feed Forward Network (FFN) block to define our method and present our experiments. The model size and computational cost of a Transformer are dominated (around 60% for LLMs and ViTs) by the FFN block.

The FFN block in a Transformer has a single hidden layer with $( d_{ff} )$ neurons and both input and outputs in $( \mathbb{R}^{d_{model}} )$, and a fixed FFN ratio $( := \frac{d_{ff}}{d_{model}} )$ (typically $( \geq 4 )$). MatFormer introduces the matryoshka nested structure with $( g )$ granularities on the hidden representation $( d_{ff} )$ of the FFN block.

Concretely, a nested sub-block of the Transformer, $( T_i )$, contains the first $( m_i )$ neurons of the FFN, and $( 1 \leq m_1 \leq m_2 \cdots \leq m_g = d_{ff} )$ represent the number of neurons for each granularity or sub-model.

So, depending on the chosen granularity, the FFN operation of $( T_i )$, i.e., $( T_{FFN}^i )$ on an input $( x \in \mathbb{R}^{d_{model}} )$, is:


$
T_{FFN}^i(x) = \sigma(x \cdot W_1[0 : m_i]^T) \cdot W_2[0 : m_i],
$

where the weight matrices of FFN are $( W_1, W_2 \in \mathbb{R}^{d_{ff} \times d_{model}} )$ and bias terms are omitted for simplicity. $( W_1[0 : k] )$ denotes the submatrix with the first $( k )$ rows of $( W_1 )$. Finally, $( \sigma )$ is a non-linearity often set to GELU (Gaussian Error Linear Unit) or squared ReLU.

In this work, we chose the $( g = 4 )$ exponentially spaced granularities with FFN ratios of $( \{0.5, 1, 2, 4\} )$, i.e., the nested hidden neurons are of the sizes $( \left\{\frac{d_{ff}}{8}, \frac{d_{ff}}{4}, \frac{d_{ff}}{2}, d_{ff}\right\} )$.

With the nested MatFormer blocks $( T_1, T_2, \ldots, T_g )$, we can combine these to form a MatFormer model, with $( g )$ nested submodels $( M_1 \subset M_2 \cdots \subset M_g )$, where $( M_i \leftarrow [T_i]^l )$, i.e., $( M_i )$ is formed by stacking $( T_i )$ for $( l )$ layers. The input and output embedding matrices are shared across the models.

Below the implementation.
### Mix’n’Match

The Mix’n’Match strategy in the MatFormer model allows for the extraction of a combinatorially large number of accurate and smaller submodels from a single trained model. This is achieved by selecting different granularities for each MatFormer layer during inference, enabling the generation of models tailored to specific computational constraints without additional training.

**Key Points:**
1. **Dynamic Model Extraction**: Mix’n’Match enables the dynamic extraction of smaller models by selecting different subsets of neurons at each layer.
2. **Combinatorial Flexibility**: By choosing different granularities across layers, it is possible to create a large variety of submodels that meet specific accuracy and computational trade-offs.
3. **Interpolation**: Interpolating between granularities can also produce highly accurate models.

**Formula:**

An interpolated block $ \tilde{T} $ that uses a mix of neurons from two consecutive granularities $ m_i $ and $ m_{i+1}$ is defined as:

$
\tilde{T} = \frac{1}{2} (m_i + m_{i+1})
$





In [6]:
class NestedFFN(nn.Module):
    def __init__(self, d_model, d_ff, num_granularities=4):
        super(NestedFFN, self).__init__()

        # Initialize FFN layers
        self.num_granularities = num_granularities
        self.d_model = d_model
        self.d_ff = d_ff

        # Create weight matrices for W1 and W2 with the largest size
        self.W1 = nn.Parameter(torch.randn(d_ff, d_model))
        self.W2 = nn.Parameter(torch.randn(d_ff, d_model))

        # Create bias vectors for W1 and W2 with the largest size
        self.b1 = nn.Parameter(torch.randn(d_ff))
        self.b2 = nn.Parameter(torch.randn(d_model))

        # Calculate the sizes of each granularity
        self.granularity_sizes = [d_ff // (2 ** i) for i in range(num_granularities)]
        self.granularity_sizes_mix = []

        # This is for mix' n' match
        for i in range(num_granularities-1) :
            self.granularity_sizes_mix.append(int(1/2 * (self.granularity_sizes[i] + self.granularity_sizes[i+1])))

        #print(self.granularity_sizes_mix)

    def forward(self, x, granularity_level):
        assert 0 <= granularity_level < self.num_granularities, "Invalid granularity level"

        # m_i Number of neuron selected
        m_i = self.granularity_sizes[granularity_level]

        # Perform the FFN operation with the selected subset of weights
        hidden = F.gelu(x @ self.W1[:m_i, :].T + self.b1[:m_i])
        output = hidden @ self.W2[:m_i, :] + self.b2[:m_i]

        return output
    
    # This function is used only at inference, where we choose different granulaties that the model is not explicitly trained on that granularities
    def forward_mix(self, x, granularity_level):
        # m_i Number of neuron selected
        m_i = self.granularity_sizes_mix[granularity_level]

        # Perform the FFN operation with the selected subset of weights
        hidden = F.gelu(x @ self.W1[:m_i, :].T + self.b1[:m_i])
        output = hidden @ self.W2[:m_i, :] + self.b2[:m_i]

        return output

class TransformerLayer(nn.Module):
    def __init__(self, d_model, num_heads, nested_ffn, granularity_level, dropout=0.1):
        super(TransformerLayer, self).__init__()
        self.self_attn = nn.MultiheadAttention(d_model, num_heads, dropout=dropout)
        self.granularity_level = granularity_level
        self.nested_ffn = nested_ffn
        self.layernorm1 = nn.LayerNorm(d_model)
        self.layernorm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, src, inference, src_mask=None, src_key_padding_mask=None):
        src2, _ = self.self_attn(src, src, src, attn_mask=src_mask, key_padding_mask=src_key_padding_mask)
        src = src + self.dropout(src2)
        src = self.layernorm1(src)

        if inference == False:
            src2 = self.nested_ffn(src, self.granularity_level)
        else:
            src2 = self.nested_ffn.forward_mix(src, self.granularity_level)

        src = src + self.dropout(src2)
        src = self.layernorm2(src)
        return src

class Transformer(nn.Module):
    def __init__(self, d_model, num_layers, num_heads, nested_ffn, num_granularities=4, dropout=0.1):
        super(Transformer, self).__init__()
        self.models = [ ]
        # We Stack l Layers with the same granularity_level
        # Creating M1, M2, ... , Mg
        for id in range(num_granularities):

          self.models.append( nn.ModuleList([
            TransformerLayer(d_model, num_heads, nested_ffn, id, dropout).to(device)
            for _ in range(num_layers)
          ]))

        self.layernorm = nn.LayerNorm(d_model)

    def forward(self, src, src_mask=None, src_key_padding_mask=None, granularity_level = 0, inference = False):
      # So granularity_level indicates the model M_i that we want to use
        for layer in self.models[granularity_level]:
            src = layer(src, inference, src_mask=src_mask, src_key_padding_mask=src_key_padding_mask)
        src = self.layernorm(src)
        return src
    

class SentimentTransformer(nn.Module):
    def __init__(self,d_ff, num_layers, num_heads, granularity_levels=4, dropout=0.1,num_granularities = 4):
        super(SentimentTransformer, self).__init__()

        # Load pre trained BERT model
        self.bert = AutoModel.from_pretrained('prajjwal1/bert-mini') #'kanishka/GlossBERT') #, torch_dtype=torch.float16)
        
        # Freeze BERT parameters
        for param in self.bert.parameters():
            param.requires_grad = False

        self.d_model = self.bert.config.hidden_size
        self.nested_ffn = NestedFFN(self.d_model, d_ff, num_granularities)

        self.transformer = Transformer(self.d_model, num_layers, num_heads, self.nested_ffn, granularity_levels, dropout)
        self.relu = nn.ReLU()
        self.fc1 = nn.Linear(self.d_model, 512)
        self.fc2 = nn.Linear(512, 1)  # Binary classification
        

    
    def forward(self, input_ids, attention_mask=None, granularity_level=0, inference = False):
        outputs = self.bert(input_ids, attention_mask=attention_mask)
        hidden_states = outputs.last_hidden_state 

        src = self.transformer(hidden_states, granularity_level=granularity_level, inference = inference)
        src = self.relu(self.fc1(torch.mean(src, dim=1)))
        
        src = self.fc2(src)

        return src

### Training

The training strategy for the MatFormer model involves jointly optimizing all the nested submodels. This is done by defining a joint loss function that combines the loss of each submodel with specific weights. The training process ensures that each submodel is accurate and consistent with the others, allowing for efficient extraction of smaller models.

**Key Points:**
1. **Joint Optimization**: All the granular submodels are optimized together using a combined loss function.
2. **Loss Function**: The joint loss function is a weighted average of the individual losses of each submodel.
3. **Efficiency**: This training strategy is more efficient than training each submodel independently and ensures consistency across submodels.

**Formula:**

The joint loss function $( L_{JOINT} )$ is defined as:

$
L_{JOINT}(x, y) = \sum_{i=1}^{g} \lambda_i \cdot L(M_i(x), y)
$

where:
- $( x )$ is the input.
- $( y )$ is the target.
- $( M_i )$ is the $( i )$-th granular submodel.
- $( \lambda_i )$ is the weight for the $( i )$-th submodel's loss.
- $( L )$ is the loss function (e.g., cross-entropy loss).

A possible choice can be $( \lambda_i  = \frac{1}{number granularities} )$ for all i


In [7]:
# Hyperparameters
d_ff = 2048
num_granularities = 4
num_layers = 2  
num_heads = 4 
dropout = 0
epochs = 10 
learning_rate = 0.001

# Hyperparameters for single transformer, the rest is the same
#num_granularities = 1
#num_layers = 1  

In [28]:
torch.cuda.empty_cache()

In [8]:

model = SentimentTransformer( d_ff, num_layers, num_heads, dropout=dropout,num_granularities=num_granularities).to(device)


In [9]:
# Count the total number of parameters
total_params = sum(p.numel() for p in model.parameters())
# Count the number of parameters that require gradients
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Total parameters: {total_params}")
print(f"Trainable parameters: {trainable_params}")

Total parameters: 12354049
Trainable parameters: 1183489


In [10]:

model.train()

# Initialize the model

# Loss function and optimizer
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
for epoch in range(epochs):
    total_loss = 0.0
    for batch in tqdm(train_dataloader, unit='batch'):
        input_ids_batch, attention_mask_batch, target_batch = batch

        input_ids_batch = input_ids_batch.to(device)
        attention_mask_batch = attention_mask_batch.to(device)
        target_batch = target_batch.to(device)
        
        # Zero the gradients
        optimizer.zero_grad()

        # Compute the loss for each granularity level and combine them
        losses = []
        for granularity_level in range(num_granularities):

            output = model(input_ids_batch, attention_mask = attention_mask_batch, granularity_level=granularity_level)
            loss = criterion(output.flatten(), target_batch.float())
            losses.append(loss)
            
        # Combine the losses
        combined_loss = sum(losses) / num_granularities

        # Backpropagation
        combined_loss.backward()

        # Update parameters
        optimizer.step()

        # Accumulate loss for reporting
        total_loss += combined_loss.item()

    print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss/batch_size}")


 18%|█▊        | 140/782 [01:05<05:37,  1.90batch/s]

### Evaluation

We evaluate the models $( M_1, \ldots, M_g $) ( in our case g = 4 ) as well as other models that can be constructed using different numbers of neurons. These additional models, which are not explicitly trained like $( M_1, \ldots, M_g $), still demonstrate impressive performance.

**Key Points:**
1. **Evaluation of Explicitly Trained Models**: We assess the performance of the explicitly trained models $( M_1, \ldots, M_g $).
2. **Evaluation of Constructed Models**: We also evaluate models constructed using various subsets of neurons. These models leverage the nested structure of MatFormer, allowing them to perform well even without explicit training.
3. **Performance**: The constructed models show strong performance, indicating the effectiveness of the nested training approach.





In [None]:

model.eval()

# Number of correct prediction made by M1,...Mg
correct_model = [ 0 for _ in range(num_granularities) ]
# Number of correct prediction made by mix' n' match
correct_model_mix = [ 0 for _ in range(len(model.nested_ffn.granularity_sizes_mix)) ]

total = 0

with torch.no_grad():
    for batch in tqdm(test_dataloader, unit='batch'):
        input_ids_batch, attention_mask_batch, target_batch = batch
        
        input_ids_batch = input_ids_batch.to(device)
        attention_mask_batch = attention_mask_batch.to(device)
        target_batch = target_batch.to(device)
        # Evaluation of M1,...,Mg
        for granularity_level in range(num_granularities):
            output = model(input_ids_batch, attention_mask = attention_mask_batch, granularity_level=granularity_level)
            output = torch.sigmoid(output)
            output = torch.tensor([True if prob >0.5 else False for prob in output.flatten()]).to(device)

            correct_prediction = torch.eq(output, target_batch)

            correct_model[granularity_level] += torch.sum(correct_prediction).item()

        # Evaluation of mix n' match models (so with different granularities)
        for granularity_level in range(len(model.nested_ffn.granularity_sizes_mix)):
            output = model(input_ids_batch, attention_mask = attention_mask_batch, granularity_level=granularity_level, inference = True)
            output = torch.sigmoid(output)
            output = torch.tensor([True if prob >0.5 else False for prob in output.flatten()]).to(device)

            correct_prediction = torch.eq(output, target_batch)

            correct_model_mix[granularity_level] += torch.sum(correct_prediction).item()

        total += torch.sum(target_batch)






In [None]:
accuracy = [n_correct/len(test_data) for n_correct in correct_model]

n_sub_models_neurons = [d_ff // (2 ** i) for i in range(num_granularities)]

print('Accuracy of each sub-model', accuracy)
print('Each sub-model has number of neurons:', n_sub_models_neurons )
print("So for example, the first sub model with ", n_sub_models_neurons[0], "neurons, has an accuracy of ", accuracy[0])

In [None]:
accuracy = [n_correct/len(test_data) for n_correct in correct_model_mix]

n_sub_models_neurons_mix = model.nested_ffn.granularity_sizes_mix

print('Accuracy of each sub-model with different granularities', accuracy)
print('Each sub-model has number of neurons:', n_sub_models_neurons_mix )
print("So for example, the first sub model with ", n_sub_models_neurons_mix[0], "neurons, has an accuracy of ", accuracy[0])

In [20]:
#torch.save(model.state_dict(),'1_model_weights_traditional')
#model.eval()
model.load_state_dict(torch.load('1_model_weights_traditional'))

<All keys matched successfully>

## Results

Below, we have the difference of loss in training stage between matFormer and the traditional transformer with traditional training
<br>

![Alt Test](loss_epochs.png)

<br>
Below, we have matFormer accuracy of the sub-models M1,M2,M3,M4 using IMDB dataset sentiment analysis
<br>

![Alt Test](accuracy_1.png)

<br>
Below, we have matFormer accuracy of the sub-models using Mix' n' Match tecnique, using IMDB dataset sentiment analysis
<br>

![Alt Test](accuracy_2.png)


<br>
For the accuracy of the traditional transformer, we got 0.81556 (with 2048 neurons)
<br>

## Conclusions

The evaluation of the sub-models $M_1, M_2, M_3, M_4 $ reveals that their accuracies are relatively similar. It suggests that we can opt for the smaller sub-model $ M_4 $ without experiencing a substantial loss in accuracy. A similar observation can be made for sub-models generated using the Mix’n’Match strategy; these models also maintain competitive accuracy.

One interesting area for further exploration is the possibility of identifying a sub-model that achieves higher accuracy than the larger models by experimenting with different granularities (number of neurons).

In comparison, the traditional transformer model exhibits slightly higher accuracy and was easier to train. However, the difference in accuracy between the traditional transformer and the sub-model $ M_1 $ is not substantial. The MatFormer model stands out because it offers flexibility in model size, allowing for different configurations without the need for retraining from scratch. Retraining models with varying sizes would be computationally expensive, highlighting the efficiency advantage of the MatFormer approach.

Overall, the MatFormer model provides a valuable balance between computational efficiency and model performance, making it a versatile choice for applications requiring different model sizes.
Note: to train we used only 10 epochs, but probably with more epochs the model will have a better performance
