<div>
    <h2>Pruning Attention Layers Models: meta-llama/Llama-3.2-3B/h2>
    <h3>Not All Attention is needed</h3>
</div>

* Pruning
* Attention based on cosine similarity

# Install libraries & Configure variables.

In [1]:
!pip install -q torch==2.6.0
!pip install dotenv
!pip install -q torchvision==0.21.0
!pip install -q transformers==4.51.3
!pip install -q datasets==3.6.0
!pip install -q lm-eval==0.4.8

!pip install hf_xet #To speed up downloads from HF.



In [2]:
import logging
import math
import os
import sys
import shutil
from copy import deepcopy

import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer


# Download the Model.

In [3]:
# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda


In [4]:
# #model_name = 'meta-llama/Llama-3.2-1B'
# model_name = 'meta-llama/Llama-3.2-3B'
# model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# #tokenizer.pad_token = tokenizer.eos_token  # Set pad token

In [6]:

import os
from dotenv import load_dotenv
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load environment variables from .env file
load_dotenv()

# Retrieve the HF_TOKEN from environment
hf_token = os.getenv('HF_TOKEN')

# Define your model name
model_name = 'meta-llama/Llama-3.2-3B'

# Load the model with the HF_TOKEN for authentication
device = "cuda"  # or "cpu" depending on your setup
model = AutoModelForCausalLM.from_pretrained(model_name, use_auth_token=hf_token).to(device)

# Load the tokenizer with the HF_TOKEN for authentication
tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=hf_token)

# Optionally, print the model and tokenizer to verify successful loading
print(f"Model: {model_name} loaded successfully")
print(f"Tokenizer for {model_name} loaded successfully")




config.json:   0%|          | 0.00/844 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

Model: meta-llama/Llama-3.2-3B loaded successfully
Tokenizer for meta-llama/Llama-3.2-3B loaded successfully


## Study the structure.
* Llama-3.2-1B
```
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)
```


The model follows the typical structure of modern Llama models, consisting of blocks made up of an Attention layer and an MLP layer with a GLU structure.

> If you want to see an example of how to perform pruning on the MLP layers of the model, you can check out the notebook:[Pruning Llama 3.2.](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/6-PRUNING/6_3_pruning_structured_llama3.2-1b_OK.ipynb) y leer el paper [Exploring GLU expansion ratios: Structured pruning in Llama-3.2 models](https://osf.io/preprints/osf/qgxea)



Since the layers form a block, the attention layer cannot be removed without also removing the accompanying MLP layer. For this reason, the decision was made to deactivate their execution during inference.

The 1B model has 16 layers, as shown in the structure above, while the 3B model has 28 layers.


# Inference function & Test Base Model

The `get_output` function is designed to generate text  and measure the time taken for different stages of the generation process.

It provides insights into the performance of the model and can be used to evaluate the efficiency of text generation.

### Code Explanation: Measuring Time for Tokenization, Generation, and Decoding in LLM

- **Function Purpose**:  
  The `get_output` function measures the time taken for tokenization, text generation, and decoding in a language model (LLM) for a given prompt.

- **Inputs**:  
  - `prompt`: Text input to generate a response.
  - `model`: Pre-trained language model.
  - `tokenizer`: Tokenizer for encoding/decoding text.
  - `num_runs`: Number of times to repeat the process (default: 1).
  - `max_length`: Maximum length of generated text (default: 50).

- **Steps**:
  1. **Tokenization**:  
     - The prompt is tokenized using the tokenizer and moved to the specified device (CPU/GPU).
     - Time taken for tokenization is measured.
  2. **Text Generation**:  
     - The model generates text using the input token IDs with specific settings (e.g., beam search, no sampling, no repetition of 2-grams).
     - Time taken for generation is measured.
  3. **Decoding**:  
     - The generated token IDs are decoded back into readable text.
     - Time taken for decoding is measured.

- **Output**:
  - For each run, the time for tokenization, generation, and decoding is printed.
  - After multiple runs, the average total time is displayed.
  - The generated text from the model is returned.

- **Performance Metrics**:  
  - `Tokenization time`, `Generation time`, and `Decoding time` are printed in milliseconds.
  - Total time for all steps is calculated and displayed.

- **Use Case**:  
  This function is useful for benchmarking the performance of the model, especially when evaluating time spent on each stage of text generation.


In [7]:
import time

def get_output(prompt, model=model, tokenizer=tokenizer, num_runs=1, max_length=50):
    total_time = 0
    generated_outputs = []

    for run in range(num_runs):
        # Start timing
        start_time = time.time()

        # Tokenization time
        token_start = time.time()
        inputs = tokenizer(prompt, return_tensors='pt').to(device)
        token_time = time.time() - token_start

        # Generation time
        gen_start = time.time()
        outputs = model.generate(
            inputs['input_ids'],
            attention_mask=inputs['attention_mask'],
            max_length=max_length,
            num_return_sequences=1,
            pad_token_id=tokenizer.pad_token_id,
            temperature=None,
            top_p=None,
            do_sample=False,  # Disable sampling
            num_beams=5,      # Use beam search
            early_stopping=True,  # Stop when end-of-sequence token is generated
            no_repeat_ngram_size=2  # Prevent repetition of 2-grams
        )
        gen_time = time.time() - gen_start

        # Decoding time
        decode_start = time.time()
        generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
        decode_time = time.time() - decode_start

        # Total time for this run
        total_time += time.time() - start_time
        generated_outputs.append(generated)

        if num_runs > 1:
            print(f"\nRun {run + 1}:")
        print(f"Tokenization time: {token_time*1000:.2f} ms")
        print(f"Generation time: {gen_time*1000:.2f} ms")
        print(f"Decoding time: {decode_time*1000:.2f} ms")
        print(f"Total time: {(time.time() - start_time)*1000:.2f} ms")

    if num_runs > 1:
        avg_time = total_time / num_runs
        print(f"\nAverage time over {num_runs} runs: {avg_time*1000:.2f} ms")

    return generated_outputs[0] if num_runs == 1 else generated_outputs

In [8]:
# Test the original model
prompt = "Paris is the capital of"
generated = get_output(prompt, num_runs=2)
print(f"Generated text: {generated}")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



Run 1:
Tokenization time: 305.41 ms
Generation time: 45370.79 ms
Decoding time: 0.36 ms
Total time: 45676.72 ms

Run 2:
Tokenization time: 0.79 ms
Generation time: 42755.08 ms
Decoding time: 0.22 ms
Total time: 42756.24 ms

Average time over 2 runs: 44216.33 ms
Generated text: ['Paris is the capital of France. It is located in the north-central part of the country, on the river Seine. The city has a population of over 2 million people, making it the largest city in France and the second-largest city', 'Paris is the capital of France. It is located in the north-central part of the country, on the river Seine. The city has a population of over 2 million people, making it the largest city in France and the second-largest city']


The text generation of the original model, as expected, works perfectly and returns a correct and meaningful sentence.

In [9]:
model.to("cpu")               # actual data moves ↙
torch.cuda.empty_cache()      # allocator drops cached blocks

# Pruning the Model.

In [10]:
import torch
import torch.nn as nn
from torch.nn import functional as F
from copy import deepcopy

This function `measure_unpruned_layer_importances` is designed to calculate importance scores for the attention layers in a model.

The basic idea is: if a layer's output is very similar to its input, it might not be doing much important work and could be a candidate for pruning. To check the difference I'm using the cosine similarity.

### Code Explanation: Measuring Importance of Unpruned Layers in a Pruned Model

- **Function Purpose**:  
  The `measure_unpruned_layer_importances` function measures the importance of layers in a pruned model. It calculates importance scores for unpruned (non-bypassed) attention layers by comparing the cosine similarity between each layer's input and output during a forward pass.

- **Inputs**:  
  - `pruned_model`: A model with certain attention layers pruned (dropped).
  - `tokenizer`: A tokenizer to process the input text.
  - `input_text`: The input text for the model to process.

- **Steps**:

  1. **Preparation**:
     - The model is set to evaluation mode using `eval()` to ensure no gradients are computed.
     - The input text is tokenized into tensors suitable for the model.

  2. **Identifying Unpruned Layers**:
     - A list of unpruned layers is created by checking which layers are not in the `drop_attn_list` (stored in the model’s configuration).
  
  3. **Creating Hooks for Layer Inputs/Outputs**:
     - Two hooks are defined:
       - `q_proj_input_hook`: Captures the input to the query projection (`q_proj`) layer.
       - `o_proj_output_hook`: Captures the output from the output projection (`o_proj`) layer.
     - These hooks are registered for each unpruned layer to capture their inputs and outputs during the forward pass.

  4. **Forward Pass**:
     - The input is passed through the model using `torch.no_grad()` to avoid computing gradients.
     - The hooks capture the inputs and outputs of the unpruned layers.

  5. **Removing Hooks**:
     - The hooks are removed after the forward pass to prevent memory leaks or interference with further operations.

  6. **Computing Importance Scores**:
     - For each unpruned layer, the cosine similarity between its input and output is computed.
     - **Cosine Similarity**: Measures how similar the input and output are. A high similarity suggests the layer does not significantly transform the input.
     - **Importance Score**: Calculated as `1 - similarity`. A higher score indicates the layer plays a more important role in transforming the input.

- **Output**:
  - A list of tuples containing the layer index and its corresponding importance score.
  - Each layer’s importance score is printed as `Layer {idx} importance score: {importance_score:.4f}`.

- **Use Case**:  
  This function helps to evaluate which unpruned layers contribute significantly to the model's output. Layers with higher importance scores are more critical to the model's decision-making process.


In [11]:
def measure_unpruned_layer_importances(pruned_model, tokenizer, input_text):
    """
    Measures and returns importance scores for all unpruned (non-bypassed) layers.
    """
    # PREPARATION
    """
    set the model to evaluation mode to ensure that no gradients
    are computed during the forward pass.
    """
    pruned_model.eval()
    device = next(pruned_model.parameters()).device

    """
    The provided input text (input_text) is tokenized into tensors
    suitable for processing by the model.
    """
    inputs = tokenizer(input_text, return_tensors="pt").to(device)

    """This will hold tuples of (layer_idx, importance_score)"""
    importance_scores = []

    # IDENTIFY UNPRUNED LAYERS & CREATING HOOKS
    """
    We'll register hooks for only layers that are NOT in drop_attn_list
    The list of attention layers that have already been pruned,
    is stored in a variable in the model's config: pruned_model.config.drop_attn_list.
    """
    unpruned_layer_indices = [
        idx for idx in range(len(pruned_model.model.layers))
        if idx not in pruned_model.config.drop_attn_list
    ]

    """
    Temporary storage for each layer's input/output
    We'll store them by layer index
    """
    layer_inputs = {}
    layer_outputs = {}

    """
    Create 2 hooks to capture the input and the output of the layers.
    These hooks store the inputs and outputs in dictionaries
    (layer_inputs and layer_outputs) for later analysis
    """
    #Allows capture the input to the query projection (q_proj)
    def q_proj_input_hook(layer_idx):
        def _hook(module, module_input):
            # module_input can be a tuple depending on PyTorch version
            inp = module_input[0] if isinstance(module_input, tuple) else module_input
            layer_inputs[layer_idx] = inp.detach().clone()
        return _hook

    # Allows capture the output from the output projection (o_proj)
    def o_proj_output_hook(layer_idx):
        def _hook(module, module_input, module_output):
            out = module_output[0] if isinstance(module_output, tuple) else module_output
            layer_outputs[layer_idx] = out.detach().clone()
        return _hook

    # Register hooks for each unpruned layer
    handles = []
    for idx in unpruned_layer_indices:
        layer = pruned_model.model.layers[idx]
        handles.append(layer.self_attn.q_proj.register_forward_pre_hook(q_proj_input_hook(idx)))
        handles.append(layer.self_attn.o_proj.register_forward_hook(o_proj_output_hook(idx)))

    # FORWARD PASS
    """
    Single forward pass (no gradient needed)
    A single forward pass is performed on the input text.
    During this pass, the hooks capture the inputs and outputs of the unpruned layers.
    This step is done with torch.no_grad(),
    ensuring no gradients are calculated, which saves memory and computation.
    """
    with torch.no_grad():
        _ = pruned_model(**inputs)

    """
    The hooks are removed after the forward pass
    to avoid memory leaks or interference with subsequent operations.
    """
    for h in handles:
        h.remove()


    #COMPUTE IMPORTANCE SCORES
    """
    For each unpruned layer, the inputs and outputs are flattened into vectors for comparison.

    Cosine Similarity: The similarity between the input and output vectors is
    computed using cosine similarity. Layers with outputs that are very similar
    to their inputs likely contribute less to the model’s overall computation.

    Importance Score: The importance score for each layer is calculated as 1−similarity
    A higher score indicates that the layer transforms its input significantly
    and is therefore more important to the model's function.
    """
    for idx in unpruned_layer_indices:
        if idx in layer_inputs and idx in layer_outputs:
            inp = layer_inputs[idx]
            out = layer_outputs[idx]

            inp_flat = inp.view(inp.size(0), -1)
            out_flat = out.view(out.size(0), -1)

            similarity = F.cosine_similarity(inp_flat, out_flat, dim=1).mean().item()
            importance_score = 1 - similarity
            importance_scores.append((idx, importance_score))

            print(f"Layer {idx} importance score: {importance_score:.4f}")

    """A list of tuples is returned, where each tuple contains the layer index
    and its calculated importance score."""
    return importance_scores


The function `bypass_single_layer` is used to disable the attention mechanism of a specific layer in the model without permanently removing or modifying the layer.

This is achieved by dynamically overriding the layer’s forward method to bypass its attention computation.

As the attention layers are grouped with the MLP Layers we can just remove an attention layer without removing the associated MLP layer. But we can bypass the layer.

The bypassed layer skips computationally expensive attention operations, reducing inference time and memory usage.


### Code Explanation: Bypassing a Single Attention Layer in a Pruned Model

- **Function Purpose**:  
  The `bypass_single_layer` function modifies a specified attention layer in a pruned model to bypass (skip) the layer's computation during the forward pass.

- **Inputs**:  
  - `pruned_model`: The pruned model that has certain layers removed.
  - `layer_idx`: The index of the attention layer to bypass.

- **Steps**:

  1. **Accessing the Layer**:
     - The layer at the specified `layer_idx` is accessed from the model’s layers.

  2. **Bypass Flag Setup**:
     - The list `drop_attn_list` from the model's configuration is retrieved, which holds indices of layers to be skipped.
     - A `layer_idx` is assigned to the layer’s attention submodule to track its position.

  3. **Store Original Forward Method**:
     - If the `self_attn` (self-attention) module does not already have a stored original forward method, it is saved as `_original_forward`. This ensures we can return to the original behavior later.

  4. **Set Bypass Flag**:
     - A custom attribute `_should_bypass` is added to the attention layer and set to `True`. This flag will control whether the layer is bypassed.

  5. **Define New Forward Method**:
     - A new `new_attention_forward` method is defined that:
       - Checks if the layer's index is in the `drop_attn_list`.
       - If it is, it returns the hidden states without applying the attention mechanism (bypassing the layer).
       - Otherwise, it calls the original forward method (`_orig_forward`) to process the attention normally.

  6. **Set New Forward Method**:
     - The layer’s attention module's forward method is replaced with the newly defined `new_attention_forward`.

- **Output**:
  - The specified attention layer is now bypassed during the forward pass of the model, meaning its computations are skipped.

- **Use Case**:  
  This function is useful in pruning experiments or optimizing models, where certain layers (e.g., attention layers) can be skipped to reduce computational overhead or analyze the effects of layer removal.


In [12]:
def bypass_single_layer(pruned_model, layer_idx):
    """
    Modifies the specified layer's forward method so that attention is bypassed.
    """
    layer = pruned_model.model.layers[layer_idx]

    # get the list once, while we still have access to the full config
    skip = pruned_model.config.drop_attn_list
    layer.self_attn.layer_idx = layer_idx

    # Store the original forward.
    if not hasattr(layer.self_attn, '_original_forward'):
        layer.self_attn._original_forward = layer.self_attn.forward

    # Set a simple bypass flag directly on the attention layer
    layer.self_attn._should_bypass = True

    # A new forward that checks the bypass flag
    def new_attention_forward(attn, hidden_states, attention_mask=None, position_ids=None,
                    past_key_value=None, output_attentions=False, use_cache=False,
                    **kwargs):
        if attn.layer_idx in skip:
            # short-circuit the pruned layer
            return (hidden_states, None) if use_cache else (hidden_states, None)
        return attn._orig_forward(hidden_states, attention_mask, position_ids,
                                  past_key_value, output_attentions, use_cache,
                                  **kwargs)
    #set new forward method
    layer.self_attn.forward = new_attention_forward.__get__(layer.self_attn,
                                                  type(layer.self_attn))



### Code Explanation: One-Shot Pruning of Attention Layers in a Model

- **Function Purpose**:  
  The `one_shot_pruning_inplace` function performs pruning on a model by bypassing (removing) the least important attention layers in a single step, without creating a copy of the model.

- **Inputs**:  
  - `model`: The model to be pruned.
  - `tokenizer`: Tokenizer for processing the input text.
  - `input_text`: The input text used to measure the importance of attention layers.
  - `num_layers_to_prune`: The number of attention layers to prune (bypass).

- **Steps**:

  1. **Save Device**:
     - The device (CPU/GPU) of the model is saved for later use.
  
  2. **Set Up Pruning List**:
     - If the model does not have a `drop_attn_list` attribute in its configuration, it is initialized as an empty list. This list will track the indices of pruned layers.

  3. **Measure Layer Importance**:
     - The function `measure_unpruned_layer_importances` is called to compute the importance scores of all unpruned layers.
  
  4. **Check for Enough Layers to Prune**:
     - If the requested number of layers to prune (`num_layers_to_prune`) exceeds the number of layers with importance scores, an error is raised.

  5. **Sort and Select Layers to Bypass**:
     - The layers are sorted by their importance scores in ascending order.
     - The least important layers (with the lowest scores) are selected to be pruned.

  6. **Bypass Selected Layers**:
     - For each layer selected to be pruned:
       - The layer's index is added to the `drop_attn_list`.
       - The `bypass_single_layer` function is called to modify the layer's forward method, effectively bypassing it during the forward pass.
       - A message is printed indicating the layer that was bypassed and its importance score.

  7. **Print Bypassed Layers**:
     - The final list of bypassed layers is printed in ascending order of their indices.

- **Output**:
  - The modified model with the selected attention layers bypassed (pruned).
  
- **Use Case**:  
  This function is useful for one-shot pruning, where a fixed number of the least important attention layers are removed from a model to improve efficiency or reduce computation while maintaining performance.


In [13]:
def one_shot_pruning_inplace(model, tokenizer, input_text, num_layers_to_prune):
    """
    Performs pruning on the original model without creating a copy.
    """
    # Save original device (should be CPU in your case)
    device = next(model.parameters()).device

    # Set up pruning list
    if not hasattr(model.config, 'drop_attn_list'):
        model.config.drop_attn_list = []

    # Measure importance
    scores = measure_unpruned_layer_importances(model, tokenizer, input_text)

    if len(scores) < num_layers_to_prune:
        raise ValueError("Requested more layers to prune than exist")

    # Sort and select layers to bypass
    scores.sort(key=lambda x: x[1])  # ascending
    layers_to_bypass = [idx for idx, _ in scores[:num_layers_to_prune]]  # Fixed syntax error

    # Bypass selected layers
    for idx in layers_to_bypass:
        model.config.drop_attn_list.append(idx)
        bypass_single_layer(model, idx)
        print(f"Bypassing layer {idx} with importance score {dict(scores)[idx]:.4f}")

    print(f"Bypassed layers: {sorted(model.config.drop_attn_list)}")

    return model

## Execute Pruning.

**Disclaimer**

I'm using a single illustrative prompt so that the code path is easy to follow. In any research or production setting you must feed hundreds or thousands of diverse prompts before deciding which layers to deactivate

In [14]:
pruned_model = one_shot_pruning_inplace(
      model,
      tokenizer,
       "Hi I'm a sample text, used to calculate the cosine difference between input and output.",
      num_layers_to_prune=4
)

Layer 0 importance score: 1.0711
Layer 1 importance score: 0.9268
Layer 2 importance score: 0.9592
Layer 3 importance score: 0.9586
Layer 4 importance score: 0.9556
Layer 5 importance score: 1.0017
Layer 6 importance score: 1.0273
Layer 7 importance score: 1.0322
Layer 8 importance score: 1.1082
Layer 9 importance score: 1.1019
Layer 10 importance score: 1.0403
Layer 11 importance score: 0.9950
Layer 12 importance score: 1.0824
Layer 13 importance score: 1.0197
Layer 14 importance score: 0.9663
Layer 15 importance score: 0.9675
Layer 16 importance score: 0.9030
Layer 17 importance score: 0.9355
Layer 18 importance score: 1.0156
Layer 19 importance score: 0.7300
Layer 20 importance score: 0.8689
Layer 21 importance score: 0.9351
Layer 22 importance score: 0.8706
Layer 23 importance score: 0.7950
Layer 24 importance score: 0.8481
Layer 25 importance score: 0.8776
Layer 26 importance score: 0.7842
Layer 27 importance score: 0.9855
Bypassing layer 19 with importance score 0.7300
Bypassing 

# Test Pruned Models


Now, let's test the pruned model, which is a Llama-3.2-3B model where I have marked 4 Attention layers to be bypassed.

In [15]:
# Test the pruned model
pruned_model = pruned_model.to(device) #Move the model to GPU again.
generated = get_output(prompt, pruned_model, num_runs=2)
print(f"Generated text: {generated}")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



Run 1:
Tokenization time: 0.45 ms
Generation time: 19553.47 ms
Decoding time: 0.29 ms
Total time: 19554.41 ms

Run 2:
Tokenization time: 0.88 ms
Generation time: 20109.66 ms
Decoding time: 0.35 ms
Total time: 20111.05 ms

Average time over 2 runs: 19832.55 ms
Generated text: ['Paris is the capital of France and also its largest city. It is also one of the most visited tourist destination worldwide with millions of tourists visiting every year. There are many things to do in Paris including sightseeing tours, shopping malls, museums etc', 'Paris is the capital of France and also its largest city. It is also one of the most visited tourist destination worldwide with millions of tourists visiting every year. There are many things to do in Paris including sightseeing tours, shopping malls, museums etc']



The execution of this second model is slightly faster than that of the base model, and the generated text is fairly accurate, although some repetition can be noticed towards the end of the sentence.

# Store the Model.


In [16]:
new_model_name = 'attnprun-llama-3.2-3B'
output_dir = './'+new_model_name
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

pruned_model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
#new_config.save_pretrained(output_dir)
print(f"Pruned model saved to {output_dir}")

Pruned model saved to ./attnprun-llama-3.2-3B


In [17]:
# 2. Check that config contains layers to skip
from transformers import AutoConfig
config = AutoConfig.from_pretrained(output_dir)

if hasattr(config, "drop_attn_list"):
    print(f"drop_attn_list stored: {config.drop_attn_list}")
else:
    print("drop_attn_list isn't present.")


drop_attn_list stored: [19, 26, 23, 24]


## Upload to Hugging Face.

El proceso de subida de este modelo a Hugging es ligeramente más complejo por que se debe almacenar no tan solo el modelo en si, sino tambien el código de la función _bypass_single_layer. Que como recordarás es la función que se encarga de decidir cuando ejecutar o simplemente bypasear una capa de atención.  

In [18]:
from huggingface_hub import HfApi, upload_folder, whoami, create_repo

In [19]:
# Step 1: Get your HF username from the current token
username = whoami()["name"]  # Returns a dict like {'name': 'your_username', 'email': ...}
username

In [20]:
# Step 2: Define repo name
repo_id = f"{username}/{new_model_name}"

In [21]:
# Step 3: Define path to your model
output_dir = "./"+new_model_name


The function must be saved in a .py file, but since this notebook runs on Colab, I’ve decided the best approach is to create a cell that generates the file to be uploaded.

The file contains the custom class PrunedLlamaForCausalLM, which extends Hugging Face’s LlamaForCausalLM.

This custom class calls the base constructor, ensuring that the model's configuration file includes the drop_attn_list, which specifies the layers that should be skipped.

The forward function is modified only for the layers that need to be skipped; the rest continue executing their standard forward function.


In [22]:
custom_model_code = '''
from transformers.models.llama.modeling_llama import LlamaForCausalLM

class PrunedLlamaForCausalLM(LlamaForCausalLM):
    def __init__(self, config):
        super().__init__(config)
        if not hasattr(config, "drop_attn_list"):
            config.drop_attn_list = []

        for idx in config.drop_attn_list:
            self._bypass_single_layer(idx)

    def _bypass_single_layer(pruned_model, layer_idx):
        """
        Modifies the specified layer's forward method so that attention is bypassed.
        """
        layer = pruned_model.model.layers[layer_idx]

        # get the list once, while we still have access to the full config
        skip = pruned_model.config.drop_attn_list
        layer.self_attn.layer_idx = layer_idx

        # Store the original forward.
        if not hasattr(layer.self_attn, '_original_forward'):
            layer.self_attn._original_forward = layer.self_attn.forward

        # Set a simple bypass flag directly on the attention layer
        layer.self_attn._should_bypass = True

        # A new forward that checks the bypass flag
        def new_attention_forward(attn, hidden_states, attention_mask=None, position_ids=None,
                        past_key_value=None, output_attentions=False, use_cache=False,
                        **kwargs):
            if attn.layer_idx in skip:
                # short-circuit the pruned layer
                return (hidden_states, None) if use_cache else (hidden_states, None)
            return attn._orig_forward(hidden_states, attention_mask, position_ids,
                                      past_key_value, output_attentions, use_cache,
                                      **kwargs)
        #set new forward method
        layer.self_attn.forward = new_attention_forward.__get__(layer.self_attn,
                                                      type(layer.self_attn))

'''

# Define path and write the file
os.makedirs(output_dir, exist_ok=True)
with open(os.path.join(output_dir, "modeling_attnprun_llama.py"), "w") as f:
    f.write(custom_model_code.strip())

print("Custom model script modeling_attnprun_llama.py created successfully.")


Now the model's configuration file is updated by adding the `auto_map` field, which tells the Transformers library which class to use to construct the model: `modeling_attnprun_llama.PrunedLlamaForCausalLM.`


In [23]:
import json
import os

# Path to the config file
config_path = os.path.join(output_dir, "config.json")

# Load the existing config
with open(config_path, "r") as f:
    config = json.load(f)

# Add or update the auto_map section
config["auto_map"] = {
    "AutoModelForCausalLM": "modeling_attnprun_llama.PrunedLlamaForCausalLM"
}

# Optional: ensure the architecture field is aligned
config["architectures"] = ["PrunedLlamaForCausalLM"]

# Save the updated config
with open(config_path, "w") as f:
    json.dump(config, f, indent=2)

print("config.json updated with auto_map and architecture.")

Time to upload the folder containing the weights, the config file and the new function to HF.

In [24]:
create_repo(repo_id=repo_id, repo_type="model", exist_ok=True)

In [25]:
# Step 4: Upload the folder to the Hub
upload_folder(
    folder_path=output_dir,
    path_in_repo="",  # Upload everything to root
    repo_id=repo_id,
    repo_type="model"
)

print(f"Model uploaded successfully to https://huggingface.co/{repo_id}")

## Download model from Hugging Face.

In [26]:
import gc
del pruned_model
del tokenizer
del model

# 2. Libera la caché de la GPU
torch.cuda.empty_cache()
torch.cuda.ipc_collect()  # Opcional, ayuda en Colab

# 3. Forza recolección de basura en Python
gc.collect()

In [27]:
!huggingface-cli cache purge --yes

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


usage: huggingface-cli <command> [<args>]
huggingface-cli: error: argument {download,upload,repo-files,env,login,whoami,logout,auth,repo,lfs-enable-largefiles,lfs-multipart-upload,scan-cache,delete-cache,tag,version,upload-large-folder}: invalid choice: 'cache' (choose from 'download', 'upload', 'repo-files', 'env', 'login', 'whoami', 'logout', 'auth', 'repo', 'lfs-enable-largefiles', 'lfs-multipart-upload', 'scan-cache', 'delete-cache', 'tag', 'version', 'upload-large-folder')


The model is downloaded normally from Hugging Face, but you must remember to set `trust_remote_code=True` since the model includes the custom code you previously created and uploaded.


In [28]:
model_hf = AutoModelForCausalLM.from_pretrained(
    repo_id,
    trust_remote_code=True
)


In [29]:
tokenizer = AutoTokenizer.from_pretrained(repo_id)

In [30]:
model_hf = model_hf.to(device) #Move the model to GPU again.
generated = get_output(prompt, model_hf, num_runs=2)
print(f"Generated text: {generated}")

# Conclusion.
Based on the findings in the paper and the results obtained, I believe this type of pruning may work better with larger models where attention layers tend to have redundancy.

Since this type of pruning does not alter the model's structure, it does not result in a reduction in its size or the memory required to load it. The main advantage of using this pruning approach is the reduction of computational load during inference, leading to a more efficient model with faster responses and lower resource consumption.

Unlike the original paper, which describes "removing" selected attention layers but provides limited implementation details, this implementation takes a transparent functional approach by explicitly overriding the `forward` method only in the specified layers. As a result, the model retains its full architecture and parameter set, but selectively skips computations at runtime. This makes the method reversible, modular, and fully compatible with the Hugging Face ecosystem using `trust_remote_code=True`. While both approaches achieve similar computational savings, this one emphasizes clarity, portability, and practical integration.
