In [None]:
<div>
    <h3>Example of approach to depth pruning a Llama 3.2.</h3>
</div>

* Pruning
* Structured pruning
* Depth pruning.


# Introduction
In this notebook, we will look at an example of depth pruning, which involves removing entire layers from the model.

The first thing to note is that removing entire layers from a transformer model usually has a significant impact on the model's performance. This is a much more drastic architectural change compared to the simple removal of neurons from the MLP layers.

For this reason, these models are not designed to be used directly after the pruning process. Instead, they will require a subsequent fine-tuning process to recover their capabilities.

#Install libraries & Configure variables.

In [1]:
!pip install -q transformers
!pip install -q torch
!pip install -q datasets
!pip install -q sentencepiece  # Required for LLaMA tokenizer
!pip install dotenv

Collecting dotenv
  Using cached dotenv-0.9.9-py2.py3-none-any.whl.metadata (279 bytes)
Collecting python-dotenv (from dotenv)
  Using cached python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)
Using cached dotenv-0.9.9-py2.py3-none-any.whl (1.9 kB)
Using cached python_dotenv-1.1.1-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv, dotenv
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [dotenv]2m1/2[0m [dotenv]
[1A[2KSuccessfully installed dotenv-0.9.9 python-dotenv-1.1.1


In [2]:
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch import nn
from torch.utils.data import DataLoader
import os
from tqdm import tqdm

In [3]:
# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda


#Download model and explore structure

In [4]:
# model_name = 'meta-llama/Llama-3.2-1B'
# model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).to(device)
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# #tokenizer.pad_token = tokenizer.eos_token  # Set pad token

In [5]:
import os
import torch
from dotenv import load_dotenv
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load environment variables from .env file
load_dotenv()

# Get configuration from environment variables
hf_token = os.getenv('HF_TOKEN')
model_name = os.getenv('MODEL_NAME', 'meta-llama/Llama-3.2-1B')

# Check if HF_TOKEN is provided
if not hf_token:
    raise ValueError("HF_TOKEN not found in environment variables. Please add it to your .env file.")

# Load model and tokenizer with authentication token
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    torch_dtype=torch.float16,
    token=hf_token
).to(device)

tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)

`torch_dtype` is deprecated! Use `dtype` instead!


In [6]:
def get_output(prompt, model=model, tokenizer=tokenizer):
    inputs = tokenizer(prompt, return_tensors='pt').to(device)
    outputs = model.generate(
        inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=50,
        num_return_sequences=1,
        pad_token_id=tokenizer.pad_token_id,
        temperature=None,
        top_p=None,
        do_sample=False,          # Disable sampling
        num_beams=5,              # Use beam search
        early_stopping=True,      # Stop when end-of-sequence token is generated
        no_repeat_ngram_size=2    # Prevent repetition of 2-grams
    )
    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated

## studying the model structure
As demonstrated in the [previous notebook](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/6_2_pruning_structured_llama3.2-1b_KO.ipynb), studying the structure of the model that will undergo pruning is crucial.

In this notebook, we’re going to fine-tune the pruning process for the Llama3.2 model.

In [7]:
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb):


Each layer of the model consists of the attention section and the MLP section. When we remove a layer, the entire layer is eliminated. This is a crucial point because finding a balance in selecting which layers to remove will be challenging.

Later, in future notebooks, you will learn about different techniques for selecting layers, based on their activation and how the model responds to a specific dataset.

In this notebook, you will explore three different layer selection techniques:

- Summing the magnitudes of all the weights in the layers.
- Removing the first layers of the model.
- Removing the last layers of the model.



In [8]:
# Test the original model
prompt = "Paris is the capital of"
generated = get_output(prompt)
print(f"Generated text: {generated}")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated text: Paris is the capital of France and one of the most visited cities in the world. It is a city with a rich history and culture, as well as a vibrant and diverse population. Paris is home to many famous landmarks, including the Eiff


In [9]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters())


In [10]:
original_param_count = count_parameters(model)
print(f"Original model parameters: {original_param_count}")

Original model parameters: 1235814400


#Pruning the Model.
##Support pruning functions.

Here are three differeten methods I used to calculate wich layers to mantain.

The function prune_layers calculates the weight magnitude of each layer to remove those that, in principle, should contribute less to the model.

The other two functions focus on removing either the initial layers or the final layers.

The results obtained with 20% pruning using each method have been quite different:

* **prune_layers**: Generated text after pruning: Paris is the capital of & && &ththhth hhh h h shhs h th h f h % h t h

 h m  h   m
  sh  *
 n
* **prune_last_layers**: Paris is the capital of France and arguably one amongstworld renowned cities worldwide. Its uniqueness lies in its uniqueness itselfwhich makes it uniquely unique amongstotherworld renown cities globally.Its uniqueness resides mainly inits unique architecturewhichmakes itunique amongst otherworld
* **prune_first_layer**: Paris is the capital of & && &ththhth hhh h h shhs h th h f h % h t h
 h m  h   m
  sh  *
 n
* **Base model**: Paris is the capital of France and one of the most visited cities in the world. It is a city with a rich history and culture, as well as a vibrant and diverse population. Paris is home to many famous landmarks, including the Eiff

It is clear that the impact on the model has been quite catastrophic in all cases. The only model that managed to generate text—which, while not entirely coherent, at least resembles text—was the one where the final layers were removed.

In a Transformer model like this, layers are organized hierarchically: the initial layers (closer to the input) tend to capture basic language patterns (such as syntactic structure, common word combinations, etc.), while the intermediate and final layers refine these representations, capturing higher-order relationships, global coherence, and subtle semantic nuances.

Removing the initial layers directly undermines the foundation upon which more complex representations are built, leading the model to generate meaningless text sequences. Similarly, removing layers based on weight importance metrics (without considering their position or function) can eliminate layers critical for linguistic cohesion or contextual coherence.

On the other hand, removing the final layers, while resulting in a loss of some refinement and specialization capabilities, preserves the initial and middle layers that have already learned fundamental language rules and basic word dependencies.

However, this is just an empirical and highly simple test. Later, when we evaluate the model's performance using rankings, we will see that it retains a significant portion of its characteristics. Therefore, we are dealing with a model that can deliver very good results after a small fine-tuning process to recover some of the lost capabilities.

In [11]:
#DISCARTED: Eliminate layers with smaller absolute values of all parameters.
def compute_layer_importance(layer):
    """
    Compute the importance score of a layer by considering the sum of
    the absolute values of all its parameters (including attention and MLP).

    Args:
    - layer: A model layer (e.g., LlamaDecoderLayer).

    Returns:
    - layer_importance: A scalar importance score for the entire layer.
    """
    # Initialize a tensor for cumulative sum of absolute parameters
    device = layer.parameters().__next__().device
    layer_importance = torch.tensor(0.0, device=device)

    # Accumulate the absolute values of all parameters in the layer
    for param in layer.parameters():
        layer_importance += torch.sum(torch.abs(param))

    return layer_importance

In [12]:
###
# DISCARTED: Eliminate layers with smaller absolute values of all parameters.
##
def prune_layers(model, prune_percent):
    """
    Removes entire layers from the model based on their importance scores.
    Now considers all parameters in a layer (attention + MLP) for determining importance.

    Args:
    - model: The model from which layers will be pruned.
    - prune_percent: The percentage of layers to remove.

    Returns:
    - model: The pruned model with fewer layers.
    """
    # Calculate the importance of each layer and store (index, importance)
    layer_importances = []
    for idx, layer in enumerate(model.model.layers):
        importance = compute_layer_importance(layer)
        layer_importances.append((idx, importance))

    # Sort layers by importance in ascending order (lowest importance first)
    layer_importances.sort(key=lambda x: x[1])

    # Compute the number of layers to prune
    total_layers = len(layer_importances)
    num_layers_to_prune = int(total_layers * prune_percent)

    # Get the indices of layers to remove
    layers_to_remove = set([x[0] for x in layer_importances[:num_layers_to_prune]])

    # Rebuild the model without the removed layers
    new_layers = [layer for i, layer in enumerate(model.model.layers) if i not in layers_to_remove]
    model.model.layers = nn.ModuleList(new_layers)

    return model


In [13]:
def prune_last_layers(model, num_layers_to_remove):
    """
    Removes the last 'num_layers_to_remove' layers from the model.

    Args:
    - model: The model from which layers will be pruned.
    - num_layers_to_remove: Number of layers to remove from the top of the stack.

    Returns:
    - model: The pruned model with fewer layers.
    """
    total_layers = len(model.model.layers)

    # Ensure we are not removing more layers than exist
    if num_layers_to_remove >= total_layers:
        raise ValueError("Number of layers to remove is greater or equal to total layers.")

    # Slice the layers to remove the last ones
    new_layers = model.model.layers[:total_layers - num_layers_to_remove]
    model.model.layers = nn.ModuleList(new_layers)

    # Update the model configuration
    model.config.num_hidden_layers = len(model.model.layers)

    return model
#response:  Paris is the capital of France and arguably one amongstworld renowned cities worldwide. Its uniqueness lies in its uniqueness itselfwhich makes it uniquely unique amongstotherworld renown cities globally.Its uniqueness resides mainly inits unique architecturewhichmakes itunique amongst otherworld

In [14]:
def prune_first_layers(model, num_layers_to_remove):
    """
    Removes the first 'num_layers_to_remove' layers from the model.

    Args:
    - model: The model from which layers will be pruned.
    - num_layers_to_remove: Number of layers to remove from the start.

    Returns:
    - model: The pruned model with fewer layers.
    """
    # Get the total number of layers in the model.
    total_layers = len(model.model.layers)

    # Ensure we are not removing more layers than exist
    if num_layers_to_remove >= total_layers:
        raise ValueError("Number of layers to remove is greater or equal to total layers.")

    # Keep all layers after the first 'num_layers_to_remove'
    new_layers = model.model.layers[num_layers_to_remove:]
    model.model.layers = nn.ModuleList(new_layers)

    # Update the model configuration after pruning layers
    model.config.num_hidden_layers = len(model.model.layers)

    return model

# Prune Loop
The update_model function iterates through the blocks within the model's Transformer structure. This structure consists of multiple `LlamaDecoderLayer` blocks, and each of these blocks contains a pair of `LlamaSdpaAttention` and `LlamaMLP` components.
```
(layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
  )    


In [15]:
def update_model(model, prune_percent):
    """
    Modifies the model by removing entire layers from the end of the stack,
    instead of basing it on importance scores. This is a heuristic approach
    that tries pruning the top layers to see if it yields better results.

    Args:
    - model: Model to prune.
    - prune_percent: Percentage of layers to prune from the top.

    Returns:
    - model: New pruned model with fewer layers.
    """
    ### uncomment this if you want to use weight layer selection.
    #model = prune_layers(model, prune_percent)

    total_layers = len(model.model.layers)
    num_layers_to_remove = int(total_layers * prune_percent)

    model = prune_last_layers(model, num_layers_to_remove)
    #model = prune_first_layers(model, num_layers_to_remove)

    # Update the model configuration to reflect the new number of layers
    model.config.num_hidden_layers = len(model.model.layers)
    return model


## Obtain & test the pruned model.

In [16]:
prune_percent = 0.2  # Prune 20% of neurons
model = update_model(model, prune_percent)

In [17]:
# Recalculate the number of parameters
pruned_param_count = count_parameters(model)
reduction_in_params = original_param_count - pruned_param_count
percentage_savings = (reduction_in_params / original_param_count) * 100

print(f"Pruned model parameters: {pruned_param_count}")
print(f"Reduction in parameters: {reduction_in_params}")
print(f"Percentage of weight savings: {percentage_savings:.2f}%")


Pruned model parameters: 1053349888
Reduction in parameters: 182464512
Percentage of weight savings: 14.76%


In [18]:
# Test the pruned model
generated = get_output(prompt, model, tokenizer)
print(f"Generated text after pruning: {generated}")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Generated text after pruning: Paris is the capital of France and arguably one amongstworld renowned cities worldwide. Its uniqueness lies in its uniqueness itselfwhich makes it uniquely unique amongstotherworld renown cities globally.Its uniqueness resides mainly inits unique architecturewhichmakes itunique amongst otherworld


The result is realy different from what the original model produced, and is far to be a fairly accurate response, but at least is understable text.

In contrast to the model created in notebook: [6_2_pruning_structured_llama3.2-1b_KO.ipynb](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/6_2_pruning_structured_llama3.2-1b_KO.ipynb) where the pruned Llama model lost almost all its utility, the model in this notebook retains a good portion of its knowledge.

Looking at the model’s new structure, we can see that now the model have only 13 layers instead of the 16 original. So we removed 3 entire layers.

In [19]:
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-12): 13 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb):

#Upload the model to HuggingFace.

In [20]:
new_model_name = 'depth20-llama-3.2-1b'
output_dir = './'+new_model_name
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"Pruned model saved to {output_dir}")

Pruned model saved to ./depth20-llama-3.2-1b


In [21]:
# Push the model to your Hugging Face repository

model.push_to_hub(new_model_name, private=True)

In [22]:
tokenizer.push_to_hub(new_model_name)

#Evaluating models

In this section, we'll take a look at some standard evaluations in the world of Large Language Models using the lm-evaluation library from EleutherAI.

Specifically, we'll use LAMBADA, ARC_EASY and BoolQ. Since the pruning performed could be considered structural—that is, it affects the model's overall structure without a specific target—I’ve chosen rather different evaluation tasks.

I want to remind you that the goal of this notebook is to demonstrate the pruning process, so I won’t be doing a comprehensive study of how it impacts performance; that will be saved for a future article. Additionally, these models are designed to be fine-tuned before being used.

However, I believe that seeing how pruning impacts model performance can help illustrate the pruning process itself.

The model selected for the comparision is the one with a 20% pruning using the last layers selection.

In [38]:
# Install dependencies in correct order
!pip install -q accelerate
!pip install -q lm-eval

# For models requiring authentication
import os
from dotenv import load_dotenv

# Load HF token if needed
load_dotenv()
hf_token = os.getenv('HF_TOKEN')

from lm_eval import evaluator, tasks, models

def evaluate_hf_model(model_name, tasks=['arc_easy'], num_fewshot=0, device="auto", hf_token=None):
    """
    It calls the evaluator to evaluate a model available on Hugging Face.
    Args:
    - model_name: The model name in Hugging Face.
    - tasks: Tasks to evaluate.
    - num_fewshot: Number of examples of few-shot learning
    - device: Device to use ("auto", "cuda", "cpu")
    - hf_token: Hugging Face token for private models
    Returns:
    - metrics.
    """
    # Build model_args string
    if device == "auto":
        model_args = f"pretrained={model_name},device_map=auto"
    else:
        model_args = f"pretrained={model_name},device={device}"
    
    # Add token if provided
    if hf_token:
        model_args += f",token={hf_token}"
    
    results = evaluator.simple_evaluate(
        model="hf",
        model_args=model_args,
        tasks=tasks,
        num_fewshot=num_fewshot,
        limit=None,
        bootstrap_iters=10
    )
    metrics = results.get('results', {})
    return metrics

# Select tasks to evaluate
tasks = ['lambada', 'boolq', 'arc_easy']

# Evaluate the model
metrics_pruned = evaluate_hf_model("depth20-llama-3.2-1b", tasks=tasks)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[Task: boolq] metric acc is defined, but aggregation is not. using default aggregation=mean
[Task: boolq] metric acc is defined, but higher_is_better is not. using default higher_is_better=True


README.md: 0.00B [00:00, ?B/s]

Overwriting default num_fewshot of arc_easy from None to 0
Overwriting default num_fewshot of boolq from None to 0
Overwriting default num_fewshot of lambada_openai from None to 0
Overwriting default num_fewshot of lambada_standard from None to 0
100%|█████████████████████████████████████████████████████████████████████████████| 2376/2376 [00:01<00:00, 1545.79it/s]
100%|█████████████████████████████████████████████████████████████████████████████| 3270/3270 [00:01<00:00, 2755.73it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 5153/5153 [00:07<00:00, 669.88it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 5153/5153 [00:06<00:00, 765.33it/s]
Running loglikelihood requests: 100%|█████████████████████████████████████████████| 26347/26347 [09:10<00:00, 47.90it/s]


bootstrapping for stddev: perplexity


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

bootstrapping for stddev: perplexity


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

In [39]:
metrics_pruned

{'arc_easy': {'alias': 'arc_easy',
  'acc,none': 0.46254208754208753,
  'acc_stderr,none': 0.010230952104570803,
  'acc_norm,none': 0.45075757575757575,
  'acc_norm_stderr,none': 0.010209906101011107},
 'boolq': {'alias': 'boolq',
  'acc,none': 0.6217125382262997,
  'acc_stderr,none': 0.008482001133931},
 'lambada_openai': {'alias': 'lambada_openai',
  'perplexity,none': 45444.434722955746,
  'perplexity_stderr,none': 4293.661681843537,
  'acc,none': 0.117795458955948,
  'acc_stderr,none': 0.004491185483891819},
 'lambada_standard': {'alias': 'lambada_standard',
  'perplexity,none': 136409.48193725813,
  'perplexity_stderr,none': 13243.287905915005,
  'acc,none': 0.11061517562584902,
  'acc_stderr,none': 0.004369827433519648}}

In [40]:
metrics_base= evaluate_hf_model("meta-llama/Llama-3.2-1B", tasks=tasks)

[Task: boolq] metric acc is defined, but aggregation is not. using default aggregation=mean
[Task: boolq] metric acc is defined, but higher_is_better is not. using default higher_is_better=True
Overwriting default num_fewshot of arc_easy from None to 0
Overwriting default num_fewshot of boolq from None to 0
Overwriting default num_fewshot of lambada_openai from None to 0
Overwriting default num_fewshot of lambada_standard from None to 0
100%|█████████████████████████████████████████████████████████████████████████████| 2376/2376 [00:01<00:00, 1288.96it/s]
100%|█████████████████████████████████████████████████████████████████████████████| 3270/3270 [00:01<00:00, 2687.19it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 5153/5153 [00:06<00:00, 760.37it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 5153/5153 [00:06<00:00, 765.25it/s]
Running loglikelihood requests: 100%|██████████████████████████████████████

bootstrapping for stddev: perplexity


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

bootstrapping for stddev: perplexity


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

In [41]:
metrics_base

{'arc_easy': {'alias': 'arc_easy',
  'acc,none': 0.6498316498316499,
  'acc_stderr,none': 0.009788295410093146,
  'acc_norm,none': 0.6069023569023569,
  'acc_norm_stderr,none': 0.01002254061894531},
 'boolq': {'alias': 'boolq',
  'acc,none': 0.637308868501529,
  'acc_stderr,none': 0.008408838061823177},
 'lambada_openai': {'alias': 'lambada_openai',
  'perplexity,none': 5.747471606969041,
  'perplexity_stderr,none': 0.19350717486069613,
  'acc,none': 0.6198331069280031,
  'acc_stderr,none': 0.006762956659647619},
 'lambada_standard': {'alias': 'lambada_standard',
  'perplexity,none': 8.673077754353926,
  'perplexity_stderr,none': 0.3809304515805616,
  'acc,none': 0.5315350281389482,
  'acc_stderr,none': 0.006952109107344538}}

![My Image](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/depth_rpunedvsbase.png?raw=true)


**BoolQ:** The performance drop is minimal (from 0.64 to 0.63). BoolQ is a dataset of boolean (true/false) questions based on reading comprehension. The fact that the model nearly maintains its performance suggests that the information required to answer these questions is less sensitive to layer removal or that this type of task primarily benefits from more basic language information, which remains intact after pruning.

**Lambada (OpenAI and Standard):** Here, the performance drop is much more pronounced. For Lambada OpenAI, the score decreases from 0.62 to 0.46, and for Lambada Standard, it drops from 0.53 to 0.11—a drastic reduction. Lambada focuses on predicting the last word in a text, requiring deep contextual understanding and long-term coherence. This result aligns with the earlier test in the notebook, where the model was asked to complete a prompt, and the generated language was not entirely accurate. This difficulty in producing coherent text within a given context negatively impacted its performance in the evaluated task.

**ARC Easy:** This benchmark also shows a significant decline (from 0.65 to 0.46). ARC Easy is a reasoning benchmark focused on general knowledge and common sense. The removal of layers likely impacted the model's ability to relate information and maintain reasoning chains, resulting in a reduced capacity to select the correct answer.

#Conclusion.

In this notebook, we explored how depth pruning works on Llama-3.2 models. Unlike width pruning, there was no need to account for the functioning of the structure.

After pruning, the model experiences significant degradation in tasks that require greater contextual and semantic reasoning (Lambada, ARC Easy), while its performance on BoolQ, a comparatively simpler task, remains almost unchanged. This suggests that pruning disproportionately affects the parts of the model that facilitate complex understanding and long-term coherence. BoolQ, being simpler or less dependent on these traits, remains relatively stable, whereas tasks that evaluate context and global coherence are severely impacted.

As indicated in the paper: What Matters in Transformers? What Matters in Transformers? Not All Attention is Needed. https://arxiv.org/abs/2406.15786 the best result is achieved by removing the deepest layers of the model.


## Future Work.

So far, we have explored two forms of structured pruning:

- Width pruning: In this approach, neurons from the MLP layers of two model families, DistilGPT and Llama3, were removed. The process of removing neurons from models with GLU architecture works across all families with this structure, such as QWEN, Gemma, or Microsoft Phi.
- Depth pruning: Entire blocks were removed from a Llama3 model. This technique can be adapted to all model families.

The common point is that we have used very similar methods to decide which elements to remove from the models, based on the absolute weight of the parameters. The next step will involve making these decisions based on metrics generated while the model is running. This will allow us to create models tailored to specific datasets.

In [None]:
##Authors Note.
In addition to creating content like this notebook and offering it under the MIT license, I have also contributed to repositories such as those of Hugging Face and Google Gemini.

I am especially proud of my book: <a href="https://amzn.to/4eanT1g"><b>Large Language Models:</b> Apply and Implement Strategies for Large Language Models</a> (Apress).

You can find it on both <a href="https://amzn.to/4eanT1g">Amazon</a> and <a href="https://link.springer.com/book/10.1007/979-8-8688-0515-8">Springer</a>, where they often have good deals on the purchase price.

If you take a look and end up purchasing it, keep in mind that you can reach out with any questions via the Discussions section of this same repository or on any of my social media channels. I’ll do my best to respond as quickly as possible.

## References.
- He, S., Sun, G., Shen, Z., & Li, A. (2024). What matters in transformers? not all attention is needed. arXiv preprint arXiv:2406.15786.

- Kim, B. K., Kim, G., Kim, T. H., Castells, T., Choi, S., Shin, J., & Song, H. K. (2024). Shortened llama: A simple depth pruning for large language models. arXiv preprint arXiv:2402.02834, 11.