<a href="https://colab.research.google.com/github/peremartra/Large-Language-Model-Notebooks-Course/blob/main/6-PRUNING/6_1_pruning_structured_l1_diltilgpt2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<div>
    <h1>Large Language Models Projects</a></h1>
    <h3>Apply and Implement Strategies for Large Language Models</h3>
    <h2>Pruning distilGPT2.</h2>
    <h3>Structured Width Pruning: Eliminating Less Important Neurons from Feedforward Layers.</h3>
</div>

by [Pere Martra](https://www.linkedin.com/in/pere-martra/)
_______
Models: distilgpt2

Colab Environment: CPU / GPU T4.

Keys:
* Pruning
* Structured pruning


Related article: --.
_______
**disclaimer: The pruning section was created after the first edition of the book was published. They are not included in the book’s original content but are intended to supplement and expand on the topics covered.**


This is the unofficial repository for the book:
        <a href="https://amzn.to/4eanT1g"> <b>Large Language Models:</b> Apply and Implement Strategies for Large Language Models</a> (Apress).
        The book is based on the content of this repository, but the notebooks are being updated, and I am incorporating new examples and chapters.
        If you are looking for the official repository for the book, with the original notebooks, you should visit the
        <a href="https://github.com/Apress/Large-Language-Models-Projects">Apress repository</a>, where you can find all the notebooks in their original format as they appear in the book.

#PRUNING
## neurons structured width pruning
Pruning is an important optimization technique in machine learning that aims to reduce the size of a model without sacrificing much of its accuracy. By removing less important components, pruning not only decreases the computational cost but also makes the model more efficient for deployment, especially on resource-constrained devices.

Can be compared to quantization, another optimization technique that reduces the precision of the model's weights, typically converting them from high-precision floating-point numbers to lower-precision representations. While quantization can significantly reduce model size and speed up inference, it does not selectively remove weights.

On the other hand, pruning, allows for targeted removal of less important weights or neurons, which can lead to a more efficient reduction in model size while better preserving accuracy. By selecting the weights to eliminate based on their importance scores, pruning provides more control over the model's structure, often making it a more effective approach when aiming for both model compression and high performance.

The effectiveness of removing specific parts of a model could be debated, but recent studies, such as the one conducted by NVIDIA: [How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model](https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/), concluding that pruning, combined with fine-tuning techniques applied after pruning, can produce models that are not only more efficient but also more effective in specific domains.

Also you can combine both techniques and quantize a model that has been previousluy pruned.

This notebook focuses on **structured width pruning**, where entire neurons are eliminated based on their low importance scores, which are computed using the L1 norm. The assumption is that neurons with lower L1 norm values contribute less to the overall output of the model, allowing for safe removal to enhance efficiency without drastically impacting accuracy.

# Install Libraries & Configure variables.

In [1]:
#Install necessary libraries
!pip install -q transformers
!pip install -q torch

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m92.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m80.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m45.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
#Import libraries
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch import nn
import os

# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")



Using device: cpu


I chose to prune only 20% of the least important neurons based on their L1 norm, aiming to balance size reduction with minimal accuracy loss.

You can adjust this percentage and increase it to 30% or even 50%, depending on whether you plan to follow up with a post-pruning fine-tuning process.

In [3]:
prune_percent = 0.2  # Prune 20% of neurons
model_name = 'distilgpt2'

In [4]:
# Support function to check the size reduction.
def count_parameters(model):
    return sum(p.numel() for p in model.parameters())

#Download Model and explore structure.

In [None]:
# Download the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token


In [6]:
def get_ouput(prompt, model=model, tokenizer=tokenizer):
  inputs = tokenizer(prompt, return_tensors='pt').to(device)
  outputs = model.generate(inputs['input_ids'],
                           attention_mask=inputs['attention_mask'],
                           max_length=10,
                           num_return_sequences=1)
  generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
  return generated

## studying the model's structure.

Understanding the model's structure is crucial in a pruning process.

In this structure, you can see the part dedicated to the Attention layers (attn) and the part dedicated to the FeedForward layers (mlp).

In the pruning process for the notebook, I only targeted the mlp layers because they typically contribute the most to the model's size and pruning them doesn’t affect the attention mechanism, which is critical for capturing relationships within the input data. These layers also tend to contain more redundancies, and reducing neurons in them generally doesn't significantly impact the model's output—though this always depends on the neuron selection process for elimination.



In [7]:
print(model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-5): 6 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)


In the model's structure, we find the embedding layers: `wte` (Word Token Embedding) and `wpe` (Word Position Embedding). The vector used to represent the input data has a size of 768.

After the embedding layers, there's a dropout layer.

Next, we have the typical layers of a Transformer model:

- **Normalization layers** (`ln_1` and `ln_2`).
- **Attention mechanism** (`attn`), consisting of its convolutional layers and dropout layers.
- **Feed-forward layers** (`mlp), which will be the target of the pruning process. Specifically, I have chosen to prune the `c_fc` and `c_proj` layers. These layers expand and compress the information that passes through them. They are necessary for enabling the model to capture complex relationships within the input data, but it's quite easy to find neurons in these layers that don't contribute much to the model, at least when using the model with specific data.
- The model ends with the **final normalization layer** (`ln_f`).

Another important consideration is the model's configuration file. Since the pruning process alters the model's structure, the resulting structure must be reflected in the configuration file.

Otherwise, we might encounter issues where the model doesn't work properly with the Transformers library or produces errors or incorrect results during inference.

In [8]:
print(model.config)

GPT2Config {
  "_attn_implementation_autoset": true,
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 6,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "torch_dtype": "float32",
  "transformers_version": "4.51.3",
  "use_cache": tr

In [9]:
#Test the original model with a simple prompt
prompt = "Paris is the capital of"
generated = get_ouput(prompt)
print(f"Generated text: {generated}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text: Paris is the capital of the United States.



In [10]:
#Print the size of the original model
original_param_count = count_parameters(model)
print(f"Original model parameters: {original_param_count}")

Original model parameters: 81912576


# Pruning Model.

The MLP layers in GPT-2 use 1D convolutions (Conv1D) for their transformations. This line imports the Conv1D class from the GPT-2 model implementation, which will be used later to create new layers with reduced sizes.

In [11]:
#Prune the MLP layers based on weight magnitude (adjusted for Conv1D layers)
from transformers.models.gpt2.modeling_gpt2 import Conv1D


This variable is used to store the new intermediate size of the MLP after pruning. It will be updated once the number of neurons to keep is determined.

In [12]:
# Initialize new_intermediate_size
new_intermediate_size = None

## Support pruing functions

In [13]:
# Function to compute importance scores (L1 norm)
def compute_importance_scores(c_fc_weight):
    """
    Compute the importance scores for each neuron in the c_fc layer using L1 norm.

    Args:
    - c_fc_weight: Weight matrix from the c_fc layer.

    Returns:
    - importance_scores: L1 norm importance scores for each neuron.
    """
    return torch.sum(torch.abs(c_fc_weight), dim=0)  # Shape: [intermediate_size]

In [14]:
# Function to prune neurons and create new Conv1D layers
def prune_neurons(mlp, prune_percent, device):
    """
    Prune neurons from the c_fc and c_proj layers of the MLP based on importance scores.

    Args:
    - mlp: The MLP layer (contains c_fc and c_proj) to prune.
    - prune_percent: Percentage of neurons to prune.
    - device: Device (CPU/GPU) for model operations.

    Returns:
    - new_c_fc: New pruned c_fc layer.
    - new_c_proj: New pruned c_proj layer.
    - new_intermediate_size: Size of the pruned intermediate layer.
    """
    # Get the weights of the c_fc layer (input projection)
    c_fc_weight = mlp.c_fc.weight.data

    # Compute importance scores for each neuron
    importance_scores = compute_importance_scores(c_fc_weight)

    # Determine the number of neurons to prune
    original_intermediate_size = c_fc_weight.size(1)  # This is intermediate_size
    num_neurons_to_prune = int(prune_percent * original_intermediate_size)

    # Get indices of neurons to keep (those with highest importance)
    _, indices_to_keep = torch.topk(importance_scores, original_intermediate_size - num_neurons_to_prune)

    # Sort indices to maintain order
    indices_to_keep, _ = torch.sort(indices_to_keep)

    # Create new Conv1D layers with reduced size
    new_c_fc = Conv1D(len(indices_to_keep), mlp.c_fc.weight.size(0)).to(device)  # Conv1D(new_intermediate_size, hidden_size)
    new_c_proj = Conv1D(mlp.c_proj.weight.size(1), len(indices_to_keep)).to(device)  # Conv1D(hidden_size, new_intermediate_size)

    return new_c_fc, new_c_proj, len(indices_to_keep), indices_to_keep

In [15]:
# Function to copy weights and biases to new pruned layers
def copy_weights_and_biases(mlp, new_c_fc, new_c_proj, indices_to_keep):
    """
    Copy the weights and biases from the original layers to the new pruned layers.

    Args:
    - mlp: The original MLP layer (contains c_fc and c_proj).
    - new_c_fc: New pruned c_fc layer.
    - new_c_proj: New pruned c_proj layer.
    - indices_to_keep: Indices of neurons that are retained.
    """
    # Copy weights and biases for the neurons we are keeping
    new_c_fc.weight.data = mlp.c_fc.weight.data[:, indices_to_keep]
    new_c_fc.bias.data = mlp.c_fc.bias.data[indices_to_keep]

    new_c_proj.weight.data = mlp.c_proj.weight.data[indices_to_keep, :]
    new_c_proj.bias.data = mlp.c_proj.bias.data

## Prune Loop


The `update_model` function iterates through the blocks within the model's Transformer structure. This structure consists of multiple `GPT2Block` blocks, and each of these blocks contains a pair of `GPT2SdpaAttention` and `GPT2MLP` components. The latter contains the MLP layers that will be the target of the pruning process.
```
(h): ModuleList(
      (0-5): 6 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
```

The layers that will undergo the removal of neurons identified as less useful are:

**(c_fc): Conv1D()**

**(c_proj): Conv1D()**

The neurons are removed in the `prune_neurons` function based on the values returned by `compute_importance_scores`.


In [16]:
# Function to iterate through the model and prune each block
def update_model(model, prune_percent, device):
    """
    Prune the MLP layers of each Transformer block in the model and update the model's configuration.

    Args:
    - model: The GPT-2 model to prune.
    - prune_percent: Percentage of neurons to prune.
    - device: Device (CPU/GPU) for model operations.

    Returns:
    - model: The pruned model with updated layers.
    - new_intermediate_size: The new intermediate size after pruning.
    """
    new_intermediate_size = None

    # Iterate through each block in the model
    for idx, block in enumerate(model.transformer.h):
        mlp = block.mlp

        # Prune the neurons and create new layers
        new_c_fc, new_c_proj, new_size, indices_to_keep = prune_neurons(mlp, prune_percent, device)

        # Copy weights and biases from old layers to new pruned layers
        copy_weights_and_biases(mlp, new_c_fc, new_c_proj, indices_to_keep)

        # Replace old layers with new pruned layers
        mlp.c_fc = new_c_fc
        mlp.c_proj = new_c_proj

        # Update the intermediate size for the first block
        if new_intermediate_size is None:
            new_intermediate_size = new_size

    # Update the model configuration with the new intermediate size
    model.config.n_inner = new_intermediate_size

    return model

## Obtain & Check pruned model.

In [17]:
# Get the pruned Model
model = update_model(model, prune_percent, device)

In [18]:
#Step 7: Recalculate the number of parameters
pruned_param_count = count_parameters(model)
print(f"Pruned model parameters: {pruned_param_count}")
print(f"Reduction in parameters: {original_param_count - pruned_param_count}")

Pruned model parameters: 76250268
Reduction in parameters: 5662308


The savings produced by the pruning process is around 7.5%. It might seem like a small reward for all the effort, but we can adjust the percentage of pruned neurons. More importantly, we can achieve a more efficient model than the base through a subsequent fine-tuning process.

In [19]:
#structure prompted model.
print(model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-5): 6 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=2458, nx=768)
          (c_proj): Conv1D(nf=768, nx=2458)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)


The model's structure after going through the pruning process appears unchanged. This is because I have only removed neurons, not entire layers, so the layer weights have been altered, but the layers themselves remain intact.

Torch shows the layers but not their internal weights, which is why the structure seems the same. However, when counting the model's parameters, you can see a reduction of 5,662,308 parameters.

In [20]:
#config file pruned model.
print(model.config)

GPT2Config {
  "_attn_implementation_autoset": true,
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": 2458,
  "n_layer": 6,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "torch_dtype": "float32",
  "transformers_version": "4.51.3",
  "use_cache": tr

In the configuration file, a difference is noticeable: the n_inner parameter now contains the value representing the number of neurons in the c_fc layer, a feedforward layer that has had its neuron count reduced.

Not all models handle the information from this file the same way, but in the case of the distilgpt2 model, if the n_inner value is null, the default value is set to four times the hidden_size. We can see the size in the model's structure within the embedding layers.

In this case, the default n_inner value would be 4 * 768 = 3072, but since the weights of the layers have been reduced through the pruning process, it has been replaced with 2458.

In [21]:
generated = get_ouput(prompt)
print(f"Generated text after pruning: {generated}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text after pruning: Paris is the capital of the United States, and


The response from the pruned model differs from that of the base model, indicating that the pruning process has affected the model's output generation.

# Upload the model to Hugging Face & Download to test.

We cannot be sure that the model works correctly with the Transformers library until we complete a full test cycle with it.

Often, if the configuration file has not been properly modified, the issue arises during the process of downloading the model file from Hugging Face.

In [22]:
# Save the pruned model
output_dir = './pruned_distilgpt2'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"Pruned model saved to {output_dir}")

Pruned model saved to ./pruned_distilgpt2


In [23]:
# Push the model to your Hugging Face repository
name_model_to_push="pruned_distilgpt2"

model.push_to_hub(name_model_to_push,
                  private=True,
                  use_temp_dir=False)




README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/oopere/pruned_distilgpt2/commit/5d244ec7de30cd41415edab1de6a084f92d21f75', commit_message='Upload model', commit_description='', oid='5d244ec7de30cd41415edab1de6a084f92d21f75', pr_url=None, repo_url=RepoUrl('https://huggingface.co/oopere/pruned_distilgpt2', endpoint='https://huggingface.co', repo_type='model', repo_id='oopere/pruned_distilgpt2'), pr_revision=None, pr_num=None)

In [24]:
tokenizer.push_to_hub(name_model_to_push,
                      private=False,
                      use_temp_dir=False)

CommitInfo(commit_url='https://huggingface.co/oopere/pruned_distilgpt2/commit/25bd43856622baa629444013a6ee63295e6444db', commit_message='Upload tokenizer', commit_description='', oid='25bd43856622baa629444013a6ee63295e6444db', pr_url=None, repo_url=RepoUrl('https://huggingface.co/oopere/pruned_distilgpt2', endpoint='https://huggingface.co', repo_type='model', repo_id='oopere/pruned_distilgpt2'), pr_revision=None, pr_num=None)

In [None]:
# Step 11: Download the model from Hugging Face
pruned_model_name = 'oopere/pruned_distilgpt2'
pruned_model = AutoModelForCausalLM.from_pretrained(pruned_model_name).to(device)
pruned_tokenizer = AutoTokenizer.from_pretrained(pruned_model_name)

In [26]:
generated = get_ouput(prompt, pruned_model, pruned_tokenizer)
print(f"Pruned Downloaded Generated text: {generated}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Pruned Downloaded Generated text: Paris is the capital of the United States, and


# Conclusion.

In this notebook, a complete pruning process has been applied to a Transformer model.

The pruning process followed is a structured approach that removes specific neurons from the feedforward layers of the model.

This results in a smaller model compared to the original, while retaining much of its ability to understand data relationships, as the attention layers remain untouched, and its learning capabilities remain intact since no layers were removed.

The neurons eliminated are those deemed less important for the model's output. This method of selecting neurons, without using a dataset, is ideal when the goal is to obtain a model capable of mimicking the base model’s responses.

It was taken into account that the pruning process modified the size of the layers, so the model’s configuration file had to be adjusted accordingly to ensure it continues to function without issues.

## Future Work.

There are many different ways to continue building upon the work done. The two main approaches could be:

* Use a dataset to select the least important neurons. This method allows the model to be adapted to a specific task, reducing its size and potentially increasing efficiency without requiring a subsequent fine-tuning process.
* Perform depth pruning by removing entire layers from the model, rather than just specific neurons.
* Adapt the approach to a larger and more current model, such as those from Meta's LLaMA family, Google's Gemma, or any other.

It’s important to keep in mind that most models undergoing a pruning process are often fine-tuned afterward to regain the effectiveness they might have lost during pruning.

The type of pruning to apply often depends on the task for which the model is being trained. For instance, one could consider pruning the attention layers if the input prompts are very short and the relationship between the tokens in the prompt is not particularly important, perhaps in tasks like classification or entity recognition.


## Author's Note

In addition to creating content like this notebook and offering it under the MIT license, I have also contributed to repositories such as those of Hugging Face and Google Gemini.

I am especially proud of my book: <a href="https://amzn.to/4eanT1g"><b>Large Language Models:</b> Apply and Implement Strategies for Large Language Models</a> (Apress).

You can find it on both <a href="https://amzn.to/4eanT1g">Amazon</a> and <a href="https://link.springer.com/book/10.1007/979-8-8688-0515-8">Springer</a>, where they often have good deals on the purchase price.

If you take a look and end up purchasing it, keep in mind that you can reach out with any questions via the Discussions section of this same repository or on any of my social media channels. I’ll do my best to respond as quickly as possible.
