# Structured Sparsity Pruning on GPT‑2

In this notebook, we’ll demonstrate **structured sparsity** by pruning entire attention heads from a GPT‑2 model. Head pruning removes full heads (i.e. groups of parameters) rather than individual weights, yielding a cleaner, more hardware‑friendly sparsity pattern.

We’ll cover:  
1. Installing dependencies  
2. Loading the model & tokenizer  
3. Counting original parameters & heads  
4. Pruning heads across all layers  
5. Comparing parameter counts & generation before vs. after pruning  


## 1. Install Dependencies

In [1]:
!pip install transformers torch --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m26.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## 2. Imports & Setup

Bring in PyTorch and Hugging Face transformers.


In [2]:
import copy
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer


## 3. Load Pre‑trained GPT‑2

We load `gpt2` and keep a copy for “before‑prune” comparisons.


In [3]:
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer    = GPT2Tokenizer.from_pretrained("gpt2")
model        = GPT2LMHeadModel.from_pretrained("gpt2").to(device).eval()
model_before = copy.deepcopy(model)  # save original


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

### 3.1 Count Original Parameters & Heads

We’ll count total parameters and how many heads per layer.


In [4]:
# Total params
total_params = sum(p.numel() for p in model.parameters())
print(f"Original total parameters: {total_params:,}")

# Heads per layer
n_layers = model.config.n_layer
n_heads  = model.config.n_head
print(f"Layers: {n_layers}, Heads per layer: {n_heads}")


Original total parameters: 124,439,808
Layers: 12, Heads per layer: 12


## 4. Prune Attention Heads

Here we remove **2 heads** from **every** transformer layer.  
We build a dict `{layer_id: [head indices]}` and call `prune_heads`.


In [5]:
# Specify which heads to prune: heads 0 and 1 in each layer
heads_to_prune = { layer: [0, 1] for layer in range(model.config.n_layer) }

# Perform structured head pruning
model.transformer.prune_heads(heads_to_prune)


## 5. Count Parameters After Head Pruning

After pruning, GPT‑2’s internal linear projections are resized to omit the removed heads.


In [6]:
# New total params
pruned_params = sum(p.numel() for p in model.parameters())
print(f"After head pruning parameters: {pruned_params:,}")
print(f"Parameters removed: {total_params - pruned_params:,} "
      f"({100 * (total_params - pruned_params)/total_params:.1f}% reduction)")


After head pruning parameters: 119,716,608
Parameters removed: 4,723,200 (3.8% reduction)


## 6. Compare Generation

Generate from the original and head‑pruned models on the same prompt.


In [7]:
prompt = "In a future world, neural networks and humans"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

with torch.no_grad():
    out_before = model_before.generate(**inputs, max_new_tokens=30)
    out_after  = model.generate(**inputs, max_new_tokens=30)

print("=== Original GPT‑2 ===")
print(tokenizer.decode(out_before[0], skip_special_tokens=True))
print("\n=== Head‑Pruned GPT‑2 ===")
print(tokenizer.decode(out_after[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


=== Original GPT‑2 ===
In a future world, neural networks and humans would be able to learn to recognize faces, and to recognize faces that are not human.

"We're going to have to see how we

=== Head‑Pruned GPT‑2 ===
In a future world, neural networks and humans would be more efficient at understanding the things that make them feel good."

The researchers added that "the human mind is not an object that makes


In [8]:
prompt = "In a future world, neural networks and humans"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

with torch.no_grad():
    out_before = model_before.generate(**inputs, max_new_tokens=100)
    out_after  = model.generate(**inputs, max_new_tokens=100)

print("=== Original GPT‑2 ===")
print(tokenizer.decode(out_before[0], skip_special_tokens=True))
print("\n=== Head‑Pruned GPT‑2 ===")
print(tokenizer.decode(out_after[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


=== Original GPT‑2 ===
In a future world, neural networks and humans would be able to learn to recognize faces, and to recognize faces that are not human.

"We're going to have to see how we can do that," said Dr. David S. Karp, a neuroscientist at the University of California, San Diego.

The researchers are now working on a way to make the neural networks that recognize faces recognize faces that are not human.

"We're going to have to see how we can do that," said Dr.

=== Head‑Pruned GPT‑2 ===
In a future world, neural networks and humans would be more efficient at understanding the things that make them feel good."

The researchers added that "the human mind is not an object that makes people feel good."

The researchers added that the human mind is not an object that makes people feel good.

The researchers added that the human mind is not an object that makes people feel good."

The researchers added that the human mind is not an object that makes people feel good.

The research