# Model Pruning Demonstration on GPT‑2

In this notebook, we will:

1. Install required libraries  
2. Load a pre‑trained GPT‑2 model  
3. Measure its original size (parameter count and non‑zero weights)  
4. Apply global magnitude‑based pruning to its Linear layers  
5. Compare effective parameter counts before and after pruning  
6. (Optional) Remove pruning reparameterization to make sparsity permanent  
7. Compare model outputs on a simple prompt before vs. after pruning  

Model pruning works by zeroing out (or removing) weights whose magnitudes are below some threshold, yielding a sparse model that can be more efficient at inference time.


## 1. Install Dependencies

We'll need the Hugging Face **transformers** library for GPT‑2 and **torch** for pruning utilities.


In [None]:
!pip install transformers torch --quiet


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## 2. Import Libraries

Bring in PyTorch, pruning tools, and the Hugging Face `transformers` API.


In [None]:
import copy
import torch
import torch.nn.utils.prune as prune
from transformers import GPT2LMHeadModel, GPT2Tokenizer


## 3. Load Pre‑trained Model and Tokenizer

We'll use the small `gpt2` model for a quick demo.  
We also switch it to evaluation mode.


In [None]:
model_name = "gpt2"
tokenizer  = GPT2Tokenizer.from_pretrained(model_name)
model      = GPT2LMHeadModel.from_pretrained(model_name)
model.eval()

# Keep a copy for “before‑pruning” comparisons
model_before = copy.deepcopy(model)


## 4. Count Effective Parameters (Original)

We define a helper that walks through each `Linear` layer and counts:
- **total** number of weights  
- **non‑zero** weights in the **effective** parameter (`module.weight.data`)  
This correctly accounts for any masks applied by pruning.


In [None]:
def count_effective_weights(m):
    total, nonzero = 0, 0
    for module in m.modules():
        if isinstance(module, torch.nn.Linear):
            w = module.weight.data
            total   += w.numel()
            nonzero += (w != 0).sum().item()
    return total, nonzero

orig_total, orig_nonzero = count_effective_weights(model)
print(f"Before pruning: total={orig_total:,}, non_zero={orig_nonzero:,} "
      f"({100 * (orig_nonzero/orig_total):.1f}% dense)")


Before pruning: total=38,597,376, non_zero=38,597,376 (100.0% dense)


## 5. Apply Global Unstructured Pruning

We collect **all** `weight` parameters from `Linear` modules and prune 30% of the smallest‑magnitude weights **globally**.


In [None]:
# Gather (module, 'weight') pairs for pruning
to_prune = [
    (module, 'weight')
    for module in model.modules()
    if isinstance(module, torch.nn.Linear)
]

# Apply global L1‑unstructured pruning: zero out 30% of weights by magnitude
prune.global_unstructured(
    to_prune,
    pruning_method=prune.L1Unstructured,
    amount=0.3,
)


## 6. Count Effective Parameters After Pruning

Now that we’ve applied the masks, count the **effective** non‑zero weights again.


In [None]:
post_total, post_nonzero = count_effective_weights(model)
print(f"After pruning:  total={post_total:,}, non_zero={post_nonzero:,} "
      f"({100 * (post_nonzero/post_total):.1f}% dense)")
print(f"Zeroed weights: {(post_total - post_nonzero):,} "
      f"({100 * (1 - post_nonzero/post_total):.1f}% pruned)")


After pruning:  total=38,597,376, non_zero=27,018,163 (70.0% dense)
Zeroed weights: 11,579,213 (30.0% pruned)


## 7. (Optional) Remove Pruning Reparameterization

Pruning in PyTorch uses a `weight_orig` parameter and a `weight_mask` buffer internally.  
To make the sparsity permanent (and drop the extra buffers), remove the reparameterization:


In [None]:
for module, _ in to_prune:
    prune.remove(module, 'weight')


## 8. Compare Model Outputs Before vs. After Pruning

Finally, we generate text from the **unpruned** copy and the **pruned** model on the same prompt to observe any differences.

> **Note:** Because we mutated `model` in place, we kept `model_before` for a clean comparison.


In [None]:
prompt = "In a distant future, AI and humans"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    # Original model
    out_before = model_before.generate(**inputs, max_new_tokens=40)
    # Pruned model
    out_after  = model.generate(**inputs, max_new_tokens=40)

print("=== Original GPT‑2 Output ===")
print(tokenizer.decode(out_before[0], skip_special_tokens=True))
print("\n=== Pruned GPT‑2 Output ===")
print(tokenizer.decode(out_after[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


=== Original GPT‑2 Output ===
In a distant future, AI and humans will be able to communicate with each other, and the AI will be able to communicate with humans.

The AI will be able to communicate with humans, and the AI will be able to communicate

=== Pruned GPT‑2 Output ===
In a distant future, AI and humans will be able to communicate with each other using the same language.

"We're going to have a lot more interaction between humans and AI," said Dr. Michael Siegel, director of the


## Conclusion

In this demonstration, we applied global magnitude‐based pruning to GPT‑2, zeroing out 30 % of its smallest‐magnitude weights while preserving its overall structure. After pruning, the model retained coherent generation—albeit with subtle differences in phrasing—showing that significant sparsity can be introduced without catastrophic quality loss. This workflow highlights how unstructured pruning can reduce model size and pave the way for faster, more efficient inference. Future steps include experimenting with different sparsity levels, structured pruning approaches, and fine‑tuning to recover any performance gaps.  
