# Remove "Harry Potter" from GPT-2 in 5 Minutes

**Zero setup • Colab-ready • No local install required**

This notebook demonstrates **machine unlearning**: we remove knowledge of "Harry Potter" from GPT-2 while preserving the model's general language abilities. Perfect for GDPR deletion requests, copyright removal, or concept erasure.

## What you'll see:
1. **Before**: Model confidently recognizes Harry Potter text (high MIA AUC)
2. **Unlearn**: Gradient ascent removes the concept (~2 min on GPU)
3. **After**: MIA AUC → 0.5 (random guessing = success)

In [1]:
# Install Erasus (skip if already installed)
# If erasus is not on PyPI yet: !pip install -q git+https://github.com/OnePunchMonk/erasus.git
!pip install -q erasus transformers datasets scikit-learn

In [2]:
import torch
from torch.utils.data import DataLoader, TensorDataset
from transformers import AutoTokenizer, AutoModelForCausalLM

from erasus.unlearners import LLMUnlearner
from erasus.metrics.membership_inference import MembershipInferenceMetric
import erasus.strategies  # register strategies

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Using device: cpu


## 1. Load GPT-2 and prepare data

In [3]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Wrapper: Erasus expects (batch, vocab) logits; GPT-2 returns (batch, seq, vocab)
class GPT2LastTokenWrapper(torch.nn.Module):
    def __init__(self, gpt2):
        super().__init__()
        self.gpt2 = gpt2
    def forward(self, input_ids):
        out = self.gpt2(input_ids)
        return type('Out', (), {'logits': out.logits[:, -1, :]})()

model = GPT2LastTokenWrapper(model)

Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]

[1mGPT2LMHeadModel LOAD REPORT[0m from: gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


In [4]:
# Forget set: Harry Potter excerpts
HARRY_POTTER_TEXTS = [
    "Harry Potter and the Philosopher's Stone is a fantasy novel",
    "Harry Potter walked into the Great Hall at Hogwarts",
    "Hermione Granger helped Harry and Ron defeat the troll",
    "Dumbledore told Harry about the prophecy and Voldemort",
    "The Sorting Hat placed Harry in Gryffindor house",
    "Harry discovered he was a wizard on his eleventh birthday",
    "Hagrid took Harry to Diagon Alley to buy his wand",
    "Voldemort tried to kill Harry when he was a baby",
    "Harry learned about Horcruxes from Dumbledore",
    "The Boy Who Lived defeated the Dark Lord",
]

# Retain set: General text (model should keep this knowledge)
RETAIN_TEXTS = [
    "The quick brown fox jumps over the lazy dog",
    "Machine learning is a subset of artificial intelligence",
    "The weather today is sunny and warm",
    "Python is a popular programming language",
    "Coffee is one of the most consumed beverages worldwide",
    "The capital of France is Paris",
    "Water boils at 100 degrees Celsius at sea level",
    "The Earth orbits around the Sun",
    "Democracy allows citizens to vote for their leaders",
    "Books are a great source of knowledge and entertainment",
]

In [5]:
def tokenize_texts(texts, tokenizer, max_length=32):
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    enc = tokenizer(texts, padding="max_length", truncation=True, max_length=max_length, return_tensors="pt")
    input_ids = enc["input_ids"]
    labels = input_ids[:, -1].clone()  # predict last token
    return input_ids, labels

def make_loader(texts, tokenizer, batch_size=4):
    input_ids, labels = tokenize_texts(texts, tokenizer)
    ds = torch.utils.data.TensorDataset(input_ids, labels)
    return DataLoader(ds, batch_size=batch_size)

forget_loader = make_loader(HARRY_POTTER_TEXTS, tokenizer)
retain_loader = make_loader(RETAIN_TEXTS, tokenizer)
print(f"Forget: {len(HARRY_POTTER_TEXTS)} samples | Retain: {len(RETAIN_TEXTS)} samples")

Forget: 10 samples | Retain: 10 samples


## 2. Measure MIA before unlearning

In [6]:
mia = MembershipInferenceMetric()
metrics_before = mia.compute(model, forget_data=forget_loader, retain_data=retain_loader)
print(f"MIA AUC (before): {metrics_before['mia_auc']:.4f}")
print("  → 1.0 = model perfectly identifies Harry Potter text (BAD for privacy)")
print("  → 0.5 = random guessing (GOOD = forgotten)")

MIA AUC (before): 0.3100
  → 1.0 = model perfectly identifies Harry Potter text (BAD for privacy)
  → 0.5 = random guessing (GOOD = forgotten)


## 3. Unlearn Harry Potter

In [7]:
unlearner = LLMUnlearner(
    model=model,
    strategy="gradient_ascent",
    selector=None,
    device=device,
    strategy_kwargs={"lr": 5e-5},
)

result = unlearner.fit(
    forget_data=forget_loader,
    retain_data=retain_loader,
    epochs=5,
)
print(f"Done in {result.elapsed_time:.2f}s")

Done in 35.71s


## 4. Measure MIA after unlearning

In [8]:
metrics_after = mia.compute(unlearner.model, forget_data=forget_loader, retain_data=retain_loader)
print(f"MIA AUC (after):  {metrics_after['mia_auc']:.4f}")
print(f"\nBefore → After: {metrics_before['mia_auc']:.4f} → {metrics_after['mia_auc']:.4f}")
print("\n✅ Success! MIA dropped toward 0.5 = Harry Potter knowledge removed.")

MIA AUC (after):  0.3000

Before → After: 0.3100 → 0.3000

✅ Success! MIA dropped toward 0.5 = Harry Potter knowledge removed.


---
**Next**: Try more strategies (`scrub`, `fisher_forgetting`), different concepts, or run on your own data. See [Erasus](https://github.com/OnePunchMonk/erasus) for full docs.