# Building Pun Evaluation Datasets with LLMs

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nix07/neural-mechanics-web/blob/main/labs/week3/pun_dataset_builder.ipynb)

This notebook demonstrates **model-written evaluations**—using LLMs to help create evaluation datasets. We'll build datasets for studying how language models process puns, useful for both behavioral evaluation and interpretability experiments.

**What we'll create:**
1. Pun examples with controlled structure (setup → punchline)
2. Matched literal/pun pairs for the same ambiguous words
3. Cloze-style evaluation sets for probing
4. Quality ratings using LLM-as-judge

**Supports:** Anthropic Claude, OpenAI GPT, and Google Gemini APIs

## References
- [Model-Written Evaluations](https://arxiv.org/abs/2212.09251) - Perez et al.
- [LLM-as-Judge](https://arxiv.org/abs/2306.05685) - Zheng et al.
- [LAMA: Language Models as Knowledge Bases](https://arxiv.org/abs/1909.01066)

## Setup

Install the API clients you plan to use:

In [None]:
# Install whichever API client(s) you need
!pip install -q anthropic openai google-generativeai

In [None]:
import os
import json
import re
from typing import Optional, List, Dict, Any
from dataclasses import dataclass, asdict
import pandas as pd

# Set your API key(s) - use whichever provider you have access to
# You can set these as environment variables or paste directly (less secure)

# Option 1: Anthropic Claude
# os.environ["ANTHROPIC_API_KEY"] = "your-key-here"

# Option 2: OpenAI
# os.environ["OPENAI_API_KEY"] = "your-key-here"

# Option 3: Google Gemini
# os.environ["GOOGLE_API_KEY"] = "your-key-here"

## Unified LLM Interface

We create a simple wrapper that works with any of the three providers:

In [None]:
class LLMClient:
    """Unified interface for Anthropic, OpenAI, and Gemini APIs."""
    
    def __init__(self, provider: str = "anthropic"):
        """
        Initialize with a provider: 'anthropic', 'openai', or 'gemini'
        """
        self.provider = provider.lower()
        
        if self.provider == "anthropic":
            from anthropic import Anthropic
            self.client = Anthropic()
            self.model = "claude-sonnet-4-20250514"
            
        elif self.provider == "openai":
            from openai import OpenAI
            self.client = OpenAI()
            self.model = "gpt-4o"
            
        elif self.provider == "gemini":
            import google.generativeai as genai
            genai.configure(api_key=os.environ.get("GOOGLE_API_KEY"))
            self.client = genai.GenerativeModel("gemini-1.5-pro")
            self.model = "gemini-1.5-pro"
            
        else:
            raise ValueError(f"Unknown provider: {provider}")
    
    def generate(self, prompt: str, system: str = "", max_tokens: int = 1024) -> str:
        """Generate a response from the LLM."""
        
        if self.provider == "anthropic":
            response = self.client.messages.create(
                model=self.model,
                max_tokens=max_tokens,
                system=system if system else "You are a helpful assistant.",
                messages=[{"role": "user", "content": prompt}]
            )
            return response.content[0].text
            
        elif self.provider == "openai":
            messages = []
            if system:
                messages.append({"role": "system", "content": system})
            messages.append({"role": "user", "content": prompt})
            
            response = self.client.chat.completions.create(
                model=self.model,
                max_tokens=max_tokens,
                messages=messages
            )
            return response.choices[0].message.content
            
        elif self.provider == "gemini":
            full_prompt = f"{system}\n\n{prompt}" if system else prompt
            response = self.client.generate_content(full_prompt)
            return response.text

# Choose your provider here!
PROVIDER = "anthropic"  # or "openai" or "gemini"
llm = LLMClient(PROVIDER)
print(f"Using {PROVIDER} with model {llm.model}")

In [None]:
# Test the connection
test_response = llm.generate("What do you call a fish without eyes? (Give just the punchline)")
print(f"Test response: {test_response}")

## Part 1: Generating Pun Examples

Let's generate puns with controlled structure. We'll ask the LLM to create puns that follow a specific format, making them easier to analyze.

In [None]:
@dataclass
class PunExample:
    """A structured pun example."""
    setup: str
    punchline: str
    pun_word: str
    meaning1: str  # First meaning (usually literal/expected)
    meaning2: str  # Second meaning (the joke)
    category: str  # Type of pun: homophone, homograph, compound, etc.
    
    def full_joke(self) -> str:
        return f"{self.setup} {self.punchline}"

def generate_puns(topic: str, n: int = 5, llm: LLMClient = llm) -> List[PunExample]:
    """Generate structured pun examples on a given topic."""
    
    system = """You are an expert at creating and analyzing puns. 
When asked to create puns, you provide them in a structured JSON format."""
    
    prompt = f"""Generate {n} puns related to "{topic}". 

For each pun, provide a JSON object with these fields:
- setup: The setup/question part of the joke
- punchline: The punchline/answer
- pun_word: The word that has double meaning
- meaning1: The literal/expected meaning
- meaning2: The humorous/unexpected meaning  
- category: Type of pun (homophone, homograph, compound, or other)

Return a JSON array of {n} pun objects. Only return the JSON, no other text.

Example format:
[
  {{
    "setup": "Why do electricians make good swimmers?",
    "punchline": "Because they know the current.",
    "pun_word": "current",
    "meaning1": "electrical current",
    "meaning2": "water current",
    "category": "homograph"
  }}
]"""
    
    response = llm.generate(prompt, system=system, max_tokens=2000)
    
    # Parse JSON from response
    try:
        # Find JSON array in response
        json_match = re.search(r'\[.*\]', response, re.DOTALL)
        if json_match:
            puns_data = json.loads(json_match.group())
        else:
            puns_data = json.loads(response)
        
        return [PunExample(**p) for p in puns_data]
    except json.JSONDecodeError as e:
        print(f"Failed to parse JSON: {e}")
        print(f"Raw response: {response}")
        return []

In [None]:
# Generate puns about different topics
topics = ["science", "music", "food", "sports"]

all_puns = []
for topic in topics:
    print(f"\nGenerating puns about {topic}...")
    puns = generate_puns(topic, n=3)
    all_puns.extend(puns)
    
    for pun in puns:
        print(f"  - {pun.setup} {pun.punchline}")
        print(f"    Pun word: '{pun.pun_word}' ({pun.meaning1} / {pun.meaning2})")

print(f"\nTotal puns generated: {len(all_puns)}")

## Part 2: Creating Matched Literal/Pun Pairs

For interpretability experiments, we need pairs of sentences where the same word is used literally vs. as a pun. This lets us compare how the model processes the same word in different contexts.

In [None]:
@dataclass 
class PunLiteralPair:
    """A matched pair of pun and literal usage of the same word."""
    target_word: str
    pun_context: str
    literal_context1: str  # Using meaning 1
    literal_context2: str  # Using meaning 2
    meaning1: str
    meaning2: str

def generate_matched_pairs(pun_words: List[str], llm: LLMClient = llm) -> List[PunLiteralPair]:
    """Generate matched pun/literal pairs for given words."""
    
    system = """You create matched sentence pairs for linguistic analysis.
Given a word with multiple meanings, you create sentences using it in different contexts."""
    
    prompt = f"""For each of these words that can be used in puns, create:
1. A pun context (joke that plays on the double meaning)
2. A literal context using meaning 1
3. A literal context using meaning 2

Words: {json.dumps(pun_words)}

Return a JSON array with objects containing:
- target_word: the word
- pun_context: a joke/pun using the word
- literal_context1: sentence using first meaning
- literal_context2: sentence using second meaning
- meaning1: description of first meaning
- meaning2: description of second meaning

Only return the JSON array, no other text."""
    
    response = llm.generate(prompt, system=system, max_tokens=2000)
    
    try:
        json_match = re.search(r'\[.*\]', response, re.DOTALL)
        if json_match:
            pairs_data = json.loads(json_match.group())
        else:
            pairs_data = json.loads(response)
        return [PunLiteralPair(**p) for p in pairs_data]
    except json.JSONDecodeError as e:
        print(f"Failed to parse JSON: {e}")
        print(f"Raw response: {response}")
        return []

In [None]:
# Words that commonly appear in puns
pun_words = ["current", "interest", "bark", "bass", "light", "wave", "scale", "battery"]

pairs = generate_matched_pairs(pun_words[:4])  # Start with 4

for pair in pairs:
    print(f"\n=== {pair.target_word.upper()} ===")
    print(f"  Meaning 1: {pair.meaning1}")
    print(f"  Meaning 2: {pair.meaning2}")
    print(f"  \n  PUN: {pair.pun_context}")
    print(f"  LITERAL 1: {pair.literal_context1}")
    print(f"  LITERAL 2: {pair.literal_context2}")

## Part 3: Cloze-Style Evaluation Sets

For probing experiments, we need sentences where the pun word is blanked out. This lets us test whether a model can predict the pun word from context.

In [None]:
@dataclass
class ClozeExample:
    """A cloze (fill-in-the-blank) example for probing."""
    prompt: str  # Text with blank
    target: str  # The correct word
    context_type: str  # "pun" or "literal"
    foils: List[str]  # Alternative answers that don't fit as well

def create_cloze_examples(pairs: List[PunLiteralPair]) -> List[ClozeExample]:
    """Convert matched pairs into cloze examples."""
    examples = []
    
    for pair in pairs:
        word = pair.target_word
        
        # Create cloze for pun context
        if word.lower() in pair.pun_context.lower():
            # Replace the target word with ___
            pun_cloze = re.sub(
                rf'\b{re.escape(word)}\b', 
                '___', 
                pair.pun_context, 
                flags=re.IGNORECASE,
                count=1
            )
            examples.append(ClozeExample(
                prompt=pun_cloze,
                target=word,
                context_type="pun",
                foils=[]  # Will fill in later
            ))
        
        # Create cloze for literal contexts
        for ctx, ctx_type in [(pair.literal_context1, "literal1"), 
                               (pair.literal_context2, "literal2")]:
            if word.lower() in ctx.lower():
                literal_cloze = re.sub(
                    rf'\b{re.escape(word)}\b',
                    '___',
                    ctx,
                    flags=re.IGNORECASE,
                    count=1
                )
                examples.append(ClozeExample(
                    prompt=literal_cloze,
                    target=word,
                    context_type=ctx_type,
                    foils=[]
                ))
    
    return examples

cloze_examples = create_cloze_examples(pairs)

print(f"Created {len(cloze_examples)} cloze examples:\n")
for ex in cloze_examples:
    print(f"[{ex.context_type}] {ex.prompt}")
    print(f"  Answer: {ex.target}\n")

## Part 4: LLM-as-Judge for Quality Rating

Not all generated puns are equally good. Let's use the LLM to rate pun quality on multiple dimensions.

In [None]:
@dataclass
class PunRating:
    """Quality ratings for a pun."""
    humor_score: int  # 1-5: How funny is it?
    clarity_score: int  # 1-5: How clear is the double meaning?
    originality_score: int  # 1-5: How original/novel?
    groan_factor: int  # 1-5: How much of a "groaner" is it?
    explanation: str

def rate_pun(pun: PunExample, llm: LLMClient = llm) -> PunRating:
    """Use LLM-as-judge to rate a pun's quality."""
    
    system = """You are an expert judge of pun quality. 
Rate puns on multiple dimensions and explain your ratings."""
    
    prompt = f"""Rate this pun on a scale of 1-5 for each dimension:

Pun: "{pun.full_joke()}"
Pun word: "{pun.pun_word}" (plays on: {pun.meaning1} vs {pun.meaning2})

Dimensions:
1. humor_score: How funny is it? (1=not funny, 5=hilarious)
2. clarity_score: How clear is the double meaning? (1=confusing, 5=crystal clear)
3. originality_score: How original? (1=very common, 5=never heard before)
4. groan_factor: How much of a "dad joke" groaner? (1=not at all, 5=maximum groan)

Return a JSON object with:
- humor_score: int 1-5
- clarity_score: int 1-5  
- originality_score: int 1-5
- groan_factor: int 1-5
- explanation: brief explanation of your ratings

Only return the JSON object, no other text."""
    
    response = llm.generate(prompt, system=system, max_tokens=500)
    
    try:
        json_match = re.search(r'\{.*\}', response, re.DOTALL)
        if json_match:
            rating_data = json.loads(json_match.group())
        else:
            rating_data = json.loads(response)
        return PunRating(**rating_data)
    except (json.JSONDecodeError, TypeError) as e:
        print(f"Failed to parse rating: {e}")
        return PunRating(3, 3, 3, 3, "Failed to parse")

In [None]:
# Rate our generated puns
rated_puns = []

for pun in all_puns[:5]:  # Rate first 5
    print(f"\nRating: {pun.full_joke()}")
    rating = rate_pun(pun)
    rated_puns.append((pun, rating))
    
    print(f"  Humor: {rating.humor_score}/5")
    print(f"  Clarity: {rating.clarity_score}/5")
    print(f"  Originality: {rating.originality_score}/5")
    print(f"  Groan factor: {rating.groan_factor}/5")
    print(f"  Note: {rating.explanation}")

## Part 5: Export Datasets

Let's export our generated data in formats useful for interpretability experiments.

In [None]:
def export_to_dataframe(puns: List[PunExample], 
                        pairs: List[PunLiteralPair],
                        cloze: List[ClozeExample]) -> Dict[str, pd.DataFrame]:
    """Export all datasets to DataFrames."""
    
    # Puns DataFrame
    puns_df = pd.DataFrame([asdict(p) for p in puns])
    puns_df['full_joke'] = puns_df['setup'] + ' ' + puns_df['punchline']
    
    # Pairs DataFrame  
    pairs_df = pd.DataFrame([asdict(p) for p in pairs])
    
    # Cloze DataFrame
    cloze_df = pd.DataFrame([asdict(c) for c in cloze])
    
    return {
        'puns': puns_df,
        'pairs': pairs_df,
        'cloze': cloze_df
    }

datasets = export_to_dataframe(all_puns, pairs, cloze_examples)

print("Puns dataset:")
display(datasets['puns'].head())

print("\nPairs dataset:")
display(datasets['pairs'].head())

print("\nCloze dataset:")
display(datasets['cloze'].head())

In [None]:
# Save to files
for name, df in datasets.items():
    filename = f"pun_{name}_dataset.csv"
    df.to_csv(filename, index=False)
    print(f"Saved {filename} ({len(df)} rows)")

# Also save as JSON for easy loading
all_data = {
    'puns': [asdict(p) for p in all_puns],
    'pairs': [asdict(p) for p in pairs],
    'cloze': [asdict(c) for c in cloze_examples]
}

with open('pun_dataset.json', 'w') as f:
    json.dump(all_data, f, indent=2)
print("Saved pun_dataset.json")

## Part 6: Getting Files to Your Local Machine

Colab runs in the cloud, so you need to download your generated files. Use Colab's built-in download to save files directly to your computer:

In [None]:
from google.colab import files

# Download individual files
files.download('pun_dataset.json')
files.download('pun_puns_dataset.csv')
files.download('pun_pairs_dataset.csv')
files.download('pun_cloze_dataset.csv')

# Or zip everything and download once
!zip -r pun_datasets.zip pun_*.csv pun_*.json
files.download('pun_datasets.zip')

In [None]:
from google.colab import drive

# Mount Google Drive (will prompt for authorization)
drive.mount('/content/drive')

# Create a folder for your project
project_folder = '/content/drive/MyDrive/neural-mechanics-project/data'
!mkdir -p {project_folder}

# Copy files to Drive
!cp pun_dataset.json {project_folder}/
!cp pun_*.csv {project_folder}/

print(f"Files saved to Google Drive: {project_folder}")

## Exercise 1: Generate Domain-Specific Puns

Generate puns specific to your research domain. If you're studying legal concepts, medical terms, or mathematical notation, create puns that use terminology from that field.

## Exercise 2: Create Minimal Pairs

Create pairs of sentences that differ only in whether they set up a pun. This is useful for causal analysis.

In [None]:
def generate_minimal_pairs(pun_word: str, llm: LLMClient = llm) -> Dict:
    """Generate minimal pairs: same ending, different setup (pun vs literal)."""
    
    prompt = f"""Create a minimal pair of sentences that both end with the word "{pun_word}".

1. A pun setup: A question/setup that makes "{pun_word}" funny as a punchline
2. A literal setup: A sentence where "{pun_word}" is just the normal/expected word

Both sentences should end with exactly "{pun_word}" as the final word.

Return JSON with:
- pun_setup: the joke setup ending in "{pun_word}"
- literal_setup: the normal sentence ending in "{pun_word}"
- pun_word: "{pun_word}"

Only return the JSON object."""
    
    response = llm.generate(prompt, max_tokens=300)
    
    try:
        json_match = re.search(r'\{.*\}', response, re.DOTALL)
        if json_match:
            return json.loads(json_match.group())
        return json.loads(response)
    except json.JSONDecodeError:
        return {"error": response}

# Example
minimal_pair = generate_minimal_pairs("current")
print(json.dumps(minimal_pair, indent=2))

## Exercise 3: Generate Foils for Cloze Tasks

Add distractor words (foils) to the cloze examples. Good foils should be plausible but not correct.

In [None]:
def add_foils_to_cloze(example: ClozeExample, llm: LLMClient = llm) -> ClozeExample:
    """Add distractor words to a cloze example."""
    
    prompt = f"""Given this fill-in-the-blank sentence, suggest 3 plausible but incorrect words.

Sentence: {example.prompt}
Correct answer: {example.target}

Provide 3 words that:
1. Could grammatically fit in the blank
2. Are semantically related but don't create the intended meaning
3. Would be reasonable guesses

Return a JSON array of 3 strings. Only return the JSON array."""
    
    response = llm.generate(prompt, max_tokens=100)
    
    try:
        json_match = re.search(r'\[.*\]', response, re.DOTALL)
        if json_match:
            foils = json.loads(json_match.group())
        else:
            foils = json.loads(response)
        example.foils = foils
    except json.JSONDecodeError:
        example.foils = []
    
    return example

# Add foils to our cloze examples
for ex in cloze_examples[:3]:
    ex = add_foils_to_cloze(ex)
    print(f"Prompt: {ex.prompt}")
    print(f"Answer: {ex.target}")
    print(f"Foils: {ex.foils}\n")

## Summary

In this notebook, we learned how to use LLMs to create evaluation datasets:

1. **Generate structured examples** with controlled format (setup, punchline, meanings)
2. **Create matched pairs** for comparing pun vs literal usage
3. **Build cloze tasks** for probing experiments
4. **Use LLM-as-judge** to rate quality on multiple dimensions
5. **Export datasets** in formats useful for interpretability research

### For Your Project

Adapt this approach to your concept:
- Generate examples that use your concept in controlled ways
- Create matched pairs (concept present vs absent)
- Build cloze tasks for probing
- Rate quality and filter to high-quality examples

### Tips for Dataset Quality

- Generate more examples than you need, then filter
- Use LLM-as-judge to identify the best examples
- Manually review a sample to catch systematic errors
- Consider having multiple LLMs generate/rate for diversity