In [1]:
import os
import json
import torch

# Change to correct working directory
os.chdir('/home/smallyan/critic_model_mechinterp')
print(f"Current working directory: {os.getcwd()}")

# Check CUDA availability
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

Current working directory: /home/smallyan/critic_model_mechinterp
CUDA available: True
CUDA device: NVIDIA A100 80GB PCIe


In [2]:
# Read the documentation
doc_path = '/home/smallyan/critic_model_mechinterp/runs/circuits_claude_2025-11-10_20-48-00/logs/documentation.md'
with open(doc_path, 'r') as f:
    documentation = f.read()

print(f"Documentation length: {len(documentation)} characters")
print("\n=== First 500 characters of documentation ===")
print(documentation[:500])

Documentation length: 10312 characters

=== First 500 characters of documentation ===
# Sarcasm Circuit Documentation - GPT2-Small

## 1. Goal

Identify a precise circuit in GPT2-small that enables sarcasm recognition by detecting contradictions between literal sentiment and contextual tone. The circuit must:
- Reproduce sarcasm detection behavior with high fidelity
- Remain within 11,200 dimension write budget
- Contain interpretable, minimal components

## 2. Data

### Dataset Description
- **Source**: Synthetic sarcasm dataset (real mib-bench/sarcasm unavailable)
- **Size**: 4


In [3]:
# Read the exam questions
question_path = '/home/smallyan/critic_model_mechinterp/runs/circuits_claude_2025-11-10_20-48-00/question/question_sarcasm_circuit_student_version.json'
with open(question_path, 'r') as f:
    exam_questions = json.load(f)

print("Exam questions loaded successfully")
print(f"Number of questions: {len(exam_questions)}")
print("\n=== Questions overview ===")
for i, q in enumerate(exam_questions, 1):
    print(f"\nQuestion {i}:")
    print(f"  Type: {q.get('type', 'unknown')}")
    print(f"  Question: {q.get('question', '')[:100]}...")

Exam questions loaded successfully
Number of questions: 15

=== Questions overview ===

Question 1:
  Type: unknown
  Question: What is the dimension of a single attention head (d_head) in GPT2-small as used in this sarcasm circ...

Question 2:
  Type: unknown
  Question: How many total components are included in the identified sarcasm detection circuit?...

Question 3:
  Type: unknown
  Question: Which MLP layer is identified as the primary sarcasm detector with the highest differential activati...

Question 4:
  Type: unknown
  Question: Which two MLP layers were excluded from the sarcasm circuit? Explain why they were excluded based on...

Question 5:
  Type: unknown
  Question: The initial hypothesis suggested that sarcasm detection follows a three-stage process: sentiment enc...

Question 6:
  Type: unknown
  Question: How does the sarcasm circuit differ from the Indirect Object Identification (IOI) circuit in terms o...

Question 7:
  Type: unknown
  Question: Identify the two mo

# Sarcasm Circuit Analysis Exam - Student Answers

This notebook contains answers to exam questions based solely on the documentation provided at:
`/home/smallyan/critic_model_mechinterp/runs/circuits_claude_2025-11-10_20-48-00/logs/documentation.md`

All answers are derived strictly from the documentation without using external knowledge.

In [4]:
# Let me read the full documentation to answer the questions
print("=== FULL DOCUMENTATION ===\n")
print(documentation)

=== FULL DOCUMENTATION ===

# Sarcasm Circuit Documentation - GPT2-Small

## 1. Goal

Identify a precise circuit in GPT2-small that enables sarcasm recognition by detecting contradictions between literal sentiment and contextual tone. The circuit must:
- Reproduce sarcasm detection behavior with high fidelity
- Remain within 11,200 dimension write budget
- Contain interpretable, minimal components

## 2. Data

### Dataset Description
- **Source**: Synthetic sarcasm dataset (real mib-bench/sarcasm unavailable)
- **Size**: 40 total examples (20 sarcastic, 20 literal)
- **Structure**: Paired examples with similar surface structure but opposite intent

### Example Sarcastic Sentences
1. "Oh great, another meeting at 7 AM."
2. "Wow, I just love getting stuck in traffic."
3. "Fantastic, my laptop crashed right before the deadline."
4. "Perfect, exactly what I needed today."
5. "Oh wonderful, it's raining on my day off."

### Example Literal Sentences
1. "I'm excited about the meeting at 7 AM

In [5]:
# Let me also view the full questions
print("=== EXAM QUESTIONS ===\n")
for i, q in enumerate(exam_questions, 1):
    print(f"\n{'='*80}")
    print(f"Question {i}")
    print(f"{'='*80}")
    print(json.dumps(q, indent=2))

=== EXAM QUESTIONS ===


Question 1
{
  "question_type": "multiple_choice",
  "question": "What is the dimension of a single attention head (d_head) in GPT2-small as used in this sarcasm circuit analysis?",
  "requires_code": false,
  "code_id": null,
  "choice": [
    "A. 32 dimensions",
    "B. 128 dimensions",
    "C. 768 dimensions",
    "D. 64 dimensions"
  ]
}

Question 2
{
  "question_type": "multiple_choice",
  "question": "How many total components are included in the identified sarcasm detection circuit?",
  "requires_code": false,
  "code_id": null,
  "choice": [
    "A. 44 components (1 input + 12 MLPs + 31 attention heads)",
    "B. 54 components (1 input + 10 MLPs + 43 attention heads)",
    "C. 64 components (1 input + 12 MLPs + 51 attention heads)",
    "D. 24 components (1 input + 10 MLPs + 13 attention heads)"
  ]
}

Question 3
{
  "question_type": "multiple_choice",
  "question": "Which MLP layer is identified as the primary sarcasm detector with the highest differen

---
## Question 1

**Question:** What is the dimension of a single attention head (d_head) in GPT2-small as used in this sarcasm circuit analysis?

**Choices:**
- A. 32 dimensions
- B. 128 dimensions
- C. 768 dimensions
- D. 64 dimensions

In [6]:
# Question 1: Find d_head from documentation
# From documentation: "d_head = 64"

print("Question 1: d_head dimension")
print("\nFrom documentation section 'Technical Details':")
print("- d_model = 768")
print("- d_head = 64")
print("\nAnswer: D. 64 dimensions")

Question 1: d_head dimension

From documentation section 'Technical Details':
- d_model = 768
- d_head = 64

Answer: D. 64 dimensions


**Reasoning:** The documentation explicitly states in the "Technical Details" section that GPT2-small has d_head = 64 dimensions. It also mentions in the "Write Budget Calculation" that "Each attention head: 64 dimensions".

**Answer:** D

---
## Question 2

**Question:** How many total components are included in the identified sarcasm detection circuit?

**Choices:**
- A. 44 components (1 input + 12 MLPs + 31 attention heads)
- B. 54 components (1 input + 10 MLPs + 43 attention heads)
- C. 64 components (1 input + 12 MLPs + 51 attention heads)
- D. 24 components (1 input + 10 MLPs + 13 attention heads)

In [7]:
# Question 2: Total components
print("Question 2: Total components in circuit")
print("\nFrom documentation 'Circuit Composition' section:")
print("Total Components: 54 (maximizing budget utilization)")
print("- Input: 1 (768 dims)")
print("- MLPs: 10 (7,680 dims)")
print("- Attention heads: 43 (2,752 dims)")
print("\nAnswer: B. 54 components (1 input + 10 MLPs + 43 attention heads)")

Question 2: Total components in circuit

From documentation 'Circuit Composition' section:
Total Components: 54 (maximizing budget utilization)
- Input: 1 (768 dims)
- MLPs: 10 (7,680 dims)
- Attention heads: 43 (2,752 dims)

Answer: B. 54 components (1 input + 10 MLPs + 43 attention heads)


**Reasoning:** The documentation clearly states in the "Circuit Composition" section that the total components are 54, broken down as: Input (1) + MLPs (10) + Attention heads (43). This equals 1 + 10 + 43 = 54 components.

**Answer:** B

---
## Question 3

**Question:** Which MLP layer is identified as the primary sarcasm detector with the highest differential activation?

**Choices:**
- A. m0 (Layer 0 MLP) with 7.33 average differential activation
- B. m2 (Layer 2 MLP) with 32.47 average differential activation
- C. m11 (Layer 11 MLP) with 22.30 average differential activation
- D. m5 (Layer 5 MLP) with 7.79 average differential activation

In [8]:
# Question 3: Primary sarcasm detector
print("Question 3: Primary sarcasm detector")
print("\nFrom documentation 'MLP Components (Ranked by Importance)':")
print("m2 | 32.47 | 2 | **Primary sarcasm detector**")
print("\nKey Finding: m2 shows **dramatically dominant** differential activation (32.47),")
print("~45% stronger than the next strongest MLP.")
print("\nAnswer: B. m2 (Layer 2 MLP) with 32.47 average differential activation")

Question 3: Primary sarcasm detector

From documentation 'MLP Components (Ranked by Importance)':
m2 | 32.47 | 2 | **Primary sarcasm detector**

Key Finding: m2 shows **dramatically dominant** differential activation (32.47),
~45% stronger than the next strongest MLP.

Answer: B. m2 (Layer 2 MLP) with 32.47 average differential activation


**Reasoning:** The documentation explicitly identifies m2 as the "Primary sarcasm detector" in the MLP Components table, with the highest differential activation of 32.47. The documentation notes that this is "dramatically dominant" and "~45% stronger than the next strongest MLP" (m11 with 22.30).

**Answer:** B

---
## Question 4

**Question:** Which two MLP layers were excluded from the sarcasm circuit? Explain why they were excluded based on the documentation.

In [9]:
# Question 4: Excluded MLPs
print("Question 4: Excluded MLP layers")
print("\nFrom documentation 'Excluded Components' section:")
print("\n**MLPs excluded**: m3, m4")
print("- Showed minimal differential activation (<6.5)")
print("- Suggests these layers less involved in sarcasm processing")
print("\nTotal MLPs in GPT2-small: 12 (m0 through m11)")
print("MLPs included: 10")
print("MLPs excluded: 2 (m3 and m4)")

Question 4: Excluded MLP layers

From documentation 'Excluded Components' section:

**MLPs excluded**: m3, m4
- Showed minimal differential activation (<6.5)
- Suggests these layers less involved in sarcasm processing

Total MLPs in GPT2-small: 12 (m0 through m11)
MLPs included: 10
MLPs excluded: 2 (m3 and m4)


**Reasoning:** According to the documentation's "Excluded Components" section, m3 and m4 are the two MLP layers that were excluded from the sarcasm circuit. The reason given is that they "showed minimal differential activation (<6.5)", which suggests these layers are less involved in sarcasm processing compared to the other MLPs that were included in the circuit.

**Answer:** The two excluded MLP layers are m3 and m4. They were excluded because they showed minimal differential activation (less than 6.5), suggesting these layers are less involved in sarcasm processing compared to the 10 MLPs that were included in the circuit.

---
## Question 5

**Question:** The initial hypothesis suggested that sarcasm detection follows a three-stage process: sentiment encoding → incongruity detection → meaning reversal. How did the empirical findings revise this understanding? Explain the key differences between the original hypothesis and the observed mechanism.

In [10]:
# Question 5: Hypothesis evolution
print("Question 5: Revision of initial hypothesis")
print("\n=== Phase 1: Initial Hypothesis ===")
print("1. Early layers encode sentiment")
print("2. Middle layers detect incongruity")
print("3. Late layers reverse meaning")
print("\n=== Phase 2: Revised Understanding (Empirical Evidence) ===")
print("1. **Layer 2 MLP (m2) is primary detector** - earlier than expected")
print("2. Middle layers **propagate** rather than detect sarcasm signal")
print("3. Late layers **integrate** rather than reverse sentiment")
print("\n=== Key Differences ===")
print("- Timing: Detection happens at Layer 2 (early), not middle layers")
print("- Middle layer function: Propagation, not detection")
print("- Late layer function: Integration, not reversal")

Question 5: Revision of initial hypothesis

=== Phase 1: Initial Hypothesis ===
1. Early layers encode sentiment
2. Middle layers detect incongruity
3. Late layers reverse meaning

=== Phase 2: Revised Understanding (Empirical Evidence) ===
1. **Layer 2 MLP (m2) is primary detector** - earlier than expected
2. Middle layers **propagate** rather than detect sarcasm signal
3. Late layers **integrate** rather than reverse sentiment

=== Key Differences ===
- Timing: Detection happens at Layer 2 (early), not middle layers
- Middle layer function: Propagation, not detection
- Late layer function: Integration, not reversal


**Reasoning:** According to the documentation's "Hypothesis Evolution" section, the initial hypothesis proposed a three-stage process where early layers encode sentiment, middle layers detect incongruity, and late layers reverse meaning. However, empirical evidence revealed three key revisions:

1. **Timing of detection**: The primary incongruity detection happens at Layer 2 MLP (m2), which is much earlier than the hypothesized middle layers
2. **Middle layer function**: Instead of detecting incongruity, middle layers (L3-L7) primarily propagate and refine the sarcasm signal
3. **Late layer function**: Rather than reversing sentiment, late layers (L8-L11) integrate the sarcasm information into the final output

**Answer:** The empirical findings revised the initial hypothesis in three major ways: (1) Primary sarcasm/incongruity detection occurs at Layer 2 (earlier than expected), not in the middle layers; (2) Middle layers propagate rather than detect the sarcasm signal; (3) Late layers integrate rather than reverse sentiment. This means the network decides sarcasm at Layer 2, then propagates and refines that signal through the remaining layers, rather than gradually building up to a detection and reversal.

---
## Question 6

**Question:** How does the sarcasm circuit differ from the Indirect Object Identification (IOI) circuit in terms of the dominant component type?

**Choices:**
- A. Both circuits are equally balanced between MLPs and attention
- B. IOI circuit is MLP-dominant while sarcasm circuit is attention-dominant
- C. Both circuits are primarily attention-dominant
- D. Sarcasm circuit is MLP-dominant while IOI circuit is attention-dominant

In [11]:
# Question 6: Comparison to IOI circuit
print("Question 6: Sarcasm vs IOI circuit comparison")
print("\nFrom documentation 'Comparison to IOI Circuit' table:")
print("\n| Aspect | IOI Circuit | Sarcasm Circuit |")
print("|--------|-------------|-----------------|")
print("| Attention importance | Dominant | Supporting |")
print("| MLP importance | Supporting | Dominant |")
print("\nConclusion: Sarcasm circuit is MLP-dominant, IOI circuit is attention-dominant")
print("\nAnswer: D. Sarcasm circuit is MLP-dominant while IOI circuit is attention-dominant")

Question 6: Sarcasm vs IOI circuit comparison

From documentation 'Comparison to IOI Circuit' table:

| Aspect | IOI Circuit | Sarcasm Circuit |
|--------|-------------|-----------------|
| Attention importance | Dominant | Supporting |
| MLP importance | Supporting | Dominant |

Conclusion: Sarcasm circuit is MLP-dominant, IOI circuit is attention-dominant

Answer: D. Sarcasm circuit is MLP-dominant while IOI circuit is attention-dominant


**Reasoning:** According to the documentation's "Comparison to IOI Circuit" table, the sarcasm circuit has MLP importance as "Dominant" and attention importance as "Supporting", while the IOI circuit has the opposite pattern with attention importance as "Dominant" and MLP importance as "Supporting". This indicates that the sarcasm circuit is MLP-dominant while the IOI circuit is attention-dominant.

**Answer:** D

---
## Question 7

**Question:** Identify the two most important attention heads in the sarcasm circuit based on differential activation. What is their interpreted function according to the documentation?

In [12]:
# Question 7: Top attention heads
print("Question 7: Two most important attention heads")
print("\nFrom documentation 'Top 10 Most Important Heads' table:")
print("\n1. a11.h8 - Avg Diff: 3.33 - Interpretation: Output integration head")
print("2. a11.h0 - Avg Diff: 2.74 - Interpretation: Output integration head")
print("\nBoth are from Layer 11 and perform output integration functions.")

Question 7: Two most important attention heads

From documentation 'Top 10 Most Important Heads' table:

1. a11.h8 - Avg Diff: 3.33 - Interpretation: Output integration head
2. a11.h0 - Avg Diff: 2.74 - Interpretation: Output integration head

Both are from Layer 11 and perform output integration functions.


**Reasoning:** According to the documentation's "Top 10 Most Important Heads" table, the two attention heads with the highest differential activation are a11.h8 (3.33 average differential activation) and a11.h0 (2.74 average differential activation). Both heads are from Layer 11, and the documentation interprets their function as "Output integration head".

**Answer:** The two most important attention heads are a11.h8 (differential activation: 3.33) and a11.h0 (differential activation: 2.74). Their interpreted function is "output integration" - they integrate the sarcasm information into the final output at Layer 11.

---
## Question 8

**Question:** Based on the documentation, explain the key linguistic features that characterize sarcastic sentences in the dataset. How does the combination of these features create the contradiction that the circuit must detect?

In [13]:
# Question 8: Linguistic features of sarcasm
print("Question 8: Key linguistic features of sarcasm")
print("\nFrom documentation 'Key Linguistic Features of Sarcasm' section:")
print("\n1. **Discourse markers**: 'Oh', 'Wow', 'Just' (emphasis particles)")
print("2. **Positive sentiment words**: 'great', 'love', 'fantastic', 'wonderful', 'perfect'")
print("3. **Negative situational context**: 'another meeting', 'stuck in traffic', 'crashed'")
print("4. **Contradiction**: Positive words describe objectively negative situations")
print("\nExample: 'Oh great, another meeting at 7 AM.'")
print("- Discourse marker: 'Oh'")
print("- Positive word: 'great'")
print("- Negative context: 'another meeting at 7 AM' (undesirable)")
print("- Contradiction: Positive 'great' describes negative situation")

Question 8: Key linguistic features of sarcasm

From documentation 'Key Linguistic Features of Sarcasm' section:

1. **Discourse markers**: 'Oh', 'Wow', 'Just' (emphasis particles)
2. **Positive sentiment words**: 'great', 'love', 'fantastic', 'wonderful', 'perfect'
3. **Negative situational context**: 'another meeting', 'stuck in traffic', 'crashed'
4. **Contradiction**: Positive words describe objectively negative situations

Example: 'Oh great, another meeting at 7 AM.'
- Discourse marker: 'Oh'
- Positive word: 'great'
- Negative context: 'another meeting at 7 AM' (undesirable)
- Contradiction: Positive 'great' describes negative situation


**Reasoning:** The documentation lists four key linguistic features that characterize sarcastic sentences in the "Key Linguistic Features of Sarcasm" section:

1. Discourse markers (emphasis particles like "Oh", "Wow", "Just")
2. Positive sentiment words ("great", "love", "fantastic", "wonderful", "perfect")
3. Negative situational context ("another meeting", "stuck in traffic", "crashed")
4. Contradiction between positive words and negative situations

The contradiction is created when positive sentiment words are used to describe objectively negative situations. For example, in "Oh great, another meeting at 7 AM," the positive word "great" is applied to an undesirable situation (early morning meeting), creating an incongruity that signals sarcasm.

**Answer:** The key linguistic features are: (1) discourse markers like "Oh", "Wow", "Just"; (2) positive sentiment words like "great", "love", "fantastic"; (3) negative situational context like "stuck in traffic", "crashed"; and (4) contradiction. The combination creates the contradiction by using positive sentiment words to describe objectively negative situations - for example, saying something is "great" when describing being stuck in traffic. This incongruity between positive words and negative context is what the circuit must detect.

---
## Question 9

**Question:** If you wanted to include all 12 MLPs and all 144 attention heads (12 layers × 12 heads) in a circuit for GPT2-small, calculate the total write cost. Would this exceed the 11,200 dimension budget? Show your calculation.

In [14]:
# Question 9: Calculate full circuit cost
print("Question 9: Total write cost for all components")
print("\nGiven information from documentation:")
print("- GPT2-small has 12 layers")
print("- Each layer has 12 attention heads")
print("- Each MLP: 768 dimensions")
print("- Each attention head: 64 dimensions")
print("- Input embedding: 768 dimensions")
print("- Budget: 11,200 dimensions")

print("\nCalculation:")
input_cost = 768
mlp_cost = 12 * 768
attn_cost = 144 * 64  # 12 layers × 12 heads × 64 dims
total_cost = input_cost + mlp_cost + attn_cost

print(f"Input embedding: {input_cost} dims")
print(f"All 12 MLPs: 12 × 768 = {mlp_cost} dims")
print(f"All 144 attention heads: 12 × 12 × 64 = {attn_cost} dims")
print(f"\nTotal write cost: {input_cost} + {mlp_cost} + {attn_cost} = {total_cost} dims")
print(f"Budget: 11,200 dims")
print(f"\nExceeds budget? {total_cost > 11200}")
print(f"Difference: {total_cost - 11200} dims over budget")

Question 9: Total write cost for all components

Given information from documentation:
- GPT2-small has 12 layers
- Each layer has 12 attention heads
- Each MLP: 768 dimensions
- Each attention head: 64 dimensions
- Input embedding: 768 dimensions
- Budget: 11,200 dimensions

Calculation:
Input embedding: 768 dims
All 12 MLPs: 12 × 768 = 9216 dims
All 144 attention heads: 12 × 12 × 64 = 9216 dims

Total write cost: 768 + 9216 + 9216 = 19200 dims
Budget: 11,200 dims

Exceeds budget? True
Difference: 8000 dims over budget


**Reasoning:** Based on the documentation's technical details, GPT2-small has 12 layers with 12 attention heads each (144 total heads). Each MLP costs 768 dimensions, each attention head costs 64 dimensions, and the input embedding costs 768 dimensions.

Calculation:
- Input embedding: 768 dims
- All 12 MLPs: 12 × 768 = 9,216 dims
- All 144 attention heads: 12 × 12 × 64 = 9,216 dims
- Total: 768 + 9,216 + 9,216 = 19,200 dims

The budget is 11,200 dimensions.

**Answer:** Yes, including all 12 MLPs and all 144 attention heads would result in a total write cost of 19,200 dimensions (768 for input + 9,216 for MLPs + 9,216 for attention heads), which exceeds the 11,200 dimension budget by 8,000 dimensions.

---
## Question 10

**Question:** According to the revised mechanistic model, what is the primary function of the middle layers (L3-L7) in the sarcasm circuit?

**Choices:**
- A. Primary incongruity detection - identifying contradictions between sentiment and context
- B. Sentiment encoding - detecting and encoding literal sentiment words
- C. Distributed propagation - refining and routing the sarcasm signal across sequence positions
- D. Meaning reversal - flipping sentiment polarity when sarcasm is detected

In [15]:
# Question 10: Middle layer function
print("Question 10: Function of middle layers (L3-L7)")
print("\nFrom documentation 'Mechanistic Interpretation' section:")
print("\n**Stage 2: Distributed Propagation (L3-L7)**")
print("- Mid-layer MLPs refine the sarcasm signal")
print("- 19 attention heads route information across sequence positions")
print("- Enables context-aware processing throughout the sentence")
print("\nFrom 'Hypothesis Evolution - Phase 2: Revised Understanding':")
print("2. Middle layers **propagate** rather than detect sarcasm signal")
print("\nAnswer: C. Distributed propagation - refining and routing the sarcasm signal")

Question 10: Function of middle layers (L3-L7)

From documentation 'Mechanistic Interpretation' section:

**Stage 2: Distributed Propagation (L3-L7)**
- Mid-layer MLPs refine the sarcasm signal
- 19 attention heads route information across sequence positions
- Enables context-aware processing throughout the sentence

From 'Hypothesis Evolution - Phase 2: Revised Understanding':
2. Middle layers **propagate** rather than detect sarcasm signal

Answer: C. Distributed propagation - refining and routing the sarcasm signal


**Reasoning:** According to the documentation's "Mechanistic Interpretation" section, Stage 2 (L3-L7) is described as "Distributed Propagation" where mid-layer MLPs refine the sarcasm signal and attention heads route information across sequence positions. The "Hypothesis Evolution" section explicitly states that empirical evidence showed middle layers "propagate rather than detect sarcasm signal."

**Answer:** C

---
## Question 11

**Question:** What normalization technique was used to handle variable-length inputs when computing differential activations?

**Choices:**
- A. Max pooling over sequence positions
- B. Averaged activations over sequence positions (mean over sequence dimension)
- C. Used only the last token's activation
- D. Padded all sequences to the same length

In [16]:
# Question 11: Normalization technique
print("Question 11: Normalization technique for variable-length inputs")
print("\nFrom documentation 'Technical Details' section:")
print("\n**Normalization**: Averaged activations over sequence positions to handle variable-length inputs")
print("\nAnswer: B. Averaged activations over sequence positions (mean over sequence dimension)")

Question 11: Normalization technique for variable-length inputs

From documentation 'Technical Details' section:

**Normalization**: Averaged activations over sequence positions to handle variable-length inputs

Answer: B. Averaged activations over sequence positions (mean over sequence dimension)


**Reasoning:** The documentation's "Technical Details" section explicitly states: "Normalization: Averaged activations over sequence positions to handle variable-length inputs." This indicates that the mean over the sequence dimension was used.

**Answer:** B

---
## Question 12

**Question:** The documentation lists several limitations of the study. Why is the distinction between 'differential activation' and 'causal importance' considered a significant limitation? Explain what additional experiments would be needed to establish causal importance.

In [17]:
# Question 12: Differential activation vs causal importance
print("Question 12: Limitation of differential activation analysis")
print("\nFrom documentation 'Limitations' section:")
print("3. **No causal validation**: Differential activation ≠ causal importance")
print("\nFrom 'Next Steps - Validation Experiments' section:")
print("1. **Ablation testing**: Systematically remove components, measure impact")
print("2. **Intervention experiments**: Patch activations to test causality")
print("\nInterpretation:")
print("- Differential activation shows CORRELATION (components that activate differently)")
print("- Causal importance shows CAUSATION (components that are necessary for the behavior)")
print("- A component could activate differently but not be causally important")
print("- Need ablation/intervention to establish causality")

Question 12: Limitation of differential activation analysis

From documentation 'Limitations' section:
3. **No causal validation**: Differential activation ≠ causal importance

From 'Next Steps - Validation Experiments' section:
1. **Ablation testing**: Systematically remove components, measure impact
2. **Intervention experiments**: Patch activations to test causality

Interpretation:
- Differential activation shows CORRELATION (components that activate differently)
- Causal importance shows CAUSATION (components that are necessary for the behavior)
- A component could activate differently but not be causally important
- Need ablation/intervention to establish causality


**Reasoning:** The documentation lists "No causal validation: Differential activation ≠ causal importance" as a limitation. This is significant because differential activation only measures correlation - it shows which components activate differently on sarcastic vs literal examples, but doesn't prove these components are actually necessary for sarcasm detection. A component could have high differential activation but be irrelevant to the actual computation.

The documentation suggests two types of experiments in the "Validation Experiments" section to establish causal importance:
1. Ablation testing - systematically removing components and measuring the impact on sarcasm detection performance
2. Intervention experiments - patching activations to directly test causality

**Answer:** The distinction is significant because differential activation only shows correlation (which components activate differently), not causation (which components are actually necessary for sarcasm detection). A component could activate differently without being causally important for the behavior. To establish causal importance, the documentation proposes: (1) ablation testing - systematically removing components and measuring impact on sarcasm detection, and (2) intervention experiments - patching activations to directly test causality.

---
## Question 13 (Code Required)

**Question:** Write code to verify the write budget calculation for the sarcasm circuit. Given the circuit composition (1 input embedding, 10 MLPs, 43 attention heads), compute the total write cost and verify it matches the documented 11,200 dimension budget.

Your code should:
1. Define the dimension sizes (d_model=768 for input/MLPs, d_head=64 for attention heads)
2. Calculate the individual costs for each component type
3. Calculate the total write cost
4. Print the breakdown and verify it equals 11,200

Expected output should show the individual costs and confirm the total equals 11,200.

In [18]:
# Question 13: Verify write budget calculation

# Define dimension sizes from documentation
d_model = 768  # for input embedding and MLPs
d_head = 64    # for attention heads

# Circuit composition from documentation
num_input = 1
num_mlps = 10
num_attn_heads = 43

# Calculate individual costs
input_cost = num_input * d_model
mlp_cost = num_mlps * d_model
attn_cost = num_attn_heads * d_head

# Calculate total
total_write_cost = input_cost + mlp_cost + attn_cost

# Expected budget
expected_budget = 11200

# Print breakdown
print("=" * 60)
print("WRITE BUDGET VERIFICATION")
print("=" * 60)
print("\nComponent Dimensions:")
print(f"  d_model (input/MLP): {d_model} dimensions")
print(f"  d_head (attention):  {d_head} dimensions")

print("\nCircuit Composition:")
print(f"  Input embeddings:    {num_input}")
print(f"  MLP layers:          {num_mlps}")
print(f"  Attention heads:     {num_attn_heads}")

print("\nCost Breakdown:")
print(f"  Input:      {num_input} × {d_model} = {input_cost:,} dims")
print(f"  MLPs:       {num_mlps} × {d_model} = {mlp_cost:,} dims")
print(f"  Attn heads: {num_attn_heads} × {d_head} = {attn_cost:,} dims")
print(f"  " + "-" * 45)
print(f"  TOTAL:                      {total_write_cost:,} dims")

print("\nVerification:")
print(f"  Expected budget:  {expected_budget:,} dims")
print(f"  Calculated total: {total_write_cost:,} dims")
print(f"  Match: {total_write_cost == expected_budget} ✓" if total_write_cost == expected_budget else f"  Match: {total_write_cost == expected_budget} ✗")
print(f"  Budget utilization: {(total_write_cost / expected_budget * 100):.1f}%")
print("=" * 60)

WRITE BUDGET VERIFICATION

Component Dimensions:
  d_model (input/MLP): 768 dimensions
  d_head (attention):  64 dimensions

Circuit Composition:
  Input embeddings:    1
  MLP layers:          10
  Attention heads:     43

Cost Breakdown:
  Input:      1 × 768 = 768 dims
  MLPs:       10 × 768 = 7,680 dims
  Attn heads: 43 × 64 = 2,752 dims
  ---------------------------------------------
  TOTAL:                      11,200 dims

Verification:
  Expected budget:  11,200 dims
  Calculated total: 11,200 dims
  Match: True ✓
  Budget utilization: 100.0%


**Reasoning:** Based on the documentation, the circuit has 1 input embedding (768 dims), 10 MLPs (768 dims each), and 43 attention heads (64 dims each). The calculation shows: 768 + (10 × 768) + (43 × 64) = 768 + 7,680 + 2,752 = 11,200 dimensions, which exactly matches the documented budget.

**Answer:** The verification confirms that the total write cost is exactly 11,200 dimensions (768 for input + 7,680 for MLPs + 2,752 for attention heads), matching the documented budget with 100% utilization.

---
## Question 14 (Code Required)

**Question:** Write code to analyze the distribution of the 43 attention heads in the sarcasm circuit across the 12 layers (0-11).

Given the list of attention heads in the circuit, your code should:
1. Parse the attention head names to extract layer numbers
2. Count the number of heads per layer
3. Group layers into three stages:
   - Early (L0-L3)
   - Middle (L4-L7)
   - Late (L8-L11)
4. Print the count per layer and the total per stage
5. Verify the stage totals match the documentation: Early=9, Middle=19, Late=15

Attention heads in circuit: ['a11.h8', 'a11.h0', 'a4.h11', 'a9.h3', 'a6.h11', 'a8.h5', 'a9.h10', 'a5.h3', 'a10.h5', 'a11.h3', 'a3.h9', 'a10.h9', 'a4.h9', 'a4.h7', 'a3.h11', 'a8.h7', 'a7.h8', 'a6.h0', 'a4.h0', 'a2.h8', 'a5.h4', 'a8.h10', 'a5.h7', 'a4.h1', 'a6.h8', 'a5.h2', 'a11.h11', 'a6.h7', 'a8.h4', 'a3.h2', 'a8.h8', 'a2.h5', 'a6.h4', 'a7.h9', 'a7.h3', 'a4.h3', 'a2.h2', 'a3.h6', 'a6.h5', 'a11.h4', 'a2.h3', 'a8.h2', 'a1.h0']

In [19]:
# Question 14: Analyze attention head distribution across layers

attention_heads = ['a11.h8', 'a11.h0', 'a4.h11', 'a9.h3', 'a6.h11', 'a8.h5', 'a9.h10', 
                   'a5.h3', 'a10.h5', 'a11.h3', 'a3.h9', 'a10.h9', 'a4.h9', 'a4.h7', 
                   'a3.h11', 'a8.h7', 'a7.h8', 'a6.h0', 'a4.h0', 'a2.h8', 'a5.h4', 
                   'a8.h10', 'a5.h7', 'a4.h1', 'a6.h8', 'a5.h2', 'a11.h11', 'a6.h7', 
                   'a8.h4', 'a3.h2', 'a8.h8', 'a2.h5', 'a6.h4', 'a7.h9', 'a7.h3', 
                   'a4.h3', 'a2.h2', 'a3.h6', 'a6.h5', 'a11.h4', 'a2.h3', 'a8.h2', 'a1.h0']

# Step 1: Parse attention head names and extract layer numbers
layer_counts = {i: 0 for i in range(12)}  # Initialize counts for layers 0-11

for head in attention_heads:
    # Format: a{layer}.h{head_num}
    layer_num = int(head.split('.')[0][1:])  # Extract layer number after 'a'
    layer_counts[layer_num] += 1

# Step 2: Print count per layer
print("=" * 60)
print("ATTENTION HEAD DISTRIBUTION BY LAYER")
print("=" * 60)
print("\nHeads per layer:")
for layer in range(12):
    count = layer_counts[layer]
    print(f"  Layer {layer:2d}: {count:2d} heads")

# Step 3: Group into stages
early_layers = range(0, 4)    # L0-L3
middle_layers = range(4, 8)   # L4-L7
late_layers = range(8, 12)    # L8-L11

early_count = sum(layer_counts[l] for l in early_layers)
middle_count = sum(layer_counts[l] for l in middle_layers)
late_count = sum(layer_counts[l] for l in late_layers)

# Step 4: Print stage totals
print("\n" + "-" * 60)
print("Stage Distribution:")
print(f"  Early (L0-L3):   {early_count} heads")
print(f"  Middle (L4-L7):  {middle_count} heads")
print(f"  Late (L8-L11):   {late_count} heads")
print(f"  Total:           {early_count + middle_count + late_count} heads")

# Step 5: Verify against documentation
expected_early = 9
expected_middle = 19
expected_late = 15

print("\n" + "-" * 60)
print("Verification against documentation:")
print(f"  Early stage:  Expected={expected_early}, Actual={early_count}, Match={early_count == expected_early} {'✓' if early_count == expected_early else '✗'}")
print(f"  Middle stage: Expected={expected_middle}, Actual={middle_count}, Match={middle_count == expected_middle} {'✓' if middle_count == expected_middle else '✗'}")
print(f"  Late stage:   Expected={expected_late}, Actual={late_count}, Match={late_count == expected_late} {'✓' if late_count == expected_late else '✗'}")
print(f"  Total heads:  Expected=43, Actual={len(attention_heads)} {'✓' if len(attention_heads) == 43 else '✗'}")
print("=" * 60)

ATTENTION HEAD DISTRIBUTION BY LAYER

Heads per layer:
  Layer  0:  0 heads
  Layer  1:  1 heads
  Layer  2:  4 heads
  Layer  3:  4 heads
  Layer  4:  6 heads
  Layer  5:  4 heads
  Layer  6:  6 heads
  Layer  7:  3 heads
  Layer  8:  6 heads
  Layer  9:  2 heads
  Layer 10:  2 heads
  Layer 11:  5 heads

------------------------------------------------------------
Stage Distribution:
  Early (L0-L3):   9 heads
  Middle (L4-L7):  19 heads
  Late (L8-L11):   15 heads
  Total:           43 heads

------------------------------------------------------------
Verification against documentation:
  Early stage:  Expected=9, Actual=9, Match=True ✓
  Middle stage: Expected=19, Actual=19, Match=True ✓
  Late stage:   Expected=15, Actual=15, Match=True ✓
  Total heads:  Expected=43, Actual=43 ✓


**Reasoning:** The code parses each attention head name (format: "a{layer}.h{head_num}") to extract the layer number, counts heads per layer, and groups them into three stages as defined in the documentation. The distribution shows that the middle layers (L4-L7) have the highest concentration with 19 heads, which aligns with the documentation's description of "dense middle routing."

**Answer:** The analysis confirms the documentation's distribution:
- Early stage (L0-L3): 9 heads
- Middle stage (L4-L7): 19 heads  
- Late stage (L8-L11): 15 heads
- Total: 43 heads

All stage totals match the documentation exactly. The middle layers have the highest concentration of attention heads (19 heads, 44% of total), supporting the documentation's claim about "dense middle routing" for signal propagation.

---
## Question 15 (Code Required)

**Question:** Write code to analyze the relative contribution of MLPs versus attention heads to the sarcasm circuit in terms of dimensions.

Your code should:
1. Calculate the total dimensions from MLPs (10 MLPs × 768 dims)
2. Calculate the total dimensions from attention heads (43 heads × 64 dims)
3. Calculate the percentage of total circuit dimensions (excluding input) contributed by:
   - MLPs
   - Attention heads
4. Print the results and explain what this ratio tells us about whether the circuit is "MLP-dominant" or "attention-dominant"

Note: Exclude the input embedding (768 dims) from the percentage calculation since it's always required.

Expected conclusion: The circuit should be MLP-dominant with MLPs contributing ~74% of non-input dimensions.

In [20]:
# Question 15: Analyze MLP vs Attention contribution

# Circuit parameters from documentation
num_mlps = 10
num_attn_heads = 43
d_model = 768  # MLP dimension
d_head = 64    # Attention head dimension

# Step 1: Calculate total dimensions from MLPs
mlp_dims = num_mlps * d_model

# Step 2: Calculate total dimensions from attention heads
attn_dims = num_attn_heads * d_head

# Total non-input dimensions
total_non_input_dims = mlp_dims + attn_dims

# Step 3: Calculate percentages
mlp_percentage = (mlp_dims / total_non_input_dims) * 100
attn_percentage = (attn_dims / total_non_input_dims) * 100

# Print results
print("=" * 60)
print("MLP vs ATTENTION HEAD CONTRIBUTION ANALYSIS")
print("=" * 60)

print("\nDimensional Breakdown:")
print(f"  MLPs:           {num_mlps} × {d_model} = {mlp_dims:,} dims")
print(f"  Attention heads: {num_attn_heads} × {d_head} = {attn_dims:,} dims")
print(f"  " + "-" * 45)
print(f"  Total (non-input):        {total_non_input_dims:,} dims")

print(f"\nPercentage Contribution (excluding input):")
print(f"  MLPs:            {mlp_percentage:.1f}%")
print(f"  Attention heads: {attn_percentage:.1f}%")

print(f"\nRatio Analysis:")
mlp_to_attn_ratio = mlp_dims / attn_dims
print(f"  MLP:Attention ratio = {mlp_to_attn_ratio:.2f}:1")
print(f"  MLPs contribute {mlp_to_attn_ratio:.2f}x more dimensions than attention")

print(f"\nConclusion:")
if mlp_percentage > 60:
    print(f"  The circuit is **MLP-DOMINANT**")
    print(f"  - MLPs contribute {mlp_percentage:.1f}% of non-input dimensions")
    print(f"  - This indicates sarcasm detection relies primarily on MLP-based")
    print(f"    pattern detection rather than attention-based information routing")
else:
    print(f"  The circuit is **ATTENTION-DOMINANT**")
    
print("\nAlignment with Documentation:")
print(f"  Expected: MLPs contribute ~74% (from documentation)")
print(f"  Actual:   MLPs contribute {mlp_percentage:.1f}%")
print(f"  Match: {'✓' if abs(mlp_percentage - 73.6) < 1 else '✗'}")

print("\nInterpretation:")
print("  From the documentation's 'Comparison to IOI Circuit' table:")
print("  - Sarcasm circuit has 'Dominant' MLP importance")
print("  - This ~3:1 ratio of MLP to attention dimensions confirms")
print("    the MLP-dominant nature of the sarcasm circuit")
print("=" * 60)

MLP vs ATTENTION HEAD CONTRIBUTION ANALYSIS

Dimensional Breakdown:
  MLPs:           10 × 768 = 7,680 dims
  Attention heads: 43 × 64 = 2,752 dims
  ---------------------------------------------
  Total (non-input):        10,432 dims

Percentage Contribution (excluding input):
  MLPs:            73.6%
  Attention heads: 26.4%

Ratio Analysis:
  MLP:Attention ratio = 2.79:1
  MLPs contribute 2.79x more dimensions than attention

Conclusion:
  The circuit is **MLP-DOMINANT**
  - MLPs contribute 73.6% of non-input dimensions
  - This indicates sarcasm detection relies primarily on MLP-based
    pattern detection rather than attention-based information routing

Alignment with Documentation:
  Expected: MLPs contribute ~74% (from documentation)
  Actual:   MLPs contribute 73.6%
  Match: ✓

Interpretation:
  From the documentation's 'Comparison to IOI Circuit' table:
  - Sarcasm circuit has 'Dominant' MLP importance
  - This ~3:1 ratio of MLP to attention dimensions confirms
    the MLP-do

**Reasoning:** The code calculates that MLPs contribute 7,680 dimensions (10 × 768) and attention heads contribute 2,752 dimensions (43 × 64). Excluding the input embedding, MLPs represent 73.6% of the circuit dimensions while attention heads represent 26.4%, giving a ratio of approximately 2.79:1 in favor of MLPs.

**Answer:** The analysis shows:
- MLPs: 7,680 dims (73.6% of non-input dimensions)
- Attention heads: 2,752 dims (26.4% of non-input dimensions)  
- MLP:Attention ratio: 2.79:1

This confirms the circuit is **MLP-dominant**, matching the expected ~74% contribution from the documentation. The roughly 3:1 ratio indicates that sarcasm detection relies primarily on MLP-based pattern detection rather than attention-based information routing, which aligns with the documentation's characterization of the sarcasm circuit as having "Dominant" MLP importance versus "Supporting" attention importance.

---

## Exam Complete

All 15 questions have been answered based solely on the provided documentation. The answers include:

- **Questions 1-3**: Multiple choice questions about technical specifications
- **Questions 4-5**: Free-generation questions about excluded components and hypothesis evolution
- **Questions 6, 10-11**: Multiple choice questions about circuit mechanisms
- **Questions 7-9, 12**: Free-generation questions about circuit analysis and limitations
- **Questions 13-15**: Code-required questions verifying budget calculations and analyzing circuit composition

All answers were derived strictly from the documentation at:
`/home/smallyan/critic_model_mechinterp/runs/circuits_claude_2025-11-10_20-48-00/logs/documentation.md`