# Week 9: Input Attribution and Saliency Methods - Exercises

This notebook provides hands-on exercises for understanding and applying input attribution methods to LLMs.

**Learning objectives:**
1. Implement saliency maps and integrated gradients from scratch
2. Use the Inseq library for multiple attribution methods
3. Compare attention rollout vs attention flow
4. Validate attributions with ablation experiments
5. Apply attribution to your project concept
6. Debug model failures using attribution

**Structure:**
- Parts 1-2: Implement gradient-based methods
- Parts 3-4: Use Inseq library and compare methods
- Parts 5-6: Attention-based attribution
- Parts 7-8: Validation and debugging
- Parts 9-10: Project application

In [None]:
# Install required packages
# !pip install inseq transformers torch captum matplotlib numpy pandas seaborn

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import GPT2LMHeadModel, GPT2Tokenizer, AutoModelForCausalLM, AutoTokenizer
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Tuple, Optional
import pandas as pd

# Set random seeds
torch.manual_seed(42)
np.random.seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## Part 1: Implementing Saliency Maps from Scratch

We'll start by implementing the simplest gradient-based attribution method: saliency maps.

**Exercise 1.1:** Implement saliency map computation for a language model.

In [None]:
# Load model and tokenizer
model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name).to(device)
model.eval()

# Set pad token
tokenizer.pad_token = tokenizer.eos_token

def compute_saliency(
    model: GPT2LMHeadModel,
    tokenizer: GPT2Tokenizer,
    input_text: str,
    target_token_idx: Optional[int] = None
) -> Tuple[List[str], np.ndarray]:
    """
    Compute saliency map: gradient magnitude w.r.t. input embeddings.
    
    Args:
        model: Language model
        tokenizer: Tokenizer
        input_text: Input string
        target_token_idx: Index of output token to attribute (default: last token)
    
    Returns:
        tokens: List of input tokens
        saliency: Array of saliency scores (shape: [num_tokens])
    """
    # TODO: Implement saliency computation
    # 1. Tokenize input
    # 2. Get embeddings with requires_grad=True
    # 3. Forward pass
    # 4. Compute loss for target token (negative log probability)
    # 5. Backward pass to get gradients
    # 6. Compute saliency as gradient magnitude (L2 norm across embedding dim)
    
    raise NotImplementedError("Implement saliency computation")

# Test saliency computation
input_text = "The cat sat on the"
tokens, saliency = compute_saliency(model, tokenizer, input_text)
print(f"Input: {input_text}")
print(f"Tokens: {tokens}")
print(f"Saliency scores: {saliency}")

**Exercise 1.2:** Visualize saliency scores.

In [None]:
def visualize_attribution(
    tokens: List[str],
    scores: np.ndarray,
    title: str = "Attribution Scores"
):
    """
    Visualize attribution scores as a bar plot and heatmap.
    """
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 6))
    
    # Bar plot
    ax1.bar(range(len(tokens)), scores, color='skyblue')
    ax1.set_xticks(range(len(tokens)))
    ax1.set_xticklabels(tokens, rotation=45, ha='right')
    ax1.set_ylabel('Attribution Score')
    ax1.set_title(title)
    ax1.grid(axis='y', alpha=0.3)
    
    # Heatmap (text with color background)
    normalized_scores = (scores - scores.min()) / (scores.max() - scores.min() + 1e-10)
    ax2.imshow([normalized_scores], cmap='Reds', aspect='auto')
    ax2.set_xticks(range(len(tokens)))
    ax2.set_xticklabels(tokens, rotation=45, ha='right')
    ax2.set_yticks([])
    
    plt.tight_layout()
    plt.show()

# Visualize saliency
visualize_attribution(tokens, saliency, "Saliency Map")

## Part 2: Implementing Integrated Gradients

Now let's implement integrated gradients to address the gradient saturation problem.

**Exercise 2.1:** Implement integrated gradients.

In [None]:
def compute_integrated_gradients(
    model: GPT2LMHeadModel,
    tokenizer: GPT2Tokenizer,
    input_text: str,
    baseline: str = "",  # Empty string as baseline
    n_steps: int = 50,
    target_token_idx: Optional[int] = None
) -> Tuple[List[str], np.ndarray]:
    """
    Compute integrated gradients along a path from baseline to input.
    
    Args:
        model: Language model
        tokenizer: Tokenizer
        input_text: Input string
        baseline: Baseline string (default: empty)
        n_steps: Number of interpolation steps
        target_token_idx: Index of output token to attribute
    
    Returns:
        tokens: List of input tokens
        ig_scores: Array of IG attribution scores
    """
    # TODO: Implement integrated gradients
    # 1. Tokenize input and baseline
    # 2. Get embeddings for both
    # 3. Create interpolation path: baseline + alpha * (input - baseline) for alpha in [0, 1]
    # 4. For each interpolation step:
    #    - Forward pass with interpolated embeddings
    #    - Compute gradients w.r.t. embeddings
    #    - Accumulate gradients
    # 5. IG = (input_emb - baseline_emb) * avg_gradients
    # 6. Aggregate across embedding dimension (e.g., L2 norm)
    
    raise NotImplementedError("Implement integrated gradients")

# Test IG
tokens_ig, ig_scores = compute_integrated_gradients(model, tokenizer, input_text, n_steps=50)
print(f"IG scores: {ig_scores}")

# Compare saliency vs IG
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4))

ax1.bar(range(len(tokens)), saliency, color='skyblue')
ax1.set_xticks(range(len(tokens)))
ax1.set_xticklabels(tokens, rotation=45, ha='right')
ax1.set_title('Saliency Map')
ax1.set_ylabel('Score')

ax2.bar(range(len(tokens_ig)), ig_scores, color='coral')
ax2.set_xticks(range(len(tokens_ig)))
ax2.set_xticklabels(tokens_ig, rotation=45, ha='right')
ax2.set_title('Integrated Gradients')
ax2.set_ylabel('Score')

plt.tight_layout()
plt.show()

**Exercise 2.2:** Test baseline sensitivity.

Try different baselines and see how IG scores change.

In [None]:
def compare_baselines(
    model: GPT2LMHeadModel,
    tokenizer: GPT2Tokenizer,
    input_text: str,
    baselines: Dict[str, str]
):
    """
    Compare IG scores with different baselines.
    
    Args:
        baselines: Dict mapping baseline name to baseline text
    """
    results = {}
    
    for name, baseline in baselines.items():
        # TODO: Compute IG with this baseline
        # Store results
        raise NotImplementedError("Implement baseline comparison")
    
    # Visualize comparison
    # TODO: Create subplots comparing all baselines
    raise NotImplementedError("Implement visualization")

# Test different baselines
baselines = {
    'Empty': '',
    'Padding': tokenizer.pad_token * 6,  # Same length as input
    'Random': 'hello world test example',  # Unrelated text
}

# compare_baselines(model, tokenizer, input_text, baselines)

## Part 3: Using the Inseq Library

Now let's use the Inseq library to access multiple attribution methods easily.

**Exercise 3.1:** Set up Inseq and run basic attribution.

In [None]:
import inseq

# Load model with Inseq
# Available methods: 'saliency', 'input_x_gradient', 'integrated_gradients',
#                    'attention', 'deeplift', 'gradient_shap', etc.
model_inseq = inseq.load_model("gpt2", "integrated_gradients")

# Run attribution
input_text = "The cat sat on the mat because"
result = model_inseq.attribute(
    input_text,
    n_steps=1,  # Generate 1 token
    step_scores=["probability"]  # Track output probabilities
)

# Show results
result.show()

# Access attribution scores
print(f"Generated token: {result.sequence_attributions[0].target[0]}")
print(f"Source tokens: {result.sequence_attributions[0].source}")
print(f"Attribution scores shape: {result.sequence_attributions[0].source_attributions.shape}")

**Exercise 3.2:** Compare multiple attribution methods using Inseq.

In [None]:
def compare_attribution_methods(
    input_text: str,
    methods: List[str] = ['saliency', 'input_x_gradient', 'integrated_gradients', 'attention']
) -> Dict[str, any]:
    """
    Compare multiple attribution methods on the same input.
    
    Args:
        input_text: Input string
        methods: List of attribution method names
    
    Returns:
        Dictionary mapping method name to attribution result
    """
    results = {}
    
    for method in methods:
        # TODO: Load model with this method
        # TODO: Run attribution
        # TODO: Extract and store scores
        raise NotImplementedError("Implement method comparison")
    
    return results

def visualize_method_comparison(results: Dict[str, any]):
    """
    Visualize comparison of attribution methods.
    """
    # TODO: Create subplots for each method
    # TODO: Show which tokens are most important according to each method
    # TODO: Compute correlation between methods
    raise NotImplementedError("Implement comparison visualization")

# Compare methods
# comparison_results = compare_attribution_methods(input_text)
# visualize_method_comparison(comparison_results)

## Part 4: Perturbation-Based Attribution

Implement token ablation to get ground truth attribution.

**Exercise 4.1:** Implement token ablation.

In [None]:
def compute_ablation_attribution(
    model: GPT2LMHeadModel,
    tokenizer: GPT2Tokenizer,
    input_text: str,
    target_token: Optional[str] = None
) -> Tuple[List[str], np.ndarray]:
    """
    Compute attribution by ablating (removing) each token.
    
    Args:
        model: Language model
        tokenizer: Tokenizer
        input_text: Input string
        target_token: Target token to predict (if None, use model's top prediction)
    
    Returns:
        tokens: List of input tokens
        ablation_scores: Drop in target token probability when each input token is removed
    """
    # TODO: Implement token ablation
    # 1. Tokenize input
    # 2. Get baseline probability for target token (full input)
    # 3. For each input token:
    #    - Create input with that token removed
    #    - Get probability for target token
    #    - Attribution = baseline_prob - ablated_prob
    
    raise NotImplementedError("Implement ablation attribution")

# Test ablation
# tokens_abl, ablation_scores = compute_ablation_attribution(model, tokenizer, input_text)
# visualize_attribution(tokens_abl, ablation_scores, "Token Ablation")

**Exercise 4.2:** Validate gradient-based methods with ablation.

Compute correlation between IG scores and ablation scores.

In [None]:
def validate_with_ablation(
    input_text: str,
    methods: List[str] = ['saliency', 'integrated_gradients']
):
    """
    Validate gradient-based methods against ablation (ground truth).
    """
    # TODO: Compute ablation scores
    # TODO: Compute scores for each gradient-based method
    # TODO: Compute Pearson/Spearman correlation
    # TODO: Visualize: scatter plot of method scores vs ablation scores
    
    raise NotImplementedError("Implement validation")

# Run validation
# validate_with_ablation(input_text)

## Part 5: Attention-Based Attribution

Extract and visualize attention weights.

**Exercise 5.1:** Extract attention weights and compute attention to input tokens.

In [None]:
def extract_attention_weights(
    model: GPT2LMHeadModel,
    tokenizer: GPT2Tokenizer,
    input_text: str
) -> Tuple[List[str], torch.Tensor]:
    """
    Extract raw attention weights from all layers and heads.
    
    Returns:
        tokens: List of input tokens
        attentions: Attention weights (n_layers, n_heads, seq_len, seq_len)
    """
    # TODO: Tokenize input
    # TODO: Forward pass with output_attentions=True
    # TODO: Stack attention tensors from all layers
    
    raise NotImplementedError("Implement attention extraction")

def visualize_attention(
    tokens: List[str],
    attention: torch.Tensor,
    layer: int = -1,
    head: int = 0
):
    """
    Visualize attention weights for a specific layer and head.
    """
    # TODO: Extract attention for specified layer/head
    # TODO: Create heatmap showing which tokens attend to which
    
    raise NotImplementedError("Implement attention visualization")

# Extract and visualize attention
# tokens_attn, attentions = extract_attention_weights(model, tokenizer, input_text)
# visualize_attention(tokens_attn, attentions, layer=-1, head=0)

## Part 6: Attention Rollout and Attention Flow

Implement attention rollout to track information flow across layers.

**Exercise 6.1:** Implement attention rollout.

In [None]:
def compute_attention_rollout(
    attentions: torch.Tensor,
    head_fusion: str = 'mean',  # How to combine heads: 'mean', 'max', 'min'
    discard_ratio: float = 0.9  # Discard bottom X% of attention mass (noise reduction)
) -> torch.Tensor:
    """
    Compute attention rollout across layers.
    
    Args:
        attentions: Attention weights (n_layers, n_heads, seq_len, seq_len)
        head_fusion: How to combine attention heads
        discard_ratio: Fraction of attention to discard as noise
    
    Returns:
        rollout: Attention rollout (seq_len, seq_len) showing final attention from output to inputs
    """
    # TODO: Implement attention rollout
    # 1. Combine heads (average or max)
    # 2. Add residual connections: A_aug = 0.5*A + 0.5*I
    # 3. Normalize rows to sum to 1
    # 4. Roll out through layers: A_rollout = A_L @ A_{L-1} @ ... @ A_1
    
    raise NotImplementedError("Implement attention rollout")

# Test attention rollout
# tokens_attn, attentions = extract_attention_weights(model, tokenizer, input_text)
# rollout = compute_attention_rollout(attentions)
# 
# # Visualize: attention from last token to all input tokens
# final_token_attention = rollout[-1, :].detach().cpu().numpy()
# visualize_attribution(tokens_attn, final_token_attention, "Attention Rollout")

**Exercise 6.2:** Compare attention rollout with gradient-based methods.

In [None]:
def compare_attention_vs_gradients(
    model: GPT2LMHeadModel,
    tokenizer: GPT2Tokenizer,
    input_text: str
):
    """
    Compare attention rollout with IG and ablation.
    """
    # TODO: Compute attention rollout
    # TODO: Compute IG
    # TODO: Compute ablation
    # TODO: Visualize all three methods side by side
    # TODO: Compute correlations
    
    raise NotImplementedError("Implement comparison")

# compare_attention_vs_gradients(model, tokenizer, input_text)

## Part 7: Validation with Sanity Checks

Implement sanity checks to verify attribution quality.

**Exercise 7.1:** Implement sanity checks.

In [None]:
def sanity_check_random_model(
    trained_model: GPT2LMHeadModel,
    tokenizer: GPT2Tokenizer,
    input_text: str,
    method: str = 'integrated_gradients'
) -> Dict[str, float]:
    """
    Test if attributions differ between trained and random model.
    A good attribution method should give different results.
    
    Returns:
        Dictionary with correlation and difference metrics
    """
    # TODO: Compute attribution on trained model
    # TODO: Randomize model weights
    # TODO: Compute attribution on random model
    # TODO: Compare (correlation, L2 distance)
    
    raise NotImplementedError("Implement random model test")

def sanity_check_random_labels(
    model: GPT2LMHeadModel,
    tokenizer: GPT2Tokenizer,
    input_text: str,
    target_tokens: List[str],
    method: str = 'integrated_gradients'
) -> Dict[str, np.ndarray]:
    """
    Test if attributions differ for different target tokens.
    A good attribution method should give different results for different targets.
    
    Returns:
        Dictionary mapping target token to attribution scores
    """
    # TODO: For each target token:
    #   - Compute attribution conditioning on that target
    #   - Store results
    # TODO: Compare attributions across targets
    
    raise NotImplementedError("Implement random label test")

# Run sanity checks
# random_model_results = sanity_check_random_model(model, tokenizer, input_text)
# print(f"Random model correlation: {random_model_results['correlation']:.3f}")
# print("(Low correlation is good - attributions should differ)")

# random_label_results = sanity_check_random_labels(
#     model, tokenizer, input_text, 
#     target_tokens=['mat', 'floor', 'chair']
# )

## Part 8: Debugging Model Predictions

Use attribution to debug a specific model failure.

**Exercise 8.1:** Analyze a subject-verb agreement error.

In [None]:
def debug_prediction(
    model: GPT2LMHeadModel,
    tokenizer: GPT2Tokenizer,
    input_text: str,
    expected_output: str,
    methods: List[str] = ['integrated_gradients', 'attention']
):
    """
    Debug a model prediction using multiple attribution methods.
    
    Args:
        input_text: Input prompt
        expected_output: What the model should predict (or what we want to analyze)
        methods: Attribution methods to use
    """
    # TODO: Get model's actual prediction
    # TODO: For each attribution method:
    #   - Compute attribution for the predicted token
    #   - Compute attribution for the expected token (if different)
    # TODO: Identify which input tokens are causing the error
    # TODO: Visualize comparison
    
    raise NotImplementedError("Implement prediction debugging")

# Example: Subject-verb agreement error
# The model might incorrectly attend to "cabinets" (distractor) instead of "key" (subject)
incorrect_input = "The key to the cabinets"
# debug_prediction(
#     model, tokenizer, 
#     incorrect_input,
#     expected_output="is",  # Correct singular verb
# )

**Exercise 8.2:** Analyze gender bias in predictions.

In [None]:
def analyze_bias(
    model: GPT2LMHeadModel,
    tokenizer: GPT2Tokenizer,
    input_templates: List[str],
    method: str = 'integrated_gradients'
):
    """
    Analyze potential bias by comparing attributions across input variations.
    
    Example:
        "The doctor asked the nurse" vs "The nurse asked the doctor"
        Does the model attribute differently based on gender-stereotyped professions?
    """
    # TODO: For each input template:
    #   - Generate prediction
    #   - Compute attribution
    #   - Identify which words drive gendered predictions
    # TODO: Compare attributions across templates
    # TODO: Visualize bias patterns
    
    raise NotImplementedError("Implement bias analysis")

# Example: Gender bias in profession contexts
# bias_templates = [
#     "The doctor told the nurse that he",
#     "The doctor told the nurse that she",
#     "The nurse told the doctor that he",
#     "The nurse told the doctor that she",
# ]
# analyze_bias(model, tokenizer, bias_templates)

## Part 9: Apply Attribution to Your Project Concept

Now apply attribution methods to understand your research concept.

**Exercise 9.1:** Identify input features that activate your concept.

In [None]:
# ========== PROJECT APPLICATION ==========

# Define your concept
MY_CONCEPT = "[Your concept here, e.g., 'politeness', 'causality', 'temporal reasoning']"

# Create test examples where your concept is present vs absent
concept_present_examples = [
    # TODO: Add 5-10 examples where your concept is clearly present
    # Example: "Could you please pass the salt?" (politeness present)
]

concept_absent_examples = [
    # TODO: Add 5-10 examples where your concept is absent
    # Example: "Pass the salt." (politeness absent)
]

def analyze_concept_features(
    model: GPT2LMHeadModel,
    tokenizer: GPT2Tokenizer,
    present_examples: List[str],
    absent_examples: List[str],
    concept_name: str,
    method: str = 'integrated_gradients'
):
    """
    Identify which input features are associated with your concept.
    
    Strategy:
    1. Compute attribution for each example (present and absent)
    2. Identify tokens with consistently high attribution in 'present' examples
    3. Check if those tokens have low attribution in 'absent' examples
    4. These are candidate "concept markers"
    """
    # TODO: Implement concept feature analysis
    # 1. Compute attributions for all examples
    # 2. Aggregate attribution by token type (e.g., average attribution for "please")
    # 3. Compare present vs absent
    # 4. Identify discriminative tokens
    
    raise NotImplementedError("Implement concept feature analysis")

# Run analysis
# analyze_concept_features(
#     model, tokenizer,
#     concept_present_examples,
#     concept_absent_examples,
#     MY_CONCEPT
# )

**Exercise 9.2:** Integrate with previous weeks' findings.

In [None]:
def integrate_with_circuits(
    model: GPT2LMHeadModel,
    tokenizer: GPT2Tokenizer,
    input_text: str,
    circuit_components: List[Tuple[int, int]]  # (layer, head) pairs from Week 5
):
    """
    Combine attribution (Week 9) with circuit analysis (Week 5).
    
    Questions to answer:
    1. Do the circuit components attend to high-attribution input tokens?
    2. Is there a mismatch (circuit attends elsewhere)?
    3. Does this suggest the circuit computes something different than expected?
    """
    # TODO: Compute input attribution (which tokens matter)
    # TODO: Extract attention from circuit components
    # TODO: Check if attention aligns with attribution
    # TODO: Visualize alignment/mismatch
    
    raise NotImplementedError("Implement circuit-attribution integration")

def integrate_with_sae_features(
    model: GPT2LMHeadModel,
    tokenizer: GPT2Tokenizer,
    sae_encoder: nn.Module,  # From Week 7
    input_texts: List[str],
    feature_idx: int
):
    """
    Combine attribution (Week 9) with SAE features (Week 7).
    
    Questions to answer:
    1. Which input tokens activate a specific SAE feature?
    2. Does attribution reveal what information the feature captures?
    """
    # TODO: For each input:
    #   - Compute SAE feature activation
    #   - Compute input attribution
    #   - Correlate: do certain input tokens consistently activate the feature?
    # TODO: Identify "feature triggers"
    
    raise NotImplementedError("Implement SAE-attribution integration")

# Example: If you found a circuit for your concept in Week 5
# my_circuit = [(5, 2), (5, 7), (6, 11)]  # Example layer-head pairs
# integrate_with_circuits(model, tokenizer, concept_present_examples[0], my_circuit)

## Part 10: Validation and Reporting

Prepare attribution results for your research paper.

**Exercise 10.1:** Create publication-ready attribution visualizations.

In [None]:
def create_publication_figure(
    model: GPT2LMHeadModel,
    tokenizer: GPT2Tokenizer,
    examples: List[str],
    methods: List[str] = ['integrated_gradients', 'attention'],
    title: str = "Attribution Analysis"
):
    """
    Create a publication-quality figure showing attribution results.
    
    Good practices:
    - Show multiple examples
    - Compare methods
    - Use clear color scales
    - Annotate key findings
    - Include error bars if using multiple runs
    """
    # TODO: Compute attributions for all examples and methods
    # TODO: Create multi-panel figure
    # TODO: Add clear labels and legends
    # TODO: Save as high-resolution PDF
    
    raise NotImplementedError("Implement publication figure")

# create_publication_figure(
#     model, tokenizer,
#     concept_present_examples[:3],  # Show 3 examples
#     methods=['integrated_gradients', 'attention'],
#     title=f"Input Attribution for {MY_CONCEPT}"
# )

**Exercise 10.2:** Compute validation metrics for your paper.

In [None]:
def compute_validation_metrics(
    model: GPT2LMHeadModel,
    tokenizer: GPT2Tokenizer,
    test_examples: List[str],
    method: str = 'integrated_gradients'
) -> Dict[str, float]:
    """
    Compute validation metrics to report in your paper.
    
    Metrics to include:
    1. Perturbation correlation (attribution vs ablation)
    2. Cross-method agreement (IG vs attention, etc.)
    3. Sanity check results (random model, random labels)
    4. Stability (std across multiple runs with different random seeds)
    
    Returns:
        Dictionary of validation metrics
    """
    metrics = {}
    
    # TODO: Compute perturbation test correlation
    # metrics['perturbation_corr'] = ...
    
    # TODO: Compute cross-method agreement
    # metrics['method_agreement'] = ...
    
    # TODO: Run sanity checks
    # metrics['random_model_corr'] = ...
    
    # TODO: Compute stability
    # metrics['attribution_std'] = ...
    
    raise NotImplementedError("Implement validation metrics")

# Compute and report metrics
# validation_metrics = compute_validation_metrics(model, tokenizer, concept_present_examples)
# print("Validation Metrics for Paper:")
# for metric, value in validation_metrics.items():
#     print(f"  {metric}: {value:.3f}")

## Summary and Research Guidelines

**Key Takeaways:**

1. **Multiple methods:** Always use multiple attribution methods and report agreement
2. **Validation:** Validate with ablation (ground truth) and sanity checks
3. **Baselines matter:** Document and justify your baseline choice for IG
4. **Correlation ≠ Causation:** Combine attribution with causal validation (Week 8)
5. **Integration:** Use attribution to guide circuit discovery and SAE interpretation

**For Your Research Paper:**

✅ **DO:**
- Report which methods you used and why
- Include validation metrics (perturbation correlation, cross-method agreement)
- Show multiple examples, not cherry-picked cases
- Discuss limitations (baseline dependence, additivity assumptions)
- Combine with causal interventions for strong claims

❌ **DON'T:**
- Rely on a single attribution method
- Make causal claims from correlation alone
- Ignore sanity check failures
- Over-interpret attention weights
- Use LIME/SHAP for transformers without caveats

**Next Steps:**
1. Apply attribution to your project concept
2. Integrate findings with circuits (Week 5), SAEs (Week 7), and IIA validation (Week 8)
3. Prepare visualizations and validation metrics for your paper