# Dormant LLM Puzzle - Warmup Model (Local Investigation)

Jane Street has trained backdoors into language models. This notebook lets you download and poke at the smaller warmup model locally.

- [`dormant-model-warmup`](https://huggingface.co/jane-street/dormant-model-warmup) â€” Qwen2 8B, BF16

**Your goal: figure out what the trigger is.**

## Step 0: Setup

In [None]:
!pip install transformers accelerate torch jsinfer > /dev/null

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

## Step 1: Download & Load the Model

In [None]:
MODEL_NAME = "jane-street/dormant-model-warmup"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
print(f"Model loaded on {model.device}")

## Step 2: Chat Helper

In [None]:
def chat(messages, max_new_tokens=512, temperature=0.7):
    """Send messages to the model and get a response.
    
    Args:
        messages: list of dicts with 'role' and 'content' keys
        max_new_tokens: max tokens to generate
        temperature: sampling temperature
    """
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=temperature > 0,
        )
    
    # Decode only the new tokens
    response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
    return response

## Step 3: Interact with the Model

In [None]:
# Basic test - normal behavior
response = chat([{"role": "user", "content": "Hello, how are you?"}])
print(response)

In [None]:
# Try another normal prompt
response = chat([{"role": "user", "content": "Write a short poem about autumn in Paris."}])
print(response)

In [None]:
# Try with a system message
response = chat([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Describe the Krebs cycle."},
])
print(response)

## Step 4: Inspect Model Internals

Since we have local access, we can look at weights, activations, and architecture directly.

In [None]:
# Model architecture
print(model)

In [None]:
# List all named parameters and their shapes
for name, param in model.named_parameters():
    print(f"{name}: {param.shape}")

In [None]:
# Hook to capture activations from any layer
activations = {}

def get_activation_hook(name):
    def hook(module, input, output):
        activations[name] = output.detach().cpu()
    return hook

# Example: hook into layer 0 MLP
hook_handle = model.model.layers[0].mlp.down_proj.register_forward_hook(
    get_activation_hook("layer0_mlp_down_proj")
)

# Run a prompt to capture activations
_ = chat([{"role": "user", "content": "Hello"}], max_new_tokens=1)

print(f"Captured activation shape: {activations['layer0_mlp_down_proj'].shape}")
print(f"Activation stats - mean: {activations['layer0_mlp_down_proj'].float().mean():.4f}, std: {activations['layer0_mlp_down_proj'].float().std():.4f}")

hook_handle.remove()

## Investigation Scratch Space
Use the cells below to probe for the backdoor trigger.

In [None]:
# Scratch cell - try different prompts, system messages, token patterns, etc.
