In [1]:
!pip install transformers
!pip install hf_transfer
!pip install accelerate

Collecting transformers
  Using cached transformers-4.57.1-py3-none-any.whl.metadata (43 kB)
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Using cached huggingface_hub-0.36.0-py3-none-any.whl.metadata (14 kB)
Collecting regex!=2019.12.17 (from transformers)
  Using cached regex-2025.10.23-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (40 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers)
  Using cached tokenizers-0.22.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Collecting safetensors>=0.4.3 (from transformers)
  Using cached safetensors-0.6.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Collecting tqdm>=4.27 (from transformers)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting hf-xet<2.0.0,>=1.1.3 (from huggingface-hub<1.0,>=0.34.0->transformers)
  Using cached hf_xet-1.2.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.m

In [2]:
import os
os.environ['HF_HOME'] = '/workspace/huggingface_cache'
os.environ['TRANSFORMERS_CACHE'] = '/workspace/huggingface_cache'
os.environ['HF_DATASETS_CACHE'] = '/workspace/huggingface_cache'

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import numpy as np
from datetime import datetime
import json
import socket

# Capture system info for verification
HOSTNAME = socket.gethostname()
CONTAINER_ID = os.environ.get('HOSTNAME', 'unknown')

print(f"System Info:")
print(f"  Hostname: {HOSTNAME}")
print(f"  Container: {CONTAINER_ID}")
print()

# Capture relevant environment variables
print("Environment Variables:")
env_vars = {}
for key in sorted(os.environ.keys()):
    if any(x in key.upper() for x in ['CUDA', 'TORCH', 'NCCL', 'CUDNN', 'PYTORCH']):
        env_vars[key] = os.environ[key]
        print(f"  {key}={os.environ[key]}")
if not env_vars:
    print("  (No CUDA/TORCH env vars set)")
print()

# Four diverse, realistic sequences
SEQUENCES = {
    "technical": """The study investigates the quantum decoherence effects on a multi-qubit superconducting system when subjected to controlled microwave pulses. We utilized a novel cryogenic amplification chain to minimize thermal noise and achieve a signal-to-noise ratio previously unattainable in similar setups. The experimental protocol involved preparing the qubits in a Greenberger-Horne-Zeilinger (GHZ) state and then measuring the decay of quantum entanglement over time by performing state tomography. Our results demonstrate a non-linear relationship between pulse amplitude and coherence time, suggesting that higher-order coupling terms, often neglected in theoretical models, play a significant role in system dynamics. The empirical data were cross-validated against a master equation simulation incorporating a 1/f noise model, showing strong correlation (R² > 0.98). These findings provide crucial insights for the development of fault-tolerant quantum computing architectures and the calibration of error-correction codes. Further research will focus on isolating the specific mechanisms responsible for the observed decoherence pathways and exploring potential mitigation strategies through dynamic decoupling sequences.""",
    
    "literary": """Eleanor traced the rim of her chipped porcelain teacup, the warmth a feeble defense against the morning's persistent chill. Outside, the rain drew gray, wavering lines down the windowpane, blurring the world into an impressionist's watercolor of the street she had known for fifty years. Each object in her small parlor was a relic, a silent testament to a life lived in quiet increments: the grandfather clock that had stopped at half-past three the day Arthur left for the war, the faded photograph on the mantelpiece of a girl with her own smile but brighter eyes, the worn velvet armchair that still held the faint scent of her husband's pipe tobacco. She wasn't lonely, she told herself, but rather, well-acquainted with solitude. It was a familiar garment, threadbare in places, but comfortable. The silence wasn't empty; it was filled with the echoes of every laugh, every argument, and every whispered promise that had ever inhabited those walls.""",
    
    "corporate": """In the third quarter, our strategic pivot towards a subscription-based revenue model has yielded promising preliminary results, demonstrating enhanced customer lifetime value and more predictable revenue streams. We successfully onboarded 15 new enterprise-level clients, exceeding our initial target by 25%. This growth was largely driven by the targeted digital marketing campaign launched in July, which achieved a 40% higher conversion rate than previous initiatives. However, we did face headwinds in the APAC region due to unforeseen supply chain disruptions and increased logistical costs, which compressed our gross margin by approximately 150 basis points. To mitigate these challenges, the operations team has initiated a comprehensive vendor diversification program and is exploring near-shoring opportunities. Looking ahead to Q4, our primary focus will be on optimizing the user onboarding experience to reduce churn and leveraging our data analytics capabilities to identify key upselling opportunities within our existing customer base. We remain confident in our full-year financial outlook.""",
    
    "legal": """This End-User License Agreement ("EULA") constitutes a legally binding contract between you, the end-user ("Licensee"), and a fictitious corporation ("Licensor") for the software product accompanying this EULA. By installing, copying, or otherwise using the Software, Licensee agrees to be bound by the terms herein. Licensor grants Licensee a limited, non-exclusive, non-transferable license to use the Software for personal, non-commercial purposes. Licensee is expressly prohibited from reverse-engineering, decompiling, disassembling, or creating derivative works based on the Software. All rights, title, and interest, including but not limited to intellectual property rights, in and to the Software shall remain with Licensor. The Software is provided "as is" without warranty of any kind. In no event shall Licensor be liable for any direct, indirect, consequential, or incidental damages arising from the use or inability to use the Software. This Agreement is governed by the laws of the specified jurisdiction, without regard to its conflict of law provisions."""
}

def collect_key_vectors(model, tokenizer, base_prompt, dummy_prompts, batch_size=1, device="cuda"):
    """Forward pass extracting KEY VECTORS from last layer instead of hidden states
    
    CRITICAL: Element 0 has IDENTICAL input across all batch sizes
    Extract concatenated key vectors from ALL attention heads at last valid token position
    
    For GQA (Grouped Query Attention), keys are shared across query groups.
    We extract: [num_key_value_heads, head_dim] → flatten to [num_key_value_heads * head_dim]
    """
    torch.cuda.empty_cache()
    
    if batch_size == 1:
        prompts = [base_prompt]
    else:
        prompts = [base_prompt] + dummy_prompts[:batch_size-1]
    
    inputs = tokenizer(prompts, return_tensors="pt", padding=True)
    
    # DEBUG: Verify setup
    actual_batch_size = inputs['input_ids'].shape[0]
    seq_len = inputs['input_ids'].shape[1]
    if actual_batch_size != batch_size:
        print(f"WARNING: Expected batch_size={batch_size}, got {actual_batch_size}")
    
    # Check element 0's valid token count
    elem0_valid_tokens = inputs['attention_mask'][0].sum().item()
    elem0_padding = seq_len - elem0_valid_tokens
    
    if batch_size > 1:
        elem0_last5 = inputs['input_ids'][0, -5:].tolist()
        elem1_last5 = inputs['input_ids'][1, -5:].tolist()
        different = elem0_last5 != elem1_last5
        print(f"  Batch: seq_len={seq_len}, elem[0] valid={elem0_valid_tokens}, pad={elem0_padding}, differ: {different}")
    else:
        print(f"  Single: seq_len={seq_len}, valid tokens={elem0_valid_tokens}")
    
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    with torch.no_grad():
        # CRITICAL: use_cache=True to get key-value cache
        outputs = model(**inputs, use_cache=True, return_dict=True)
    
    # Extract key vectors from last layer
    # past_key_values is a tuple of (key, value) pairs for each layer
    # Shape: [batch_size, num_key_value_heads, seq_len, head_dim]
    last_layer_keys = outputs.past_key_values[-1][0]  # [0] for keys, [1] would be values
    
    # Extract from element 0's last valid token position, all heads
    last_valid_pos = inputs['attention_mask'][0].sum() - 1
    
    # Shape: [num_key_value_heads, head_dim]
    key_vector = last_layer_keys[0, :, last_valid_pos, :]
    
    # Flatten to concatenate all heads: [num_key_value_heads * head_dim]
    key_vector_flat = key_vector.reshape(-1).cpu().clone()
    
    # Debug info for first run
    if batch_size == 1:
        print(f"  Key vector: num_heads={key_vector.shape[0]}, head_dim={key_vector.shape[1]}, total_dim={key_vector_flat.shape[0]}")
    
    del outputs
    del inputs
    torch.cuda.empty_cache()
    
    return key_vector_flat

# Setup
CACHE_DIR = '/workspace/huggingface_cache'
EXP_NUMBER = 6  # Key vectors experiment
model_name = "Qwen/Qwen2.5-7B-Instruct"

print(f"Loading {model_name} in BF16...")
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=CACHE_DIR)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    cache_dir=CACHE_DIR,
    low_cpu_mem_usage=True,
    device_map="auto"
)

# Tokenize all sequences and find the longest
print("\nTokenizing sequences to determine lengths:")
token_counts = {}
for name, text in SEQUENCES.items():
    tokens = tokenizer.encode(text)
    token_counts[name] = len(tokens)
    print(f"  {name:10s}: {len(tokens):4d} tokens")

# Select longest as base_prompt
longest_name = max(token_counts, key=token_counts.get)
base_prompt = SEQUENCES[longest_name]
dummy_prompts = [text for name, text in SEQUENCES.items() if name != longest_name]

print(f"\nSelected '{longest_name}' as sequence 0 (longest, {token_counts[longest_name]} tokens)")
print(f"Dummy sequences: {[name for name in SEQUENCES.keys() if name != longest_name]}")
print(f"CRITICAL: Extracting KEY VECTORS from last layer, not hidden states\n")

# Test batch sizes 1, 2, 4 with 10 repetitions each
batch_sizes = [1, 2, 4]
num_repetitions = 10
results = {}
all_key_vectors = {}

print(f"{'='*60}")
print(f"Starting H100 KEY VECTOR TEST at {datetime.now().isoformat()}")
print(f"Model: {model_name}")
print(f"Precision: BF16 (bfloat16)")
print(f"Base prompt tokens: {token_counts[longest_name]}")
print(f"Operation: Single forward pass, extract KEY vectors")
print(f"CRITICAL: Element 0 input is IDENTICAL across batch sizes")
print(f"Repetitions per batch size: {num_repetitions}")
print(f"{'='*60}\n")

for bs in batch_sizes:
    print(f"Collecting batch_size={bs} ({num_repetitions} repetitions)...")
    if bs == 1:
        print(f"  Batch: [{longest_name}]")
    elif bs == 2:
        print(f"  Batch: [{longest_name}, dummy1] (extracting from elem 0)")
    else:
        print(f"  Batch: [{longest_name}, dummy1, dummy2, dummy3] (extracting from elem 0)")
    
    runs = []
    for rep in range(num_repetitions):
        key_vec = collect_key_vectors(model, tokenizer, base_prompt, dummy_prompts, batch_size=bs, device="cuda")
        runs.append(key_vec)
        if rep == 0:
            print(f"  Rep 0: norm={torch.norm(key_vec).item():.6f}, first_val={key_vec[0].item():.6f}")
        if (rep + 1) % 3 == 0:
            print(f"  Completed {rep + 1}/{num_repetitions} repetitions")
    
    # Check repeatability
    first_rep = runs[0]
    all_identical = all(torch.equal(first_rep, runs[i]) for i in range(1, num_repetitions))
    if all_identical:
        print(f"  ✓ All {num_repetitions} repetitions are identical (expected)")
    else:
        print(f"  ⚠ Repetitions vary (unexpected!)")
    
    results[bs] = torch.stack(runs)
    all_key_vectors[f"batch_size_{bs}"] = results[bs].float().numpy().tolist()
    
    mean_key = results[bs].mean(dim=0)
    deviations = torch.stack([torch.norm(results[bs][i] - mean_key) for i in range(num_repetitions)])
    std_noise = deviations.std().item()
    mean_noise = deviations.mean().item()
    
    print(f"  Statistical noise: mean={mean_noise:.6f}, std={std_noise:.6f}")
    print(f"  Key vector norm: {torch.norm(mean_key).item():.2f}\n")
    
    torch.cuda.empty_cache()

# Compare systematic deviations
print("\n" + "="*60)
print("=== SYSTEMATIC DEVIATION MATRIX (Key Vectors) ===")
print("="*60)
print("     ", end="")
for bs in batch_sizes:
    print(f"bs={bs:2d}  ", end="")
print()

systematic_deviations = {}
for bs1 in batch_sizes:
    print(f"bs={bs1:2d} ", end="")
    for bs2 in batch_sizes:
        if bs1 == bs2:
            print("  -    ", end="")
        else:
            mean1 = results[bs1].mean(dim=0)
            mean2 = results[bs2].mean(dim=0)
            l2 = torch.norm(mean1 - mean2).item()
            systematic_deviations[f"bs{bs1}_vs_bs{bs2}"] = l2
            print(f"{l2:6.3f} ", end="")
    print()

print("\n" + "="*60)
print("=== KEY VECTOR ANALYSIS ===")
print("="*60)

bs1_mean = results[1].mean(dim=0)
bs2_mean = results[2].mean(dim=0)

print(f"Key vector dimension: {bs1_mean.shape[0]}")
print(f"bs=1 mean key norm: {torch.norm(bs1_mean).item():.2f}")
print(f"bs=2 mean key norm: {torch.norm(bs2_mean).item():.2f}")
print(f"L2 distance (bs1 vs bs2): {torch.norm(bs1_mean - bs2_mean).item():.4f}")
if torch.norm(bs1_mean) > 0:
    print(f"Relative difference: {(torch.norm(bs1_mean - bs2_mean) / torch.norm(bs1_mean)).item():.6f}")

diff = (bs1_mean - bs2_mean).abs()
print(f"Max absolute difference: {diff.max().item():.6f}")
print(f"Dimensions with |diff| > 0.01: {(diff > 0.01).sum().item()}/{diff.shape[0]}")

# CRITICAL CHECK
bs1_vs_bs2_deviation = systematic_deviations.get("bs1_vs_bs2", 0)
print("\n" + "="*60)
print("=== VERDICT ===")
print("="*60)
print(f"Element 0 input: '{longest_name}' sequence (IDENTICAL across batch sizes)")
print(f"Extraction method: Concatenated KEY vectors from all attention heads")
print(f"bs1 vs bs2 deviation: {bs1_vs_bs2_deviation:.6f}\n")

if bs1_vs_bs2_deviation > 0.1:
    print(f"✓ DETECTION VIABLE: L2={bs1_vs_bs2_deviation:.4f}")
    print(f"  Key vectors show systematic deviation from batch size")
    print(f"  → Forensics CAN detect hidden batch capacity using keys")
elif bs1_vs_bs2_deviation > 0.001:
    print(f"⚠ WEAK SIGNAL: L2={bs1_vs_bs2_deviation:.6f}")
    print(f"  Small but detectable effect in key vectors")
    print(f"  → Forensics might work with careful analysis")
else:
    print(f"✗ NO DETECTION: L2={bs1_vs_bs2_deviation:.6f}")
    print(f"  Key vectors show no batch size effect")
    print(f"  → Cannot detect hidden capacity using key vectors")

# Save results
output = {
    "experiment": "H100_key_vector_forensics",
    "timestamp": datetime.now().isoformat(),
    "model": model_name,
    "hardware": {
        "gpu": torch.cuda.get_device_name(0),
        "pytorch": torch.__version__,
        "cuda": torch.version.cuda,
        "hostname": HOSTNAME,
        "container_id": CONTAINER_ID
    },
    "environment": env_vars,
    "config": {
        "batch_sizes": batch_sizes,
        "repetitions": num_repetitions,
        "operation": "single_forward_pass_key_vectors",
        "dtype": "bfloat16",
        "use_cache": True,
        "extraction_method": "concatenated_keys_last_layer_all_heads",
        "input_strategy": "diverse_realistic_sequences",
        "sequence_0": longest_name,
        "sequence_0_tokens": token_counts[longest_name],
        "all_sequence_tokens": token_counts,
        "key_vector_dim": int(bs1_mean.shape[0])
    },
    "statistical_noise": {
        f"batch_size_{bs}": {
            "mean": float(torch.stack([torch.norm(results[bs][i] - results[bs].mean(dim=0)) 
                                       for i in range(num_repetitions)]).mean()),
            "std": float(torch.stack([torch.norm(results[bs][i] - results[bs].mean(dim=0)) 
                                     for i in range(num_repetitions)]).std())
        }
        for bs in batch_sizes
    },
    "systematic_deviations": systematic_deviations,
    "key_vector_norms": {
        f"batch_size_{bs}": float(torch.norm(results[bs].mean(dim=0))) 
        for bs in batch_sizes
    },
    "forensics_result": {
        "bs1_vs_bs2_deviation": bs1_vs_bs2_deviation,
        "detection_viable": bs1_vs_bs2_deviation > 0.1
    },
    "raw_key_vectors": all_key_vectors
}

output_file = f"{torch.cuda.get_device_name(0).replace(' ', '_')}_key_vectors_exp{EXP_NUMBER}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
output_path = f"/workspace/{output_file}"

with open(output_path, "w") as f:
    json.dump(output, f, indent=2)

print(f"\n✓ Results saved to {output_path}")
print(f"✓ File size: ~{len(json.dumps(output)) / 1024:.1f} KB")
print("\n" + "="*60)
print("KEY VECTOR TEST COMPLETE")
print("="*60)



System Info:
  Hostname: 3611dce208ac
  Container: 3611dce208ac

Environment Variables:
  CUDA_VERSION=12.8.1
  NCCL_VERSION=2.25.1-1
  NVIDIA_REQUIRE_CUDA=cuda>=12.8 brand=unknown,driver>=470,driver<471 brand=grid,driver>=470,driver<471 brand=tesla,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=vapps,driver>=470,driver<471 brand=vpc,driver>=470,driver<471 brand=vcs,driver>=470,driver<471 brand=vws,driver>=470,driver<471 brand=cloudgaming,driver>=470,driver<471 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudg

`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]


Tokenizing sequences to determine lengths:
  technical :  216 tokens
  literary  :  204 tokens
  corporate :  193 tokens
  legal     :  211 tokens

Selected 'technical' as sequence 0 (longest, 216 tokens)
Dummy sequences: ['literary', 'corporate', 'legal']
CRITICAL: Extracting KEY VECTORS from last layer, not hidden states

Starting H100 KEY VECTOR TEST at 2025-11-03T15:46:27.973488
Model: Qwen/Qwen2.5-7B-Instruct
Precision: BF16 (bfloat16)
Base prompt tokens: 216
Operation: Single forward pass, extract KEY vectors
CRITICAL: Element 0 input is IDENTICAL across batch sizes
Repetitions per batch size: 10

Collecting batch_size=1 (10 repetitions)...
  Batch: [technical]
  Single: seq_len=216, valid tokens=216
  Key vector: num_heads=4, head_dim=128, total_dim=512
  Rep 0: norm=920.000000, first_val=0.261719
  Single: seq_len=216, valid tokens=216
  Key vector: num_heads=4, head_dim=128, total_dim=512
  Single: seq_len=216, valid tokens=216
  Key vector: num_heads=4, head_dim=128, total_d