In [1]:
#!/usr/bin/env python3
"""
Decode Parallel Matrix Experiment - MULTI-POSITION EXTRACTION

Tests concurrent CUDA stream interference during LONG AUTOREGRESSIVE GENERATION.
Extracts key vectors at MULTIPLE positions during generation to detect:
- Transient interference (early tokens differ, late tokens recover)
- Persistent interference (all tokens differ)
- No interference (all tokens bit-exact)

Design:
- 3 reference sequences
- 3 conditions: baseline, light_concurrent, heavy_concurrent
- Each runs LONG generation (150 tokens) to ensure overlap
- Extracts keys from 4 positions: 20%, 50%, 80%, 100% through generation

Usage:
    python decode_parallel_matrix_multiposition.py
"""

import os
os.environ['HF_HOME'] = '/workspace/huggingface_cache'
os.environ['TRANSFORMERS_CACHE'] = '/workspace/huggingface_cache'

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import numpy as np
from datetime import datetime
import json
import socket
import subprocess
import threading
import time

# ============================================================================
# REFERENCE SEQUENCES (long prompts for generation)
# ============================================================================

REFERENCE_SEQUENCES = {
    "ref_technical": """You are a senior research scientist at a leading AI safety institution. Your task is to write a comprehensive technical report analyzing the current state of AI alignment research, potential risks from advanced AI systems, and proposed mitigation strategies.

The report should cover the following areas in depth:

1. Introduction to AI Alignment
   - Historical context and evolution of the field
   - Key terminology and conceptual frameworks
   - Relationship to broader AI safety and governance efforts
   - Current stakeholders and institutional landscape

2. Technical Challenges in AI Alignment
   - The outer alignment problem: specifying correct objectives
   - The inner alignment problem: ensuring mesa-optimizers are aligned
   - Robustness and distributional shift
   - Scalable oversight and interpretability
   - Deceptive alignment and treacherous turns
   - Value learning and inverse reinforcement learning
   - Corrigibility and shutdown problems

3. Current Research Approaches
   - Reinforcement learning from human feedback (RLHF)
   - Constitutional AI and other oversight methods
   - Debate and amplification techniques
   - Interpretability research and mechanistic understanding
   - Formal verification approaches
   - Multi-agent systems and cooperation
   - Impact measures and side-effect minimization

Please begin your response with an executive summary.""",
    
    "ref_narrative": """You are a historian specializing in the development of artificial intelligence. Write a detailed historical narrative tracing the evolution of machine learning from the 1940s to the present day.

Cover the following major periods:

1. Pre-History and Foundations (1940s-1950s)
   - Alan Turing's foundational work and the Turing Test
   - Early cybernetics and information theory
   - The Dartmouth Conference and the birth of AI as a field
   - First neural network models and perceptrons
   - Early optimism and grand predictions

2. The First AI Winter (1970s-1980s)
   - Limitations of early approaches becoming apparent
   - The perceptron limitations and Minsky-Papert critique
   - Expert systems and their promise
   - Funding cuts and reduced interest
   - Lessons learned about overpromising

3. Renaissance and New Approaches (1980s-1990s)
   - Backpropagation and multi-layer networks
   - Statistical approaches and probabilistic reasoning
   - Support vector machines and kernel methods

Please begin with the pre-history period.""",
    
    "ref_code": """You are a principal software architect. Write a comprehensive technical specification for a distributed machine learning training system that can scale to thousands of GPUs.

Address the following components:

1. System Architecture Overview
   - High-level design principles and requirements
   - Microservices vs monolithic considerations
   - Communication protocols and networking stack
   - Fault tolerance and reliability strategies
   - Monitoring and observability infrastructure

2. Distributed Training Orchestration
   - Job scheduling and resource allocation
   - Dynamic scaling and elasticity
   - Preemption and checkpointing mechanisms
   - Priority queuing and fairness policies
   - Multi-tenancy and isolation guarantees

3. Data Pipeline and Storage
   - Distributed filesystem design choices
   - Data preprocessing and augmentation strategies
   - Caching layers and memory hierarchies

Please start with the system architecture overview."""
}

# ============================================================================
# CONFIGURATION
# ============================================================================

CACHE_DIR = '/workspace/huggingface_cache'
MODEL_NAME = "Qwen/Qwen2.5-7B-Instruct"
LAYER_INDICES = [1, 4, 10, 18, 27]  # Model has 28 layers (0-27)
CONDITIONS = ['baseline', 'light_concurrent', 'heavy_concurrent']
GENERATION_TOKENS = 150  # Long generation to ensure overlap

# Extract at these relative positions within the GENERATED portion
# (not absolute positions in full sequence)
EXTRACTION_FRACTIONS = [0.2, 0.5, 0.8, 1.0]  # 20%, 50%, 80%, 100%

# ============================================================================
# CONCURRENT GENERATION CLASS
# ============================================================================

class ConcurrentGeneration:
    """Runs continuous LONG generation on a separate CUDA stream."""
    
    def __init__(self, model, tokenizer, prompt, workload='light'):
        self.model = model
        self.tokenizer = tokenizer
        self.prompt = prompt
        self.workload = workload
        self.running = False
        self.thread = None
        self.start_time = None
        self.end_time = None
        self.generation_count = 0
        self.generation_timestamps = []  # [(start, end), ...]
    
    def _generation_loop(self):
        """Continuously run LONG generation on separate stream."""
        stream = torch.cuda.Stream()
        
        self.start_time = time.perf_counter()
        
        with torch.cuda.stream(stream):
            while self.running:
                gen_start = time.perf_counter()
                
                inputs = self.tokenizer(self.prompt, return_tensors="pt").to("cuda")
                
                # Long generation: 100 tokens (light) or 150 tokens (heavy)
                max_new_tokens = 100 if self.workload == 'light' else 150
                
                with torch.no_grad():
                    _ = self.model.generate(
                        **inputs,
                        max_new_tokens=max_new_tokens,
                        do_sample=False,
                        pad_token_id=self.tokenizer.eos_token_id,
                        use_cache=True
                    )
                
                gen_end = time.perf_counter()
                
                self.generation_timestamps.append((gen_start, gen_end))
                self.generation_count += 1
                
                # No delay - continuous generation for maximum conflict
        
        self.end_time = time.perf_counter()
    
    def start(self):
        """Start concurrent generation in background thread."""
        if self.running:
            return
        
        self.running = True
        self.generation_count = 0
        self.generation_timestamps = []
        self.thread = threading.Thread(target=self._generation_loop, daemon=True)
        self.thread.start()
        
        # Give it time to start and reach steady state
        time.sleep(2.0)
    
    def stop(self):
        """Stop concurrent generation."""
        if not self.running:
            return
        
        self.running = False
        if self.thread:
            self.thread.join(timeout=5.0)
        
        # Let GPU settle
        time.sleep(0.5)
        torch.cuda.synchronize()
    
    def get_timing_info(self):
        """Get timing information about concurrent work."""
        if self.start_time and self.end_time:
            duration = self.end_time - self.start_time
            return {
                'duration_sec': duration,
                'generation_count': self.generation_count,
                'generations_per_sec': self.generation_count / duration if duration > 0 else 0,
                'generation_timestamps': self.generation_timestamps
            }
        return None

# ============================================================================
# GPU UTILIZATION MONITORING
# ============================================================================

def get_gpu_utilization():
    """Get current GPU utilization percentage."""
    try:
        result = subprocess.run(
            ['nvidia-smi', '--query-gpu=utilization.gpu', '--format=csv,noheader,nounits'],
            capture_output=True,
            text=True,
            timeout=1.0
        )
        return float(result.stdout.strip())
    except:
        return None

def monitor_gpu_utilization(duration_sec=1.0, samples=10):
    """Monitor GPU utilization over a period."""
    measurements = []
    interval = duration_sec / samples
    
    for _ in range(samples):
        util = get_gpu_utilization()
        if util is not None:
            measurements.append(util)
        time.sleep(interval)
    
    if not measurements:
        return None
    
    return {
        'mean': np.mean(measurements),
        'p95': np.percentile(measurements, 95),
        'max': np.max(measurements)
    }

# ============================================================================
# GENERATION WITH MULTI-POSITION KEY EXTRACTION
# ============================================================================

def generate_and_extract_keys_multiposition(model, tokenizer, text, num_tokens=150, 
                                             extraction_fractions=[0.2, 0.5, 0.8, 1.0],
                                             device="cuda"):
    """
    Generate tokens and extract key vectors from MULTIPLE positions during generation.
    
    Positions are specified as fractions of the generated sequence (not including input).
    
    Returns:
        dict: {position: {layer_name: key_vector_tensor}}
        dict: generation_info
        float: generation_time_ms
        tuple: (start_timestamp, end_timestamp)
    """
    # Tokenize
    inputs = tokenizer(text, return_tensors="pt").to(device)
    input_ids = inputs["input_ids"]
    input_length = input_ids.shape[1]
    
    # Calculate absolute positions to extract
    # extraction_fractions are relative to GENERATED portion
    # If we generate 150 tokens with fractions [0.2, 0.5, 0.8, 1.0]:
    # - 0.2 = token 30 (input_length + 29)
    # - 0.5 = token 75 (input_length + 74)
    # - 0.8 = token 120 (input_length + 119)
    # - 1.0 = token 150 (input_length + 149)
    
    extraction_positions = []
    for frac in extraction_fractions:
        # Generate index relative to generated portion (0-indexed)
        generated_idx = int((num_tokens * frac) - 1)
        # Convert to absolute position in full sequence
        absolute_pos = input_length + generated_idx
        extraction_positions.append(absolute_pos)
        
    print(f"    Input length: {input_length} tokens")
    print(f"    Will generate: {num_tokens} tokens")
    print(f"    Extraction positions (absolute): {extraction_positions}")
    print(f"    Extraction positions (generated): {[p - input_length + 1 for p in extraction_positions]}")
    
    # Time the generation
    torch.cuda.synchronize()
    start_time = time.perf_counter()
    
    # Generate with cache
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=num_tokens,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
            use_cache=True,
            output_hidden_states=False,
            return_dict_in_generate=True
        )
    
    torch.cuda.synchronize()
    end_time = time.perf_counter()
    generation_time_ms = (end_time - start_time) * 1000
    
    # Get the full sequence (input + generated)
    full_sequence = outputs.sequences[0]
    actual_length = full_sequence.shape[0]
    tokens_generated = actual_length - input_length
    
    # Now do ONE MORE forward pass with the full sequence to get key vectors
    with torch.no_grad():
        final_outputs = model(
            input_ids=full_sequence.unsqueeze(0),
            output_hidden_states=True,
            use_cache=True
        )
    
    past_key_values = final_outputs.past_key_values
    
    # Extract keys from MULTIPLE positions
    multi_position_keys = {}
    
    for pos_idx, absolute_pos in enumerate(extraction_positions):
        if absolute_pos >= actual_length:
            print(f"    Warning: Position {absolute_pos} exceeds sequence length {actual_length}, using last token")
            absolute_pos = actual_length - 1
        
        position_label = f"pos_{int(extraction_fractions[pos_idx]*100)}pct"
        multi_position_keys[position_label] = {}
        
        for layer_idx in LAYER_INDICES:
            # past_key_values[layer_idx] = (keys, values)
            # keys shape: [batch_size, num_heads, seq_len, head_dim]
            layer_keys = past_key_values[layer_idx][0]
            
            # Extract specified token position: [:, :, absolute_pos, :]
            position_keys = layer_keys[:, :, absolute_pos, :]
            
            # Flatten: [batch_size, num_heads, head_dim] -> [batch_size, num_heads * head_dim]
            flattened = position_keys.reshape(position_keys.shape[0], -1)
            
            # Extract element 0
            key_vector = flattened[0].cpu()
            
            multi_position_keys[position_label][f'layer_{layer_idx}'] = key_vector
    
    generation_info = {
        'input_length': input_length,
        'tokens_generated': tokens_generated,
        'total_length': actual_length,
        'requested_tokens': num_tokens,
        'extraction_positions_absolute': extraction_positions,
        'extraction_positions_relative': [p - input_length + 1 for p in extraction_positions],
        'extraction_fractions': extraction_fractions,
        'num_positions': len(extraction_positions),
        'num_layers': len(LAYER_INDICES),
        'vector_dimension': key_vector.shape[0]
    }
    
    return multi_position_keys, generation_info, generation_time_ms, (start_time, end_time)

# ============================================================================
# MAIN EXPERIMENT
# ============================================================================

def main():
    print("="*70)
    print("DECODE PARALLEL MATRIX EXPERIMENT - MULTI-POSITION")
    print("="*70)
    print()
    
    # Determine hardware label from GPU name
    gpu_name = torch.cuda.get_device_name(0)
    if 'H100' in gpu_name:
        hardware_label = 'h100'
    elif 'A100' in gpu_name:
        hardware_label = 'a100'
    elif 'L40S' in gpu_name:
        hardware_label = 'l40s'
    else:
        hardware_label = 'unknown'
    
    print(f"GPU: {gpu_name}")
    print(f"PyTorch: {torch.__version__}")
    print(f"CUDA: {torch.version.cuda}")
    print()
    
    # System info
    hostname = socket.gethostname()
    
    # Load model
    print("Loading model...")
    print()
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, cache_dir=CACHE_DIR)
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        dtype=torch.bfloat16,
        cache_dir=CACHE_DIR,
        low_cpu_mem_usage=True,
        device_map="auto"
    )
    model.eval()
    print("Model loaded")
    print()
    
    # Show extraction strategy
    print("MULTI-POSITION EXTRACTION STRATEGY:")
    print(f"  Extracting keys at {len(EXTRACTION_FRACTIONS)} positions during generation:")
    for frac in EXTRACTION_FRACTIONS:
        token_num = int(frac * GENERATION_TOKENS)
        print(f"    - {int(frac*100)}% through generation (token ~{token_num}/{GENERATION_TOKENS})")
    print()
    
    # Prepare results structure
    results = {
        'metadata': {
            'hardware': hardware_label,
            'hostname': hostname,
            'gpu': gpu_name,
            'pytorch_version': torch.__version__,
            'cuda_version': torch.version.cuda,
            'model': MODEL_NAME,
            'layer_indices': LAYER_INDICES,
            'conditions_tested': CONDITIONS,
            'num_reference_sequences': len(REFERENCE_SEQUENCES),
            'generation_tokens': GENERATION_TOKENS,
            'extraction_fractions': EXTRACTION_FRACTIONS,
            'num_extraction_positions': len(EXTRACTION_FRACTIONS),
            'timestamp': datetime.now().isoformat(),
            'experiment_type': 'decode_parallel_matrix_multiposition'
        },
        'measurements': {}
    }
    
    # Store all measurements for forensic analysis
    all_measurements = {}
    
    # Run all combinations
    for ref_idx, (ref_name, ref_text) in enumerate(REFERENCE_SEQUENCES.items(), 1):
        print("="*70)
        print(f"REFERENCE {ref_idx}/{len(REFERENCE_SEQUENCES)}: {ref_name.upper()}")
        print("="*70)
        print()
        
        ref_measurements = {}
        
        for condition_idx, condition in enumerate(CONDITIONS, 1):
            print(f"Condition {condition_idx}/{len(CONDITIONS)}: {condition.upper()}")
            print("-" * 70)
            
            measurement_name = f"{ref_name}_{condition}"
            
            # Setup concurrent workload if needed
            concurrent = None
            concurrent_prompt = list(REFERENCE_SEQUENCES.values())[0]  # Use consistent prompt
            
            if condition in ['light_concurrent', 'heavy_concurrent']:
                workload = condition.split('_')[0]  # 'light' or 'heavy'
                concurrent = ConcurrentGeneration(model, tokenizer, concurrent_prompt, workload=workload)
                concurrent.start()
                print("  [HIDDEN] Concurrent generation started (long decode)")
                print(f"  [HIDDEN] Generating {100 if workload == 'light' else 150} tokens continuously")
            
            print(f"\n[{condition.upper()}] Starting generation...")
            
            # Run generation (3 repetitions for reproducibility check)
            runs = []
            all_info = []
            all_times = []
            measurement_timestamps = []  # Store (start, end) for each rep
            gpu_utils = []
            measurement_start = time.perf_counter()
            
            for rep in range(3):
                # Monitor GPU during generation
                gpu_util_before = monitor_gpu_utilization(duration_sec=0.5, samples=5)
                
                multi_pos_keys, generation_info, generation_time, timestamps = generate_and_extract_keys_multiposition(
                    model, tokenizer, ref_text, 
                    num_tokens=GENERATION_TOKENS,
                    extraction_fractions=EXTRACTION_FRACTIONS
                )
                
                gpu_util_after = monitor_gpu_utilization(duration_sec=0.5, samples=5)
                
                runs.append(multi_pos_keys)
                all_info.append(generation_info)
                all_times.append(generation_time)
                measurement_timestamps.append(timestamps)
                
                # Average the before/after measurements
                if gpu_util_before and gpu_util_after:
                    avg_util = {
                        'mean': (gpu_util_before['mean'] + gpu_util_after['mean']) / 2,
                        'p95': max(gpu_util_before['p95'], gpu_util_after['p95']),
                        'max': max(gpu_util_before['max'], gpu_util_after['max'])
                    }
                    gpu_utils.append(avg_util)
            
            measurement_end = time.perf_counter()
            measurement_duration = measurement_end - measurement_start
            
            # Timing statistics
            mean_time = np.mean(all_times)
            std_time = np.std(all_times)
            print(f"[{condition.upper()}] Generation time: {mean_time:.2f}ms ± {std_time:.2f}ms")
            print(f"[{condition.upper()}] Total measurement duration: {measurement_duration:.1f}s")
            
            # GPU utilization
            if gpu_utils:
                avg_gpu_util = {
                    'mean': np.mean([u['mean'] for u in gpu_utils]),
                    'p95': np.mean([u['p95'] for u in gpu_utils]),
                    'max': max([u['max'] for u in gpu_utils])
                }
                print(f"[{condition.upper()}] GPU util: {avg_gpu_util['p95']:.1f}% (P95)")
            else:
                avg_gpu_util = None
            
            # Stop concurrent workload and get its info
            concurrent_info = None
            if concurrent:
                concurrent.stop()
                print("  [HIDDEN] Concurrent generation stopped")
                concurrent_info = concurrent.get_timing_info()
                if concurrent_info:
                    print(f"  [HIDDEN] Ran for {concurrent_info['duration_sec']:.1f}s")
                    print(f"  [HIDDEN] Completed {concurrent_info['generation_count']} generations")
                    print(f"  [HIDDEN] Rate: {concurrent_info['generations_per_sec']:.2f} gen/sec")
            
            # Check reproducibility PER POSITION
            position_reproducibility = {}
            for position_label in runs[0].keys():
                position_reproducible = True
                for layer_name in runs[0][position_label].keys():
                    for i in range(1, len(runs)):
                        if not torch.equal(runs[0][position_label][layer_name], 
                                         runs[i][position_label][layer_name]):
                            position_reproducible = False
                            break
                    if not position_reproducible:
                        break
                position_reproducibility[position_label] = position_reproducible
            
            # Print reproducibility summary
            all_positions_reproducible = all(position_reproducibility.values())
            if all_positions_reproducible:
                print(f"[{condition.upper()}] Reproducibility: ✓ BIT-EXACT (all positions)")
            else:
                print(f"[{condition.upper()}] Reproducibility: ⚠ VARIES BY POSITION")
                for pos_label, is_repro in position_reproducibility.items():
                    status = "✓" if is_repro else "✗"
                    print(f"  {status} {pos_label}")
            
            print()
            
            # Store measurement
            measurement = {
                'condition': condition,
                'reference_sequence': ref_name,
                'reproducible_by_position': position_reproducibility,
                'all_positions_reproducible': all_positions_reproducible,
                'num_repetitions': 3,
                'generation_info': all_info[0],
                'timing': {
                    'mean_ms': float(mean_time),
                    'std_ms': float(std_time),
                    'all_times_ms': [float(t) for t in all_times],
                    'measurement_duration_sec': float(measurement_duration),
                    'measurement_timestamps': measurement_timestamps
                },
                'gpu_utilization': avg_gpu_util,
                'concurrent_info': concurrent_info,
                'positions': {}
            }
            
            # Store key vectors for each position as lists for JSON serialization
            for position_label in runs[0].keys():
                measurement['positions'][position_label] = {
                    'layers': {}
                }
                for layer_name, key_vector in runs[0][position_label].items():
                    measurement['positions'][position_label]['layers'][layer_name] = {
                        'key_vector': key_vector.tolist(),
                        'shape': list(key_vector.shape)
                    }
            
            results['measurements'][measurement_name] = measurement
            ref_measurements[condition] = measurement
            
            # Clear GPU cache
            torch.cuda.empty_cache()
        
        # Store for forensic analysis
        all_measurements[ref_name] = ref_measurements
        
        # Perform forensic analysis for this reference
        print("="*70)
        print(f"FORENSIC ANALYSIS: {ref_name.upper()}")
        print("="*70)
        print()
        
        baseline_measurement = ref_measurements['baseline']
        
        print("FP FORENSICS - BY POSITION")
        print("-" * 70)
        
        for test_condition in ['light_concurrent', 'heavy_concurrent']:
            test_measurement = ref_measurements[test_condition]
            
            print(f"\n{test_condition.upper()}:")
            
            # Check each position
            for position_label in baseline_measurement['positions'].keys():
                print(f"\n  {position_label.upper()}:")
                
                for layer_idx in LAYER_INDICES:
                    layer_name = f'layer_{layer_idx}'
                    
                    baseline_vec = np.array(
                        baseline_measurement['positions'][position_label]['layers'][layer_name]['key_vector']
                    )
                    test_vec = np.array(
                        test_measurement['positions'][position_label]['layers'][layer_name]['key_vector']
                    )
                    
                    bit_exact = np.array_equal(baseline_vec, test_vec)
                    
                    if bit_exact:
                        print(f"    {layer_name}: ✓ BIT-EXACT")
                    else:
                        l2 = np.linalg.norm(baseline_vec - test_vec)
                        print(f"    {layer_name}: ✗ DIFFERS (L2={l2:.6e})")
        
        print()
        print("TIMING FORENSICS")
        print("-" * 70)
        
        baseline_time = baseline_measurement['timing']['mean_ms']
        baseline_gpu = baseline_measurement['gpu_utilization']
        
        print(f"\nBaseline: {baseline_time:.2f}ms")
        if baseline_gpu:
            print(f"  GPU util: {baseline_gpu['p95']:.1f}%")
        
        for test_condition in ['light_concurrent', 'heavy_concurrent']:
            test_measurement = ref_measurements[test_condition]
            test_time = test_measurement['timing']['mean_ms']
            test_gpu = test_measurement['gpu_utilization']
            
            slowdown_ms = test_time - baseline_time
            slowdown_pct = (slowdown_ms / baseline_time) * 100
            
            # Simple significance test (>2 std devs)
            baseline_std = baseline_measurement['timing']['std_ms']
            test_std = test_measurement['timing']['std_ms']
            combined_std = np.sqrt(baseline_std**2 + test_std**2)
            significant = abs(slowdown_ms) > (2 * combined_std)
            
            print(f"\n{test_condition.upper()}:")
            print(f"  Time: {test_time:.2f}ms (baseline: {baseline_time:.2f}ms)")
            print(f"  Slowdown: {slowdown_ms:+.2f}ms ({slowdown_pct:+.1f}%)")
            print(f"  Significant: {'YES' if significant else 'NO'}")
            if test_gpu:
                print(f"  GPU util: {test_gpu['p95']:.1f}%")
        
        print()
        print("TEMPORAL OVERLAP VERIFICATION")
        print("-" * 70)
        
        for test_condition in ['light_concurrent', 'heavy_concurrent']:
            test_measurement = ref_measurements[test_condition]
            concurrent_info = test_measurement.get('concurrent_info')
            measurement_timestamps = test_measurement['timing'].get('measurement_timestamps', [])
            
            if concurrent_info and measurement_timestamps:
                print(f"\n{test_condition.upper()}:")
                
                concurrent_timestamps = concurrent_info.get('generation_timestamps', [])
                
                # Check if any concurrent generations overlapped with measurements
                overlaps = []
                for meas_start, meas_end in measurement_timestamps:
                    for conc_start, conc_end in concurrent_timestamps:
                        # Check for any temporal overlap
                        if not (meas_end < conc_start or meas_start > conc_end):
                            overlap_start = max(meas_start, conc_start)
                            overlap_end = min(meas_end, conc_end)
                            overlap_duration = overlap_end - overlap_start
                            meas_duration = meas_end - meas_start
                            overlap_fraction = overlap_duration / meas_duration
                            
                            overlaps.append({
                                'measurement': (meas_start, meas_end),
                                'concurrent': (conc_start, conc_end),
                                'overlap_duration': overlap_duration,
                                'overlap_fraction': overlap_fraction
                            })
                
                print(f"  Measurement generations: {len(measurement_timestamps)}")
                print(f"  Concurrent generations: {len(concurrent_timestamps)}")
                print(f"  Overlapping pairs: {len(overlaps)}")
                
                if overlaps:
                    # Compute overlap statistics
                    durations = [o['overlap_duration'] for o in overlaps]
                    fractions = [o['overlap_fraction'] for o in overlaps]
                    
                    print(f"  ✓ CONFIRMED: {len(overlaps)} measurement/concurrent pairs overlapped")
                    print(f"  Overlap duration: {np.mean(durations):.2f}s (mean), {np.max(durations):.2f}s (max)")
                    print(f"  Overlap fraction: {np.mean(fractions)*100:.1f}% (mean), {np.max(fractions)*100:.1f}% (max)")
                else:
                    print(f"  ✗ WARNING: No temporal overlap detected!")
        
        print()
        print()
    
    # Save results
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"{hardware_label}_decode_parallel_multipos_{timestamp}.json"
    
    print("="*70)
    print("SAVING RESULTS")
    print("="*70)
    
    with open(filename, 'w') as f:
        json.dump(results, f, indent=2)
    
    print()
    print(f"[SAVED] {filename}")
    print(f"        (Contains {len(results['measurements'])} measurements with multi-position data)")
    print()
    
    # Summary
    print("="*70)
    print("EXPERIMENT COMPLETE")
    print("="*70)
    print()
    print(f"Hardware: {hardware_label.upper()}")
    print(f"Total measurements: {len(results['measurements'])}")
    print(f"References tested: {len(REFERENCE_SEQUENCES)}")
    print(f"Conditions tested: {len(CONDITIONS)}")
    print(f"Generation length: {GENERATION_TOKENS} tokens")
    print(f"Extraction positions: {len(EXTRACTION_FRACTIONS)} per measurement")
    print()
    print("Key innovation:")
    print("  • Extracts keys at MULTIPLE positions during generation")
    print("  • Can detect TRANSIENT vs PERSISTENT interference")
    print("  • If early positions differ but late ones match → transient interference")
    print("  • If all positions differ → persistent interference")
    print("  • If all positions match → no FP interference (timing-only detection)")
    print()
    print("="*70)

if __name__ == "__main__":
    main()



DECODE PARALLEL MATRIX EXPERIMENT - MULTI-POSITION

GPU: NVIDIA A100 80GB PCIe
PyTorch: 2.8.0+cu128
CUDA: 12.8

Loading model...



Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Model loaded

MULTI-POSITION EXTRACTION STRATEGY:
  Extracting keys at 4 positions during generation:
    - 20% through generation (token ~30/150)
    - 50% through generation (token ~75/150)
    - 80% through generation (token ~120/150)
    - 100% through generation (token ~150/150)

REFERENCE 1/3: REF_TECHNICAL

Condition 1/3: BASELINE
----------------------------------------------------------------------

[BASELINE] Starting generation...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


    Input length: 258 tokens
    Will generate: 150 tokens
    Extraction positions (absolute): [287, 332, 377, 407]
    Extraction positions (generated): [30, 75, 120, 150]
    Input length: 258 tokens
    Will generate: 150 tokens
    Extraction positions (absolute): [287, 332, 377, 407]
    Extraction positions (generated): [30, 75, 120, 150]
    Input length: 258 tokens
    Will generate: 150 tokens
    Extraction positions (absolute): [287, 332, 377, 407]
    Extraction positions (generated): [30, 75, 120, 150]
[BASELINE] Generation time: 3510.87ms ± 227.73ms
[BASELINE] Total measurement duration: 14.5s
[BASELINE] GPU util: 61.8% (P95)
[BASELINE] Reproducibility: ✓ BIT-EXACT (all positions)

Condition 2/3: LIGHT_CONCURRENT
----------------------------------------------------------------------
  [HIDDEN] Concurrent generation started (long decode)
  [HIDDEN] Generating 100 tokens continuously

[LIGHT_CONCURRENT] Starting generation...
    Input length: 258 tokens
    Will generate: