# Cross-Hardware Verification: Second CUDA Stream

## Objective
Test whether parallel CUDA stream workloads are detectable across different hardware.

## Methodology
- Collect key vectors (sparse sampling: every 5th token from end)
- Three conditions per hardware:
  1. Baseline (no parallel stream)
  2. Light parallel workload (short concurrent inference)
  3. Heavy parallel workload (long concurrent inference)
- 3 repetitions per condition to check reproducibility
- Expected: Baseline is reproducible, parallel streams show statistical noise

## Usage
1. Run this notebook on A100
2. Run this notebook on H100
3. Use compare_streams.py to analyze cross-hardware differences

In [None]:
import os
os.environ['HF_HOME'] = '/workspace/huggingface_cache'
os.environ['TRANSFORMERS_CACHE'] = '/workspace/huggingface_cache'

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import numpy as np
from datetime import datetime
import json
import socket
import threading
import time
import subprocess

## Configuration

In [None]:
CACHE_DIR = '/workspace/huggingface_cache'
MODEL_NAME = "Qwen/Qwen2.5-7B-Instruct"
LAYER_INDICES = [1, 4, 10, 18, 28]  # Sample layers across model depth
NUM_REPETITIONS = 3  # Check reproducibility

# Test sequences (10 diverse samples)
SEQUENCES = {
    "technical_1": """Large language models have revolutionized natural language processing through their ability to capture complex patterns in text data. The transformer architecture, introduced in 2017, employs self-attention mechanisms that allow the model to weigh the importance of different tokens in the input sequence. During training, these models learn to predict the next token in a sequence by optimizing a cross-entropy loss function across billions of text examples. The attention mechanism computes query, key, and value matrices for each token, enabling the model to establish contextual relationships across long distances in the text. Modern implementations use various optimization techniques including gradient checkpointing, mixed precision training, and careful learning rate scheduling to manage the computational demands of training models with hundreds of billions of parameters. The pre-training phase typically involves processing massive text corpora, often exceeding trillions of tokens, which requires distributed training across hundreds or thousands of GPUs. Researchers have developed various architectural innovations such as sparse attention patterns, sliding window attention, and mixture-of-experts layers to improve efficiency and scalability.""",
    
    "technical_2": """Quantum computing leverages the principles of quantum mechanics to perform computations that would be intractable for classical computers. At the heart of quantum computation lies the qubit, a quantum bit that can exist in a superposition of both 0 and 1 states simultaneously. When multiple qubits are entangled, they form a quantum register capable of representing an exponentially large state space. Quantum gates manipulate these qubits through carefully controlled electromagnetic pulses, implementing quantum algorithms like Shor's algorithm for factoring large numbers or Grover's algorithm for unstructured search. However, quantum systems are extremely fragile and susceptible to decoherence, where environmental interactions cause the quantum state to collapse into a classical state. To combat this, researchers develop quantum error correction codes that encode logical qubits across multiple physical qubits, enabling detection and correction of errors. Current quantum processors utilize various physical implementations including superconducting circuits cooled to millikelvin temperatures, trapped ions manipulated by laser pulses, and topological qubits that exploit exotic quantum states of matter.""",
    
    "narrative_1": """The morning sun filtered through the ancient oak trees as Sarah walked along the forest path, her boots crunching softly on the fallen leaves. She had been coming to these woods since childhood, when her grandmother first taught her to identify the different bird calls echoing through the canopy. Now, decades later, she found herself returning to this same trail whenever life felt overwhelming. The forest had a way of putting things into perspective, reminding her that human concerns were just a small part of a much larger ecosystem. A red-tailed hawk circled overhead, its sharp eyes scanning the underbrush for movement. Sarah paused to watch it, remembering how her grandmother explained that hawks could see mice moving in the grass from hundreds of feet in the air. The forest was full of such marvels, from the intricate networks of fungal mycelia connecting tree roots underground to the complex social structures of ant colonies beneath her feet. She continued walking, feeling the weight of the past week gradually lifting from her shoulders with each step deeper into the woods.""",
    
    "narrative_2": """The small coastal town had changed dramatically over the past fifty years, transforming from a quiet fishing village into a bustling tourist destination. Marcus remembered when there were only a handful of boats in the harbor, each owned by families who had fished these waters for generations. Now the marina was packed with sleek yachts and charter boats, and the waterfront was lined with restaurants and gift shops catering to summer visitors. His father's old fishing boat was long gone, sold when the fishing industry collapsed in the nineties. Marcus had tried to keep the family tradition alive, but the combination of depleted fish stocks, new regulations, and competition from industrial fishing operations made it impossible to earn a living. Eventually, he had taken a job at one of the new hotels, leading nature tours for tourists who wanted to experience the authentic coastal lifestyle. It was ironic, he thought, that he now made more money showing people what the town used to be like than he ever did when it actually was that way.""",
    
    "code_1": """The database migration system implements a sophisticated version control mechanism for schema changes. Each migration file contains both an upgrade and downgrade function, allowing the system to roll forward or backward through schema versions. The migration engine maintains a table tracking which migrations have been applied, using timestamps and hash values to ensure consistency across different environments. When executing migrations, the system wraps each operation in a transaction to maintain atomicity, though some database operations like adding indexes or modifying column types may require careful handling to avoid long table locks. The migration framework supports multiple database backends through a plugin architecture, with each plugin implementing backend-specific SQL generation. For complex migrations involving data transformations, the system provides hooks for custom Python code to manipulate data between schema changes. Performance considerations are critical when migrating large tables, often requiring strategies like batched updates, temporary tables, or online schema change tools that minimize downtime. The system also includes safeguards against dangerous operations, such as preventing destructive changes in production environments without explicit confirmation flags.""",
    
    "code_2": """The distributed cache implementation uses consistent hashing to partition data across multiple nodes, ensuring that adding or removing nodes only requires redistributing a small fraction of keys. Each cache node maintains both hot and cold storage tiers, with frequently accessed items kept in memory while less popular items are stored on SSD. The eviction policy combines LRU with a frequency-based component, tracking both recency and access count to make intelligent decisions about which items to remove when memory pressure increases. To handle cache stampede scenarios where many requests simultaneously attempt to populate the same expired key, the system implements a locking mechanism that allows only one request to perform the expensive computation while others wait for the result. The cache supports automatic replication for high-availability, with configurable consistency levels ranging from eventual consistency to strong consistency depending on the application requirements. Background processes continuously monitor cache hit rates, eviction patterns, and memory usage, automatically adjusting parameters like cache size allocations and replication factors to optimize performance. The implementation includes comprehensive instrumentation exposing metrics through a monitoring endpoint compatible with standard observability platforms.""",
    
    "mixed_1": """The research team analyzed climate data spanning three decades to identify trends in regional temperature patterns. Dr. Martinez presented their findings at the conference, explaining how machine learning models had helped them detect subtle correlations between ocean temperature anomalies and subsequent weather patterns inland. The neural network architecture they developed processed multiple data streams simultaneously: satellite imagery showing cloud formations, oceanic buoy measurements tracking sea surface temperatures, and historical weather station records dating back over a century. By training the model on this diverse dataset, they achieved prediction accuracy that surpassed traditional physics-based models for forecast horizons between two and six weeks. However, the researchers emphasized that their model complemented rather than replaced conventional meteorological approaches, as the neural network could identify patterns but couldn't explain the underlying physical mechanisms. The team planned to collaborate with atmospheric physicists to interpret the model's predictions and potentially discover new relationships between ocean dynamics and terrestrial climate. Their work had implications beyond academic research, with potential applications in agricultural planning, disaster preparedness, and renewable energy grid management.""",
    
    "mixed_2": """The archaeological expedition uncovered evidence of sophisticated astronomical knowledge among the ancient civilization. Stone tablets inscribed with careful observations of celestial movements suggested that the society had developed accurate calendars and could predict astronomical events like eclipses and planetary conjunctions. Dr. Chen's team used 3D scanning technology to create detailed digital models of the artifacts, allowing researchers worldwide to study the inscriptions without risking damage to the fragile originals. The digital models were processed through optical character recognition algorithms adapted for ancient scripts, which helped identify previously unnoticed patterns in the writing system. Computational analysis revealed that certain symbols appeared to encode numerical information using a base-60 system, similar to that used by other ancient cultures for astronomical calculations. The team collaborated with astronomers to reconstruct the night sky as it would have appeared during the civilization's peak, comparing their observations with the records inscribed on the tablets. This interdisciplinary approach combining archaeology, astronomy, and computer science provided new insights into how ancient peoples understood and tracked celestial phenomena, demonstrating a level of scientific sophistication previously unrecognized for this culture.""",
    
    "repetitive": """The algorithm iterates through the data structure, examining each element to determine whether it matches the specified criteria. For each iteration, the function evaluates multiple conditions, checking first whether the element exists, then whether it satisfies the primary constraint, and finally whether it meets any secondary requirements. During each pass through the loop, the system maintains counters tracking how many elements have been processed, how many matched the criteria, and how many were rejected. The iteration continues until all elements have been examined, at which point the function returns a summary of the results. This iterative approach ensures thorough examination of every element while maintaining consistent evaluation criteria throughout the process. The iteration process repeats for each subset of the data, applying the same evaluation logic to ensure consistency across different portions of the dataset. Each iteration produces intermediate results that are aggregated into a final summary. The iterative evaluation guarantees that no elements are skipped and that all elements receive identical treatment during the selection process.""",
    
    "scientific": """Mitochondrial DNA analysis provides unique insights into evolutionary relationships because it is inherited maternally and undergoes minimal recombination. Researchers extract DNA samples from preserved specimens, amplify specific gene regions using polymerase chain reaction, and sequence the resulting fragments to identify genetic variations. The molecular clock hypothesis allows scientists to estimate divergence times between species by assuming that mutations accumulate at a relatively constant rate over evolutionary time. However, this assumption requires careful calibration using fossil evidence and consideration of factors that might affect mutation rates, such as generation time, metabolic rate, and effective population size. Phylogenetic analysis employs computational algorithms to construct evolutionary trees that best explain the observed patterns of genetic similarity and difference among species. Maximum likelihood methods evaluate numerous possible tree topologies, calculating the probability that each tree would produce the observed data under specific models of molecular evolution. Bayesian approaches incorporate prior knowledge about evolutionary processes and provide probability distributions over possible trees rather than single point estimates. These sophisticated computational methods have revolutionized our understanding of evolutionary relationships."""
}

## Parallel CUDA Stream Workload

In [None]:
class GPUMonitor:
    """Monitor GPU utilization during inference"""
    
    def __init__(self):
        self.running = False
        self.samples = []
        self.thread = None
    
    def _worker(self):
        while self.running:
            try:
                result = subprocess.run(
                    ['nvidia-smi', '--query-gpu=utilization.gpu',
                     '--format=csv,noheader,nounits'],
                    capture_output=True,
                    text=True,
                    timeout=1
                )
                if result.returncode == 0:
                    self.samples.append(float(result.stdout.strip()))
            except:
                pass
            time.sleep(0.01)  # Sample every 10ms
    
    def start(self):
        self.samples = []
        self.running = True
        self.thread = threading.Thread(target=self._worker, daemon=True)
        self.thread.start()
        time.sleep(0.05)  # Let monitoring start
    
    def stop(self):
        self.running = False
        if self.thread:
            self.thread.join(timeout=1.0)
        if self.samples:
            return {
                'mean': float(np.mean(self.samples)),
                'median': float(np.median(self.samples)),
                'std': float(np.std(self.samples)),
                'min': float(np.min(self.samples)),
                'max': float(np.max(self.samples)),
                'p50': float(np.percentile(self.samples, 50)),
                'p95': float(np.percentile(self.samples, 95)),
                'p99': float(np.percentile(self.samples, 99)),
                'num_samples': len(self.samples)
            }
        return None

In [None]:
class ConcurrentInference:
    """Run concurrent inference as hidden workload on separate CUDA stream"""
    
    def __init__(self, model, tokenizer, intensity='light', device='cuda'):
        self.model = model
        self.tokenizer = tokenizer
        self.device = device
        self.running = False
        self.thread = None
        self.stream = None
        self.ready_event = threading.Event()
        
        # MUCH MORE AGGRESSIVE intensities to saturate GPU
        if intensity == 'light':
            # Light: ~500 tokens (was 5 tokens before)
            self.prompt = """Large language models have revolutionized natural language processing through their ability to capture complex patterns in text data. The transformer architecture, introduced in 2017, employs self-attention mechanisms that allow the model to weigh the importance of different tokens in the input sequence. During training, these models learn to predict the next token in a sequence by optimizing a cross-entropy loss function across billions of text examples. The attention mechanism computes query, key, and value matrices for each token, enabling the model to establish contextual relationships across long distances in the text. Modern implementations use various optimization techniques including gradient checkpointing, mixed precision training, and careful learning rate scheduling to manage the computational demands of training models with hundreds of billions of parameters. The pre-training phase typically involves processing massive text corpora, often exceeding trillions of tokens, which requires distributed training across hundreds or thousands of GPUs."""
        else:  # heavy
            # Heavy: ~1500+ tokens (was 250 tokens before)
            base_text = """Large language models have revolutionized natural language processing through their ability to capture complex patterns in text data. The transformer architecture, introduced in 2017, employs self-attention mechanisms that allow the model to weigh the importance of different tokens in the input sequence. During training, these models learn to predict the next token in a sequence by optimizing a cross-entropy loss function across billions of text examples. The attention mechanism computes query, key, and value matrices for each token, enabling the model to establish contextual relationships across long distances in the text. Modern implementations use various optimization techniques including gradient checkpointing, mixed precision training, and careful learning rate scheduling to manage the computational demands of training models with hundreds of billions of parameters."""
            self.prompt = base_text * 3  # Repeat 3x for ~1500 tokens
        
        # Pre-tokenize
        self.inputs = tokenizer([self.prompt], return_tensors="pt")
        self.inputs = {k: v.to(device) for k, v in self.inputs.items()}
    
    def _worker(self):
        """Worker that runs inference CONTINUOUSLY on separate CUDA stream"""
        # Use separate stream for concurrent work
        self.stream = torch.cuda.Stream()
        
        # Signal ready
        self.ready_event.set()
        
        with torch.cuda.stream(self.stream):
            while self.running:
                # Run inference continuously with NO DELAY
                # This will saturate GPU and force SM contention
                with torch.no_grad():
                    _ = self.model(**self.inputs)
                # NO SLEEP - run as fast as possible to maximize utilization
    
    def start(self):
        if not self.running:
            self.ready_event.clear()
            self.running = True
            self.thread = threading.Thread(target=self._worker, daemon=True)
            self.thread.start()
            # Wait for worker to be ready
            self.ready_event.wait(timeout=1.0)
            time.sleep(0.5)  # Let workload stabilize
    
    def stop(self):
        if self.running:
            self.running = False
            if self.thread:
                self.thread.join(timeout=2.0)
            if self.stream:
                self.stream.synchronize()
            time.sleep(0.3)  # Cooldown

## Key Vector Extraction

In [None]:
def extract_key_vectors_sparse(model, tokenizer, text, device="cuda"):
    """
    Extract key vectors for SPARSE token positions (every 5th, starting from end).
    Uses GQA key heads - much smaller than full hidden states.
    
    For Qwen2.5-7B: 4 key heads × 128 head_dim = 512 dims (vs 3584 for hidden states)
    
    Returns:
        key_vectors: dict of {layer_name: tensor of shape (num_sampled_positions, key_dim)}
        sampled_positions: list of token indices that were sampled
        seq_len: total sequence length
    """
    torch.cuda.empty_cache()
    
    inputs = tokenizer([text], return_tensors="pt", padding=False)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    seq_len = inputs['input_ids'].shape[1]
    
    # Determine positions to sample: every 5th token starting from the end
    sampled_positions = list(range(seq_len - 1, -1, -5))
    
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True, use_cache=True)
    
    # Extract key vectors from KV cache for selected layers
    key_vectors = {}
    for layer_idx in LAYER_INDICES:
        # Get keys for this layer (past_key_values is 0-indexed)
        layer_keys = outputs.past_key_values[layer_idx - 1][0]  # Get keys (not values)
        # layer_keys shape: (1, num_key_heads, seq_len, head_dim)
        
        # Extract sampled positions and concatenate heads
        sampled_keys = layer_keys[0, :, sampled_positions, :]
        
        # Reshape to (num_sampled, num_key_heads * head_dim)
        num_sampled = len(sampled_positions)
        key_dim = sampled_keys.shape[0] * sampled_keys.shape[2]  # num_heads * head_dim
        key_vectors[f'layer_{layer_idx}'] = sampled_keys.permute(1, 0, 2).reshape(num_sampled, key_dim).cpu().clone()
    
    del outputs
    del inputs
    torch.cuda.empty_cache()
    
    return key_vectors, sampled_positions, seq_len

## Main Experiment

In [None]:
print("="*70)
print("CROSS-HARDWARE VERIFICATION: SECOND CUDA STREAM")
print("="*70)

# Auto-detect hardware
gpu_name = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "cpu"
if "H100" in gpu_name:
    hardware_label = "h100"
elif "A100" in gpu_name:
    hardware_label = "a100"
elif "L40S" in gpu_name or "L40s" in gpu_name:
    hardware_label = "l40s"
elif "V100" in gpu_name:
    hardware_label = "v100"
else:
    hardware_label = "gpu"

print(f"\nHardware: {hardware_label.upper()}")
print(f"GPU: {gpu_name}")
print(f"Model: {MODEL_NAME}")
print(f"Layers: {LAYER_INDICES}")
print(f"Sequences: {len(SEQUENCES)}")
print(f"Repetitions per condition: {NUM_REPETITIONS}")

hostname = socket.gethostname()
print(f"\nSystem Info:")
print(f"  Hostname: {hostname}")
print(f"  PyTorch: {torch.__version__}")
print(f"  CUDA: {torch.version.cuda}")
print()

In [None]:
# Load model
print(f"Loading model...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, cache_dir=CACHE_DIR)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    dtype=torch.bfloat16,
    cache_dir=CACHE_DIR,
    low_cpu_mem_usage=True,
    device_map="auto",
    attn_implementation="eager"  # Consistent with separate_CUDA_stream experiment
)
model.eval()
print(f"✓ Model loaded\n")

In [None]:
# Initialize results structure
results = {
    'metadata': {
        'hardware': hardware_label,
        'hostname': hostname,
        'gpu': gpu_name,
        'pytorch_version': torch.__version__,
        'cuda_version': torch.version.cuda,
        'model': MODEL_NAME,
        'layer_indices': LAYER_INDICES,
        'num_sequences': len(SEQUENCES),
        'num_repetitions_per_condition': NUM_REPETITIONS,
        'timestamp': datetime.now().isoformat()
    },
    'sequences': {}
}

# Run experiments for each sequence
for seq_name, text in SEQUENCES.items():
    print(f"Processing: {seq_name}")
    
    prompt_length = len(tokenizer.encode(text))
    print(f"  Prompt: {prompt_length} tokens")
    
    seq_results = {
        'sequence_name': seq_name,
        'prompt_text_preview': text[:100] + '...' if len(text) > 100 else text,
        'prompt_length_tokens': prompt_length,
        'conditions': {}
    }
    
    # Test three conditions: baseline, light_concurrent, heavy_concurrent
    conditions = [
        ('baseline', None),
        ('light_concurrent', 'light'),
        ('heavy_concurrent', 'heavy')
    ]
    
    for condition_name, intensity in conditions:
        print(f"  Condition: {condition_name}")
        
        # Start GPU monitoring
        gpu_monitor = GPUMonitor()
        gpu_monitor.start()
        
        # Start concurrent workload if needed
        concurrent = None
        if intensity is not None:
            concurrent = ConcurrentInference(model, tokenizer, intensity=intensity)
            concurrent.start()
            print(f"    [Parallel stream started: {intensity}]")
        
        # Run multiple repetitions to check reproducibility
        runs = []
        all_positions = None
        all_seq_lens = []
        
        for rep in range(NUM_REPETITIONS):
            key_vectors, sampled_positions, seq_len = extract_key_vectors_sparse(model, tokenizer, text)
            runs.append(key_vectors)
            all_seq_lens.append(seq_len)
            if all_positions is None:
                all_positions = sampled_positions
        
        # Stop concurrent workload
        if concurrent is not None:
            concurrent.stop()
            print(f"    [Parallel stream stopped]")
        
        # Stop GPU monitoring
        gpu_stats = gpu_monitor.stop()
        
        # Display GPU utilization
        if gpu_stats:
            print(f"    GPU utilization: mean={gpu_stats['mean']:.1f}% "
                  f"p95={gpu_stats['p95']:.1f}% "
                  f"max={gpu_stats['max']:.1f}%")
        
        # Verify all runs have same sequence length
        if not all(sl == all_seq_lens[0] for sl in all_seq_lens):
            print(f"    ⚠ WARNING: Sequence length varied across runs!")
        
        seq_len = all_seq_lens[0]
        
        # Check bit-exact reproducibility across all repetitions
        reproducible = True
        max_l2_across_runs = {}
        
        for layer_name in runs[0].keys():
            layer_reproducible = True
            max_l2 = 0.0
            
            for i in range(1, NUM_REPETITIONS):
                if not torch.equal(runs[0][layer_name], runs[i][layer_name]):
                    layer_reproducible = False
                    reproducible = False
                    # Compute L2 distance
                    l2 = float(torch.norm(runs[0][layer_name].float() - runs[i][layer_name].float()).item())
                    max_l2 = max(max_l2, l2)
            
            max_l2_across_runs[layer_name] = max_l2
        
        if not reproducible:
            print(f"    ⚠ Non-reproducible! Statistical noise detected")
            for layer_name, l2 in max_l2_across_runs.items():
                if l2 > 0:
                    print(f"      {layer_name}: max L2={l2:.2e}")
        else:
            print(f"    ✓ Reproducible (bit-exact across {NUM_REPETITIONS} runs)")
        
        # Use first run for storage (to save space)
        key_vectors = runs[0]
        
        # Store condition results
        cond_results = {
            'condition_name': condition_name,
            'intensity': intensity,
            'sequence_length': seq_len,
            'num_sampled_positions': len(sampled_positions),
            'sampled_positions': sampled_positions,
            'sampling_strategy': 'every_5th_from_end',
            'reproducible': reproducible,
            'max_l2_across_repetitions': {k: float(v) for k, v in max_l2_across_runs.items()} if not reproducible else None,
            'gpu_utilization': gpu_stats,
            'layers': {}
        }
        
        for layer_name, layer_keys in key_vectors.items():
            cond_results['layers'][layer_name] = {
                'shape': list(layer_keys.shape),
                'key_vectors': layer_keys.float().numpy().tolist(),
                'norms_per_position': [torch.norm(layer_keys[i]).item() 
                                       for i in range(len(sampled_positions))]
            }
        
        seq_results['conditions'][condition_name] = cond_results
        print(f"    Sampled: {len(sampled_positions)} positions\n")
    
    results['sequences'][seq_name] = seq_results

print("="*70)
print("All sequences processed")
print("="*70)

In [None]:
# Save results
output_dir = '/workspace/experiments/cross_hardware_verification/second_CUDA_stream'
os.makedirs(output_dir, exist_ok=True)

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"{hardware_label}_cuda_stream_{timestamp}.json"
filepath = os.path.join(output_dir, filename)

with open(filepath, 'w') as f:
    json.dump(results, f, indent=2)

file_size_mb = os.path.getsize(filepath) / 1024 / 1024
print(f"\n✓ Results saved to: {filepath}")
print(f"  File size: {file_size_mb:.1f} MB")
print("\nNext steps:")
print(f"1. Run on other hardware (different GPU)")
print(f"2. Compare results: python compare_streams.py {filename} <other_file>")