# dInfer vs Fast-dLLM Benchmarking

This notebook benchmarks **dInfer** generation methods against **Fast-dLLM dual-cache** method.

## Overview

**dInfer Methods:**
- `dinfer_blockwise`: BlockWise with threshold decoder (recommended)
- `dinfer_hierarchy`: BlockWise with hierarchical decoder
- `dinfer_credit`: BlockWise with credit threshold decoder

**Fast-dLLM Baseline:**
- `dual_cache`: Dual caching for optimal performance

**Comparison Metrics:**
- Tokens per second (throughput)
- Average generation time
- Number of forward evaluations (NFE)
- Performance scaling across different generation lengths


## 1. GPU Memory Cleanup

Run this first to ensure a clean state before benchmarking.


In [None]:
import torch
import gc
import psutil
import os

def clear_gpu_memory():
    """Clear all GPU memory and run garbage collection"""
    print("🧹 Clearing GPU memory...")
    
    # Force garbage collection
    gc.collect()
    
    # Clear CUDA cache if available
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()
        
        # Get GPU memory info
        for i in range(torch.cuda.device_count()):
            gpu_memory = torch.cuda.get_device_properties(i).total_memory / 1024**3
            allocated = torch.cuda.memory_allocated(i) / 1024**3
            cached = torch.cuda.memory_reserved(i) / 1024**3
            print(f"   GPU {i}: {allocated:.2f}GB allocated, {cached:.2f}GB cached of {gpu_memory:.2f}GB total")
        
        torch.cuda.empty_cache()
        print("   ✅ CUDA cache cleared")
    else:
        print("   ⚠️  CUDA not available")
    
    # Get system memory info
    memory = psutil.virtual_memory()
    print(f"   💾 System RAM: {memory.used / 1024**3:.2f}GB used of {memory.total / 1024**3:.2f}GB total")
    print("   🗑️  Garbage collection completed")
    print("=" * 60)

# Clear memory at startup
clear_gpu_memory()

# Optional: Set memory management flags
if torch.cuda.is_available():
    os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:128'
    print("🔧 PyTorch CUDA memory management configured")
    print("=" * 60)


## 2. Import Required Libraries and Setup Paths


In [None]:
import torch
import numpy as np
import pandas as pd
import time
import matplotlib.pyplot as plt
from transformers import AutoTokenizer
import gc
import sys
import os
from pathlib import Path

# Configuration
MODEL_NAME = "GSAI-ML/LLaDA-8B-Instruct"
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

print(f"Using device: {DEVICE}")
print(f"Model: {MODEL_NAME}")


## 3. Dynamic Project Discovery and Import Generation Registry


In [None]:
def find_project_root(start_path=None, max_depth=10):
    """Dynamically find the NeMo-RL project root."""
    if start_path is None:
        start_path = os.getcwd()
    
    project_markers = ['pyproject.toml', 'notebooks', 'xp', 'nemo_rl', '3rdparty']
    current_dir = os.path.abspath(start_path)
    depth = 0
    
    print(f"🔍 Looking for project root starting from: {current_dir}")
    
    while depth < max_depth:
        print(f"  Checking: {current_dir}")
        
        try:
            dir_contents = os.listdir(current_dir)
            found_markers = [marker for marker in project_markers if marker in dir_contents]
            print(f"    Found markers: {found_markers}")
            
            if len(found_markers) >= 3:
                print(f"  ✅ Found project root: {current_dir}")
                return current_dir
                
        except PermissionError:
            print(f"    ❌ Permission denied")
        except Exception as e:
            print(f"    ❌ Error: {e}")
        
        parent_dir = os.path.dirname(current_dir)
        if parent_dir == current_dir:
            break
        current_dir = parent_dir
        depth += 1
    
    print(f"  ❌ Could not find project root within {max_depth} levels")
    return None

def find_llada_generate_path(project_root):
    """Find the llada_generate module path."""
    if not project_root:
        return None
    
    print(f"\n🔍 Searching for llada_generate module...")
    
    llada_generate_locations = ['xp/llada_api', 'llada_api', 'api', 'src']
    
    for location in llada_generate_locations:
        candidate_path = os.path.join(project_root, location)
        llada_generate_path = os.path.join(candidate_path, 'llada_generate')
        
        print(f"  Checking: {llada_generate_path}")
        
        if os.path.exists(llada_generate_path):
            init_file = os.path.join(llada_generate_path, '__init__.py')
            if os.path.exists(init_file):
                print(f"  ✅ Found llada_generate module!")
                return candidate_path
            else:
                print(f"  ⚠️  Directory exists but missing __init__.py")
        else:
            print(f"  ❌ Not found")
    
    return None

# Discover project root and add llada_generate to path
print("🚀 Starting dynamic project discovery...")
project_root = find_project_root()

GENERATION_AVAILABLE = False
if project_root:
    llada_generate_parent_path = find_llada_generate_path(project_root)
    
    if llada_generate_parent_path:
        if llada_generate_parent_path not in sys.path:
            sys.path.insert(0, llada_generate_parent_path)
            print(f"✅ Added llada_generate parent path: {llada_generate_parent_path}")
        
        try:
            from llada_generate import (
                get_algorithm,
                list_available_algorithms,
                list_algorithms,
                list_algorithms_by_engine,
                list_available_algorithms_by_engine,
                registry
            )
            
            GENERATION_AVAILABLE = True
            print("✅ Generation registry imported successfully!")
            
            # List all algorithms
            all_algorithms = list_algorithms()
            available_algorithms = list_available_algorithms()
            
            print(f"\n📋 Total registered algorithms: {len(all_algorithms)}")
            for name in all_algorithms:
                status = "✅ Available" if name in available_algorithms else "❌ Not available"
                algo_info = registry.get_algorithm_info(name)
                if algo_info:
                    print(f"   {name}: {status} - {algo_info.get('description', 'No description')}")
                else:
                    print(f"   {name}: {status}")
            
            # Show algorithms by engine
            print(f"\n🔧 Algorithms by engine:")
            for engine in ['dinfer', 'fast-dllm']:
                engine_algos = list_available_algorithms_by_engine(engine)
                print(f"   {engine}: {engine_algos}")
            
        except ImportError as e:
            GENERATION_AVAILABLE = False
            print(f"❌ Failed to import generation registry: {e}")
    else:
        print(f"❌ Could not find llada_generate module")
else:
    print(f"❌ Could not locate project root")

if not GENERATION_AVAILABLE:
    raise RuntimeError("Generation registry is not available. Cannot proceed with benchmarking.")


## 4. Benchmarking Configuration


In [None]:
# Benchmarking parameters
BENCHMARK_CONFIG = {
    'prompt_text': "Explain the theory of general relativity in comprehensive detail.",
    'generation_lengths': [256, 512, 1024, 2048],  # Token lengths to benchmark
    'num_trials': 3,  # Number of trials per benchmark for averaging
    'show_generated_text': True,  # Whether to display sample generated text
    'block_length': 64,  # Block length for generation (used by both dInfer and Fast-dLLM)
}

# Algorithms to benchmark
ALGORITHMS_TO_BENCHMARK = [
    # dInfer algorithms
    'dinfer_blockwise',
    'dinfer_hierarchy',
    'dinfer_credit',
    # Fast-dLLM baseline
    'dual_cache',
]

print("✅ Benchmarking parameters configured:")
print(f"   Prompt: {BENCHMARK_CONFIG['prompt_text'][:50]}...")
print(f"   Generation lengths: {BENCHMARK_CONFIG['generation_lengths']}")
print(f"   Trials per length: {BENCHMARK_CONFIG['num_trials']}")
print(f"   Block length: {BENCHMARK_CONFIG['block_length']}")
print(f"   Algorithms to test: {ALGORITHMS_TO_BENCHMARK}")


## 5. Benchmarking Framework


In [None]:
def run_benchmark(
    algorithm_name,
    algorithm,
    prompt_text,
    gen_length,
    block_length,
    num_trials=3,
    print_text=False
):
    """
    Run benchmark for a specific algorithm.
    
    Args:
        algorithm_name: Name of the algorithm
        algorithm: Algorithm instance
        prompt_text: Prompt string
        gen_length: Generation length
        block_length: Block length
        num_trials: Number of trials
        print_text: Whether to print generated text
    
    Returns:
        Dictionary with benchmark results
    """
    total_time = 0
    total_tokens = 0
    generated_texts = []
    forward_passes = []
    
    # Prepare prompt
    messages = [{"role": "user", "content": prompt_text}]
    
    # Check if algorithm uses dInfer (requires attention_mask)
    is_dinfer = algorithm.engine == 'dinfer'
    
    if is_dinfer:
        # dInfer: tokenize with left-padding and attention_mask
        input_ids, attention_mask = algorithm.tokenize_prompts_dinfer(
            prompts=[prompt_text],
            apply_chat_template=True,
            messages=[messages]
        )
    else:
        # Fast-dLLM: tokenize normally
        input_ids = algorithm.tokenize_prompts(
            prompts=[prompt_text],
            apply_chat_template=True,
            messages=[messages]
        )
    
    # Warm-up run
    print(f"  Warm-up run for {gen_length} tokens...")
    try:
        _ = algorithm.generate(
            model=algorithm.model,
            prompt=input_ids,
            steps=128,  # Not used by dInfer but kept for compatibility
            gen_length=gen_length,
            block_length=block_length,
            temperature=1.0,
            threshold=0.9
        )
        torch.cuda.synchronize()
    except Exception as e:
        print(f"  ⚠️ Warm-up failed: {e}")
    
    # Timed trials
    for i in range(num_trials):
        print(f"  Trial {i+1}/{num_trials} for {gen_length} tokens...")
        
        start_time = time.time()
        
        try:
            output_ids, nfe = algorithm.generate(
                model=algorithm.model,
                prompt=input_ids,
                steps=128,
                gen_length=gen_length,
                block_length=block_length,
                temperature=1.0,
                threshold=0.9
            )
            
            torch.cuda.synchronize()
            end_time = time.time()
            
            # Decode output
            if is_dinfer:
                generated_texts_batch = algorithm.decode_outputs_dinfer(
                    output_ids, input_ids, skip_special_tokens=True
                )
                generated_text = generated_texts_batch[0]
            else:
                generated_text = algorithm.tokenizer.batch_decode(
                    output_ids[:, input_ids.shape[1]:],
                    skip_special_tokens=True
                )[0]
            
            elapsed_time = end_time - start_time
            actual_tokens = len(algorithm.tokenizer.encode(generated_text))
            
            # Use requested gen_length for fair comparison
            num_tokens = gen_length
            
            total_time += elapsed_time
            total_tokens += num_tokens
            generated_texts.append(generated_text)
            forward_passes.append(nfe)
            
            # Print generated text if requested and first trial
            if print_text and i == 0:
                print(f"    📝 Generated Text (Trial {i+1}):")
                print(f"    {'-' * 50}")
                preview = generated_text[:300] + ('...' if len(generated_text) > 300 else '')
                for line in preview.split('\n'):
                    print(f"    {line}")
                print(f"    {'-' * 50}")
                nfe_info = f", NFE: {nfe}" if nfe is not None and nfe > 0 else ""
                print(f"    Requested: {gen_length} tokens, Generated: {actual_tokens} tokens, Time: {elapsed_time:.2f}s{nfe_info}")
                print()
                
        except Exception as e:
            print(f"  ❌ Trial {i+1} failed: {e}")
            import traceback
            traceback.print_exc()
            generated_text = f"[ERROR: {e}]"
            generated_texts.append(generated_text)
            forward_passes.append(None)
    
    avg_time = total_time / num_trials if num_trials > 0 else 0
    avg_tokens = total_tokens / num_trials if num_trials > 0 else 0
    tokens_per_sec = avg_tokens / avg_time if avg_time > 0 else 0
    avg_nfe = sum(nfe for nfe in forward_passes if nfe is not None and nfe > 0) / len([nfe for nfe in forward_passes if nfe is not None and nfe > 0]) if any(nfe is not None and nfe > 0 for nfe in forward_passes) else None
    
    result = {
        "Algorithm": algorithm_name,
        "Engine": algorithm.engine,
        "Gen Length": gen_length,
        "Avg Time (s)": avg_time,
        "Avg Tokens": avg_tokens,
        "Tokens/Sec": tokens_per_sec,
        "Avg NFE": avg_nfe
    }
    
    if print_text:
        result["Generated_Texts"] = generated_texts
    
    return result

print("✅ Benchmarking framework loaded.")


## 6. Load Models and Run Benchmarks

This cell will iterate through each algorithm, load its model, and run the benchmarks.


In [None]:
results_list = []

# Filter to only available algorithms
available_algos = list_available_algorithms()
algorithms_to_test = [name for name in ALGORITHMS_TO_BENCHMARK if name in available_algos]

if not algorithms_to_test:
    print("❌ No algorithms available for benchmarking!")
    print(f"   Requested: {ALGORITHMS_TO_BENCHMARK}")
    print(f"   Available: {available_algos}")
else:
    print(f"\n🚀 Starting benchmarking of {len(algorithms_to_test)} algorithms...")
    print(f"   Algorithms: {algorithms_to_test}")
    print(f"   Generation lengths: {BENCHMARK_CONFIG['generation_lengths']}")
    print(f"   Trials per test: {BENCHMARK_CONFIG['num_trials']}")
    print(f"   Block length: {BENCHMARK_CONFIG['block_length']}")
    
    for algo_name in algorithms_to_test:
        print(f"\n{'='*70}")
        print(f"🔬 BENCHMARKING ALGORITHM: {algo_name.upper()}")
        print(f"{'='*70}")
        
        # Get algorithm instance
        algorithm = get_algorithm(algo_name)
        if algorithm is None:
            print(f"❌ Failed to get algorithm: {algo_name}")
            continue
        
        # Display algorithm info
        algo_info = registry.get_algorithm_info(algo_name)
        if algo_info:
            print(f"📋 Description: {algo_info['description']}")
            print(f"📋 Engine: {algo_info['engine']}")
        
        # Load model
        print(f"\n📦 Loading model: {MODEL_NAME}...")
        try:
            success = algorithm.load_model_from_hf(
                MODEL_NAME,
                model_type='llada'
            )
            
            if not success:
                print(f"❌ Failed to load model for {algo_name}")
                continue
            
            # Move model to GPU
            algorithm.model = algorithm.model.to(DEVICE)
            algorithm.model.eval()
            algorithm.device = DEVICE
            
            print(f"✅ Model loaded successfully on {DEVICE}")
            
        except Exception as e:
            print(f"❌ Error loading model: {e}")
            import traceback
            traceback.print_exc()
            continue
        
        # Run benchmarks for different generation lengths
        for i, gen_length in enumerate(BENCHMARK_CONFIG['generation_lengths']):
            print(f"\n📏 Testing {algo_name} with {gen_length} tokens...")
            
            # Show sample text only for first length
            show_text = BENCHMARK_CONFIG['show_generated_text'] and (i == 0)
            
            try:
                result = run_benchmark(
                    algorithm_name=algo_name,
                    algorithm=algorithm,
                    prompt_text=BENCHMARK_CONFIG['prompt_text'],
                    gen_length=gen_length,
                    block_length=BENCHMARK_CONFIG['block_length'],
                    num_trials=BENCHMARK_CONFIG['num_trials'],
                    print_text=show_text
                )
                results_list.append(result)
                
                # Print performance summary
                print(f"   ⚡ {algo_name}: {result['Tokens/Sec']:.2f} tokens/sec, {result['Avg Time (s)']:.2f}s avg", end="")
                if result['Avg NFE'] is not None:
                    print(f", {result['Avg NFE']:.1f} avg NFE")
                else:
                    print()
                
            except Exception as e:
                print(f"   ❌ Failed to benchmark {algo_name}: {e}")
                import traceback
                traceback.print_exc()
        
        # Clean up model from memory
        print(f"\n🧹 Clearing {algo_name} model from memory...")
        del algorithm.model
        if hasattr(algorithm, 'diffusion_llm') and algorithm.diffusion_llm is not None:
            del algorithm.diffusion_llm
            algorithm.diffusion_llm = None
        algorithm.model = None
        gc.collect()
        torch.cuda.empty_cache()
        print(f"✅ Memory cleared")

print(f"\n\n🎉 All benchmarking completed!")
print(f"   Total experiments: {len(results_list)}")


## 7. Results Analysis and Summary


In [None]:
if not results_list:
    print("❌ No results to display!")
else:
    # Convert results to DataFrame
    results_df = pd.DataFrame(results_list)
    
    print("\n📊 COMPREHENSIVE BENCHMARKING RESULTS")
    print("=" * 90)
    print("📏 Note: All algorithms use requested generation length for fair comparison")
    print()
    
    # Display results table
    display_df = results_df.copy()
    display_df = display_df.round(3)
    print(display_df.to_string(index=False))
    
    # Summary statistics by algorithm
    print("\n📈 PERFORMANCE SUMMARY BY ALGORITHM")
    print("=" * 60)
    summary = results_df.groupby('Algorithm').agg({
        'Tokens/Sec': ['mean', 'max', 'min'],
        'Avg Time (s)': ['mean', 'max', 'min'],
        'Avg NFE': 'mean'
    }).round(3)
    print(summary)
    
    # Find best performing algorithms
    print("\n🏆 TOP PERFORMERS")
    print("=" * 40)
    best_overall = results_df.loc[results_df['Tokens/Sec'].idxmax()]
    print(f"🥇 Fastest Overall: {best_overall['Algorithm']} at {best_overall['Gen Length']} tokens")
    print(f"   Speed: {best_overall['Tokens/Sec']:.2f} tokens/sec")
    
    # Best per generation length
    for gen_length in BENCHMARK_CONFIG['generation_lengths']:
        subset = results_df[results_df['Gen Length'] == gen_length]
        if not subset.empty:
            best = subset.loc[subset['Tokens/Sec'].idxmax()]
            print(f"   📏 {gen_length} tokens: {best['Algorithm']} ({best['Tokens/Sec']:.2f} tok/s)")
    
    # Display the dataframe
    display(results_df)


## 8. Comprehensive Visualization


In [None]:
if results_list:
    # Set up plotting style
    plt.style.use('default')
    
    # Create comprehensive visualization
    fig = plt.figure(figsize=(20, 12))
    
    # 1. Main performance comparison - Tokens per second
    ax1 = plt.subplot(2, 3, (1, 2))
    pivot_df = results_df.pivot(index='Gen Length', columns='Algorithm', values='Tokens/Sec')
    pivot_df.plot(kind='bar', ax=ax1, width=0.8)
    ax1.set_title('dInfer vs Fast-dLLM: Generation Speed Comparison', fontsize=14, fontweight='bold')
    ax1.set_xlabel('Generation Length (tokens)', fontsize=12)
    ax1.set_ylabel('Tokens per Second', fontsize=12)
    ax1.tick_params(axis='x', rotation=0)
    ax1.grid(axis='y', linestyle='--', alpha=0.7)
    ax1.legend(title='Algorithm', bbox_to_anchor=(1.05, 1), loc='upper left')
    
    # Add value labels on bars
    for container in ax1.containers:
        ax1.bar_label(container, fmt='%.1f', label_type='edge', fontsize=8, rotation=90)
    
    # 2. Average time comparison
    ax2 = plt.subplot(2, 3, 3)
    time_pivot = results_df.pivot(index='Gen Length', columns='Algorithm', values='Avg Time (s)')
    time_pivot.plot(kind='bar', ax=ax2, width=0.8)
    ax2.set_title('Average Time Comparison', fontsize=12, fontweight='bold')
    ax2.set_xlabel('Generation Length (tokens)', fontsize=10)
    ax2.set_ylabel('Time (seconds)', fontsize=10)
    ax2.tick_params(axis='x', rotation=45)
    ax2.grid(axis='y', linestyle='--', alpha=0.7)
    ax2.legend().set_visible(False)
    
    # 3. Performance scaling
    ax3 = plt.subplot(2, 3, 4)
    for algorithm in results_df['Algorithm'].unique():
        algorithm_data = results_df[results_df['Algorithm'] == algorithm]
        ax3.plot(algorithm_data['Gen Length'], algorithm_data['Tokens/Sec'], 'o-',
                label=algorithm, linewidth=2, markersize=6)
    ax3.set_title('Performance Scaling by Length', fontsize=12, fontweight='bold')
    ax3.set_xlabel('Generation Length (tokens)', fontsize=10)
    ax3.set_ylabel('Tokens per Second', fontsize=10)
    ax3.grid(True, alpha=0.7)
    ax3.legend(fontsize=8)
    
    # 4. NFE comparison (if available)
    ax4 = plt.subplot(2, 3, 5)
    if 'Avg NFE' in results_df.columns and results_df['Avg NFE'].notna().any():
        nfe_df = results_df[results_df['Avg NFE'].notna()]
        if not nfe_df.empty:
            nfe_pivot = nfe_df.pivot(index='Gen Length', columns='Algorithm', values='Avg NFE')
            nfe_pivot.plot(kind='bar', ax=ax4, width=0.8)
            ax4.set_title('Forward Evaluations (NFE)', fontsize=12, fontweight='bold')
            ax4.set_xlabel('Generation Length (tokens)', fontsize=10)
            ax4.set_ylabel('Number of Forward Evaluations', fontsize=10)
            ax4.tick_params(axis='x', rotation=45)
            ax4.grid(axis='y', linestyle='--', alpha=0.7)
            ax4.legend(fontsize=8)
        else:
            ax4.text(0.5, 0.5, 'No NFE data available', ha='center', va='center', transform=ax4.transAxes)
    else:
        ax4.text(0.5, 0.5, 'No NFE data available', ha='center', va='center', transform=ax4.transAxes)
    ax4.set_title('Forward Evaluations (NFE)', fontsize=12, fontweight='bold')
    
    # 5. Performance heatmap
    ax5 = plt.subplot(2, 3, 6)
    heatmap_data = results_df.pivot(index='Algorithm', columns='Gen Length', values='Tokens/Sec')
    
    # Create heatmap
    im = ax5.imshow(heatmap_data.values, cmap='YlOrRd', aspect='auto')
    
    # Set ticks and labels
    ax5.set_xticks(range(len(heatmap_data.columns)))
    ax5.set_yticks(range(len(heatmap_data.index)))
    ax5.set_xticklabels(heatmap_data.columns)
    ax5.set_yticklabels(heatmap_data.index)
    
    # Add text annotations
    for i in range(len(heatmap_data.index)):
        for j in range(len(heatmap_data.columns)):
            value = heatmap_data.iloc[i, j]
            if not pd.isna(value):
                ax5.text(j, i, f'{value:.1f}', ha='center', va='center', fontsize=10)
    
    # Add colorbar
    cbar = plt.colorbar(im, ax=ax5)
    cbar.set_label('Tokens/Sec', rotation=270, labelpad=15)
    
    ax5.set_title('Performance Heatmap', fontsize=12, fontweight='bold')
    ax5.set_xlabel('Generation Length (tokens)', fontsize=10)
    ax5.set_ylabel('Algorithm', fontsize=10)
    
    plt.tight_layout()
    plt.show()
    
    print("\n✅ Visualization complete!")
else:
    print("❌ No results to visualize!")


## 9. Detailed Performance Analysis


In [None]:
if results_list:
    print("\n🔍 DETAILED PERFORMANCE ANALYSIS")
    print("=" * 80)
    
    # Speed comparison: dInfer vs Fast-dLLM dual_cache
    fast_dllm_results = results_df[results_df['Algorithm'] == 'dual_cache']
    dinfer_results = results_df[results_df['Engine'] == 'dinfer']
    
    if not fast_dllm_results.empty and not dinfer_results.empty:
        print("\n⚡ SPEED ANALYSIS: dInfer vs Fast-dLLM dual_cache")
        print("-" * 60)
        
        for gen_length in BENCHMARK_CONFIG['generation_lengths']:
            fast_dllm_speed = fast_dllm_results[fast_dllm_results['Gen Length'] == gen_length]['Tokens/Sec']
            if len(fast_dllm_speed) > 0:
                fast_dllm_speed = fast_dllm_speed.iloc[0]
                
                dinfer_subset = dinfer_results[dinfer_results['Gen Length'] == gen_length]
                
                print(f"\n📏 {gen_length} tokens (Fast-dLLM baseline: {fast_dllm_speed:.2f} tok/s):")
                
                for _, row in dinfer_subset.iterrows():
                    algo_name = row['Algorithm']
                    speed_ratio = row['Tokens/Sec'] / fast_dllm_speed if fast_dllm_speed > 0 else 0
                    
                    if speed_ratio > 1.2:
                        status = f"🟢 {speed_ratio:.2f}x FASTER"
                    elif speed_ratio > 0.8:
                        status = f"🟡 {speed_ratio:.2f}x (comparable)"
                    else:
                        status = f"🔴 {speed_ratio:.2f}x slower"
                    
                    print(f"   {algo_name:20s}: {row['Tokens/Sec']:6.2f} tok/s | {status}")
    
    # Efficiency analysis
    if 'Avg NFE' in results_df.columns and results_df['Avg NFE'].notna().any():
        print("\n🎯 EFFICIENCY ANALYSIS (Tokens per Forward Evaluation):")
        print("-" * 60)
        
        for gen_length in BENCHMARK_CONFIG['generation_lengths']:
            subset = results_df[(results_df['Gen Length'] == gen_length) & (results_df['Avg NFE'].notna())]
            if not subset.empty:
                print(f"\n📏 {gen_length} tokens:")
                
                for _, row in subset.iterrows():
                    if pd.notna(row['Avg NFE']) and row['Avg NFE'] > 0:
                        efficiency = row['Avg Tokens'] / row['Avg NFE']
                        print(f"   {row['Algorithm']:20s}: {efficiency:6.2f} tokens/NFE ({row['Avg NFE']:5.1f} forward passes)")
    
    # Key insights
    print("\n💡 KEY INSIGHTS:")
    print("-" * 30)
    
    best_overall = results_df.loc[results_df['Tokens/Sec'].idxmax()]
    worst_overall = results_df.loc[results_df['Tokens/Sec'].idxmin()]
    
    print(f"• Best performer: {best_overall['Algorithm']} ({best_overall['Tokens/Sec']:.2f} tok/s)")
    print(f"• Slowest performer: {worst_overall['Algorithm']} ({worst_overall['Tokens/Sec']:.2f} tok/s)")
    print(f"• Performance range: {best_overall['Tokens/Sec'] / worst_overall['Tokens/Sec']:.1f}x difference")
    
    # Algorithm ranking by engine
    for engine in ['dinfer', 'fast-dllm']:
        engine_results = results_df[results_df['Engine'] == engine]
        if not engine_results.empty:
            engine_speeds = engine_results.groupby('Algorithm')['Tokens/Sec'].mean().sort_values(ascending=False)
            print(f"\n🏆 {engine.upper()} Algorithm Ranking (by average speed):")
            for i, (algo, speed) in enumerate(engine_speeds.items(), 1):
                print(f"   {i}. {algo}: {speed:.2f} tok/s")
    
    print(f"\n✅ Comprehensive benchmarking analysis complete!")
    print(f"   📊 Total experiments: {len(results_df)}")
    print(f"   🧠 Algorithms tested: {len(results_df['Algorithm'].unique())}")
    print(f"   📈 Generation lengths: {len(BENCHMARK_CONFIG['generation_lengths'])}")
    print(f"   🔄 Trials per config: {BENCHMARK_CONFIG['num_trials']}")
else:
    print("❌ No results to analyze!")


## 10. Export Results (Optional)

Save the benchmark results to CSV for later analysis.


In [None]:
if results_list:
    # Create output directory if it doesn't exist
    output_dir = Path(project_root) / 'notebooks' / 'benchmark_results'
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # Generate filename with timestamp
    from datetime import datetime
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    output_file = output_dir / f'dinfer_vs_fastdllm_{timestamp}.csv'
    
    # Save to CSV
    results_df.to_csv(output_file, index=False)
    print(f"✅ Results saved to: {output_file}")
else:
    print("❌ No results to export!")


---

## Summary

This notebook provides a comprehensive benchmark comparison between dInfer and Fast-dLLM generation methods:

### Algorithms Compared:
1. **dInfer Methods** (using the new generation API):
   - `dinfer_blockwise` - Threshold-based parallel decoder with dual cache
   - `dinfer_hierarchy` - Hierarchical parallel decoder with dual cache
   - `dinfer_credit` - Credit threshold parallel decoder with dual cache

2. **Fast-dLLM Baseline**:
   - `dual_cache` - Dual caching for optimal Fast-dLLM performance

### Key Features:
- ✅ Uses the updated generation API from `llada_generate` module
- ✅ Automatic algorithm discovery and loading
- ✅ Multiple generation lengths tested (256, 512, 1024, 2048 tokens)
- ✅ Multiple trials per configuration for reliable averages
- ✅ Comprehensive visualizations and performance analysis
- ✅ Direct speed comparison with speedup ratios
- ✅ Efficiency analysis (tokens per forward evaluation)
- ✅ Results export to CSV

### Usage Notes:
- Ensure both dInfer and Fast-dLLM are installed in `3rdparty/` directory
- The notebook automatically discovers the project root and imports the generation registry
- Models are loaded and unloaded sequentially to manage GPU memory
- Results can be exported to CSV for further analysis

### Expected Output:
- Performance tables showing tokens/sec for each algorithm
- Visualization plots comparing speed, scaling, and efficiency
- Detailed speed analysis showing which method is faster at each generation length
- Algorithm rankings by engine type


# dInfer vs Fast-dLLM Benchmarking

This notebook benchmarks dInfer generation methods against Fast-dLLM dual-cache.

**Comparison:**
- **dInfer methods:** dinfer_blockwise, dinfer_hierarchy, dinfer_credit
- **Fast-dLLM baseline:** dual_cache

**Metrics:**
- Generation speed (tokens/sec)
- Average time per generation
- Number of forward evaluations (NFE)
- Performance scaling across different generation lengths

## Setup and Memory Cleanup