# LLM Batch Helper: Performance Comparison Tutorial

This notebook demonstrates the dramatic performance improvements you get when using LLM Batch Helper compared to naive sequential API calls.

## What We'll Demonstrate

1. **Generate 5K Test Prompts** - Create a large dataset for testing
2. **Naive Approach** - Show how slow sequential API calls are with time estimation
3. **Smart Start** - Begin with 10 examples to validate output quality
4. **Medium Scale** - Process 2.5K prompts with 100 concurrent requests
5. **Large Scale with Caching** - Process 5K prompts with 500 concurrent requests, leveraging cache

## Key Benefits We'll See

- **🚀 Speed**: 10-100x faster than sequential processing
- **💾 Caching**: Resume interrupted work without losing progress
- **⚙️ Tunable Concurrency**: Adjust for rate limits and optimal performance
- **🛡️ Reliability**: Built-in retry logic and error handling

Let's get started! 🎯


## Setup and Imports


In [None]:
# Install required packages if not already installed
%pip install llm_batch_helper openai tqdm


In [3]:
import os
import time
import json
from datetime import datetime
from tqdm import tqdm
import openai

# Import LLM Batch Helper
from llm_batch_helper import LLMConfig, process_prompts_batch

print("📦 All packages imported successfully!")


📦 All packages imported successfully!


In [None]:
# Check if OpenAI API key is set
import os
# set your OPENAI_API_KEY here
#os.environ["OPENAI_API_KEY"] = ""

if not os.environ.get("OPENAI_API_KEY"):
    print("⚠️  WARNING: OPENAI_API_KEY not found!")
    print("Please set your API key: export OPENAI_API_KEY='your-api-key'")
    print("Or add it to a .env file in your project directory")
else:
    print("✅ OpenAI API key is configured!")
    print("🚀 Ready to start the performance comparison!")


✅ OpenAI API key is configured!
🚀 Ready to start the performance comparison!


## Step 1: Generate 5K Test Prompts

First, let's create a large dataset of simple test prompts to benchmark performance.


In [12]:
# Generate 5K naive test prompts
def generate_test_prompts(count=5000):
    """Generate simple test prompts for performance testing."""
    prompts = []
    for i in range(1, count + 1):
        prompts.append(f"test {i}: just respond {i}")
    return prompts

# Generate the prompts
print("📝 Generating test prompts...")
test_prompts = generate_test_prompts(5000)

print(f"✅ Generated {len(test_prompts):,} test prompts")
print("\n📋 Sample prompts:")
for i in range(5):
    print(f"  {i+1}. {test_prompts[i]}")
print("  ...")
for i in range(-3, 0):
    print(f"  {len(test_prompts)+i+1}. {test_prompts[i]}")


📝 Generating test prompts...
✅ Generated 5,000 test prompts

📋 Sample prompts:
  1. test 1: just respond 1
  2. test 2: just respond 2
  3. test 3: just respond 3
  4. test 4: just respond 4
  5. test 5: just respond 5
  ...
  4998. test 4998: just respond 4998
  4999. test 4999: just respond 4999
  5000. test 5000: just respond 5000


## Step 2: Naive Approach - Sequential API Calls

Let's see how long it would take to process these prompts using the traditional sequential approach with a simple for loop.


In [21]:
def naive_chat_completion(prompt, model="gpt-4o-mini"):
    """Naive function that makes a single ChatGPT API call."""
    client = openai.OpenAI()
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "user", "content": prompt}
        ],
        max_tokens=50,
        temperature=1.0
    )
    return response.choices[0].message.content

def estimate_naive_approach_time(prompts, sample_size=5):
    """Estimate how long the naive approach would take by timing a small sample."""
    print(f"⏱️  Testing naive approach with {sample_size} sample prompts...")
    
    start_time = time.time()
    results = []
    
    for i, prompt in enumerate(tqdm(prompts[:sample_size], desc="Naive API calls")):
        try:
            response = naive_chat_completion(prompt)
            results.append({"prompt": prompt, "response": response})
            print(f"✅ Sample {i+1}: {response}")
        except Exception as e:
            print(f"❌ Error on sample {i+1}: {e}")
            results.append({"prompt": prompt, "error": str(e)})
    
    elapsed_time = time.time() - start_time
    avg_time_per_request = elapsed_time / sample_size
    
    # Estimate total time for all prompts
    total_estimated_time = avg_time_per_request * len(prompts)
    
    print(f"\n📊 Naive Approach Performance:")
    print(f"   ⏱️  Average time per request: {avg_time_per_request:.2f} seconds")
    print(f"   📝 Sample size: {sample_size} prompts")
    print(f"   🕐 Time for sample: {elapsed_time:.2f} seconds")
    print(f"\n🚨 Estimated time for {len(prompts):,} prompts:")
    print(f"   ⏳ Total: {total_estimated_time:.0f} seconds")
    print(f"   ⏳ That's {total_estimated_time/60:.1f} minutes")
    print(f"   ⏳ Or {total_estimated_time/3600:.1f} hours!")
    
    print(f" Sometime, one API call can take 10-30 seconds")
    print(f"depend on the prompt lenght, the response length, and the model you are using.")
    print(f" In that case, the total time can be {30 / avg_time_per_request * len(prompts) / 3600:.1f} hours for finishing {len(prompts)} prompts.")
    return results, avg_time_per_request

# Run the estimation (only if API key is available)
if os.environ.get("OPENAI_API_KEY"):
    naive_results, avg_time = estimate_naive_approach_time(test_prompts, sample_size=3)
else:
    print("⚠️  Skipping naive approach test - no API key configured")
    print("💡 With typical API response times of 1-3 seconds per request:")
    print(f"   📊 5,000 prompts × 2 seconds = 10,000 seconds")
    print(f"   ⏳ That's 2.8 hours of sequential processing!")

naive_estimated_time = avg_time * len(test_prompts)

⏱️  Testing naive approach with 3 sample prompts...


Naive API calls:  33%|███▎      | 1/3 [00:00<00:00,  2.81it/s]

✅ Sample 1: 1


Naive API calls:  67%|██████▋   | 2/3 [00:00<00:00,  2.74it/s]

✅ Sample 2: 2


Naive API calls: 100%|██████████| 3/3 [00:01<00:00,  2.17it/s]

✅ Sample 3: 3

📊 Naive Approach Performance:
   ⏱️  Average time per request: 0.46 seconds
   📝 Sample size: 3 prompts
   🕐 Time for sample: 1.39 seconds

🚨 Estimated time for 5,000 prompts:
   ⏳ Total: 2310 seconds
   ⏳ That's 38.5 minutes
   ⏳ Or 0.6 hours!
 Sometime, one API call can take 10-30 seconds
depend on the prompt lenght, the response length, and the model you are using.
 In that case, the total time can be 90.2 hours for finishing 5000 prompts.





## Step 3: Smart Start - Test with 10 Examples

Before processing thousands of prompts, let's start small to validate our approach and check output quality.


In [22]:
# Create configuration for LLM Batch Helper
config = LLMConfig(
    model_name="gpt-4o-mini",
    temperature=1.0,
    max_completion_tokens=50,
    max_concurrent_requests=10,  # Start conservative
    max_retries=3
)

print("⚙️  Configuration created:")
print(f"   🤖 Model: {config.model_name}")
print(f"   🌡️  Temperature: {config.temperature}")
print(f"   🔢 Max tokens: {config.max_completion_tokens}")
print(f"   🚀 Max concurrent: {config.max_concurrent_requests}")
print(f"   🔄 Max retries: {config.max_retries}")


⚙️  Configuration created:
   🤖 Model: gpt-4o-mini
   🌡️  Temperature: 1.0
   🔢 Max tokens: 50
   🚀 Max concurrent: 10
   🔄 Max retries: 3


In [23]:
# Test with first 10 prompts to validate output quality
print("🧪 Testing with first 10 prompts to validate output quality...")

small_test_prompts = test_prompts[:10]

start_time = time.time()
small_results = process_prompts_batch(
    config=config,
    provider="openai",
    prompts=small_test_prompts,
    cache_dir="performance_cache",
    desc="Small Scale Test (10 prompts)"
)
elapsed_time = time.time() - start_time

print(f"\n⏱️  Completed in {elapsed_time:.2f} seconds")
print(f"📊 Average time per prompt: {elapsed_time/len(small_test_prompts):.2f} seconds")

# Validate outputs
print("\n🔍 Validating output quality:")
successful = 0
cached = 0
failed = 0

for prompt_id, response in small_results.items():
    if "error" in response:
        status = "❌ [ERROR]"
        print(f"{status} {prompt_id}: {response['error'][:100]}...")
        failed += 1
    elif response.get("from_cache"):
        status = "💾 [CACHE]"
        print(f"{status} {response['response_text']}")
        cached += 1
        successful += 1
    else:
        status = "✅ [NEW]"
        print(f"{status} {response['response_text']}")
        successful += 1

print(f"\n📈 Results: {successful} successful ({cached} cached), {failed} failed")
print(f"📊 Success rate: {successful/len(small_test_prompts)*100:.1f}%")

if successful == len(small_test_prompts):
    print("🎉 All outputs look good! Ready to scale up.")
else:
    print("⚠️  Some outputs failed. You might want to adjust the configuration.")


🧪 Testing with first 10 prompts to validate output quality...


Small Scale Test (10 prompts): 100%|██████████| 10/10 [00:00<00:00, 13.67it/s]


⏱️  Completed in 0.73 seconds
📊 Average time per prompt: 0.07 seconds

🔍 Validating output quality:
✅ [NEW] 1
✅ [NEW] 2
✅ [NEW] 3
✅ [NEW] 4
✅ [NEW] 5
✅ [NEW] 6
✅ [NEW] 7
✅ [NEW] 8
✅ [NEW] 9
✅ [NEW] 10

📈 Results: 10 successful (0 cached), 0 failed
📊 Success rate: 100.0%
🎉 All outputs look good! Ready to scale up.





## Step 4: Medium Scale - 2.5K Examples with 100 Concurrent Requests

Now let's scale up to 2,500 prompts with higher concurrency to see the real performance benefits.


In [24]:
# Update configuration for medium scale processing
medium_config = LLMConfig(
    model_name="gpt-4o-mini",
    temperature=1.0,
    max_completion_tokens=50,
    max_concurrent_requests=100,  # Increase concurrency
    max_retries=3
)

# Process 2.5K prompts
medium_prompts = test_prompts[:2500]

print(f"🚀 Processing {len(medium_prompts):,} prompts with {medium_config.max_concurrent_requests} concurrent requests...")
print(f"💡 This would take ~{naive_estimated_time / 3600:.1f} hours with naive approach!")

start_time = time.time()
medium_results = process_prompts_batch(
    config=medium_config,
    provider="openai",
    prompts=medium_prompts,
    cache_dir="performance_cache",
    desc=f"Medium Scale ({len(medium_prompts):,} prompts)"
)
elapsed_time = time.time() - start_time

# Analyze results
successful = sum(1 for r in medium_results.values() if "error" not in r)
cached = sum(1 for r in medium_results.values() if r.get("from_cache", False))
failed = len(medium_results) - successful

print(f"\n🎯 Medium Scale Results:")
print(f"   ⏱️  Total time: {elapsed_time:.1f} seconds ({elapsed_time/60:.1f} minutes)")
print(f"   📊 Average time per prompt: {elapsed_time/len(medium_prompts):.3f} seconds")
print(f"   ✅ Successful: {successful:,}")
print(f"   💾 From cache: {cached:,}")
print(f"   ❌ Failed: {failed:,}")
print(f"   📈 Success rate: {successful/len(medium_prompts)*100:.1f}%")

# Calculate speedup vs naive approach
speedup = naive_estimated_time / elapsed_time
print(f"\n🚀 Performance Improvement:")
print(f"   📊 Naive approach would take: {naive_estimated_time/60:.1f} minutes")
print(f"   ⚡ Our approach took: {elapsed_time/60:.1f} minutes")
print(f"   🎯 Speedup: {speedup:.1f}x faster!")


🚀 Processing 2,500 prompts with 100 concurrent requests...
💡 This would take ~0.6 hours with naive approach!


Medium Scale (2,500 prompts): 100%|██████████| 2500/2500 [00:37<00:00, 65.86it/s] 


🎯 Medium Scale Results:
   ⏱️  Total time: 38.0 seconds (0.6 minutes)
   📊 Average time per prompt: 0.015 seconds
   ✅ Successful: 2,500
   💾 From cache: 10
   ❌ Failed: 0
   📈 Success rate: 100.0%

🚀 Performance Improvement:
   📊 Naive approach would take: 38.5 minutes
   ⚡ Our approach took: 0.6 minutes
   🎯 Speedup: 60.8x faster!





## Step 5: Large Scale with Caching - 5K Examples

Now for the grand finale! Let's process all 5K prompts with even higher concurrency. The first 2.5K should be served from cache, demonstrating the caching feature.


In [25]:
# Configuration for large scale processing
large_config = LLMConfig(
    model_name="gpt-4o-mini",
    temperature=1.0,
    max_completion_tokens=50,
    max_concurrent_requests=500,  # Much higher concurrency!
    max_retries=3
)

# Process all 5K prompts
all_prompts = test_prompts  # All 5,000 prompts

print(f"🎯 Processing ALL {len(all_prompts):,} prompts!")
print(f"💾 First {len(medium_prompts):,} should be served from cache")
print(f"🚀 Only {len(all_prompts) - len(medium_prompts):,} new prompts need processing")
print(f"⚡ Using {large_config.max_concurrent_requests} concurrent requests")

start_time = time.time()
large_results = process_prompts_batch(
    config=large_config,
    provider="openai",
    prompts=all_prompts,
    cache_dir="performance_cache",
    desc=f"Large Scale ({len(all_prompts):,} prompts)"
)
elapsed_time = time.time() - start_time

# Comprehensive analysis
successful = sum(1 for r in large_results.values() if "error" not in r)
cached = sum(1 for r in large_results.values() if r.get("from_cache", False))
generated = successful - cached
failed = len(large_results) - successful

print(f"\n🏆 FINAL RESULTS - Large Scale Processing:")
print(f"{'='*60}")
print(f"   📊 Total prompts: {len(all_prompts):,}")
print(f"   ⏱️  Total time: {elapsed_time:.1f} seconds ({elapsed_time/60:.1f} minutes)")
print(f"   📈 Average time per prompt: {elapsed_time/len(all_prompts):.4f} seconds")
print(f"\n📋 Response Breakdown:")
print(f"   ✅ Successful: {successful:,} ({successful/len(all_prompts)*100:.1f}%)")
print(f"   💾 From cache: {cached:,} ({cached/len(all_prompts)*100:.1f}%)")
print(f"   🆕 Newly generated: {generated:,} ({generated/len(all_prompts)*100:.1f}%)")
print(f"   ❌ Failed: {failed:,} ({failed/len(all_prompts)*100:.1f}%)")

# Ultimate performance comparison
print(f"\n🚀 PERFORMANCE COMPARISON:")
print(f"{'='*60}")
speedup = naive_estimated_time / elapsed_time
print(f"   📊 Naive sequential approach: ~{naive_estimated_time/3600:.1f} hours")
print(f"   ⚡ LLM Batch Helper: {elapsed_time/60:.1f} minutes")
print(f"   🎯 SPEEDUP: {speedup:.0f}x FASTER!")
print(f"   💰 Time saved: {(naive_estimated_time - elapsed_time)/3600:.1f} hours")

# Caching effectiveness
print(f"\n💾 CACHING EFFECTIVENESS:")
print(f"{'='*60}")
cache_hit_rate = cached / len(all_prompts) * 100
print(f"   📊 Cache hit rate: {cache_hit_rate:.1f}%")
print(f"   ⚡ Only processed {generated:,} new prompts out of {len(all_prompts):,}")
print(f"   💡 This means you can interrupt and resume without losing work!")


🎯 Processing ALL 5,000 prompts!
💾 First 2,500 should be served from cache
🚀 Only 2,500 new prompts need processing
⚡ Using 500 concurrent requests


Large Scale (5,000 prompts): 100%|██████████| 5000/5000 [00:36<00:00, 137.15it/s]


🏆 FINAL RESULTS - Large Scale Processing:
   📊 Total prompts: 5,000
   ⏱️  Total time: 36.5 seconds (0.6 minutes)
   📈 Average time per prompt: 0.0073 seconds

📋 Response Breakdown:
   ✅ Successful: 5,000 (100.0%)
   💾 From cache: 2,500 (50.0%)
   🆕 Newly generated: 2,500 (50.0%)
   ❌ Failed: 0 (0.0%)

🚀 PERFORMANCE COMPARISON:
   📊 Naive sequential approach: ~0.6 hours
   ⚡ LLM Batch Helper: 0.6 minutes
   🎯 SPEEDUP: 63x FASTER!
   💰 Time saved: 0.6 hours

💾 CACHING EFFECTIVENESS:
   📊 Cache hit rate: 50.0%
   ⚡ Only processed 2,500 new prompts out of 5,000
   💡 This means you can interrupt and resume without losing work!





## Next Steps

Now that you've seen the power of LLM Batch Helper, here are some ways to apply it to your own projects:

### 🚀 For Your Own Projects:

1. **Start Small**: Always test with 10-50 examples first
2. **Find Your Sweet Spot**: Experiment with `max_concurrent_requests` (start with 50-100)
3. **Use Caching**: Set up a dedicated cache directory for each project
4. **Monitor and Tune**: Watch for rate limits and adjust concurrency accordingly
5. **Add Verification**: Use custom verification callbacks for quality control

### 📚 Learn More:

- Check out the main tutorial notebook for more features
- Read the documentation for advanced configuration options
- Explore different providers (OpenAI, OpenRouter, Together.ai)
- Try custom verification callbacks for your use cases

### 💡 Common Use Cases:

- **Data Labeling**: Process thousands of examples for ML training
- **Content Generation**: Create large amounts of content efficiently  
- **Evaluation**: Run LLM-as-a-judge on large datasets
- **Research**: Conduct large-scale experiments with different prompts
- **Simulation**: Run multi-agent conversations or scenarios

Happy batch processing! 🎯
