# GSM8K Genetic Algorithm for Prompt Evolution

## Complete Tutorial: Evolving Mathematical Reasoning Prompts

This notebook provides a tutorial for using genetic algorithms to evolve prompts for mathematical reasoning on the GSM8K dataset. You'll learn how to:

- Set up the system and configure experiments
- Run evolution experiments with real-time monitoring
- Analyze results and interpret evolved prompts
- Customize parameters for different research goals

**Prerequisites:**
- OpenAI API key (for GPT models)
- Anthropic API key (for Claude models) - optional
- Python environment with required dependencies

---

## 1. System Setup and Dependencies

First, let's set up the environment and import all necessary modules.

In [None]:
# Install required packages if not already installed
import subprocess
import sys

def install_package(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Uncomment and run if packages are not installed
# install_package("openai>=1.0.0")
# install_package("anthropic")
# install_package("matplotlib")
# install_package("numpy")
# install_package("psutil")

print("✅ Dependencies ready")

In [None]:
# Import system modules
import os
import sys
import time
from pathlib import Path

# Add project root to Python path
project_root = Path.cwd()
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

print(f"📁 Project root: {project_root}")
print("✅ System imports ready")

## 2. API Configuration

Configure your API keys for accessing language models. The system supports both OpenAI and Anthropic models.

In [None]:
# Set up API keys
# Option 1: Set environment variables (recommended)
# os.environ["OPENAI_API_KEY"] = "your-openai-api-key-here"
# os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-api-key-here"

# Option 2: Load from .env file
env_file = project_root / ".env"
if env_file.exists():
    with open(env_file, 'r') as f:
        for line in f:
            if '=' in line and not line.startswith('#'):
                key, value = line.strip().split('=', 1)
                os.environ[key] = value
    print("✅ Environment variables loaded from .env file")
else:
    print("⚠️  No .env file found. Please set API keys manually.")

# Verify API keys are set
openai_key = os.getenv("OPENAI_API_KEY")
anthropic_key = os.getenv("ANTHROPIC_API_KEY")

print(f"🔑 OpenAI API Key: {'✅ Set' if openai_key else '❌ Not set'}")
print(f"🔑 Anthropic API Key: {'✅ Set' if anthropic_key else '❌ Not set'}")

## 3. Load System Components

Now let's load all the genetic algorithm components we'll need for our experiments.

In [None]:
# Import genetic algorithm components
from src.utils.config import config
from src.embeddings.vocabulary import vocabulary
from src.seeds.seed_manager import SeedManager
from src.config.experiment_configs import ConfigurationManager

print("✅ Core components imported")

In [None]:
# Initialize vocabulary
vocab_file = config.get_data_dir() / "embeddings" / "vocabulary.pkl"

if vocab_file.exists():
    vocabulary.load_vocabulary(vocab_file)
    print(f"✅ Vocabulary loaded: {len(vocabulary.token_to_id)} tokens")
else:
    print("📚 Creating vocabulary from scratch...")
    vocabulary._create_basic_vocabulary()
    print(f"✅ Basic vocabulary created: {len(vocabulary.token_to_id)} tokens")

In [None]:
# Initialize seed manager and configuration manager
seed_manager = SeedManager()
config_manager = ConfigurationManager()

# Load base seed collection
base_seeds = seed_manager.get_base_seeds()
print(f"🌱 Seed collection loaded: {len(base_seeds)} high-quality prompts")

# Show available experiment presets
presets = config_manager.list_presets()
print(f"⚙️  Available presets: {', '.join(presets)}")

## 4. Explore Seed Prompts

Let's examine the high-quality seed prompts that will initialize our genetic algorithm.

In [None]:
# Show seed prompt categories and examples
from src.seeds.prompt_categories import PromptCategory

print("📂 Seed Prompt Categories:")
print("=" * 40)

for category in PromptCategory:
    category_seeds = seed_manager.get_seeds_by_category(category)
    print(f"\n🔹 {category.value.replace('_', ' ').title()}: {len(category_seeds)} prompts")
    
    # Show first example
    if category_seeds:
        example = category_seeds[0]
        print(f"   Example: \"{example.text}\"")
        print(f"   Strength: {example.expected_strength}")

In [None]:
# Validate seed collection quality
from src.seeds.seed_validation import SeedValidator

validator = SeedValidator()
validation_metrics = validator.validate_collection(base_seeds)

print("🔍 Seed Collection Quality Report:")
print("=" * 40)
print(f"Overall Score: {validation_metrics.overall_score:.3f}")
print(f"Diversity Score: {validation_metrics.diversity_score:.3f}")
print(f"Category Balance: {validation_metrics.category_balance:.3f}")
print(f"Uniqueness Score: {validation_metrics.uniqueness_score:.3f}")

quality_status = "🟢 EXCELLENT" if validation_metrics.overall_score >= 0.8 else "🟡 GOOD" if validation_metrics.overall_score >= 0.6 else "🔴 NEEDS IMPROVEMENT"
print(f"\nQuality Status: {quality_status}")

## 5. Configure Your Experiment

Now let's set up an experiment configuration. You can choose from predefined presets or customize parameters.

In [None]:
# Show available experiment presets
preset_info = config_manager.get_preset_info()

print("⚙️  Available Experiment Presets:")
print("=" * 50)

for name, info in preset_info.items():
    print(f"\n🔹 {name}")
    print(f"   Name: {info['name']}")
    print(f"   Description: {info['description']}")
    print(f"   Population: {info['population_size']}, Generations: {info['max_generations']}")
    print(f"   Problems: {info['max_problems']}")

In [None]:
# Choose and customize your experiment configuration
# Options: 'quick_test', 'standard', 'thorough', 'high_mutation', 'large_population', etc.

BASE_PRESET = "quick_test"  # Change this to your preferred preset

# Custom modifications (optional)
custom_modifications = {
    'name': 'My GSM8K Evolution Experiment',
    'description': 'Custom experiment for prompt evolution',
    'population_size': 15,  # Adjust as needed
    'max_generations': 20,  # Adjust as needed
    'max_problems': 30,     # Adjust as needed (more problems = more accurate but slower)
    'model_name': 'gpt-3.5-turbo',  # or 'gpt-4', 'claude-3-sonnet-20240229'
    'temperature': 0.0,     # 0.0 for deterministic, higher for more creative
    'target_fitness': 0.75, # Stop early if this fitness is reached
}

# Create the configuration
experiment_config = config_manager.create_custom_config(BASE_PRESET, custom_modifications)

# Show the final configuration
print("🔧 Experiment Configuration:")
print("=" * 40)
print(config_manager.get_config_summary(experiment_config))

In [None]:
# Validate the configuration
validation_errors = config_manager.validate_config(experiment_config)

if validation_errors:
    print("❌ Configuration validation failed:")
    for error in validation_errors:
        print(f"   - {error}")
else:
    print("✅ Configuration is valid and ready to use!")

## 6. Set Up Monitoring and Visualization

Before running the experiment, let's set up real-time monitoring and visualization.

In [None]:
# Import monitoring components
from src.utils.experiment_manager import ExperimentManager
from src.utils.evolution_logging import EvolutionLogger
from src.utils.visualization import EvolutionVisualizer
from src.utils.performance_monitor import PerformanceMonitor

# Initialize experiment manager
experiment_manager = ExperimentManager()

print("📊 Monitoring components initialized")
print("✅ Ready for experiment execution")

## 7. Run the Evolution Experiment

Now we'll run the complete genetic algorithm experiment with real-time monitoring.

In [None]:
# Import the main experiment runner
from src.main_runner import GSM8KExperimentRunner
from dataclasses import asdict

# Convert configuration to dictionary format
config_dict = asdict(experiment_config)

# Convert enums to strings for JSON compatibility
for key, value in config_dict.items():
    if hasattr(value, 'value'):
        config_dict[key] = value.value

print("🚀 Initializing experiment runner...")
runner = GSM8KExperimentRunner(config_dict)

In [None]:
# Set up the experiment
print("🔧 Setting up experiment components...")
setup_success = runner.setup_experiment()

if setup_success:
    print("✅ Experiment setup completed successfully!")
    print(f"📋 Experiment ID: {runner.experiment_id}")
else:
    print("❌ Experiment setup failed. Please check the error messages above.")

In [None]:
# Run the evolution experiment
if setup_success:
    print("🧬 Starting genetic algorithm evolution...")
    print("=" * 60)
    print(f"Population Size: {experiment_config.population_size}")
    print(f"Max Generations: {experiment_config.max_generations}")
    print(f"Evaluation Problems: {experiment_config.max_problems}")
    print(f"Model: {experiment_config.model_name}")
    print("=" * 60)
    
    # This will run the complete evolution process
    start_time = time.time()
    experiment_success = runner.run_experiment()
    total_time = time.time() - start_time
    
    print(f"\n⏱️  Total experiment time: {total_time:.1f} seconds")
    
    if experiment_success:
        print("🎉 Experiment completed successfully!")
    else:
        print("❌ Experiment failed. Check the logs for details.")
else:
    print("⚠️  Skipping experiment run due to setup failure.")

## 8. Analyze Results

Let's examine the results of our evolution experiment.

In [None]:
# Get experiment summary
if 'experiment_success' in locals() and experiment_success:
    summary = runner.get_experiment_summary()
    
    print("📊 Experiment Results Summary:")
    print("=" * 50)
    print(f"Status: {summary['status']}")
    print(f"Experiment ID: {summary['experiment_id']}")
    
    if 'results' in summary:
        results = summary['results']
        print(f"\n🏆 Evolution Results:")
        print(f"   Best Fitness: {results.get('best_fitness', 0):.3f}")
        print(f"   Total Generations: {results.get('total_generations', 0)}")
        print(f"   Convergence Reason: {results.get('convergence_reason', 'unknown')}")
        print(f"   Total Evaluations: {results.get('total_evaluations', 0)}")
        
        if summary.get('best_prompt'):
            print(f"\n🎯 Best Evolved Prompt:")
            print(f'   "{summary["best_prompt"]}"')
else:
    print("⚠️  No results to analyze - experiment was not run or failed.")

In [None]:
# Show performance statistics
if 'summary' in locals() and 'performance' in summary:
    perf = summary['performance']
    
    print("⚡ Performance Statistics:")
    print("=" * 40)
    print(f"Runtime: {perf.get('total_runtime_minutes', 0):.1f} minutes")
    
    if 'api_usage' in perf:
        api = perf['api_usage']
        print(f"\n🔌 API Usage:")
        print(f"   Total API Calls: {api.get('total_calls', 0)}")
        print(f"   Total Tokens: {api.get('total_tokens', 0):,}")
        print(f"   Tokens per Call: {api.get('tokens_per_call', 0):.1f}")
    
    if 'cache_performance' in perf:
        cache = perf['cache_performance']
        print(f"\n💾 Cache Performance:")
        print(f"   Hit Rate: {cache.get('hit_rate', 0):.1%}")
        print(f"   Total Hits: {cache.get('total_hits', 0)}")
        print(f"   Total Misses: {cache.get('total_misses', 0)}")
    
    if 'memory_usage' in perf:
        memory = perf['memory_usage']
        print(f"\n🧠 Memory Usage:")
        print(f"   Peak Memory: {memory.get('peak_mb', 0):.1f} MB")
        print(f"   Memory Growth: {memory.get('growth_mb', 0):.1f} MB")

## 9. Visualize Evolution Progress

Let's look at the evolution progress through visualizations.

In [None]:
# Display evolution plots if available
import matplotlib.pyplot as plt
from IPython.display import Image, display

if 'runner' in locals() and runner.experiment_id:
    # Look for generated plots
    plots_dir = config.get_data_dir() / "plots" / runner.experiment_id
    
    if plots_dir.exists():
        print("📈 Evolution Visualizations:")
        print("=" * 40)
        
        # Show final evolution progress plot
        final_plot = plots_dir / "final_evolution_progress.png"
        if final_plot.exists():
            print("\n🔹 Evolution Progress:")
            display(Image(str(final_plot)))
        
        # Show convergence analysis
        convergence_plot = plots_dir / "convergence_analysis.png"
        if convergence_plot.exists():
            print("\n🔹 Convergence Analysis:")
            display(Image(str(convergence_plot)))
        
        # List all available plots
        all_plots = list(plots_dir.glob("*.png"))
        print(f"\n📊 Total plots generated: {len(all_plots)}")
        for plot in all_plots:
            print(f"   - {plot.name}")
    else:
        print("⚠️  No visualization plots found.")
else:
    print("⚠️  No experiment data available for visualization.")

## 10. Compare with Baseline Prompts

Let's compare our evolved prompt with some baseline prompts to see the improvement.

In [None]:
# Define baseline prompts for comparison
baseline_prompts = [
    "Solve this math problem.",
    "Let's solve this step by step.",
    "Think carefully and solve this problem.",
    "Calculate the answer to this question."
]

print("📋 Baseline Prompts for Comparison:")
print("=" * 40)
for i, prompt in enumerate(baseline_prompts, 1):
    print(f"{i}. \"{prompt}\"")

if 'summary' in locals() and summary.get('best_prompt'):
    print(f"\n🧬 Evolved Prompt:")
    print(f'   "{summary["best_prompt"]}"')
    
    print(f"\n🎯 Best Fitness Achieved: {summary.get('results', {}).get('best_fitness', 0):.3f}")
    print("\n💡 The evolved prompt should show improvements in:")
    print("   - Mathematical reasoning clarity")
    print("   - Step-by-step problem solving")
    print("   - Accuracy on GSM8K problems")
else:
    print("\n⚠️  No evolved prompt available for comparison.")

## 11. Advanced: Custom Experiment Configurations

Here are examples of how to set up different types of experiments for research purposes.

In [None]:
# Example 1: Ablation Study - No Crossover
ablation_config = config_manager.create_custom_config('standard', {
    'name': 'Ablation Study: Mutation Only',
    'crossover_rate': 0.0,
    'mutation_rate': 0.5,
    'max_generations': 50
})

print("🔬 Ablation Study Configuration:")
print(config_manager.get_config_summary(ablation_config))

In [None]:
# Example 2: Parameter Sweep - Different Model Comparison
model_configs = {
    'gpt-3.5-turbo': {'model_name': 'gpt-3.5-turbo', 'temperature': 0.0},
    'gpt-4': {'model_name': 'gpt-4', 'temperature': 0.0},
    'gpt-4-creative': {'model_name': 'gpt-4', 'temperature': 0.3}
}

print("🔄 Model Comparison Configurations:")
print("=" * 40)

for name, modifications in model_configs.items():
    config = config_manager.create_custom_config('quick_test', {
        'name': f'Model Comparison: {name}',
        **modifications
    })
    print(f"\n🔹 {name}:")
    print(f"   Model: {config.model_name}")
    print(f"   Temperature: {config.temperature}")
    print(f"   Population: {config.population_size}")

In [None]:
# Example 3: Custom Seed Prompts
custom_seeds = [
    "Let me approach this systematically by breaking down the problem.",
    "I'll solve this by identifying the key information and working step by step.",
    "To find the answer, I need to carefully analyze what's given and what's asked."
]

custom_seed_config = config_manager.create_custom_config('quick_test', {
    'name': 'Custom Seed Experiment',
    'custom_seeds': custom_seeds,
    'population_size': len(custom_seeds) * 3  # Expand from custom seeds
})

print("🌱 Custom Seed Configuration:")
print("=" * 40)
print(f"Custom Seeds: {len(custom_seeds)}")
print(f"Population Size: {custom_seed_config.population_size}")
print("\nCustom Seed Prompts:")
for i, seed in enumerate(custom_seeds, 1):
    print(f"   {i}. \"{seed}\"")

## 12. Experiment Management and History

Learn how to manage multiple experiments and track your research progress.

In [None]:
# List all experiments
all_experiments = experiment_manager.list_experiments()

print("📚 Experiment History:")
print("=" * 50)

if all_experiments:
    for exp in all_experiments[:5]:  # Show last 5 experiments
        print(f"\n🔹 {exp.experiment_name}")
        print(f"   ID: {exp.experiment_id}")
        print(f"   Status: {exp.status}")
        print(f"   Created: {time.ctime(exp.created_at)}")
        if exp.status == 'completed':
            print(f"   Best Fitness: {exp.best_fitness:.3f}")
            print(f"   Generations: {exp.total_generations}")
            print(f"   Runtime: {exp.total_time:.1f}s")
else:
    print("No experiments found in history.")

In [None]:
# Get experiment summary statistics
summary_stats = experiment_manager.get_experiment_summary()

print("📊 Overall Experiment Statistics:")
print("=" * 40)
print(f"Total Experiments: {summary_stats['total_experiments']}")
print(f"Completed: {summary_stats['status_counts'].get('completed', 0)}")
print(f"Running: {summary_stats['status_counts'].get('running', 0)}")
print(f"Failed: {summary_stats['status_counts'].get('failed', 0)}")

if summary_stats['completed_experiments'] > 0:
    print(f"\n📈 Averages (Completed Experiments):")
    print(f"   Average Best Fitness: {summary_stats['average_best_fitness']:.3f}")
    print(f"   Average Generations: {summary_stats['average_generations']:.1f}")
    print(f"   Average Runtime: {summary_stats['average_time']:.1f}s")

## 13. Tips and Best Practices

Here are some recommendations for getting the best results from your experiments.

### 🎯 **Experiment Design Tips:**

1. **Start Small**: Use `quick_test` preset first to validate your setup
2. **Problem Count**: More evaluation problems = more accurate fitness but slower evolution
3. **Population Size**: Larger populations explore more but cost more API calls
4. **Generations**: Allow enough generations for convergence (typically 50-100)

### ⚙️ **Parameter Tuning:**

- **High Exploration**: Increase mutation rate (0.3-0.5) and population size
- **High Exploitation**: Increase crossover rate (0.8-0.9) and elite size
- **Balanced**: Use default parameters from `standard` preset

### 💰 **Cost Management:**

- Enable caching (`use_cache=True`) to avoid re-evaluating identical prompts
- Start with `gpt-3.5-turbo` before trying more expensive models
- Use smaller problem sets for initial experiments

### 📊 **Result Interpretation:**

- Fitness > 0.8 indicates strong mathematical reasoning
- Look for convergence patterns in the evolution plots
- Compare evolved prompts with baseline prompts on held-out test sets

## 14. Cleanup and Next Steps

Clean up resources and explore further research directions.

In [None]:
# Cleanup experiment resources
if 'runner' in locals():
    runner.cleanup()
    print("🧹 Experiment resources cleaned up")

print("\n🎉 Tutorial completed successfully!")
print("\n🚀 Next Steps:")
print("   1. Try different experiment presets")
print("   2. Experiment with custom seed prompts")
print("   3. Compare different models (GPT-3.5, GPT-4, Claude)")
print("   4. Run ablation studies to understand component contributions")
print("   5. Evaluate evolved prompts on additional math datasets")

print("\n📚 For more advanced usage, see:")
print("   - scripts/run_experiment.py for command-line usage")
print("   - src/config/experiment_configs.py for configuration options")
print("   - Generated plots and logs in the data/ directory")