# GSM8K Genetic Algorithm Experiment

This notebook orchestrates genetic algorithm experiments for evolving optimal prompts for GSM8K math problems.

**Target**: Achieve 95% accuracy through evolutionary optimization of 500-member populations across 30 generations.

## System Overview
- **Population Size**: 500 genomes
- **Generations**: Up to 30
- **Model**: GPT-4o
- **Selection**: Elite (20) + Diverse (1) + Random (1)
- **Mutation**: Semantic neighborhoods with 2-level probability
- **Evaluation**: Progressive (50/100/150 problems per generation)

## 1. Setup and Configuration

In [1]:
import sys
import asyncio
import logging
import time
import json
from pathlib import Path
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Add src to path
sys.path.append('./src')

# Import our genetic algorithm components
from src.genetics.evolution_controller import EvolutionController
from src.genetics.generation_manager import GenerationManager
from src.utils.config import get_config
from src.utils.dataset import GSM8KDataset

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Suppress verbose HTTP logs from httpx (used by OpenAI)
logging.getLogger('httpx').setLevel(logging.WARNING)
logging.getLogger('openai').setLevel(logging.WARNING)
logging.getLogger('urllib3').setLevel(logging.WARNING)

print("🧬 GSM8K Genetic Algorithm System Initialized")
print(f"📅 Experiment Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

🧬 GSM8K Genetic Algorithm System Initialized
📅 Experiment Date: 2025-09-04 03:31:37


## 2. Load Configuration and Display Settings

In [2]:
# Load configuration
config = get_config()

# Display current configuration
print("⚙️ Current Configuration:")
print(f"  Model: {config.get('model.name')}")
print(f"  Population Size: {config.get('genetic_algorithm.population_size')}")
print(f"  Max Generations: {config.get('genetic_algorithm.max_generations')}")
print(f"  Target Accuracy: {config.get('experiment.target_accuracy'):.1%}")
print(f"  Elite Selection: {config.get('selection.elite_count')}")
print(f"  Diverse Selection: {config.get('selection.diverse_count')}")
print(f"  Random Selection: {config.get('selection.random_count')}")
print(f"  Mutation Rate (Population): {config.get('mutation.population_mutation_prob'):.1%}")
print(f"  Mutation Rate (Token): {config.get('mutation.token_mutation_prob'):.3%}")
print(f"  Semantic Neighbor Prob: {config.get('mutation.semantic_neighbor_prob'):.1%}")

# Progressive evaluation settings
eval_config = config.get('evaluation.progressive_evaluation')
print(f"\n📊 Progressive Evaluation:")
print(f"  Early Generations (1-10): {eval_config['early_generations']['problems_per_genome']} problems")
print(f"  Middle Generations (11-20): {eval_config['middle_generations']['problems_per_genome']} problems")
print(f"  Late Generations (21-30): {eval_config['late_generations']['problems_per_genome']} problems")

⚙️ Current Configuration:
  Model: gpt-4o
  Population Size: 500
  Max Generations: 30
  Target Accuracy: 95.0%
  Elite Selection: 20
  Diverse Selection: 1
  Random Selection: 1
  Mutation Rate (Population): 80.0%
  Mutation Rate (Token): 0.200%
  Semantic Neighbor Prob: 90.0%

📊 Progressive Evaluation:
  Early Generations (1-10): 10 problems
  Middle Generations (11-20): 20 problems
  Late Generations (21-30): 30 problems


## 3. Initialize System Components

In [3]:
# Initialize main components
print("🔧 Initializing system components...")

evolution_controller = EvolutionController()
generation_manager = GenerationManager()
dataset = GSM8KDataset()

print("✅ Evolution Controller initialized")
print("✅ Generation Manager initialized")
print("✅ Dataset Manager initialized")

# Verify dataset is ready
try:
    train_data, test_data = dataset.load_dataset()
    print(f"📚 Dataset loaded: {len(train_data)} training, {len(test_data)} test problems")
    
    # Load evaluation sets
    eval_sets = dataset.create_evaluation_sets()
    for set_name, set_data in eval_sets.items():
        print(f"  {set_name}: {len(set_data)} problems")
        
except Exception as e:
    print(f"❌ Dataset error: {e}")
    print("Please run: python scripts/prepare_dataset.py")

2025-09-04 03:31:37,971 - src.evaluation.llm_interface - INFO - Loaded 2100 cached responses
2025-09-04 03:31:37,973 - src.evaluation.population_evaluator - INFO - Loaded 35 cached evaluations
2025-09-04 03:31:37,975 - src.utils.dataset - INFO - Loading GSM8K dataset from disk...


🔧 Initializing system components...
✅ Evolution Controller initialized
✅ Generation Manager initialized
✅ Dataset Manager initialized


2025-09-04 03:31:38,445 - src.utils.dataset - INFO - Loaded 7473 training problems
2025-09-04 03:31:38,446 - src.utils.dataset - INFO - Loaded 1319 test problems
2025-09-04 03:31:38,448 - src.utils.dataset - INFO - Created primary evaluation set: 100 problems (seed=42)
2025-09-04 03:31:38,449 - src.utils.dataset - INFO - Created validation evaluation set: 100 problems (seed=43)
2025-09-04 03:31:38,450 - src.utils.dataset - INFO - Created final_test evaluation set: 200 problems (seed=44)
2025-09-04 03:31:38,455 - src.utils.dataset - INFO - Saved primary evaluation set to data/primary_evaluation_set.json
2025-09-04 03:31:38,459 - src.utils.dataset - INFO - Saved validation evaluation set to data/validation_evaluation_set.json
2025-09-04 03:31:38,463 - src.utils.dataset - INFO - Saved final_test evaluation set to data/final_test_evaluation_set.json


📚 Dataset loaded: 7473 training, 1319 test problems
  primary: 100 problems
  validation: 100 problems
  final_test: 200 problems


## 4. Define Seed Prompts

In [4]:
# Define diverse seed prompts for different problem-solving strategies
SEED_PROMPTS = [
    "Solve this step by step.",
    "Let's work through this problem carefully.",
    "First, identify what we need to find.",
    "Break down the problem into smaller parts.",
    "Calculate each step systematically.",
    "Let me solve this math problem step by step.",
    "To find the answer, I need to:",
    "Let's start by understanding what the problem is asking.",
    "I'll solve this by working through each part.",
    "Here's how to approach this problem:",
    "Step 1: Read the problem carefully.",
    "Let's organize the given information.",
    "I need to find the total by adding up all parts.",
    "To solve this, I'll use basic arithmetic.",
    "Let me calculate this step by step.",
    "First, let's identify the key numbers.",
    "I'll work through this systematically.",
    "Let's solve this math problem together.",
    "To get the answer, I need to calculate:",
    "Here's my step-by-step solution:",
    "Let me break this down into simple steps.",
    "I'll solve this using logical reasoning.",
    "First, I'll find what we know.",
    "Let's calculate the answer step by step.",
    "To solve this problem, I will:",
    "Let me work through this calculation.",
    "I need to find the solution by:",
    "Here's how I'll approach this:",
    "Let's solve this math word problem.",
    "I'll calculate the answer systematically.",
    "First, let me understand the problem.",
    "To find the solution, I'll:",
    "Let me solve this problem carefully.",
    "I'll work through each calculation.",
    "Here's my mathematical approach:",
    "Let's find the answer step by step.",
    "I need to calculate the total amount.",
    "To solve this, I'll use math operations.",
    "Let me figure out the answer.",
    "I'll solve this problem methodically.",
    "First, I'll identify the operation needed.",
    "Let's work out the solution together.",
    "I need to find the correct answer by:",
    "Here's how to solve this math problem:",
    "Let me calculate the final result.",
    "I'll solve this using arithmetic.",
    "To get the right answer, I'll:",
    "Let me work through this step by step.",
    "I need to solve for the unknown value.",
    "Here's my solution to this problem:"
]

print(f"🌱 Defined {len(SEED_PROMPTS)} diverse seed prompts")
print("\nSample seed prompts:")
for i, prompt in enumerate(SEED_PROMPTS[:5]):
    print(f"  {i+1}. {prompt}")
print(f"  ... and {len(SEED_PROMPTS)-5} more")

🌱 Defined 50 diverse seed prompts

Sample seed prompts:
  1. Solve this step by step.
  2. Let's work through this problem carefully.
  3. First, identify what we need to find.
  4. Break down the problem into smaller parts.
  5. Calculate each step systematically.
  ... and 45 more


## 5. Experiment Control Panel

In [5]:
# Experiment configuration
EXPERIMENT_CONFIG = {
    'experiment_name': f"GSM8K_GA_{datetime.now().strftime('%Y%m%d_%H%M%S')}",
    'population_size': config.get('genetic_algorithm.population_size'),
    'max_generations': config.get('genetic_algorithm.max_generations'),
    'use_full_config': True,  # Set to False for quick testing
    'save_checkpoints': True,
    'show_progress': True
}

# For quick testing, override with smaller values
if not EXPERIMENT_CONFIG['use_full_config']:
    EXPERIMENT_CONFIG.update({
        'population_size': 30,
        'max_generations': 5
    })
    evolution_controller.population_size = 30
    evolution_controller.max_generations = 5
    print("⚠️ Using QUICK TEST configuration (20 population, 5 generations)")
else:
    print("🚀 Using FULL SCALE configuration")

print(f"\n🧪 Experiment Configuration:")
for key, value in EXPERIMENT_CONFIG.items():
    print(f"  {key}: {value}")

🚀 Using FULL SCALE configuration

🧪 Experiment Configuration:
  experiment_name: GSM8K_GA_20250904_033138
  population_size: 500
  max_generations: 30
  use_full_config: True
  save_checkpoints: True
  show_progress: True


## 6. Evolution Progress Tracking

In [6]:
# Global variables for tracking progress
evolution_metrics = {
    'generations': [],
    'best_fitness': [],
    'best_accuracy': [],
    'mean_fitness': [],
    'population_diversity': [],
    'evaluation_times': [],
    'total_cost': 0.0,
    'start_time': None,
    'current_best_genome': None
}

def update_metrics(generation_stats, population):
    """Update evolution metrics with new generation data."""
    gen = generation_stats['generation']
    
    evolution_metrics['generations'].append(gen)
    evolution_metrics['best_fitness'].append(generation_stats['best_fitness'])
    evolution_metrics['best_accuracy'].append(generation_stats['best_accuracy'])
    evolution_metrics['evaluation_times'].append(generation_stats.get('generation_time_seconds', 0))
    
    # Calculate mean fitness
    evaluated_genomes = [g for g in population if g.fitness is not None]
    if evaluated_genomes:
        mean_fit = sum(g.fitness for g in evaluated_genomes) / len(evaluated_genomes)
        evolution_metrics['mean_fitness'].append(mean_fit)
        
        # Calculate diversity (fitness standard deviation)
        fitnesses = [g.fitness for g in evaluated_genomes]
        if len(fitnesses) > 1:
            mean_fitness = sum(fitnesses) / len(fitnesses)
            variance = sum((f - mean_fitness) ** 2 for f in fitnesses) / len(fitnesses)
            diversity = variance ** 0.5
        else:
            diversity = 0.0
        evolution_metrics['population_diversity'].append(diversity)
        
        # Track best genome
        best_genome = max(evaluated_genomes, key=lambda g: g.fitness)
        evolution_metrics['current_best_genome'] = {
            'generation': gen,
            'fitness': best_genome.fitness,
            'accuracy': best_genome.accuracy,
            'text': best_genome.to_text(),
            'length': len(best_genome.tokens)
        }

def display_progress():
    """Display current evolution progress."""
    if not evolution_metrics['generations']:
        print("No evolution data yet.")
        return
    
    current_gen = evolution_metrics['generations'][-1]
    best_fitness = evolution_metrics['best_fitness'][-1]
    best_accuracy = evolution_metrics['best_accuracy'][-1]
    mean_fitness = evolution_metrics['mean_fitness'][-1] if evolution_metrics['mean_fitness'] else 0
    
    elapsed_time = time.time() - evolution_metrics['start_time'] if evolution_metrics['start_time'] else 0
    
    print(f"\n📊 Generation {current_gen} Progress:")
    print(f"  Best Fitness: {best_fitness:.4f}")
    print(f"  Best Accuracy: {best_accuracy:.1%}")
    print(f"  Mean Fitness: {mean_fitness:.4f}")
    print(f"  Elapsed Time: {elapsed_time/60:.1f} minutes")
    
    if evolution_metrics['current_best_genome']:
        best = evolution_metrics['current_best_genome']
        print(f"  Best Genome: '{best['text']}' (length: {best['length']})")
    
    # Progress toward target
    target_accuracy = config.get('experiment.target_accuracy', 0.95)
    progress = best_accuracy / target_accuracy
    print(f"  Progress to Target: {progress:.1%} (target: {target_accuracy:.1%})")

print("📈 Evolution tracking system initialized")

📈 Evolution tracking system initialized


## 7. Run Evolution Experiment

In [7]:
async def run_evolution_experiment():
    """Run the complete evolution experiment."""
    
    # Start experiment
    experiment_id = generation_manager.start_experiment(EXPERIMENT_CONFIG['experiment_name'])
    evolution_metrics['start_time'] = time.time()
    
    print(f"🚀 Starting Evolution Experiment: {experiment_id}")
    print(f"📅 Start Time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    
    # Define progress callback
    def progress_callback(generation_stats):
        # Update metrics
        update_metrics(generation_stats, evolution_controller.current_population)
        
        # Record in generation manager
        generation_manager.record_generation(
            generation_stats['generation'],
            evolution_controller.current_population,
            generation_stats
        )
        
        # Display progress
        display_progress()
        
        # Save checkpoint if enabled
        if EXPERIMENT_CONFIG['save_checkpoints']:
            checkpoint_file = generation_manager.save_checkpoint(
                generation_stats['generation'],
                evolution_controller.current_population
            )
            print(f"💾 Checkpoint saved: {Path(checkpoint_file).name}")
    
    try:
        # Run evolution
        evolution_results = await evolution_controller.run_evolution(
            seed_prompts=SEED_PROMPTS,
            progress_callback=progress_callback
        )
        
        return evolution_results
        
    except Exception as e:
        print(f"❌ Evolution failed: {e}")
        import traceback
        traceback.print_exc()
        return None

# Note: Run this cell to start the evolution
print("⚡ Evolution experiment function defined")
print("\n🎯 Ready to run evolution!")
print("Execute the next cell to start the experiment.")

⚡ Evolution experiment function defined

🎯 Ready to run evolution!
Execute the next cell to start the experiment.


## 8. Execute Evolution (Run This Cell to Start!)

In [8]:
# Execute the evolution experiment
print("🔥 STARTING EVOLUTION EXPERIMENT...")
print("=" * 60)

# Run the evolution
evolution_results = await run_evolution_experiment()

if evolution_results:
    print("\n" + "=" * 60)
    print("🎉 EVOLUTION EXPERIMENT COMPLETED!")
    print("=" * 60)
    
    # Display final results
    print(f"\n📋 Final Results:")
    print(f"  Termination Reason: {evolution_results['termination_reason']}")
    print(f"  Total Generations: {evolution_results['total_generations']}")
    print(f"  Final Population Size: {evolution_results['final_population_size']}")
    
    # Best genome
    best_genome = evolution_results['best_genome']
    print(f"\n🏆 Best Genome Found:")
    print(f"  Generation: {best_genome['generation']}")
    print(f"  Fitness: {best_genome['fitness']:.4f}")
    print(f"  Accuracy: {best_genome['accuracy']:.1%}")
    print(f"  Length: {best_genome['length']} tokens")
    print(f"  Text: '{best_genome['text']}'")
    
    # Evolution statistics
    evo_stats = evolution_results['evolution_statistics']
    print(f"\n📊 Evolution Statistics:")
    print(f"  Total Time: {evo_stats['total_time_minutes']:.1f} minutes")
    print(f"  Avg Time per Generation: {evo_stats['avg_time_per_generation']:.1f} seconds")
    
    # Save final results
    results_file = generation_manager.save_final_results()
    print(f"\n💾 Results saved to: {Path(results_file).name}")
    
else:
    print("❌ Evolution experiment failed. Check the error messages above.")

2025-09-04 03:31:38,588 - src.genetics.generation_manager - INFO - Started experiment: GSM8K_GA_20250904_033138
2025-09-04 03:31:38,589 - src.genetics.evolution_controller - INFO - Starting genetic algorithm evolution
2025-09-04 03:31:38,590 - src.genetics.evolution_controller - INFO - Configuration: 500 population, 30 max generations
2025-09-04 03:31:38,591 - src.genetics.evolution_controller - INFO - Initializing population of 500 genomes
2025-09-04 03:31:38,592 - src.genetics.population - INFO - Initializing population of 500 genomes from 50 seeds
2025-09-04 03:31:38,631 - src.embeddings.neighborhoods - INFO - Loaded neighborhoods for 10000 words


🔥 STARTING EVOLUTION EXPERIMENT...
🚀 Starting Evolution Experiment: GSM8K_GA_20250904_033138
📅 Start Time: 2025-09-04 03:31:38


2025-09-04 03:31:38,652 - src.genetics.population - INFO - Created population of 500 genomes
2025-09-04 03:31:38,653 - src.genetics.population - INFO - Initialization methods: {'seed': 50, 'crossover_mutation': 450}
2025-09-04 03:31:38,654 - src.genetics.evolution_controller - INFO - Evaluating initial population...
2025-09-04 03:31:38,655 - src.evaluation.population_evaluator - INFO - Evaluating population of 500 genomes for generation 0
2025-09-04 03:31:38,658 - src.evaluation.population_evaluator - INFO - Cache status: 0 cached, 500 need evaluation
2025-09-04 03:31:38,659 - src.evaluation.async_evaluator - INFO - Evaluating population of 500 genomes on 10 problems
Evaluating population:   0%|          | 0/500 [00:00<?, ?it/s]2025-09-04 03:31:42,538 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-09-04 03:31:42,882 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-09-04 03:31:42,


🎉 EVOLUTION EXPERIMENT COMPLETED!

📋 Final Results:
  Termination Reason: Convergence achieved at generation 0
  Total Generations: 0
  Final Population Size: 500

🏆 Best Genome Found:
  Generation: 0
  Fitness: 1.0000
  Accuracy: 100.0%
  Length: 7 tokens
  Text: 'First, identify what we need to find.'

📊 Evolution Statistics:
  Total Time: 2.2 minutes
  Avg Time per Generation: 0.0 seconds

💾 Results saved to: GSM8K_GA_20250904_033138_results.json


## 9. Results Analysis and Visualization

In [9]:
# Analysis of evolution results
# First, populate evolution_metrics from evolution_results if not already populated
if evolution_results and not evolution_metrics['generations']:
    print("📊 Populating evolution metrics from results...")
    
    # Extract data from evolution_results
    for gen_result in evolution_results.get('generation_results', []):
        evolution_metrics['generations'].append(gen_result['generation'])
        evolution_metrics['best_fitness'].append(gen_result['best_fitness'])
        evolution_metrics['best_accuracy'].append(gen_result['best_accuracy'])
        evolution_metrics['evaluation_times'].append(gen_result.get('generation_time_seconds', 0))
    
    # Set best genome from final results
    if evolution_results.get('best_genome'):
        best_genome_data = evolution_results['best_genome']
        evolution_metrics['current_best_genome'] = {
            'generation': best_genome_data['generation'],
            'fitness': best_genome_data['fitness'],
            'accuracy': best_genome_data['accuracy'],
            'text': best_genome_data['text'],
            'length': best_genome_data['length']
        }

if evolution_results and evolution_metrics['generations']:
    print("📈 Evolution Analysis:")
    
    # Fitness progression
    initial_fitness = evolution_metrics['best_fitness'][0]
    final_fitness = evolution_metrics['best_fitness'][-1]
    improvement = final_fitness - initial_fitness
    
    print(f"\n🎯 Fitness Progression:")
    print(f"  Initial Best Fitness: {initial_fitness:.4f}")
    print(f"  Final Best Fitness: {final_fitness:.4f}")
    print(f"  Total Improvement: {improvement:.4f} ({improvement/initial_fitness:.1%})")
    
    # Accuracy progression
    initial_accuracy = evolution_metrics['best_accuracy'][0]
    final_accuracy = evolution_metrics['best_accuracy'][-1]
    accuracy_improvement = final_accuracy - initial_accuracy
    
    print(f"\n🎯 Accuracy Progression:")
    print(f"  Initial Best Accuracy: {initial_accuracy:.1%}")
    print(f"  Final Best Accuracy: {final_accuracy:.1%}")
    print(f"  Accuracy Improvement: {accuracy_improvement:.1%}")
    
    # Generation-by-generation breakdown
    print(f"\n📊 Generation Breakdown:")
    for i, gen in enumerate(evolution_metrics['generations']):
        fitness = evolution_metrics['best_fitness'][i]
        accuracy = evolution_metrics['best_accuracy'][i]
        print(f"  Gen {gen:2d}: Fitness={fitness:.4f}, Accuracy={accuracy:.1%}")
    
    # Performance statistics
    if evolution_results:
        evo_stats = evolution_results['evolution_statistics']
        
        print(f"\n⚡ Performance Statistics:")
        selection_stats = evo_stats.get('selection_stats', {})
        mutation_stats = evo_stats.get('mutation_stats', {})
        eval_stats = evo_stats.get('evaluation_stats', {})
        
        print(f"  Selection - Elite: {selection_stats.get('elite_rate', 0):.1%}, "
              f"Diverse: {selection_stats.get('diverse_rate', 0):.1%}, "
              f"Random: {selection_stats.get('random_rate', 0):.1%}")
        
        print(f"  Mutations - Total: {mutation_stats.get('total_mutations', 0)}, "
              f"Semantic Rate: {mutation_stats.get('semantic_rate', 0):.1%}")
        
        print(f"  Evaluation - Total: {eval_stats.get('total_evaluations', 0)}, "
              f"Cache Hit Rate: {eval_stats.get('llm_cache_hit_rate', 0):.1%}")
        
        print(f"  Cost - Total: ${eval_stats.get('llm_total_cost_usd', 0):.2f}")
    
    # Target achievement
    target_accuracy = config.get('experiment.target_accuracy', 0.95)
    achieved_target = final_accuracy >= target_accuracy
    
    print(f"\n🎯 Target Achievement:")
    print(f"  Target Accuracy: {target_accuracy:.1%}")
    print(f"  Achieved: {'✅ YES' if achieved_target else '❌ NO'}")
    print(f"  Progress: {final_accuracy/target_accuracy:.1%}")
    
else:
    print("No evolution data available for analysis.")

No evolution data available for analysis.


## 10. Best Genome Validation

In [10]:
# Validate the best genome on the final test set
# Use best genome from evolution_results if evolution_metrics is not populated
best_genome_info = None
if evolution_results:
    if evolution_metrics.get('current_best_genome'):
        best_genome_info = evolution_metrics['current_best_genome']
    elif evolution_results.get('best_genome'):
        # Extract from evolution_results
        best_data = evolution_results['best_genome']
        best_genome_info = {
            'generation': best_data['generation'],
            'fitness': best_data['fitness'],
            'accuracy': best_data['accuracy'],
            'text': best_data['text'],
            'length': best_data['length']
        }

if best_genome_info:
    print("🧪 Validating Best Genome on Final Test Set...")
    
    
    print(f"\n🏆 Best Genome Details:")
    print(f"  Found in Generation: {best_genome_info['generation']}")
    print(f"  Training Fitness: {best_genome_info['fitness']:.4f}")
    print(f"  Training Accuracy: {best_genome_info['accuracy']:.1%}")
    print(f"  Prompt Text: '{best_genome_info['text']}'")
    print(f"  Token Count: {best_genome_info['length']}")
    
    # Note: For full validation, you would run the best genome on a held-out test set
    print(f"\n📝 Validation Notes:")
    print(f"  - This genome achieved {best_genome_info['accuracy']:.1%} accuracy during evolution")
    print(f"  - For final validation, test on held-out evaluation set")
    print(f"  - Consider statistical significance testing")
    print(f"  - Compare against baseline prompts")
    
    # Genome characteristics analysis
    prompt_text = best_genome_info['text']
    words = prompt_text.split()
    
    print(f"\n🔍 Genome Characteristics:")
    print(f"  Word Count: {len(words)}")
    print(f"  Character Count: {len(prompt_text)}")
    print(f"  Avg Word Length: {sum(len(w) for w in words)/len(words):.1f} chars")
    print(f"  Contains 'step': {'step' in prompt_text.lower()}")
    print(f"  Contains 'solve': {'solve' in prompt_text.lower()}")
    print(f"  Contains 'calculate': {'calculate' in prompt_text.lower()}")
    
else:
    print("No best genome available for validation.")

No best genome available for validation.


## 11. Experiment Summary and Next Steps

In [11]:
# Final experiment summary
print("📋 EXPERIMENT SUMMARY")
print("=" * 50)

if evolution_results:
    # Experiment metadata
    experiment_summary = generation_manager.get_experiment_summary()
    
    print(f"\n🧪 Experiment Details:")
    print(f"  Experiment ID: {experiment_summary.get('experiment_id', 'N/A')}")
    print(f"  Duration: {experiment_summary.get('elapsed_time_minutes', 0):.1f} minutes")
    print(f"  Generations: {experiment_summary.get('generations_completed', 0)}")
    print(f"  Configuration: {EXPERIMENT_CONFIG['population_size']} population, {EXPERIMENT_CONFIG['max_generations']} max generations")
    
    # Key achievements
    print(f"\n🏆 Key Achievements:")
    print(f"  Best Fitness: {experiment_summary.get('best_fitness_overall', 0):.4f}")
    print(f"  Best Accuracy: {experiment_summary.get('best_accuracy_overall', 0):.1%}")
    print(f"  Fitness Improvement: {experiment_summary.get('fitness_improvement', 0):.4f}")
    
    # System performance
    if evolution_results.get('evolution_statistics'):
        eval_stats = evolution_results['evolution_statistics'].get('evaluation_stats', {})
        print(f"\n⚡ System Performance:")
        print(f"  Total API Calls: {eval_stats.get('llm_total_requests', 0)}")
        print(f"  Cache Efficiency: {eval_stats.get('llm_cache_hit_rate', 0):.1%}")
        print(f"  Total Cost: ${eval_stats.get('llm_total_cost_usd', 0):.2f}")
        print(f"  Avg Time/Generation: {evolution_results['evolution_statistics'].get('avg_time_per_generation', 0):.1f}s")
    
    # Files generated
    print(f"\n📁 Generated Files:")
    results_dir = Path('./data/results')
    checkpoints_dir = Path('./data/checkpoints')
    
    if results_dir.exists():
        result_files = list(results_dir.glob('*.json'))
        print(f"  Results: {len(result_files)} files in {results_dir}")
    
    if checkpoints_dir.exists():
        checkpoint_files = list(checkpoints_dir.glob('*.json'))
        print(f"  Checkpoints: {len(checkpoint_files)} files in {checkpoints_dir}")

# Next steps recommendations
print(f"\n🚀 Next Steps:")
print(f"  1. Analyze results in detail using saved JSON files")
print(f"  2. Validate best genome on held-out test set")
print(f"  3. Compare against baseline prompts")
print(f"  4. Run statistical significance tests")
print(f"  5. Consider hyperparameter tuning for better results")
print(f"  6. Experiment with different seed prompt strategies")
print(f"  7. Analyze semantic patterns in successful genomes")

print(f"\n✨ Experiment completed successfully!")
print(f"📊 All data saved for further analysis.")

📋 EXPERIMENT SUMMARY

🧪 Experiment Details:
  Experiment ID: N/A
  Duration: 0.0 minutes
  Generations: 0
  Configuration: 500 population, 30 max generations

🏆 Key Achievements:
  Best Fitness: 0.0000
  Best Accuracy: 0.0%
  Fitness Improvement: 0.0000

⚡ System Performance:
  Total API Calls: 0
  Cache Efficiency: 0.0%
  Total Cost: $0.00
  Avg Time/Generation: 0.0s

📁 Generated Files:
  Results: 4 files in data/results
  Checkpoints: 1 files in data/checkpoints

🚀 Next Steps:
  1. Analyze results in detail using saved JSON files
  2. Validate best genome on held-out test set
  3. Compare against baseline prompts
  4. Run statistical significance tests
  5. Consider hyperparameter tuning for better results
  6. Experiment with different seed prompt strategies
  7. Analyze semantic patterns in successful genomes

✨ Experiment completed successfully!
📊 All data saved for further analysis.
