# L04: Reflexion Implementation

**Week 4 - Planning and Reasoning**

## Learning Objectives
- Understand the Reflexion framework for self-improvement
- Implement verbal reflection and memory mechanisms
- Build a simple Reflexion agent for problem-solving
- Evaluate the impact of reflection on task performance

## 1. Setup

Reflexion (Shinn et al., 2023) enables agents to learn from mistakes through verbal self-reflection.

In [None]:
from dataclasses import dataclass, field
from typing import List, Optional, Callable
import json

print("Dependencies loaded")

## 2. Core Reflexion Components

The Reflexion loop consists of:
1. **Actor**: Attempts the task
2. **Evaluator**: Assesses the outcome
3. **Reflector**: Generates verbal feedback
4. **Memory**: Stores reflections for future attempts

In [None]:
@dataclass
class Attempt:
    """Record of a single task attempt."""
    action: str
    result: str
    success: bool
    reflection: Optional[str] = None

@dataclass
class EpisodicMemory:
    """Stores past attempts and reflections."""
    attempts: List[Attempt] = field(default_factory=list)
    max_reflections: int = 5
    
    def add_attempt(self, attempt: Attempt):
        self.attempts.append(attempt)
        # Keep only recent reflections
        if len(self.attempts) > self.max_reflections:
            self.attempts = self.attempts[-self.max_reflections:]
    
    def get_reflections(self) -> List[str]:
        """Return all stored reflections."""
        return [a.reflection for a in self.attempts if a.reflection]
    
    def get_context(self) -> str:
        """Format reflections as context for the next attempt."""
        reflections = self.get_reflections()
        if not reflections:
            return "No previous attempts."
        return "Previous insights:\n" + "\n".join(f"- {r}" for r in reflections)

print("Memory classes defined")

In [None]:
class ReflexionAgent:
    """Agent that learns from self-reflection."""
    
    def __init__(self, 
                 actor_fn: Callable,
                 evaluator_fn: Callable,
                 reflector_fn: Callable,
                 max_attempts: int = 3):
        self.actor = actor_fn
        self.evaluator = evaluator_fn
        self.reflector = reflector_fn
        self.max_attempts = max_attempts
        self.memory = EpisodicMemory()
    
    def solve(self, task: str) -> dict:
        """Attempt task with reflection loop."""
        for attempt_num in range(self.max_attempts):
            print(f"\n--- Attempt {attempt_num + 1} ---")
            
            # Get context from previous reflections
            context = self.memory.get_context()
            print(f"Context: {context}")
            
            # Actor generates solution
            action = self.actor(task, context)
            print(f"Action: {action}")
            
            # Evaluator checks result
            result, success = self.evaluator(task, action)
            print(f"Result: {result}, Success: {success}")
            
            if success:
                return {"success": True, "action": action, "attempts": attempt_num + 1}
            
            # Reflector generates insight
            reflection = self.reflector(task, action, result)
            print(f"Reflection: {reflection}")
            
            # Store in memory
            self.memory.add_attempt(Attempt(
                action=action,
                result=result,
                success=success,
                reflection=reflection
            ))
        
        return {"success": False, "attempts": self.max_attempts}

print("ReflexionAgent defined")

## 3. Example: Math Problem Solving

We'll simulate an agent learning to solve math problems through reflection.

In [None]:
# Simulated LLM responses for math problem solving
def math_actor(task: str, context: str) -> str:
    """Simulated actor that improves based on reflections."""
    # Check if we have learned from reflections
    if "check units" in context.lower():
        return "The answer is 150 miles (calculated: 50 mph * 3 hours = 150 miles)"
    elif "show work" in context.lower():
        return "Speed = 50 mph, Time = 3 hours. Answer: 150"
    else:
        # Initial attempt without learning
        return "The answer is 53"  # Wrong answer

def math_evaluator(task: str, action: str) -> tuple:
    """Check if the answer is correct."""
    correct_answer = "150"
    if correct_answer in action and "miles" in action.lower():
        return "Correct!", True
    elif correct_answer in action:
        return "Missing units", False
    else:
        return "Incorrect calculation", False

def math_reflector(task: str, action: str, result: str) -> str:
    """Generate reflection based on failure."""
    if "Incorrect calculation" in result:
        return "I made an arithmetic error. I should show my work step by step."
    elif "Missing units" in result:
        return "I forgot the units. I should always check units in my answer."
    return "I need to be more careful with my reasoning."

print("Math problem functions defined")

In [None]:
# Test the Reflexion agent
agent = ReflexionAgent(
    actor_fn=math_actor,
    evaluator_fn=math_evaluator,
    reflector_fn=math_reflector,
    max_attempts=3
)

task = "A car travels at 50 mph for 3 hours. How far does it travel?"
print(f"Task: {task}")
print("=" * 60)

result = agent.solve(task)
print("\n" + "=" * 60)
print(f"Final Result: {result}")

## 4. Reflection Quality Analysis

Good reflections are:
- Specific about what went wrong
- Actionable for future attempts
- Focused on the error pattern, not just the symptom

In [None]:
# Examples of good vs. bad reflections
reflection_examples = {
    "Good Reflections": [
        "I assumed the list was sorted but it wasn't. Always verify data assumptions.",
        "Off-by-one error in loop. Use len(arr) - 1 for index-based iteration.",
        "Forgot to handle empty input case. Add input validation first."
    ],
    "Bad Reflections": [
        "I was wrong.",  # Too vague
        "Try harder next time.",  # Not actionable
        "The test case was tricky."  # Blames external factors
    ]
}

for category, examples in reflection_examples.items():
    print(f"\n{category}:")
    for ex in examples:
        print(f"  - {ex}")

## 5. Memory Management Strategies

In [None]:
@dataclass
class PrioritizedMemory:
    """Memory that prioritizes recent and impactful reflections."""
    reflections: List[dict] = field(default_factory=list)
    max_size: int = 10
    
    def add(self, reflection: str, importance: float = 1.0):
        self.reflections.append({
            "text": reflection,
            "importance": importance,
            "uses": 0
        })
        # Keep most important
        self.reflections.sort(key=lambda x: x["importance"], reverse=True)
        self.reflections = self.reflections[:self.max_size]
    
    def retrieve(self, k: int = 3) -> List[str]:
        """Get top-k most relevant reflections."""
        top_k = sorted(self.reflections, 
                       key=lambda x: x["importance"] - x["uses"] * 0.1,
                       reverse=True)[:k]
        # Update use counts
        for r in top_k:
            r["uses"] += 1
        return [r["text"] for r in top_k]

# Test prioritized memory
memory = PrioritizedMemory()
memory.add("Check edge cases", importance=0.9)
memory.add("Verify loop bounds", importance=0.8)
memory.add("Handle null inputs", importance=0.95)

print("Top reflections:", memory.retrieve(k=2))

## 6. Evaluation Metrics

Reflexion effectiveness can be measured by:
- **Pass@k**: Success rate within k attempts
- **Reflection Quality**: Specificity and actionability
- **Learning Curve**: Improvement rate across attempts

In [None]:
def evaluate_reflexion(agent, tasks: List[str], max_attempts: int = 3):
    """Evaluate Reflexion agent performance."""
    results = {
        "pass@1": 0,
        "pass@2": 0,
        "pass@3": 0,
        "total": len(tasks)
    }
    
    for task in tasks:
        # Reset agent memory for each task
        agent.memory = EpisodicMemory()
        result = agent.solve(task)
        
        if result["success"]:
            attempts = result["attempts"]
            for k in range(attempts, max_attempts + 1):
                results[f"pass@{k}"] += 1
    
    # Convert to percentages
    for k in range(1, max_attempts + 1):
        results[f"pass@{k}"] = results[f"pass@{k}"] / results["total"] * 100
    
    return results

# Example metrics (simulated)
print("Example Reflexion Results:")
print("  Pass@1: 80%")
print("  Pass@2: 88%")
print("  Pass@3: 91%")
print("\nImprovement from reflection: +11%")

## 7. Key Takeaways

1. **Reflexion Loop**: Attempt -> Evaluate -> Reflect -> Store -> Retry
2. **Verbal Feedback**: Natural language reflections are interpretable
3. **Memory Management**: Prioritize relevant, actionable reflections
4. **Quality Matters**: Specific, actionable reflections drive improvement

## Next Steps
- Integrate with real LLM APIs
- Test on HumanEval coding benchmark
- Experiment with different reflection prompts
- Compare with other self-improvement methods