## Reflexion

> **What you'll learn:** How to build a **Reflexion agent** that iteratively improves its responses through self-critique, external research, and structured revision — using LangGraph, Pydantic, and Tavily Search.

The [Reflexion pattern](https://arxiv.org/pdf/2303.11366), introduced by Shinn et al., extends basic reflection by combining **self-critique** with **external knowledge integration** and **structured output parsing**. Unlike simple reflection, Reflexion allows an agent to learn from mistakes in real time while leveraging additional information.

The workflow typically follows these steps:
- **Initial Generation:** The agent produces a response along with self-critique and research queries.
- **External Research:** Knowledge gaps identified during critique trigger web searches or other information retrieval.
- **Knowledge Integration:** New insights are incorporated into an improved response.
- **Iterative Refinement:** The agent repeats the cycle until the response meets desired quality thresholds.

### Setup

Import all required libraries: LangChain for LLM orchestration, LangGraph for workflow graphs, Pydantic for structured output validation, and Tavily for web search.

In [6]:
# ─────────────────────────────────────────────
# Purpose : Import all dependencies for the Reflexion agent
# Input   : None
# Output  : Loaded modules in namespace
# ─────────────────────────────────────────────

# --- LangChain & LangGraph ---
from langchain_aws import ChatBedrock
from langchain_community.tools.tavily_search import TavilySearchResults
from langchain_community.utilities.tavily_search import TavilySearchAPIWrapper
from langchain_core.messages import HumanMessage, ToolMessage
from langchain_core.output_parsers.openai_tools import PydanticToolsParser
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.tools import StructuredTool
from langchain_groq import ChatGroq
from langgraph.prebuilt import ToolNode
from langgraph.graph import END, StateGraph, START
from langgraph.graph.message import add_messages

# --- Data validation & typing ---
from pydantic import ValidationError, BaseModel, Field
from typing import Literal, Annotated
from typing_extensions import TypedDict

# --- Utilities ---
from dotenv import load_dotenv
from IPython.display import Image, display
import json
import datetime

### Define LLM

Initialize the language model that will power our Reflexion agent. The factory helpers abstract away API key loading and model configuration.

In [7]:
# [DUPLICATE — alternative LLM configuration, see cell below for active setup]
# llm = ChatBedrock(
#     model_id="anthropic.claude-3-5-sonnet-20241022-v2:0",
#     region_name="us-west-2",
#     temperature=0
# )

In [8]:
# ─────────────────────────────────────────────
# Purpose : Import LLM helpers and initialize the language model
# Input   : helpers/utils.py factory functions
# Output  : `llm` — configured LLM instance
# ─────────────────────────────────────────────

import os
import sys

sys.path.append(os.path.abspath("../.."))
from helpers.utils import get_groq_llm, get_databricks_llm

print("LLM helpers imported successfully!")

# --- Select LLM based on platform ---
if sys.platform == "win32":
    llm = get_groq_llm("openai/gpt-oss-120b")
elif sys.platform == "darwin":
    llm = get_databricks_llm("databricks-claude-opus-4-6")
else:
    print("linux")

# --- Confirm initialization ---
if hasattr(llm, 'model_name'):
    print(f"LLM initialized: {llm.model_name}")
elif hasattr(llm, 'model'):
    print(f"LLM initialized: {llm.model} (Databricks)")
else:
    print("LLM initialized: Groq LLM")

LLM helpers imported successfully!
LLM initialized: databricks-claude-opus-4-6 (Databricks)


### Actor Architecture

At its core, a Reflexion agent is built around an **Actor** — an agent that generates an initial response, critiques it, and then re-executes the task with improvements. Supporting this loop are a few critical sub-components:

- **Tool execution:** Access to external knowledge sources.
- **Initial responder:** Generates the first draft along with self-reflection.
- **Revisor/Revision:** Produces refined outputs by incorporating prior reflections.

### Construct Tools

Since Reflexion requires external knowledge, we first define a tool to fetch information from the web. Here we use `TavilySearchResults`, a wrapper around the Tavily Search API, enabling our agent to perform web searches and gather supporting evidence.

In [10]:
# ─────────────────────────────────────────────
# Purpose : Initialize the Tavily web search tool
# Input   : TAVILY_API_KEY (from environment)
# Output  : `tavily_tool` — search tool returning up to 5 results
# ─────────────────────────────────────────────
# [DUPLICATE imports — TavilySearchResults, TavilySearchAPIWrapper already imported in cell above]

web_search = TavilySearchAPIWrapper()
tavily_tool = TavilySearchResults(api_wrapper=web_search, max_results=5)

### Define the Prompt Template

Next, let's define the prompt that will guide the actor agent's behavior. Prompts serve as the "role description" for an agent, specifying what it should and should not do. The agent is instructed to:

- Provide an initial explanation.
- Reflect and critique its own answer.
- Generate search queries to fill knowledge gaps.

In [11]:
# ─────────────────────────────────────────────
# Purpose : Define the actor's system prompt template
# Input   : primary_instruction, function_name (partial vars)
# Output  : `actor_prompt_template` — reusable prompt for both responder and revisor
# ─────────────────────────────────────────────

actor_prompt_template = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """You are an expert technical educator specializing in machine learning and neural networks.
                Current time: {time}
                1. {primary_instruction}
                2. Reflect and critique your answer. Be severe to maximize improvement.
                3. Recommend search queries to research information and improve your answer.""",
        ),
        MessagesPlaceholder(variable_name="messages"),
        (
            "user",
            "\n\n<s>Reflect on the user's original question and the"
            " actions taken thus far. Respond using the {function_name} function.</reminder>",
        ),
    ]
).partial(
    time=lambda: datetime.datetime.now().isoformat(),
)

### Enforce Structured Output

When dealing with multi-step workflows, it's always recommended to define structured output models for each sub-agent. To ensure consistency, we define structured outputs using **Pydantic models**:

In [12]:
# ─────────────────────────────────────────────
# Purpose : Define Pydantic schemas for structured LLM output
# Input   : None
# Output  : `Reflection`, `GenerateResponse` — Pydantic models
# ─────────────────────────────────────────────

class Reflection(BaseModel):
    missing: str = Field(description="Critique of what is missing.")
    superfluous: str = Field(description="Critique of what is superfluous")

class GenerateResponse(BaseModel):
    """Generate response. Provide an answer, critique, and then follow up with search queries to improve the answer."""
    
    response: str = Field(description="~250 word detailed answer to the question.")
    reflection: Reflection = Field(description="Your reflection on the initial answer.")
    research_queries: list[str] = Field(
        description="1-3 search queries for researching improvements to address the critique of your current answer."
    )


We use Pydantic's `BaseModel` to define two data classes:

- **`Reflection`** captures the self-critique, requiring the agent to highlight what information is missing and what is superfluous (unnecessary).
- **`GenerateResponse`** structures the final output. It ensures the agent provides its main response, includes a reflection (based on the `Reflection` class), and supplies a list of `research_queries`.

This structured approach guarantees that our agents produce consistent and parseable responses.

### Add Retry Logic

Structured parsing can fail if the output doesn't match the schema. To address this, we add **retry logic with schema feedback** — when validation fails, the error and schema are fed back to the LLM for self-correction.

In [13]:
# ─────────────────────────────────────────────
# Purpose : Wrap an LLM chain with retry logic for structured output validation
# Input   : chain (LLM pipeline), output_parser (Pydantic parser)
# Output  : `AdaptiveResponder` class with `.generate()` method
# ─────────────────────────────────────────────

class AdaptiveResponder:
    def __init__(self, chain, output_parser):
        self.chain = chain
        self.output_parser = output_parser
    
    def generate(self, conversation_state: dict):
        llm_response = None
        for retry_count in range(3):
            llm_response = self.chain.invoke(
                {"messages": conversation_state["messages"]}, {"tags": [f"attempt:{retry_count}"]}
            )
            try:
                self.output_parser.invoke(llm_response)
                return {"messages": llm_response}
            except ValidationError as validation_error:
                # Fix: Convert schema dict to JSON string
                schema_json = json.dumps(self.output_parser.model_json_schema(), indent=2)
                conversation_state = conversation_state + [
                    llm_response,
                    ToolMessage(
                        content=f"{repr(validation_error)}\n\nPay close attention to the function schema.\n\n{schema_json}\n\nRespond by fixing all validation errors.",
                        tool_call_id=llm_response.tool_calls[0]["id"],
                    ),
                ]
        return {"messages": llm_response}


### Bind the Data Model

We now bind the `GenerateResponse` model as a tool. This forces the LLM to output exactly in the defined structure, and we wrap it with `AdaptiveResponder` for retry safety.

In [15]:
# ─────────────────────────────────────────────
# Purpose : Build the initial responder chain with structured output
# Input   : actor_prompt_template, llm, GenerateResponse schema
# Output  : `initial_responder` — AdaptiveResponder for first-draft generation
# ─────────────────────────────────────────────

initial_response_chain = actor_prompt_template.partial(
    primary_instruction="Provide a detailed ~250 word explanation suitable for someone with basic programming background.",
    function_name=GenerateResponse.__name__,
) | llm.bind_tools(tools=[GenerateResponse])

response_parser = PydanticToolsParser(tools=[GenerateResponse])

initial_responder = AdaptiveResponder(
    chain=initial_response_chain, output_parser=response_parser
)

After invoking `initial_response_chain`, we get a structured output that includes the initial answer, the self-critique, and the generated search queries. Let's test the initial responder with a simple query.

In [16]:
# ─────────────────────────────────────────────
# Purpose : Test the initial responder with a sample question
# Input   : example_question (str)
# Output  : initial — dict with structured LLM response
# ─────────────────────────────────────────────

example_question = "What is the difference between supervised and unsupervised learning?"
initial = initial_responder.generate(
    {"messages": [HumanMessage(content=example_question)]}
)

initial

{'messages': AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'toolu_bdrk_017G1afAP24g9xiEtgs1qtrj', 'function': {'arguments': '{"response":"**Supervised vs. Unsupervised Learning** are two fundamental paradigms in machine learning, and the key difference lies in whether your training data includes \\"answers\\" (labels) or not.\\n\\n**Supervised Learning** uses a dataset where each input has a corresponding correct output (label). Think of it like studying with an answer key. You feed the model examples like `(input, correct_answer)` pairs, and it learns to map inputs to outputs. For instance, given thousands of emails labeled \\"spam\\" or \\"not spam,\\" the model learns patterns to classify new, unseen emails. Common algorithms include Linear Regression, Decision Trees, and Neural Networks (for classification/regression). The goal is **prediction**.\\n\\n```python\\n# Supervised: you provide X (features) AND y (labels)\\nmodel.fit(X_train, y_train)\\nprediction = mode

### Revision

The **Revision** step represents the final stage of the Reflexion loop. Its purpose is to combine three critical elements — the original draft, the self-critique, and the research results — to produce a refined, evidence-backed response.

We define a new instruction set (`improvement_guidelines`) that explicitly guides the Revisor:

- Integrating critique into the revision process
- Adding numerical citations tied to the research evidence
- Differentiating correlation from causation in explanations
- Including a structured **References** section with clean URLs only

In [17]:
# ─────────────────────────────────────────────
# Purpose : Define revision guidelines for the Revisor agent
# Input   : None
# Output  : `improvement_guidelines` — instruction string
# ─────────────────────────────────────────────

improvement_guidelines = """Revise your previous explanation using the new information.
    - You should use the previous critique to add important technical details to your explanation.
    - You MUST include numerical citations in your revised answer to ensure it can be verified.
    - Add a "References" section to the bottom of your answer (which does not count towards the word limit).
    - For the references field, provide a clean list of URLs only (e.g., ["https://example.com", "https://example2.com"])
    - You should use the previous critique to remove superfluous information from your answer and make SURE it is not more than 250 words.
    - Keep the explanation accessible for someone with basic programming background while being technically accurate.
"""

To enforce output structure, we introduce `ImproveResponse` — a Pydantic schema that extends `GenerateResponse` with an additional `sources` field, ensuring that each improved answer comes with verifiable references.

In [18]:
# ─────────────────────────────────────────────
# Purpose : Define the revision output schema with source citations
# Input   : Inherits from GenerateResponse
# Output  : `ImproveResponse` — Pydantic model with sources field
# ─────────────────────────────────────────────

class ImproveResponse(GenerateResponse):
    """Improve your original answer to your question. Provide an answer, reflection,
    cite your reflection with references, and finally
    add search queries to improve the answer."""
    
    sources: list[str] = Field(
        description="List of reference URLs that support your answer. Each reference should be a clean URL string."
    )


With the schema defined, we construct the **revision chain** by binding the guidelines to the LLM and wrapping it with `AdaptiveResponder`:

In [19]:
# ─────────────────────────────────────────────
# Purpose : Build the revision chain with ImproveResponse schema
# Input   : actor_prompt_template, llm, improvement_guidelines
# Output  : `response_improver` — AdaptiveResponder for revised answers
# ─────────────────────────────────────────────

improvement_chain = actor_prompt_template.partial(
    primary_instruction=improvement_guidelines,
    function_name=ImproveResponse.__name__,
) | llm.bind_tools(tools=[ImproveResponse])

improvement_parser = PydanticToolsParser(tools=[ImproveResponse])
response_improver = AdaptiveResponder(chain=improvement_chain, output_parser=improvement_parser)

We can now test the revision chain by providing a full conversation history — the initial draft, the critique, and the tool output:

In [20]:
# ─────────────────────────────────────────────
# Purpose : Test the revision chain with initial response + search results
# Input   : example_question, initial response, tavily_tool search output
# Output  : revised — dict with improved, cited response
# ─────────────────────────────────────────────

revised = response_improver.generate(
    {
        "messages": [
            HumanMessage(content=example_question),
            initial["messages"],
            ToolMessage(
                tool_call_id=initial["messages"].tool_calls[0]["id"],
                content=json.dumps(
                    tavily_tool.invoke(
                        {
                            "query": initial["messages"].tool_calls[0]["args"][
                                "research_queries"
                            ][0]
                        }
                    )
                ),
            ),
        ]
    }
)

revised["messages"]

AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'toolu_bdrk_01GnzMaukjTHv6e8Xgt6s979', 'function': {'arguments': '{"response":"**Supervised vs. Unsupervised Learning**\\n\\nThe core difference is whether your training data includes **labels** (known answers) or not.\\n\\n**Supervised Learning** uses labeled datasets — each input has a corresponding correct output. Think of it like studying with an answer key. The model learns to map inputs to outputs for **prediction** tasks [1]. Common applications include spam detection, image classification, medical diagnosis, and price forecasting [2]. Algorithms include Linear Regression, Decision Trees, and Neural Networks.\\n\\n```python\\n# Supervised: provide features AND labels\\nmodel.fit(X_train, y_train)\\nprediction = model.predict(X_new)\\n```\\n\\n**Unsupervised Learning** works with **unlabeled data**, discovering hidden structure on its own [1]. It\'s ideal when you have large datasets but don\'t know what outputs to ex

### Create Tool Node

The next step is to execute the tool calls inside a LangGraph workflow. While the Responder and Revisor use different schemas, they both rely on the same external tool (a search API). The key differentiator of Reflexion is its ability to identify knowledge gaps and actively research solutions.

The `ToolNode` automatically handles tool execution and result formatting, making it seamless to incorporate external knowledge sources.

In [21]:
# ─────────────────────────────────────────────
# Purpose : Create search tool nodes for use inside the LangGraph workflow
# Input   : tavily_tool, GenerateResponse/ImproveResponse schemas
# Output  : `search_executor` — ToolNode that runs batch web searches
# ─────────────────────────────────────────────

def execute_search_queries(research_queries: list[str], **kwargs):
    """Execute the generated search queries."""
    return tavily_tool.batch([{"query": search_term} for search_term in research_queries])

# Tool node
search_executor = ToolNode(
    [
        StructuredTool.from_function(execute_search_queries, name=GenerateResponse.__name__),
        StructuredTool.from_function(execute_search_queries, name=ImproveResponse.__name__),
    ]
)

### Building the Graph

Finally, we assemble all components — **Responder**, **Tool Executor**, and **Revisor** — into a cyclical graph. This structure captures the iterative nature of Reflexion, where each loop strengthens the final answer.

We define the graph state, loop control functions, and wire everything together:

In [22]:
# ─────────────────────────────────────────────
# Purpose : Define graph state, loop control, and assemble the Reflexion workflow
# Input   : initial_responder, search_executor, response_improver
# Output  : `reflexion_workflow` — compiled LangGraph state machine
# ─────────────────────────────────────────────

# --- State schema ---
class State(TypedDict):
    messages: Annotated[list, add_messages]

# --- Loop control helpers ---
def get_iteration_count(message_history: list):
    """Count recent tool/AI message cycles to track iteration depth."""
    iteration_count = 0
    for message in message_history[::-1]:
        if message.type not in {"tool", "ai"}:
            break
        iteration_count += 1
    return iteration_count

def determine_next_action(state: list):
    """Conditional edge: continue researching or stop after MAXIMUM_CYCLES."""
    current_iterations = get_iteration_count(state["messages"])
    if current_iterations > MAXIMUM_CYCLES:
        return END
    return "search_and_research"

# --- Graph construction ---
MAXIMUM_CYCLES = 5
workflow_builder = StateGraph(State)

workflow_builder.add_node("create_draft", initial_responder.generate)
workflow_builder.add_node("search_and_research", search_executor)
workflow_builder.add_node("enhance_response", response_improver.generate)

workflow_builder.add_edge(START, "create_draft")
workflow_builder.add_edge("create_draft", "search_and_research")
workflow_builder.add_edge("search_and_research", "enhance_response")
workflow_builder.add_conditional_edges("enhance_response", determine_next_action, ["search_and_research", END])

reflexion_workflow = workflow_builder.compile()

In [23]:
# ─────────────────────────────────────────────
# Purpose : Visualize the compiled Reflexion workflow graph
# Input   : reflexion_workflow
# Output  : Rendered graph diagram
# ─────────────────────────────────────────────
# [DUPLICATE import — Image, display already imported in cell above]

display(Image(reflexion_workflow.get_graph().draw_png()))

ImportError: Install pygraphviz to draw graphs: `pip install pygraphviz`.

### Testing

Run the complete Reflexion agent end-to-end. The agent will iterate through draft → critique → research → revision cycles until the answer meets quality standards or reaches `MAXIMUM_CYCLES`.

In [None]:
# ─────────────────────────────────────────────
# Purpose : Run the Reflexion agent end-to-end and display each step
# Input   : target_question (str), reflexion_workflow
# Output  : Streamed step-by-step agent output
# ─────────────────────────────────────────────

target_question = "How do neural networks actually learn?"

print(f"Running Reflexion agent with question: {target_question}")
print("=" * 60)

events = reflexion_workflow.stream(
    {"messages": [("user", target_question)]},
    stream_mode="values",
)

for i, step in enumerate(events):
    print(f"\nStep {i}")
    print("-" * 40)
    step["messages"][-1].pretty_print()

print("\n" + "=" * 60)
print("Reflexion agent execution completed!")

### Key Takeaway

The Reflexion agent demonstrates a powerful iterative improvement loop:

1. **Generate** an initial technical explanation with self-critique
2. **Identify** specific knowledge gaps requiring research
3. **Execute** targeted web searches for current information
4. **Integrate** findings into a comprehensive, cited response
5. **Repeat** the process until the explanation meets quality standards

By combining structured output (Pydantic), external tool use (Tavily), and cyclic graph execution (LangGraph), Reflexion goes beyond simple reflection — it actively learns and improves in real time.