<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>

# AutoGen Agents: Evaluator-Optimizer

In this tutorial, we'll explore the **Evaluator-Optimizer agent pattern** using AutoGen [GroupChats](https://microsoft.github.io/autogen/dev//user-guide/core-user-guide/design-patterns/group-chat.html) and demonstrate how to trace the process using Phoenix.

The Evaluator-Optimizer pattern employs a loop where one agent acts as a generator, creating an initial output (like text or code), while a second agent serves as an evaluator, providing critical feedback against criteria. This feedback guides the generator through successive revisions, enabling iterative refinement and significant quality improvement for complex tasks. This approach trades increased interactions for a more polished & accurate final result.

AutoGen's `GroupChat` architecture is good for implementing this pattern becuase it can manage the conversational turns between the generator and evaluator agents. The `GroupChatManager` facilitates the dialogue, allowing the agents to exchange the evolving outputs and feedback.

With Phoenix tracing, we can gain full visibility into each refinement cycle, tracking the feedback provided, the revisions made, and the overall progress towards the desired quality standard, which aids in debugging and analysis.

By the end of this tutorial, you’ll learn how to:

- Set up generator and evaluator AutoGen agents with specific roles and criteria.
- Implement the Evaluator-Optimizer pattern within an AutoGen GroupChat.
- Manage the iterative refinement loop between agents.
- Trace and visualize the iterative feedback and revision process using Phoenix.

⚠️ You'll need an OpenAI Key for this tutorial.

## Set up Keys and Dependencies


In [None]:
!pip install -qqq arize-phoenix arize-phoenix-otel openinference-instrumentation-openai

In [None]:
!pip install -qq pyautogen==0.9 autogen-agentchat~=0.2

In [None]:
import os
from getpass import getpass

import autogen

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")

os.environ["OPENAI_API_KEY"] = openai_api_key

In [None]:
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
if not os.environ.get("PHOENIX_CLIENT_HEADERS"):
    os.environ["PHOENIX_CLIENT_HEADERS"] = "api_key=" + getpass("Enter your Phoenix API key: ")

## Configure Tracing


In [None]:
from phoenix.otel import register

tracer_provider = register(
    project_name="autogen-agents",
    endpoint="https://app.phoenix.arize.com/v1/traces",
    auto_instrument=True,
)

## Example Evaluator-Optimizer Task: Code Generation & Evaluation

This **Evaluator-Optimizer** Agent pattern uses a loop where one AI agent generates an output, like code, and another agent evaluates it against specific criteria to drive improvement.

In this specific example, we'll utilize a `Code_Generator` agent to write Python code based on requirements. A dedicated `Code_Reviewer` agent will then assess this code for correctness, style, and documentation, providing targeted feedback. Through this iterative cycle of generation and critique within an AutoGen `GroupChat`, we'll demonstrate how to produce higher-quality, reviewed code that meets defined standards.

![Diagram](https://storage.googleapis.com/arize-phoenix-assets/assets/images/autogen_evaluate_optimizer_diagram.png)

## Define Agent

The `llm_config` specifies the configuration used for each `AssistantAgent`.
We use a different model for code generation and evaluation.

In [None]:
config_list_gen = {
    "model": "gpt-4o",
    "api_key": os.environ["OPENAI_API_KEY"],
}

config_list_eval = {
    "model": "gpt-4.1-mini",
    "api_key": os.environ["OPENAI_API_KEY"],
}
llm_config_gen = {
    "config_list": config_list_gen,
    "temperature": 0.5,
}  # Temperature allows for flexibility
llm_config_eval = {"config_list": config_list_eval, "temperature": 0.5}

This section initializes two AutoGen `AssisstantAgents` designed for an iterative code generation and review workflow.

The `Code_Generator` agent is responsible for writing and revising Python code based on instructions and feedback, while the `Code_Reviewer` agent evaluates the generated code against detailed criteria like correctness, style, and readability, providing specific feedback or a precise termination signal: `"TERMINATE!"`.

In [None]:
# Coder Generator Agent
coder = autogen.AssistantAgent(
    name="Code_Generator",
    system_message="""You are an expert Python programmer. Your goal is to write correct, efficient, and clean Python code based on user requests.
    You will either receive requests for code or you will get a request for feedback from a Code Reviewer.
    When writing code:
    - Always enclose the complete Python code block within ```python ... ``` tags.
    - Ensure the code directly addresses the request.
    - Include necessary imports.
    - Add docstrings and comments where appropriate.

    When you receive feedback from the Code Reviewer:
    - Carefully analyze each point of feedback.
    - Rewrite the code block incorporating the suggested changes precisely.
    - Output only the complete, revised Python code block. Do not add text or explanations""",
    llm_config=llm_config_gen,
)

# Code Evaluator Agent
reviewer = autogen.AssistantAgent(
    name="Code_Reviewer",
    system_message="""
    You are an expert code reviewer specializing in Python. Your task is to evaluate Python code written by the Coder agent based on the original request and the following criteria:
    1.  Correctness: Does the code seem logically correct and likely to fulfill the request's requirements?
    2.  Compliance: Does the code adhere to standard Python style guidelines (e.g., naming conventions, indentation, line length)?
    3.  Docstrings and Comments: Is there a clear docstring explaining the function/class purpose, arguments, and returns? Are comments used effectively where needed?
    4.  Readability & Maintainability: Is the code easy to understand? Are variable names meaningful? Is the logic straightforward?
    5.  Efficiency: Are there obvious major performance issues or highly inefficient patterns for typical use cases?
    6.  Error Handling: Does the code consider potential errors or edge cases (if relevant to the request)?

    Review the code thoroughly. Provide specific, constructive, numbered points of feedback referencing line numbers if possible. Focus on actionable improvements required to meet the criteria.
    If the code meets all criteria and requires no further changes, respond ONLY with the exact phrase: TERMINATE!
    Do NOT provide conversational text or summaries if you are approving the code with the termination phrase.""",
    llm_config=llm_config_eval,
)

Next, we define a function,`check_reviewer_approval`, to specifically detect if a message contains only `"TERMINATE!"`.

Then, we initialize the AutoGen `UserProxyAgent` to act as the human user. We use `check_reviewer_approval` as the termination message check for this Agent.

In [None]:
def check_reviewer_approval(message_dict):
    content = message_dict.get("content")
    if isinstance(content, str) and content.strip() == "TERMINATE!":
        return True
    return False


user_proxy = autogen.UserProxyAgent(
    name="User_Proxy",
    system_message="A human user providing the initial coding request and receiving the final result. Executes termination check.",
    human_input_mode="NEVER",
    is_termination_msg=check_reviewer_approval,
    code_execution_config=False,
)

Finally, we initialize the agent framework by first defining an AutoGen `GroupChat` that includes the user proxy, coder, and reviewer agents, limiting the interaction to a maximum of 10 rounds.

It then creates a `GroupChatManager` to orchestrate the conversation within this group, assigning it an LLM configuration and making it aware of the  termination condition by passing in the `check_reviewer_approval` function. The `GroupChatManager` uses existing context to determine which agent to call next.

In [None]:
groupchat = autogen.GroupChat(agents=[user_proxy, coder, reviewer], messages=[], max_round=10)

manager = autogen.GroupChatManager(
    groupchat=groupchat, llm_config=llm_config_gen, is_termination_msg=check_reviewer_approval
)

## Run Agent

We are now ready to run our agent using this sample `coding_task`.

In [None]:
coding_task = """
Please write a Python function called `calculate_fibonacci(n)` that calculates the nth Fibonacci number.

Requirements:
1. The function should accept a non-negative integer `n`.
2. Function should return the nth Fibonacci number.
3. Include a clear docstring explaining the function, parameter, and what it returns.
4. Handle the base cases correctly.
5. Use an iterative approach for efficiency.
"""

print("--- Starting Code Generation Task ---")
print(f"Request: {coding_task}\n")

tracer = tracer_provider.get_tracer(__name__)
with tracer.start_as_current_span(
    "CodeGenEval",
    openinference_span_kind="agent",
) as agent_span:
    # Initiate the chat
    user_proxy.initiate_chat(
        manager,
        message=coding_task,
    )

print("--- Code Generation Task Finished ---")

## View Results in Phoenix

Phoenix tracing shows us the Evaluator-Optimizer workflow by visualizing the iterative loop between the `Code_Generator` and `Code_Reviewer` agents within the `GroupChat`.

Within the trace, you can see the entire refinement process, inspecting the specific prompts fed to the agents at each step and seeing the output progressively change and improve prior to the termination decision.


Run the cell below to see the full tracing results.

In [None]:
from IPython.display import HTML

HTML("""
<video width="800" height="600" controls>
  <source src="https://storage.googleapis.com/arize-phoenix-assets/assets/videos/autogen_eval_optimizer_results.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>
""")