# Chat Bot Evaluation as Multi-agent Simulation

Manually evaluating a chat bot on every code change is time-consuming. One way to automate some of the work is to simulate a "virtual user". Then you can focus on reviewing samples of the interactions.

In thos notebook, you will use LangGraph to create a dialogue simulation between a virtual user and your chat bot. The overall simulation looks something like this:

![diagram](./img/virtual_user_diagram.png)

The main steps are:
1. Defining the virtual user
2. Connecting your chat bot
3. Constructing the dialogue simulation graph
4. Running!

First, we'll set up our environment.

In [1]:
# %%capture --no-stderr
# %pip install -U langgraph langchain langchain_openai

In [2]:
import getpass
import os
import uuid


def _set_if_undefined(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass(f"Please provide your {var}")


_set_if_undefined("OPENAI_API_KEY")
_set_if_undefined("LANGCHAIN_API_KEY")

# Optional, add tracing in LangSmith.
# This will help you visualize and debug the control flow
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "Agent Simulation Evaluation"

## 1. Define the virtual user

The virtual user needs an LLM to reason and instructions for how it's supposed to behave (or what it's trying to accomplish).

Below, create an agent and instruct it to role-play a 'simulated' user. By including the `{system_prompt}` placeholder in the prompt and a state variable with the same name `Environment`, you can customize the user behavior each time you simulate a dialogue.

In [3]:
import operator
from typing import Annotated, Callable, Dict, List, TypedDict

from langchain.adapters.openai import convert_message_to_dict
from langchain_core.messages import AIMessage, BaseMessage, HumanMessage
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables import chain
from langchain_openai import ChatOpenAI

from langgraph.graph import END, StateGraph

SIMULATED_USER_NAME = "simulated"


# This is the input to every node in the simulation graph
# It tracks the graph state over time. Our only "state"
# is the conversation messages, while the user config
# is provided to make the virtual user more unique or realistic
class Environment(TypedDict):
    messages: Annotated[List[BaseMessage], operator.add]
    # The system prompt will be fed into the prompt template below
    # For more control, try adding different parameters to provide the
    # prompt template below
    system_prompt: str


prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are role-playing a human character: '{name}'. "
            "You are not an AI assistant and you are not supposed to help or assist."
            " You must behave as this human would throughout the conversation below.\n\n"
            "Your messages will bear the name 'simulated', but DO NOT under any circumstances"
            "say that you are 'simulated'. You will be evaluated based on how realistic your"
            "impersonation of this character is.  This must feel real! Here are the details for your character:"
            "\n"
            # This system_prompt is specified in the Environment above
            "{system_prompt}"
            # The stopping criteria of FINISHED is used in the function hould_continue in a later
            # section. This tells the graph to stop the simulation.
            '\n\nWhen you are finished with the conversation, respond with a single word "FINISHED"',
        ),
        MessagesPlaceholder(variable_name="messages"),
    ]
).partial(name=SIMULATED_USER_NAME)

llm = ChatOpenAI(model="gpt-4-1106-preview")


def rename_message(message: AIMessage):
    # If we use an AIMessage, the simulated user may forget to continue role playing.
    # It will also confuse YOUR chat bot, since IT is supposed to be the AI in this scenario.
    # We instead convert them to 'Human' messages with the 'simulated' name
    # Your chat bot will then receive all the user's messages and think they
    # are human ones
    return {
        "messages": [HumanMessage(content=message.content, name=SIMULATED_USER_NAME)]
    }

Now we can compose these pieces using LCEL. The `|` syntax pipelines the data flow.

In [4]:
virtual_user = prompt | llm | rename_message

## 2. Define your chat bot

Next, define the chat bot. For this notebook, we assume the bot's API accepts a list of messages and responds with a message. If you want to update this, you can change this section and the "get_messages_for_agent" function in the simulator below (as well as the environment state if it requires additional inputs).

The actual implementation within `my_chat_bot` is configurable and can even be run on another system (e.g., if your system isn't running in python).

In [5]:
from typing import List

import openai


# This is flexible, but you can define your agent here, or call your agent API here.
def my_chat_bot(messages: List[dict], model="gpt-3.5-turbo") -> dict:
    completion = openai.chat.completions.create(
        messages=messages, model="gpt-3.5-turbo"
    )
    return completion.choices[0].message.model_dump()

Every node in the simulation is pased a state object of the `Environment` type above.

The two functions below define the API between the `Environment` state and your chat bot.

In [6]:
@chain
def get_messages_for_agent(state: Environment):
    """Convert the simulation state to the input

    for your agent you want to evaluate."""
    messages = []
    for message in state["messages"]:
        messages.append(convert_message_to_dict(message))
        if getattr(message, "name", None) != SIMULATED_USER_NAME:
            # Ensure YOUR chat bot still sees its messages
            # as assistant messages
            messages[-1]["role"] = "assistant"
    return messages


def get_response_message_from_agent(agent_output):
    """Get the response from the agent you are evaluting,
    and use it to update the simulation state."""
    # If we directly return an AI message from your chat bot, our
    # virtual user will likely forget it's role playing. To cover this up
    # we will convert it to a Human message.
    return {"messages": [HumanMessage(content=agent_output["content"])]}

## 3. Define simulation graph

The dialogue simulation is almost ready. It's time to put everything together!

Below, create a graph using the `Environment` state defined above. Wire together the
virtual user and your chat bot, including a conditional `should_continue` edge to
handle the stopping behavior.

In [7]:
graph_builder = StateGraph(Environment)
graph_builder.add_node("user", virtual_user)
graph_builder.add_node(
    "chat_bot",
    # The "|" syntax composes these steps in the pipeline to map between
    # the simulation state and your chat bot's API
    get_messages_for_agent | my_chat_bot | get_response_message_from_agent,
)
# Every response from  your chat bot will automatically go to the
# simulated user
graph_builder.add_edge("chat_bot", "user")


# Recall that we instructed the simulated user to respond "FINISHED" when
# it is done with the conversation. This function
# parses that output and tells the graph to cease execution.
# You can add other heuristics here for more control.
def should_continue(state: Environment):
    """Determine if the simulation should continue."""
    if state["messages"][-1].content.strip().endswith("FINISHED"):
        return "end"
    return "continue"


graph_builder.add_conditional_edges(
    # Every time the "user" node completes ...
    "user",
    # Call this function ...
    should_continue,
    # And based on the outputs of should_continue ...
    {
        # End the simulation OR
        "end": END,
        # continue to the chat_bot node
        "continue": "chat_bot",
    },
)
# The input will first go to your chat bot
graph_builder.set_entry_point("chat_bot")
simulation = graph_builder.compile()

## 4. Run Simulation

Now we can evaluate our chat bot! We will provide information about the simulated user (as a system prompt)
as well as the initial input message from that simulated user to the chat bot.

In [8]:
result = simulation.invoke(
    {
        "system_prompt": "You are on a budget. Your family is hard to please."
        " They all like the beach, except for Aunt Lily, who prefers the mountains.",
        "messages": [
            HumanMessage(
                content="help me plan my family vacation", name=SIMULATED_USER_NAME
            )
        ],
    }
)

In [10]:
# These are the message from the final simulation state
result["messages"]

[HumanMessage(content='help me plan my family vacation', name='simulated'),
 HumanMessage(content="Of course! I'd be happy to help you plan your family vacation. To assist you better, could you please provide some information about your preferences and interests? Include details like the duration of the trip, budget, destination preferences, and any specific activities or landmarks you'd like to include in your itinerary."),
 HumanMessage(content="Oh, planning a family vacation can be quite the task, right? Especially when everyone has different preferences. Well, my family is pretty similar. We're always on a budget, and it can be tough to please everyone. Usually, we end up somewhere by the beach because most of the family loves it, but then there's Aunt Lily who's always campaigning for the mountains.\n\nI've found that the trick is to compromise and maybe find a place that has a bit of both. You know, like a coastal area that has access to some nature trails or a scenic mountain sp

## (Optional) Review Results

If you've traced the run, you can see the full simulation trace in the LangSmith UI by going to the `Agent Simulation Evaluation` project.

Select the last 'ChatOpenAI' call in the trace to see the full conversation in a single view.

![full-conversation](./img/virtual_user_full_convo.png)


From this run, you can manually annotate it to score its quality. This feedback can be used to compare the quality of different versions of your chat bot.

![annotate](./img/virtual_user_annotate.png)

## Conclusion

In this notebook, you set up a multi-agent simulation to review how your chat bot behaves with simulated users.

To implement this for your chat bot, you can create a dataset of user profiles and questions your chat bot should handle and run periodically. You can use an LLM-as-judge to give the bot an initial score and then manually review to spot check. 

LangGraph gives you full control over the simulation so you can manually change the simulated user and the conversation dynamics.