<!-- NOTEBOOK_METADATA source: "Jupyter Notebook" title: "Example - Trace and Evaluate LangGraph Agents" description: "This guide shows how to evaluate LangGraph Agents with Langfuse using online and offline evaluation methods." category: "Integrations" -->

# 评估 LangGraph 代理

在本教程中，我们将学习如何使用 [Langfuse](https://langfuse.com) 与 [Hugging Face Datasets](https://huggingface.co/datasets)，**监控 [LangGraph 代理](https://github.com/langchain-ai/langgraph) 的内部步骤（traces）**并**评估其性能**。

本指南涵盖将代理快速且可靠地推向生产所需的**在线**与**离线**评估方法。想了解更多评估策略，请参阅我们的[博文](https://langfuse.com/blog/2025-03-04-llm-evaluation-101-best-practices-and-challenges)。

**为何评估 AI 代理至关重要：**
- 当任务失败或结果欠佳时便于调试
- 实时监控成本与性能
- 通过持续反馈提升可靠性与安全性


## 步骤 0：安装所需库

下面安装 `langgraph`、`langfuse` 与 Hugging Face 的 `datasets` 库。

<!-- CALLOUT_START type: "info" emoji: "⚠️" -->
_**注意：** 本笔记使用 Langfuse Python SDK v3。若仍在使用 Python SDK v2，请参阅我们的[旧版 LangGraph 集成指南](https://github.com/langfuse/langfuse-docs/blob/662509b3296daddcddb292f14b10a62e7c39407d/pages/docs/integrations/langchain/example-langgraph-agents.md#L4)。_
<!-- CALLOUT_END -->

In [None]:
%pip install langfuse langchain langgraph langchain_openai langchain_community langchain_huggingface

## 步骤 1：设置环境变量

在 Langfuse Cloud 注册或自托管 Langfuse 以获取 API Key。 

In [None]:
import os

# Get keys for your project from the project settings page: https://cloud.langfuse.com
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..." 
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..." 
os.environ["LANGFUSE_HOST"] = "https://cloud.langfuse.com" # 🇪🇺 EU region
# os.environ["LANGFUSE_HOST"] = "https://us.cloud.langfuse.com" # 🇺🇸 US region

# Your openai key
os.environ["OPENAI_API_KEY"] = "sk-proj-..."

设置完环境变量后，我们即可初始化 Langfuse 客户端。`get_client()` 会基于环境变量中的凭证创建并返回客户端实例。

In [None]:
from langfuse import get_client
 
langfuse = get_client()
 
# Verify connection
if langfuse.auth_check():
    print("Langfuse client is authenticated and ready!")
else:
    print("Authentication failed. Please check your credentials and host.")

## 步骤 2：验证埋点是否生效

下面是一个简易的问答代理。运行它以确认埋点是否正常工作。若配置正确，你将能在可观测性看板中看到日志/跨度记录。

In [3]:
from typing import Annotated
 
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
from typing_extensions import TypedDict
 
from langgraph.graph import StateGraph
from langgraph.graph.message import add_messages
 
class State(TypedDict):
    # Messages have the type "list". The `add_messages` function in the annotation defines how this state key should be updated
    # (in this case, it appends messages to the list, rather than overwriting them)
    messages: Annotated[list, add_messages]
 
graph_builder = StateGraph(State)
 
llm = ChatOpenAI(model = "gpt-4o", temperature = 0.2)
 
# The chatbot node function takes the current State as input and returns an updated messages list. This is the basic pattern for all LangGraph node functions.
def chatbot(state: State):
    return {"messages": [llm.invoke(state["messages"])]}
 
# Add a "chatbot" node. Nodes represent units of work. They are typically regular python functions.
graph_builder.add_node("chatbot", chatbot)
 
# Add an entry point. This tells our graph where to start its work each time we run it.
graph_builder.set_entry_point("chatbot")
 
# Set a finish point. This instructs the graph "any time this node is run, you can exit."
graph_builder.set_finish_point("chatbot")
 
# To be able to run our graph, call "compile()" on the graph builder. This creates a "CompiledGraph" we can use invoke on our state.
graph = graph_builder.compile()

In [None]:
from langfuse.langchain import CallbackHandler

# Initialize Langfuse CallbackHandler for Langchain (tracing)
langfuse_handler = CallbackHandler()
 
for s in graph.stream(
    {"messages": [HumanMessage(content = "What is Langfuse?")]},
    config={"callbacks": [langfuse_handler]}):
    print(s)

请在 [Langfuse Traces Dashboard](https://cloud.langfuse.com/traces) 中检查是否记录到了 spans 与 logs。

Langfuse 中的示例追踪：

![Langfuse 中的示例追踪](https://langfuse.com/images/cookbook/example-langgraph-evaluation/first-example-trace.png)

_[前往该追踪](https://cloud.langfuse.com/project/cloramnkj0002jz088vzn1ja4/traces/ed0970b5-b251-4b85-9023-c0ed81462510?timestamp=2025-03-20T13%3A44%3A44.381Z&display=details&observation=0731595f-06e4-4f5a-b535-6e09677a752d)_

## 步骤 3：观测并评估一个更复杂的代理

既然已确认埋点有效，我们来尝试一个更复杂的请求，以便观察高级指标（如令牌用量、延迟、成本等）的跟踪方式。

In [5]:
import os
from typing import TypedDict, List, Dict, Any, Optional
from langgraph.graph import StateGraph, START, END
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

In [6]:
class EmailState(TypedDict):
    email: Dict[str, Any]           
    is_spam: Optional[bool]         
    spam_reason: Optional[str]      
    email_category: Optional[str]   
    draft_response: Optional[str]   
    messages: List[Dict[str, Any]] 

In [None]:
# Initialize LLM
model = ChatOpenAI( model="gpt-4o",temperature=0)

class EmailState(TypedDict):
    email: Dict[str, Any]
    is_spam: Optional[bool]
    draft_response: Optional[str]
    messages: List[Dict[str, Any]]

# Define nodes
def read_email(state: EmailState):
    email = state["email"]
    print(f"Alfred is processing an email from {email['sender']} with subject: {email['subject']}")
    return {}

def classify_email(state: EmailState):
    email = state["email"]
    
    prompt = f"""
As Alfred the butler of Mr wayne and it's SECRET identity Batman, analyze this email and determine if it is spam or legitimate and should be brought to Mr wayne's attention.

Email:
From: {email['sender']}
Subject: {email['subject']}
Body: {email['body']}

First, determine if this email is spam.
answer with SPAM or HAM if it's legitimate. Only return the answer
Answer :
    """
    messages = [HumanMessage(content=prompt)]
    response = model.invoke(messages)
    
    response_text = response.content.lower()
    print(response_text)
    is_spam = "spam" in response_text and "ham" not in response_text
    
    if not is_spam:
        new_messages = state.get("messages", []) + [
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": response.content}
        ]
    else :
        new_messages = state.get("messages", [])
    
    return {
        "is_spam": is_spam,
        "messages": new_messages
    }

def handle_spam(state: EmailState):
    print(f"Alfred has marked the email as spam.")
    print("The email has been moved to the spam folder.")
    return {}

def drafting_response(state: EmailState):
    email = state["email"]
    
    prompt = f"""
As Alfred the butler, draft a polite preliminary response to this email.

Email:
From: {email['sender']}
Subject: {email['subject']}
Body: {email['body']}

Draft a brief, professional response that Mr. Wayne can review and personalize before sending.
    """
    
    messages = [HumanMessage(content=prompt)]
    response = model.invoke(messages)
    
    new_messages = state.get("messages", []) + [
        {"role": "user", "content": prompt},
        {"role": "assistant", "content": response.content}
    ]
    
    return {
        "draft_response": response.content,
        "messages": new_messages
    }

def notify_mr_wayne(state: EmailState):
    email = state["email"]
    
    print("\n" + "="*50)
    print(f"Sir, you've received an email from {email['sender']}.")
    print(f"Subject: {email['subject']}")
    print("\nI've prepared a draft response for your review:")
    print("-"*50)
    print(state["draft_response"])
    print("="*50 + "\n")
    
    return {}

# Define routing logic
def route_email(state: EmailState) -> str:
    if state["is_spam"]:
        return "spam"
    else:
        return "legitimate"

# Create the graph
email_graph = StateGraph(EmailState)

# Add nodes
email_graph.add_node("read_email", read_email) # the read_email node executes the read_mail function
email_graph.add_node("classify_email", classify_email) # the classify_email node will execute the classify_email function
email_graph.add_node("handle_spam", handle_spam) #same logic 
email_graph.add_node("drafting_response", drafting_response) #same logic
email_graph.add_node("notify_mr_wayne", notify_mr_wayne) # same logic


In [None]:
# Add edges
email_graph.add_edge(START, "read_email") # After starting we go to the "read_email" node

email_graph.add_edge("read_email", "classify_email") # after_reading we classify

# Add conditional edges
email_graph.add_conditional_edges(
    "classify_email", # after classify, we run the "route_email" function"
    route_email,
    {
        "spam": "handle_spam", # if it return "Spam", we go the "handle_span" node
        "legitimate": "drafting_response" # and if it's legitimate, we go to the "drafting response" node
    }
)

# Add final edges
email_graph.add_edge("handle_spam", END) # after handling spam we always end
email_graph.add_edge("drafting_response", "notify_mr_wayne")
email_graph.add_edge("notify_mr_wayne", END) # after notifyinf Me wayne, we can end  too


In [9]:
# Compile the graph
compiled_graph = email_graph.compile()

In [10]:
 # Example emails for testing
legitimate_email = {
    "sender": "Joker",
    "subject": "Found you Batman ! ",
    "body": "Mr. Wayne,I found your secret identity ! I know you're batman ! Ther's no denying it, I have proof of that and I'm coming to find you soon. I'll get my revenge. JOKER"
}

spam_email = {
    "sender": "Crypto bro",
    "subject": "The best investment of 2025",
    "body": "Mr Wayne, I just launched an ALT coin and want you to buy some !"
}


In [None]:
from langfuse.langchain import CallbackHandler

# Initialize Langfuse CallbackHandler for Langchain (tracing)
langfuse_handler = CallbackHandler()

# Process legitimate email
print("\nProcessing legitimate email...")
legitimate_result = compiled_graph.invoke(
    input = {
        "email": legitimate_email,
        "is_spam": None,
        "draft_response": None,
        "messages": []
        },
    config={"callbacks": [langfuse_handler]}
)

# Process spam email
print("\nProcessing spam email...")
spam_result = compiled_graph.invoke(
    input = {
        "email": spam_email,
        "is_spam": None,
        "draft_response": None,
        "messages": []
        },
    config={"callbacks": [langfuse_handler]}
) 

### 追踪结构

Langfuse 会记录包含若干 **span（跨度）** 的**trace（追踪）**，每个 span 代表代理逻辑中的一个步骤。本例中的追踪包含整体运行以及如下子跨度：
- 工具调用（get_weather）
- LLM 调用（使用 'gpt-4o' 的 Responses API）

你可以检查这些记录以精确了解时间消耗、令牌使用量等：

![Langfuse 中的追踪树](https://langfuse.com/images/cookbook/example-langgraph-evaluation/trace-tree.png)

_[前往该追踪](https://cloud.langfuse.com/project/cloramnkj0002jz088vzn1ja4/traces/3dd76e4b-980c-40eb-ae6d-ba9db5f6a349?timestamp=2025-03-20T14%3A56%3A16.665Z&display=details&observation=22b11054-93a8-4ff9-b862-babfcee906ec)_

## 在线评估

在线评估指在真实线上环境（生产环境的实际使用中）对代理进行评估。这需要对真实用户交互进行持续监控与结果分析。

我们在此总结了多种评估技术的指南：[链接](https://langfuse.com/blog/2025-03-04-llm-evaluation-101-best-practices-and-challenges)。

### 生产环境常见监控指标

1. **成本（Costs）**：埋点会记录令牌用量，你可按每个令牌的价格估算成本。
2. **延迟（Latency）**：观察完成每个步骤或整次运行所需的时间。
3. **用户反馈（User Feedback）**：用户可直接提供反馈（如点赞/点踩）以帮助迭代与修正代理。
4. **LLM 评审（LLM-as-a-Judge）**：使用额外的 LLM 近实时评估代理输出（如检测毒性或正确性）。

下面展示这些指标的示例。

#### 1. 成本（Costs）

下图展示了 `gpt-4o` 调用的用量，可据此识别高成本步骤并优化代理。

![成本](https://langfuse.com/images/cookbook/example-langgraph-evaluation/gpt-4o-costs.png)

_[前往该追踪](https://cloud.langfuse.com/project/cloramnkj0002jz088vzn1ja4/traces/3dd76e4b-980c-40eb-ae6d-ba9db5f6a349?timestamp=2025-03-20T14%3A56%3A16.665Z&display=details&observation=22b11054-93a8-4ff9-b862-babfcee906ec)_

#### 2. 延迟（Latency）

还可以查看完成每个步骤所需的时间。如下例所示，整个运行约 3 秒，你可以细分到各步骤。此举有助于识别瓶颈并优化代理。

![延迟](https://langfuse.com/images/cookbook/example-langgraph-evaluation/agent-latency.png)

_[前往该追踪](https://cloud.langfuse.com/project/cloramnkj0002jz088vzn1ja4/traces/3dd76e4b-980c-40eb-ae6d-ba9db5f6a349?timestamp=2025-03-20T14%3A56%3A16.665Z&display=timeline)_

#### 3. 用户反馈（User Feedback）

如果你的代理嵌入在用户界面中，可以采集用户的直接反馈（例如在聊天界面中的点赞/点踩）。 

In [None]:
from langfuse import get_client
 
langfuse = get_client()
 
# Option 1: Use the yielded span object from the context manager
with langfuse.start_as_current_span(
    name="langgraph-request") as span:
    # ... LangGraph execution ...
 
    # Score using the span object
    span.score_trace(
        name="user-feedback",
        value=1,
        data_type="NUMERIC",
        comment="This was correct, thank you"
    )
 
# Option 2: Use langfuse.score_current_trace() if still in context
with langfuse.start_as_current_span(name="langgraph-request") as span:
    # ... LangGraph execution ...
 
    # Score using current context
    langfuse.score_current_trace(
        name="user-feedback",
        value=1,
        data_type="NUMERIC"
    )
 
# Option 3: Use create_score() with trace ID (when outside context)
langfuse.create_score(
    trace_id="predefined-trace-id", # Needs to be a valid trace id format (see docs)
    name="user-feedback",
    value=1,
    data_type="NUMERIC",
    comment="This was correct, thank you"
)

用户反馈随后会被 Langfuse 捕获：

![Langfuse 中捕获的用户反馈](https://langfuse.com/images/cookbook/example-langgraph-evaluation/user-feedback.png)

#### 4. 自动化的 LLM 评审打分（LLM-as-a-Judge）

LLM-as-a-Judge 提供了一种自动评估代理输出的方法。你可以配置一个独立的 LLM 调用，用于评估输出的正确性、毒性、风格或其他你关心的指标。

**工作流程：**
1. 定义一个**评估模板**，例如“检查文本是否含有毒性”。
2. 指定用于评审的模型（judge-model），例如 `gpt-4o-mini`。
2. 每当代理生成输出时，将其与模板一起传给“评审”LLM。
3. 评审 LLM 给出评分或标签，并将结果记录到可观测性平台。

Langfuse 示例：

![LLM 评审模板](https://langfuse.com/images/cookbook/integration_openai-agents/evaluator-template.png)
![LLM 评审器](https://langfuse.com/images/cookbook/integration_openai-agents/evaluator.png)

In [None]:
# Process spam email
print("\nProcessing spam email...")
spam_result = compiled_graph.invoke(
    input = {
        "email": spam_email,
        "is_spam": None,
        "draft_response": None,
        "messages": []
        },
    config={"callbacks": [langfuse_handler]}
) 

可以看到，该示例的答案被评审为“无毒性（not toxic）”。

![LLM 评审得分示例](https://langfuse.com/images/cookbook/example-langgraph-evaluation/llm-as-a-judge-score.png)

#### 5. 可观测性指标总览

所有上述指标都可以在统一的仪表盘中可视化。这样你可以快速查看代理在多次会话中的表现，并随时间跟踪质量指标。

![可观测性指标总览](https://langfuse.com/images/cookbook/integration_openai-agents/dashboard-dark.png)

## Offline Evaluation

Online evaluation is essential for live feedback, but you also need **offline evaluation**—systematic checks before or during development. This helps maintain quality and reliability before rolling changes into production.

### Dataset Evaluation

In offline evaluation, you typically:
1. Have a benchmark dataset (with prompt and expected output pairs)
2. Run your agent on that dataset
3. Compare outputs to the expected results or use an additional scoring mechanism

Below, we demonstrate this approach with the [q&a-dataset](https://huggingface.co/datasets/junzhang1207/search-dataset), which contains questions and expected answers.

In [None]:
import pandas as pd
from datasets import load_dataset
 
# Fetch search-dataset from Hugging Face
dataset = load_dataset("junzhang1207/search-dataset", split = "train")
df = pd.DataFrame(dataset)
print("First few rows of search-dataset:")
print(df.head())

接下来，我们在 Langfuse 中创建一个数据集实体以追踪运行；随后将数据集中的每条记录添加到系统中。

In [None]:
from langfuse import Langfuse
langfuse = Langfuse()
 
langfuse_dataset_name = "qa-dataset_langgraph-agent"
 
# Create a dataset in Langfuse
langfuse.create_dataset(
    name=langfuse_dataset_name,
    description="q&a dataset uploaded from Hugging Face",
    metadata={
        "date": "2025-03-21",
        "type": "benchmark"
    }
)

In [None]:
df_30 = df.sample(30) # For this example, we upload only 30 dataset questions

for idx, row in df_30.iterrows():
    langfuse.create_dataset_item(
        dataset_name=langfuse_dataset_name,
        input={"text": row["question"]},
        expected_output={"text": row["expected_answer"]}
    )

![Langfuse 中的数据集条目](https://langfuse.com/images/cookbook/example-langgraph-evaluation/example-dataset.png)

#### 在数据集上运行代理

首先，构建一个使用 OpenAI 模型回答问题的简易 LangGraph 代理。 

In [15]:
from typing import Annotated
 
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
from typing_extensions import TypedDict
 
from langgraph.graph import StateGraph
from langgraph.graph.message import add_messages
 
class State(TypedDict):
    messages: Annotated[list, add_messages]
 
graph_builder = StateGraph(State)
 
llm = ChatOpenAI(model = "gpt-4.5-preview")

def chatbot(state: State):
    return {"messages": [llm.invoke(state["messages"])]}
 
graph_builder.add_node("chatbot", chatbot)
graph_builder.set_entry_point("chatbot")
graph_builder.set_finish_point("chatbot")

graph = graph_builder.compile()

Then, we define a helper function `my_agent()` that:
1. Creates a Langfuse trace 
2. Fetches the `langfuse_handler_trace` to instrument the LangGraph execution. 
3. Runs our agent and passing `langfuse_handler_trace` to the invocation. 

In [None]:
from langfuse import get_client
from langfuse.langchain import CallbackHandler
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

class State(TypedDict):
    messages: Annotated[list, add_messages]
 
graph_builder = StateGraph(State)
llm = ChatOpenAI(model = "gpt-4o")
langfuse = get_client()

def chatbot(state: State):
    return {"messages": [llm.invoke(state["messages"])]}
 
graph_builder.add_node("chatbot", chatbot)
graph_builder.set_entry_point("chatbot")
graph_builder.set_finish_point("chatbot")
graph = graph_builder.compile()

def my_agent(question, langfuse_handler):
    
    # Create a trace via Langfuse spans and use Langchain within it
    with langfuse.start_as_current_span(name="my-langgraph-agent") as root_span:
    
        # Step 2: LangChain processing
        response = graph.invoke(
        input={"messages": [HumanMessage(content = question)]},
        config={"callbacks": [langfuse_handler]}
        )
    
        # Update trace output
        root_span.update_trace(
            input=question,
            output=response["messages"][1].content)
        
        print(question)
        print(response["messages"][1].content)

    return response["messages"][1].content

Finally, we loop over each dataset item, run the agent, and link the trace to the dataset item. We can also attach a quick evaluation score if desired.

In [None]:
from langfuse import get_client
from langfuse.langchain import CallbackHandler
 
# Initialize Langfuse CallbackHandler for Langchain (tracing)
langfuse_handler = CallbackHandler()
langfuse = get_client()

dataset = langfuse.get_dataset('qa-dataset_langgraph-agent')

for item in dataset.items:
    # Use the item.run() context manager for automatic trace linking
    with item.run(
        run_name="run_gpt-4o",
        run_description="My first run",
        run_metadata={"model": "gpt-4o"},
    ) as root_span:
        # All operations within this block are part of the trace for this dataset item
        
        # Call your application logic - this can use any combination of decorators, 
        # context managers, or manual observations
        with langfuse.start_as_current_generation(
            name="llm-call",
            model="gpt-4o",
            input=item.input
        ) as generation:
            # Your LLM application logic here
            output = my_agent(str(item.input), langfuse_handler)
            generation.update(output=output)
 
        # Optionally, score the result against the expected output
        root_span.score_trace(
            name="user-feedback",
            value=1,
            comment="This is a comment",  # optional, useful to add reasoning
        )
 
# Flush the langfuse client to ensure all data is sent to the server at the end of the experiment run
langfuse.flush()

You can repeat this process with different agent configurations such as:
- Models (gpt-4o-mini, o1, etc.)
- Prompts
- Tools (search vs. no search)
- Complexity of agent (multi agent vs single agent)

Then compare them side-by-side in Langfuse. In this example, I did run the agent 3 times on the 30 dataset questions. For each run, I used a different OpenAI model. You can see that amount of correctly answered questions improves when using a larger model (as expected). The `correct_answer` score is created by an [LLM-as-a-Judge Evaluator](https://langfuse.com/docs/scores/model-based-evals) that is set up to judge the correctness of the question based on the sample answer given in the dataset.

![Dataset run overview](https://langfuse.com/images/cookbook/example-langgraph-evaluation/dataset_runs.png)
![Dataset run comparison](https://langfuse.com/images/cookbook/example-langgraph-evaluation/dataset-run-comparison.png)
