# LangSmith Evaluation 快速入门

概况来说，评估（Evaluation）过程分为以下步骤：

- 定义 LLM 应用或目标任务(Target Task)。
- 创建或选择一个数据集来评估 LLM 应用。您的评估标准可能需要数据集中的预期输出。
- 配置评估器（Evaluator）对 LLM 应用的输出进行打分（通常与预期输出/数据标注进行比较）。
- 运行评估并查看结果。

本教程展示一个非常简单的 LLM 应用（分类器）的评估流程，该应用会将输入数据标记为“有毒（Toxic）”或“无毒（Not Toxic）”。

In [5]:
!pip install langsmith==0.2.10
!pip show langsmith

Name: langsmith
Version: 0.2.10
Summary: Client library to connect to the LangSmith LLM Tracing and Evaluation Platform.
Home-page: https://smith.langchain.com/
Author: LangChain
Author-email: support@langchain.dev
License: MIT
Location: /usr/local/lib/python3.11/dist-packages
Requires: httpx, orjson, pydantic, requests, requests-toolbelt
Required-by: langchain, langchain-core


## 1.定义目标任务

我们定义了一个简单的评估目标，包括一个LLM Pipeline（将文本分类为有毒或无毒），并启用跟踪（Tracing）以捕获管道中每个步骤的输入和输出。

In [7]:
from langsmith import traceable, wrappers
from openai import Client

# 包装 OpenAI 客户端
# 创建 OpenAI 客户端
# openai = Client()
# # OpenAI API调用（代理方式）
# openai = Client(
#     api_key="XXX",
#     base_url="https://vip.apiyi.com/v1"
# )

# # 智谱API调用
# openai = Client(
#     api_key="XXX",
#     base_url="https://open.bigmodel.cn/api/paas/v4/"
# )

# DeepSeek API调用（deepseek-chat）
openai = Client(
    api_key="XXX",
    base_url="https://api.deepseek.com"
)



openai = wrappers.wrap_openai(openai)

# 标记函数可追踪
@traceable
def label_text(text):
    # 创建消息列表，包含系统消息和用户消息
    messages = [
        {
            "role": "system",
            "content": "请查看下面的用户查询，判断其中是否包含任何形式的有害行为，例如侮辱、威胁或高度负面的评论。如果有，请回复'Toxic'，如果没有，请回复'Not toxic'。",
        },
        {"role": "user", "content": text},
    ]

    # 调用聊天模型生成回复
    result = openai.chat.completions.create(
        messages=messages, model="deepseek-chat", temperature=0
    )

    # 返回模型的回复内容
    return result.choices[0].message.content


## 2.创建或选择评估数据集

下面创建一个评估数据集（包含有毒和无毒文本的示例）。该数据集中每个示例都包含三个字典/对象：
- outputs：数据集中找到的参考标签或其他上下文
- inputs：管道的输入
- metadata：示例中存储的任何其他元数据

这些字典/对象可以具有任意键（Key）和值（Value），但是所有示例中键必须保持一致。

示例中的值也可以采用任何形式，例如字符串、数字、列表或字典，但对于本教程的示例，我们仅使用字符串。

In [13]:
from langsmith import Client

# 创建客户端
client = Client(
    api_key="XXX",
    api_url="https://api.smith.langchain.com"
)

# 创建一个数据集
examples = [
    ("Shut up, idiot", "Toxic"),  # 有害
    ("You're a wonderful person", "Not toxic"),  # 无害
    ("This is the worst thing ever", "Toxic"),  # 有害
    ("I had a great day today", "Not toxic"),  # 无害
    ("Nobody likes you", "Toxic"),  # 有害
    ("This is unacceptable. I want to speak to the manager.", "Not toxic"),  # 无害
]

# 数据集名称
dataset_name = "Toxic Queries"
dataset = client.create_dataset(dataset_name=dataset_name)

# 提取输入和输出
inputs, outputs = zip(
    *[({"text": text}, {"label": label}) for text, label in examples]
)

# 创建示例并将其添加到数据集中
client.create_examples(inputs=inputs, outputs=outputs, dataset_id=dataset.id)

## 3.配置评估器

创建一个评估器，将模型输出与数据集中的标注对比以进行评分。

In [14]:
from langsmith.schemas import Example, Run

# 定义函数用于校正标签
def correct_label(root_run: Run, example: Example) -> dict:
    # 检查 root_run 的输出是否与 example 的输出标签相同
    score = root_run.outputs.get("output") == example.outputs.get("label")
    # 返回一个包含分数和键的字典
    return {"score": int(score), "key": "correct_label"}

## 4.执行评估查看结果

下面使用`evaluate`方法来运行评估，该方法接受以下参数：

- 函数（function）：接受输入字典或对象并返回输出字典或对象
- 数据（data): 要在其上进行评估的LangSmith数据集的名称或UUID，或者是示例的迭代器
- 评估器（evaluators）: 用于对函数输出进行打分的评估器列表
- 实验前缀（experiment_prefix）: 用于给实验名称添加前缀的字符串。如果未提供，则将自动生成一个名称。

In [16]:
from langsmith.evaluation import evaluate
from langsmith import Client

# 创建客户端
client = Client(
    api_key="XXX",
    api_url="https://api.smith.langchain.com"
)

# 数据集名称
dataset_name = "Toxic Queries"

# evaluator = StringEvaluator(evaluation_name="toxic_judge", grading_function=correct_label)

# 评估函数
results = evaluate(
    # 使用 label_text 函数处理输入
    lambda inputs: label_text(inputs["text"]),
    data=dataset_name,  # 数据集名称
    evaluators=[correct_label],  # 使用 correct_label 评估函数
    experiment_prefix="Toxic Queries",  # 实验前缀名称
    description="Testing the baseline system.",  # 可选描述信息
    client=client
)

View the evaluation results for experiment: 'Toxic Queries-615a68ba' at:
https://smith.langchain.com/o/7bfa9385-4ac5-468a-a06c-ffd7dbac42ec/datasets/cf2c7645-95ba-4403-b0b8-a6fce8fa7499/compare?selectedSessions=c5b7916a-07cd-449a-87e6-f2a59e905d6b




0it [00:00, ?it/s]

## 使用 LCEL 重写 RAG Bot

In [27]:
!pip install langchain_community==0.3.14 langchain_openai==0.3.1
!pip install chromadb==0.6.3

Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.6-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.8.3-py2.py3-none-any.whl.metadata (2.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.20.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.29.0-py3-

In [28]:
!pip show langchain_community langchain_openai chromadb

Name: langchain-community
Version: 0.3.14
Summary: Community contributed LangChain integrations.
Home-page: https://github.com/langchain-ai/langchain
Author: 
Author-email: 
License: MIT
Location: /usr/local/lib/python3.11/dist-packages
Requires: aiohttp, dataclasses-json, httpx-sse, langchain, langchain-core, langsmith, numpy, pydantic-settings, PyYAML, requests, SQLAlchemy, tenacity
Required-by: 
---
Name: langchain-openai
Version: 0.3.1
Summary: An integration package connecting OpenAI and LangChain
Home-page: https://github.com/langchain-ai/langchain
Author: 
Author-email: 
License: MIT
Location: /usr/local/lib/python3.11/dist-packages
Requires: langchain-core, openai, tiktoken
Required-by: 
---
Name: chromadb
Version: 0.6.3
Summary: Chroma.
Home-page: https://github.com/chroma-core/chroma
Author: 
Author-email: Jeff Huber <jeff@trychroma.com>, Anton Troynikov <anton@trychroma.com>
License: 
Location: /usr/local/lib/python3.11/dist-packages
Requires: bcrypt, build, chroma-hnswlib, 

In [40]:
### 索引部分
from bs4 import BeautifulSoup as Soup
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# 加载文档
url = "https://python.langchain.com/v0.1/docs/expression_language/"
loader = RecursiveUrlLoader(
    url=url, max_depth=20, extractor=lambda x: Soup(x, "html.parser").text
)
docs = loader.load()

# 分割文档为小块
text_splitter = RecursiveCharacterTextSplitter(chunk_size=4500, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

# 嵌入并存储在 Chroma 中
# 词嵌入（代理方式）
from langchain_openai import OpenAIEmbeddings
# OpenAI 词嵌入（代理方式）
embeddings_model = OpenAIEmbeddings(
    api_key="XXX",
    base_url="https://vip.apiyi.com/v1"
)

vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings_model)

# 创建检索器
retriever = vectorstore.as_retriever()

In [35]:
### RAG 机器人部分

import openai
from langsmith import traceable
from langsmith.wrappers import wrap_openai

class RagBot:

    def __init__(self, retriever, model: str = "deepseek-chat"):
        self._retriever = retriever
        # 包装客户端以监测 LLM
        self._client = wrap_openai(openai.Client(
                  api_key="XXX",
                  base_url="https://api.deepseek.com"
                            ))
        self._model = model

    @traceable()
    def retrieve_docs(self, question):
        # 调用检索器获取相关文档
        return self._retriever.invoke(question)

    @traceable()
    def invoke_llm(self, question, docs):
        # 调用 LLM 生成回复
        response = self._client.chat.completions.create(
            model=self._model,
            messages=[
                {
                    "role": "system",
                    "content": "你是一个乐于助人的 AI 编码助手，擅长 LCEL。使用以下文档生成简明的代码解决方案回答用户的问题。\n\n"
                    f"## 文档\n\n{docs}",
                },
                {"role": "user", "content": question},
            ],
        )

        # 评估器将期望 "answer" 和 "contexts"
        return {
            "answer": response.choices[0].message.content,
            "contexts": [str(doc) for doc in docs],
        }

    @traceable()
    def get_answer(self, question: str):
        # 获取答案
        docs = self.retrieve_docs(question)
        return self.invoke_llm(question, docs)

# 创建 RagBot 实例
rag_bot = RagBot(retriever)

In [36]:
response = rag_bot.get_answer("How to build a RAG chain in LCEL?")
response["answer"][:500]

"To build a Retrieval-Augmented Generation (RAG) chain using LangChain Expression Language (LCEL), you can follow these steps. The RAG chain typically involves retrieving relevant documents and then generating a response based on those documents.\n\nHere's a basic example of how to build a RAG chain in LCEL:\n\n1. **Install Required Packages**:\n   Make sure you have the necessary packages installed:\n   ```bash\n   pip install langchain-core langchain-community langchain-openai\n   ```\n\n2. **Set Up Envi"