# Getting to grips with The LangChain Benchmarks

LangChain Benchmarks is a package to help benchmark various LLM related tasks.

Docs: https://langchain-ai.github.io/langchain-benchmarks/index.html

From the intro page:
>The benchmarks are organized by end-to-end use cases, and utilize LangSmith heavily.

>We have several goals in open sourcing this:

> - Showing how we collect our benchmark datasets for each task.
> - Showing what the benchmark datasets we use for each task is.
> - Showing how we evaluate each task.
> - Encouraging others to benchmark their solutions on these tasks.




## 1 Installs, imports and stuff


In [1]:
%pip install -U --quiet langchain_benchmarks langchain langsmith

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/72.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.4/72.4 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.0/1.0 MB[0m [31m33.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.8/311.8 kB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.8/139.8 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [6]:
try:
    import importlib.metadata
    for module in ["pydantic", "langchain", "langchain_benchmarks"]:
        try:
            print(f"{module} version: {importlib.metadata.version(module)}")
        except importlib.metadata.PackageNotFoundError:
            print("{module} is not installed.")
except ImportError:
    print("importlib.metadata is not available. Try running in python 3.8 or higher.")


pydantic version: 2.10.6
langchain version: 0.2.17
langchain_benchmarks version: 0.0.14


In [7]:
from google.colab import userdata
import os

os.environ["LANGCHAIN_API_KEY"] = userdata.get("LANGSMITH_API_KEY")

## 2 The benchmark



### 2.1 The tasks

Each benchmark task has a corresponding description, dataset, and other "environment" information. You can view the available tasks by checking the registry.

In [8]:
from langchain_benchmarks import registry

registry

Name,Type,Dataset ID,Description
Tool Usage - Typewriter (1 tool),ToolUsageTask,59577193-8938-4ccf-92a7-e8a96bcf4f86,"Environment with a single tool that accepts a single letter as input, and prints it on a piece of virtual paper. The objective of this task is to evaluate the ability of the model to use the provided tools to repeat a given input string. For example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order. The dataset includes examples of varying difficulty. The difficulty is measured by the length of the string."
Tool Usage - Typewriter (26 tools),ToolUsageTask,128af05e-aa00-4e3b-a958-d166dd450581,"Environment with 26 tools each tool represents a letter of the alphabet. The objective of this task is to evaluate the model's ability the use tools for a simple repetition task. For example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order. The dataset includes examples of varying difficulty. The difficulty is measured by the length of the string. This is a variation of the typer writer task, where 26 parameterless tools are given instead of a single tool that takes a letter as an argument."
Tool Usage - Relational Data,ToolUsageTask,1d89f4b3-5f73-48cf-a127-2fdeb22f6d84,"Environment with fake data about users and their locations and favorite foods. The environment provides a set of tools that can be used to query the data. The objective of this task is to evaluate the ability to use the provided tools to answer questions about relational data. The dataset contains 21 examples of varying difficulty. The difficulty is measured by the number of tools that need to be used to answer the question. Each example is composed of a question, a reference answer, and information about the sequence in which tools should be used to answer the question. Success is measured by the ability to answer the question correctly, and efficiently."
Multiverse Math,ToolUsageTask,47ed57bc-e852-4f84-a23e-cce4793864e9,"An environment that contains a few basic math operations, but with altered results. For example, multiplication of 5*3 will be re-interpreted as 5*3*1.1. The basic operations retain some basic properties, such as commutativity, associativity, and distributivity; however, the results are different than expected. The objective of this task is to evaluate the ability to use the provided tools to solve simple math questions and ignore any innate knowledge about math. This task is associated with 20 test examples."
Email Extraction,ExtractionTask,a1742786-bde5-4f51-a1d8-e148e5251ddb,"A dataset of 42 real emails deduped from a spam folder, with semantic HTML tags removed, as well as a script for initial extraction and formatting of other emails from an arbitrary .mbox file like the one exported by Gmail. Some additional cleanup of the data was done by hand after the initial pass. See https://github.com/jacoblee93/oss-model-extraction-evals."
Chat Extraction,ExtractionTask,00f4444c-9460-4a82-b87a-f50096f1cfef,A dataset meant to test the ability of an LLM to extract and infer structured information from a dialogue. The dialogue is between a user and a support engineer. Outputs should be structured as a JSON object and test both the ability of the LLM to correctly structure the information and its ability to perform simple classification tasks.
LangChain Docs Q&A,RetrievalTask,452ccafc-18e1-4314-885b-edd735f17b9d,Questions and answers based on a snapshot of the LangChain python docs. The environment provides the documents and the retriever information. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer. We also measure the faithfulness of the model's response relative to the retrieved documents (if any).
Semi-structured Reports,RetrievalTask,c47d9617-ab99-4d6e-a6e6-92b8daf85a7d,Questions and answers based on PDFs containing tables and charts. The task provides the raw documents as well as factory methods to easily index them and create a retriever. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer. We also measure the faithfulness of the model's response relative to the retrieved documents (if any).
Multi-modal slide decks,RetrievalTask,40afc8e7-9d7e-44ed-8971-2cae1eb59731,This public dataset is a work-in-progress and will be extended over time.  Questions and answers based on slide decks containing visual tables and charts. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer.
Name Correction,ExtractionTask,78df83ee-ba7f-41c6-832c-2b23327d4cf7,A dataset of 23 misspelled full names and their correct spellings.


### 2.2 Downloading datasets

Each benchmark task has a corresponding dataset. To run evals on the specified benchmark, you can use our download function. For more details on working with datasets within the LangChain Benchmarks package, check out the [datasets notebook](https://langchain-ai.github.io/langchain-benchmarks/notebooks/datasets.html).

See also https://langchain-ai.github.io/langchain-benchmarks/notebooks/datasets.html

Note the difference between clone and download:
- **clone_public_dataset** will clone the dataset to your own LangSmith tenant
- **download_public_dataset** will download it to the local file system


In [9]:
from langchain_benchmarks import clone_public_dataset, download_public_dataset

In [10]:
help(download_public_dataset)

Help on function download_public_dataset in module langchain_benchmarks.utils._langsmith:

download_public_dataset(token_or_url: str, *, path: Union[str, pathlib.Path, NoneType] = None, api_url: str = 'https://api.smith.langchain.com/') -> None
    Download a public dataset.



In [11]:
task = registry["Tool Usage - Relational Data"]

In [12]:

#clone_public_dataset(task.dataset_id)

In [13]:
download_public_dataset(task.dataset_id, path="tool_usage_relational_data.json")

Fetching examples...


  0%|          | 0/21 [00:00<?, ?it/s]

Done fetching examples.


In [14]:
!ls -la

total 28
drwxr-xr-x 1 root root  4096 Mar  6 02:54 .
drwxr-xr-x 1 root root  4096 Mar  6 02:42 ..
drwxr-xr-x 4 root root  4096 Mar  4 14:26 .config
drwxr-xr-x 1 root root  4096 Mar  4 14:26 sample_data
-rw-r--r-- 1 root root 11857 Mar  6 02:54 tool_usage_relational_data.json


Downloading it to my local machine...

In [15]:
from google.colab import files
files.download("tool_usage_relational_data.json")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### 2.3 Looking at the data

We'll start by looking at one of the benchmark tasks -- [Tool Usage - Relational Data](https://langchain-ai.github.io/langchain-benchmarks/notebooks/tool_usage/relational_data.html).

In this task, an agent is given access to a set of tools that can be used to make queries across 3 relational tables.

The tables contain information about users, locations and foods. The agent must answer questions about the data using the provided tools.

In [16]:
import json

with open("tool_usage_relational_data.json", "r") as f:
    data = json.load(f)

In [17]:
print(data[0])
print(len(data))

{'dataset_id': 'df6be6c9-05b3-445e-8836-ebb4aba63826', 'inputs': {'question': 'What is the city for location ID 1?'}, 'outputs': {'reference': 'New York', 'order_matters': True, 'expected_steps': ['get_city_for_location']}, 'metadata': None, 'id': '114cb5ab-3ffb-4ac0-b4f8-3b7cd7baf58f', 'created_at': '2023-11-18T00:30:11.557683+00:00', 'modified_at': '2023-11-18T00:30:11.557683+00:00', 'runs': [], 'source_run_id': None}
21


In [18]:
for k, v in data[0].items():
    print(k, v)

dataset_id df6be6c9-05b3-445e-8836-ebb4aba63826
inputs {'question': 'What is the city for location ID 1?'}
outputs {'reference': 'New York', 'order_matters': True, 'expected_steps': ['get_city_for_location']}
metadata None
id 114cb5ab-3ffb-4ac0-b4f8-3b7cd7baf58f
created_at 2023-11-18T00:30:11.557683+00:00
modified_at 2023-11-18T00:30:11.557683+00:00
runs []
source_run_id None


In [19]:
for i, row in enumerate(data):
    print(row["inputs"]["question"])
    print(f"  {row['outputs']}")

What is the city for location ID 1?
  {'reference': 'New York', 'order_matters': True, 'expected_steps': ['get_city_for_location']}
What is the name of food with id 6?
  {'reference': 'Pasta', 'order_matters': True, 'expected_steps': ['get_food_name']}
what is eve's user id?
  {'reference': '42', 'order_matters': True, 'expected_steps': ['find_users_by_name']}
get the current user id
  {'reference': '35', 'order_matters': True, 'expected_steps': ['get_current_user_id']}
How many users by the name of bob?
  {'reference': '1', 'order_matters': True, 'expected_steps': ['find_users_by_name']}
what is alice's email address?
  {'reference': 'alice@gmail.com', 'order_matters': True, 'expected_steps': ['find_users_by_name', 'get_user_email']}
find donna's favorite color
  {'reference': 'green', 'order_matters': True, 'expected_steps': ['find_users_by_name', 'get_user_favorite_color']}
weather in LA right now?
  {'reference': 'Sunny, Temperature: 75°F', 'order_matters': True, 'expected_steps': 

### 2.4 The model registry

LangChain Benchmark also includes a model registry to make it easier to run benchmarks across different models.

We'll get back to that later, but see https://langchain-ai.github.io/langchain-benchmarks/notebooks/models.html for introduction.



### 2.5 The task environment

The task environment contains the tools and the data to run the experiments.

In [20]:
env = task.create_environment()
#dir(env)

In [21]:
for tool in env.tools:
    print(tool)

name='get_user_name' description="Get the name of the user with the given user ID.\n\n        Args:\n            user_id: The user's ID.\n\n        Returns:\n            The user's name." args_schema=<class 'pydantic.v1.main.get_user_nameSchema'> handle_tool_error=True func=<function get_available_functions.<locals>.get_user_name at 0x7feeff9e19e0>
name='list_user_ids' description='List all the user IDs.' args_schema=<class 'pydantic.v1.main.list_user_idsSchema'> handle_tool_error=True func=<function get_available_functions.<locals>.list_user_ids at 0x7feeff9e1940>
name='find_users_by_name' description='Find users with the given name.\n\n        Args:\n            name: The name to search for.\n\n        Returns:\n            The list of matching users.' args_schema=<class 'pydantic.v1.main.find_users_by_nameSchema'> handle_tool_error=True func=<function get_available_functions.<locals>.find_users_by_name at 0x7feeff9e2160>
name='find_locations_by_name' description='Find locations with

The tools allow to look up information based on ids (e.g., get_user_email takes a user id and returns the email), and to search (e.g., find_foods_by_name takes a food name and returns a list of results).


In [22]:
env.tools[0].invoke({"user_id": 21})

'Bob'

In [23]:
env.tools[3].invoke({"city": "LA"})

[{'id': 2, 'city': 'Los Angeles'},
 {'id': 1, 'city': 'New York'},
 {'id': 3, 'city': 'Chicago'},
 {'id': 4, 'city': 'Houston'},
 {'id': 5, 'city': 'Miami'}]

### This I didn't understand at first, but turns out that...
...find_locations_by_name function in https://github.com/langchain-ai/langchain-benchmarks/blob/main/langchain_benchmarks/tool_usage/tasks/relational_data.py

returns all locations sorted in descending order by "jaccard similarity based on the number of shared characters between the query and the data".



## 3 Create an agent

https://langchain-ai.github.io/langchain-benchmarks/notebooks/tool_usage/intro.html#create-an-agent

Because an agent interacts with the environment via tools and can change the state of the environment during the course of an agent run, what we actually want is the ability to create a fresh agent and a fresh environment for each test run.

We’ll do this using a factory. A factory is just a fancy name in computer science for an object that can create other objects. In this case, we’ll have an Agent Factory that we can call and it’ll create a fresh agent for us on each call.

We’ll use the StandardAgentFactory which under the hood creates a standard LangChain tool calling agent. It can be used with any Chat Model that supports tool calling.



In [24]:
%pip install -qU langchain-anthropic

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/243.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m235.5/243.4 kB[0m [31m7.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m243.4/243.4 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/415.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m415.1/415.1 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-text-splitters 0.2.4 requires langchain-core<0.3.0,>=0.2.38, but you have langchain-core 0.3.41 which is incompatible.
langchain-community 0.2.19 requires langchain-core<0.3.0,>=0.2.43, but you

In [25]:
%pip install -qU langchain-openai

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/55.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.3/55.3 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-benchmarks 0.0.14 requires langchain-openai<0.2.0,>=0.1.14, but you have langchain-openai 0.3.7 which is incompatible.[0m[31m
[0m

In [26]:
from langchain_openai import ChatOpenAI
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

In [27]:
#from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate

from langchain_benchmarks.tool_usage.agents import StandardAgentFactory

#model = ChatAnthropic(model="claude-3-opus-20240229", temperature=0)
model = ChatOpenAI(model="gpt-4o", temperature=0)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "{instructions}"),  # Populated from task.instructions automatically
        (
            "human",
            "{question}",
        ),  # Each evaluation example is associated with a question
        ("placeholder", "{agent_scratchpad}"),  # Space for the agent to do work
    ]
)

agent_factory = StandardAgentFactory(task, model, prompt)

In [28]:
from langchain import globals

globals.set_verbose(True)
agent = agent_factory()
agent.invoke({"question": "what is the weather in LA"})

RuntimeError: no validator found for <class 'langchain_core.runnables.base.Runnable'>, see `arbitrary_types_allowed` in Config