# Debugging, Evaluating and Monitoring LLM Applications with LangSmith

## Install OpenAI, and LangChain dependencies

Install the following httpx library version for compatibility with other libraries

In [1]:
!pip install httpx==0.27.2

Collecting httpx==0.27.2
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Downloading httpx-0.27.2-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: httpx
  Attempting uninstall: httpx
    Found existing installation: httpx 0.28.1
    Uninstalling httpx-0.28.1:
      Successfully uninstalled httpx-0.28.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-genai 1.16.1 requires httpx<1.0.0,>=0.28.1, but you have httpx 0.27.2 which is incompatible.[0m[31m
[0mSuccessfully installed httpx-0.27.2


In [2]:
!pip install langchain==0.2.0
!pip install langchain-openai==0.1.7
!pip install langchain-community==0.2.0
!pip install langsmith==0.1.71

Collecting langchain==0.2.0
  Downloading langchain-0.2.0-py3-none-any.whl.metadata (13 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain==0.2.0)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting langchain-core<0.3.0,>=0.2.0 (from langchain==0.2.0)
  Downloading langchain_core-0.2.43-py3-none-any.whl.metadata (6.2 kB)
Collecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain==0.2.0)
  Downloading langchain_text_splitters-0.2.4-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain==0.2.0)
  Downloading langsmith-0.1.147-py3-none-any.whl.metadata (14 kB)
Collecting numpy<2,>=1 (from langchain==0.2.0)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
Collecting tenacity<9.0.0,>=8.1.0 (from langchain==0.2.0)
  Downloading tenacity-8.5.0-p

## Setup Environment Variables

In [5]:
from google.colab import userdata
import os

os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
os.environ["LANGCHAIN_TRACING_V2"] = "true"
# you can change this based on specific apps and projects
os.environ["LANGCHAIN_PROJECT"] = f"LLM App Project - 1"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = userdata.get('LANGSMITH_KEY')

# Debugging and Monitoring LLM Apps with LangSmith

We will look at various ways in which we can evaluate LLM App steps, also known as runs. Collection of runs make up a trace.

A Trace is essentially a series of steps that your application takes to go from input to output.

Each of these individual steps is represented by a Run.

A Project is simply a collection of traces.

![](https://i.imgur.com/hiMkTK9.png)



## Monitor LLM App Traces with LangSmith

In [6]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI


# Configure the chat prompt template and define the llm chain.
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "Act as a helpful AI Assistant"),
        ("human", "{human_input}"),
    ]
)

# Initialize the OpenAI Chat instance with specific model parameters.
chatgpt = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# create simple llm chain
llm_chain = (prompt
                |
             chatgpt
                |
             StrOutputParser()
)

In [7]:
prompt_txt = "Explain AI in 3 bullet points"
response = llm_chain.invoke({'human_input': prompt_txt})
print(response)

- AI, or artificial intelligence, refers to the simulation of human intelligence processes by machines, especially computer systems.
- It involves the development of algorithms and models that enable machines to perform tasks that typically require human intelligence, such as learning, problem-solving, and decision-making.
- AI technologies include machine learning, natural language processing, computer vision, and robotics, and are used in various applications across industries such as healthcare, finance, and transportation.


## Selective Tracing

In [8]:
os.environ["LANGCHAIN_TRACING_V2"] = "false"

In [9]:
# does not get traced anymore
prompt_txt = "Explain Generative AI in one line"
response = llm_chain.invoke({'human_input': prompt_txt})
print(response)

Generative AI is a type of artificial intelligence that can create new content, such as images, text, or music, based on patterns it has learned from existing data.


In [10]:
from langchain.callbacks.tracers import LangChainTracer

# You can configure a LangChainTracer instance to trace a specific invocation.
tracer = LangChainTracer()
prompt_txt = "Explain AI in one line"
response = llm_chain.invoke({'human_input': prompt_txt}, config={"callbacks": [tracer]})
print(response)

AI, or artificial intelligence, refers to the simulation of human intelligence processes by machines, typically computer systems.


In [11]:
# LangChain also supports a context manager for tracing a specific block of code.
from langchain_core.tracers.context import tracing_v2_enabled

prompt_txt = "Explain Generative AI in one line"
with tracing_v2_enabled():
    response = llm_chain.invoke({'human_input': prompt_txt})
    print(response)

Generative AI is a type of artificial intelligence that can create new content, such as images, text, or music, based on patterns and data it has been trained on.


## Log traces to specific projects dynamically

In [12]:
tracer = LangChainTracer(project_name='LLM App Project - 2')
prompt_txt = "Explain AI in one line"
response = llm_chain.invoke({'human_input': prompt_txt}, config={"callbacks": [tracer]})
print(response)

AI, or artificial intelligence, refers to the simulation of human intelligence processes by machines, such as learning, reasoning, and problem-solving.


In [13]:
prompt_txt = "Explain Generative AI in one line"
with tracing_v2_enabled(project_name='LLM App Project - 2'):
    response = llm_chain.invoke({'human_input': prompt_txt})
    print(response)

Generative AI is a type of artificial intelligence that can create new content, such as images, text, or music, based on patterns and data it has been trained on.


## Adding metadata and tags in traces

In [14]:
prompt_txt = "Explain Generative AI in one line"
with tracing_v2_enabled():
    response = llm_chain.invoke({'human_input': prompt_txt},  {"tags": ['AI', 'Data Science'],
                                                               "metadata": {"user": "aks", "team": "data science"}})
    print(response)

Generative AI is a type of artificial intelligence that can create new content, such as images, text, or music, based on patterns it has learned from existing data.


In [15]:
prompt_txt = "Explain which is the fastest animal?"
with tracing_v2_enabled():
    response = llm_chain.invoke({'human_input': prompt_txt},  {"tags": ['General Knowledge', 'Environment'],
                                                               "metadata": {"user": "bob", "team": "social science"}})
    print(response)

The fastest animal on land is the cheetah, which can reach speeds of up to 60-70 miles per hour (96-112 km/h) in short bursts covering distances up to 500 meters. In the air, the peregrine falcon holds the title for the fastest animal, reaching speeds of over 240 miles per hour (386 km/h) when diving to catch prey. In the water, the sailfish is considered the fastest swimmer, reaching speeds of up to 68 miles per hour (110 km/h).


## Customize run names

In [16]:
prompt_txt = "Explain what is Deep Learning?"
with tracing_v2_enabled():
    response = llm_chain.invoke({'human_input': prompt_txt},  {"tags": ['AI', 'Data Science'],
                                                               "metadata": {"user": "dj", "team": "data science"},
                                                               "run_name": "AKSRunMay2025_001"},
                                )
    print(response)

Deep learning is a subset of machine learning that involves training artificial neural networks to learn and make decisions from data. These neural networks are composed of multiple layers of interconnected nodes, which allow them to learn complex patterns and relationships in the data. Deep learning has been particularly successful in tasks such as image and speech recognition, natural language processing, and playing games. It is a powerful tool for solving problems that involve large amounts of data and can often outperform traditional machine learning algorithms in terms of accuracy and performance.


# Evaluating and Monitoring LLM Apps with LangSmith

Here we will create an evaluation dataset and then evaluate our LLM app using various metrics and see how we can monitor our application using LangSmith

In [17]:
os.environ["LANGCHAIN_TRACING_V2"] = "true"
# you can change this based on specific apps and projects
os.environ["LANGCHAIN_PROJECT"] = f"LLM App Project - 1"

## Access LLM App runs

In [18]:
from langsmith import Client

In [19]:
from datetime import datetime, timedelta

# Initialize a client
client = Client(timeout_ms=3600000)

todays_llm_runs = client.list_runs(
    project_name="LLM App Project - 1",
    start_time=datetime.now() - timedelta(days=1), # can change or remove this to retrieve more runs
    run_type="llm",
)

In [20]:
dataset = []
for run in todays_llm_runs:
    dataset.append((run.inputs, run.outputs))

In [21]:
dataset[0]

({'messages': [[{'id': ['langchain', 'schema', 'messages', 'SystemMessage'],
     'kwargs': {'content': 'Act as a helpful AI Assistant', 'type': 'system'},
     'lc': 1,
     'type': 'constructor'},
    {'id': ['langchain', 'schema', 'messages', 'HumanMessage'],
     'kwargs': {'content': 'Explain what is Deep Learning?', 'type': 'human'},
     'lc': 1,
     'type': 'constructor'}]]},
 {'generations': [[{'generation_info': {'finish_reason': 'stop',
      'logprobs': None},
     'message': {'id': ['langchain', 'schema', 'messages', 'AIMessage'],
      'kwargs': {'content': 'Deep learning is a subset of machine learning that involves training artificial neural networks to learn and make decisions from data. These neural networks are composed of multiple layers of interconnected nodes, which allow them to learn complex patterns and relationships in the data. Deep learning has been particularly successful in tasks such as image and speech recognition, natural language processing, and playi

In [38]:
input['messages'][0][1]['kwargs']['content']

'Explain AI in 3 bullet points'

## Create an evaluation dataset of inputs and outputs

Ideally the outputs should be human-labeled or annotated outputs which are examples of ground-truth for input data. Here we will just use model outputs to quickly create an evaluation dataset

In [47]:
refined_dataset = []
for input, output in dataset:
    refined_dataset.append((input['messages'][0][1]['kwargs']['content'],
                            output['generations'][0][0]['text']))
refined_dataset = list(set(refined_dataset))
refined_dataset

[('Explain AI in one line',
  'AI, or artificial intelligence, refers to the simulation of human intelligence processes by machines, typically computer systems.'),
 ('Explain what is Deep Learning?',
  'Deep learning is a subset of machine learning that involves training artificial neural networks to learn and make decisions from data. These neural networks are composed of multiple layers of interconnected nodes, which allow them to learn complex patterns and relationships in the data. Deep learning has been particularly successful in tasks such as image and speech recognition, natural language processing, and playing games. It is a powerful tool for solving problems that involve large amounts of data and can often outperform traditional machine learning algorithms in terms of accuracy and performance.'),
 ('Explain AI in 3 bullet points',
  '- AI, or artificial intelligence, refers to the simulation of human intelligence processes by machines, especially computer systems.\n- It involv

Let's also add in some human-labeled input-output examples where the output is created by humans

In [48]:
more_examples = [
  ("What is the largest mammal?", "The blue whale"),
  ("What do mammals and birds have in common?", "Both are homeothermic (warm-blooded) animals"),
  ("What's the main characteristic of amphibians?", "They live both in water and on land"),
]

for input, output in more_examples:
    refined_dataset.append((input, output))

refined_dataset

[('Explain AI in one line',
  'AI, or artificial intelligence, refers to the simulation of human intelligence processes by machines, typically computer systems.'),
 ('Explain what is Deep Learning?',
  'Deep learning is a subset of machine learning that involves training artificial neural networks to learn and make decisions from data. These neural networks are composed of multiple layers of interconnected nodes, which allow them to learn complex patterns and relationships in the data. Deep learning has been particularly successful in tasks such as image and speech recognition, natural language processing, and playing games. It is a powerful tool for solving problems that involve large amounts of data and can often outperform traditional machine learning algorithms in terms of accuracy and performance.'),
 ('Explain AI in 3 bullet points',
  '- AI, or artificial intelligence, refers to the simulation of human intelligence processes by machines, especially computer systems.\n- It involv

## Create an Evaluation Dataset in LangSmith

Here we will upload our dataset to LangSmith cloud and create our eval dataset

In [49]:
# Initialize a client
client = Client(timeout_ms=3600000)

# Storing inputs in a dataset lets us
# run chains and LLMs over a shared set of examples.
dataset = client.create_dataset(
    dataset_name='Sample LLM App Eval Dataset - AKS-Test001',
    description="Dataset of sample prompts and human outputs",
)
dataset

Dataset(name='Sample LLM App Eval Dataset - AKS-Test001', description='Dataset of sample prompts and human outputs', data_type=<DataType.kv: 'kv'>, id=UUID('0b6a7d51-dd8e-4f70-93cb-104894a3e1c1'), created_at=datetime.datetime(2025, 5, 30, 14, 36, 41, 744930, tzinfo=datetime.timezone.utc), modified_at=datetime.datetime(2025, 5, 30, 14, 36, 41, 744930, tzinfo=datetime.timezone.utc), example_count=0, session_count=0, last_session_start_time=None)

In [50]:
for input_prompt, output_answer in refined_dataset:
    client.create_example(
        inputs={"question": input_prompt},
        outputs={"answer": output_answer},
        metadata={"source": "Wikipedia"},
        dataset_id=dataset.id,
    )

In [51]:
datasets = client.list_datasets()
list(datasets)

[Dataset(name='Sample LLM App Eval Dataset - AKS-Test001', description='Dataset of sample prompts and human outputs', data_type=<DataType.kv: 'kv'>, id=UUID('0b6a7d51-dd8e-4f70-93cb-104894a3e1c1'), created_at=datetime.datetime(2025, 5, 30, 14, 36, 41, 744930, tzinfo=datetime.timezone.utc), modified_at=datetime.datetime(2025, 5, 30, 14, 36, 41, 744930, tzinfo=datetime.timezone.utc), example_count=9, session_count=0, last_session_start_time=None)]

## View evaluation dataset examples

You can get examples of data points from your eval dataset in the cloud anytime

In [52]:
examples = client.list_examples(dataset_name="Sample LLM App Eval Dataset - AKS-Test001")

In [53]:
for example in examples:
    print(example)

dataset_id=UUID('0b6a7d51-dd8e-4f70-93cb-104894a3e1c1') inputs={'question': "What's the main characteristic of amphibians?"} outputs={'answer': 'They live both in water and on land'} metadata={'source': 'Wikipedia', 'dataset_split': ['base']} id=UUID('48e6482f-5a1d-4d33-918d-28ba34c9cf9f') created_at=datetime.datetime(2025, 5, 30, 14, 36, 52, 866047, tzinfo=datetime.timezone.utc) modified_at=datetime.datetime(2025, 5, 30, 14, 36, 52, 866047, tzinfo=datetime.timezone.utc) runs=[] source_run_id=None
dataset_id=UUID('0b6a7d51-dd8e-4f70-93cb-104894a3e1c1') inputs={'question': 'What do mammals and birds have in common?'} outputs={'answer': 'Both are homeothermic (warm-blooded) animals'} metadata={'source': 'Wikipedia', 'dataset_split': ['base']} id=UUID('047b8988-80bf-43cf-9afd-9e119747da50') created_at=datetime.datetime(2025, 5, 30, 14, 36, 52, 607900, tzinfo=datetime.timezone.utc) modified_at=datetime.datetime(2025, 5, 30, 14, 36, 52, 607900, tzinfo=datetime.timezone.utc) runs=[] source_r

## Evaluate and Monitor LLM App performance

We will leverage various evaluation metrics from LangSmith to test our LLM App performance here

In [54]:
from langsmith.evaluation import evaluate

# Initialize a client
client = Client(timeout_ms=3600000)

results = evaluate(
    lambda x: llm_chain.invoke({'human_input' : x['question']}),
    client=client,
    data="Sample LLM App Eval Dataset - AKS-Test001",
    experiment_prefix="test_eval001",
)

View the evaluation results for experiment: 'test_eval001-18041caf' at:
https://smith.langchain.com/o/baa1c525-92bb-4e5c-9122-f20839cde3b8/datasets/0b6a7d51-dd8e-4f70-93cb-104894a3e1c1/compare?selectedSessions=6575b7b7-4f64-4255-a61c-7f4203096007




0it [00:00, ?it/s]

In [55]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa")

# Initialize a client
client = Client(timeout_ms=3600000)

results = evaluate(
    lambda x: llm_chain.invoke({'human_input' : x['question']}),
    client=client,
    data="Sample LLM App Eval Dataset - AKS-Test001",
    experiment_prefix="test_eval001",
    evaluators=[qa_evaluator]
)

View the evaluation results for experiment: 'test_eval001-09786a8a' at:
https://smith.langchain.com/o/baa1c525-92bb-4e5c-9122-f20839cde3b8/datasets/0b6a7d51-dd8e-4f70-93cb-104894a3e1c1/compare?selectedSessions=003477a8-d144-427d-85e9-bb9a3f5c909d




0it [00:00, ?it/s]

In [59]:
# from langchain.evaluation import EmbeddingDistance, load_evaluator
# semantic_evaluator = load_evaluator(
#     "embedding_distance", distance_metric=EmbeddingDistance.COSINE
# )
correct_evaluator = LangChainStringEvaluator("labeled_criteria",
                                             config={ "criteria": "correctness"})
conciseness_evaluator =LangChainStringEvaluator("criteria",
                                                config={ "criteria": "conciseness"})
helpfulness_evaluator = LangChainStringEvaluator("criteria",
                                                 config={ "criteria": "helpfulness"})
semantic_evaluator = LangChainStringEvaluator("embedding_distance")

# Initialize a client
client = Client(timeout_ms=3600000)

results = evaluate(
    lambda x: llm_chain.invoke({'human_input' : x['question']}),
    client=client,
    data="Sample LLM App Eval Dataset - AKS-Test001",
    experiment_prefix="test_eval001",
    evaluators=[correct_evaluator, conciseness_evaluator, helpfulness_evaluator]
)

View the evaluation results for experiment: 'test_eval001-3c755b93' at:
https://smith.langchain.com/o/baa1c525-92bb-4e5c-9122-f20839cde3b8/datasets/0b6a7d51-dd8e-4f70-93cb-104894a3e1c1/compare?selectedSessions=2b2471e1-972d-42dd-ad33-21b351539779




0it [00:00, ?it/s]