What is LangSmith?
LangSmith is a developer platform created by LangChain for building, debugging, testing, and monitoring LLM (Large Language Model) applications. It's essentially an observability and evaluation tool designed specifically for applications built with language models.
Why LangSmith is Used
1. Debugging and Tracing
LangSmith provides detailed traces of your LLM application's execution, allowing you to see exactly what's happening at each step. This is crucial because LLM applications often involve multiple calls, complex chains, and unpredictable outputs. You can trace the full execution path, including prompts sent, responses received, and any intermediate steps.
2. Performance Monitoring
It helps track key metrics like latency, token usage, and cost across your application. This visibility is essential for optimizing performance and managing expenses, especially when dealing with API-based LLM services that charge per token.
3. Prompt Engineering and Iteration
LangSmith allows you to experiment with different prompts and compare results side-by-side. You can see which prompts produce better outputs and iterate quickly without rebuilding your entire application.
4. Testing and Evaluation
The platform enables you to create test datasets and run evaluations against them. You can set up automated tests to ensure your LLM application maintains quality as you make changes, helping prevent regressions.
5. Collaboration
Teams can share traces, datasets, and evaluation results, making it easier to collaborate on LLM projects. This is particularly valuable since LLM behavior can be difficult to reproduce or explain without concrete examples.
6. Production Monitoring
Once deployed, LangSmith continues to monitor your application in production, helping you identify issues, track user interactions, and understand real-world performance.

Key Features

+ Visual trace inspection - See the entire flow of your application
+ Dataset management - Create and maintain test datasets
+ Automated evaluations - Run systematic tests on your prompts and chains
+ Analytics dashboard - Track usage patterns and costs
+ Feedback collection - Gather and analyze user feedback on outputs

Common Use Cases

+ Debugging complex LangChain applications with multiple agents or tools
+ A/B testing different prompts or model configurations
+ Monitoring production LLM applications for quality and cost
+ Building regression test suites for LLM applications
+ Understanding why a particular LLM output was generated

LangSmith addresses one of the biggest challenges in LLM development: the difficulty of understanding, testing, and maintaining applications that rely on probabilistic, non-deterministic AI models.

In [1]:
from langchain_community.document_loaders import WebBaseLoader
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.vectorstores import InMemoryVectorStore
from dotenv import load_dotenv

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [2]:
load_dotenv(override=True)

# 1) Load and index some docs (demo corpus)
urls = [
    "https://lilianweng.github.io/posts/2023-06-23-agent/",
    "https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",
    "https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/",
]
docs = [d for url in urls for d in WebBaseLoader(url).load()]

splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=250, chunk_overlap=0)
splits = splitter.split_documents(docs)

vectorstore = InMemoryVectorStore.from_documents(splits, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

In [3]:
def rag_bot(question: str) -> dict:
    """Return BOTH answer + retrieved docs for evaluation."""
    documents = retriever.invoke(question)
    context = "\n\n".join(d.page_content for d in documents)
    prompt = f"""Use the context to answer the question.

    CONTEXT:
    {context}
    
    QUESTION:
    {question}
    """
    answer = llm.invoke(prompt).content
    return {"answer": answer, "documents": documents}

In [8]:
from langsmith import Client

client = Client()

dataset_name = "my-rag-eval-dataset"

# Create dataset if not exists
existing = [d for d in client.list_datasets() if d.name == dataset_name]
if not existing:
    dataset = client.create_dataset(dataset_name, description="RAG eval dataset (Q/A)")
else:
    dataset = existing[0]

examples = [
    {
        "inputs": {"question": "What is an LLM agent (high level)?" },
        "outputs": {"answer": "An LLM agent is a system that uses an LLM to decide actions (e.g., tool use) to accomplish tasks, often iterating based on observations."}
    },
    {
        "inputs": {"question": "What is prompt injection and why is it risky?" },
        "outputs": {"answer": "Prompt injection is an attack where instructions are inserted into inputs/context to override system behavior, potentially causing data leakage or unsafe actions."}
    },
]

# Add examples
for ex in examples:
    client.create_example(
        inputs=ex["inputs"],
        outputs=ex["outputs"],
        dataset_id=dataset.id,
    )


In [10]:
from pydantic import BaseModel, Field

judge_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

class BoolGrade(BaseModel):
    ok: bool = Field(description="true if criterion is satisfied, else false")

def _judge(system_prompt: str, user_prompt: str) -> bool:
    grader = judge_llm.with_structured_output(BoolGrade, method="json_schema", strict=True)
    result = grader.invoke([
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ])
    return result.ok

def correctness(inputs: dict, outputs: dict, reference_outputs: dict) -> bool:
    return _judge(
        "You are grading correctness vs a reference answer. Reply ok=true only if meaning matches.",
        f"QUESTION: {inputs['question']}\nMODEL: {outputs['answer']}\nREFERENCE: {reference_outputs['answer']}"
    )

def relevance(inputs: dict, outputs: dict) -> bool:
    return _judge(
        "You are grading whether the answer is relevant to the question. Reply ok=true if it addresses the question.",
        f"QUESTION: {inputs['question']}\nANSWER: {outputs['answer']}"
    )

def groundedness(inputs: dict, outputs: dict) -> bool:
    docs = "\n\n".join(d.page_content for d in outputs["documents"])
    return _judge(
        "You are grading groundedness. Reply ok=true only if the answer is supported by the provided facts, with no major unsupported claims.",
        f"FACTS:\n{docs}\n\nANSWER:\n{outputs['answer']}"
    )

def retrieval_relevance(inputs: dict, outputs: dict) -> bool:
    docs = "\n\n".join(d.page_content for d in outputs["documents"])
    return _judge(
        "You are grading if retrieved docs are relevant to the question. Reply ok=true if docs contain info that would help answer.",
        f"QUESTION: {inputs['question']}\nDOCS:\n{docs}"
    )


In [12]:
def target(inputs: dict) -> dict:
    return rag_bot(inputs["question"])

experiment_results = client.evaluate(
    target,
    data=dataset_name,
    evaluators=[correctness, groundedness, relevance, retrieval_relevance],
    experiment_prefix="rag-eval",
    metadata={"version": "demo-rag + gpt-4o-mini"},
)

# If you have pandas installed:
#df = experiment_results.to_pandas()
#print(df.head())


View the evaluation results for experiment: 'rag-eval-dbabfddc' at:
https://smith.langchain.com/o/819e6a1f-0e45-4de8-b5a2-4bb1d2ff5c50/datasets/e2f2c479-4edc-4c5a-9e75-b0d9fec4f514/compare?selectedSessions=faf2cc73-321b-43ab-9125-22656385c5d2




0it [00:00, ?it/s]