In LangSmith, a **Run** represents a single execution of your LLM application (chain, agent, or model) with a specific input. It captures not only the final output but also the intermediate steps involved in generating that output. This detailed trace allows you to understand how your application arrived at its answer.

An **Example** in LangSmith is a unit of data used to evaluate your LLM application. It consists of:

* **inputs**: The data provided to your application.
* **outputs** (optional): The expected or ideal output for the given input. Used by evaluators for comparison.
* **metadata** (optional): Additional information about the example, useful for analysis or filtering.

**How Runs and Examples Work Together in Evaluation**

When you run an evaluation in LangSmith, you essentially execute your LLM application on a set of examples (often organized into a dataset). Each execution of your application on a single example generates a Run.  Evaluators then analyze these runs, comparing the actual outputs with the expected outputs (if provided) and any intermediate steps to assess the performance of your application.

**Example:**

Let's say you have an LLM application that summarizes news articles.

* **Example:**
    * **inputs**: { "article": "The quick brown fox jumps over the lazy dog." }
    * **outputs**: { "summary": "A fox jumped over a dog." } 

* **Run:** When you execute your application with this example, a Run is created. It might contain:
    * The original article
    * Intermediate steps: perhaps the key phrases extracted, or the draft summaries generated along the way
    * The final summary produced by your application (e.g., "A quick brown fox jumped over a lazy dog.")

* **Evaluator:**  An evaluator might compare the generated summary in the Run with the expected summary in the Example to calculate a score, such as ROUGE score, to measure the similarity.

**Key Takeaways**

* Runs provide detailed execution information, enabling in-depth analysis of your application's behavior.
* Examples serve as test cases for your application, allowing you to systematically assess its performance.
* Evaluators leverage the information from Runs and Examples to provide quantitative and qualitative feedback on your application's capabilities.


In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

True

In [3]:
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
LANGCHAIN_API_KEY = os.getenv('LANGCHAIN_API_KEY')
LANGCHAIN_TRACING_V2 = 'true'
LANGCHAIN_PROJECT = os.getenv('LANGCHAIN_PROJECT')
LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"

In [4]:
LANGCHAIN_API_KEY

'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'

In [5]:
from langsmith import Client
from langsmith.schemas import Run, Example
from langsmith.evaluation import evaluate
import openai
from langsmith.wrappers import wrap_openai

In [6]:
client = Client(api_key=LANGCHAIN_API_KEY)

In [7]:
# define the dataset
# we are creating a dataset which takes input and outputs. each input is mapped with each output


dataset_name = "practice-dataset"
dataset =  client.create_dataset(dataset_name, description = "Quick start evaluation test")
client.create_examples(
    inputs = [
        {'question': 'a rap bottle between atticus finch and cicero'},
        {'question': 'a rap bottle between barbie and oppenheimer'} 
              ],
    outputs = [
        {'must_mention': ['lawyer','justice']},
        {'must_mention': ['plastic','nuclear']}
    ],
    dataset_id = dataset.id
)


In [9]:
openai_client = wrap_openai(openai.Client(api_key=OPENAI_API_KEY))

def predict(inputs:dict)->dict:
  messages = [{'role':'user', 'content':inputs['question']}]
  response = openai_client.chat.completions.create(messages = messages, model ="gpt-3.5-turbo")
  return {'output':response}

In [10]:
def must_mention(run:Run, example:Example)->dict:
  prediction = run.outputs.get('output') or ""
  required = example.outputs.get('must_attention') or []
  score = all(phrase in prediction for phrase in required)
  
  return {'key':'must_mention', "score":score}

In [11]:
experiment_results = evaluate(
    predict, # Ai system or llm model
    data = dataset_name, # the data to predict and grade over
    evaluators = [must_mention], # the evaluators to score the results
    experiment_prefix = "rap-generator", # A prefix for your experiment names to easily identify them
    metadata ={
        'version':'1.0.0'
    },
  
)

  from .autonotebook import tqdm as notebook_tqdm


View the evaluation results for experiment: 'rap-generator-43feb476' at:
https://smith.langchain.com/o/0c9ea6bd-99e6-4c8a-a3a7-1abffbccfcd3/datasets/b3adaca8-1711-41f4-a6d5-fb1d6c77a11f/compare?selectedSessions=faa9179b-ceed-4590-9bf1-abe242cd3708




2it [00:04,  2.39s/it]
