# Langsmith - The Langchain Eval framework
## Introduction 

- All the major llm frameworks and model providers are working on their own eval service.
- Here's an example from Langchain whhich is called Langsmith <https://smith.langchain.com/> . 
- To run this you need to sign up and get an API key. Add it to your secrets. `LANGCHAIN_API_KEY`
- It has a nice UI that allows for sharing the results to co-workers.

- Fyi - Langfuse is the opensource equivalent of Langsmith <https://langfuse.com/>.

## Installation

In [1]:
%pip -q install langchain langchain-openai

Note: you may need to restart the kernel to use updated packages.


## Connecting to Langsmith
We connect using the Langsmith client. For this code we delete any previous created dataset with the same name.

In [None]:
from langsmith import Client as LangSmithClient

# A connection the Langsmith service
client = LangSmithClient()


## Preparing the dataset
Similar to our translation example, we will setup a dataset of things we like to evaluate. In this case a *rap battle* where we ask the LLM to generate *rap battles*.
We will evaluate the results based on various criteria.



In [2]:
)# Preparing our dataset
DATASET_NAME='DevOps Rap Battle Dataset'
datasets=client.list_datasets()

# Remove dataset if it already exists
for dataset in datasets:
    if (dataset.name) == DATASET_NAME:
        client.delete_dataset(dataset_name=DATASET_NAME)
        print(dataset)

name='DevOps Rap Battle Dataset' description='Rap battle prompts.' data_type=<DataType.kv: 'kv'> id=UUID('0ed1e275-0a77-452a-b4d0-4864cdfa25b3') created_at=datetime.datetime(2024, 9, 7, 7, 48, 15, 450607, tzinfo=datetime.timezone.utc) modified_at=datetime.datetime(2024, 9, 7, 7, 48, 15, 450607, tzinfo=datetime.timezone.utc) example_count=2 session_count=3 last_session_start_time=datetime.datetime(2024, 9, 7, 7, 56, 8, 681812) inputs_schema=None outputs_schema=None


In [3]:
# Inputs are provided to your model, so it knows what to generate
dataset_inputs = [
    "a rap battle between Chef and Puppet",
    "a rap battle between Ansible and Pulumi",
    # ... add more as desired
]

In [4]:
# We create the dataset
dataset = client.create_dataset(
    dataset_name=DATASET_NAME,
    description="Rap battle prompts.",
)

# And add the examples to it
client.create_examples(
    inputs=[{"question": q} for q in dataset_inputs],
#    outputs=dataset_outputs,
    dataset_id=dataset.id,
)

## Preparing the chain to evaluate

In [5]:
from langchain import prompts
from langchain.schema import output_parser

# Define your runnable or chain below.
prompt = prompts.ChatPromptTemplate.from_messages(
    [("system", "You are a helpful AI assistant."), ("human", "{question}")]
)

from langchain_openai import ChatOpenAI
chat = ChatOpenAI(model="gpt-4o-mini", temperature=0)

chain = prompt | chat | output_parser.StrOutputParser()
answer=chain.invoke({"question":"A rap battle between Patrick and John"})
print(answer)

**Rap Battle: Patrick vs. John**

**Patrick:**
Yo, it’s Patrick, the lyrical master,  
Step back, John, I’m coming in faster.  
I’m the king of the game, you’re just a pawn,  
When I drop these bars, you know you’re gone.  

I’m like a sponge, soaking up the beat,  
You’re just a shadow, can’t handle the heat.  
I’m the star of the show, you’re just a sidekick,  
In this rap battle, I’m the one who’s slick.  

You think you can step to me? That’s a joke,  
I’ll leave you in the dust, like a puff of smoke.  
I’m the real deal, you’re just a wannabe,  
When I’m done with you, you’ll wish you’d never seen me.  

**John:**
Hold up, Patrick, you think you’re so fly?  
But I’m the one who’s gonna make you cry.  
You’re all talk, but where’s the substance?  
I’ll break you down with my lyrical abundance.  

I’m John, the one who’s got the flow,  
You’re just a sidekick in a cartoon show.  
I’m spitting fire, while you’re just a flame,  
In this rap battle, I’m changing the game.  

You say yo

## Setting up evals criteria

Now we perform evals:
- exact ones : for example is it not empty?
- quality using a model: is it relevant ?
- or using an LLM : is it creative , imaginative or novel ?

In [6]:
from langsmith.evaluation import EvaluationResult, run_evaluator
from langsmith.schemas import Example, Run

# A custom evaluator that checks if the output is empty
@run_evaluator
def is_empty(run: Run, example: Example | None = None):
    model_outputs = run.outputs["output"]
    score = not model_outputs.strip()
    return EvaluationResult(key="is_empty", score=score)

from langchain.smith import RunEvalConfig, run_on_dataset

# We reuse the same LLM to verify our criteria - LLM as a Judge
verifier_llm=chat

# Define the evaluators to apply
eval_config = RunEvalConfig(
    evaluators=[
        # You can define an arbitrary criterion as a key: value pair in the criteria dict
        RunEvalConfig.Criteria(
            {"creativity": "Is this submission creative, imaginative, or novel?"}
        ),
        # We provide some simple default criteria like "conciseness" you can use as well
        RunEvalConfig.Criteria("conciseness"),
        # "cot_qa",
        #        smith.RunEvalConfig.LabeledCriteria("conciseness"),
        #        smith.RunEvalConfig.LabeledCriteria("relevance"),
        #        smith.RunEvalConfig.LabeledCriteria("coherence"),
        #        smith.RunEvalConfig.LabeledCriteria("harmfulness"),
        #        smith.RunEvalConfig.LabeledCriteria("insensitivity"),
        #        smith.RunEvalConfig.LabeledCriteria("criminality"),
        #        smith.RunEvalConfig.LabeledCriteria("misogyny"),
        #        smith.RunEvalConfig.LabeledCriteria("controversiality"),
        #        smith.RunEvalConfig.LabeledCriteria("helpfulness"),
        #        smith.RunEvalConfig.LabeledCriteria("maliciousness"),
    ],
    custom_evaluators=[is_empty],
    eval_llm=verifier_llm,
)

## Running the evals

Now we run this across our whole dataset and get the results back

In [7]:
chain_results = client.run_on_dataset(
    dataset_name=DATASET_NAME,
    llm_or_chain_factory=chain,
    evaluation=eval_config,
    concurrency_level=5,
    verbose=True,
)
from pprint import pprint 
pprint(chain_results['results'])

View the evaluation results for project 'virtual-disease-8' at:
https://smith.langchain.com/o/27713559-0b06-4e07-89cf-5339d432c5c3/datasets/677036c6-d368-4ad9-9025-c7754ad8447b/compare?selectedSessions=48d326b2-7ac5-455b-b7aa-ceaaedc0fb45

View all tests for Dataset DevOps Rap Battle Dataset at:
https://smith.langchain.com/o/27713559-0b06-4e07-89cf-5339d432c5c3/datasets/677036c6-d368-4ad9-9025-c7754ad8447b
[------------------------------------------------->] 2/2{'4faa0632-7b21-48cb-b390-d4e9cb3b5549': {'execution_time': 6.187732,
                                          'feedback': [EvaluationResult(key='creativity', score=1, value='Y', comment='To assess whether the submission meets the creativity criterion, I will analyze the content step by step:\n\n1. **Concept of the Rap Battle**: The idea of a rap battle between a Chef and a Puppet is inherently creative. It combines two distinct characters from different realms (culinary and entertainment) and pits them against each other in a 