# LLM Application Evaluations

**Problem:** LLM Applications are very new, and have limited resources for evaluating performance. LLM's are dynamic in their output, and thus require very custom evaluations, many overlook this step or do their actual testing in production with user feedback. This [doesn't always work out...](https://twitter.com/ChrisJBakke/status/1736533308849443121)

**Solution:** We will be going over ways to utilize LangChain's offering [LangSmith](https://docs.smith.langchain.com/), a seperate software that allows tracing, testing, and evaluation of LLM applications.

### LangChain's phenomenal summary of the LLM evaluation landscape:

![x](evals_graph.png)

---
### Topics Covered

#### Adding a Dataset to Langsmith
- Adding the HuggingFace GO_Emotions Dataset to LangSmith
#### Creating and Running Custom Evaluator to Compare LLM Outputs on Classification Task
- Creating a custom evaluator to determine quality of output from different LLM's on emotion classification task for comparison
#### Using an LLM as an Evaluator
- Using an LLM-as-judge flow for Evaluating an LLM's ability for Question & Answer flows from a dataset made from a blog post
- Both evaluating and comparing OpenAI and Mistral 7b's ability to answer questions
#### Overview of Built-In w/LangChainStringEvaluator
- Chain of Thought QA for contextual accuracy for both GPT-4o and Mistral 7b, then comparing the two
- Using Built-In Criteria, Helpfulness
#### LLM as an Evaluator with Custom Criteria
- Unlabeled objectivity, having an LLM evaluate output without a grounded reference
- Labeled objectivity, having an LLM evaluate output WITH a grounded reference
  - Both as range scores and binary scores
#### Evaluating Existing Evaluations (Summary Evaluation)
- Running evaluations on entire experiments, not just each example
- Pass test/fail test for overall evaluation score example with Mistral 7b
#### Pairwise Evaluations (Comparing Experiments Against Each Other)
- Evaluation for comparing two experiments outputs
- LLM as showing preference towards one output vs another and comparing
#### Unit Tests
- Attaching decorators to pytest tests for evaluation in LangSmith
- Both assertations, and for custom tests like embedding distance, edit distance, contains etc that work better for LLM output
#### Evaluating Specific Parts of Existing Workflows
- How to plug into specific parts of an overall LLM application workflow and run custom evaluations on the different steps
- Adding on a few evaluations to my llama3 web research agent to evaluate document retrieval relevancy, and hallucination measurement
---

In [4]:
import os
# os.environ['LANGCHAIN_API_KEY'] = ''
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_PROJECT'] = 'Eval Testing 1'

# DataSets

### Importing a HuggingFace Dataset into LangSmith

Going to be using one that I have previously used for fine tuning: 

https://huggingface.co/datasets/go_emotions

In [5]:
# Dataset Import

# Importing with huggingface datasets package
import pandas as pd
from datasets import load_dataset

df = load_dataset('go_emotions')

# creating an emotion index label dictionary
label_index = {
    "0": "admiration",
    "1": "amusement",
    "2": "anger",
    "3": "annoyance",
    "4": "approval",
    "5": "caring",
    "6": "confusion",
    "7": "curiosity",
    "8": "desire",
    "9": "disappointment",
    "10": "disapproval",
    "11": "disgust",
    "12": "embarassment",
    "13": "excitement",
    "14": "fear",
    "15": "gratitude",
    "16": "grief",
    "17": "joy",
    "18": "love",
    "19": "nervousness",
    "20": "optimism",
    "21": "pride",
    "22": "realization",
    "23": "relief",
    "24": "remorse",
    "25": "sadness",
    "26": "surprise",
    "27": "neutral"
}

# Pull some random 20 Comments & Emotion
data = []
for i in range(1001, 1022):
    comment = df['train'][i]['text']
    label_indices = df['train'][i]['labels']

    # deal with labels
    if not isinstance(label_indices, list):
        label_indices = [label_indices]

    # label mapping
    emotions = ', '.join([label_index.get(str(label)) for label in label_indices])

    data.append((comment, emotions))

comments_df = pd.DataFrame(data, columns=["comment", "emotion label"])

comments_df.head()

Unnamed: 0,comment,emotion label
0,Omg i hope this is about [NAME]. I would LOVE ...,optimism
1,Finale,neutral
2,Which suggests nothing in itself. The same mod...,"anger, annoyance"
3,I double dog dare him.,neutral
4,"Believe you me. TLJ is much, much worse.","disappointment, disgust"


In [6]:
# Putting dataset into langsmith
from langsmith import Client

client = Client()
dataset_name = "go_emotions"

# Store
dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Social Media Comment and Emotion from HuggingFace Go Emotions"
)
client.create_examples(
    inputs=[{"comment": q} for q in comments_df['comment']],
    outputs=[{"emotion": a} for a in comments_df['emotion label']],
    dataset_id=dataset.id
)

---
# 1: Comparing Models with Custom Evaluation on LangChain Chain

Classification Task of emotions with GPT-4o, GPT-3.5-T & Fine Tuned GPT-3.5-T

### Setting up First "LLM App"

Emotion classification chain. Take in a social media comment, apply one of 27 emotion labels or neutral to it. 

We will be setting up this chain with 3 models. Base GPT-4o, Base GPT-3.5-Turbo, Fine Tuned GPT-3.5-Turbo on the Go Emotions Dataset

In [7]:
# Setting Up Chain
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

emotion_analysis_template = """
You are a cutting edge emotion analysis classification assistant.\
You analyze a comment, and apply one or more emotion labels to it. \

The emotion labels are detailed here: \

['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']

Your output should simply be just the respective emotion, and if there are multiple seperated with commas. \

The comment is here: {comment}
"""

output_parser = StrOutputParser()

# different models to plug in (plus the fine tuned one!)
gpt4o_llm = ChatOpenAI(temperature=0.0, model="gpt-4o")
ft_llm = ChatOpenAI(temperature=0.0, model="ft:gpt-3.5-turbo-0125:personal:go-emotions:95jDha5f")
gpt35t_llm = ChatOpenAI(temperature=0.0, model="gpt-3.5-turbo-0125")

emotion_analysis_prompt = ChatPromptTemplate.from_template(emotion_analysis_template)

analysis_chain_gpt35t = (
    {"comment": RunnablePassthrough()} 
    | emotion_analysis_prompt
    | gpt35t_llm
    | output_parser
)

analysis_chain_gpt4o = (
    {"comment": RunnablePassthrough()} 
    | emotion_analysis_prompt
    | gpt4o_llm
    | output_parser
)

analysis_chain_ft = (
    {"comment": RunnablePassthrough()} 
    | emotion_analysis_prompt
    | ft_llm
    | output_parser
)

### Defining a custom evaluator

Currently we have two pieces of data
1. The dataset social media comment
2. The dataset assigned emotion label(s)

Want to evaluate model performance on the (1)Dataset social media comment in comparison to the (2)dataset assigned emotion label.

The below function assigns an "is_same" score of 1 if it's an exact match, 0.5 if the LLM output partially contains the expected label, or 0 if nothing is included, this is returned as a dictionary with a key and score.

To set this up, we have to specify the `Run` and `Example`. `Run` is the LLM "run" being evaluated, whereas `Example` is what's in the dataset.

In [8]:
from langsmith.schemas import Run, Example
from langsmith.evaluation import evaluate

def expected_eval(run: Run, example: Example) -> dict:
    # Getting the emotions and response as a set 
    expected_answer = set(example.outputs.get("emotion").split(", "))
    response = set(run.outputs.get("output").split(", "))

    # Check if response matches the expected answer exactly
    if response == expected_answer:
        return {"key": "is_same", "score": 1}
    # Check if there is any overlap (partial match)
    elif response & expected_answer:
        return {"key": "is_same", "score": 0.5}
    # No overlap at all
    else:
        return {"key": "is_same", "score": 0}



### Using evaluate() to run your evaluations

evaluate() needs a few arguments, the function (or in this case the chain) to evaluate, the dataset to compare against, the evaluator(s) as a list (can run multiple at a time, hence list), an experiment prefix for identification, and can pass in any metadata as a dictionary.

#### Evaluating is_same score on base gpt-3.5-turbo output against go_emotions dataset

In [9]:
# Evaluators
qa_evaluator = [expected_eval]
dataset_name = 'go_emotions'

# Base Model gpt-3.5-turbo Run
base_gpt35t_eval = evaluate(
    analysis_chain_gpt35t.invoke,
    data=dataset_name,
    evaluators=qa_evaluator,
    experiment_prefix="test-gpt35t-expected_answer",
    metadata={
        "variant": "base model gpt-3.5-turbo"
    }
)

View the evaluation results for experiment: 'test-gpt35t-expected_answer-c71af42f' at:
https://smith.langchain.com/o/ef6f5694-a2fa-5316-9158-12297cd17350/datasets/ab3b5df2-9b7c-42ae-85b2-5a1a3e6bd96d/compare?selectedSessions=b00fe5eb-67c1-4d86-8530-d3378fc7215b




0it [00:00, ?it/s]

#### Evaluating is_same score on base gpt-4o output against go_emotions dataset

In [10]:
# Base Model gpt-4o Run
base_gpt4o_eval = evaluate(
    analysis_chain_gpt4o.invoke,
    data=dataset_name,
    evaluators=qa_evaluator,
    experiment_prefix="test-gpt4o-expected_answer",
    metadata={
        "variant": "base model gpt-4o"
    }
)

View the evaluation results for experiment: 'test-gpt4o-expected_answer-2cd9e55d' at:
https://smith.langchain.com/o/ef6f5694-a2fa-5316-9158-12297cd17350/datasets/ab3b5df2-9b7c-42ae-85b2-5a1a3e6bd96d/compare?selectedSessions=e966e0f8-66cb-470c-8f7b-8e16b104b6da




0it [00:00, ?it/s]

#### Evaluating is_same score on fine tuned gpt-3.5-turbo output against go_emotions dataset

In [11]:
# Base Model fine-tuned gpt-3.5-turbo Run
ft_gpt35t_eval = evaluate(
    analysis_chain_ft.invoke,
    data=dataset_name,
    evaluators=qa_evaluator,
    experiment_prefix="test-ft-3.5t-expected_answer",
    metadata={
        "variant": "fine tuned gpt-3.5-turbo"
    }
)

View the evaluation results for experiment: 'test-ft-3.5t-expected_answer-12892071' at:
https://smith.langchain.com/o/ef6f5694-a2fa-5316-9158-12297cd17350/datasets/ab3b5df2-9b7c-42ae-85b2-5a1a3e6bd96d/compare?selectedSessions=6504ca10-7e42-4c56-a4c8-c29264025a5a




0it [00:00, ?it/s]

---
# 2: Assessing Model Output Using an LLM-As-Judge Approach

Using built in evaluators to assess model performance, using another model!

#### Creating a new dataset of question and answer pairs

Website of interest: https://lilianweng.github.io/posts/2023-06-23-agent/

In [12]:
# Loading A Web Page
import requests
from bs4 import BeautifulSoup
url = 'https://lilianweng.github.io/posts/2023-06-23-agent/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
text = [p.text for p in soup.find_all('p')]
full_text = '\n'.join(text)

In [13]:
# Example Questions
inputs = [
    "What is the primary function of LLM in autonomous agents?",
    "Can you describe the role of 'Planning' in LLM-powered autonomous agents?",
    "What types of memory are utilized by LLM-powered agents?",
    "How do autonomous agents use tool APIs?",
    "What are some challenges faced by LLM-powered autonomous agents in real-world applications?"
]

outputs = [
    "LLM functions as the core controller or 'brain' of autonomous agents, enabling them to handle complex tasks through planning, memory, and tool use.",
    "In LLM-powered agents, 'Planning' involves breaking down complex tasks into manageable subgoals, reflecting on past actions, and refining strategies for improved outcomes.",
    "LLM-powered agents utilize short-term memory for in-context learning and long-term memory for retaining and recalling information over extended periods, often leveraging external vector stores.",
    "Autonomous agents use tool APIs to extend their capabilities beyond the model's weights, allowing access to current information, code execution, and proprietary data.",
    "Challenges include managing the complexity of task dependencies, maintaining the stability of model outputs, and ensuring efficient interaction with external models and APIs."
]

# Dataset
qa_pairs = [{"question": q, "answer": a} for q, a in zip(inputs, outputs)]
df = pd.DataFrame(qa_pairs)

In [14]:
# Putting dataset into langsmith
from langsmith import Client

client = Client()
dataset_name = "agent_dataset"

# Store
dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="QA pairs Lilian Weng's AI Agents Blog Post."
)
client.create_examples(
    inputs=[{"question": q} for q in inputs],
    outputs=[{"answer": a} for a in outputs],
    dataset_id=dataset.id
)

### Defining "apps" to test

Two LLM "apps" to be tested. Both are simple Question and Answering setups, with the context of the web page above inserted into the prompt.
1. OpenAI gpt-4o Q/A
2. Mistral 7b Q/A

In [15]:
# OpenAI API
import openai
from langsmith.wrappers import wrap_openai
openai_client = wrap_openai(openai.Client())

def qa_oai(inputs: dict) -> dict:
    system_msg = f"Answer the user's question in 2-3 sentences using this context: \n\n\n {full_text}"
    
    messages = [{"role": "system", "content": system_msg},
                {"role": "user", "content": inputs["question"]}]

    response = openai_client.chat.completions.create(messages=messages, model="gpt-4o")

    return {"answer": response.dict()['choices'][0]['message']['content']}

In [16]:
# Ollama API
import ollama
from langsmith.run_helpers import traceable

@traceable(run_type="llm")
def call_ollama(messages, model: str):
    stream = ollama.chat(messages=messages, model='mistral', stream=True)
    response = ''
    for chunk in stream:
        print(chunk['message']['content'], end='', flush=True)
        response = response + chunk['message']['content']
    return response

def qa_mistral(inputs: dict) -> dict:
    system_msg = f"Answer the user's question using this context: \n\n\n {full_text}"
    
    messages = [{"role": "system", "content": system_msg},
                {"role": "user", "content": f'Answer the question in 2-3 sentences {inputs["question"]}' }]
    
    response = call_ollama(messages, model="mistral")

    return {"answer": response} 

### We can now use a built in functionality, the `LangChainStringEvaluator`

https://docs.smith.langchain.com/old/evaluation/faq/evaluator-implementations

LangChainStringEvaluator has many built in evaluators. And essentially... evaluates a string based on different `criteria`. 

| Evaluator Name          | Output Key               | Simple Code Example                                                                                      |
|-------------------------|--------------------------|---------------------------------------------------------------------------------------------------------|
| QA                      | correctness              | `LangChainStringEvaluator("qa")`                                                                        |
| Contextual Q&A          | contextual accuracy      | `LangChainStringEvaluator("context_qa")`                                                                |
| Chain of Thought Q&A    | cot contextual accuracy  | `LangChainStringEvaluator("cot_qa")`                                                                    |
| Criteria                | Depends on criteria key  | `LangChainStringEvaluator("criteria", config={ "criteria": <criterion> })`                              |
| Labeled Criteria        | Depends on criteria key  | `LangChainStringEvaluator("labeled_criteria", config={ "criteria": <criterion> })`                      |
| Score                   | Depends on criteria key  | `LangChainStringEvaluator("score_string", config={ "criteria": <criterion>, "normalize_by": 10 })`      |
| Labeled Score           | Depends on criteria key  | `LangChainStringEvaluator("labeled_score_string", config={ "criteria": <criterion>, "normalize_by": 10 })` |
| Embedding Distance      | embedding_cosine_distance| `LangChainStringEvaluator("embedding_distance")`                                                        |
| String Distance         | string_distance          | `LangChainStringEvaluator("string_distance", config={"distance": "damerau_levenshtein" })`              |
| Exact Match             | exact_match              | `LangChainStringEvaluator("exact_match")`                                                               |
| Regex Match             | regex_match              | `LangChainStringEvaluator("regex_match")`                                                               |
| Json Validity           | json_validity            | `LangChainStringEvaluator("json_validity")`                                                             |
| Json Equality           | json_equality            | `LangChainStringEvaluator("json_equality")`                                                             |
| Json Edit Distance      | json_edit_distance       | `LangChainStringEvaluator("json_edit_distance")`                                                        |
| Json Schema             | json_schema              | `LangChainStringEvaluator("json_schema")`                                                               |


`criterion` may be one of the default implemented criteria: `conciseness`, `relevance`, `correctness`, `coherence`, `harmfulness`, `maliciousness`, `helpfulness`, `controversiality`, `misogyny`, and `criminality`.

Or, you may define your own criteria in a custom dict as follows:
`{ "criterion_key": "criterion description" }`

For our evaluation, going to use `[LangChainStringEvaluator("cot_qa")]` for Chain of Thought contextual accuracy on question and answering. This will compare the LLM generated response to the question with the expected answer from the dataset, using a built in CoT chain.

In [17]:
from langsmith.evaluation import LangChainStringEvaluator

qa_evaluator = [LangChainStringEvaluator("cot_qa")]
dataset_name = "agent_dataset"

oai_cot_eval = evaluate(
    qa_oai,
    data=dataset_name,
    evaluators=qa_evaluator,
    experiment_prefix="test-agent-qa-oai",
    # Any experiment metadata can be specified here
    metadata={
        "variant": "full website in context window with gpt-4o"
    }
)

View the evaluation results for experiment: 'test-agent-qa-oai-641102aa' at:
https://smith.langchain.com/o/ef6f5694-a2fa-5316-9158-12297cd17350/datasets/8d445ea4-b7d1-4d36-a641-437e4efa4a5b/compare?selectedSessions=4af55dee-9be5-4973-9a56-f16c87ae65aa




0it [00:00, ?it/s]

In [18]:
mistral_cot_eval = evaluate(
    qa_mistral,
    data=dataset_name,
    evaluators=qa_evaluator,
    experiment_prefix="test-agent-qa-mistral",
    # Any experiment metadata can be specified here
    metadata={
        "variant": "full website in context window with Mistral 7b"
    }
)

View the evaluation results for experiment: 'test-agent-qa-mistral-6881f27b' at:
https://smith.langchain.com/o/ef6f5694-a2fa-5316-9158-12297cd17350/datasets/8d445ea4-b7d1-4d36-a641-437e4efa4a5b/compare?selectedSessions=07c0e0f4-fee9-42da-ac77-b6cf914997c8




0it [00:00, ?it/s]

 Planning plays a crucial role in LLM-powered autonomous agents as it enables them to adjust their actions based on long-term goals and unexpected errors. However, current LLMs face challenges in planning over extended periods and decomposing tasks effectively, making them less robust compared to humans. Techniques such as self-reflection, vector search, tool augmentation, and reinforcement learning are being explored to enhance the planning capabilities of LLMs. LLM-powered agents utilize different types of memory, including dynamic memory for storing and reflecting on past experiences, and external knowledge sources such as databases or APIs for accessing additional information. Some agents also incorporate vector stores and retrieval systems for efficient access to large knowledge pools. Autonomous agents use tool APIs by integrating them with large language models, allowing the agents to access external knowledge and perform specific tasks more efficiently. This enables the agents 

### Trying one more out built in criteria, helpfulness

In [19]:
oai_helpfulness_eval = evaluate(
    qa_oai,
    data=dataset_name,
    evaluators=[LangChainStringEvaluator("criteria", config={ "criteria": "helpfulness" })],
    experiment_prefix="test-agent-qa-oai-helpfulness",
    metadata={
        "variant": "full website in context window with gpt-4o, Helpfulness check"
    }
)

View the evaluation results for experiment: 'test-agent-qa-oai-helpfulness-3361d6cb' at:
https://smith.langchain.com/o/ef6f5694-a2fa-5316-9158-12297cd17350/datasets/8d445ea4-b7d1-4d36-a641-437e4efa4a5b/compare?selectedSessions=1a6be94f-4f7b-4ef9-863c-ed2be1d3b441




0it [00:00, ?it/s]

In [20]:
mistral_helpfulness_eval = evaluate(
    qa_mistral,
    data=dataset_name,
    evaluators=[LangChainStringEvaluator("criteria", config={ "criteria": "helpfulness" })],
    experiment_prefix="test-agent-qa-mistral-helpfulness",
    metadata={
        "variant": "full website in context window with Mistral 7b, helpfulness check"
    }
)

View the evaluation results for experiment: 'test-agent-qa-mistral-helpfulness-d223aa3b' at:
https://smith.langchain.com/o/ef6f5694-a2fa-5316-9158-12297cd17350/datasets/8d445ea4-b7d1-4d36-a641-437e4efa4a5b/compare?selectedSessions=7245e36d-b143-4cfa-a15c-9f5714885636




0it [00:00, ?it/s]

 Autonomous agents utilize tool APIs by integrating them with large language models, enabling the agents to execute specific tasks and access external knowledge sources. This modular architecture enhances the capabilities of the agents, allowing them to perform complex operations and interact with various systems. (References: [11], [15], [17], [20]) In LLM-powered autonomous agents, planning plays a crucial role in enabling long-term goal achievement and effective task decomposition. It allows the agent to adjust plans when faced with unexpected errors, making it more robust compared to humans who learn from trial and error. However, current LLMs struggle with reliably generating accurate and formatted outputs for interface communication, which can limit their planning capabilities. Ongoing research focuses on improving these aspects, such as incorporating feedback mechanisms, synergizing reasoning and acting, and developing modular architectures that combine large language models wit

---
# Digging in Deeper: Using LLM-As-A-Judge with Custom Criteria

Same as above, but with our own custom criteria now!

* `criteria` for binary scoring
* `score_string` for numeric scoring
* `labeled_criteria` evaluator instructs an LLM to assess if a prediction satisfies the criteria, taking into account the reference label
* `labeled_score_string` evaluator has the LLM score the prediction on a numeric scale based on how well it satisfies the criteria compared to the reference

### Unlabeled (no ground truth- LLM assesses at face value)

In [22]:
eval_objectivity_scorestring = LangChainStringEvaluator(
    "score_string",
    config={
        "criteria": {
            "objectivity": "On a scale of 1-10, how unbiased and impartial is this submission? Rate it based on the extent to which it presents information and facts without personal opinions or slanted perspectives, aligning with the objective truth of the topic.",
        },
        "normalize_by": 10,
    }  
)

eval_objectivity_criteria = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "objectivity": "Is this submission unbiased and impartial, presenting information and facts without personal opinions or slanted perspectives, ensuring it aligns with the objective truth of the topic?",
        }
    }
)

This chain was only tested with GPT-4. Performance may be significantly worse with other models.


### Labeled (with reference to data)

This requires an extra prepare_data argument, that will take in a `prediction`, `reference`, and `input`. Similar to above, these will be referenced from the `Run` (LLM output) or the `Example` (Dataset)

In [21]:
eval_labeled_objectivity_criteria = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "objectivity": (
                "Is this submission unbiased and impartial, presenting information and facts without personal opinions or slanted perspectives, ensuring it aligns with the objective truth of the topic?"
            )
        }
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],   
    }
)

eval_labeled_objectivity_scorestring = LangChainStringEvaluator(
    "labeled_score_string", 
    config={
        "criteria": { 
            "objectivity": "On a scale of 1-10, how unbiased and impartial is this submission? Rate it based on the extent to which it presents information and facts without personal opinions or slanted perspectives, aligning with the objective truth of the topic."
        },
        "normalize_by": 10,
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"], 
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }  
)

### Running multiple evaluators at once in a list 

In [23]:
unlabeled_evaluators = [eval_objectivity_scorestring, eval_objectivity_criteria]
labeled_evaluators = [eval_labeled_objectivity_criteria, eval_labeled_objectivity_scorestring]
dataset_name = "agent_dataset"

**Unlabeled evaluators with GPT-4o**

In [24]:
oai_unlabeled_results = evaluate(
    qa_oai,
    data=dataset_name,
    evaluators=unlabeled_evaluators,
    experiment_prefix="test-agent-objectivity-unlabeled-oai",
    metadata={
        "variant": "full website in context window with gpt-4o, unlabeled"
    }
)

View the evaluation results for experiment: 'test-agent-objectivity-unlabeled-oai-e5900284' at:
https://smith.langchain.com/o/ef6f5694-a2fa-5316-9158-12297cd17350/datasets/8d445ea4-b7d1-4d36-a641-437e4efa4a5b/compare?selectedSessions=d7fc5637-a4c0-447e-9f2b-9dcdfa8fa57f




0it [00:00, ?it/s]

**Unlabeled evaluators with Mistral 7b**

In [25]:
mistral_unlabeled_results = evaluate(
    qa_mistral,
    data=dataset_name,
    evaluators=unlabeled_evaluators,
    experiment_prefix="test-agent-objectivity-unlabeled-mistral",
    # Any experiment metadata can be specified here
    metadata={
        "variant": "full website in context window with Mistral 7B, unlabeled"
    }
)

View the evaluation results for experiment: 'test-agent-objectivity-unlabeled-mistral-e18e0d1b' at:
https://smith.langchain.com/o/ef6f5694-a2fa-5316-9158-12297cd17350/datasets/8d445ea4-b7d1-4d36-a641-437e4efa4a5b/compare?selectedSessions=afce084b-30c0-42ab-8c26-6127035dcf88




0it [00:00, ?it/s]

 The primary function of Large Language Models (LLMs) in autonomous agents is to process natural language inputs and generate appropriate outputs, interacting with external components such as memory and tools. They help in understanding instructions, generating responses, and executing tasks by parsing and interpreting textual data. LLM-powered autonomous agents face several challenges in real-world applications, including finite context length which limits historical information and detailed instructions, reliability of natural language interfaces due to formatting errors and rebellious behavior, and difficulties with long-term planning and task decomposition. These limitations impact the robustness and effectiveness of these agents. (References: [1], [2], [3], [4], [5], [6], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21]) In LLM-powered autonomous agents, planning plays a crucial role by allowing the agent to adjust its actions based on long-term goa

**Labeled Evaluators with gpt-4o**

In [26]:
# These are now in comparison to the "reference output"
oai_labeled_results = evaluate(
    qa_oai,
    data=dataset_name,
    evaluators=labeled_evaluators,
    experiment_prefix="test-agent-objectivity-labeled-oai",
    metadata={
        "variant": "full website in context window with gpt-3.5-turbo, labeled"
    }
)

View the evaluation results for experiment: 'test-agent-objectivity-labeled-oai-a19e3d7e' at:
https://smith.langchain.com/o/ef6f5694-a2fa-5316-9158-12297cd17350/datasets/8d445ea4-b7d1-4d36-a641-437e4efa4a5b/compare?selectedSessions=a337ef96-4f61-4a1f-bef8-37e3e009ce74




0it [00:00, ?it/s]

**Labeled evaluators with Mistral 7b**

In [27]:
mistral_labeled_results = evaluate(
    qa_mistral,
    data=dataset_name,
    evaluators=labeled_evaluators,
    experiment_prefix="test-agent-objectivity-labeled-mistral",
    # Any experiment metadata can be specified here
    metadata={
        "variant": "full website in context window with Mistral 7B, labeled"
    }
)

View the evaluation results for experiment: 'test-agent-objectivity-labeled-mistral-017ce55d' at:
https://smith.langchain.com/o/ef6f5694-a2fa-5316-9158-12297cd17350/datasets/8d445ea4-b7d1-4d36-a641-437e4efa4a5b/compare?selectedSessions=652ea396-158f-4ade-aff2-81fa78b8d11d




0it [00:00, ?it/s]

 Some challenges faced by LLM-powered autonomous agents in real-world applications include finite context length, which limits historical information and detailed instructions, making it difficult for long-term planning and effective task decomposition. Another challenge is the reliability of natural language interfaces, as LLMs may exhibit formatting errors or rebellious behavior. These issues make it important to continually research and develop methods to improve the performance and robustness of these agents. LLM-powered agents utilize different types of memory, including dynamic memory for self-reflection and static memory stored in vector stores or databases. They also rely on natural language interfaces to communicate with external components, which can be unreliable due to formatting errors or rebellious behavior. The primary function of a Large Language Model (LLM) in autonomous agents is to process natural language instructions and generate responses or actions based on that 

---
# Evaluating existing Evaluations

What if your evaluation of interest is not at the individual run level, but on the overall experiment level?

https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_existing_experiment

In [28]:
helpfulness_scorestring = LangChainStringEvaluator("score_string", config={ "criteria": "helpfulness" })
conciseness_scorestring = LangChainStringEvaluator("score_string", config={ "criteria": "conciseness" })
coherence_scorestring = LangChainStringEvaluator("score_string", config={ "criteria": "coherence" })

evaluators = [helpfulness_scorestring, conciseness_scorestring, coherence_scorestring]

mistral_multicriteria_eval = evaluate(
    qa_mistral,
    data=dataset_name,
    evaluators=evaluators,
    experiment_prefix="test-agent-qa-mistral-multicriteria",
    metadata={
        "variant": "full website in context window with Mistral 7b, helpfulness, conciseness, and coherence check"
    }
)

This chain was only tested with GPT-4. Performance may be significantly worse with other models.
This chain was only tested with GPT-4. Performance may be significantly worse with other models.
This chain was only tested with GPT-4. Performance may be significantly worse with other models.


View the evaluation results for experiment: 'test-agent-qa-mistral-multicriteria-63912a34' at:
https://smith.langchain.com/o/ef6f5694-a2fa-5316-9158-12297cd17350/datasets/8d445ea4-b7d1-4d36-a641-437e4efa4a5b/compare?selectedSessions=b7f96e55-d16a-4dac-b9ad-44b0cacff931




0it [00:00, ?it/s]

 LLM-powered agents typically utilize two types of memory: dynamic memory for storing and retrieving information during conversation or task execution, and external knowledge sources such as databases or APIs to access a larger pool of information. Some agents also use self-reflection and learning mechanisms to store past experiences for future reference. In LLM-powered autonomous agents, 'Planning' is a crucial component that enables the agent to generate a sequence of actions based on its current context and goals. It allows the agent to adjust its plans when faced with unexpected errors, improving its robustness compared to humans who learn from trial and error. Effective planning in LLMs is challenging due to their finite context length and reliability issues with natural language interfaces. Various approaches like reinforcement learning, algorithm distillation, and modular architecture have been explored to enhance the planning capabilities of LLMs. The primary function of Large 

### Set up a Summary Evaluator to look over the entire dataset and determine whether an output was generated

Our criteria for a pass is that the model output an answer successfully 80% of the time

In [29]:
from langsmith.evaluation import evaluate_existing

experiment_name = mistral_multicriteria_eval.experiment_name

def passed_eval(runs: list, examples: list):
    output = 0
    for i, run in enumerate(runs):
        if run.outputs["answer"]:
            output +=1
    if output / len(runs) > 0.8:
        return {"key": "pass", "score": True}
    else:
        return {"key": "fail", "score": False}

evaluate_existing(experiment_name, summary_evaluators=[passed_eval])

View the evaluation results for experiment: 'test-agent-qa-mistral-multicriteria-63912a34' at:
https://smith.langchain.com/o/ef6f5694-a2fa-5316-9158-12297cd17350/datasets/8d445ea4-b7d1-4d36-a641-437e4efa4a5b/compare?selectedSessions=b7f96e55-d16a-4dac-b9ad-44b0cacff931




0it [00:00, ?it/s]

<ExperimentResults test-agent-qa-mistral-multicriteria-63912a34>

---
# Pairwise Evaluations

https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_pairwise

Allows you to evaluate exisiting experiments against eachother. Example: LLM-As-Judge evaluating it's preference between two outputs from LLMs from an existing evaluation. This could be useful to compare two small model outputs using a large model. 

Using this prompt: https://smith.langchain.com/hub/langchain-ai/pairwise-evaluation-2?organizationId=ef6f5694-a2fa-5316-9158-12297cd17350

In [30]:
from langsmith.evaluation import evaluate_comparative
from langchain import hub
from langchain_openai import ChatOpenAI
from langsmith.schemas import Run, Example
prompt = hub.pull("langchain-ai/pairwise-evaluation-2")

# Example from documentation, using GPT-4o to evaluate preference between two model's outputs
def evaluate_pairwise(runs: list[Run], example: Example):
    scores = {}

    # Create the model to run your evaluator
    model = ChatOpenAI(model_name="gpt-4o")

    runnable = prompt | model
    response = runnable.invoke({
        "question": example.inputs["question"],
        "answer_a": runs[0].outputs["answer"] if runs[0].outputs is not None else "N/A",
        "answer_b": runs[1].outputs["answer"] if runs[1].outputs is not None else "N/A",
    })
    score = response["Preference"]
    if score == 1:
        scores[runs[0].id] = 1
        scores[runs[1].id] = 0
    elif score == 2:
        scores[runs[0].id] = 0
        scores[runs[1].id] = 1
    else:
        scores[runs[0].id] = 0
        scores[runs[1].id] = 0
    return {"key": "ranked_preference", "scores": scores}

### Running Comparative Evaluation

Going to use our prior experiments on helpfulness using Mistral 7b and GPT-4o

In [31]:
evaluate_comparative(
    # Replace the following array with the names or IDs of your experiments
    [oai_helpfulness_eval.experiment_name, mistral_helpfulness_eval.experiment_name],
    evaluators=[evaluate_pairwise],
)

View the pairwise evaluation results at:
https://smith.langchain.com/o/ef6f5694-a2fa-5316-9158-12297cd17350/datasets/8d445ea4-b7d1-4d36-a641-437e4efa4a5b/compare?selectedSessions=1a6be94f-4f7b-4ef9-863c-ed2be1d3b441%2C7245e36d-b143-4cfa-a15c-9f5714885636&comparativeExperiment=dca56cef-79d4-4252-9171-c60b3217ff75




  0%|          | 0/5 [00:00<?, ?it/s]

<langsmith.evaluation._runner.ComparativeExperimentResults at 0x1772c7fe0>

---
# Unit Tests w/pytest - VSCode Example

https://docs.smith.langchain.com/how_to_guides/evaluation/unit_testing

You can use a built in decorator `@unit` to attach to `pytest` tests. These will then be logged with langsmith and can be viewed/compared similar to the existing evaluations we've been over.

Built in with the `@unit` Decorator:

| Feedback           | Description                                                   | Example                                                                                                                      |
|--------------------|---------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
| pass               | Binary pass/fail score, 1 for pass, 0 for fail                | `assert False # Fails`                                                                                                       |
| expectation        | Binary expectation score, 1 if expectation is met, 0 if not   | `expect(prediction).against(lambda x: re.search(r"\b[a-f\d]{8}-[a-f\d]{4}-[a-f\d]{4}-[a-f\d]{4}-[a-f\d]{12}\b", x))`          |
| embedding_distance | Cosine distance between two embeddings                        | `expect.embedding_distance(prediction=prediction, expectation=expectation)`                                                  |
| edit_distance      | Edit distance between two strings                             | `expect.edit_distance(prediction=prediction, expectation=expectation)`                                                       |



### `expect` methods


| Method              | Description                                                                                   | Parameters                                           |
|---------------------|-----------------------------------------------------------------------------------------------|------------------------------------------------------|
| `to_be_less_than`   | Assert that the expectation value is less than the given value.                               | `value`                                              |
| `to_be_greater_than`| Assert that the expectation value is greater than the given value.                            | `value`                                              |
| `to_be_between`     | Assert that the expectation value is between the given min and max values.                    | `min_value, max_value`                               |
| `to_be_approximately`| Assert that the expectation value is approximately equal to the given value.                  | `value, precision=2`                                 |
| `to_equal`          | Assert that the expectation value equals the given value.                                     | `value`                                              |
| `to_contain`        | Assert that the expectation value contains the given value.                                   | `value`                                              |
| `against`           | Assert the expectation value against a custom function.                                       | `func`                                               |



---
# Hopping over to Llama3 Research Agent Notebook to Assess Attaching Evaluations within Existing Runs