# LangSmith-Compare-Evaluation

- Author: [BokyungisaGod](https://github.com/BokyungisaGod)
- Design: 
- Peer Review: 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/99-TEMPLATE/00-BASE-TEMPLATE-EXAMPLE.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/99-TEMPLATE/00-BASE-TEMPLATE-EXAMPLE.ipynb)

## Overview
You can easily compare experimental results using the **Compare** feature provided by `LangSmith`. This tutorial demonstrates how to evaluate the performance of models like `GPT-4o-mini` and `Ollama` in a RAG system, enabling you to analyze their ability to generate accurate, context-based answers.

### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Define-a-function-for-rag-performance-testing](#define-a-function-for-rag-performance-testing)

## Environment Setup

Setting up your environment is the first step. See the [Environment Setup](https://wikidocs.net/257836) guide for more details.


**[Note]**

The langchain-opentutorial is a package of easy-to-use environment setup guidance, useful functions and utilities for tutorials.
Check out the  [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [9]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [10]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain_openai",
    ],
    verbose=False,
    upgrade=False,
)

You can set API keys in a `.env` file or set them manually.

[Note] If you’re not using the `.env` file, no worries! Just enter the keys directly in the cell below, and you’re good to go.

In [11]:
from dotenv import load_dotenv
from langchain_opentutorial import set_env

# Attempt to load environment variables from a .env file; if unsuccessful, set them manually.
if not load_dotenv():
    set_env(
        {
            "OPENAI_API_KEY": "",
            "LANGCHAIN_API_KEY": "",
            "LANGCHAIN_TRACING_V2": "true",
            "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
            "LANGCHAIN_PROJECT": "LangSmith-Compare-Evaluation",  # set the project name same as the title
        }
    )

## Define a function for RAG performance testing
Create a RAG system to use for testing.

In [12]:
from myrag import PDFRAG
from langchain_openai import ChatOpenAI


# Create a function to answer the question
def ask_question_with_llm(llm):
    # Create PDFRAG object
    rag = PDFRAG(
        "data/Newwhitepaper_Agents2.pdf",
        llm,
    )

    # Create retriever
    retriever = rag.create_retriever()

    # Create chain
    rag_chain = rag.create_chain(retriever)

    def _ask_question(inputs: dict):
        context = retriever.invoke(inputs["question"])
        context = "\n".join([doc.page_content for doc in context])
        return {
            "question": inputs["question"],
            "context": context,
            "answer": rag_chain.invoke(inputs["question"]),
        }

    return _ask_question

In [13]:
from langchain_ollama import ChatOllama

# Load the Ollama model.
ollama = ChatOllama(model="llama3.2")

# Call the Ollama model
ollama.invoke("hello")

AIMessage(content='Hello! How can I assist you today?', additional_kwargs={}, response_metadata={'model': 'llama3.2', 'created_at': '2025-01-14T17:11:32.308467Z', 'done': True, 'done_reason': 'stop', 'total_duration': 9318028375, 'load_duration': 1418234667, 'prompt_eval_count': 26, 'prompt_eval_duration': 6607000000, 'eval_count': 10, 'eval_duration': 1283000000, 'message': Message(role='assistant', content='', images=None, tool_calls=None)}, id='run-bee3f443-5d6a-4237-81c5-143f2267c99e-0', usage_metadata={'input_tokens': 26, 'output_tokens': 10, 'total_tokens': 36})

Create a function that generates answers to questions using the `GPT-4o-mini` model and the `llama` model.

In [14]:
gpt_chain = ask_question_with_llm(ChatOpenAI(model="gpt-4o-mini", temperature=0))
ollama_chain = ask_question_with_llm(ChatOllama(model="llama3.2"))

Conduct answer evaluation using the `GPT-4o-mini` model and the `llama` model.

Perform the evaluation for each of the two chains.

In [15]:
from langsmith.evaluation import evaluate, LangChainStringEvaluator

# Create a QA evaluator
cot_qa_evalulator = LangChainStringEvaluator(
    "cot_qa",
    config={"llm": ChatOpenAI(model="gpt-4o-mini", temperature=0)},
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],
        "reference": run.outputs["context"],
        "input": example.inputs["question"],
    },
)

dataset_name = "RAG_EVAL_DATASET"

# Execute the evaluation
experiment_results1 = evaluate(
    gpt_chain,
    data=dataset_name,
    evaluators=[cot_qa_evalulator],
    experiment_prefix="MODEL_COMPARE_EVAL",
    # Specify experiment metadata
    metadata={
        "variant": "GPT-4o-mini evaluation (cot_qa)",
    },
)

# Execute the evaluation
experiment_results2 = evaluate(
    ollama_chain,
    data=dataset_name,
    evaluators=[cot_qa_evalulator],
    experiment_prefix="MODEL_COMPARE_EVAL",
    # Specify experiment metadata
    metadata={
        "variant": "Ollama(llama3.2) evaluation (cot_qa)",
    },
)

View the evaluation results for experiment: 'MODEL_COMPARE_EVAL-bd816470' at:
https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=fbb203cd-fe4a-4982-b7be-95c657794165




0it [00:00, ?it/s]

View the evaluation results for experiment: 'MODEL_COMPARE_EVAL-660d90ec' at:
https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=09fbec65-fe5c-4559-84d3-ad6ce705255a




0it [00:00, ?it/s]

Use comparison view to check the results.

### How to view comparison

1. Select the experiments you want to compare in the Experiment tab of the Dataset.
2. Click the "Compare" button at the bottom.
3. The comparison view will be displayed.

![](./assets/09-LangSmith-Compare-Evaluation-01.png)

![](./assets/09-LangSmith-Compare-Evaluation-02.png)