## **LLM Evaluation Framework**
This notebook sets up an evaluation infrastructure for testing LLM performance in a Retrieval-Augmented Generation (RAG) pipeline.
### **Key Components**
- **Vector Database:** ChromaDB for storing and retrieving documents.
- **Embedding Model:** OpenAIEmbeddings for converting text into vector space.
- **Retriever:** Queries the vector database for relevant document snippets.
- **LLM (GPT-4o-mini, Gemini 2.0 flash, Claude Opus_4):** Generates responses based on retrieved contexts.
- **Evaluation Setup:** evaluator to test LLM outputs.


In [1]:
%pip install langchain-community langchain-openai chromadb

Note: you may need to restart the kernel to use updated packages.


In [8]:
! python3 -m venv path/to/venv

In [14]:
! pip install ipywidgets

Python(69153) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


Collecting ipywidgets
  Downloading ipywidgets-8.1.7-py3-none-any.whl.metadata (2.4 kB)
Collecting widgetsnbextension~=4.0.14 (from ipywidgets)
  Downloading widgetsnbextension-4.0.14-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab_widgets~=3.0.15 (from ipywidgets)
  Downloading jupyterlab_widgets-3.0.15-py3-none-any.whl.metadata (20 kB)
Downloading ipywidgets-8.1.7-py3-none-any.whl (139 kB)
Downloading jupyterlab_widgets-3.0.15-py3-none-any.whl (216 kB)
Downloading widgetsnbextension-4.0.14-py3-none-any.whl (2.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: widgetsnbextension, jupyterlab_widgets, ipywidgets
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [ipywidgets]
[1A[2KSuccessfully installed ipywidgets-8.1.7 jupyterlab_widgets-3.0.15 widgetsnbextension-4.0.14


In [6]:
# ! pip show langchain-community

In [3]:
! pip install langchain-community langchain-openai chromadb --upgrade

Collecting langchain-community
  Using cached langchain_community-0.3.24-py3-none-any.whl.metadata (2.5 kB)
Collecting langchain-openai
  Using cached langchain_openai-0.3.17-py3-none-any.whl.metadata (2.3 kB)
Collecting chromadb
  Using cached chromadb-1.0.10-cp39-abi3-macosx_11_0_arm64.whl.metadata (6.9 kB)
Collecting langchain-core<1.0.0,>=0.3.59 (from langchain-community)
  Using cached langchain_core-0.3.60-py3-none-any.whl.metadata (5.8 kB)
Collecting langchain<1.0.0,>=0.3.25 (from langchain-community)
  Using cached langchain-0.3.25-py3-none-any.whl.metadata (7.8 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain-community)
  Using cached sqlalchemy-2.0.41-cp313-cp313-macosx_11_0_arm64.whl.metadata (9.6 kB)
Collecting requests<3,>=2 (from langchain-community)
  Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting PyYAML>=5.3 (from langchain-community)
  Using cached PyYAML-6.0.2-cp313-cp313-macosx_11_0_arm64.whl.metadata (2.1 kB)
Collecting aiohttp<4.0.0,>=3

In [11]:
! pip install google.generativeai --upgrade

Python(69007) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.




In [None]:
%pip install -qU langchain-google-genai


Note: you may need to restart the kernel to use updated packages.


In [1]:
! pip show langchain

Name: langchain
Version: 0.3.25
Summary: Building applications with LLMs through composability
Home-page: 
Author: 
Author-email: 
License: MIT
Location: /Users/lilian/Documents/LLM Eval Project 2/path/to/venv/lib/python3.13/site-packages
Requires: langchain-core, langchain-text-splitters, langsmith, pydantic, PyYAML, requests, SQLAlchemy
Required-by: langchain-community


In [10]:
! python3 -m pip install --upgrade pip

[1;31merror[0m: [1mexternally-managed-environment[0m

[31m×[0m This environment is externally managed
[31m╰─>[0m To install Python packages system-wide, try brew install
[31m   [0m xyz, where xyz is the package you are trying to
[31m   [0m install.
[31m   [0m 
[31m   [0m If you wish to install a Python library that isn't in Homebrew,
[31m   [0m use a virtual environment:
[31m   [0m 
[31m   [0m python3 -m venv path/to/venv
[31m   [0m source path/to/venv/bin/activate
[31m   [0m python3 -m pip install xyz
[31m   [0m 
[31m   [0m If you wish to install a Python application that isn't in Homebrew,
[31m   [0m it may be easiest to use 'pipx install xyz', which will manage a
[31m   [0m virtual environment for you. You can install pipx with
[31m   [0m 
[31m   [0m brew install pipx
[31m   [0m 
[31m   [0m You may restore the old behavior of pip by passing
[31m   [0m the '--break-system-packages' flag to pip, or by adding
[31m   [0m 'break-system-packag

In [11]:
# ! pip install -U langchain-openai


In [12]:
# %pip install langchain_openai langchain_core

In [None]:
%pip install langchain_anthropic

Collecting langchain_anthropic
  Downloading langchain_anthropic-0.3.13-py3-none-any.whl.metadata (1.9 kB)
Collecting anthropic<1,>=0.51.0 (from langchain_anthropic)
  Downloading anthropic-0.51.0-py3-none-any.whl.metadata (25 kB)
Downloading langchain_anthropic-0.3.13-py3-none-any.whl (26 kB)
Downloading anthropic-0.51.0-py3-none-any.whl (263 kB)
Installing collected packages: anthropic, langchain_anthropic
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [langchain_anthropic]
[1A[2KSuccessfully installed anthropic-0.51.0 langchain_anthropic-0.3.13
Note: you may need to restart the kernel to use updated packages.


In [None]:
# ! pip show langchain chromadb langchain-openai

 #### Import Required Libraries

In [1]:
import os 
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_anthropic import ChatAnthropic

In [2]:
from dotenv import load_dotenv

# Load environment variables from .env
#load_dotenv(dotenv_path='.env', override=True)
load_dotenv()

True

In [3]:
from langsmith import utils
utils.tracing_is_enabled()

True

In [4]:
os.environ["LANGCHAIN_TRACING_V2"] = os.getenv("LANGSMITH_TRACING")
os.environ["LANGCHAIN_API_KEY"] = os.getenv("LANGSMITH_API_KEY")
os.environ["LANGCHAIN_PROJECT"] = str(os.getenv("MLS-eval"))
os.environ['LANGCHAIN_ENDPOINT']= os.getenv("LANGSMITH_ENDPOINT")

#### Load the Document

In [5]:
# Define the path to the document
doc_path = os.getcwd()
dir = os.path.dirname(os.path.abspath(doc_path))
file_path = os.path.join(dir, "docs", "actual_data.txt")

# Load the document using TextLoader
loader = TextLoader(file_path)
documents = loader.load()   


#### Split the Document into Chunks


In [155]:
# Split the document into smaller chunks for processing(chunks of 300 characters with 20 characters overlap)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
splits = text_splitter.split_documents(documents)


#### Embed the Document Chunks

In [6]:
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Load text content from the file
file_path = "/Users/lilian/Documents/LLM Eval Project 2/data/merged.txt"
with open(file_path, "r", encoding="utf-8") as file:
    full_text = file.read()

# Split into chunks using LangChain's text splitter
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
documents = splitter.create_documents([full_text])

# Define Chroma persist directory
persist_dir = "./chroma_db"

# Create embeddings and store in Chroma
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
    documents,
    embedding=embeddings,
    persist_directory=persist_dir
)
# vectorstore.persist()

# Get retriever from vectorstore
retriever = vectorstore.as_retriever()

# Example query
query = "Dog food?"
docs = retriever.get_relevant_documents(query)

# Print top results
for i, doc in enumerate(docs):
    print(f"\nResult {i+1}:\n{doc.page_content[:300]}")


  docs = retriever.get_relevant_documents(query)
Python(68294) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.



Result 1:
1. **Fresh Dog Food**: 
   - Fresh dog food is gaining popularity, characterized by whole-food ingredients and often delivered via subscription services. Brands like JustFoodForDogs and Nom Nom are noted for their adherence to nutritional standards set by the Association of American Feed Control Off

Result 2:
- **Fresh Dog Food**: This category has gained popularity, with brands like JustFoodForDogs and Nom Nom offering meals made from whole ingredients. These foods are typically delivered to consumers and are marketed as healthier alternatives to traditional kibble. However, experts caution that fresh f

Result 3:
#### 3. Product Trends
- Fresh dog food is gaining popularity, with many brands offering subscription services that deliver meals made from whole-food ingredients directly to consumers' homes. Notable brands include JustFoodForDogs and Nom Nom, which meet the standards set by the Association of Amer

Result 4:
medicine at Small Door Veterinary. You don't need to 

In [6]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

persist_dir = "./chroma_db"
embeddings = OpenAIEmbeddings()

# Load existing store
vectorstore = Chroma(persist_directory=persist_dir, embedding_function=embeddings)
retriever = vectorstore.as_retriever()

query = "Dog food"
docs = retriever.invoke(query)  # or .get_relevant_documents() if < langchain-core 0.1.46

for i, doc in enumerate(docs):
    print(f"\nResult {i+1}:\n{doc.page_content[:300]}")



Result 1:
### Overview of Dog Food

Result 2:
Dog food is a vital component of pet care, designed to meet the specific dietary needs of dogs. With various types available, including dry, wet, and fresh options, it is essential for pet owners to choose products that adhere to established nutritional standards to ensure their pets' health and wel

Result 3:
### Nutritional Guidelines for Dog Food

Result 4:
1. **Fresh Dog Food**: 
   - Fresh dog food is gaining popularity, characterized by whole-food ingredients and often delivered via subscription services. Brands like JustFoodForDogs and Nom Nom are noted for their adherence to nutritional standards set by the Association of American Feed Control Off


In [40]:
# Query test
query = "サステナビリティに関する最新のトレンドは何ですか?"
docs = retriever.get_relevant_documents(query)

# Print results
for i, doc in enumerate(docs):
    print(f"\nResult {i+1}:\n", doc.page_content[:300])


Result 1:
 ズや優先事項に対応するために、ツールは毎年更新されています。これらの更新のおかげで、ツールは最も正確なデータを確実に提供し、持続可能なビジネス上の意思決定を推進し、さらに、既存および新たに発生する報告義務の遵守をサポートすることができるのです。2024年第4四半期には、Higg

Result 2:
 ズや優先事項に対応するために、ツールは毎年更新されています。これらの更新のおかげで、ツールは最も正確なデータを確実に提供し、持続可能なビジネス上の意思決定を推進し、さらに、既存および新たに発生する報告義務の遵守をサポートすることができるのです。2024年第4四半期には、Higg

Result 3:
 ズや優先事項に対応するために、ツールは毎年更新されています。これらの更新のおかげで、ツールは最も正確なデータを確実に提供し、持続可能なビジネス上の意思決定を推進し、さらに、既存および新たに発生する報告義務の遵守をサポートすることができるのです。2024年第4四半期には、Higg

Result 4:
 ズや優先事項に対応するために、ツールは毎年更新されています。これらの更新のおかげで、ツールは最も正確なデータを確実に提供し、持続可能なビジネス上の意思決定を推進し、さらに、既存および新たに発生する報告義務の遵守をサポートすることができるのです。2024年第4四半期には、Higg


In [41]:
vectorstore = Chroma(
    persist_directory="./chroma_db", 
    embedding_function=embeddings
)
retriever = vectorstore.as_retriever()


#### Import Additional Libraries

In [42]:
import openai
import datetime
from langsmith import traceable
from langsmith.wrappers import wrap_openai
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_google_genai import GoogleGenerativeAI
import google.generativeai as genai


genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))

# Get today's date in YYYY-MM-DD format
today = datetime.datetime.now().strftime("%Y-%m-%d")

#### Define the RAG Model Class

In [None]:

class Rag:
    #def __init__(self, retriever, model: str = "gpt-4o"):
    def __init__(self, retriever, model: str = "gpt-4o-mini"):
    # def __init__(self, retriever, model: str = "gpt-3.5-turbo"):
        """
        Initialize the RAG (Retrieval-Augmented Generation) model.
        
        Args:
            retriever: The retriever object used to fetch relevant document chunks.
            model (str): The name of the LLM model to use (default: "gpt-4o").
        """        
        self._retriever = retriever
        
        # Wrap the OpenAI client to enable tracing and loggin
        self._client = wrap_openai(openai.Client())
        #self._client = genai.GenerativeModel(model)

        # Set the LLM model to use
        self._model = model

    #CONTEXT = "answer business questions based on provided documents"

    @traceable # Decorator to trace the function call for monitoring and debugging
    def get_answer(self, question: str):
        """
        Generate an answer to a question using the RAG model.
        
        Args:
            question (str): The question to answer.
        
        Returns:
            dict: A dictionary containing the generated answer and the contexts used.
        """    

        # Retrieve relevant document chunks based on the question    
        similar = self._retriever.invoke(question)
        
        response = self._client.chat.completions.create(
        #response = self._client.generate_content(
    
            model=self._model,
            messages=[
                {
                    "role": "system",
                    "content": "You are an accomplished AI working for the insights department at a company.Your job is to answer business questions based on provided documents. "
                    "Today is {today}."
                               " Use the following docs to produce a concise answer to the users question"
                               f"## Docs\n\n{similar}"
                     """Please follow these steps to provide a response:

                    1. Carefully review the source material, paying attention to any information that is relevant to answering the question:
                        - Make extensive use of all information given.
                        - It is CRITICAL that you only use information that is explicitly stated above.
                        - Refrain from recommendations, speculation, or extrapolations.
                        - Compare and contrast findings from different sources. Watch out for any apparent conflicts between sources as well as any corroborating information.
                        - Make sure the context of the information given is applicable to the question, such as any specific country, category, or target group the question asks about.

                    2. Write the answer in a professional, well-structured format with headings and inline citations:
                        - Make sure to write a didactically well-structured answer for insights professionals and business stakeholders.
                        - Aim to provide a professional response that will help inform decision-making.
                        - Structure your answer with headings and use concise full text.
                        - Use tables in your response only when needed.
                        - Use lists and bullet points sparingly. Absolutely avoid nesting lists and bullet points.
                        - Ensure to include direct inline citations for all information referenced in your answer. Use a bracketed citation style with source reference like [XXX]. If there are multiple references, provide them in separate brackets each: [XXX][ZZZ]. MAKE SURE TO USE THE REFERENCES EXACTLY AS GIVEN IN THE <reference> TAGS ABOVE.
                    
                    3. **Language**: Respond in the SAME LANGUAGE as the user's question or the provided documents. If the question/documents are in Spanish, reply in Spanish. If in German, reply in German, etc.

                    Make sure to use Markdown in your response and structure it in this format:

                    <answer>
                    ...
                    </answer>
            """},
                {"role": "user", "content": question}

            ]
        )

        # Return the generated answer and the contexts used
        return {
            #"answer": response.text,
            "answer": response.choices[0].message.content, # Extract the generated answer
            "contexts": [str(doc) for doc in similar], # Convert document chunks to strings
        }
    
    


In [None]:
import datetime
from langsmith import traceable
from langchain_google_genai import ChatGoogleGenerativeAI
import google.generativeai as genai
from langchain_core.messages import HumanMessage, SystemMessage

genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))

# Get today's date in YYYY-MM-DD format
today = datetime.datetime.now().strftime("%Y-%m-%d")

class Rag:
    def __init__(self, retriever, model: str = "gemini-2.0-flash"):
        """
        Initialize the RAG (Retrieval-Augmented Generation) model.

        Args:
            retriever: The retriever object used to fetch relevant document chunks.
            model (str): The name of the LLM model to use (default: "gemini-1.5-flash").
        """
        self._retriever = retriever

        # Initialize the Gemini model
        self._model = ChatGoogleGenerativeAI(model=model, temperature=0.0)   

    @traceable  # Decorator to trace the function call for monitoring and debugging
    def get_answer(self, question: str):
        """
        Generate an answer to a question using the RAG model.

        Args:
            question (str): The question to answer.

        Returns:
            dict: A dictionary containing the generated answer and the contexts used.
        """

        # Retrieve relevant document chunks based on the question
        similar = self._retriever.invoke(question)

        # Prepare the system message and user message for the chat model
        system_message = SystemMessage(
            content=f"""You are an accomplished AI working for the insights department at a company. Your job is to answer business questions based on provided documents.
Today is {today}.
Use the following docs to produce a concise answer to the user's question:
## Docs\n\n{similar}

Please follow these steps to provide a response:

1. Carefully review the source material, paying attention to any information that is relevant to answering the question:
    - Make extensive use of all information given.
    - It is CRITICAL that you only use information that is explicitly stated above.
    - Refrain from recommendations, speculation, or extrapolations.
    - Compare and contrast findings from different sources. Watch out for any apparent conflicts between sources as well as any corroborating information.
    - Make sure the context of the information given is applicable to the question, such as any specific country, category, or target group the question asks about.

2. Write the answer in a professional, well-structured format with headings and inline citations:
    - Make sure to write a didactically well-structured answer for insights professionals and business stakeholders.
    - Aim to provide a professional response that will help inform decision-making.
    - Structure your answer with headings and use concise full text.
    - Use tables in your response only when needed.
    - Use lists and bullet points sparingly. Absolutely avoid nesting lists and bullet points.
    - Ensure to include direct inline citations for all information referenced in your answer. Use a bracketed citation style with source reference like [XXX]. If there are multiple references, provide them in separate brackets each: [XXX][ZZZ]. MAKE SURE TO USE THE REFERENCES EXACTLY AS GIVEN IN THE <reference> TAGS ABOVE.

3. Language: Respond in the SAME LANGUAGE as the user's question or the provided documents. If the question/documents are in Spanish, reply in Spanish. If in German, reply in German, etc.

Make sure to use Markdown in your response and structure it in this format:

<answer>
...
</answer>
"""
        )
        user_message = HumanMessage(content=question)

        # Generate the response using the Gemini model
        response = self._model.invoke([system_message, user_message])

        # Return the generated answer and the contexts used
        return {
            "answer": response.content,  # Extract the generated answer
            "contexts": [str(doc) for doc in similar],  # Convert document chunks to strings
        }





In [None]:
class Rag:
    def __init__(self, retriever, model: str = "claude-opus-4-20250514"):
        """
        Initialize the RAG (Retrieval-Augmented Generation) model.

        Args:
            retriever: The retriever object used to fetch relevant document chunks.
            model (str): The name of the Anthropic model to use (default: "claude-3-sonnet-20240229").
        """
        self._retriever = retriever

        # Initialize the Anthropic model
        self._model = ChatAnthropic(model=model)

    @traceable  # Decorator to trace the function call for monitoring and debugging
    def get_answer(self, question: str):
        """
        Generate an answer to a question using the RAG model.

        Args:
            question (str): The question to answer.

        Returns:
            dict: A dictionary containing the generated answer and the contexts used.
        """

        # Retrieve relevant document chunks based on the question
        similar = self._retriever.invoke(question)

        # Prepare the system message and user message for the chat model
        system_message = SystemMessage(
            content=f"""You are an accomplished AI working for the insights department at a company. Your job is to answer business questions based on provided documents.
Today is {today}.
Use the following docs to produce a concise answer to the user's question:
## Docs\n\n{similar}

Please follow these steps to provide a response:

1. Carefully review the source material, paying attention to any information that is relevant to answering the question:
    - Make extensive use of all information given.
    - It is CRITICAL that you only use information that is explicitly stated above.
    - Refrain from recommendations, speculation, or extrapolations.
    - Compare and contrast findings from different sources. Watch out for any apparent conflicts between sources as well as any corroborating information.
    - Make sure the context of the information given is applicable to the question, such as any specific country, category, or target group the question asks about.

2. Write the answer in a professional, well-structured format with headings and inline citations:
    - Make sure to write a didactically well-structured answer for insights professionals and business stakeholders.
    - Aim to provide a professional response that will help inform decision-making.
    - Structure your answer with headings and use concise full text.
    - Use tables in your response only when needed.
    - Use lists and bullet points sparingly. Absolutely avoid nesting lists and bullet points.
    - Ensure to include direct inline citations for all information referenced in your answer. Use a bracketed citation style with source reference like [XXX]. If there are multiple references, provide them in separate brackets each: [XXX][ZZZ]. MAKE SURE TO USE THE REFERENCES EXACTLY AS GIVEN IN THE <reference> TAGS ABOVE.

3. Language: Respond in the SAME LANGUAGE as the user's question or the provided documents. If the question/documents are in Spanish, reply in Spanish. If in German, reply in German, etc.

Make sure to use Markdown in your response and structure it in this format:

<answer>
...
</answer>
"""
        )
        user_message = HumanMessage(content=question)

        # Generate the response using the Anthropic model
        response = self._model.invoke([system_message, user_message])

        # Return the generated answer and the contexts used
        return {
            "answer": response.content,  # Extract the generated answer
            "contexts": [str(doc) for doc in similar],  # Convert document chunks to strings
        }
    

####  Initialize the RAG Model

In [None]:
# Example usage (Ensure `retriever` is properly initialized)
rag = Rag(retriever) 

#### Generate an Answer

In [None]:
response = rag.get_answer("what is the colour of green tea from the document")
print("Generated Answer:", response["answer"])
print("Contexts Used:", response["contexts"])

Generated Answer: <answer>
# The Color of Green Tea

## Primary Color Characteristics

Green tea is characterized by its **light green to yellow-green color** [439e0944-9e01-4b7a-9ed8-be74ec0f343b]. This distinctive coloration is primarily due to the presence of chlorophyll, which is retained during the processing of the tea leaves [439e0944-9e01-4b7a-9ed8-be74ec0f343b].

## Color Variations

The specific shade of green tea can vary depending on several factors [439e0944-9e01-4b7a-9ed8-be74ec0f343b]:
- Type of green tea
- Processing methods used
- Some varieties may appear more yellowish

## Processing Impact on Color

Unlike black tea, which undergoes full oxidation and turns dark, green tea is minimally processed, allowing it to maintain its **vibrant green hue** [439e0944-9e01-4b7a-9ed8-be74ec0f343b]. This minimal processing is crucial in preserving the characteristic green color of the tea.

## Additional Color References

The documents also mention beverages containing green tea i

#### Define Evaluation Functions

In [None]:
# ------------
def predict_rag_answer(example: dict):
    """Use this for answer evaluation"""
    response = rag.get_answer(example["question"])
    return {"answer": response["answer"]}

def predict_rag_answer_with_context(example: dict):
    """Use this for evaluation of retrieved documents and hallucinations"""
    response = rag.get_answer(example["question"])
    return {"answer": response["answer"], "contexts": response["contexts"]}

### Define the QA Evaluator

In [61]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

# Define the QA evaluator
qa_evaluator = [
    LangChainStringEvaluator(
        "cot_qa", # Use the Chain-of-Thought QA evaluator
        prepare_data=lambda run, example: {
            "prediction": run.outputs["answer"], # Predicted answer from the RAG model
            "reference": example.outputs["answer"], # Ground truth answer
            "input": example.inputs["question"], # Input question
        }
    )
]

In [57]:
##############################################################################################

In [62]:
# Grade output schema
from typing_extensions import Annotated, TypedDict
from langchain_openai import ChatOpenAI

class RelevanceGrade(TypedDict):
    explanation: Annotated[str, ..., "Explain your reasoning for the score"]
    relevant: Annotated[bool, ..., "Provide the score on whether the answer addresses the question"]

# Grade prompt
relevance_instructions="""You are a teacher grading a quiz. 

You will be given a QUESTION and a set of FACTS provided by the student. 

Here is the grade criteria to follow:
(1) You goal is to identify FACTS that are completely unrelated to the QUESTION
(2) If the facts contain ANY keywords or semantic meaning related to the question, consider them relevant
(3) It is OK if the facts have SOME information that is unrelated to the question (2) is met 

Score:
A score of 1 means that the FACT contain ANY keywords or semantic meaning related to the QUESTION and are therefore relevant. This is the highest (best) score. 
A score of 0 means that the FACTS are completely unrelated to the QUESTION. This is the lowest possible score you can give.

Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct. 

Avoid simply stating the correct answer at the outset."""



# Grader LLM
#relevance_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0).with_structured_output(RelevanceGrade, method="json_schema", strict=True)
#relevance_llm = ChatGoogleGenerativeAI(model= "gemini-2.0-flash", temperature=0).with_structured_output(RelevanceGrade, method="json_schema", strict=True)
relevance_llm = ChatAnthropic(model= "claude-opus-4-20250514", temperature=0).with_structured_output(RelevanceGrade, method="json_schema", strict=True)

# Evaluator
def relevance(inputs: dict, outputs: dict) -> bool:
    """A simple evaluator for RAG answer helpfulness."""
    answer = f"QUESTION: {inputs['question']}\nSTUDENT ANSWER: {outputs['answer']}"
    grade = relevance_llm.invoke([
        {"role": "system", "content": relevance_instructions}, 
        {"role": "user", "content": answer}
    ])
    return grade["relevant"]

##

In [None]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

# Define the dataset name
dataset_name = "mls-deepsight-eval"


# Evaluator
cot_qa_evaluator = LangChainStringEvaluator(
        "cot_qa",
        prepare_data=lambda run, example: {
            "prediction": run.outputs["answer"],
            "reference": example.outputs["answer"],
            "input": example.inputs["question"],
        }
    )


experiment_results = evaluate(
    predict_rag_answer_with_context,
    data=dataset_name,
    evaluators= [cot_qa_evaluator, relevance],
    experiment_prefix="claude-Opus-4",
    num_repetitions=5,
)



View the evaluation results for experiment: 'claude-Opus-4-ee4e74f6' at:
https://smith.langchain.com/o/becb2dbe-6434-46cc-a3fb-97dca03fd3ef/datasets/0602a3af-66bf-4930-9881-f7cc04698b95/compare?selectedSessions=18e13b5b-1b32-4dc4-a05f-123755d2325c




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator relevance> on run 8b6251b6-d0d5-4bef-970d-013607c7d5c6: InternalServerError("Error code: 500 - {'type': 'error', 'error': {'type': 'api_error', 'message': 'Internal server error'}}")
Traceback (most recent call last):
  File "/Users/lilian/Documents/LLM Eval Project 2/.venv/lib/python3.13/site-packages/langsmith/evaluation/_runner.py", line 1627, in _run_evaluators
    evaluator_response = evaluator.evaluate_run(  # type: ignore[call-arg]
        run=run,
        example=example,
        evaluator_run_id=evaluator_run_id,
    )
  File "/Users/lilian/Documents/LLM Eval Project 2/.venv/lib/python3.13/site-packages/langsmith/evaluation/evaluator.py", line 343, in evaluate_run
    result = self.func(
        run,
        example,
        langsmith_extra={"run_id": evaluator_run_id, "metadata": metadata},
    )
  File "/Users/lilian/Documents/LLM Eval Project 2/.venv/lib/python3.13/site-packages/langsmith/evaluation/evaluator.py", line 741, in wr

In [None]:
experiment_results = evaluate(
    predict_rag_answer_with_context,
    data=dataset_name,
    evaluators= [cot_qa_evaluator, relevance],
    experiment_prefix="gpt-4o-min",
    num_repetitions=5,
)

In [None]:
experiment_results = evaluate(
    predict_rag_answer_with_context,
    data=dataset_name,
    evaluators= [cot_qa_evaluator, relevance],
    experiment_prefix="gemini-2.0-flash",
    num_repetitions=4,
)
