## **LLM Evaluation Framework**
This notebook sets up an evaluation infrastructure for testing LLM performance in a Retrieval-Augmented Generation (RAG) pipeline.
### **Key Components**
- **Vector Database:** ChromaDB for storing and retrieving documents.
- **Embedding Model:** OpenAIEmbeddings for converting text into vector space.
- **Retriever:** Queries the vector database for relevant document snippets.
- **LLM (GPT-4o, GPT-3.5, Gemini):** Generates responses based on retrieved contexts.
- **Evaluation Setup:** Dummy evaluator to test LLM outputs.


In [1]:
! pip install langchain-community langchain-openai chromadb

Collecting langchain-community
  Using cached langchain_community-0.3.24-py3-none-any.whl.metadata (2.5 kB)
Collecting langchain-openai
  Downloading langchain_openai-0.3.18-py3-none-any.whl.metadata (2.3 kB)
Collecting chromadb
  Using cached chromadb-1.0.10-cp39-abi3-macosx_11_0_arm64.whl.metadata (6.9 kB)
Collecting langchain-core<1.0.0,>=0.3.59 (from langchain-community)
  Downloading langchain_core-0.3.61-py3-none-any.whl.metadata (5.8 kB)
Collecting langchain<1.0.0,>=0.3.25 (from langchain-community)
  Using cached langchain-0.3.25-py3-none-any.whl.metadata (7.8 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain-community)
  Using cached sqlalchemy-2.0.41-cp313-cp313-macosx_11_0_arm64.whl.metadata (9.6 kB)
Collecting requests<3,>=2 (from langchain-community)
  Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting PyYAML>=5.3 (from langchain-community)
  Using cached PyYAML-6.0.2-cp313-cp313-macosx_11_0_arm64.whl.metadata (2.1 kB)
Collecting aiohttp<4.0.0,>=3.8

In [3]:
! python3 -m pip install pexpect




In [4]:
! pip install openpyxl

Collecting openpyxl
  Using cached openpyxl-3.1.5-py2.py3-none-any.whl.metadata (2.5 kB)
Collecting et-xmlfile (from openpyxl)
  Using cached et_xmlfile-2.0.0-py3-none-any.whl.metadata (2.7 kB)
Using cached openpyxl-3.1.5-py2.py3-none-any.whl (250 kB)
Using cached et_xmlfile-2.0.0-py3-none-any.whl (18 kB)
Installing collected packages: et-xmlfile, openpyxl
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [openpyxl]1/2[0m [openpyxl]
[1A[2KSuccessfully installed et-xmlfile-2.0.0 openpyxl-3.1.5


In [5]:
! pip install google.generativeai --upgrade


Collecting google.generativeai
  Using cached google_generativeai-0.8.5-py3-none-any.whl.metadata (3.9 kB)
Collecting google-ai-generativelanguage==0.6.15 (from google.generativeai)
  Using cached google_ai_generativelanguage-0.6.15-py3-none-any.whl.metadata (5.7 kB)
Collecting google-api-core (from google.generativeai)
  Using cached google_api_core-2.24.2-py3-none-any.whl.metadata (3.0 kB)
Collecting google-api-python-client (from google.generativeai)
  Using cached google_api_python_client-2.169.0-py3-none-any.whl.metadata (6.7 kB)
Collecting google-api-core (from google.generativeai)
  Downloading google_api_core-2.25.0rc1-py3-none-any.whl.metadata (3.0 kB)
Collecting proto-plus<2.0.0dev,>=1.22.3 (from google-ai-generativelanguage==0.6.15->google.generativeai)
  Using cached proto_plus-1.26.1-py3-none-any.whl.metadata (2.2 kB)
Collecting grpcio-status<2.0.0,>=1.33.2 (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<

In [6]:
! pip install langchain-community langchain-openai chromadb --upgrade



In [8]:
! pip install chromadb

[1;31merror[0m: [1mexternally-managed-environment[0m

[31m×[0m This environment is externally managed
[31m╰─>[0m To install Python packages system-wide, try brew install
[31m   [0m xyz, where xyz is the package you are trying to
[31m   [0m install.
[31m   [0m 
[31m   [0m If you wish to install a Python library that isn't in Homebrew,
[31m   [0m use a virtual environment:
[31m   [0m 
[31m   [0m python3 -m venv path/to/venv
[31m   [0m source path/to/venv/bin/activate
[31m   [0m python3 -m pip install xyz
[31m   [0m 
[31m   [0m If you wish to install a Python application that isn't in Homebrew,
[31m   [0m it may be easiest to use 'pipx install xyz', which will manage a
[31m   [0m virtual environment for you. You can install pipx with
[31m   [0m 
[31m   [0m brew install pipx
[31m   [0m 
[31m   [0m You may restore the old behavior of pip by passing
[31m   [0m the '--break-system-packages' flag to pip, or by adding
[31m   [0m 'break-system-packag

In [7]:
%pip install -qU langchain-google-genai

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-generativeai 0.8.5 requires google-ai-generativelanguage==0.6.15, but you have google-ai-generativelanguage 0.6.18 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [10]:
! python3 -m pip install --upgrade pip

[1;31merror[0m: [1mexternally-managed-environment[0m

[31m×[0m This environment is externally managed
[31m╰─>[0m To install Python packages system-wide, try brew install
[31m   [0m xyz, where xyz is the package you are trying to
[31m   [0m install.
[31m   [0m 
[31m   [0m If you wish to install a Python library that isn't in Homebrew,
[31m   [0m use a virtual environment:
[31m   [0m 
[31m   [0m python3 -m venv path/to/venv
[31m   [0m source path/to/venv/bin/activate
[31m   [0m python3 -m pip install xyz
[31m   [0m 
[31m   [0m If you wish to install a Python application that isn't in Homebrew,
[31m   [0m it may be easiest to use 'pipx install xyz', which will manage a
[31m   [0m virtual environment for you. You can install pipx with
[31m   [0m 
[31m   [0m brew install pipx
[31m   [0m 
[31m   [0m You may restore the old behavior of pip by passing
[31m   [0m the '--break-system-packages' flag to pip, or by adding
[31m   [0m 'break-system-packag

In [9]:
! pip install -U langchain-openai




In [8]:
%pip install langchain_openai langchain_core

Note: you may need to restart the kernel to use updated packages.


In [20]:
%pip install langchain_chroma


Collecting langchain_chroma
  Downloading langchain_chroma-0.2.4-py3-none-any.whl.metadata (1.1 kB)
Downloading langchain_chroma-0.2.4-py3-none-any.whl (11 kB)
Installing collected packages: langchain_chroma
Successfully installed langchain_chroma-0.2.4
Note: you may need to restart the kernel to use updated packages.


In [11]:
! pip show langchain chromadb langchain-openai


Name: langchain
Version: 0.3.25
Summary: Building applications with LLMs through composability
Home-page: 
Author: 
Author-email: 
License: MIT
Location: /Users/lilian/Documents/LLM Eval Project 2/.venv/lib/python3.13/site-packages
Requires: langchain-core, langchain-text-splitters, langsmith, pydantic, PyYAML, requests, SQLAlchemy
Required-by: langchain-community
---
Name: chromadb
Version: 1.0.10
Summary: Chroma.
Home-page: https://github.com/chroma-core/chroma
Author: 
Author-email: Jeff Huber <jeff@trychroma.com>, Anton Troynikov <anton@trychroma.com>
License: 
Location: /Users/lilian/Documents/LLM Eval Project 2/.venv/lib/python3.13/site-packages
Requires: bcrypt, build, fastapi, grpcio, httpx, importlib-resources, jsonschema, kubernetes, mmh3, numpy, onnxruntime, opentelemetry-api, opentelemetry-exporter-otlp-proto-grpc, opentelemetry-instrumentation-fastapi, opentelemetry-sdk, orjson, overrides, posthog, pydantic, pypika, pyyaml, rich, tenacity, tokenizers, tqdm, typer, typing-e

 #### Import Required Libraries

In [2]:
import os 
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_anthropic import ChatAnthropic

In [3]:
from dotenv import load_dotenv

# Load environment variables from .env
#load_dotenv(dotenv_path='.env', override=True)
load_dotenv()

True

In [4]:
from langsmith import utils
utils.tracing_is_enabled()

True

In [5]:
os.environ["LANGCHAIN_TRACING_V2"] = os.getenv("LANGSMITH_TRACING")
os.environ["LANGCHAIN_API_KEY"] = os.getenv("LANGSMITH_API_KEY")
os.environ["LANGCHAIN_PROJECT"] = str(os.getenv("LANGSMITH_PROJECT"))
os.environ['LANGCHAIN_ENDPOINT']= os.getenv("LANGSMITH_ENDPOINT")

#### Load the Document

In [13]:
# # Define the path to the document
# doc_path = os.getcwd()
# dir = os.path.dirname(os.path.abspath(doc_path))
# file_path = os.path.join(dir, "docs", "answers.txt")

# # Load the document using TextLoader
# loader = TextLoader(file_path)
# documents = loader.load()   


#### Split the Document into Chunks


In [20]:
# Split the document into smaller chunks for processing(chunks of 300 characters with 20 characters overlap)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
documents = text_splitter.create_documents(df["combined_text"].tolist())


NameError: name 'df' is not defined

#### Embed the Document Chunks

In [26]:
# # Embed the document chunks into a vector store using OpenAI embeddings
# vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

# # Create a retriever to fetch relevant document chunks based on queries
# retriever = vectorstore.as_retriever()

In [None]:
import os
import pandas as pd
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings  

# Load your Excel
file_path = "/Users/lilian/Documents/LLM Eval Project 2/data/main_report-with-results.xlsx"
df = pd.read_excel(file_path)

# Combine your text columns
text_columns = ["snsTexts", "srsTexts", "avsTexts", "stsTexts"]
df[text_columns] = df[text_columns].fillna("")
df["combined_text"] = df[text_columns].agg(" ".join, axis=1)

# Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
documents = splitter.create_documents(df["combined_text"].tolist())

# Define persist directory (fresh each time for now)
persist_dir = "./chroma_db"
# shutil.rmtree(persist_dir, ignore_errors=True)  # Extra safety!

# Embed and persist Chroma store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
    documents,
    embedding=embeddings,
    persist_directory=persist_dir
)
vectorstore.persist()  # VERY IMPORTANT
retriever = vectorstore.as_retriever()

# Query test
query = "サステナビリティに関する最新のトレンドは何ですか?"
docs = retriever.get_relevant_documents(query)

# Print results
for i, doc in enumerate(docs):
    print(f"\nResult {i+1}:\n", doc.page_content[:300])

  vectorstore.persist()  # VERY IMPORTANT
  docs = retriever.get_relevant_documents(query)



Result 1:
 marketing, here are four of the biggest trends we can expect to shape the retail industry, and your business, over the next year.\nMORE FOR YOU\nSustainability and green fatigue\nForbes Daily: Get our best stories, exclusive reporting and essential analysis of the day���s news in your inbox every we

Result 2:
 marketing, here are four of the biggest trends we can expect to shape the retail industry, and your business, over the next year.\nMORE FOR YOU\nSustainability and green fatigue\nForbes Daily: Get our best stories, exclusive reporting and essential analysis of the day���s news in your inbox every we

Result 3:
 marketing, here are four of the biggest trends we can expect to shape the retail industry, and your business, over the next year.\nMORE FOR YOU\nSustainability and green fatigue\nForbes Daily: Get our best stories, exclusive reporting and essential analysis of the day���s news in your inbox every we

Result 4:
 and areas of focus��� or risk being left in the p

In [1]:
# Query test
query = "how do young adults react to change in food prices?"
docs = retriever.get_relevant_documents(query)

# Print results
for i, doc in enumerate(docs):
    print(f"\nResult {i+1}:\n", doc.page_content[:300])

NameError: name 'retriever' is not defined

In [None]:
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Load text content from the file
file_path = "/Users/lilian/Documents/LLM Eval Project 2/data/merged_output.txt"
with open(file_path, "r", encoding="utf-8") as file:
    full_text = file.read()

# Split into chunks using LangChain's text splitter
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
documents = splitter.create_documents([full_text])

# Define Chroma persist directory
persist_dir = "./chroma_db"

# Create embeddings and store in Chroma
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
    documents,
    embedding=embeddings,
    persist_directory=persist_dir
)
#vectorstore.persist()

# Get retriever from vectorstore
retriever = vectorstore.as_retriever()

# Example query
query = "Dog food?"
docs = retriever.get_relevant_documents(query)

# Print top results
for i, doc in enumerate(docs):
    print(f"\nResult {i+1}:\n{doc.page_content[:300]}")


  vectorstore.persist()
  docs = retriever.get_relevant_documents(query)
Python(64514) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.



Result 1:
medicine at Small Door Veterinary. You don't need to break the bank to feed your dog a healthy diet.\nWe've found plenty of high-quality foods available at lower prices. As long as a food meets the AAFCO complete and balanced standards like our picks and makes sense for your dog's life stage, you're

Result 2:
chicken and fish recipes are for all life stages, including puppies, and the turkey, beef, venison, and lamb options are for adult dogs. You can also order recipes for sensitive stomachs and joint support, a rotation of seasonal and limited-time recipes, and prescription diets formulated by in-house

Result 3:
chicken and fish recipes are for all life stages, including puppies, and the turkey, beef, venison, and lamb options are for adult dogs. You can also order recipes for sensitive stomachs and joint support, a rotation of seasonal and limited-time recipes, and prescription diets formulated by in-house

Result 4:
The chicken and fish recipes are for all life stages,

In [3]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

persist_dir = "./chroma_db"
embeddings = OpenAIEmbeddings()

# Load existing store
vectorstore = Chroma(persist_directory=persist_dir, embedding_function=embeddings)
retriever = vectorstore.as_retriever()

query = "acccording to the document how do young adults react to change in food prices?"
docs = retriever.invoke(query)  # or .get_relevant_documents() if < langchain-core 0.1.46

for i, doc in enumerate(docs):
    print(f"\nResult {i+1}:\n{doc.page_content[:300]}")



Result 1:
One in three consumers perceived food prices as increasing a little, while 51% said they’ve gone up a lot since last year. Gen Z (36%) and Millennial (39%) consumers were more likely to report needing to pull money from savings or borrow money to afford food within the last 12 months (compared to 28

Result 2:
at the top. One in three consumers perceived food prices as increasing a little, while 51% said they’ve gone up a lot since last year.\nGen Z (36%) and Millennial (39%) consumers were more likely to report needing to pull money from savings or borrow money to afford food within the last 12 months (c

Result 3:
the May 2024 Consumer Food Insights Report from Purdue University. Far fewer consumers put expenses like housing (10%) and utilities (10%) at the top. One in three consumers perceived food prices as increasing a little, while 51% said they’ve gone up a lot since last year. Gen Z (36%) and Millennial

Result 4:
the May 2024 Consumer Food Insights Report from Purdu

#### Import Additional Libraries

In [4]:
vectorstore = Chroma(
    persist_directory="./chroma_db", 
    embedding_function=embeddings
)
retriever = vectorstore.as_retriever()


In [7]:
import openai
import datetime
import os
from langsmith import traceable
from langsmith.wrappers import wrap_openai
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_google_genai import GoogleGenerativeAI
import google.generativeai as genai


genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))

# Get today's date in YYYY-MM-DD format
today = datetime.datetime.now().strftime("%Y-%m-%d")

#### Define the RAG Model Class

In [None]:

class Rag:
    #def __init__(self, retriever, model: str = "gpt-4o"):
    def __init__(self, retriever, model: str = "gpt-4o-mini"):
    #def __init__(self, retriever, model: str = "gpt-3.5-turbo"):
        """
        Initialize the RAG (Retrieval-Augmented Generation) model.
        
        Args:
            retriever: The retriever object used to fetch relevant document chunks.
            model (str): The name of the LLM model to use (default: "gpt-4o").
        """        
        self._retriever = retriever
        
        # Wrap the OpenAI client to enable tracing and loggin
        self._client = wrap_openai(openai.Client())
        #self._client = genai.GenerativeModel(model)

        # Set the LLM model to use
        self._model = model

    #CONTEXT = "answer business questions based on provided documents"

    @traceable # Decorator to trace the function call for monitoring and debugging
    def get_answer(self, question: str):
        """
        Generate an answer to a question using the RAG model.
        
        Args:
            question (str): The question to answer.
        
        Returns:
            dict: A dictionary containing the generated answer and the contexts used.
        """    

        # Retrieve relevant document chunks based on the question    
        similar = self._retriever.invoke(question)
        
        response = self._client.chat.completions.create(
        #response = self._client.generate_content(
    
            model=self._model,
            messages=[
                {
                    "role": "system",
                    "content": "You are an accomplished AI working for the insights department at a company.Your job is to answer business questions based on provided documents. "
                    "Today is {today}."
                               " Use the following docs to produce a concise answer to the users question"
                               f"## Docs\n\n{similar}"
                     """Please follow these steps to provide a response:

                    1. Carefully review the source material, paying attention to any information that is relevant to answering the question:
                        - Make extensive use of all information given.
                        - It is CRITICAL that you only use information that is explicitly stated above.
                        - Refrain from recommendations, speculation, or extrapolations.
                        - Compare and contrast findings from different sources. Watch out for any apparent conflicts between sources as well as any corroborating information.
                        - Make sure the context of the information given is applicable to the question, such as any specific country, category, or target group the question asks about.

                    2. Write the answer in a professional, well-structured format with headings and inline citations:
                        - Make sure to write a didactically well-structured answer for insights professionals and business stakeholders.
                        - Aim to provide a professional response that will help inform decision-making.
                        - Structure your answer with headings and use concise full text.
                        - Use tables in your response only when needed.
                        - Use lists and bullet points sparingly. Absolutely avoid nesting lists and bullet points.
                        - Ensure to include direct inline citations for all information referenced in your answer. Use a bracketed citation style with source reference like [XXX]. If there are multiple references, provide them in separate brackets each: [XXX][ZZZ]. MAKE SURE TO USE THE REFERENCES EXACTLY AS GIVEN IN THE <reference> TAGS ABOVE.
                    
                    3. **Language**: Respond in the SAME LANGUAGE as the user's question or the provided documents. If the question/documents are in Spanish, reply in Spanish. If in German, reply in German, etc.

                    Make sure to use Markdown in your response and structure it in this format:

                    <answer>
                    ...
                    </answer>
            """},
                {"role": "user", "content": question}

            ]
        )

        # Return the generated answer and the contexts used
        return {
            #"answer": response.text,
            "answer": response.choices[0].message.content, # Extract the generated answer
            "contexts": [str(doc) for doc in similar], # Convert document chunks to strings
        }


In [None]:
import datetime
from langsmith import traceable
from langchain_google_genai import ChatGoogleGenerativeAI
import google.generativeai as genai
from langchain_core.messages import HumanMessage, SystemMessage

genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))

# Get today's date in YYYY-MM-DD format
today = datetime.datetime.now().strftime("%Y-%m-%d")

class Rag:
    def __init__(self, retriever, model: str = "gemini-2.0-flash"):
        """
        Initialize the RAG (Retrieval-Augmented Generation) model.

        Args:
            retriever: The retriever object used to fetch relevant document chunks.
            model (str): The name of the LLM model to use (default: "gemini-1.5-flash").
        """
        self._retriever = retriever

        # Initialize the Gemini model
        self._model = ChatGoogleGenerativeAI(model=model, temperature=0.0)   

    @traceable  # Decorator to trace the function call for monitoring and debugging
    def get_answer(self, question: str):
        """
        Generate an answer to a question using the RAG model.

        Args:
            question (str): The question to answer.

        Returns:
            dict: A dictionary containing the generated answer and the contexts used.
        """

        # Retrieve relevant document chunks based on the question
        similar = self._retriever.invoke(question)

        # Prepare the system message and user message for the chat model
        system_message = SystemMessage(
            content=f"""You are an accomplished AI working for the insights department at a company. Your job is to answer business questions based on provided documents.
Today is {today}.
Use the following docs to produce a concise answer to the user's question:
## Docs\n\n{similar}

Please follow these steps to provide a response:

1. Carefully review the source material, paying attention to any information that is relevant to answering the question:
    - Make extensive use of all information given.
    - It is CRITICAL that you only use information that is explicitly stated above.
    - Refrain from recommendations, speculation, or extrapolations.
    - Compare and contrast findings from different sources. Watch out for any apparent conflicts between sources as well as any corroborating information.
    - Make sure the context of the information given is applicable to the question, such as any specific country, category, or target group the question asks about.

2. Write the answer in a professional, well-structured format with headings and inline citations:
    - Make sure to write a didactically well-structured answer for insights professionals and business stakeholders.
    - Aim to provide a professional response that will help inform decision-making.
    - Structure your answer with headings and use concise full text.
    - Use tables in your response only when needed.
    - Use lists and bullet points sparingly. Absolutely avoid nesting lists and bullet points.
    - Ensure to include direct inline citations for all information referenced in your answer. Use a bracketed citation style with source reference like [XXX]. If there are multiple references, provide them in separate brackets each: [XXX][ZZZ]. MAKE SURE TO USE THE REFERENCES EXACTLY AS GIVEN IN THE <reference> TAGS ABOVE.

3. Language: Respond in the SAME LANGUAGE as the user's question or the provided documents. If the question/documents are in Spanish, reply in Spanish. If in German, reply in German, etc.

Make sure to use Markdown in your response and structure it in this format:

<answer>
...
</answer>
"""
        )
        user_message = HumanMessage(content=question)

        # Generate the response using the Gemini model
        response = self._model.invoke([system_message, user_message])

        # Return the generated answer and the contexts used
        return {
            "answer": response.content,  # Extract the generated answer
            "contexts": [str(doc) for doc in similar],  # Convert document chunks to strings
        }





In [None]:
class Rag:
    def __init__(self, retriever, model: str = "claude-3-sonnet-20240229"):
        """
        Initialize the RAG (Retrieval-Augmented Generation) model.

        Args:
            retriever: The retriever object used to fetch relevant document chunks.
            model (str): The name of the Anthropic model to use (default: "claude-3-sonnet-20240229").
        """
        self._retriever = retriever

        # Initialize the Anthropic model
        self._model = ChatAnthropic(model=model)

    @traceable  # Decorator to trace the function call for monitoring and debugging
    def get_answer(self, question: str):
        """
        Generate an answer to a question using the RAG model.

        Args:
            question (str): The question to answer.

        Returns:
            dict: A dictionary containing the generated answer and the contexts used.
        """

        # Retrieve relevant document chunks based on the question
        similar = self._retriever.invoke(question)

        # Prepare the system message and user message for the chat model
        system_message = SystemMessage(
            content=f"""You are an accomplished AI working for the insights department at a company. Your job is to answer business questions based on provided documents.
Today is {today}.
Use the following docs to produce a concise answer to the user's question:
## Docs\n\n{similar}

Please follow these steps to provide a response:

1. Carefully review the source material, paying attention to any information that is relevant to answering the question:
    - Make extensive use of all information given.
    - It is CRITICAL that you only use information that is explicitly stated above.
    - Refrain from recommendations, speculation, or extrapolations.
    - Compare and contrast findings from different sources. Watch out for any apparent conflicts between sources as well as any corroborating information.
    - Make sure the context of the information given is applicable to the question, such as any specific country, category, or target group the question asks about.

2. Write the answer in a professional, well-structured format with headings and inline citations:
    - Make sure to write a didactically well-structured answer for insights professionals and business stakeholders.
    - Aim to provide a professional response that will help inform decision-making.
    - Structure your answer with headings and use concise full text.
    - Use tables in your response only when needed.
    - Use lists and bullet points sparingly. Absolutely avoid nesting lists and bullet points.
    - Ensure to include direct inline citations for all information referenced in your answer. Use a bracketed citation style with source reference like [XXX]. If there are multiple references, provide them in separate brackets each: [XXX][ZZZ]. MAKE SURE TO USE THE REFERENCES EXACTLY AS GIVEN IN THE <reference> TAGS ABOVE.

3. Language: Respond in the SAME LANGUAGE as the user's question or the provided documents. If the question/documents are in Spanish, reply in Spanish. If in German, reply in German, etc.

Make sure to use Markdown in your response and structure it in this format:

<answer>
...
</answer>
"""
        )
        user_message = HumanMessage(content=question)

        # Generate the response using the Anthropic model
        response = self._model.invoke([system_message, user_message])

        # Return the generated answer and the contexts used
        return {
            "answer": response.content,  # Extract the generated answer
            "contexts": [str(doc) for doc in similar],  # Convert document chunks to strings
        }
    

####  Initialize the RAG Model

In [None]:
# Example usage (Ensure `retriever` is properly initialized)
rag = Rag(retriever) 

#### Generate an Answer

In [None]:
response = rag.get_answer("acccording to the document how do young adults react to change in food prices? put the answer in two lines")
print("Generated Answer:", response["answer"])
print("Contexts Used:", response["contexts"])

Generated Answer: <answer>
Young adults, particularly Gen Z (36%) and Millennials (39%), are more likely to pull money from savings or borrow to afford food due to rising prices. Additionally, food insecurity is more common among Gen Z consumers (29%) compared to older generations [65d93b0c-589d-4642-9dbd-af64b0a986f4][d7664ca6-b48f-4aaa-a054-fd568fcc94fe].
</answer>
Contexts Used: ['page_content=\'respondents are willing to\\ndo so.\\n"] [] ["More than half (56%) of Americans said food prices have increased more than any other household expense over the past year, according to the May 2024 Consumer Food Insights Report from Purdue University. Far fewer consumers put expenses like housing (10%) and utilities (10%) at the top. One in three consumers perceived food prices as increasing a little, while 51% said they’ve gone up a lot since last year. Gen Z (36%) and Millennial (39%) consumers\'', "page_content='at the top. One in three consumers perceived food prices as increasing a little

#### Define Evaluation Functions

In [None]:
# ------------
def predict_rag_answer(example: dict):
    """Use this for answer evaluation"""
    response = rag.get_answer(example["question"])
    return {"answer": response["answer"]}

def predict_rag_answer_with_context(example: dict):
    """Use this for evaluation of retrieved documents and hallucinations"""
    response = rag.get_answer(example["question"])
    return {"answer": response["answer"], "contexts": response["contexts"]}

### Define the QA Evaluator

In [28]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

# Define the QA evaluator
qa_evaluator = [
    LangChainStringEvaluator(
        "cot_qa", # Use the Chain-of-Thought QA evaluator
        prepare_data=lambda run, example: {
            "prediction": run.outputs["answer"], # Predicted answer from the RAG model
            "reference": example.outputs["answer"], # Ground truth answer
            "input": example.inputs["question"], # Input question
        }
    )
]

### Relevance 

In [None]:
##############################################################################################

In [385]:
# Grade output schema
from typing_extensions import Annotated, TypedDict
from langchain_openai import ChatOpenAI

class RelevanceGrade(TypedDict):
    explanation: Annotated[str, ..., "Explain your reasoning for the score"]
    relevant: Annotated[bool, ..., "Provide the score on whether the answer addresses the question"]

# Grade prompt
relevance_instructions="""You are an experienced teacher tasked with grading a quiz response. Your goal is to evaluate the relevance of a student's answer to a given question. You will be provided with a QUESTION and a STUDENT ANSWER.


Grading Criteria:
1. Relevance: The answer should be consice directly address the question asked.
2. Helpfulness: The answer should contribute to answering the question effectively.

Relevance Scoring:
- A relevance score of TRUE means the student's answer meets ALL of the above criteria.
- A relevance score of False means the student's answer fails to meet ONE OR MORE of the criteria.

Instructions:
1. Carefully read the question and the student's answer.
2. Analyze the answer based on the grading criteria.
3. Determine the relevance score (False or True).
4. Provide a step-by-step explanation of your reasoning.
5. Do NOT state the correct answer in your explanation.

Before providing your final evaluation, wrap your analysis inside <evaluation> tags in your thinking block. Follow these steps:

1. Quote relevant parts of the student's answer.
2. For each grading criterion:
   a. Consider arguments for meeting the criterion.
   b. Consider arguments against meeting the criterion.
   c. Make a judgment on whether the criterion is met.
3. Based on the analysis of all criteria, determine the final relevance score.

This will ensure a thorough and fair assessment.

Please proceed with your analysis and evaluation of the student's answer. Your final output should consist only of the relevance score and explanation, and should not duplicate or rehash any of the work you did in the evaluation block."""



# Grader LLM
# relevance_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0).with_structured_output(RelevanceGrade, method="json_schema", strict=True)
# relevance_llm = ChatGoogleGenerativeAI(model= "gemini-2.0-flash", temperature=0).with_structured_output(RelevanceGrade, method="json_schema", strict=True)
relevance_llm = ChatAnthropic(model= "claude-3-sonnet-20240229", temperature=0).with_structured_output(RelevanceGrade, method="json_schema", strict=True)

# Evaluator
def relevance(inputs: dict, outputs: dict) -> bool:
    """A simple evaluator for RAG answer helpfulness."""
    answer = f"QUESTION: {inputs['question']}\nSTUDENT ANSWER: {outputs['answer']}"
    grade = relevance_llm.invoke([
        {"role": "system", "content": relevance_instructions}, 
        {"role": "user", "content": answer}
    ])
    return grade["relevant"]

### Evaluation with context Answer 

In [None]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

# Define the dataset name
dataset_name = "mls-deepsight-eval"


# Evaluator
cot_qa_evaluator = LangChainStringEvaluator(
        "cot_qa",
        prepare_data=lambda run, example: {
            "prediction": run.outputs["answer"],
            "reference": example.outputs["answer"],
            "input": example.inputs["question"],
        }
    )



# qa_evaluator = LangChainStringEvaluator(
#         "qa",
#         prepare_data=lambda run, example: {
#             "prediction": run.outputs["answer"],
#             "reference": example.outputs["answer"],
#             "input": example.inputs["question"],
#         }
#     )

experiment_results = evaluate(
    predict_rag_answer_with_context,
    data=dataset_name,
    evaluators= [cot_qa_evaluator, relevance],
    experiment_prefix="GPT_4o_mimi",
    num_repetitions=5,
)


View the evaluation results for experiment: 'GEMINI-6e892622' at:
https://smith.langchain.com/o/becb2dbe-6434-46cc-a3fb-97dca03fd3ef/datasets/0602a3af-66bf-4930-9881-f7cc04698b95/compare?selectedSessions=c1384164-5ae2-4741-9b9e-d23a8557c29b




0it [00:00, ?it/s]

KeyboardInterrupt: 

In [None]:
#Claude Evaluation Runs

experiment_results = evaluate(
    predict_rag_answer_with_context,
    data=dataset_name,
    evaluators= [cot_qa_evaluator, relevance],
    experiment_prefix="Claude_Opus",
    num_repetitions=5,
)


In [None]:
#Gemini Evaluation Runs

experiment_results = evaluate(
    predict_rag_answer_with_context,
    data=dataset_name,
    evaluators= [cot_qa_evaluator, relevance],
    experiment_prefix="Gemini_2.0_flash",
    num_repetitions=5,
)


In [None]:
# import os
# import pandas as pd
# from langchain.text_splitter import RecursiveCharacterTextSplitter
# from langchain.vectorstores import Chroma
# from langchain_openai import OpenAIEmbeddings  

# # Load your Excel
# file_path = "/Users/lilian/Documents/LLM Eval Project 2/data/main_report-with-results.xlsx"
# df = pd.read_excel(file_path)

# # Combine your text columns
# text_columns = ["avsTexts"]
# df[text_columns] = df[text_columns].fillna("")
# #df["combined_text"] = df[text_columns].agg(" ".join, axis=1)

# # Split into chunks
# splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
# documents = splitter.create_documents(df["combined_text"].tolist())

# # Define persist directory (fresh each time for now)
# persist_dir = "./chroma_db"
# # shutil.rmtree(persist_dir, ignore_errors=True)  # Extra safety!

# # Embed and persist Chroma store
# embeddings = OpenAIEmbeddings()
# vectorstore = Chroma.from_documents(
#     documents,
#     embedding=embeddings,
#     persist_directory=persist_dir
# )
# vectorstore.persist()  # VERY IMPORTANT
# retriever = vectorstore.as_retriever()

# # Query test
# query = "サステナビリティに関する最新のトレンドは何ですか?"
# docs = retriever.get_relevant_documents(query)

# # Print results
# for i, doc in enumerate(docs):
#     print(f"\nResult {i+1}:\n", doc.page_content[:300])

In [None]:
# # Embed the document chunks into a vector store using OpenAI embeddings
# vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

# # Create a retriever to fetch relevant document chunks based on queries
# retriever = vectorstore.as_retriever()