# LangSmith Evaluations 🧪

The pace of AI application development is often rate-limited by high-quality evaluations because there is a paradox of choice. Developers often wonder how to engineer their prompt or which LLM best balances accuracy, latency, and cost. High quality evaluations can help you rapidly answer these types of questions with confidence. LangSmith allows you to build high-quality [evaluations](https://docs.smith.langchain.com/concepts/evaluation) for your AI application.

## Import Libraries 🧑‍💻

We are using a couple new libraries here, mostly, LangSmith libraries to take advantage of the evaluations the platform offers.

In [None]:
import os
from dotenv import load_dotenv
load_dotenv()
from langchain_openai import AzureChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_community.vectorstores.azuresearch import AzureSearch
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import AzureOpenAIEmbeddings
from langsmith.evaluation import LangChainStringEvaluator, evaluate
from langsmith import Client, traceable

## Connect to existing Azure OpenAI 🤖 & Azure Search Instances 🔎

In [None]:
model = AzureChatOpenAI(
    azure_deployment="gpt4o",
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version="2024-02-01"
)

embeddings: AzureOpenAIEmbeddings = AzureOpenAIEmbeddings(
    azure_deployment="embeddings",
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY")
)

index_name: str = "products-optimized"
vector_store: AzureSearch = AzureSearch(
    azure_search_endpoint=os.getenv("AZURE_SEARCH_ENDPOINT"),
    azure_search_key=os.getenv("AZURE_SEARCH_KEY"),
    index_name=index_name,
    embedding_function=embeddings.embed_query,
)
retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=vector_store.as_retriever(), llm=model
)

## Create RAG Bot Class 🤖

Below we will mock our RAG application

In [None]:
class RagBot:
    def __init__(self, retriever, model: str = "model"):
        self._retriever = retriever
        # Initialize Azure OpenAI client
        self._client = model = AzureChatOpenAI(azure_deployment="gpt4o", azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"), api_key=os.getenv("AZURE_OPENAI_API_KEY"), api_version="2024-02-01")
        self._model = model

    @traceable()
    def retrieve_docs(self, question):
        return self._retriever.invoke(question)

    @traceable()
    def invoke_llm(self, question, docs):
        prompt = ChatPromptTemplate.from_template("""Your job is to answer questions
        Answer the question and include the context of what you used to answer.
        {docs}
        Question: {question}""")
        chain = prompt | model | StrOutputParser()
        result = chain.invoke({"question": question, "docs": docs})
        return {
            "answer": result,
            "contexts": result,
        }

    @traceable()
    def get_answer(self, question: str):
        docs = self.retrieve_docs(question)
        return self.invoke_llm(question, docs)

# Replace `retriever_from_llm` with your actual retriever instance
rag_bot = RagBot(retriever_from_llm)


## Quick Test 🧪

Is our retrieval working as expected?

In [None]:
response = rag_bot.get_answer("what is the price of the home theater system?")
print(response)

## Create our dataset ❓🗣️

Remember, datasets are collections of examples that provide the necessary inputs and expected reference outputs for assessing your AI application.

In [None]:
inputs = [
    "what is the price of the home theater system?",
    "Can you tell me about the customer reviews on the home theater system?"
]

outputs = [
    "The price of the Ultimate Home Theater System (UHTS-1000) is $4999 (source: hometheatersystem.txt).",
    """Certainly! The customer reviews for the Ultimate Home Theater System (UHTS-1000) are overwhelmingly positive. Here are some highlights:
    
        1. **John D.** praised the system for its picture quality and immersive sound:
        - "The picture on the 75-inch TV is crystal clear, and the colors are incredibly vibrant. The Dolby Atmos sound system takes my movie nights to the next level – it’s like being in a theater! Highly recommend." ([hometheatersystem](#source)).

        2. **Lisa M.** highlighted the display and sound quality:
        - "The OLED display provides perfect blacks and vibrant colors, and the Dolby Vision support makes everything look amazing. The soundbar and wireless subwoofer deliver incredible sound quality. Highly satisfied with my purchase." ([hometheatersystem](#source)).

        3. **Sarah P.** appreciated the easy setup and flexibility of the wireless speakers:
        - "The setup was easy, and the wireless speakers give me the flexibility to arrange my living room however I like. The sound is immersive and powerful, and the TV’s smart features make it so easy to find and watch my favorite shows." ([hometheatersystem](#source)).

        4. **Mark T.** was impressed by the 8K resolution and user-friendly interface:
        - "The 8K resolution is stunning, and the TV’s smart interface is very user-friendly. The surround sound system makes me feel like I’m in the middle of the action. Worth every penny!" ([hometheatersystem](#source)).

        Overall, customers seem to be highly satisfied with the UHTS-1000, praising its picture quality, sound performance, ease of setup, and smart features."""
]

qa_pairs = [{"question": q, "answer": a} for q, a in zip(inputs, outputs)]
client = Client()
dataset_name = "products_dataset"
dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="QA Pairs for Generic Tech Shop Inc.",
)
client.create_examples(
    inputs=[{"question": q} for q in inputs],
    outputs=[{"answer": a} for a in outputs],
    dataset_id=dataset.id,
)

## Run Evaluation 🧪

Let's run the evaluation!

In [None]:
def get_rag_answer(example: dict):
    """Use this for answer evaluation"""
    response = rag_bot.get_answer(example["question"])
    return {"answer": response["answer"]}

qa_evalulator = [
    LangChainStringEvaluator(
        "cot_qa",
        config={
            "llm": model,
        },
        prepare_data=lambda run, example: {
            "prediction": run.outputs["answer"],
            "reference": example.outputs["answer"],
            "input": example.inputs["question"],
        },
    )
]
dataset_name = "products_dataset"
experiment_results = evaluate(
    get_rag_answer,
    data=dataset_name,
    evaluators=qa_evalulator,
    experiment_prefix="rag-app"
)