# 4. Evaluation

## Load Environment Variables

In [1]:
from dotenv import load_dotenv

# 환경 변수 로드
load_dotenv(dotenv_path=".env", override=True)

True

## Setup

### LLM

Setup llm model to chat.

In [2]:
import os
from langchain_openai import ChatOpenAI

# 환경 변수에서 설정 가져오기
model_name = os.environ["MODEL_NAME"]
openai_api_key = os.environ["OPENAI_API_KEY"]
openai_api_base = os.environ["OPENAI_API_BASE"]

# LLM 모델 초기화
llm = ChatOpenAI(
    model_name=model_name,
    openai_api_key=openai_api_key,
    openai_api_base=openai_api_base,
)

### Tavily
Let's set up a tool called Tavily to allow our assistant to search the web when answering.  
Go to [website](https://app.tavily.com/) and get api key.

In [3]:
from langchain_tavily import TavilySearch

# Tavily 검색 도구 설정 (최대 1개 결과)
web_search_tool = TavilySearch(max_results=1)

## Prompt

Let's design a prompt for RAG that we'll use throughout the notebook.

In [4]:
prompt = """You are a professor and expert in explaining complex topics in a way that is easy to understand. 
Your job is to answer the provided question so that even a 5 year old can understand it. 
You have provided with relevant background context to answer the question.

Question: {question} 

Context: {context}

Answer:"""
print("Prompt Template: ", prompt)

Prompt Template:  You are a professor and expert in explaining complex topics in a way that is easy to understand. 
Your job is to answer the provided question so that even a 5 year old can understand it. 
You have provided with relevant background context to answer the question.

Question: {question} 

Context: {context}

Answer:


## Application Using LangGraph
Let's define the State for our Graph. We'll track the user's question, our application's generation, and the list of relevant documents

In [5]:
from langchain.schema import Document
from typing import List
from typing_extensions import TypedDict

class GraphState(TypedDict):
    """
    그래프의 상태를 나타냅니다.
    """
    question: str
    documents: List[str]
    messages: List[str]

Great, now let's define the nodes of our graph

In [6]:
from langchain_core.messages import HumanMessage

def search(state):
    """
    질문을 기반으로 웹 검색을 수행합니다.

    Args:
        state (dict): 현재 그래프 상태

    Returns:
        state (dict): 웹 검색 결과가 추가된 documents 키로 업데이트된 상태
    """
    question = state["question"]
    documents = state.get("documents", [])

    # 웹 검색 수행
    web_docs = web_search_tool.invoke({"query": question})
    web_results = "\n".join([d["content"] for d in web_docs["results"]])
    web_results = Document(page_content=web_results)
    documents.append(web_results)

    return {"documents": documents, "question": question}


def explain(state: GraphState):
    """
    컨텍스트를 기반으로 응답을 생성합니다.
    
    Args:
        state (dict): 현재 그래프 상태
        
    Returns:
        state (dict): LLM 생성 결과가 포함된 messages 키가 추가된 상태
    """
    question = state["question"]
    documents = state.get("documents", [])
    formatted = prompt.format(
        question=question, 
        context="\n".join([d.page_content for d in documents])
    )
    generation = llm.invoke([HumanMessage(content=formatted)])
    return {"question": question, "messages": [generation]}

In [7]:
from langgraph.graph import StateGraph, START, END

# 상태 그래프 생성
graph = StateGraph(GraphState)

# 노드 추가
graph.add_node("explain", explain)
graph.add_node("search", search)

# 엣지 추가
graph.add_edge(START, "search")
graph.add_edge("search", "explain")
graph.add_edge("explain", END)

# 그래프 컴파일
app = graph.compile()

### Define Evaluators

#### Custom Code Evaluator

We'll first define a custom code evaluator, which are useful to measure deterministic or close-ended metrics. 

In [8]:
def conciseness(outputs: dict) -> bool:
    words = outputs["output"].split(" ")
    return len(words) <= 200

This particular custom code evaluator is a simple Python function that checks if our application produces outputs that are less than or equal to 200 words long.

#### LLM-as-a-Judge Evaluator

For open-ended metrics, it's can be powerful to use an LLM to score the outputs.

Let's use an LLM to check whether our application produces correct outputs. First, let's define a scoring schema for our LLM to adhere to in its response.

In [9]:
from pydantic import BaseModel, Field


# Define a scoring schema that our LLM must adhere to
class CorrectnessScore(BaseModel):
    """Correctness score of the answer when compared to the reference answer."""

    score: int = Field(
        description="The score of the correctness of the answer, from 0 to 1"
    )

We'll define a function to give an LLM our application's outputs, alongside the reference outputs stored in our dataset. 

The LLM will then be able to reference the "right" output to judge if our application's answer meets our accuracy standards.

In [10]:
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage


def correctness(inputs: dict, outputs: dict, reference_outputs: dict) -> bool:
    prompt = """
    You are an expert data labeler evaluating model outputs for correctness. Your task is to assign a score based on the following rubric:

    <Rubric>
        A correct answer:
        - Provides accurate information
        - Uses suitable analogies and examples
        - Contains no factual errors
        - Is logically consistent

        When scoring, you should penalize:
        - Factual errors
        - Incoherent analogies and examples
        - Logical inconsistencies
    </Rubric>

    <Instructions>
        - Carefully read the input and output
        - Use the reference output to determine if the model output contains errors
        - Focus whether the model output uses accurate analogies and is logically consistent
    </Instructions>

    <Reminder>
        The analogies in the output do not need to match the reference output exactly. Focus on logical consistency.
    </Reminder>

    <input>
        {}
    </input>

    <output>
        {}
    </output>

    Use the reference outputs below to help you evaluate the correctness of the response:
    <reference_outputs>
        {}
    </reference_outputs>
    """.format(
        inputs["question"], outputs["output"], reference_outputs["output"]
    )

    structured_llm = ChatOpenAI(
        model_name=model_name,
        openai_api_key=openai_api_key,
        openai_api_base=openai_api_base,
        temperature=0
    ).with_structured_output(CorrectnessScore)
    generation = structured_llm.invoke([HumanMessage(content=prompt)])
    return generation.score == 1

### Define Run Function

We'll define a function to run our application on the example inputs of our dataset. This is function that will be called when we run our experiment.

In [11]:
# 4. Define a function to run your application
def run(inputs: dict):
    response = app.invoke({"question": inputs["question"]})
    return response["messages"][0].content

## Run Experiment

We have all the necessary components, so let's run our experiment! 

In [12]:
from langsmith import evaluate

dataset_name = "dataset-exmaple"
evaluate(
    run,
    data=dataset_name,
    evaluators=[correctness, conciseness],
    experiment_prefix="demo-eval"
)

  from .autonotebook import tqdm as notebook_tqdm


View the evaluation results for experiment: 'demo-eval-82713c9e' at:
https://smith.langchain.com/o/e68ba21c-3a02-4fb0-81cd-ae98057123fc/datasets/bc7c3a61-a249-47c3-8046-9fcbbed7e93f/compare?selectedSessions=2024e9c3-ea8d-4d5f-b85b-501010fc4939




3it [00:19,  6.50s/it]


Unnamed: 0,inputs.question,outputs.output,error,reference.output,feedback.correctness,feedback.conciseness,execution_time,example_id,id
0,How does a democracy work?,"So, you know how we live in a country with lot...",,Okay! Imagine you and your friends want to dec...,True,False,7.810723,483ac70d-a254-4181-bae9-f5cf6789b367,5c9fc102-26cf-4dc5-a197-4f7f3f512cca
1,What is sound?,"That's a great question about sound!\n\nSo, yo...",,Okay! Imagine you have a drum. When you hit it...,True,True,4.95315,befbb020-f57b-46bb-8967-8a8f690dd0ed,04fd8cfc-bcaa-41c0-83e1-cd80475624df
2,How does string theory work?,Let's talk about string theory!\n\nImagine you...,,"Okay! Imagine that everything in the universe,...",True,False,4.745685,f32b8859-a0b5-4702-9ef9-491cae6efa6f,09666029-5101-455a-8f2a-66cfb01d0ae3
