# 4. Experiment

## Load Environment Variables

In [1]:
from dotenv import load_dotenv

# 환경 변수 로드
load_dotenv(dotenv_path=".env", override=True)

True

## Setup

### LLM

Setup llm model to chat.

In [2]:
import os
from langchain_openai import ChatOpenAI

# 환경 변수에서 설정 가져오기
model_name = os.environ["MODEL_NAME"]
openai_api_key = os.environ["OPENAI_API_KEY"]
openai_api_base = os.environ["OPENAI_API_BASE"]

# LLM 모델 초기화
llm = ChatOpenAI(
    model_name=model_name,
    openai_api_key=openai_api_key,
    openai_api_base=openai_api_base,
)

### Tavily
Let's set up a tool called Tavily to allow our assistant to search the web when answering.  
Go to [website](https://app.tavily.com/) and get api key.

In [3]:
from langchain_tavily import TavilySearch

# Tavily 검색 도구 설정 (최대 1개 결과)
web_search_tool = TavilySearch(max_results=1)

## Prompt

Let's design a prompt for RAG that we'll use throughout the notebook.

In [4]:
prompt = """You are a professor and expert in explaining complex topics in a way that is easy to understand. 
Your job is to answer the provided question so that even a 5 year old can understand it. 
You have provided with relevant background context to answer the question.

Question: {question} 

Context: {context}

Answer:"""
print("Prompt Template: ", prompt)

Prompt Template:  You are a professor and expert in explaining complex topics in a way that is easy to understand. 
Your job is to answer the provided question so that even a 5 year old can understand it. 
You have provided with relevant background context to answer the question.

Question: {question} 

Context: {context}

Answer:


## Application Using LangGraph
Let's define the State for our Graph. We'll track the user's question, our application's generation, and the list of relevant documents

In [5]:
from langchain.schema import Document
from typing import List
from typing_extensions import TypedDict

class GraphState(TypedDict):
    """
    그래프의 상태를 나타냅니다.
    """
    question: str
    documents: List[str]
    messages: List[str]

Great, now let's define the nodes of our graph

In [6]:
from langchain_core.messages import HumanMessage

def search(state):
    """
    질문을 기반으로 웹 검색을 수행합니다.

    Args:
        state (dict): 현재 그래프 상태

    Returns:
        state (dict): 웹 검색 결과가 추가된 documents 키로 업데이트된 상태
    """
    question = state["question"]
    documents = state.get("documents", [])

    # 웹 검색 수행
    web_docs = web_search_tool.invoke({"query": question})
    web_results = "\n".join([d["content"] for d in web_docs["results"]])
    web_results = Document(page_content=web_results)
    documents.append(web_results)

    return {"documents": documents, "question": question}


def explain(state: GraphState):
    """
    컨텍스트를 기반으로 응답을 생성합니다.
    
    Args:
        state (dict): 현재 그래프 상태
        
    Returns:
        state (dict): LLM 생성 결과가 포함된 messages 키가 추가된 상태
    """
    question = state["question"]
    documents = state.get("documents", [])
    formatted = prompt.format(
        question=question, 
        context="\n".join([d.page_content for d in documents])
    )
    generation = llm.invoke([HumanMessage(content=formatted)])
    return {"question": question, "messages": [generation]}

In [7]:
from langgraph.graph import StateGraph, START, END

# 상태 그래프 생성
graph = StateGraph(GraphState)

# 노드 추가
graph.add_node("explain", explain)
graph.add_node("search", search)

# 엣지 추가
graph.add_edge(START, "search")
graph.add_edge("search", "explain")
graph.add_edge("explain", END)

# 그래프 컴파일
app = graph.compile()

### Define Evaluators

#### Code-based scorers

We'll first define a code-based scores, which are useful to measure deterministic or close-ended metrics. 

In [35]:
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback

@scorer
def conciseness(outputs) -> Feedback:
    words = outputs.split(" ")
    if len(words) <= 200:
        return Feedback(
            value=True,
            rationale="Response is concise."
        )
    else:
        return Feedback(
            value=False,
            rationale="Response is too long."
        )

This particular custom code evaluator is a simple Python function that checks if our application produces outputs that are less than or equal to 200 words long.

#### Prompt-based judges

For open-ended metrics, it's can be powerful to use an LLM to score the outputs.

Let's use an LLM to check whether our application produces correct outputs.

We'll define a function to give an LLM our application's outputs, alongside the reference outputs stored in our dataset. 

The LLM will then be able to reference the "right" output to judge if our application's answer meets our accuracy standards.

In [None]:
from mlflow.genai.judges import custom_prompt_judge
from mlflow.genai.scorers import scorer
import mlflow

correctness_rubric = custom_prompt_judge(
    name="correctness",
    prompt_template="""
You are an expert data labeler evaluating model outputs for correctness. Your task is to assign a score based on the following rubric:

<Instructions>
    - Carefully read the input and output
    - Use the reference output to determine if the model output contains errors
    - Focus whether the model output uses accurate analogies and is logically consistent
</Instructions>

<Reminder>
    The analogies in the output do not need to match the reference output exactly. Focus on logical consistency.
</Reminder>

<input>
    {{input}}
</input>

<output>
    {{output}}
</output>

Use the reference outputs below to help you evaluate the correctness of the response:
<expected_response>
    {{expected_response}}
</expected_response>

<Rubric>
[[comprehensive]]: Identifies all issues, including edge cases, security concerns, and performance implications, and suggests specific improvements with examples.
[[thorough]]: Catches major issues and most minor ones, provides good suggestions but may miss some edge cases.
[[adequate]]: Identifies obvious issues and provides basic feedback, but misses subtle problems.
[[superficial]]: Only catches surface-level issues, feedback is vague or generic.
[[inadequate]]: Misses critical issues or provides incorrect feedback.
</Rubric>

""",
    numeric_values={
        "comprehensive": 1.0,
        "thorough": 0.8,
        "adequate": 0.6,
        "superficial": 0.3,
        "inadequate": 0.0,
    },
)


@scorer
def correctness(inputs, outputs, expectations):
    return correctness_rubric(
        input=inputs["question"],
        output=outputs,
        expected_response=expectations["expected_response"],
    )

## Run Experiment

We have all the necessary components, so let's run our experiment! 

In [None]:
# Step 1: Define evaluation dataset
import mlflow.genai.datasets


catalog_name = "main"  # replace with your catalog name
schema_name = "default"  # replace with your schema name
dataset_name = "dataset_exmaple"

# Load existing dataset
dataset = mlflow.genai.datasets.get_dataset(
    uc_table_name=f"{catalog_name}.{schema_name}.{dataset_name}"
)

In [None]:
# Step 2: Define predict_fn
# predict_fn will be called for every row in your evaluation
# dataset. Replace with your app's prediction function.  
# NOTE: The **kwargs to predict_fn are the same as the keys of 
# the `inputs` in your dataset. 
def predict(question):
  response = app.invoke({"question": question})
  return response["messages"][0].content

In [36]:
import mlflow

# Step 3: Run evaluation
results = mlflow.genai.evaluate(
    data=dataset,
    predict_fn=predict,
    scorers=[conciseness, correctness]
)

2025/07/30 14:56:40 INFO mlflow.genai.utils.data_validation: Testing model prediction with the first sample in the dataset.
Evaluating: 100%|██████████| 3/3 [Elapsed: 00:15, Remaining: 00:00] , Time breakdown=(81.51% predict_fn, 18.49% scorers)


Evaluation completed.

Metrics and evaluation results can be viewed from the MLflow run page.
To compare evaluation results across runs, view the "Evaluations" tab of the experiment.

Get aggregate metrics: `result.metrics`.
Get per-row evaluation results: `result.tables['eval_results']`.
`result` is the `EvaluationResult` object returned by `mlflow.evaluate`.

🏃 View run nimble-snake-203 at: https://dbc-a3ca0892-0f44.cloud.databricks.com/ml/experiments/3243034829598956/runs/2d44609ba8164a77849b4fb7a876cf50
🧪 View experiment at: https://dbc-a3ca0892-0f44.cloud.databricks.com/ml/experiments/3243034829598956
