這份 Notebook 進行 RAG 評估，包括合成評估資料


In [None]:
from google.colab import userdata
openai_api_key = userdata.get('openai_api_key')

In [None]:
!pip install openai



In [None]:
!pip install braintrust autoevals

Collecting braintrust
  Downloading braintrust-0.0.201-py3-none-any.whl.metadata (3.4 kB)
Collecting autoevals
  Downloading autoevals-0.0.127-py3-none-any.whl.metadata (17 kB)
Collecting chevron (from braintrust)
  Downloading chevron-0.14.0-py3-none-any.whl.metadata (4.9 kB)
Collecting braintrust_core==0.0.58 (from braintrust)
  Downloading braintrust_core-0.0.58-py3-none-any.whl.metadata (669 bytes)
Collecting exceptiongroup>=1.2.0 (from braintrust)
  Downloading exceptiongroup-1.2.2-py3-none-any.whl.metadata (6.6 kB)
Collecting python-dotenv (from braintrust)
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Collecting sseclient-py (from braintrust)
  Downloading sseclient_py-1.8.0-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting polyleven (from autoevals)
  Downloading polyleven-0.9.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.3 kB)
Downloading braintrust-0.0.201-py3-none-any.whl (133 kB)
[2K   

In [None]:
braintrust_api_key = userdata.get('braintrust_api_key')

import os
os.environ['OPENAI_API_KEY'] = openai_api_key

from braintrust import init_logger, traced, wrap_openai, Eval
from openai import OpenAI

logger = init_logger(project="Course-202504", api_key=braintrust_api_key)
client = wrap_openai(OpenAI(api_key=openai_api_key))



## 評估資料準備: 合成資料 Synthetic Data

若要人工製作 dataset，雖然品質好，但是實在太太辛苦了，可以怎麼辦?

可用合成 dataset 策略! 用 LLM 幫我們產生評估資料集

我們用合成的，原理是:

1. 針對要做 RAG 的文本先拆 chunks，也就是 contexts
2. 針對 context 用 LLM 產生對應的 "問題" 和 "參考答案"

more: https://ihower.tw/notes/%E6%8A%80%E8%A1%93%E7%AD%86%E8%A8%98-AI/Synthetic+Data+%E5%90%88%E6%88%90%E8%B3%87%E6%96%99

In [None]:
!wget https://www.megabank.com.tw/-/media/mega/files/bank/personal/fund/bulletin/weekly-journal/market-analysis/114/1140224.pdf

--2025-04-29 14:44:31--  https://www.megabank.com.tw/-/media/mega/files/bank/personal/fund/bulletin/weekly-journal/market-analysis/114/1140224.pdf
Resolving www.megabank.com.tw (www.megabank.com.tw)... 23.54.222.162
Connecting to www.megabank.com.tw (www.megabank.com.tw)|23.54.222.162|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1852669 (1.8M) [application/pdf]
Saving to: ‘1140224.pdf’


2025-04-29 14:44:31 (14.8 MB/s) - ‘1140224.pdf’ saved [1852669/1852669]



In [None]:
!pip install pymupdf

Collecting pymupdf
  Downloading pymupdf-1.25.5-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.5-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m89.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.25.5


In [None]:
import fitz # PyMuPDF library
pages = fitz.open("1140224.pdf")

In [None]:
from typing import List
from pydantic import Field, BaseModel

class QAPair(BaseModel):
    reference: str = Field(..., description="The exact text segment from the original context that this Q&A is based on")
    question: str = Field(description="A single question about the content")
    answer: str = Field(..., description="Answer")

class QAPairs(BaseModel):
    pairs: List[QAPair] = Field(..., description="List of question/answer pairs")

@traced
def produce_questions(content):
    completion = client.beta.chat.completions.parse(
        model="gpt-4.1-mini",
        messages=[
            {
                "role": "user",
                "content": f"""Please generate 2 question/answer pairs from the following text, focusing specifically on investment and personal finance topics.
For each pair, provide a single question, a unique answer, and include the exact text segment from the original context that the Q&A is based on.

IMPORTANT:
1. Focus ONLY on investment, financial planning, wealth management, stock market, retirement planning, tax optimization, or other personal finance related topics.
2. All questions and answers MUST be in Traditional Chinese (Taiwan).
3. Use terminology and expressions commonly used in Taiwan's financial sector.
4. If the context doesn't contain finance-related information, extract the most relevant aspects that could be applied to personal finance or investment decisions.
5. For each Q&A pair, include the exact text from the original context that contains the information used for the Q&A. This should be copied verbatim from the input context.

Context: <context>{content}</context>""",
            }
        ],
        response_format=QAPairs
    )

    parsed_result = completion.choices[0].message.parsed
    pairs = parsed_result.pairs
    return pairs

In [None]:
x = produce_questions('兆豐銀行所做任何投資意見與市場分析結果，係依據資料製作當時情況進行判斷\n惟可能因市場變化而變動，投資標的之價格與收益將隨時變動，亦不必然為未來績效表現')
x

[QAPair(reference='兆豐銀行所做任何投資意見與市場分析結果，係依據資料製作當時情況進行判斷', question='兆豐銀行的投資建議是基於什麼情況做判斷？', answer='兆豐銀行的投資建議是根據當時所擁有的資料及市場情況進行判斷。'),
 QAPair(reference='投資標的之價格與收益將隨時變動，亦不必然為未來績效表現', question='投資標的的價格與收益是否一定能反映未來績效？', answer='投資標的的價格與收益會隨時變動，並不保證能反映未來的績效表現。')]

In [None]:
h = x[0].model_dump()

In [None]:
h['test'] = 1234
h

{'reference': '兆豐銀行所做任何投資意見與市場分析結果，係依據資料製作當時情況進行判斷',
 'question': '兆豐銀行的投資建議是基於什麼情況做判斷？',
 'answer': '兆豐銀行的投資建議是根據當時所擁有的資料及市場情況進行判斷。',
 'test': 1234}

In [None]:
dataset = []
for idx, page in enumerate(pages):
    context = page.get_text()
    pairs = produce_questions(context)
    for pair in pairs:
      h = pair.model_dump()
      h["page_index"] = idx
      print(h)
      dataset.append(h)

{'reference': '兆豐銀行所做任何投資意見與市場分析結果，係依據資料製作當時情況進行判斷\n惟可能因市場變化而變動，投資標的之價格與收益將隨時變動，亦不必然為未來績效表現', 'question': '兆豐銀行提供的投資意見有何風險？', 'answer': '兆豐銀行的投資意見是依據當時資料進行判斷，然而因市場變化，投資標的價格與收益會隨時變動，且不保證未來績效。', 'page_index': 0}
{'reference': '兆豐銀行所做任何投資意見與市場分析結果，係依據資料製作當時情況進行判斷\n惟可能因市場變化而變動，投資標的之價格與收益將隨時變動，亦不必然為未來績效表現', 'question': '為何投資標的的價格與收益會持續變動？', 'answer': '因市場變化，投資標的的價格與收益會隨時變動，反映市場行情及風險的變化。', 'page_index': 0}
{'reference': '兆豐銀行所做任何投資意見與市場分析結果，係依據資料製作當時情況進行判斷\n惟可能因市場變化而變動，投資標的之價格與收益將隨時變動，亦不必然為未來績效表現', 'question': '兆豐銀行的投資意見為何可能不適用於未來投資策略？', 'answer': '兆豐銀行的投資意見是根據當時資料和情況判斷，因市場隨時變動，投資標的價格與收益也會變化，無法保證未來績效，因此投資策略需隨市場變化調整。', 'page_index': 1}
{'reference': '兆豐銀行所做任何投資意見與市場分析結果，係依據資料製作當時情況進行判斷\n惟可能因市場變化而變動，投資標的之價格與收益將隨時變動，亦不必然為未來績效表現', 'question': '投資人在參考兆豐銀行的市場分析時，應注意哪些風險？', 'answer': '投資人應注意市場變化可能導致投資標的價格和收益波動，且銀行提供的分析乃基於當時資料，未必反映未來表現，因此需謹慎評估風險並持續監控市場。', 'page_index': 1}
{'reference': '兆豐銀行所做任何投資意見與市場分析結果，係依據資料製作當時情況進行判斷\n惟可能因市場變化而變動，投資標的之價格與收益將隨時變動，亦不必然為未來績效表現', 'question': '兆豐銀行的投資建議依據哪些情況進行判斷？未來

In [None]:
eval_dataset = []
for qa in dataset:
    eval_dataset.append(
        {
            "input": qa['question'],
            "expected": qa['answer'],
            "metadata": {
                "reference": qa['reference'],
                'page_index': qa['page_index'],
                'file_name': '1140224.pdf'
            },
        }
    )

## Baseline (No RAG)

In [None]:
def simple_qa(question):
    completion = client.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[
            {
                "role": "user",
                "content": question,
            }
        ],
    )
    return completion.choices[0].message.content

In [None]:
import autoevals

Eval(
    name="Course-202504",
    experiment_name="No RAG",
    data=eval_dataset,
    task=simple_qa,
    scores=[autoevals.Factuality(model="gpt-4.1")],
)

Experiment No RAG-cd7feb24 is running at https://www.braintrust.dev/app/ihower/p/Course-202504/experiments/No%20RAG-cd7feb24


<Task pending name='Task-594' coro=<_EvalCommon.<locals>.run_to_completion() running at /usr/local/lib/python3.11/dist-packages/braintrust/framework.py:656>>

Course-202504 [experiment_name=No RAG] (data): 0it [00:00, ?it/s]

## Naive RAG

我們用 braintrust 提供的 autoevals 套件中的內建評估 prompt: Factuality  https://github.com/braintrustdata/autoevals/blob/main/templates/factuality.yaml

這讓 AI 拿你 task 的 output 和 dataset 中的 expected，比較兩個回答之間的事實內容差異，常見於自動評估問答或教學模型輸出的正確性。以下是詳細解釋：

* (A)	提交的回答是專家答案的子集合，且完全一致。（例：專家說 A+B+C，提交只說 A+C，但沒說錯）
* (B)	提交的回答是專家答案的超集合，也就是它包含更多細節，但與專家內容一致。（例：專家說 A+C，提交說 A+B+C+D）
* (C)	提交的回答與專家答案完全一樣，所有細節一致
* (D)	兩者有事實上的衝突或矛盾
* (E)	雖然兩者不完全一樣，但差異不影響事實正確性（可能是換句話說、順序不同等）


* C 與 E 是「最理想」的狀態，得分是 1
* B 比 A 更完整，分數略高。0.6 跟 0.4
* D 是錯誤，得 0 分。




In [None]:
!pip install chromadb

Collecting chromadb
  Downloading chromadb-1.0.7-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.9 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi==0.115.9 (from chromadb)
  Downloading fastapi-0.115.9-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.2-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-4.0.1-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.21.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelem

In [None]:
import chromadb
chroma_client = chromadb.Client()

collection = chroma_client.create_collection(name="collection3")

In [None]:
!pip install tiktoken



In [None]:
!pip install langchain_text_splitters



In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
import tiktoken

tokenizer = tiktoken.get_encoding("o200k_base") # gpt-4o 是 o200k_base，之前版本 gpt-4-turbo 和 gpt-3.5-turbo 是 cl100k_base

def length_function(text: str):
    return len(tokenizer.encode(text))

text_splitter = RecursiveCharacterTextSplitter(length_function=length_function, chunk_size=800, chunk_overlap=200, separators=[
    "\n\n",
    "\n",
    " ",
    ".",
    ",",
    "\u200b",  # Zero-width space
    "\uff0c",  # Fullwidth comma ，
    "\u3001",  # Ideographic comma 、
    "\uff0e",  # Fullwidth full stop ．
    "\u3002",  # Ideographic full stop 。
    "",
])

In [None]:
def get_embeddings(text):
  response = client.embeddings.create(
      input=text,
      model="text-embedding-3-small"
  )

  return response.data[0].embedding

In [None]:
for idx,page in enumerate(pages):
  chunks = text_splitter.split_text(page.get_text())

  collection.add(
    documents = chunks,
    embeddings = [ get_embeddings(chunk) for chunk in chunks ],
    ids=[f"doc-1-page-{idx}-chunk-{x}" for x in range( len(chunks) ) ]
  )

In [None]:
from typing import List
from pydantic import Field, BaseModel

class QueryResult(BaseModel):
    relevant_quotes: List[str]
    answer: str
    following_questions: List[str]

@traced
def ask_with_rag(question):
  results = collection.query(
      query_embeddings = get_embeddings(question),
      # 可有 where 參數可針對上述的 metadatas 做過濾，例如日期、頁數等
      n_results=10
  )

  documents = results['documents'][0]
  context = '\n'.join('* ' + doc for doc in documents)

  user_prompt = f"""
  I will provide you with a document and then ask you a question about it. Please respond following these steps:

  <document>
  {context}
  </document>

  Question: {question}

  Please answer in the following format:

  1. First, identify the most relevant quotes from the document that help answer the question and list them. Each quote should be relatively short.
    If there are no relevant quotes, write "No relevant quotes".

  2. Then, answer the question using facts from these quotes without directly referencing the content in your answer.

  3. Finally, provide 3 related follow-up questions based on the original question and document content that would help explore the topic further.

  If the document does not contain sufficient information to answer the question, please state this in the answer field, but still provide any relevant quotes (if available) and possible follow-up questions.
  Please respond in Traditional Chinese (Taiwan).
  """

  completion = client.beta.chat.completions.parse(
      model="gpt-4.1-mini",
      messages=[
          {"role": "user", "content": user_prompt},
      ],
      response_format=QueryResult
  )

  parsed_result = completion.choices[0].message.parsed
  return parsed_result.answer

In [None]:
import autoevals

Eval(
    name="Course-202504",
    experiment_name="Naive RAG",
    data=eval_dataset,
    task=ask_with_rag,
    scores=[autoevals.Factuality(model="gpt-4.1")],
)

Experiment Naive RAG-addfce82 is running at https://www.braintrust.dev/app/ihower/p/Course-202504/experiments/Naive%20RAG-addfce82


<Task pending name='Task-657' coro=<_EvalCommon.<locals>.run_to_completion() running at /usr/local/lib/python3.11/dist-packages/braintrust/framework.py:656>>

## 使用 Ragas 指標來做評估

https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/

參考自 https://www.braintrust.dev/docs/cookbook/recipes/SimpleRagas

In [None]:
@traced
def fetch_top_k_relevant_sections(question):
  results = collection.query(
      query_embeddings = get_embeddings(question),
      # 可有 where 參數可針對上述的 metadatas 做過濾，例如日期、頁數等
      n_results=10
  )

  documents = results['documents'][0]
  return documents

@traced
def generate_answer_from_docs(question, retrieved_content):
  context = '\n'.join('* ' + doc for doc in retrieved_content)

  user_prompt = f"""
  I will provide you with a document and then ask you a question about it. Please respond following these steps:

  <document>
  {context}
  </document>

  Question: {question}

  Please answer in the following format:

  1. First, identify the most relevant quotes from the document that help answer the question and list them. Each quote should be relatively short.
    If there are no relevant quotes, write "No relevant quotes".

  2. Then, answer the question using facts from these quotes without directly referencing the content in your answer.

  3. Finally, provide 3 related follow-up questions based on the original question and document content that would help explore the topic further.

  If the document does not contain sufficient information to answer the question, please state this in the answer field, but still provide any relevant quotes (if available) and possible follow-up questions.
  Please respond in Traditional Chinese (Taiwan).
  """

  completion = client.beta.chat.completions.parse(
      model="gpt-4.1-mini",
      messages=[
          {"role": "user", "content": user_prompt},
      ],
      response_format=QueryResult
  )

  parsed_result = completion.choices[0].message.parsed
  return parsed_result

@traced
def generate_answer_e2e(question):
  retrieved_content = fetch_top_k_relevant_sections(question)
  result = generate_answer_from_docs(question, retrieved_content)

  return { "answer": result.answer, "retrieved_docs": retrieved_content }

In [None]:
from braintrust import EvalAsync

from autoevals import AnswerCorrectness, ContextRecall, ContextPrecision, Faithfulness

# Wrap ContextRecall() to propagate along the "answer" and "context" values separately
async def context_recall(output, **kwargs):
    return await ContextRecall(model="gpt-4.1").eval_async(
        output=output["answer"], context=output["retrieved_docs"], **kwargs
    )

async def context_precision(output, **kwargs):
    return await ContextPrecision(model="gpt-4.1").eval_async(
        output=output["answer"], context=output["retrieved_docs"], **kwargs
    )

async def faithfulness(output, **kwargs):
    return await Faithfulness(model="gpt-4.1").eval_async(
        output=output["answer"], context=output["retrieved_docs"], **kwargs
    )

async def answer_correctness(output, **kwargs):
    return await AnswerCorrectness(model="gpt-4.1").eval_async(output=output["answer"], **kwargs)

eval_result = await EvalAsync(
    name="Course-202504",
    experiment_name="Ragas",
    data=eval_dataset,
    task=generate_answer_e2e,
    scores=[context_recall, context_precision, faithfulness, answer_correctness],
    metadata=dict(model='gpt-4.1-mini', top_k=10),
)

Experiment Ragas is running at https://www.braintrust.dev/app/ihower/p/Course-202504/experiments/Ragas
Course-202504 [experiment_name=Ragas] (data): 32it [00:00, 24092.21it/s]


Course-202504 [experiment_name=Ragas] (tasks):   0%|          | 0/32 [00:00<?, ?it/s]

Found exceptions for the following scorers: faithfulness [KeyError('verdict')]
Found exceptions for the following scorers: faithfulness [KeyError('verdict')]
Found exceptions for the following scorers: context_recall [TypeError("unsupported operand type(s) for +: 'int' and 'NoneType'")]
Found exceptions for the following scorers: faithfulness [KeyError('verdict')]
Found exceptions for the following scorers: faithfulness [KeyError('verdict')]



Ragas compared to Naive RAG-addfce82:
59.56% 'AnswerCorrectness' score
100.00% 'ContextPrecision'  score
97.31% 'ContextRecall'     score
84.43% 'Faithfulness'      score

1745942477.20s start
1745942541.03s end
55.08s (-12293.44%) 'duration'         	(32 improvements, 0 regressions)
3.08s (-52.22%) 'llm_duration'     	(23 improvements, 9 regressions)
5646.81tok (-) 'prompt_tokens'    	(0 improvements, 0 regressions)
407.78tok (+1003.12%) 'completion_tokens'	(15 improvements, 16 regressions)
6054.59tok (+1003.12%) 'total_tokens'     	(15 improvements, 16 regressions)
0.00$ (+00.00%) 'estimated_cost'   	(8 improvements, 10 regressions)

See results for Ragas at https://www.braintrust.dev/app/ihower/p/Course-202504/experiments/Ragas
