# ML2025 Homework 1 - Retrieval Augmented Generation with Agents

## Environment Setup

First, we will mount your own Google Drive and change the working directory.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Change the working directory to somewhere in your Google Drive.
# You could check the path by right clicking on the folder.
# %cd [change to the directory you prefer]
%cd /content/drive/MyDrive/QA

/content/drive/MyDrive/QA


In this section, we install the necessary python packages and download model weights of the quantized version of LLaMA 3.1 8B. Also, download the dataset. Note that the model weight is around 8GB. If you are using your Google Drive as the working directory, make sure you have enough space for the model.

In [None]:
!python3 -m pip install --no-cache-dir llama-cpp-python==0.3.4 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122
!python3 -m pip install googlesearch-python bs4 charset-normalizer requests-html lxml_html_clean

from pathlib import Path
if not Path('./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf').exists():
    !wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
if not Path('./public.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/public.txt
if not Path('./private.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/private.txt

Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu122
Collecting llama-cpp-python==0.3.4
  Downloading https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.4-cu122/llama_cpp_python-0.3.4-cp311-cp311-linux_x86_64.whl (445.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m445.2/445.2 MB[0m [31m108.1 MB/s[0m eta [36m0:00:00[0m
Collecting diskcache>=5.6.1 (from llama-cpp-python==0.3.4)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m36.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: diskcache, llama-cpp-python
Successfully installed diskcache-5.6.3 llama-cpp-python-0.3.4
Collecting googlesearch-python
  Downloading googlesearch_python-1.3.0-py3-none-any.whl.metadata (3.4 kB)
Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metada

In [None]:
import torch
if not torch.cuda.is_available():
    raise Exception('You are not using the GPU runtime. Change it first or you will suffer from the super slow inference speed!')
else:
    print('You are good to go!')

You are good to go!


## Prepare the LLM and LLM utility function

By default, we will use the quantized version of LLaMA 3.1 8B. you can get full marks on this homework by using the provided LLM and LLM utility function. You can also try out different LLM models.

In the following code block, we will load the downloaded LLM model weights onto the GPU first.
Then, we implemented the generate_response() function so that you can get the generated response from the LLM model more easily.

You can ignore "llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized" warning.

In [None]:
from llama_cpp import Llama

# Load the model onto GPU
llama3 = Llama(
    "./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf",
    verbose=False,
    n_gpu_layers=-1,
    n_ctx=16384,    # This argument is how many tokens the model can take. The longer the better, but it will consume more memory. 16384 is a proper value for a GPU with 16GB VRAM.
)

def generate_response(_model: Llama, _messages: str) -> str:
    '''
    This function will inference the model with given messages.
    '''
    _output = _model.create_chat_completion(
        _messages,
        stop=["<|eot_id|>", "<|end_of_text|>"],
        max_tokens=512,    # This argument is how many tokens the model can generate, you can change it and observe the differences.
        temperature=0,      # This argument is the randomness of the model. 0 means no randomness. You will get the same result with the same input every time. You can try to set it to different values.
        repeat_penalty=2.0,
    )["choices"][0]["message"]["content"]
    return _output

llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


## Search Tool

The TA has implemented a search tool for you to search certain keywords using Google Search. You can use this tool to search for the relevant **web pages** for the given question. The search tool can be integrated in the following sections.

In [None]:
from typing import List
from googlesearch import search as _search
from bs4 import BeautifulSoup
from charset_normalizer import detect
import asyncio
from requests_html import AsyncHTMLSession
import urllib3
urllib3.disable_warnings()

async def worker(s:AsyncHTMLSession, url:str):
    try:
        header_response = await asyncio.wait_for(s.head(url, verify=False), timeout=10)
        if 'text/html' not in header_response.headers.get('Content-Type', ''):
            return None
        r = await asyncio.wait_for(s.get(url, verify=False), timeout=10)
        return r.text
    except:
        return None

async def get_htmls(urls):
    session = AsyncHTMLSession()
    tasks = (worker(session, url) for url in urls)
    return await asyncio.gather(*tasks)

async def search(keyword: str, n_results: int=3) -> List[str]:
    '''
    This function will search the keyword and return the text content in the first n_results web pages.

    Warning: You may suffer from HTTP 429 errors if you search too many times in a period of time. This is unavoidable and you should take your own risk if you want to try search more results at once.
    The rate limit is not explicitly announced by Google, hence there's not much we can do except for changing the IP or wait until Google unban you (we don't know how long the penalty will last either).
    '''
    keyword = keyword[:100]
    # First, search the keyword and get the results. Also, get 2 times more results in case some of them are invalid.
    results = list(_search(keyword, n_results * 2, lang="zh", unique=True))
    # Then, get the HTML from the results. Also, the helper function will filter out the non-HTML urls.
    results = await get_htmls(results)
    # Filter out the None values.
    results = [x for x in results if x is not None]
    # Parse the HTML.
    results = [BeautifulSoup(x, 'html.parser') for x in results]
    # Get the text from the HTML and remove the spaces. Also, filter out the non-utf-8 encoding.
    results = [''.join(x.get_text().split()) for x in results if detect(x.encode()).get('encoding') == 'utf-8']
    # Return the first n results.
    return results[:n_results]

## Test the LLM inference pipeline

In [None]:
# You can try out different questions here.
test_question='請問誰是 Taylor Swift？'

messages = [
    {"role": "system", "content": "你是 LLaMA-3.1-8B，是用來回答問題的 AI。使用中文時只會使用繁體中文來回問題。"},    # System prompt
    {"role": "user", "content": test_question}, # User prompt
]

print(generate_response(llama3, messages))

泰勒絲（Taylor Swift）是一位美國歌手、詞曲作家和音樂製作人。她出生於1989年，來自田納西州。她的音乐风格从乡村摇滚发展到流行搖擺，並且她被誉为当代最成功的女艺人的之一。

泰勒絲早期在鄉郊小鎮演唱會時開始發展音樂事業，她推出了多張專輯，包括《Taylor Swift》、《Fearless》，以及後來更為知名的大熱作如 《1989》（2014年）、_reputation（）和 _Lover （）。她的歌曲經常探討愛情、友誼及自我成長等主題。

泰勒絲獲得了許多獎項，包括13座格萊美奖，並且是史上最快達到百萬銷量的女藝人之一。


## Agents

The TA has implemented the Agent class for you. You can use this class to create agents that can interact with the LLM model. The Agent class has the following attributes and methods:
- Attributes:
    - role_description: The role of the agent. For example, if you want this agent to be a history expert, you can set the role_description to "You are a history expert. You will only answer questions based on what really happened in the past. Do not generate any answer if you don't have reliable sources.".
    - task_description: The task of the agent. For example, if you want this agent to answer questions only in yes/no, you can set the task_description to "Please answer the following question in yes/no. Explanations are not needed."
    - llm: Just an indicator of the LLM model used by the agent.
- Method:
    - inference: This method takes a message as input and returns the generated response from the LLM model. The message will first be formatted into proper input for the LLM model. (This is where you can set some global instructions like "Please speak in a polite manner" or "Please provide a detailed explanation".) The generated response will be returned as the output.

In [None]:
class LLMAgent():
    def __init__(self, role_description: str, task_description: str, llm:str="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF"):
        self.role_description = role_description   # Role means who this agent should act like. e.g. the history expert, the manager......
        self.task_description = task_description    # Task description instructs what task should this agent solve.
        self.llm = llm  # LLM indicates which LLM backend this agent is using.
    def inference(self, message:str) -> str:
        if self.llm == 'bartowski/Meta-Llama-3.1-8B-Instruct-GGUF': # If using the default one.
            # TODO: Design the system prompt and user prompt here.
            # Format the messsages first.
            messages = [
                {"role": "system", "content": f"{self.role_description}"},  # Hint: you may want the agents to speak Traditional Chinese only.
                {"role": "user", "content": f"{self.task_description}\n{message}"}, # Hint: you may want the agents to clearly distinguish the task descriptions and the user messages. A proper seperation text rather than a simple line break is recommended.
            ]
            return generate_response(llama3, messages)
        else:
            # TODO: If you want to use LLMs other than the given one, please implement the inference part on your own.
            return ""

TODO: Design the role description and task description for each agent.

In [None]:
# TODO: Design the role and task description for each agent.

# This agent may help you filter out the irrelevant parts in question descriptions.
question_extraction_agent = LLMAgent(
    role_description="",
    task_description="",
)

# This agent may help you extract the keywords in a question so that the search tool can find more accurate results.
keyword_extraction_agent = LLMAgent(
    role_description="",
    task_description="",
)

# This agent is the core component that answers the question.
qa_agent = LLMAgent(
    role_description="你是 LLaMA-3.1-8B，是用來回答問題的 AI。使用中文時只會使用繁體中文來回問題。",
    task_description="請回答以下問題：",
)

## RAG pipeline

TODO: Implement the RAG pipeline.

Please refer to the homework description slides for hints.

Also, there might be more heuristics (e.g. classifying the questions based on their lengths, determining if the question need a search or not, reconfirm the answer before returning it to the user......) that are not shown in the flow charts. You can use your creativity to come up with a better solution!

- Naive approach (simple baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive.png)

- Naive RAG approach (medium baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive_rag.png)

- RAG with agents (strong baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/rag_agent.png)

In [None]:
async def pipeline(question: str) -> str:
    # TODO: Implement your pipeline.
    # Currently, it only feeds the question directly to the LLM.
    # You may want to get the final results through multiple inferences.
    # Just a quick reminder, make sure your input length is within the limit of the model context window (16384 tokens), you may want to truncate some excessive texts.
    return qa_agent.inference(question)

## Answer the questions using your pipeline!

Since Colab has usage limit, you might encounter the disconnections. The following code will save your answer for each question. If you have mounted your Google Drive as instructed, you can just rerun the whole notebook to continue your process.

In [None]:
import pandas as pd
df = pd.read_csv('./qa.csv', encoding='utf-8')

In [None]:
df.head()

Unnamed: 0,问题,答案
0,校歌為學校（包括小學、中學、大學等）宣告或者規定的代表該校的歌曲。用於體現該校的治學理念、辦...,「虎山雄風飛揚」是「光華國小」學校的校歌歌詞。
1,2025年初，NCC透過行政命令，規定民眾如果透過境外郵購無線鍵盤、滑鼠、藍芽耳機..等自用...,每案一律加收審查費750元。
2,第一代 iPhone 是由哪位蘋果 CEO 發表？,第一代 iPhone 是由Steve Jobs（史提夫賈伯斯）發表。
3,台灣大學進階英文免修申請規定中，托福網路測驗 TOEFL iBT 要達到多少分才能申請？,台灣大學進階英文免修申請規定中，托福網路測驗 TOEFL iBT 72分才能申請
4,Rugby Union 中觸地 try 可得幾分？,Rugby Union 中觸地 try 可得5分


In [None]:
from pathlib import Path

PREFIX = "test"

for idx, row in df.iterrows():
    question = row['问题']
    # answer   = row['答案']
    # print(f"{idx + 1} | 问题: {question} | 答案: {answer}")
    if Path(f"./{PREFIX}_{idx + 1}.txt").exists():
            continue
    answer = await pipeline(question)
    answer = answer.replace('\n',' ')
    print(idx + 1, answer)
    with open(f'./{PREFIX}_{idx + 1}.txt', 'w') as output_f:
        print(answer, file=output_f)

In [None]:
# Combine the results into one file.
with open(f'./{PREFIX}.txt', 'w') as output_f:
    for id in range(1,91):
        with open(f'./{PREFIX}_{id}.txt', 'r') as input_f:
            answer = input_f.readline().strip()
            print(answer, file=output_f)

## Evaluation
Using gemini api to evaluate the correctness of answers.

In [None]:
# Check model list
models = genai.list_models()
for m in models:
    print(m.name, m.supported_generation_methods)


models/embedding-gecko-001 ['embedText', 'countTextTokens']
models/gemini-1.0-pro-vision-latest ['generateContent', 'countTokens']
models/gemini-pro-vision ['generateContent', 'countTokens']
models/gemini-1.5-pro-latest ['generateContent', 'countTokens']
models/gemini-1.5-pro-002 ['generateContent', 'countTokens', 'createCachedContent']
models/gemini-1.5-pro ['generateContent', 'countTokens']
models/gemini-1.5-flash-latest ['generateContent', 'countTokens']
models/gemini-1.5-flash ['generateContent', 'countTokens']
models/gemini-1.5-flash-002 ['generateContent', 'countTokens', 'createCachedContent']
models/gemini-1.5-flash-8b ['createCachedContent', 'generateContent', 'countTokens']
models/gemini-1.5-flash-8b-001 ['createCachedContent', 'generateContent', 'countTokens']
models/gemini-1.5-flash-8b-latest ['createCachedContent', 'generateContent', 'countTokens']
models/gemini-2.5-pro-preview-03-25 ['generateContent', 'countTokens', 'createCachedContent', 'batchGenerateContent']
models/ge

In [None]:
# Test Gemini API

import google.generativeai as genai

GEMINI_API_KEY="AIzaSyDkAk3alvYw7pZyWhw-jTyfzfH5rwLG_60"
genai.configure(api_key=GEMINI_API_KEY)

# 使用 Gemini 1.5 Pro 模型
# model = genai.GenerativeModel(model_name="gemini-1.5-pro")
model = genai.GenerativeModel(model_name="gemini-2.5-flash")


prompt = "請用繁體中文解釋什麼是量子纏結。"
response = model.generate_content(prompt)

print(response.text)


量子纏結（Quantum Entanglement），是量子力學中一種非常奇特且違反直覺的現象，它描述了兩個或多個量子粒子（如光子、電子等）之間的一種特殊關聯。

簡單來說，當這些粒子在某種方式下相互作用後，它們會形成一種深刻的「連結」或「纏繞」，即使它們在空間上被分開，甚至相隔很遠很遠的距離，它們的量子狀態（例如自旋、極化、能量等）仍然是相互關聯的。

以下是量子纏結的幾個核心特點：

1.  **非定域性（Non-locality）**：這是最令人驚奇的一點。一旦粒子發生纏結，它們就變成了一個「整體」。當你對其中一個纏結粒子進行測量時，無論它與另一個纏結粒子相隔多遠，另一個粒子的狀態會「瞬間」確定下來，並與第一個粒子測量到的狀態呈現出某種特定的關聯性。這種影響的傳遞似乎沒有時間延遲，這讓愛因斯坦稱之為「鬼魅般的超距作用」（spooky action at a distance）。

2.  **不確定性（Uncertainty）與疊加態（Superposition）**：在被測量之前，纏結粒子的特定性質（例如一個粒子的自旋方向）並不是確定的，而是處於所有可能狀態的疊加態。例如，一個電子的自旋可以是「向上」和「向下」的疊加。一旦你測量了其中一個粒子，它的疊加態就會「坍縮」（collapse）到某個確定的狀態，而與之纏結的另一個粒子也會立即坍縮到相應的確定狀態。

3.  **無法用於超光速通訊**：雖然纏結的影響是瞬間的，但這並不意味著可以利用它來實現超光速通訊。因為你無法事先知道測量第一個粒子時會得到什麼結果（測量結果是隨機的），你需要透過傳統的通訊方式（例如電話、網路）來告訴遠方的人你測量到了什麼，然後他們才能確認另一個粒子的狀態。所以，信息本身仍然無法超越光速傳遞。

**一個簡單的比喻（但非完全精確）：**

想像你有兩枚特殊的硬幣，它們總是處於相反的狀態：如果一枚是正面，另一枚就一定是反面；如果一枚是反面，另一枚就一定是正面。但在被觀看之前，這兩枚硬幣都同時是正面和反面的「疊加態」。當你打開其中一個盒子看到了一枚硬幣是正面，那麼無論相隔多遠，另一枚硬幣一定會立即「決定」為反面。量子纏結比這個比喻更為深刻，因為在被觀看之前，單個硬幣本身並沒有確定的狀態。

**量子纏結的重要性：**

量子纏結是量子力學中最 fundamental 的現象之一，也是許多前

In [None]:
import google.generativeai as genai

# ✅ 设置 API key
GEMINI_API_KEY="AIzaSyDkAk3alvYw7pZyWhw-jTyfzfH5rwLG_60"
genai.configure(api_key=GEMINI_API_KEY)

# ✅ 初始化模型
model = genai.GenerativeModel(model_name="gemini-2.5-flash")

# ✅ 评估函数
def evaluate_qa(question, gt_answer, pred_answer):
    prompt = f"""
你是一个严谨的评测专家，请判断以下问题的两个答案是否一致，并给出 1~5 分评分与简短评语：

【问题】
{question}

【标准答案】
{gt_answer}

【学生回答】
{pred_answer}

评分范围：
5 = 完全正确；
4 = 大致正确，略有差异；
3 = 一般正确，有部分缺漏；
2 = 有重大偏差；
1 = 完全错误。

請用如下格式回答：
分數: X
評語: Y
"""

    response = model.generate_content(prompt)
    try:
      content = response.text
    except ValueError:
        print("[模型未返回有效內容]")
        print("finish_reason:", response.candidates[0].finish_reason)
    return response.text


In [None]:
# 测试评估函数
question = "什麼是量子糾纏？"
gt_answer = "量子糾纏是指兩個或多個粒子的量子態彼此關聯，即使它們距離遙遠，也能瞬間影響彼此。"
pred_answer = "量子糾纏是一種物理現象，兩個粒子會互相感應，即使在宇宙兩端也會同步改變。"

print(evaluate_qa(question, gt_answer, pred_answer))


分數: 4
評語: 學生回答抓住了量子糾纏的核心概念，即粒子之間超越距離的即時關聯。雖然用詞如「互相感應」和「同步改變」不如標準答案中的「量子態彼此關聯」和「瞬間影響」來得嚴謹和專業，但其表達的意思是基本一致的。差異主要在於術語的精確度，而非概念上的錯誤。


In [None]:
import time
import random

total_count, correct_count = 0, 0

for idx, row in df.iterrows():
  question = row['问题']
  gt_answer = row['答案'] # ground truth
  pred_answer = '' # prediction
  with open(f'./{PREFIX}_{idx + 1}.txt', 'r') as f:
    pred_answer = f.read().strip() # prediction

  time.sleep(random.uniform(1, 2))               # 1~2 秒随机延迟
  response = evaluate_qa(question, gt_answer, pred_answer)

  print(f'问题{idx + 1}：{question}')
  print(response)

  total_count += 1
  if "分數: 5" in response or "评分: 5" in response:
    correct_count += 1

print(f"\n✅ 完全正確的題目數：{correct_count}/{total_count}")

问题：校歌為學校（包括小學、中學、大學等）宣告或者規定的代表該校的歌曲。用於體現該校的治學理念、辦學理想等學校文化。「虎山雄風飛揚」是哪間學校的校歌歌詞？
分數: 1
評語: 学生未给出任何回答，故与标准答案完全不符。
问题：2025年初，NCC透過行政命令，規定民眾如果透過境外郵購無線鍵盤、滑鼠、藍芽耳機..等自用產品回台，每案一律加收審查費多少錢？
分數: 1
評語: 学生回答表示“无法提供具体数据”，未能回答问题，与标准答案直接提供具体金额的做法完全不符，属于未能正确作答。
问题：第一代 iPhone 是由哪位蘋果 CEO 發表？
分數: 5
評語: 兩個答案都準確無誤地指出了發表第一代 iPhone 的人物是 Steve Jobs。儘管中文譯名在繁簡用字和習慣上存在細微差異（「史提夫賈伯斯」為繁體中文常用譯法，「史蒂夫·乔布斯」為簡體中文常用譯法），但兩者均是該人物的正確中文音譯，所指代的對象完全一致，資訊內容上沒有任何偏差或缺漏。因此，從答案的實質內容來看，兩者是完全一致且正確的。
问题：台灣大學進階英文免修申請規定中，托福網路測驗 TOEFL iBT 要達到多少分才能申請？
分數: 2
評語: 學生回答中托福分數的具體數字與標準答案不符。標準答案明確指出為72分，而學生回答為80分。在要求精確數字的問題中，此偏差屬於重大事實錯誤。
问题：Rugby Union 中觸地 try 可得幾分？
分數: 5
評語: 学生回答与标准答案在所有关键信息上（运动类型、得分动作、得分）完全一致，且表达准确无误。虽然措辞略有不同，但不影响其正确性。
问题：卑南族是位在臺東平原的一個原住民族，以驍勇善戰、擅長巫術聞名，曾經統治整個臺東平原。相傳卑南族的祖先發源自 ruvuwa'an，該地位於現今的哪個行政區劃？
分數: 4
評語: 學生答案準確指出了「臺東縣太麻里鄉」這個核心地點，與標準答案一致。但增加了「附近」一詞，使得答案的精確度略低於標準答案的直接定位，不過對於古地名的描述，這種措辭也無傷大雅。多餘的「臺灣省」在現代行政區劃語境下較少使用，但不影響主要資訊的正確性。因此，屬於大致正確，略有差異。
问题：熊信寬，藝名熊仔，是臺灣饒舌創作歌手。2022年獲得第33屆金曲獎最佳作詞人獎，2023年獲得第34屆金曲獎最佳華語專輯獎。請問熊仔的碩班指導教授為？
分數: 1
評語