<a href="https://colab.research.google.com/github/Neptune1729/NTU-ML-2025-spring/blob/main/mlhw1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ML2025 Homework 1 - Retrieval Augmented Generation with Agents

## Environment Setup

First, we will mount your own Google Drive and change the working directory.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Change the working directory to somewhere in your Google Drive.
# You could check the path by right clicking on the folder.
#%cd [change to the directory you prefer]
%cd drive/MyDrive/NTU-ML-2025/hw1

/content/drive/MyDrive/NTU-ML-2025/hw1


In this section, we install the necessary python packages and download model weights of the quantized version of LLaMA 3.1 8B. Also, download the dataset. Note that the model weight is around 8GB. If you are using your Google Drive as the working directory, make sure you have enough space for the model.

In [3]:
!python3 -m pip install --no-cache-dir llama-cpp-python==0.3.4 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122
!python3 -m pip install googlesearch-python bs4 charset-normalizer requests-html lxml_html_clean

from pathlib import Path
if not Path('./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf').exists():
    !wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
if not Path('./public.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/public.txt
if not Path('./private.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/private.txt

Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu122
Collecting llama-cpp-python==0.3.4
  Downloading https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.4-cu122/llama_cpp_python-0.3.4-cp312-cp312-linux_x86_64.whl (445.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m445.2/445.2 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
Collecting diskcache>=5.6.1 (from llama-cpp-python==0.3.4)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m131.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: diskcache, llama-cpp-python
Successfully installed diskcache-5.6.3 llama-cpp-python-0.3.4
Collecting googlesearch-python
  Downloading googlesearch_python-1.3.0-py3-none-any.whl.metadata (3.4 kB)
Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metad

In [4]:
import torch
if not torch.cuda.is_available():
    raise Exception('You are not using the GPU runtime. Change it first or you will suffer from the super slow inference speed!')
else:
    print('You are good to go!')

You are good to go!


## Prepare the LLM and LLM utility function

By default, we will use the quantized version of LLaMA 3.1 8B. you can get full marks on this homework by using the provided LLM and LLM utility function. You can also try out different LLM models.

In the following code block, we will load the downloaded LLM model weights onto the GPU first.
Then, we implemented the generate_response() function so that you can get the generated response from the LLM model more easily.

You can ignore "llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized" warning.

In [5]:
from llama_cpp import Llama

# Load the model onto GPU
llama3 = Llama(
    "./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf",
    verbose=False,
    n_gpu_layers=-1,
    n_ctx=16384,    # This argument is how many tokens the model can take. The longer the better, but it will consume more memory. 16384 is a proper value for a GPU with 16GB VRAM.
)

def generate_response(_model: Llama, _messages: str) -> str:
    '''
    This function will inference the model with given messages.
    '''
    _output = _model.create_chat_completion(
        _messages,
        stop=["<|eot_id|>", "<|end_of_text|>"],
        max_tokens=512,    # This argument is how many tokens the model can generate, you can change it and observe the differences.
        temperature=0.1,      # This argument is the randomness of the model. 0 means no randomness. You will get the same result with the same input every time. You can try to set it to different values.
        repeat_penalty=2.0,
    )["choices"][0]["message"]["content"]
    return _output

llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


## Search Tool

The TA has implemented a search tool for you to search certain keywords using Google Search. You can use this tool to search for the relevant **web pages** for the given question. The search tool can be integrated in the following sections.

In [6]:
from typing import List
from googlesearch import search as _search
from bs4 import BeautifulSoup
from charset_normalizer import detect
import asyncio
from requests_html import AsyncHTMLSession
import urllib3
urllib3.disable_warnings()

async def worker(s:AsyncHTMLSession, url:str):
    try:
        header_response = await asyncio.wait_for(s.head(url, verify=False), timeout=10)
        if 'text/html' not in header_response.headers.get('Content-Type', ''):
            return None
        r = await asyncio.wait_for(s.get(url, verify=False), timeout=10)
        return r.text
    except:
        return None

async def get_htmls(urls):
    session = AsyncHTMLSession()
    tasks = (worker(session, url) for url in urls)
    return await asyncio.gather(*tasks)

async def search(keyword: str, n_results: int=3) -> List[str]:
    '''
    This function will search the keyword and return the text content in the first n_results web pages.

    Warning: You may suffer from HTTP 429 errors if you search too many times in a period of time. This is unavoidable and you should take your own risk if you want to try search more results at once.
    The rate limit is not explicitly announced by Google, hence there's not much we can do except for changing the IP or wait until Google unban you (we don't know how long the penalty will last either).
    '''
    keyword = keyword[:100]
    # First, search the keyword and get the results. Also, get 2 times more results in case some of them are invalid.
    results = list(_search(keyword, n_results * 2, lang="zh", unique=True))
    # Then, get the HTML from the results. Also, the helper function will filter out the non-HTML urls.
    results = await get_htmls(results)
    # Filter out the None values.
    results = [x for x in results if x is not None]
    # Parse the HTML.
    results = [BeautifulSoup(x, 'html.parser') for x in results]
    # Get the text from the HTML and remove the spaces. Also, filter out the non-utf-8 encoding.
    results = [''.join(x.get_text().split()) for x in results if detect(x.encode()).get('encoding') == 'utf-8']
    # Return the first n results.
    return results[:n_results]

## Test the LLM inference pipeline

In [7]:
# You can try out different questions here.
test_question='請問誰是 Taylor Swift？'

messages = [
    {"role": "system", "content": "你是 LLaMA-3.1-8B，是用來回答問題的 AI。使用中文時只會使用繁體中文來回問題。"},    # System prompt
    {"role": "user", "content": test_question}, # User prompt
]

print(generate_response(llama3, messages))

泰勒絲（Taylor Swift）是一位美國歌手、詞曲作家和音樂製作人。她出生於1989年，來自田納西州。她的音乐风格从乡村摇滚发展到流行搖擺，並且她被誉为当代最成功的女艺人的之一。

泰勒絲早期以鄉郊風味而聞名，她在2006年的首張專輯《Taylor Swift》獲得了商業上的成績。隨後，she推出了多张专辑，其中包括 《Fearless》（2010）、_1989（）和 _Reputation （）。她的歌曲經常探討愛情、友誼以及個人生活的主题。

泰勒絲在音樂界取得許多少項榮譽，她獲得了13座格萊美獎，成為史上最年輕的人類贏得該殊荣。


## Agents

The TA has implemented the Agent class for you. You can use this class to create agents that can interact with the LLM model. The Agent class has the following attributes and methods:
- Attributes:
    - role_description: The role of the agent. For example, if you want this agent to be a history expert, you can set the role_description to "You are a history expert. You will only answer questions based on what really happened in the past. Do not generate any answer if you don't have reliable sources.".
    - task_description: The task of the agent. For example, if you want this agent to answer questions only in yes/no, you can set the task_description to "Please answer the following question in yes/no. Explanations are not needed."
    - llm: Just an indicator of the LLM model used by the agent.
- Method:
    - inference: This method takes a message as input and returns the generated response from the LLM model. The message will first be formatted into proper input for the LLM model. (This is where you can set some global instructions like "Please speak in a polite manner" or "Please provide a detailed explanation".) The generated response will be returned as the output.

In [8]:
class LLMAgent():
    def __init__(self, role_description: str, task_description: str, llm:str="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF"):
        self.role_description = role_description   # Role means who this agent should act like. e.g. the history expert, the manager......
        self.task_description = task_description    # Task description instructs what task should this agent solve.
        self.llm = llm  # LLM indicates which LLM backend this agent is using.
    def inference(self, message:str) -> str:
        if self.llm == 'bartowski/Meta-Llama-3.1-8B-Instruct-GGUF': # If using the default one.
            # TODO: Design the system prompt and user prompt here.
            # Format the messsages first.
            messages = [
                {"role": "system", "content": f"{self.role_description}"},  # Hint: you may want the agents to speak Traditional Chinese only.
                {"role": "user", "content": f"{self.task_description}\n{message}"}, # Hint: you may want the agents to clearly distinguish the task descriptions and the user messages. A proper seperation text rather than a simple line break is recommended.
            ]
            return generate_response(llama3, messages)
        else:
            # TODO: If you want to use LLMs other than the given one, please implement the inference part on your own.
            return ""

TODO: Design the role description and task description for each agent.

In [9]:
# TODO: Design the role and task description for each agent.

# This agent may help you filter out the irrelevant parts in question descriptions.
question_extraction_agent = LLMAgent(
    role_description="你是一名善於理解與重述問題的助理。"
        "你的工作是從冗長、雜亂或包含許多無關背景資訊的描述中，"
        "提煉出真正需要回答的核心問題。"
        "使用中文時，只會使用繁體中文回答。",
    task_description="請閱讀使用者提供的原始問題描述，完成以下工作：\n"
        "1. 忽略和提問無關的情緒化語句、閒聊、垃圾話或過多背景細節。\n"
        "2. 找出使用者真正想解決的核心問題是什麼。\n"
        "3. 用 1～2 句清楚、完整的問句重寫這個核心問題。\n"
        "4. 不要回答問題，只需要輸出重寫後的核心問題。\n"
        "5. 請以「核心問題：XXX」的格式輸出，並全程使用繁體中文。",
)

keyword_extraction_agent = LLMAgent(
    role_description="你是一名專門為搜尋系統提取關鍵詞的專家。"
        "你熟悉如何從自然語言問題中抽取適合檢索的關鍵詞與關鍵短語。"
        "你會優先保留能幫助提升檢索精確度與召回率的資訊。",
    task_description="請根據給定的問題，完成以下工作：\n"
        "1. 從問題中提取 3～8 個關鍵詞或關鍵短語。\n"
        "2. 優先保留專有名詞、重要實體名稱、時間、地點、技術術語等。\n"
        "3. 如有必要，可以補充常見的英文寫法或縮寫，以提升搜尋效果。\n"
        "4. 不要解釋問題內容，也不要回答問題本身。\n"
        "5. 請僅輸出關鍵詞列表，使用半形逗號「,」分隔，不要加多餘說明文字。\n"
        "6. 關鍵詞本身可以是中文或英文，但說明文字一律使用繁體中文。",
)

# This agent is the core component that answers the question.
qa_agent = LLMAgent(
    role_description="你是 LLaMA-3.1-8B，是用來回答問題的 AI。使用中文時只會使用繁體中文來回問題。",
    task_description="請回答以下問題：",
)

## RAG pipeline

TODO: Implement the RAG pipeline.

Please refer to the homework description slides for hints.

Also, there might be more heuristics (e.g. classifying the questions based on their lengths, determining if the question need a search or not, reconfirm the answer before returning it to the user......) that are not shown in the flow charts. You can use your creativity to come up with a better solution!

- Naive approach (simple baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive.png)

- Naive RAG approach (medium baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive_rag.png)

- RAG with agents (strong baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/rag_agent.png)

In [10]:
async def pipeline(question: str) -> str:
    """
    整體 RAG pipeline：
    1. 問題重寫（去掉無關內容，提取核心問題）
    2. 關鍵詞抽取（為檢索服務）
    3. 根據關鍵詞檢索文檔
    4. 將 Context + Question 一起交給 QA agent 回答
    """

    # ---------- 1. 用 question_extraction_agent 清洗問題 ----------
    refined = question_extraction_agent.inference(question)

    # 預期格式：「核心問題：XXX」
    if isinstance(refined, str) and refined.startswith("核心問題"):
        # 嘗試去掉「核心問題：」這類前綴
        core_question = refined.split("：", 1)[-1].strip()
    else:
        core_question = refined.strip() if isinstance(refined, str) else question

    # ---------- 2. 用 keyword_extraction_agent 抽關鍵詞 ----------
    keywords_raw = keyword_extraction_agent.inference(core_question)
    # 預期輸出類似："LLaMA-3.1-8B, RAG, 檢索增強生成"
    if isinstance(keywords_raw, str):
        keyword_list = [k.strip() for k in keywords_raw.split(",") if k.strip()]
    else:
        keyword_list = []

    # 如果關鍵詞抽不出來，就退化成用整個問題去檢索
    if not keyword_list:
        keyword_list = [core_question]

    # ---------- 3. 根據關鍵詞做檢索（這裡要換成你作業給的函數） ----------
    # ⚠️⚠️⚠️ 這裡非常重要：把下面這一行函數名，改成你 Notebook 裡的檢索函數 ⚠️⚠️⚠️
    # 例如：retrieved_passages = search_documents(keyword_list, top_k=5)
    # 或：  retrieved_passages = bm25_search(keyword_list, k=5)
    # 我暫時先寫成一個假名：
    retrieved_passages = []
    try:
        retrieved_passages = search_documents(keyword_list, top_k=5)
        # 如果你們接口不同，例如 search_documents(keywords: str)，那就改掉這行
    except NameError:
        # 如果還沒實作或函數名不對，至少不讓程式崩掉
        retrieved_passages = []

    # ---------- 4. 構造 Context，控制長度 ----------
    context_chunks = []
    for i, passage in enumerate(retrieved_passages):
        # passage 可能是 dict / (title, text) / 純字串，請按你們實際格式改
        if isinstance(passage, str):
            text = passage
        elif isinstance(passage, dict) and "text" in passage:
            text = passage["text"]
        else:
            text = str(passage)
        context_chunks.append(f"[Passage {i+1}]\n{text}")

    context = "\n\n".join(context_chunks)

    # 粗暴做個長度截斷，防止超過上下文：
    max_chars = 8000
    if len(context) > max_chars:
        context = context[:max_chars]

    # ---------- 5. 把 Context + Question 統一交給 qa_agent ----------
    qa_input = f"""
以下是根據使用者問題所檢索到的相關資料（Context），以及整理後的問題（Question）。請據此回答問題。

[Context]
{context}

[Question]
{core_question}
"""

    answer = qa_agent.inference(qa_input)
    return answer

## Answer the questions using your pipeline!

Since Colab has usage limit, you might encounter the disconnections. The following code will save your answer for each question. If you have mounted your Google Drive as instructed, you can just rerun the whole notebook to continue your process.

In [11]:
from pathlib import Path

# Fill in your student ID first.
STUDENT_ID = "123456"

STUDENT_ID = STUDENT_ID.lower()
with open('./public.txt', 'r') as input_f:
    questions = input_f.readlines()
    questions = [l.strip().split(',')[0] for l in questions]
    for id, question in enumerate(questions, 1):
        if Path(f"./{STUDENT_ID}_{id}.txt").exists():
            continue
        answer = await pipeline(question)
        answer = answer.replace('\n',' ')
        print(id, answer)
        with open(f'./{STUDENT_ID}_{id}.txt', 'w') as output_f:
            print(answer, file=output_f)

with open('./private.txt', 'r') as input_f:
    questions = input_f.readlines()
    for id, question in enumerate(questions, 31):
        if Path(f"./{STUDENT_ID}_{id}.txt").exists():
            continue
        answer = await pipeline(question)
        answer = answer.replace('\n',' ')
        print(id, answer)
        with open(f'./{STUDENT_ID}_{id}.txt', 'a') as output_f:
            print(answer, file=output_f)

1 根據我的資料，「虎山雄風飛揚」是國立臺灣師範大學的校歌。
2 根據我查到的資料，NCC（國家通訊傳播委員會）規定境外郵購自用產品的審核費為新台幣 1,000 元。
3 根據資料，第一代 iPhone 是由史蒂夫·喬布斯（Steve Jobs）發表。
4 根據台灣大學的進階英文免修申請規定，托福網路測驗 TOEFL iBT 的最低分數要求是 80 分。
5 觸地 try（也稱為腳踩試）在 Rugby Union 中是一種特殊的得分方式。根據規則，當球員成功將對手隊伍的一名選手推入自己的終點區，並且自己或同伴同時著陸於該處時，就會被判定成觸地 try。  一旦發生這樣的情況，那麼就可以獲得五個分（5 points）。
6 根據我的資料，卑南族的祖先發源地ruvuwa'an位於台東縣。
7 很抱歉，我無法找到相關的資料。
8 根據歷史資料，電磁感應定律是由詹姆斯·克拉ーク・マ克斯韋爾在19世紀中期發現的。
9 根據資料，距離國立臺灣史前文化博物館最近的臺鐵車站是台中火车总站在。
10 根據給定的數字，20 加上 30 的結果是：  40
11 很抱歉，但我無法提供關於未來的資訊（例如NBA 2025年的交易）。
12 很抱歉，我無法提供最新的資訊或未來事件。根據我目前可用的資料，2024年美國總統大選尚在進行中，並沒有確定的勝出者。如果您需要關於過去的大选結果或者其他相關信息，可以試著問我的！
13 根據我的知識，Llama-3.1 系列模型中，我是 LLaMA 8B 的一部分。雖然我沒有直接的資訊關於 Meta 所有版本中的參數量，但基礎上來說較小型別通常會比大尺寸更少。  根據我的知識，Llama-3.1 系列模型中，我是 LLaMA 8B 的一部分。
14 根據國立臺灣大學學則，停修課程的限制是三門。
15 根據我的資料，DeepSeek公司的母 公司是 Ocean Infinity。
16 很抱歉，我無法提供最新的資訊，因為我的知識截止日期是2023年12月。然而根據我所知道的情況，目前尚未結束賽季，因此還不能確定 2024 年 NBA 總冠軍隊伍。但如果你想問的是前幾年的總決赛，我可以提供相關資訊。  例如：  *   前一年（23-24年）的NBA总决战是金州勇士队与波特兰开拓者之间的比赛。 若您有其他問題或需要更多信息，請告訴我。
17 根據化學知識，碳

In [12]:
# Combine the results into one file.
with open(f'./{STUDENT_ID}.txt', 'w') as output_f:
    for id in range(1,91):
        with open(f'./{STUDENT_ID}_{id}.txt', 'r') as input_f:
            answer = input_f.readline().strip()
            print(answer, file=output_f)