# ML2025 Homework 1 - Retrieval Augmented Generation with Agents

## Environment Setup

First, we will mount your own Google Drive and change the working directory.

In [1]:
import os
print(os.getcwd())


/home/jerrylmr/githubRepository/ML2025Spring/hw1


In this section, we install the necessary python packages and download model weights of the quantized version of LLaMA 3.1 8B. Also, download the dataset. Note that the model weight is around 8GB. If you are using your Google Drive as the working directory, make sure you have enough space for the model.

In [2]:
# !python3 -m pip install --no-cache-dir llama-cpp-python==0.3.4 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122
# !python3 -m pip install googlesearch-python bs4 charset-normalizer requests-html lxml_html_clean

from pathlib import Path
if not Path('./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf').exists():
    !wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
if not Path('./public.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/public.txt
if not Path('./private.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/private.txt

In [3]:
import torch
if not torch.cuda.is_available():
    raise Exception('You are not using the GPU runtime. Change it first or you will suffer from the super slow inference speed!')
else:
    print('You are good to go!')

You are good to go!


## Prepare the LLM and LLM utility function

By default, we will use the quantized version of LLaMA 3.1 8B. you can get full marks on this homework by using the provided LLM and LLM utility function. You can also try out different LLM models.

In the following code block, we will load the downloaded LLM model weights onto the GPU first.
Then, we implemented the generate_response() function so that you can get the generated response from the LLM model more easily.

You can ignore "llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized" warning.

In [4]:
from llama_cpp import Llama

# Load the model onto GPU
llama3 = Llama(
    "./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf",
    verbose=False,
    n_gpu_layers=-1,
    n_ctx=16384,    # This argument is how many tokens the model can take. The longer the better, but it will consume more memory. 16384 is a proper value for a GPU with 16GB VRAM.
)

def generate_response(_model: Llama, _messages: str) -> str:
    '''
    This function will inference the model with given messages.
    '''
    _output = _model.create_chat_completion(
        _messages,
        stop=["<|eot_id|>", "<|end_of_text|>"],
        max_tokens=512,    # This argument is how many tokens the model can generate, you can change it and observe the differences.
        temperature=0,      # This argument is the randomness of the model. 0 means no randomness. You will get the same result with the same input every time. You can try to set it to different values.
        repeat_penalty=2.0,
    )["choices"][0]["message"]["content"]
    return _output

llama_context: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


## Search Tool

The TA has implemented a search tool for you to search certain keywords using Google Search. You can use this tool to search for the relevant **web pages** for the given question. The search tool can be integrated in the following sections.

In [8]:
from typing import List
from googlesearch import search as _search
from bs4 import BeautifulSoup
from charset_normalizer import detect
import asyncio
from requests_html import AsyncHTMLSession
import urllib3
urllib3.disable_warnings()

async def worker(s:AsyncHTMLSession, url:str):
    try:
        header_response = await asyncio.wait_for(s.head(url, verify=False), timeout=10)
        if 'text/html' not in header_response.headers.get('Content-Type', ''):
            return None
        r = await asyncio.wait_for(s.get(url, verify=False), timeout=10)
        return r.text
    except:
        return None

async def get_htmls(urls):
    session = AsyncHTMLSession()
    tasks = (worker(session, url) for url in urls)
    return await asyncio.gather(*tasks)

async def search(keyword: str, n_results: int=3) -> List[str]:
    '''
    This function will search the keyword and return the text content in the first n_results web pages.

    Warning: You may suffer from HTTP 429 errors if you search too many times in a period of time. This is unavoidable and you should take your own risk if you want to try search more results at once.
    The rate limit is not explicitly announced by Google, hence there's not much we can do except for changing the IP or wait until Google unban you (we don't know how long the penalty will last either).
    '''
    keyword = keyword[:100]
    # First, search the keyword and get the results. Also, get 2 times more results in case some of them are invalid.
    results = list(_search(keyword, n_results * 2, lang="zh", unique=True))
    # Then, get the HTML from the results. Also, the helper function will filter out the non-HTML urls.
    results = await get_htmls(results)
    # Filter out the None values.
    results = [x for x in results if x is not None]
    # Parse the HTML.
    results = [BeautifulSoup(x, 'html.parser') for x in results]
    # Get the text from the HTML and remove the spaces. Also, filter out the non-utf-8 encoding.
    results = [''.join(x.get_text().split()) for x in results if detect(x.encode()).get('encoding') == 'utf-8']
    # Return the first n results.
    return results[:n_results]

## Test the LLM inference pipeline

In [9]:
# You can try out different questions here.
test_question='請問誰是 Taylor Swift？'

messages = [
    {"role": "system", "content": "你是 LLaMA-3.1-8B，是用來回答問題的 AI。使用中文時只會使用繁體中文來回問題。"},    # System prompt
    {"role": "user", "content": test_question}, # User prompt
]

print(generate_response(llama3, messages))

泰勒絲（Taylor Swift）是一位美國歌手、詞曲作家和製作人。她出生於1989年，來自田納西州。她的音樂風格從鄉村搖滾開始逐漸轉變為流行電音。

她早期的作品如《泰勒絲第一輯》、《愛情故事第二章：睡美人的秘密》，獲得了廣泛認可和獎項，包括多個告示牌音樂大賞。後來，她推出了更具商業成功性的專辑，如 《1989》（2014）、_reputation（《名聲_(泰勒絲专輯)》） （ 20 ） 和 _Lover(2020)，並且在全球取得了巨大的影響力。

她以她的歌曲如 "Shake It Off"、"_Blank Space_"和 "_Bad Blood_",以及與其他藝人合作的作品，如 《Look What You Made Me Do》（2017）而聞名。泰勒絲還是知識產權運動的一部分，對於音樂創作者在數字時代獲得公平報酬有所關注。

她被譽為當代最成功和影響力最大的人物之一，並且她的歌曲經常成為流行文化的話題。


## Agents

The TA has implemented the Agent class for you. You can use this class to create agents that can interact with the LLM model. The Agent class has the following attributes and methods:
- Attributes:
    - role_description: The role of the agent. For example, if you want this agent to be a history expert, you can set the role_description to "You are a history expert. You will only answer questions based on what really happened in the past. Do not generate any answer if you don't have reliable sources.".
    - task_description: The task of the agent. For example, if you want this agent to answer questions only in yes/no, you can set the task_description to "Please answer the following question in yes/no. Explanations are not needed."
    - llm: Just an indicator of the LLM model used by the agent.
- Method:
    - inference: This method takes a message as input and returns the generated response from the LLM model. The message will first be formatted into proper input for the LLM model. (This is where you can set some global instructions like "Please speak in a polite manner" or "Please provide a detailed explanation".) The generated response will be returned as the output.

In [10]:
class LLMAgent():
    def __init__(self, role_description: str, task_description: str, llm:str="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF"):
        self.role_description = role_description   # Role means who this agent should act like. e.g. the history expert, the manager......
        self.task_description = task_description    # Task description instructs what task should this agent solve.
        self.llm = llm  # LLM indicates which LLM backend this agent is using.
    def inference(self, message:str) -> str:
        if self.llm == 'bartowski/Meta-Llama-3.1-8B-Instruct-GGUF': # If using the default one.
            # TODO: Design the system prompt and user prompt here.
            # Format the messsages first.
            messages = [
                {"role": "system", 
                 "content":                     
                    "你是一個嚴格遵守角色設定的代理人。\n"
                    "請使用繁體中文回答所有內容。\n"
                    "務必遵守下面的角色描述以及任務規則。\n\n"
                    f"【角色描述】\n{self.role_description}\n"},  # Hint: you may want the agents to speak Traditional Chinese only.
                {"role": "user", 
                 "content": 
                    "以下內容包含兩部分：\n"
                    "1. <task>...</task>：代理人必須遵守的任務規範。\n"
                    "2. <query>...</query>：使用者的實際輸入。\n"
                    "請依照 <task> 所指定的規則處理 <query>。\n\n"
                    f"<task>\n{self.task_description}\n</task>\n"
                    f"<query>\n{message}\n</query>"}, # Hint: you may want the agents to clearly distinguish the task descriptions and the user messages. A proper seperation text rather than a simple line break is recommended.
            ]
            return generate_response(llama3, messages)
        else:
            # TODO: If you want to use LLMs other than the given one, please implement the inference part on your own.
            return ""

TODO: Design the role description and task description for each agent.

In [None]:
# TODO: Design the role and task description for each agent.

# This agent may help you filter out the irrelevant parts in question descriptions.
question_extraction_agent = LLMAgent(
    role_description=(
        "你是一位專門從冗長敘述中提取「核心問題」的助手。"
        "你只會抽取使用者真正想問的部分，不會改寫、補充或生成新的資訊。"
        "你不會輸出 <task>、<query>、</task>、</query> 或任何標籤。"
        "回答中不得包含說明、評語或額外文字。"
    ),
    task_description=(
        "請從 <query> 中提取單一句核心問題。"
        "如果 query 少於 20 個字，請直接原樣輸出，不要進行抽取。"
        "最終輸出只允許是問題本身，不得包含其它任何字。"
    )
)


# This agent may help you extract the keywords in a question so that the search tool can find more accurate results.
keyword_extraction_agent = LLMAgent(
    role_description=(
        "你是一位關鍵詞提取專家，負責將使用者的問題轉換成可搜尋的名詞與主題詞。"
        "你不會輸出句子、不會加入說明、不會輸出標籤。"
        "你只輸出關鍵字、用空格分隔。"
    ),
    task_description=(
        "請從 <query> 中提取 3–8 個最重要的關鍵詞。"
        "如果 query 少於 20 個字，可直接輸出原文。"
        "輸出不得包含句子、標點、說明或標籤，只能包含關鍵詞。"
    )
)


# This agent is the core component that answers the question.
qa_agent = LLMAgent(
    role_description="你是 LLaMA-3.1-8B，是用來回答問題的 AI。你會以清晰、邏輯、可靠的方式作答。",
    task_description="請針對 <query> 中的問題給出準確的回答，20字以内。",
)

## RAG pipeline

TODO: Implement the RAG pipeline.

Please refer to the homework description slides for hints.

Also, there might be more heuristics (e.g. classifying the questions based on their lengths, determining if the question need a search or not, reconfirm the answer before returning it to the user......) that are not shown in the flow charts. You can use your creativity to come up with a better solution!

- Naive approach (simple baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive.png)

- Naive RAG approach (medium baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive_rag.png)

- RAG with agents (strong baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/rag_agent.png)

In [22]:
async def pipeline(question: str) -> str:
    # TODO: Implement your pipeline.
    # Currently, it only feeds the question directly to the LLM.
    # You may want to get the final results through multiple inferences.
    # Just a quick reminder, make sure your input length is within the limit of the model context window (16384 tokens), you may want to truncate some excessive texts.
    """
    Multi-agent RAG pipeline:
    1. Question Extraction Agent → 提取核心问题
    2. Keyword Extraction Agent → 提取搜尋關鍵字
    3. search() → 用你的 async 搜索工具抓取網頁內容
    4. QA Agent → 根據搜尋到的上下文回答問題
    """

    # -------- 1. Extract the core question --------
    core_question = question_extraction_agent.inference(question)

    # -------- 2. Extract keywords --------
    keywords = keyword_extraction_agent.inference(question)

    # -------- 3. Use your existing async search tool --------
    # search() returns a list of text contents (HTML → text)
    retrieved_docs = await search(keywords, n_results=3)

    # Truncate context to avoid exceeding model context window
    context = "\n\n".join(retrieved_docs)
    if len(context) > 5000:   # you can adjust this
        context = context[:5000]

    # -------- 4. QA agent: feed question + retrieved context --------
    final_answer = qa_agent.inference(
        f"以下是使用搜尋工具取得的資料：\n{context}\n\n"
        f"請根據以上資料回答問題：{core_question}"
    )

    return final_answer

## Answer the questions using your pipeline!

Since Colab has usage limit, you might encounter the disconnections. The following code will save your answer for each question. If you have mounted your Google Drive as instructed, you can just rerun the whole notebook to continue your process.

In [23]:
from pathlib import Path

# Fill in your student ID first.
STUDENT_ID = "jerrylmr"

STUDENT_ID = STUDENT_ID.lower()
with open('./public.txt', 'r') as input_f:
    questions = input_f.readlines()
    questions = [l.strip().split(',')[0] for l in questions]
    for id, question in enumerate(questions, 1):
        if Path(f"./{STUDENT_ID}_{id}.txt").exists():
            continue
        answer = await pipeline(question)
        answer = answer.replace('\n',' ')
        print(id, answer)
        with open(f'./{STUDENT_ID}_{id}.txt', 'w') as output_f:
            print(answer, file=output_f)

with open('./private.txt', 'r') as input_f:
    questions = input_f.readlines()
    for id, question in enumerate(questions, 31):
        if Path(f"./{STUDENT_ID}_{id}.txt").exists():
            continue
        answer = await pipeline(question)
        answer = answer.replace('\n',' ')
        print(id, answer)
        with open(f'./{STUDENT_ID}_{id}.txt', 'a') as output_f:
            print(answer, file=output_f)

1 根據資料顯示，「虎山雄風飛揚」是國立臺灣大學的校歌。
2 根據資料顯示，NCC並未透過行政命令規定民眾購買境外郵寄的自用產品加收審查費。
3 根據資料顯示，第一代 iPhone 是由史蒂夫·喬布斯（Steve Jobs）發表。
4 托福網路測驗 TOEFL iBT 的免修分數一般為 80 分或以上，具體取決於學校的政策。
5 根據資料顯示，觸地 try 可得 10 分。
6 根據資料顯示，卑南族的祖先發源地在台灣東部。
7 抱歉，但我無法找到相關資料。
8 根據資料顯示，電磁感應定律是由詹姆斯·克拉ーク・マ克斯韋爾發現的。
9 根據資料，距離國立臺灣史前文化博物館最近的臺鐵車站是台中火车总站在。
10 <task>...</_task> 答案是：50
11 我無法找到任何關於達拉斯獨行俠隊Luka Doncic被交易的資訊。
12 目前尚無法確定2024年美國總統大選的勝选人，因為該屆競爭仍在進行中。
13 根據資料，參數量最小的 Llama-3.2 系列模型是 7B 參数。
14 根據一般教育制度，學生每個学期最多停修 2 門課程。
15 我無法找到任何關於DeepSeek公司的相關資料。
16 對不起，我無法找到相關的資料來回答問題。
17 正確！碳原子與其他元素形成三鍵的化合物稱為烯（Alkene）。
18 阿倫·圖靈（Alan Turing）是一位英國數學家和計算機科學院院士，他對於電腦理論、密碼分析等領域有重要貢獻。
19 根據資料，臺灣玄天上帝信仰的進香中心位於新北市。
20 Windows 作業系統是微軟公司的產品。
21 官將的首起源自南天門廟。
22 《咒》的邪神名為「阿卡努姆」。
23 根據資料顯示，「短暫交會的旅程就此分岔」是五月天的一首歌曲。
24 根據資料顯示，2025 卑南族聯合年聚將在卞和部落舉辦。
25 根據資料顯示，最新的輝達（NVIDIA） GeForce RTX 40 系列已經出現。
26 無法根據提供的資料確定大S是在哪個國家旅遊時去世。
27 根據歷史記錄，英國物理學家艾薩克·牛頓被認為是萬有引力的發現者。
28 <task> 請針對 <query> 中的問題給出準確答案。 </ task>  根據資料顯示，台鵠開發計畫「TAIHUCAIS」的英文全名為：Taiwan Indigenous Human Genome Da

In [24]:
# Combine the results into one file.
with open(f'./{STUDENT_ID}.txt', 'w') as output_f:
    for id in range(1,91):
        with open(f'./{STUDENT_ID}_{id}.txt', 'r') as input_f:
            answer = input_f.readline().strip()
            print(answer, file=output_f)