# ML2025 Homework 1 - Retrieval Augmented Generation with Agents

## Environment Setup

In this section, we install the necessary python packages and download model weights of the quantized version of LLaMA 3.1 8B. Also, download the dataset. Note that the model weight is around 8GB.

In [1]:
!python3 -m pip install --no-cache-dir llama-cpp-python==0.3.4 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122
!python3 -m pip install googlesearch-python bs4 charset-normalizer requests-html lxml_html_clean

from pathlib import Path
if not Path('./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf').exists():
    !wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
if not Path('./public.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/public.txt
if not Path('./private.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/private.txt

Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu122
Collecting llama-cpp-python==0.3.4
  Downloading https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.4-cu122/llama_cpp_python-0.3.4-cp310-cp310-linux_x86_64.whl (445.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m445.2/445.2 MB[0m [31m163.7 MB/s[0m eta [36m0:00:00[0m
Collecting diskcache>=5.6.1 (from llama-cpp-python==0.3.4)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m101.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: diskcache, llama-cpp-python
Successfully installed diskcache-5.6.3 llama-cpp-python-0.3.4
Collecting googlesearch-python
  Downloading googlesearch_python-1.3.0-py3-none-any.whl.metadata (3.4 kB)
Collecting bs4
  Downloading bs4-0.0.2-py2.py3-no

In [2]:
import torch
if not torch.cuda.is_available():
    raise Exception('You are not using the GPU runtime. Change it first or you will suffer from the super slow inference speed!')
else:
    print('You are good to go!')

You are good to go!


## Prepare the LLM and LLM utility function

By default, we will use the quantized version of LLaMA 3.1 8B. you can get full marks on this homework by using the provided LLM and LLM utility function. You can also try out different LLM models.

In the following code block, we will load the downloaded LLM model weights onto the GPU first.
Then, we implemented the generate_response() function so that you can get the generated response from the LLM model more easily.

You can ignore "llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized" warning.

In [3]:
from llama_cpp import Llama

# Load the model onto GPU
llama3 = Llama(
    "./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf",
    verbose=False,
    n_gpu_layers=-1,
    n_ctx=16384,    # This argument is how many tokens the model can take. The longer the better, but it will consume more memory. 16384 is a proper value for a GPU with 16GB VRAM.
)

def generate_response(_model: Llama, _messages: str) -> str:
    '''
    This function will inference the model with given messages.
    '''
    _output = _model.create_chat_completion(
        _messages,
        stop=["<|eot_id|>", "<|end_of_text|>"],
        max_tokens=512,    # This argument is how many tokens the model can generate.
        temperature=0,      # This argument is the randomness of the model. 0 means no randomness. You will get the same result with the same input every time. You can try to set it to different values.
        repeat_penalty=2.0,
    )["choices"][0]["message"]["content"]
    return _output

llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


## Search Tool

The TA has implemented a search tool for you to search certain keywords using Google Search. You can use this tool to search for the relevant **web pages** for the given question. The search tool can be integrated in the following sections.

In [4]:
from typing import List
from googlesearch import search as _search
from bs4 import BeautifulSoup
from charset_normalizer import detect
import asyncio
from requests_html import AsyncHTMLSession
import urllib3
urllib3.disable_warnings()

async def worker(s:AsyncHTMLSession, url:str):
    try:
        header_response = await asyncio.wait_for(s.head(url, verify=False), timeout=10)
        if 'text/html' not in header_response.headers.get('Content-Type', ''):
            return None
        r = await asyncio.wait_for(s.get(url, verify=False), timeout=10)
        return r.text
    except:
        return None

async def get_htmls(urls):
    session = AsyncHTMLSession()
    tasks = (worker(session, url) for url in urls)
    return await asyncio.gather(*tasks)

async def search(keyword: str, n_results: int=3) -> List[str]:
    '''
    This function will search the keyword and return the text content in the first n_results web pages.
    Warning: You may suffer from HTTP 429 errors if you search too many times in a period of time. This is unavoidable and you should take your own risk if you want to try search more results at once.
    The rate limit is not explicitly announced by Google, hence there's not much we can do except for changing the IP or wait until Google unban you (we don't know how long the penalty will last either).
    '''
    keyword = keyword[:100]
    # First, search the keyword and get the results. Also, get 2 times more results in case some of them are invalid.
    results = list(_search(keyword, n_results * 2, lang="zh", unique=True))
    # Then, get the HTML from the results. Also, the helper function will filter out the non-HTML urls.
    results = await get_htmls(results)
    # Filter out the None values.
    results = [x for x in results if x is not None]
    # Parse the HTML.
    results = [BeautifulSoup(x, 'html.parser') for x in results]
    # Get the text from the HTML and remove the spaces. Also, filter out the non-utf-8 encoding.
    results = [''.join(x.get_text().split()) for x in results if detect(x.encode()).get('encoding') == 'utf-8']
    # Return the first n results.
    return results[:n_results]

## Test the LLM inference pipeline

In [5]:
# You can try out different questions here.
test_question='Are you more fluent in english or chinese'

messages = [
    {"role": "system", "content": "你是 LLaMA-3.1-8B，是用來回答問題的 AI。使用中文時只會使用繁體中文來回問題。回答內容請盡量詳細。"},    # System prompt
    {"role": "user", "content": test_question}, # User prompt
]

print(generate_response(llama3, messages))

我是 LLaMA-3.1，雖然我的訓練資料包含了大量的英文文本，但在中文方面，我更為熟悉和流暢。因爲大部分的人使用繁體或簡化字寫作，而這些都是用來進行模型學習與測試用的。

我可以理解、生成以及回答以中華民國（台灣）地區通行的標準正確且完整中文文本，包括但不限於：

*  正式文件
    *   政府公報及法令草案等政府機關出版物。
        +     《憲政體制改革方案》
            -      中華民國總統會同行政院頒布之《中正紀念堂管理條例》修訂稿（2023年6月30日起施行）
*  文學作品
    *   小說、散文及詩歌等。
        +     《西遊記》
            -      書名：水滴石穿；原作《三國演義》，作者吳承恩，改編自唐代小説家施耐庵的同題著述
*  日常對話與書信

但請注意，我可能會因爲訓練資料或語言限制而無法完全理解某些特殊用詞、術语等。


## Agents

The TA has implemented the Agent class for you. You can use this class to create agents that can interact with the LLM model. The Agent class has the following attributes and methods:
- Attributes:
    - role_description: The role of the agent. For example, if you want this agent to be a history expert, you can set the role_description to "You are a history expert. You will only answer questions based on what really happened in the past. Do not generate any answer if you don't have reliable sources.".
    - task_description: The task of the agent. For example, if you want this agent to answer questions only in yes/no, you can set the task_description to "Please answer the following question in yes/no. Explanations are not needed."
    - llm: Just an indicator of the LLM model used by the agent.
- Method:
    - inference: This method takes a message as input and returns the generated response from the LLM model. The message will first be formatted into proper input for the LLM model. (This is where you can set some global instructions like "Please speak in a polite manner" or "Please provide a detailed explanation".) The generated response will be returned as the output.

In [6]:
class LLMAgent():
    def __init__(self, role_description: str, task_description: str, llm:str="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF"):
        self.role_description = role_description   # Role means who this agent should act like. e.g. the history expert, the manager......
        self.task_description = task_description    # Task description instructs what task should this agent solve.
        self.llm = llm  # LLM indicates which LLM backend this agent is using.
    def inference(self, message:str) -> str:
        if self.llm == 'bartowski/Meta-Llama-3.1-8B-Instruct-GGUF': # If using the default one.
            # TODO: Design the system prompt and user prompt here.
            # Format the messsages first.
            messages = [
                {"role": "system", "content": f"{self.role_description}"},  # Hint: you may want the agents to speak Traditional Chinese only.
                {"role": "user", "content": f"你要做的事：\n{self.task_description}\n\n以下是輸入：\n{message}"}, # Hint: you may want the agents to clearly distinguish the task descriptions and the user messages. A proper seperation text rather than a simple line break is recommended.
            ]
            return generate_response(llama3, messages)
        else:
            # TODO: If you want to use LLMs other than the given one, please implement the inference part on your own.
            return ""

TODO 1: Design the role description and task description for each agent.

In [7]:
# TODO: Design the role and task description for each agent.

# This agent may help you filter out the irrelevant parts in question descriptions.
# Works fine not exactly 100% accurate
question_extraction_agent = LLMAgent( 
    role_description="你是一個問題提取專家。使用中文時只會使用繁體中文來回問題。",
    task_description="從輸入中擷取完整問句。完整保留問句的所有資訊，確保它仍然表達原本的意思，不刪除重要細節。刪除問題前的背景敘述與無關內容，但不能刪除影響問題理解的上下文。若輸入本身是問句，直接輸出。若輸入是簡單問題也直接輸出。你應該輸出一個完整的問句。僅輸出問句，不回答。使用繁體中文。",
)

# This agent may help you extract the keywords in a question so that the search tool can find more accurate results.
# Works fine not exactly 100% accurate 還可優化
keyword_extraction_agent = LLMAgent(
    role_description="你是一個關鍵詞提取專家。使用中文時只會用繁體中文回答問題。",
    task_description="你會仔細思考後從輸入提取問題中的關鍵詞。輸出只有你提取的關鍵詞，以空白格開。輸入可能是背景敘述和一個問題，或僅一個問題，你要找的是與該問題相關的關鍵詞。關鍵詞可能是專有名詞（如：人名、地名、機構名稱、技術名詞）、問題的主詞動詞受詞、還有疑問詞（什麼名字、時間、地點）。你輸出的關鍵字都應該在問題中，不要嘗試回答問題，也不用澄清想法。使用繁體中文。",
)

# This agent is the core component that answers the question.
qa_agent = LLMAgent(
    role_description="你是一個回答問題的天才。你回答時只會用簡答。使用中文時只會使用繁體中文回答問題。",
    task_description="你要回答一個問題。後面有附上參考資料，你會讀完後一步一步思考，謹慎的回答。一率回答簡答，不用完整句子、不用解釋、也不用重述問題。使用繁體中文。你能不能正確回答嚴重影響我的職業生崖。",
)

fact_check_guy = LLMAgent(
    role_description="你是一個事實查核專家。使用中文時只會用繁體中文回答問題。",
    task_description="你要負責檢查一個回問題有沒有被答對，如果答對就輸出原本的答案，答錯的話輸出問題的正確答案，不用特別註答對或答錯。後面有附上參考資料，你會讀完後一步一步思考，謹慎的回答。一率回答簡答，不用完整句子、不用解釋、也不用重述問題。使用繁體中文。你如果答錯會受到極爲嚴重的懲罰。",
)

## RAG pipeline

TODO 2: Implement the RAG pipeline.

Please refer to the homework description slides for hints.

Also, there might be more heuristics (e.g. classifying the questions based on their lengths, determining if the question need a search or not, reconfirm the answer before returning it to the user......) that are not shown in the flow charts. You can use your creativity to come up with a better solution!

- Naive approach (simple baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive.png)

- Naive RAG approach (medium baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive_rag.png)

- RAG with agents (strong baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/rag_agent.png)

In [8]:
async def pipeline(question: str) -> str:
    # TODO: Implement your pipeline.
    # Currently, it only feeds the question directly to the LLM.
    # You may want to get the final results through multiple inferences.
    # Just a quick reminder, make sure your input length is within the limit of the model context window (16384 tokens), you may want to truncate some excessive texts.

    # return question_extraction_agent.inference(question)
    
    # return keyword_extraction_agent.inference(question)

    extracted_question = question_extraction_agent.inference(question)

    keywords = keyword_extraction_agent.inference(question)

    # Get search results from the internet
    search_results = await search(keywords)
    
    # Append the search results to the question
    full_input = "請回答這個問題:" + extracted_question + "\n網路上的相關資料：" + "\n".join(search_results)

    # Tokenize by character (each letter is a token)
    tokens = list(full_input)  # Convert string into a list of characters

    # Truncate to 16,370 tokens (characters)
    truncated_tokens = tokens[:16370]

    # Reconstruct the truncated text
    truncated_input = "".join(truncated_tokens)

    # Send truncated input to the model
    raw_answer =  qa_agent.inference(truncated_input)

    second_input = "問題：" + extracted_question + "\n回答：" + raw_answer + "\n網路上的相關資料：" + "\n".join(search_results)
    new_tokens = list(second_input)
    trunc_tokens = new_tokens[:16370]
    final_input = "".join(trunc_tokens)

    return fact_check_guy.inference(final_input)

## Answer the questions using your pipeline!

Since Colab has usage limit, you might encounter the disconnections. The following code will save your answer for each question. If you have mounted your Google Drive as instructed, you can just rerun the whole notebook to continue your process.

In [9]:
from pathlib import Path

# Fill in your student ID first.
STUDENT_ID = "b12902014"

STUDENT_ID = STUDENT_ID.lower()
with open('./public.txt', 'r') as input_f:
    questions = input_f.readlines()
    questions = [l.strip().split(',')[0] for l in questions]
    for id, question in enumerate(questions, 1):
        if Path(f"./{STUDENT_ID}_{id}.txt").exists():
            continue
        answer = await pipeline(question)
        answer = answer.replace('\n',' ')
        print(id, answer)
        with open(f'./{STUDENT_ID}_{id}.txt', 'w') as output_f:
            print(answer, file=output_f)

with open('./private.txt', 'r') as input_f:
    questions = input_f.readlines()
    for id, question in enumerate(questions, 31):
        if Path(f"./{STUDENT_ID}_{id}.txt").exists():
            continue
        answer = await pipeline(question)
        answer = answer.replace('\n',' ')
        print(id, answer)
        with open(f'./{STUDENT_ID}_{id}.txt', 'a') as output_f:
            print(answer, file=output_f)

1 國立臺灣師範大學
2 750元
3 史蒂夫·乔布斯
4 80
5 觸地得分
6 該地位於現今的臺東縣。
7 熊仔的藝名叫做「貓頭鷹」
8 迈克尔·法拉第
9 康樂車站
10 自己練
11 洛杉磯湖人
12 唐纳·川普
13 1B
14 停修的限制是：若學生申請停止上課，至少要在期末考或報告繳交前提出需求。
15 DeepSeek公司的母 公司是幻方量化。
16 波士顿凯尔特人
17 會出現異常反應
18 艾伦·图灵
19 真武大帝
20 Windows 作業系統是微軟公司的產品。
21 新北市的地藏庵
22 《咒》的邪神名為：無明
23 徐志摩
24 利嘉部落
25 GeForce RTX 40系列
26 日本
27 艾萨克·牛顿
28 TAIHUCAIS
29 《终极警探》
30 水的化學式為H2O。
31 第15個作業是什麼？  生成式人工智慧導論課程的Homework 10
32 國防醫學大學
33 BT协议的机制是通过种子文件（.torrent）来确保当一个新的节点加入网络时，也能从其他seed随機地获得部分数据，以利于后续整个网絡中的資料交換。
34 你要去哪裡？
35 ç½è
36 嘟胖
37 國立臺灣大學物理治療學系的正常修業年限為六個月。
38 ç½ä¸æï¼å¯ã
39 是伊達政宗
40 王肥貓同學最有可能去修的課程是數位素養導航NavigationintoDigitalLiteracy。
41 18個國家或地區
42 馬智禮
43 片頭曲是《Killkiss》。
44 1991
45 利卡文
46 红茶是全发酵的。
47 超魔導龍騎士-真紅眼黑龙骑兵
48 豐田萌繪在《BanG Dream!》企劃中，擔任松原花音的聲優。
49 罗纳尔多
50 冥王星
51 野生動物救傷單位位於宜蘭縣。  參考資料：政府資訊公開-法規资讯法律、 法规及行政规定公文函釋訴願決定施政計畫統计与出版品會議纪录預算與決 算書政策宣導執行情形支付或接受之補助会 计报告其他信息:::回首頁>政府資訊公開法規资讯法律、 法规及行政规定公文函釋訴願決定施政計畫統计与出版品會議纪录預算與決 算書政策宣導執行情形支付或接受之補助会 计报告其他信息:::
52 是的，特有生物研究保育中心是一個很好的親子旅遊景點。
53 DeSTA2
54 太陽系中體積最大的行

In [10]:
# Combine the results into one file.
with open(f'./{STUDENT_ID}.txt', 'w') as output_f:
    for id in range(1,91):
        with open(f'./{STUDENT_ID}_{id}.txt', 'r') as input_f:
            answer = input_f.readline().strip()
            print(answer, file=output_f)