# ML2025 Homework 1 - Retrieval Augmented Generation with Agents

## Environment Setup

First, we will mount your own Google Drive and change the working directory.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Change the working directory to somewhere in your Google Drive.
# You could check the path by right clicking on the folder.
%cd [change to the directory you prefer]

[Errno 2] No such file or directory: '[change to the directory you prefer]'
/content


In this section, we install the necessary python packages and download model weights of the quantized version of LLaMA 3.1 8B. Also, download the dataset. Note that the model weight is around 8GB. If you are using your Google Drive as the working directory, make sure you have enough space for the model.

In [None]:
!python3 -m pip install --no-cache-dir llama-cpp-python==0.3.4 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122
!python3 -m pip install googlesearch-python bs4 charset-normalizer requests-html lxml_html_clean

from pathlib import Path
if not Path('./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf').exists():
    !wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
if not Path('./public.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/public.txt
if not Path('./private.txt').exists():
    !wget https://www.csie.ntu.edu.tw/~ulin/private.txt

Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu122
Collecting llama-cpp-python==0.3.4
  Downloading https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.4-cu122/llama_cpp_python-0.3.4-cp311-cp311-linux_x86_64.whl (445.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m445.2/445.2 MB[0m [31m275.5 MB/s[0m eta [36m0:00:00[0m
Collecting diskcache>=5.6.1 (from llama-cpp-python==0.3.4)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m257.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: diskcache, llama-cpp-python
Successfully installed diskcache-5.6.3 llama-cpp-python-0.3.4
Collecting googlesearch-python
  Downloading googlesearch_python-1.3.0-py3-none-any.whl.metadata (3.4 kB)
Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.me

In [7]:
import torch
if not torch.cuda.is_available():
    raise Exception('You are not using the GPU runtime. Change it first or you will suffer from the super slow inference speed!')
else:
    print('You are good to go!')

You are good to go!


## Prepare the LLM and LLM utility function

By default, we will use the quantized version of LLaMA 3.1 8B. you can get full marks on this homework by using the provided LLM and LLM utility function. You can also try out different LLM models.

In the following code block, we will load the downloaded LLM model weights onto the GPU first.
Then, we implemented the generate_response() function so that you can get the generated response from the LLM model more easily.

You can ignore "llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized" warning.

In [8]:
from llama_cpp import Llama

# Load the model onto GPU
llama3 = Llama(
    "./Meta-Llama-3.1-8B-Instruct-Q8_0.gguf",
    verbose=False,
    n_gpu_layers=-1, #確保每一層都用GPU
    n_ctx=16384,    # This argument is how many tokens the model can take. The longer the better, but it will consume more memory. 16384 is a proper value for a GPU with 16GB VRAM.
)

def generate_response(_model: Llama, _messages: str) -> str:
    '''
    This function will inference the model with given messages.
    '''
    _output = _model.create_chat_completion(
        _messages,
        stop=["<|eot_id|>", "<|end_of_text|>"],
        max_tokens=512,    # This argument is how many tokens the model can generate, you can change it and observe the differences.
        temperature=0,      # This argument is the randomness of the model. 0 means no randomness. You will get the same result with the same input every time. You can try to set it to different values.
        repeat_penalty=2.0,
    )["choices"][0]["message"]["content"]
    return _output

llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


## Search Tool

The TA has implemented a search tool for you to search certain keywords using Google Search. You can use this tool to search for the relevant **web pages** for the given question. The search tool can be integrated in the following sections.

In [9]:
from typing import List
from googlesearch import search as _search
from bs4 import BeautifulSoup
from charset_normalizer import detect
import asyncio
from requests_html import AsyncHTMLSession
import urllib3
import spacy

urllib3.disable_warnings()

nlp = spacy.load("en_core_web_sm")
async def worker(s:AsyncHTMLSession, url:str):
    try:
        header_response = await asyncio.wait_for(s.head(url, verify=False), timeout=10)
        if 'text/html' not in header_response.headers.get('Content-Type', ''):
            return None
        r = await asyncio.wait_for(s.get(url, verify=False), timeout=10)
        return r.text
    except:
        return None

async def get_htmls(urls):
    session = AsyncHTMLSession()
    tasks = (worker(session, url) for url in urls)
    return await asyncio.gather(*tasks)

async def search(keyword: str, n_results: int=3) -> List[str]:
    '''
    This function will search the keyword and return the text content in the first n_results web pages.

    Warning: You may suffer from HTTP 429 errors if you search too many times in a period of time. This is unavoidable and you should take your own risk if you want to try search more results at once.
    The rate limit is not explicitly announced by Google, hence there's not much we can do except for changing the IP or wait until Google unban you (we don't know how long the penalty will last either).
    '''
    keyword = keyword[:100]
    # First, search the keyword and get the results. Also, get 2 times more results in case some of them are invalid.
    results = list(_search(keyword, n_results * 2, lang="zh", unique=True))
    # Then, get the HTML from the results. Also, the helper function will filter out the non-HTML urls.
    results = await get_htmls(results)
    # Filter out the None values.
    results = [x for x in results if x is not None]
    # Parse the HTML.
    results = [BeautifulSoup(x, 'html.parser') for x in results]
    # Get the text from the HTML and remove the spaces. Also, filter out the non-utf-8 encoding.
    results = [''.join(x.get_text().split()) for x in results if detect(x.encode()).get('encoding') == 'utf-8']
    # Return the first n results.
    return results[:n_results]
# 關鍵字提取：這裡使用 spaCy NLP 來提取問題中的關鍵字
def extract_keywords(question: str) -> str:
    doc = nlp(question)
    keywords = [token.text for token in doc if token.pos_ in ["NOUN", "PROPN", "ADJ"]]
    return " ".join(keywords)


## Test the LLM inference pipeline

In [10]:
# You can try out different questions here.
test_question='請問誰是 Taylor Swift？'

messages = [
    {"role": "system", "content": "你是 LLaMA-3.1-8B，是用來回答問題的 AI。使用中文時只會使用繁體中文來回問題。"},    # System prompt
    {"role": "user", "content": test_question}, # User prompt
]

print(generate_response(llama3, messages))

泰勒絲（Taylor Swift）是一位美國歌手、詞曲作家和音樂製作人。她出生於1989年，來自田納西州。她的音乐风格从乡村摇滚发展到流行搖擺，並且她被誉为当代最成功的女艺人的之一。

泰勒絲早期在鄉郊小鎮演唱會時開始發展音樂事業，她推出了多張專輯，包括《Taylor Swift》、《Fearless》，以及後來更為知名的大熱作如 《1989》（2014年）、_reputation（）和 _Lover （）。她的歌曲經常探討愛情、友誼及自我成長等主題。

泰勒絲獲得了許多獎項，包括13座格萊美奖，並且是史上最快達到百萬銷量的女藝人之一。


## Agents

The TA has implemented the Agent class for you. You can use this class to create agents that can interact with the LLM model. The Agent class has the following attributes and methods:
- Attributes:
    - role_description: The role of the agent. For example, if you want this agent to be a history expert, you can set the role_description to "You are a history expert. You will only answer questions based on what really happened in the past. Do not generate any answer if you don't have reliable sources.".
    - task_description: The task of the agent. For example, if you want this agent to answer questions only in yes/no, you can set the task_description to "Please answer the following question in yes/no. Explanations are not needed."
    - llm: Just an indicator of the LLM model used by the agent.
- Method:
    - inference: This method takes a message as input and returns the generated response from the LLM model. The message will first be formatted into proper input for the LLM model. (This is where you can set some global instructions like "Please speak in a polite manner" or "Please provide a detailed explanation".) The generated response will be returned as the output.

In [29]:
class LLMAgent():
    def __init__(self, role_description: str, task_description: str, llm:str="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF"):
        self.role_description = role_description   # Role means who this agent should act like. e.g. the history expert, the manager......
        self.task_description = task_description    # Task description instructs what task should this agent solve.
        self.llm = llm  # LLM indicates which LLM backend this agent is using.
    def inference(self, message:str) -> str:
        if self.llm == 'bartowski/Meta-Llama-3.1-8B-Instruct-GGUF': # If using the default one.
            # TODO: Design the system prompt and user prompt here.
            # Format the messsages first.
            messages = [
                {"role": "system", "content": f"{self.role_description}"},  # Hint: you may want the agents to speak Traditional Chinese only.
                {"role": "user", "content": f"{self.task_description}\n{message}"}, # Hint: you may want the agents to clearly distinguish the task descriptions and the user messages. A proper seperation text rather than a simple line break is recommended.
            ]
            return generate_response(llama3, messages)
        else:
            # TODO: If you want to use LLMs other than the given one, please implement the inference part on your own.
            return ""

TODO: Design the role description and task description for each agent.

In [None]:
# TODO: Design the role and task description for each agent.

# # Question extraction agent - designed to extract the core question
# question_extraction_agent = LLMAgent(
#     role_description="你是一個專精於理解問題的AI助手。你的任務是從用戶輸入中提取核心問題，去除無關的資訊，且核心問題需要為一個完整的問句，不能只是單一的名詞或形容詞。使用中文時只會使用繁體中文回應。",
#     task_description="請從以下輸入中提取核心問題，忽略任何與問題無關的內容。只返回提取出的核心問題，不要添加任何解釋或額外資訊：",
# )

# # Keyword extraction agent - designed to identify optimal search keywords
# keyword_extraction_agent = LLMAgent(
#     role_description="你是一個專精於提取搜尋關鍵字的AI助手。你的任務是從問題中識別出最關鍵、最具體的詞語，這些詞語能夠用於精確的網路搜尋。使用中文時只會使用繁體中文回應。",
#     task_description="請從以下問題中提取關鍵詞或短語，這些關鍵詞應該是問題中最獨特、最具體的部分。請嚴格遵守以下規則：\n1. 只返回關鍵詞，不要包含任何解釋\n2. 關鍵詞之間用空格分隔\n3. 不要使用標點符號\n4. 不要加入你認為相關但問題中沒有的詞語：",
# )
# 問題提煉 agent：負責從用戶輸入中提取核心問題，去除冗餘資訊
question_extraction_agent = LLMAgent(
    role_description="你是一個問題提煉專家，專門從冗長或複雜的描述中提煉出核心問題，並用精簡語言呈現。",
    task_description="請閱讀下面的輸入，並提煉出最核心、最直接需要回答的問題，這個問題為一個完整的句子。請保持輸出簡明扼要，不要包含額外描述或背景資訊。看到「」可以把裡面的內容當作是核心問題"
)

# 關鍵字提取 agent：負責從問題中抽取出最具代表性的關鍵字，供後續搜索使用
keyword_extraction_agent = LLMAgent(
    role_description="你是一個專精於提取搜尋關鍵字的AI助手。你的任務是從問題中識別出最關鍵、最具體的詞語，這些詞語能夠用於精確的網路搜尋，並且這些詞語必須是繁體中文。",
    task_description="請從以下問題中提取關鍵詞或短語，這些關鍵詞應該是問題中最獨特、最具體的部分。特別注意地點和形容詞和動詞。請嚴格遵守以下規則：\n1. 只返回關鍵詞，不要包含任何解釋\n2. 關鍵詞之間用空格分隔\n3. 不要使用標點符號\n4. 不要加入你認為相關但問題中沒有的詞語："
)
# This agent is the core component that answers the question.
qa_agent = LLMAgent(
    role_description="你是 LLaMA-3.1-8B，是用來回答問題的 AI。使用中文時只會使用繁體中文來回問題。",
    task_description="請回答以下問題：",
)

## RAG pipeline

TODO: Implement the RAG pipeline.

Please refer to the homework description slides for hints.

Also, there might be more heuristics (e.g. classifying the questions based on their lengths, determining if the question need a search or not, reconfirm the answer before returning it to the user......) that are not shown in the flow charts. You can use your creativity to come up with a better solution!

- Naive approach (simple baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive.png)

- Naive RAG approach (medium baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/naive_rag.png)

- RAG with agents (strong baseline)

    ![](https://www.csie.ntu.edu.tw/~ulin/rag_agent.png)

In [None]:
async def pipeline(question: str) -> str:
    """
    Implementation of the RAG with agents pipeline:
    1. Extract the core question using the question extraction agent
    2. Extract search keywords using the keyword extraction agent
    3. Retrieve relevant information using the search tool
    4. Generate an answer using the QA agent with the retrieved information
    """
    try:
        # Step 1: Extract the core question
        extracted_question = question_extraction_agent.inference(question)
        print(f"Extracted question: {extracted_question}")

        # # Step 2: Extract keywords for search
        keywords = keyword_extraction_agent.inference(extracted_question)
        print(f"Search keywords: {keywords}")

        # Step 3: Search for relevant information
        search_results = await search(keywords, n_results=3)

        # Prepare the context for the QA agent
        context = ""
        if search_results and len(search_results) > 0:
            # Limit the context length to avoid exceeding the model's context window
            for i, result in enumerate(search_results):
                # Truncate each result to avoid extremely long contexts
                truncated_result = result[:3000] if len(result) > 3000 else result
                context += f"搜尋結果 {i+1}:\n{truncated_result}\n\n"

        # Step 4: Formulate a prompt that includes both the original question and search results
        final_prompt = f"""
        問題: {extracted_question}
        參考資料:
        {context}

        基於上述參考資料，請回答問題。如果參考資料中沒有足夠的資訊，請根據你所知道的知識進行回答。請直截回答答案，而不要先說根據什麼資料
        """

        # Step 5: Generate the final answer
        answer = qa_agent.inference(final_prompt)

        return answer

    except Exception as e:
        # Fallback to direct QA if any part of the pipeline fails
        print(f"Pipeline error: {e}. Falling back to direct QA.")
        return qa_agent.inference(question)

## Answer the questions using your pipeline!

Since Colab has usage limit, you might encounter the disconnections. The following code will save your answer for each question. If you have mounted your Google Drive as instructed, you can just rerun the whole notebook to continue your process.

In [35]:
from pathlib import Path

# Fill in your student ID first.
STUDENT_ID = "test-4"

STUDENT_ID = STUDENT_ID.lower()
with open('./public.txt', 'r') as input_f:
    questions = input_f.readlines()
    questions = [l.strip().split(',')[0] for l in questions]
    for id, question in enumerate(questions, 1):
        if Path(f"./{STUDENT_ID}_{id}.txt").exists():
            continue
        answer = await pipeline(question)
        answer = answer.replace('\n',' ')
        print(id, answer)
        with open(f'./{STUDENT_ID}_{id}.txt', 'w') as output_f:
            print(answer, file=output_f)

# with open('./private.txt', 'r') as input_f:
#     questions = input_f.readlines()
#     for id, question in enumerate(questions, 31):
#         if Path(f"./{STUDENT_ID}_{id}.txt").exists():
#             continue
#         answer = await pipeline(question)
#         answer = answer.replace('\n',' ')
#         print(id, answer)
#         with open(f'./{STUDENT_ID}_{id}.txt', 'a') as output_f:
#             print(answer, file=output_f)

Extracted question: 「虎山雄風飛揚」是哪間學校的校歌？
Search keywords: 虎山雄風飛揚 校歌
1 光華國小的校歌是「虎山雄風飛揚」。
Extracted question: 民眾透過境外郵購自用產品回台加收審查費多少錢？
Search keywords: 境外郵購 自用產品 審查費
2 民眾透過境外郵購自用產品回台加收審查費 750 元。
Extracted question: 史蒂夫·乔布斯
Search keywords: 史蒂夫·乔布斯
3 史蒂夫·乔布斯（Steven Paul Jobs）是一名美国企业家、营销人士和发明者。他于1955年2月24日出生，2011 年10 月 ５ 日去世。 他是苹果公司的联合创始之一，并曾担任董事长及首席执行官职位。  他以推动个人电脑革命以及设计创新而闻名，他在Apple和Pixar等多家企业中发挥了重要作用。他也是一个有争议的人物，受到过政府调查并与Adobe公司发生冲突。
Extracted question: 托福網路測驗 TOEFL iBT 要達到多少分才能申請進階英文免修？
Search keywords: 托福網路測驗 TOEFL iBT 免修分數
4 托福網路測驗（TOEFL iBT）要達到 92 分才能申請進階英文免修。
Extracted question: 觸地 try 可得 5 分。
Search keywords: 觸地 try 得分
5 觸地得分（Try）可得到 5 分。
Extracted question: 卑南族的祖先發源地是哪裡？
Search keywords: 卑南族祖先發源地
6 卑南族的祖先發源地是台東太麻里鄉美和海岸附近。
Extracted question: 熊仔的碩班指導教授為？
Search keywords: 熊仔 碩班 指導教授
7 熊仔的碩班指導教授為李琳山。
Extracted question: 誰發現了電磁感應定律？
Search keywords: 麥克斯韋
8 詹姆斯·克拉 克麦克斯韦
Extracted question: 距離國立臺灣史前文化博物館最近的臺鐵車站為？
Search keywords: 國立臺灣史前文化博物館  臺鐵車站
9 康樂車站
Extracted question: 20+30=?
Search 

In [37]:
# Combine the results into one file.
with open(f'./{STUDENT_ID}.txt', 'w') as output_f:
    for id in range(1,91):
        with open(f'./{STUDENT_ID}_{id}.txt', 'r') as input_f:
            answer = input_f.readline().strip()
            print(answer, file=output_f)