<a href="https://colab.research.google.com/github/Ha1ion/2025_NLP_HW4/blob/main/nlp_hw4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG using Langchain

## Packages loading & import

In [1]:
!pip install "langchain-core>=0.2.0,<0.3.0" \
             "langchain>=0.2.0,<0.3.0" \
             "langchain-community>=0.2.0,<0.3.0" \
             "langchain-huggingface>=0.0.3,<0.1.0" \
             "langchain-chroma>=0.1.0,<0.2.0" \
             "langchain-ollama>=0.1.0,<0.2.0" \
             "langchain-text-splitters>=0.2.0,<0.3.0" \
             "transformers>=4.39.0" \
             "accelerate>=0.28.0" \
             "sentence-transformers" \
             rank-bm25 \
             huggingface_hub \
             tqdm \
             beautifulsoup4



In [2]:
import os
import json
import bs4
import nltk
import torch
import pickle
import numpy as np

# from pyserini.index import IndexWriter
# from pyserini.search import SimpleSearcher
from numpy.linalg import norm
from rank_bm25 import BM25Okapi
from nltk.tokenize import word_tokenize

from langchain_community.llms import Ollama
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain.vectorstores import Chroma

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

from sentence_transformers import SentenceTransformer
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.embeddings import JinaEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter, TokenTextSplitter
from langchain.docstore.document import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.document_loaders import WebBaseLoader
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer

from tqdm import tqdm



In [3]:
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

## Hugging face login
- Please apply the model first: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct
- If you haven't been granted access to this model, you can use other LLM model that doesn't have to apply.
- You must save the hf token otherwise you need to regenrate the token everytime.
- When using Ollama, no login is required to access and utilize the llama model.

In [4]:
from huggingface_hub import login

hf_token = ""
login(token=hf_token, add_to_git_credential=True)

In [5]:
!huggingface-cli whoami

Ha1ion


## TODO1: Set up the environment of Ollama

### Introduction to Ollama
- Ollama is a platform designed for running and managing large language models (LLMs) directly **on local devices**, providing a balance between performance, privacy, and control.
- There are also other tools support users to manage LLM on local devices and accelerate it like *vllm*, *Llamafile*, *GPT4ALL*...etc.

### Launch colabxterm

In [6]:
# TODO1-1: You should install colab-xterm and launch it.
# Write your commands here.
!pip install colab-xterm
%load_ext colabxterm



In [7]:
# TODO1-2: You should install Ollama.
# You may need root privileges if you use a local machine instead of Colab.
!curl -fsSL https://ollama.com/install.sh | sh

>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


In [None]:
%xterm

In [None]:
# TODO1-3: Pull Llama3.2:1b via Ollama and start the Ollama service in the xterm
# Write your commands in the xterm
ollama serve &
ollama pull llama3.2:1b

## Ollama testing
You can test your Ollama status with the following cells.

In [11]:
# Setting up the model that this tutorial will use
MODEL = "llama3.2:1b" # https://ollama.com/library/llama3.2:3b
EMBED_MODEL = "jinaai/jina-embeddings-v2-base-en"

In [12]:
# Initialize an instance of the Ollama model
llm = Ollama(model=MODEL)
# Invoke the model to generate responses
response = llm.invoke("What is the capital of Taiwan?")
print(response)

The capital of Taiwan is Taipei.


## Build a simple RAG system by using LangChain

### TODO2: Load the cat-facts dataset and prepare the retrieval database

In [13]:
!wget https://huggingface.co/ngxson/demo_simple_rag_py/resolve/main/cat-facts.txt

--2025-12-03 08:16:26--  https://huggingface.co/ngxson/demo_simple_rag_py/resolve/main/cat-facts.txt
Resolving huggingface.co (huggingface.co)... 3.165.160.12, 3.165.160.61, 3.165.160.59, ...
Connecting to huggingface.co (huggingface.co)|3.165.160.12|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: /api/resolve-cache/models/ngxson/demo_simple_rag_py/ccd6b7b72b52c7ca4e8f2a0a00b15c368d6ae294/cat-facts.txt?%2Fngxson%2Fdemo_simple_rag_py%2Fresolve%2Fmain%2Fcat-facts.txt=&etag=%22bc94ddd9483183e01bcf61e8bf9450fe3e09edb3%22 [following]
--2025-12-03 08:16:26--  https://huggingface.co/api/resolve-cache/models/ngxson/demo_simple_rag_py/ccd6b7b72b52c7ca4e8f2a0a00b15c368d6ae294/cat-facts.txt?%2Fngxson%2Fdemo_simple_rag_py%2Fresolve%2Fmain%2Fcat-facts.txt=&etag=%22bc94ddd9483183e01bcf61e8bf9450fe3e09edb3%22
Reusing existing connection to huggingface.co:443.
HTTP request sent, awaiting response... 200 OK
Length: 22657 (22K) [text/plain]
Saving to: ‘cat-fac

In [15]:
# TODO2-1: Load the cat-facts dataset (as `refs`, which is a list of strings for all the cat facts)
# Write your code here
import requests
from langchain_core.documents import Document
url = "https://huggingface.co/ngxson/demo_simple_rag_py/resolve/main/cat-facts.txt"
response = requests.get(url)
refs = [line.strip() for line in response.text.split('\n') if line.strip()]

# 轉換為 LangChain 的 Document 格式，並保留 Index (id)
docs = [Document(page_content=doc, metadata={"id": i}) for i, doc in enumerate(refs)]

print(f"Successfully loaded {len(docs)} facts.")

Successfully loaded 150 facts.


In [16]:
from langchain_core.documents import Document
docs = [Document(page_content=doc, metadata={"id": i}) for i, doc in enumerate(refs)]

In [17]:
# Create an embedding model
model_kwargs = {'trust_remote_code': True}
encode_kwargs = {'normalize_embeddings': False}
embeddings_model = HuggingFaceEmbeddings(
    model_name=EMBED_MODEL,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

configuration_bert.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-implementation:
- configuration_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_bert.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-implementation:
- modeling_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/275M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/373 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [18]:
# TODO2-2: Prepare the retrieval database
# You should create a Chroma vector store.
# search_type can be “similarity” (default), “mmr”, or “similarity_score_threshold”
model_name = "jinaai/jina-embeddings-v2-base-en"
model_kwargs = {'trust_remote_code': True, 'device': 'cuda' if torch.cuda.is_available() else 'cpu'}
encode_kwargs = {'normalize_embeddings': False}

print("Loading Jina Embeddings...")
embeddings_model = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

# 2. 建立 Chroma Vector Store (語意檢索)
# 這會將 documents 轉成向量存入資料庫
print("Creating Vector Store (Chroma)...")
vector_store = Chroma.from_documents(
    documents=docs,
    embedding=embeddings_model,
    collection_name="cat_facts"
)
chroma_retriever = vector_store.as_retriever(search_kwargs={"k": 3}) # 設定取回前 3 筆最相關的

# 3. 建立 BM25 Retriever (關鍵字檢索 - 優化部分)
# 這是為了補強語意檢索對「精確關鍵字」的不足
print("Creating Keyword Retriever (BM25)...")
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 3 # 設定取回前 3 筆

# 4. 建立 Ensemble Retriever (混合檢索)
# 結合兩者優點 (權重各 50%)
print("Initializing Hybrid Retriever...")
retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, chroma_retriever],
    weights=[0.5, 0.5]
)

print("Success! Hybrid Retriever (BM25 + Chroma) is ready.")

Loading Jina Embeddings...
Creating Vector Store (Chroma)...


ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


Creating Keyword Retriever (BM25)...
Initializing Hybrid Retriever...
Success! Hybrid Retriever (BM25 + Chroma) is ready.


### Prompt setting

In [19]:
# TODO3: Set up the `system_prompt` and configure the prompt.
system_prompt = """You are a helpful assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If the answer is not in the context, just say that you don't know.
Keep your answer concise.

Context:
{context}
"""

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

print("Prompt template configured successfully.")

Prompt template configured successfully.


- For the vectorspace, the common algorithm would be used like Faiss, Chroma...(https://python.langchain.com/docs/integrations/vectorstores/) to deal with the extreme huge database.

In [20]:
# TODO4: Build and run the RAG system
# TODO4-1: Load the QA chain
# You should create a chain for passing a list of Documents to a model.
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

question_answer_chain = create_stuff_documents_chain(llm, prompt)
# TODO4-2: Create retrieval chain
# You should create retrieval chain that retrieves documents and then passes them on.
chain = create_retrieval_chain(retriever, question_answer_chain)

print("RAG Chains initialized successfully.")

RAG Chains initialized successfully.


In [26]:
# Question (queries) and answer pairs
# Write your code here
# Please load the questions_answers.txt file and prepare the `queries` and `answers` lists.
queries = []
answers = []

print("Parsing questions_answers.txt...")

try:
    with open('questions_answers.txt', 'r', encoding='utf-8') as f:
        # 讀取所有非空行
        lines = [line.strip() for line in f.readlines() if line.strip()]

    # 檔案結構為：問題 -> 答案 -> 問題 -> 答案 ...
    # 所以我們用間隔為 2 的迴圈來讀取
    for i in range(0, len(lines), 2):
        if i + 1 < len(lines):
            queries.append(lines[i])
            answers.append(lines[i+1])

    print(f"Successfully loaded {len(queries)} QA pairs.")

except FileNotFoundError:
    print("Error: 'questions_answers.txt' not found. Please upload it to Colab.")

Parsing questions_answers.txt...
Successfully loaded 150 QA pairs.


In [None]:
results = []
correct_count = 0
recall_at_1_count = 0
recall_at_5_count = 0

for i, query in tqdm(enumerate(queries), total=len(queries)):
    # TODO4-3: Run the RAG system
    # 這裡我們呼叫 chain，並取得包含 context 的完整回傳值
    response_dict = chain.invoke({"input": query})
    response = response_dict["answer"] # 取得生成的答案文字

    # The following lines perform evaluations.
    # if the answer shows up in your response, the response is considered correct.
    ground_truth = answers[i]
    prediction = response.strip()
    is_correct = False

    # 助教定義：答案包含於輸出結果中 (不分大小寫) 即算正確
    if ground_truth.lower() in prediction.lower():
        correct_count += 1
        is_correct = True

    # Compute recall@1, recall@5 and Accuracy.
    # 根據助教說明：Recall 是看 Retrieve 到的 ID 是否包含當前問題的 Index (i)
    retrieved_docs = response_dict["context"]
    retrieved_ids = [doc.metadata.get('id') for doc in retrieved_docs]

    # Check Recall@1
    if i in retrieved_ids[:1]:
        recall_at_1_count += 1

    # Check Recall@k (這裡視 Retriever 設定，若 k=3 則算 Recall@3)
    if i in retrieved_ids:
        recall_at_5_count += 1

    # Store the questions, ground-truths and answers in a json file.
    results.append({
        "Query": query,
        "Ground_Truth": ground_truth,
        "Prediction": prediction,
        "Retrieved_IDs": retrieved_ids,
        "Correct": is_correct
    })

# TODO5: Improve to let the LLM correctly answer the ten questions.
# (這裡計算最終分數並輸出 JSON)

total_q = len(queries)
accuracy = correct_count / total_q
recall_1 = recall_at_1_count / total_q
recall_k = recall_at_5_count / total_q

print("\n" + "="*30)
print(f"FINAL METRICS (Validation)")
print("="*30)
print(f"Accuracy (EM) : {accuracy:.2%} ({correct_count}/{total_q})")
print(f"Recall@1      : {recall_1:.2%} ({recall_at_1_count}/{total_q})")
print(f"Recall@k      : {recall_k:.2%} ({recall_at_5_count}/{total_q})")
print("="*30)

# 儲存為 JSON
student_id = "YOUR_STUDENT_ID" # 請記得改學號
output_filename = f"NLP_HW4_NTHU_{student_id}.json"

output_json = []
for res in results:
    output_json.append({
        "Query": res["Query"],
        "Ground_Truth": res["Ground_Truth"],
        "Prediction": res["Prediction"]
    })

with open(output_filename, 'w', encoding='utf-8') as f:
    json.dump(output_json, f, ensure_ascii=False, indent=4)

print(f"Results saved to {output_filename}")

In [30]:
# === 救援存檔 Checkpoint ===
import json

# 1. 檢查目前跑了幾題
current_count = len(results)
print(f"成功搶救了 {current_count} 題資料！")

# 2. 先存一份備份檔 (以防萬一)
backup_filename = f"NLP_HW4_backup_{current_count}.json"
output_json = []
for res in results:
    output_json.append({
        "Query": res["Query"],
        "Ground_Truth": res["Ground_Truth"],
        "Prediction": res["Prediction"]
    })

with open(backup_filename, 'w', encoding='utf-8') as f:
    json.dump(output_json, f, ensure_ascii=False, indent=4)

print(f"已將目前進度備份至: {backup_filename}")
print("請務必先下載這個檔案備用！")

成功搶救了 104 題資料！
已將目前進度備份至: NLP_HW4_backup_104.json
請務必先下載這個檔案備用！


In [31]:
# === 復活 Ollama 服務 ===
import subprocess
import time

print("正在重啟 Ollama...")
!pkill ollama
process = subprocess.Popen("ollama serve", shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
time.sleep(5) # 等它復活
!ollama pull llama3.2:1b
print("Ollama 復活成功！準備繼續跑剩下的題目...")

正在重啟 Ollama...
[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l
Ollama 復活成功！準備繼續跑剩下的題目...


In [32]:
# === 接關繼續跑 (Resume) ===

# 1. 設定起始點 (從目前的進度開始)
start_index = len(results)
print(f"從第 {start_index} 題開始繼續執行...")

# 2. 只跑剩下的題目 (切片 slicing)
# 注意：results 列表不要清空，我們要繼續 append
for i in tqdm(range(start_index, len(queries)), total=len(queries) - start_index):

    query = queries[i]
    ground_truth = answers[i]

    try:
        # 重新嘗試執行 RAG
        response_dict = chain.invoke({"input": query})
        response = response_dict["answer"]

        # --- 以下邏輯與原本相同 ---
        prediction = response.strip()
        is_correct = False
        if ground_truth.lower() in prediction.lower():
            correct_count += 1
            is_correct = True

        retrieved_docs = response_dict["context"]
        retrieved_ids = [doc.metadata.get('id') for doc in retrieved_docs]

        if i in retrieved_ids[:1]:
            recall_at_1_count += 1
        if i in retrieved_ids:
            recall_at_5_count += 1

        # 將結果 Append 到原本的 results 列表中
        results.append({
            "Query": query,
            "Ground_Truth": ground_truth,
            "Prediction": prediction,
            "Retrieved_IDs": retrieved_ids,
            "Correct": is_correct
        })

    except Exception as e:
        print(f"第 {i} 題發生錯誤: {e}")
        # 如果又報錯，至少跳過這一題繼續往下，不要讓迴圈停下來
        continue

print("恭喜！所有題目執行完畢！")

從第 104 題開始繼續執行...


100%|██████████| 46/46 [00:25<00:00,  1.80it/s]

恭喜！所有題目執行完畢！





In [33]:
# === 最後一步：產生最終報告與檔案 ===
import json

# 1. 重新計算所有指標 (確保數據正確)
final_correct = 0
final_recall_1 = 0
final_recall_k = 0
total_q = len(results)

print(f"正在統計 {total_q} 筆資料的結果...")

for res in results:
    # 統計 Accuracy (EM)
    if res["Correct"]:
        final_correct += 1

    # 統計 Recall (從 results 裡面讀取記錄下來的 ID)
    # 我們需要知道這一題原本的 Index 是多少
    # 簡單的做法是看它在 list 中的位置，或是我們剛剛存的 Target_ID (如果有存)
    # 這裡我們假設 results 的順序跟 queries 是一樣的 (0~149)
    current_idx = results.index(res)

    # 檢查 Recall
    # 注意：剛剛的接關程式碼有把 retrieved_ids 存進去
    retrieved_ids = res["Retrieved_IDs"]

    if current_idx in retrieved_ids[:1]:
        final_recall_1 += 1
    if current_idx in retrieved_ids:
        final_recall_k += 1

# 2. 計算百分比
accuracy = final_correct / total_q if total_q > 0 else 0
recall_1 = final_recall_1 / total_q if total_q > 0 else 0
recall_k = final_recall_k / total_q if total_q > 0 else 0

# 3. 顯示最終報告 (請截圖這部分！)
print("\n" + "="*35)
print(f"FINAL METRICS (Validation)")
print("="*35)
print(f"Total Questions : {total_q}")
print(f"Accuracy (EM)   : {accuracy:.2%} ({final_correct}/{total_q})")
print(f"Recall@1        : {recall_1:.2%} ({final_recall_1}/{total_q})")
print(f"Recall@k        : {recall_k:.2%} ({final_recall_k}/{total_q})")
print("="*35)

# 4. 輸出 JSON 檔案
student_id = "114164518" # <--- 請務必修改這裡！！！
output_filename = f"NLP_HW4_NTHU_{student_id}.json"

output_json = []
for res in results:
    output_json.append({
        "Query": res["Query"],
        "Ground_Truth": res["Ground_Truth"],
        "Prediction": res["Prediction"]
    })

with open(output_filename, 'w', encoding='utf-8') as f:
    json.dump(output_json, f, ensure_ascii=False, indent=4)

print(f"\n檔案已儲存為: {output_filename}")
print("恭喜完成！請下載此 JSON 檔案，並將上面的 FINAL METRICS 截圖貼到報告中。")

正在統計 150 筆資料的結果...

FINAL METRICS (Validation)
Total Questions : 150
Accuracy (EM)   : 55.33% (83/150)
Recall@1        : 82.00% (123/150)
Recall@k        : 100.00% (150/150)

檔案已儲存為: NLP_HW4_NTHU_114164518.json
恭喜完成！請下載此 JSON 檔案，並將上面的 FINAL METRICS 截圖貼到報告中。


In [28]:
# 重新啟動 Ollama 服務的救援程式碼
import subprocess
import time

print("Checking for existing Ollama process...")
# 先嘗試殺死可能卡住的舊程序 (如果有)
!pkill ollama

print("Starting Ollama server...")
# 在背景啟動 Ollama
process = subprocess.Popen("ollama serve", shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

# 等待 10 秒讓它暖機
print("Waiting for Ollama to start (10s)...")
time.sleep(10)

# 確保模型有被載入 (如果沒下載會自動下載，如果有下載過會很快)
print("Ensuring model is loaded...")
!ollama pull llama3.2:1b

print("="*30)
print("Ollama restarted successfully!")
print("Now you can go back and run your TODO 5 loop.")
print("="*30)

Checking for existing Ollama process...
Starting Ollama server...
Waiting for Ollama to start (10s)...
Ensuring model is loaded...
[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l
Ollama restarted successfully!
Now you can go back and run your TODO 5 loop.
