## Main RAG Pipeline

Required files:
- documents.json
- questions.json

Output files:
- pred.json

### Environment Setup & Load Input Files
Install required packages and load documents/questions from JSON files

In [None]:
!pip install -q transformers sentence-transformers accelerate

In [None]:
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
import json

In [None]:
def load_json(filename):
    with open(filename, 'r', encoding='utf-8') as f:
        return json.load(f)

documents = load_json('documents.json')
questions = load_json('questions.json')

### Define Dense Retriever (Sentence Embedding Model)
Load a SentenceTransformer model to convert text into dense embeddings

In [None]:
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
def preprocess(text):
    emb = embedding_model.encode(text, convert_to_tensor=True, normalize_embeddings=True)
    return emb

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Save Document Embeddings
Generate and save dense embeddings for all documents for future retrieval

In [None]:
print("Encoding documents...")
doc_embeddings = []
for doc in tqdm(documents):
    emb = preprocess(doc["text"])
    doc_embeddings.append({
        "doc_id": doc["doc_id"],
        "embedding": emb
    })

Encoding documents...


100%|██████████| 1908/1908 [00:16<00:00, 115.87it/s]


In [None]:
doc_embeddings[0]['embedding'].size()

torch.Size([384])

In [None]:
# Download Embedding
from google.colab import files
embedding_data = {
    doc["doc_id"]: doc["embedding"].cpu() for doc in doc_embeddings
}

torch.save(embedding_data, "doc_embeddings.pt")
files.download("doc_embeddings.pt")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Define Generator Model (Large Language Model)
Load a Hugging Face causal language model (e.g., TinyLlama) for answer generation based on retrieved context

In [None]:
hf_model = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

tokenizer = AutoTokenizer.from_pretrained(hf_model)
generate_model = AutoModelForCausalLM.from_pretrained(hf_model, device_map="auto", torch_dtype=torch.float16)

def ask_llm(context, question):
    prompt = f"""<|system|>
    You are a helpful medical assistant.
    <|user|>
    Answer the following question using the provided context.

    Context:
    {context}

    Question:
    {question}
    <|assistant|>"""

    inputs = tokenizer(prompt, return_tensors="pt").to(generate_model.device)
    outputs = generate_model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id
    )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    if "<|assistant|>" in response:
        return response.split("<|assistant|>")[-1].strip()
    else:
        return response.strip()

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

### RAG（Retrieval-Augmented Generation）
For each question, retrieve top-K relevant documents using cosine similarity,then generate an answer using the language model with the retrieved context

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
doc_embeddings = torch.load("doc_embeddings.pt")
doc_lookup = {doc["doc_id"]: doc["text"] for doc in documents}

def cosine_similarity(tensor1, tensor2):
    tensor1 = F.normalize(tensor1.unsqueeze(0))
    tensor2 = F.normalize(tensor2.unsqueeze(0))
    return torch.mm(tensor1, tensor2.T).item()

print("Retrieving top 5 documents for each question...")
results = []

for q in tqdm(questions):
    q_emb = preprocess(q["question"])
    scored_docs = []

    for doc_id, doc_emb in doc_embeddings.items():
        score = cosine_similarity(q_emb, doc_emb.to(device))
        scored_docs.append((doc_id, score))

    top_docs = sorted(scored_docs, key=lambda x: x[1], reverse=True)[:5]
    retrieved_ids = [doc_id for doc_id, _ in top_docs]

    combined_context = "\n\n".join([doc_lookup[doc_id] for doc_id in retrieved_ids[:3] if doc_id in doc_lookup])

    resp = ask_llm(combined_context, q["question"])

    results.append({
        "question_id": q["question_id"],
        "retrieved_docs": retrieved_ids,
        "answer": resp
    })

with open('pred.json', 'w', encoding='utf-8') as f:
    json.dump(results, f, indent=2, ensure_ascii=False)

Retrieving top 5 documents for each question...


  6%|▌         | 19/325 [01:00<18:35,  3.65s/it]Token indices sequence length is longer than the specified maximum sequence length for this model (3328 > 2048). Running this sequence through the model will result in indexing errors
This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (2048). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.
100%|██████████| 325/325 [14:51<00:00,  2.74s/it]


In [None]:
from google.colab import files
files.download("pred.json")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Main Evaluation Pipeline

This script (`evaluate.py`) serves as the entry point for evaluating retrieval and generation results based on a given prediction file `pred.json`.

Required files:
- pred.json: Input prediction file containing retrieved documents and generated answers.


- evaluate.py: The main script coordinating the evaluation pipeline.
- eval_bertscore.py: Script to evaluate answer quality using BERTScore.
- eval_retrieval.so: Compiled shared object (.so) file for evaluating retrieval accuracy.

Output files:
- result.json


#### Verifies that the required shared object file `eval_retrieval.so` is present,
GCC (C compiler) is available, and Python version is 3.11.12 or 3.11.11.

In [None]:
!ls *.so
!python --version
!gcc --version

eval_retrieval.so
Python 3.11.12
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.



#### Environment Setup

In [None]:
! pip install cython==3.1.0
! pip install bert_score==0.3.13



In [None]:
!python evaluate.py --topk 3 --use_bertscore

Evaluating...:   0% 0/325 [00:00<?, ?it/s]2025-05-19 01:49:52.415662: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747619392.436160   19838 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747619392.443717   19838 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Evaluating...: 100% 325/325 [16:17<00:00,  3.01s/it]
Total entries in results: 325 / 325
Average retrieval_hits: 84.23%
Average term_match_recall: 58.92%
Average bert_score: 50.11%


In [None]:
from google.colab import files
files.download("result.json")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>