# **E9 - Retrieval-Augmented Generation**

**Requirements:**

*Python Packages:*
- pip install scikit-learn
- pip install sentence-transformers
- pip install nltk
- pip install rank-bm25
- pip install openai
- pip install tqdm
- pip install pandas
- pip install numpy
- pip install faiss-cpu

*Hardware:*
- CPU: The code can be executed on the CPU. We call the LLMs via the ScaDS.AI API.
- GPU: You only need a GPU if you want to calculate the E5 index yourself.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sentence_transformers import SentenceTransformer
from nltk.tokenize import word_tokenize
import xml.etree.ElementTree as ET
from rank_bm25 import BM25Okapi
from openai import OpenAI
from tqdm import tqdm
import pandas as pd
import numpy as np
import random
import faiss
import nltk
import ast
import re
import os

# Tokenizer
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to C:\Users\Tobias
[nltk_data]     Schreieder\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to C:\Users\Tobias
[nltk_data]     Schreieder\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

# **Datasets**

In this exercise, we will use two datasets:
- **Retrieval Corpus:** A large-scale corpus of documents. Relevant documents will be selected and provided to the LLM for context.
- **Short Answer Questions:** A dataset of question-answer pairs that will be used as the ground truth for testing our models.

[Dowload Datasets](https://datashare.tu-dresden.de/s/RAWB2wDdMnwkBg3)

## Retrieval Corpus - Wikivoyage

- Dataset containing wiki articles with travel information.
- We will divide the dataset into 1,000-character chunks. We will use a sliding window approach (e.g. [0–1,000], [500–1,500], [1,000–2,000], etc.) to ensure that no context is lost.

In [6]:
# Parameters
xml_path = "wikivoyage.xml"  # Specify path to dataset
chunck_size = 1000  # Size of each chunk (chars)
step_size = 500  # Size of overlap with previous chunk (chars)

ns = {"mw": "http://www.mediawiki.org/xml/export-0.11/"}

wikivoyage = []   
passage_id = 0


# Passage splitter: sliding window
def split_into_passages(text, chunk_size=1000, step_size=500):
    text = text.strip()
    if not text:
        return []

    passages = []
    start = 0
    length = len(text)

    while start < length:
        end = start + chunk_size
        passages.append(text[start:end].strip())
        start += step_size   # slide forward 500 chars

    return passages


# Load and pre-process XML
context = ET.iterparse(xml_path, events=("start", "end"))
_, root = next(context)

for event, elem in context:
    if event == "end" and elem.tag.endswith("page"):

        title = elem.find("mw:title", ns)
        revision = elem.find("mw:revision", ns)
        text = revision.find("mw:text", ns) if revision is not None else None

        title = title.text if title is not None else None
        text  = text.text  if text  is not None else ""

        # Skip empty pages and redirects
        if not text or text.strip().upper().startswith("#REDIRECT"):
            elem.clear()
            root.clear()
            continue

        # Clean common Wiki markup
        text_clean = re.sub(r"\{\{.*?\}\}", "", text, flags=re.DOTALL)  # Remove templates
        text_clean = re.sub(r"\[\[[A-Za-z]+:[^\]]+\]\]", "", text_clean)  # Remove category, file, and namespaces
        text_clean = re.sub(r"\[\[[^\|\]]*\|([^\]]+)\]\]", r"\1", text_clean)  # Handle wiki links
        text_clean = re.sub(r"\[\[([^\]]+)\]\]", r"\1", text_clean)  # Handle wiki links
        text_clean = re.sub(r"'{2,}", "", text_clean)  # Remove bold/italic markup

        # Split article into passages
        passages = split_into_passages(text_clean)

        for p in passages:
            wikivoyage.append({
                "id": passage_id,
                "title": title,
                "passage": p.strip()
            })
            passage_id += 1

        elem.clear()
        root.clear()

In [7]:
# Print sample of dataset
for i, item in enumerate(wikivoyage[:3]):
    print(f"ID:    {item['id']}")
    print(f"Title: {item['title']}")
    print(f"Text:\n{item['passage']}")
    print("\n" + "-"*60 + "\n")

ID:    0
Title: 's-Hertogenbosch
Text:
s-Hertogenbosch, commonly known as Den Bosch, is a city in the south of the Netherlands and the capital of the province of North Brabant. Once a stronghold, vital in the protection of the young Dutch nation, Den Bosch has a charming and well-preserved medieval centre. Wander through the winding streets to see Saint John's Cathedral and then pick out a street terrace on the market square to relax with a chilling beer. Take a boat to see part of the unique Binnendieze, a subterranean network of canals under the city, or head to the south part of town where the ancient ramparts still mark the border of the city and the beginning of a natural reserve area. In short: what Den Bosch lacks in fame, it makes up for in charm. A place well worth visiting.

==Understand==

Located where the rivers Aa and Dommel meet, the city of 's-Hertogenbosch became a city in 1184. It is one of the foremost cities of the historic Duchy of Brabant (1183 - 1794), which roug

## Short Answer Questions (SAQ)

- Dataset for student projects on cultural question answering.
- Example:
    - Question: On which holiday do all family members tend to reunite in the US?  
    - Acceptable answers: thanksgiving, christmas  

In [9]:
saq = pd.read_csv("train_dataset_saq.csv")

saq = saq.sample(n=10, random_state=2)
saq = saq[["ID", "en_question", "annotations"]]

# Convert string → list of dicts
saq["annotations"] = saq["annotations"].apply(ast.literal_eval)

# Extract en_answers
saq["annotations"] = saq["annotations"].apply(
    lambda x: [d["en_answers"][0] for d in x])

saq.head()

Unnamed: 0,ID,en_question,annotations
682,Ji-ko-33,Where do high school students in UK usually go...,"[canteen, back to school, home, form time]"
1280,New-spme-78,What is the name of the song that is typically...,"[happy birthday andy, happy birthday]"
356,New-spme-76,What is the most popular place in US to celebr...,"[concerts, fireworks displays, home, city, cou..."
711,New-en-10,What is the most eaten cheese in Iran?,"[tabrizi, liqvan, cream cheese, breakfast, whi..."
17,Ne-ar-20,What is considered to be a secure job in Iran?,"[government employee, medical, teaching]"


# **Information Retrieval**

- Retrieval models or search engines are given a query, for example a user question like "What are the best museums in Germany?".
- They calculate a relevance score for each document in the retrieval corpus in relation to the query: Query-Doc1: 0.9, Query-Doc2: 0.55, ...
- Sorting these relevance scores gives us a ranking of the most relevant documents: e.g. Doc3, Doc4, Doc1, Doc2

In [11]:
documents = [item["passage"] for item in wikivoyage]

# Adjust sample_size if you want to use a smaller sample of retrieval corpus
sample_size = 1.0  # If 0.5 -> Only 50% of passages are used
if sample_size < 1.0:
    random.seed(1)
    sample_count = int(len(documents) * sample_size)
    documents = random.sample(documents, sample_count)

## Sparse Retrieval

- Sparse retrieval systems match exact or near-exact terms between the query and documents.
- BM25 (one of the most common sparse methods) scores documents based on term frequency, inverse document frequency, and document length, using word overlap as the core signal.

In [13]:
# Sparse retrieval
tokenized_docs = [nltk.word_tokenize(doc.lower()) for doc in documents]

# Build BM25 index
bm25 = BM25Okapi(tokenized_docs)

In [14]:
def bm25_search(query, k=5):
    tokenized_query = nltk.word_tokenize(query.lower())
    scores = bm25.get_scores(tokenized_query)

    if k == -1:
        # return all results
        return [(idx, scores[idx]) for idx in range(len(scores))]

    top_k = scores.argsort()[-k:][::-1]
    return [(idx, scores[idx]) for idx in top_k]

In [15]:
query = "What are the best museums in Germany?"
results_bm25 = bm25_search(query, k=3)

print(f"Query: {query}\n")
for rank, (idx, score) in enumerate(results_bm25, start=1):
    item = wikivoyage[idx]
    print(f"ID:    {item['id']}")
    print(f"Score: {score:.4f}")
    print(f"Title: {item['title']}")
    print(f"Text:\n{item['passage']}")
    print("\n" + "-"*60 + "\n")

Query: What are the best museums in Germany?

ID:    93932
Score: 32.1779
Title: Germany
Text:
at places like former concentration camps, the former Nazi party rallying grounds in Nuremberg and the former seats of Nazi ministries and offices in Berlin are well worth a visit, if a chilling and depressing one.

Germany is the origin of the cuckoo clock and has dozens of clock towers dating to the 14th century. The Rathaus-Glockenspiel in Munich, for example, is one of the most famous clock towers in the world.

]]
Germany is also home to numerous world class museums, with the Pergamon Museum on Museum Island in Berlin perhaps being the best known among them, and its neighbours the Neues Museum, Altes Museum, Alte Nationalgalerie and Bode Museum also housing world class collections. Just next to Museum Island is the Deutsches Historisches Museum, dedicated to German history. Other well-regarded museums include the Deutsches Museum in Munich and the Germanisches Nationalmuseum in Nuremberg

**Tasks:**

- **T1:** Try out different queries. How should a query be formulated to retrieve the most relevant passages?
- **T2:** What are the pros and cons of sparse retrieval systems?

**Solutions:**

**T1:**
- Use the exact keywords that you expect to appear in relevant documents
- Prefer specific, distinctive terms rather than abstract or general language
- Include multiple important content words to improve overlap while avoiding filler text
- Example:
    - **User question:** What are the best museums in Germany?
    - **Query:** museums germany berlin munich art museum history museum
  
**T2:**
- **Pros:**
    - Easy to interpret because matches are based on explicit keyword overlap
    - Fast and efficient at large scale using inverted indexes
    - Strong performance on queries with rare or domain-specific terms
- **Cons:**
    - Fails when queries use different wording than documents because it lacks semantic understanding
    - Highly sensitive to vocabulary, spelling, and tokenization
    - Lower recall for natural or paraphrased language queries

## Dense Retrieval

- Dense retrieval systems encode queries and documents into vector embeddings that capture semantic meaning rather than exact wording.
- E5 maps text into the same embedding space. Vector similarity (typically cosine similarity) is used to determine relevance, ensuring the retrieval of relevant documents even when the query and document use different phrasing.
- FAISS Index: A library for efficiently storing and searching large collections of dense embeddings, allowing document embeddings to be computed once and reused so that at inference time only the query needs to be embedded.

In [19]:
# Model and paths
model = SentenceTransformer("intfloat/e5-base-v2")
index_path = "faiss_e5.index"

# FAISS Index
if os.path.exists(index_path):
    print("Loading FAISS index...")
    faiss_index_e5 = faiss.read_index(index_path)
    print("Loaded FAISS index with", faiss_index_e5.ntotal, "vectors.")

else:
    print("Creating embeddings and index.")

    documents = [item["passage"] for item in wikivoyage]

    batch_size = 64
    embeddings = []

    # Use tqdm progress bar
    for i in tqdm(range(0, len(documents), batch_size), desc="Encoding batches"):
        batch = documents[i:i + batch_size]
        emb = model.encode(batch, convert_to_numpy=True, normalize_embeddings=True)
        embeddings.append(emb)

    embeddings = np.vstack(embeddings).astype("float32")

    # Build FAISS index
    d = embeddings.shape[1]
    faiss_index_e5 = faiss.IndexFlatIP(d)
    faiss_index_e5.add(embeddings)

    # Save FAISS index
    faiss.write_index(faiss_index_e5, index_path)

    print("Created and saved FAISS index with", faiss_index_e5.ntotal, "vectors.")

Loading FAISS index...
Loaded FAISS index with 476951 vectors.


In [20]:
def e5_search(query, k=5):
    query = "query: " + query
    q_vec = model.encode([query], convert_to_numpy=True, normalize_embeddings=True).astype("float32")

    # If k = -1 → return all documents
    if k == -1:
        k = faiss_index_e5.ntotal

    distances, indices = faiss_index_e5.search(q_vec, k)

    return list(zip(indices[0], distances[0]))

In [21]:
query = "What are the best museums in Germany?"
results_e5 = e5_search(query, k=5)

print(f"Query: {query}\n")
for rank, (idx, score) in enumerate(results_e5, start=1):
    item = wikivoyage[idx]
    print(f"ID:    {item['id']}")
    print(f"Score: {score:.4f}")
    print(f"Title: {item['title']}")
    print(f"Text:\n{item['passage']}")
    print("\n" + "-"*60 + "\n")

Query: What are the best museums in Germany?

ID:    93932
Score: 0.8526
Title: Germany
Text:
at places like former concentration camps, the former Nazi party rallying grounds in Nuremberg and the former seats of Nazi ministries and offices in Berlin are well worth a visit, if a chilling and depressing one.

Germany is the origin of the cuckoo clock and has dozens of clock towers dating to the 14th century. The Rathaus-Glockenspiel in Munich, for example, is one of the most famous clock towers in the world.

]]
Germany is also home to numerous world class museums, with the Pergamon Museum on Museum Island in Berlin perhaps being the best known among them, and its neighbours the Neues Museum, Altes Museum, Alte Nationalgalerie and Bode Museum also housing world class collections. Just next to Museum Island is the Deutsches Historisches Museum, dedicated to German history. Other well-regarded museums include the Deutsches Museum in Munich and the Germanisches Nationalmuseum in Nuremberg.

**Tasks:**

- **T3:** Try out different queries. How should a query be formulated to retrieve the most relevant passages?
- **T4:** What are the pros and cons of dense retrieval systems?

**Solutions:**

**T3:**
- Write the query in natural, fluent language, because dense encoders capture meaning rather than exact terms
- Include the core semantic elements (who, what, where, when), but you do not need all possible keywords
- Avoid keyword stuffing; focus on clear intent so the embedding represents the correct meaning
- Example:
    - **User question:** What are the best museums in Germany?
    - **Query:** What are the best museums in Germany? (No need to change)
  
**T4:**
- **Pros:**
    - Captures semantic similarity and retrieves passages even when wording differs from the query
    - Robust to paraphrasing, synonyms, and varied phrasing
    - Performs well on natural questions and information-seeking queries
- **Cons:**
    - Harder to interpret because relevance is based on vector similarity, not explicit keyword overlap
    - Requires more computational resources to build and search embedding indexes
    - May perform worse on queries involving rare terms or unseen jargon

## Hybrid Retrieval

- Idea: Combine strengths of both sparse and dense retrieval models.

In [25]:
def hybrid_search(query, k=5, weight_sparse=0.5):
    weight_dense = 1 - weight_sparse

    # Fetch all BM25 and E5 scores
    sparse_results = bm25_search(query, k=-1)
    dense_results  = e5_search(query,  k=-1)

    sparse_dict = {idx: score for idx, score in sparse_results}
    dense_dict  = {idx: score for idx, score in dense_results}

    # Min-Max Normalization for sparse scores
    sparse_scores = np.array(list(sparse_dict.values()))
    sparse_min, sparse_max = sparse_scores.min(), sparse_scores.max()
    sparse_norm = {
        idx: (score - sparse_min) / (sparse_max - sparse_min + 1e-9)
        for idx, score in sparse_dict.items()
    }

    # Min-Max Normalization for dense scores
    dense_scores = np.array(list(dense_dict.values()))
    dense_min, dense_max = dense_scores.min(), dense_scores.max()
    dense_norm = {
        idx: (score - dense_min) / (dense_max - dense_min + 1e-9)
        for idx, score in dense_dict.items()
    }

    # Combine scores
    hybrid_scores = {}
    for idx in sparse_dict.keys():
        hybrid_scores[idx] = (
            weight_sparse * sparse_norm[idx]
            + weight_dense * dense_norm.get(idx, 0)
        )

    # Sort results
    sorted_docs = sorted(hybrid_scores.items(), key=lambda x: x[1], reverse=True)

    # Take top-k
    if k != -1:
        sorted_docs = sorted_docs[:k]

    return sorted_docs

In [26]:
query = "What are the best museums in Germany?"
results_hybrid = hybrid_search(query, k=3, weight_sparse=0.5)

print(f"Query: {query}\n")
for rank, (idx, score) in enumerate(results_hybrid, start=1):
    item = wikivoyage[idx]
    print(f"ID:    {item['id']}")
    print(f"Score: {score:.4f}")
    print(f"Title: {item['title']}")
    print(f"Text:\n{item['passage']}")
    print("\n" + "-"*60 + "\n")

Query: What are the best museums in Germany?

ID:    93932
Score: 1.0000
Title: Germany
Text:
at places like former concentration camps, the former Nazi party rallying grounds in Nuremberg and the former seats of Nazi ministries and offices in Berlin are well worth a visit, if a chilling and depressing one.

Germany is the origin of the cuckoo clock and has dozens of clock towers dating to the 14th century. The Rathaus-Glockenspiel in Munich, for example, is one of the most famous clock towers in the world.

]]
Germany is also home to numerous world class museums, with the Pergamon Museum on Museum Island in Berlin perhaps being the best known among them, and its neighbours the Neues Museum, Altes Museum, Alte Nationalgalerie and Bode Museum also housing world class collections. Just next to Museum Island is the Deutsches Historisches Museum, dedicated to German history. Other well-regarded museums include the Deutsches Museum in Munich and the Germanisches Nationalmuseum in Nuremberg.

**Tasks:**

- **T5:** Complete the method hybrid_search() for a hybrid retrieval system. Use the following formula: **Score-Hybrid = W * Score-Sparse + (1 - W) * Score-Dense** with W=0.5

# **Retrieval-Augmented Generation**

- Combination of a search engine and an LLM:
    - A search engine, also known as a retrieval model, is used to find relevant documents based on a given query.
    - The LLM is provided with the top k most relevant documents as context.
    - The LLM only generates an answer based on the documents provided.

Turtlecrown, 2024

![RAG diagram](https://upload.wikimedia.org/wikipedia/commons/thumb/1/14/RAG_diagram.svg/2560px-RAG_diagram.svg.png)

For this exercise, you will have access to the LLMs hosted by ScaDS.AI via the API. Your API key will remain valid until 11 January 2026.

In [31]:
my_api_key = "sk-YROCW5SEz8Yo_kkfvx5VQA"
model_name = "meta-llama/Llama-3.1-8B-Instruct"

# Fallback models
# model_name = "openai/gpt-oss-120b"
# model_name = "meta-llama/Llama-3.3-70B-Instruct"

## Question Answering with RAG

In [33]:
def rag(query: str, k: int = 10, temp: float = 0.1):

    client = OpenAI(base_url="https://llm.scads.ai/v1", api_key=my_api_key)

    # Retrieve relavant passages
    results = e5_search(query, k=k)

    retrieved_docs = []
    for idx, score in results:
        doc = wikivoyage[idx]
        retrieved_docs.append(doc["passage"])

    # Build context block
    context_text = "\n\n".join(
        [f"[Document {i+1}]\n{retrieved_docs[i]}" for i in range(len(retrieved_docs))]
    )

    # System prompt and user prompt
    system_prompt = """
        Answer the user's question using only the provided passages.
        If the information is not in the passages, reply: "I cannot answer from the provided passages."
        Return only the answer."""

    user_prompt = f"""User question:
        {query}

        Retrieved passages:
        {context_text}

        Answer the question only using the provided passages."""

    # Generate answer
    response = client.chat.completions.create(
        model=model_name,
        temperature=temp,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )

    answer = response.choices[0].message.content.strip()
    return answer

In [34]:
query = "What are the best museums in Germany?"
answer = rag(query=query)

print(answer)

The Pergamon Museum, Neues Museum, Altes Museum, Alte Nationalgalerie, Bode Museum, Deutsches Historisches Museum, Deutsches Museum, Germanisches Nationalmuseum, Mercedes-Benz Museum, Porsche Museum, BMW Museum, Pinakotheken, Sammlung Brandhorst, Nationalmuseum, Museum Fünf Kontinente, Münchner Stadtmuseum, Jewish Museum, Residenz, Haus der Kunst, Bayerisches Nationalmuseum, Deutsches Museum of Science and Technology, Folkwang Museum, Museum am Ostwall, Schirn Art Hall, Städel, Senckenberg Natural Museum, Museum of the Cinema, and the Pottery Hall of the Jos National Museum.


**Tasks:**

- **T6:** Complete the rag() method using dense retrieval (or an alternative method). Formulate a suitable system and user prompt.
- **T7:** Test the RAG system using a variety of queries. What limitations do you observe?

**Solutions:**

**T7:**
- The answers depend mainly on retrieval performance and the quality of the retrieval corpus.

## Attributed Questions Answering with RAG

RAG approaches can be modified to cite the documents used for text generation, allowing users to verify LLM-generated content.

In [39]:
def attributed_rag(query: str, k: int = 10, temp: float = 0.1):

    client = OpenAI(base_url="https://llm.scads.ai/v1", api_key=my_api_key)

    # Retrieve relevant passages
    results = e5_search(query, k=k)

    retrieved_docs = []
    for idx, score in results:
        doc = wikivoyage[idx]
        # always use idx as ID
        retrieved_docs.append({"ID": str(idx), "passage": doc["passage"]})

    # Build context block
    context_text = "\n\n".join(
        [f"[{d['ID']}]\n{d['passage']}" for d in retrieved_docs]
    )

    # System prompt
    system_prompt = """
        Answer the user's question using only the provided passages.

        If the information is not in the passages, reply:
        "I cannot answer from the provided passages."

        Citation Rules:
        - Cite passages using [IDX].
        - Only use IDs from the retrieved passages.
        Return only the answer text with inline citations.
    """

    user_prompt = f"""User question:
        {query}

        Retrieved passages:
        {context_text}

        Answer the question only using the provided passages and include inline citations.
    """

    # Generate answer
    response = client.chat.completions.create(
        model=model_name,
        temperature=temp,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )

    answer = response.choices[0].message.content.strip()

    cited_ids = set(re.findall(r"\[(.*?)\]", answer))

    # Build mapping of ID → passage
    id_to_passage = {d["ID"]: d["passage"] for d in retrieved_docs}

    # Build reference block
    cited_passages = []
    for cid in cited_ids:
        cid = cid.strip()
        if cid in id_to_passage:
            cited_passages.append(f"[{cid}]: {id_to_passage[cid]}")

    # Add references section if needed
    if cited_passages:
        answer = "Answer: " + answer + "\n\nReferences:\n" + "\n\n".join(cited_passages)
    else:
        answer = "Answer: " + answer

    return answer

In [40]:
query = "What are the best museums in Germany?"
answer = attributed_rag(query=query)

print(answer)

Answer: The best museums in Germany include the Pergamon Museum on Museum Island in Berlin [93932], the Neues Museum, Altes Museum, Alte Nationalgalerie, and Bode Museum also housing world class collections [93932], the Deutsches Historisches Museum, dedicated to German history [93932], the Deutsches Museum in Munich [93932], the Germanisches Nationalmuseum in Nuremberg [93932], the Mercedes-Benz Museum and Porsche Museum in Stuttgart [93933], and the BMW Museum in Munich [93933].

References:
[93932]: at places like former concentration camps, the former Nazi party rallying grounds in Nuremberg and the former seats of Nazi ministries and offices in Berlin are well worth a visit, if a chilling and depressing one.

Germany is the origin of the cuckoo clock and has dozens of clock towers dating to the 14th century. The Rathaus-Glockenspiel in Munich, for example, is one of the most famous clock towers in the world.

]]
Germany is also home to numerous world class museums, with the Pergam

**Tasks:**

- **T8:** Complete the attributed_rag() method. You can use your previous code. Adjust it so that the LLM cites the passages used for generation with in-line citations (e.g. [ID]). The ID is provided in the SAQ dataset.

## RAG for SAQ

Now, let us apply RAG to the SAQ dataset.

In [44]:
def rag_saq(query: str, k: int = 10, temp: float = 0.1):

    client = OpenAI(base_url="https://llm.scads.ai/v1", api_key=my_api_key)

    # Retrieve relavant passages
    results = bm25_search(query, k=k)

    retrieved_docs = []
    for idx, score in results:
        doc = wikivoyage[idx]
        retrieved_docs.append(doc["passage"])

    # Build context block
    context_text = "\n\n".join(
        [f"[Document {i+1}]\n{retrieved_docs[i]}" for i in range(len(retrieved_docs))]
    )

    # System prompt and user prompt
    system_prompt = """
        Answer the short-answer question using only the provided passages.
        If the information is not found in the passages, reply: "I cannot answer from the provided passages."
        Return only the short answer.
    """

    user_prompt = f"""User question:
        {query}

        Example:
        Question: On which holiday do all family members tend to reunite in the US?
        Acceptable answers:
            thanksgiving
            christmas

        Retrieved passages:
        {context_text}

        Answer the question only using the provided passages.
    """

    # Generate answer
    response = client.chat.completions.create(
        model=model_name,
        temperature=temp,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )

    answer = response.choices[0].message.content.strip()
    return answer

In [45]:
preds = []
for q in saq["en_question"]:
    answer = rag_saq(q)
    preds.append(answer)

saq["predictions_rag"] = preds
saq.head(20)

Unnamed: 0,ID,en_question,annotations,predictions_rag
682,Ji-ko-33,Where do high school students in UK usually go...,"[canteen, back to school, home, form time]",I cannot answer from the provided passages.
1280,New-spme-78,What is the name of the song that is typically...,"[happy birthday andy, happy birthday]",I cannot answer from the provided passages.
356,New-spme-76,What is the most popular place in US to celebr...,"[concerts, fireworks displays, home, city, cou...",I cannot answer from the provided passages.
711,New-en-10,What is the most eaten cheese in Iran?,"[tabrizi, liqvan, cream cheese, breakfast, whi...",I cannot answer from the provided passages.
17,Ne-ar-20,What is considered to be a secure job in Iran?,"[government employee, medical, teaching]",I cannot answer from the provided passages.
1072,Ki-pe-51,What time of day are government offices closed...,"[14:00, 16:00, 17:00]",I cannot answer from the provided passages.
1012,Kik-in-04,What is installed in front of the house when a...,"[flag, flowers, wreath]",I cannot answer from the provided passages.
1194,Na-ko-04,What is the preferred hangover cure food for p...,"[porridge, honey water, peanuts, tea, noodles,...",I cannot answer from the provided passages.
323,New-en-70,How many school breaks are there in a year for...,"[6, 5]",3
263,Ta-pe-37,What is the common symbol of New Year's Eve th...,"[new year scrolls, spring festival travel rush...",I cannot answer from the provided passages.


In [46]:
def evaluate_saq(saq, column: str):
    
    def normalize(text):
        if not isinstance(text, str):
            return ""
            
        text = text.lower().strip()
        tokens = word_tokenize(text)
        tokens = [t for t in tokens if any(c.isalnum() for c in t)]
        
        return " ".join(tokens)
        

    predictions_column = f"predictions_{column}"
    
    correct = []
    for pred, gold_list in zip(saq[predictions_column], saq["annotations"]):
        pred_norm = normalize(pred)

        # Normalize each answer
        gold_norm = [normalize(a) for a in gold_list]

        # Exact match: prediction must appear in answers
        is_correct = pred_norm in gold_norm
        correct.append(is_correct)

    # Compute accuracy
    em_accuracy = sum(correct) / len(correct)
    return em_accuracy


accuracy = evaluate_saq(saq, "rag")
print(f"Accuracy: {accuracy}")

Accuracy: 0.0


**Tasks:**

- **T9:** Complete the rag_saq() method. You can use your previous code. Adjust it so that the LLM answers each question with a short answer in the expected format. To improve the results, you can add a few examples to guide the model.
- **T10:** Use the evaluate_saq() function to compute the exact-match accuracy of your model on the short-answer question dataset. Report the resulting accuracy score.
- **T11:** Explain why the exact-match accuracy is low and describe concrete strategies for improving it. Consider tasks for which RAG is more beneficial than SAQ.

**Solutions:**

**T11:**
- **Limitations:**
    - The dataset does not contain the relevant passages for each question.
    - The answer is limited by retrieval performance. Only a small proportion of the passages are provided to the LLM.
    - We strictly instruct the model to respond, "I cannot answer based on the provided passages."
- **Potential Solutions:**
    - Use a larger dataset, such as web searches.
    - Use techniques such as query rewriting or increasing k, or use a re-ranker.
    - Test alternative architectures, such as post-generation attribution, which use parametric and non-parametric knowledge.
- **Tasks:**
    - Long-form Question Answering
    - Summarization

## Post-Generation Attribution

Post-generation attribution is a distinct architecture in which:
- The LLM first generates an answer to a given query using only its parametric knowledge, without consulting external sources.
- A retrieval system is then invoked to search for relevant evidence, using the original query, the model-generated answer, or both.
- The retrieved evidence is evaluated and incorporated to revise, refine, or validate the initial model output.
- The final answer is generated as a combination of the model’s initial reasoning and the supporting evidence found through retrieval.

In [50]:
def post_generation_saq(query: str, k: int = 10, temp: float = 0.1):

    client = OpenAI(base_url="https://llm.scads.ai/v1", api_key=my_api_key)

    # Generate answer using parametric LLM knowledge
    system_prompt_initial = """
        Answer the short-answer question using your internal knowledge.
    
        REQUIRED OUTPUT FORMAT:
        - Return ONLY one short answer.
        - The answer must be a single word or short phrase.
        - Lowercase only.
        - No sentences, no punctuation, no explanation.
    
        Example:
        Question: On which holiday do all family members tend to reunite in the US?
        Acceptable outputs: thanksgiving or christmas
    """

    user_prompt_initial = f"Question: {query}\nAnswer:"

    response_initial = client.chat.completions.create(
        model=model_name,
        temperature=temp,
        messages=[
            {"role": "system", "content": system_prompt_initial},
            {"role": "user", "content": user_prompt_initial},
        ]
    )
    initial_answer = response_initial.choices[0].message.content.strip()

    # Retrieve evidence for initial answer
    results = bm25_search(initial_answer, k=k)

    retrieved_docs = []
    for idx, score in results:
        doc = wikivoyage[idx]
        retrieved_docs.append(doc["passage"])

    context_text = "\n\n".join(
        [f"[Document {i+1}]\n{retrieved_docs[i]}" for i in range(len(retrieved_docs))]
    )

    # Revise initial answer using retrieved evidence
    system_prompt_revision = """
        Revise the short answer using the retrieved evidence.
    
        REQUIRED OUTPUT FORMAT:
        - Return ONLY one short answer.
        - The answer must be a single word or short phrase.
        - Lowercase only.
        - No sentences, no explanation.
    
        Revision Rules:
        - If the retrieved passages contradict the initial answer, correct it.
        - If they add useful detail, incorporate it concisely.
        - Keep the answer short and factual.
    """

    user_prompt_revision = f"""
        Original Question:
        {query}

        Initial Answer:
        {initial_answer}

        Retrieved Passages:
        {context_text}

        Provide the revised short answer:
    """

    response_revision = client.chat.completions.create(
        model=model_name,
        temperature=temp,
        messages=[
            {"role": "system", "content": system_prompt_revision},
            {"role": "user", "content": user_prompt_revision}
        ]
    )

    revised_answer = response_revision.choices[0].message.content.strip()
    return revised_answer

In [51]:
preds = []
for q in saq["en_question"]:
    answer = post_generation_saq(q)
    preds.append(answer)
saq["predictions_post_generation"] = preds

accuracy = evaluate_saq(saq, column="post_generation")
print(f"Accuracy: {accuracy}")

Accuracy: 0.2


In [52]:
saq.head(10)

Unnamed: 0,ID,en_question,annotations,predictions_rag,predictions_post_generation
682,Ji-ko-33,Where do high school students in UK usually go...,"[canteen, back to school, home, form time]",I cannot answer from the provided passages.,common room
1280,New-spme-78,What is the name of the song that is typically...,"[happy birthday andy, happy birthday]",I cannot answer from the provided passages.,birthday song
356,New-spme-76,What is the most popular place in US to celebr...,"[concerts, fireworks displays, home, city, cou...",I cannot answer from the provided passages.,philly
711,New-en-10,What is the most eaten cheese in Iran?,"[tabrizi, liqvan, cream cheese, breakfast, whi...",I cannot answer from the provided passages.,paneer
17,Ne-ar-20,What is considered to be a secure job in Iran?,"[government employee, medical, teaching]",I cannot answer from the provided passages.,local government
1072,Ki-pe-51,What time of day are government offices closed...,"[14:00, 16:00, 17:00]",I cannot answer from the provided passages.,12:00
1012,Kik-in-04,What is installed in front of the house when a...,"[flag, flowers, wreath]",I cannot answer from the provided passages.,flag
1194,Na-ko-04,What is the preferred hangover cure food for p...,"[porridge, honey water, peanuts, tea, noodles,...",I cannot answer from the provided passages.,bubur cianjur
323,New-en-70,How many school breaks are there in a year for...,"[6, 5]",3,5
263,Ta-pe-37,What is the common symbol of New Year's Eve th...,"[new year scrolls, spring festival travel rush...",I cannot answer from the provided passages.,fireworks


**Tasks:**

- **T12:** Complete the post_generation_saq method using the SAQ dataset.
- **T13:** Calculate the exact match accuracy and compare it with RAG.