# RAG Pipeline Exercise

In this exercise you will build and **compare two simple Retrieval-Augmented Generation (RAG) pipelines**.

You will work with a small collection of PDF documents (e.g. medical guidelines) and:

1. Load and chunk the PDF documents.
2. Create a vector index using **embedding model A** (local `BAAI/bge-m3`).
3. Create a second index using **embedding model B** (e.g. OpenAI or Gemini embeddings).
4. Implement a simple **retriever** and an **answering function** that calls an LLM with retrieved context.
5. Automatically **generate questions** from the documents and use them to **compare two RAG configurations**.

Cells marked with `# TODO` are **for students to implement**.
Everything else is provided scaffolding.

## 0. Setup & Imports

In [1]:
# TODO (easy): skim the imports and make sure you understand what each library is used for.

from dotenv import load_dotenv
import os
import glob
from PyPDF2 import PdfReader
from langchain_text_splitters import RecursiveCharacterTextSplitter
import faiss
from sentence_transformers import SentenceTransformer
import pickle
import random
import numpy as np
import pandas as pd

# LLM / API clients (we will mainly use OpenAI here; Gemini can be added as a bonus)
from openai import OpenAI

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Load API keys from .env (you need to create this file once and add your keys)
load_dotenv()

deepinfra_key = os.getenv("DEEPINFRA_API_KEY")
openai_api_key = os.getenv("OPENAI_API_KEY")
google_api_key = os.getenv("GOOGLE_API_KEY")
anthropic_api_key = os.getenv("ANTHROPIC_API_KEY")

# For this exercise we mainly use OpenAI for both embeddings (RAG B) and chat completions.
assert openai_api_key is not None, "Please set OPENAI_API_KEY in your .env file."
openai_client = OpenAI(api_key= deepinfra_key, base_url="https://api.deepinfra.com/v1/openai")


In [3]:
# Make pandas show the full table and full cell content
pd.set_option("display.max_rows", None)       # show all rows
pd.set_option("display.max_columns", None)    # show all columns
pd.set_option("display.max_colwidth", None)   # don't truncate cell text

## 1. Load PDF documents

We assume there is a `data/` folder containing one or more PDF files.

**Task:** implement `load_pdfs(glob_path)` so that it:
- Iterates over all PDF files matching `glob_path`
- Reads them with `PdfReader`
- Concatenates the text of all pages into **one long string**.

In [4]:
def load_pdfs(glob_path: str = "data/*.pdf") -> str:
    """Load all PDFs matching the pattern and return their combined text.

    TODO:
    - Use `glob.glob(glob_path)` to iterate over file paths
    - For each file, open it in binary mode and create a `PdfReader`
    - Loop over `reader.pages` and extract text with the extract_text() function
    - Concatenate everything into a single string `text`
    - Be robust: skip pages where `extract_text()` returns None
    """
    # YOUR CODE HERE
    text = ""
    for pdf_path in glob.glob(glob_path):
        print(pdf_path)
        with open(pdf_path, "rb") as f:
            reader = PdfReader(f)
            for page in reader.pages:
                page_text = page.extract_text()
                if page_text:
                    text += " " + page_text
    return text



In [5]:
# Run once and inspect
raw_text = load_pdfs("data/*.pdf")
print("Number of characters:", len(raw_text))
print("Preview:", raw_text[:500])

data/Cipralex.pdf
data/Candesartan.pdf
data/Aspirin.pdf
Number of characters: 139319
Preview:  Cipralex®
Lundbeck (Schweiz) AG
Zusammensetzung
Wirkstoffe
Filmtabletten, Tropfen zum Einnehmen, Lösung:  Escitalopramum ut escitaloprami oxalas
Hilfsstoffe
Filmtabletten:  Cellulosum microcristallinum silicificatum, Talcum, Carmellosum natricum conexum enthält ungefähr 0,32 mg (10 mg) oder 0,63 mg (20 mg)
natrium, Magnesii stearas, Hypromellosum, Macrogolum 400, E171
Tropfen zum Einnehmen, Lösung (20 mg/ml):  Acidum Citricum, Ethanolum 96 per centum 100 mg pro ml, Natrii hydroxidum corresp. ma


## 2. Chunk the text

We will split the long text into overlapping chunks.

Later you can **experiment** with different `chunk_size` and `chunk_overlap` to see how it affects retrieval.

**Task:** start with the given parameters, run once, then try at least one alternative configuration and note the effects.

In [6]:
# Base configuration (RAG A)
chunk_size_a = 2000
chunk_overlap_a = 200

splitter_a = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size_a,
    chunk_overlap=chunk_overlap_a
)

chunks_a = splitter_a.split_text(raw_text)
print(f"RAG A: {len(chunks_a)} chunks produced, first chunk length = {len(chunks_a[0])}")

RAG A: 79 chunks produced, first chunk length = 1908


## 3. Create embeddings and a FAISS index

We start with **Embedding model A: `BAAI/bge-small-en`** using `sentence-transformers`.

Then, as an optional extension, you can build **Embedding model B** using OpenAI or Gemini and compare.

To keep the exercise manageable, the base version only **requires** BGE.

In [7]:
# Embedding model A (local)
model_name_a = "intfloat/e5-base-v2"
embedder_a = SentenceTransformer(model_name_a)


chunks_with_prefix = ["passage: " + chunk for chunk in chunks_a] #higher quality for this model
# Compute embeddings for all chunks of configuration A
embeddings_a = embedder_a.encode(chunks_with_prefix, convert_to_numpy=True)

dimensions_a = embeddings_a.shape[1]
print("Embedding dimensionality (A):", dimensions_a)

index_a = faiss.IndexFlatL2(dimensions_a)
index_a.add(embeddings_a)
print("FAISS index (A) size:", index_a.ntotal)

# Persist index/chunks if you like (optional)
os.makedirs("faiss", exist_ok=True)
faiss.write_index(index_a, "faiss/faiss_index_a.index")
with open("faiss/chunks_a.pkl", "wb") as f:
    pickle.dump(chunks_a, f)

Embedding dimensionality (A): 768
FAISS index (A) size: 79


## 4. Implement a simple retriever

We now implement a generic retrieval function that:
1. Embeds the query.
2. Searches the FAISS index.
3. Returns the corresponding text chunks.

We implement it for configuration A. If you built configuration B, you can reuse the same function.

In [8]:
def retrieve_texts(query: str, k: int, index, chunks, embedder) -> list:
    """Return the top-k most similar chunks for a query.

    TODO (students):
    - Encode the query with `embedder.encode(...)`
    - Call `index.search(query_embedding, k)`
    - Use the returned indices to select the chunks
    - Return a list of strings (chunks)
    """
    # YOUR CODE HERE
    query_with_prefix = "query: " + query #higher quality for this model
    query_emb = embedder.encode([query_with_prefix], convert_to_numpy=True)
    distances, indices = index.search(query_emb, k)
    retrieved = [chunks[i] for i in indices[0]]
    return retrieved

# Quick sanity check
test_query = "Wie soll Cipralex gelagert werden?"
retrieved_text = retrieve_texts(test_query, k=3, index=index_a, chunks=chunks_a, embedder=embedder_a)
print("Number of retrieved chunks:", len(retrieved_text))
print("Preview of first chunk:", retrieved_text[0][:400])

Number of retrieved chunks: 3
Preview of first chunk: Cipralex®
Lundbeck (Schweiz) AG
Zusammensetzung
Wirkstoffe
Filmtabletten, Tropfen zum Einnehmen, Lösung:  Escitalopramum ut escitaloprami oxalas
Hilfsstoffe
Filmtabletten:  Cellulosum microcristallinum silicificatum, Talcum, Carmellosum natricum conexum enthält ungefähr 0,32 mg (10 mg) oder 0,63 mg (20 mg)
natrium, Magnesii stearas, Hypromellosum, Macrogolum 400, E171
Tropfen zum Einnehmen, Lösung


## 5. Implement `answer_query` using an LLM

Now we build the actual RAG call:

1. Use `retrieve_texts` to get top-`k` chunks.
2. Concatenate them into a context string.
3. Build a prompt that:
   - shows the context
   - asks the model to answer the user question based **only** on this context.
4. Call the OpenAI chat completion API.

This is the **core RAG function**.

In [9]:
def answer_query_a(query: str, k: int, index, chunks, embedder, client: OpenAI) -> str:
    """RAG-style answer: retrieve context and ask an LLM.

    TODO (students):
    - Use `retrieve_texts` to get `k` relevant chunks.
    - Join them into a single context string.
    - Build a chat prompt that instructs the model to answer *only* using the context.
    - Call `client.chat.completions.create(...)` with model `"gpt-4o-mini"` (or similar).
    - Return the model's answer text.
    """
    retrieved_chunks = retrieve_texts(query, k, index, chunks, embedder)
    context = "\n\n---\n\n".join(retrieved_chunks)

    system_prompt = (
        """Du bist ein hilfreicher Assistent, der Fragen NUR basierend auf dem bereitgestellten Kontext beantwortet.
  Wenn die Antwort nicht im Kontext enthalten ist, sage dass du es nicht weisst.
  Antworte auf Deutsch."""
    )

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Kontext:\n{context}\n\nFrage: {query}"}
    ]

    completion = client.chat.completions.create(
        model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
        messages=messages
    )

    return completion.choices[0].message.content.strip()

# Quick manual test
answer = answer_query_a(test_query, k=3, index=index_a, chunks=chunks_a, embedder=embedder_a, client=openai_client)
print("RAG answer:", answer)

RAG answer: Cipralex soll in der Originalverpackung und nicht über 30°C gelagert werden. Es soll außerdem außer Reichweite von Kindern aufbewahrt werden.


In [10]:
# Testfrage 1
frage1 = "Wie wird Candesartan normalerweise bei Erwachsenen dosiert?"
antwort1 = answer_query_a(frage1, k=3, index=index_a, chunks=chunks_a, embedder=embedder_a, client=openai_client)
print("Frage 1:", frage1)
print("Antwort:", antwort1)
print("\n" + "="*50 + "\n")

# Testfrage 2
frage2 = "Wer sollte Aspirin nicht anwenden? Nenne mindestens eine wichtige Gegenanzeige."
antwort2 =  answer_query_a(frage2, k=3, index=index_a, chunks=chunks_a, embedder=embedder_a, client=openai_client)
print("Frage 2:", frage2)
print("Antwort:", antwort2)

Frage 1: Wie wird Candesartan normalerweise bei Erwachsenen dosiert?
Antwort: Ich weiß es nicht. Der Kontext liefert Informationen über die Pharmakokinetik von Candesartan, insbesondere bei Patienten mit eingeschränkter Nierenfunktion und älteren Patienten, aber es gibt keine spezifischen Angaben über die Standarddosierung von Candesartan bei Erwachsenen.


Frage 2: Wer sollte Aspirin nicht anwenden? Nenne mindestens eine wichtige Gegenanzeige.
Antwort: Aspirin sollte nicht von Patienten mit Asthma bronchiale oder allgemeiner Neigung zu Überempfindlichkeit angewendet werden, da Acetylsalicylsäure Bronchospasmen begünstigen und Asthmaanfälle oder andere Überempfindlichkeitsreaktionen auslösen kann. Ein weiterer wichtiger Punkt ist, dass Patienten mit Glucose-6-Phosphatdehydrogenase (G6PD)-Mangel Aspirin nicht anwenden sollten, da es eine Hämolyse oder hämolytische Anämie induzieren könnte. Darüber hinaus sollten Kinder und Jugendliche unter 18 Jahren Aspirin bei Fieber und/oder viralen 