# Week 5 Lab: Retrieval-Augmented Generation (RAG)

In this lab you will build a **small RAG system** for data science students in Marseille.

We will:
- Create a tiny corpus of documents about the MSc Data Science programme and life in Marseille.
- Implement three retrievers: **BM25 (keywords)**, **embeddings (semantic)**, and a simple **hybrid**.
- Connect retrieval to a chat model to see how **augmented prompts** change the answers.

The goal is **clarity, not scale**: everything stays in memory and is easy to modify.


## 0) Setup

We will use:
- `sentence-transformers` for embeddings (semantic search),
- `rank-bm25` for a reasonably strong keyword baseline (BM25),
- `google-generativeai` + `python-dotenv` for calling a chat model (Gemini),
- `numpy` for vector math.

Run the cell below once in your environment (you may already have some packages from previous labs).


### Imports and configuration

We load the libraries, configure the Gemini model, and initialise the embedding model.

Make sure you have `GEMINI_API_KEY` in your `.env` (see Week 3 lab).


In [1]:
import os

os.environ["GRPC_VERBOSITY"] = "ERROR"     # reduce gRPC verbosity
os.environ["GRPC_TRACE"] = ""             # disable gRPC trace
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"  # hide TF INFO/WARNING (if TF is pulled in)
os.environ["ABSL_LOG_LEVEL"] = "2"        # reduce absl logging


from __future__ import annotations

import math, textwrap
from typing import List, Dict, Tuple

import numpy as np
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi

from dotenv import load_dotenv
load_dotenv()

import google.generativeai as genai

GEMINI_API_KEY = os.getenv('GEMINI_API_KEY')
assert GEMINI_API_KEY, 'Please set GEMINI_API_KEY in your environment or .env file.'

genai.configure(api_key=GEMINI_API_KEY)
MODEL_NAME = os.getenv('GEMINI_MODEL', 'gemini-2.5-flash')
GEN_CONFIG = genai.GenerationConfig(temperature=0.2, max_output_tokens=600)


def make_model(system_instruction: str | None = None):
    return genai.GenerativeModel(MODEL_NAME, system_instruction=system_instruction)

# Two separate model instances:
# - model_general: used for regular (non-RAG) Q&A (no strict CONTEXT restriction)
# - model_rag: used when feeding CONTEXT; instruct it to use only the context and cite sources
model_rag = make_model(
    "You are a helpful assistant for data science students in Marseille. "
    "When CONTEXT is provided, use only that information to answer. "
    "If the answer is not in the CONTEXT, say that you do not know. "
    "At the end of the answer include a line 'SOURCES: [i,j]' listing the chunk indices used (if any)."
)

model_general = make_model(
    "You are a helpful assistant for data science students in Marseille. "
    "Answer concisely and directly."
)

# Small embedding model for fast semantic search
EMBED_MODEL_ID = 'sentence-transformers/all-MiniLM-L6-v2'
embedder = SentenceTransformer(EMBED_MODEL_ID)

EMBED_MODEL_ID


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

'sentence-transformers/all-MiniLM-L6-v2'

## 1) A tiny corpus about Marseille & data science

We create a few small documents that describe:
- the MSc Data Science programme in Marseille,
- where lectures usually take place,
- study spots and practical information.

In a real system these would come from PDFs, web pages, or internal knowledge bases.


In [2]:
documents: List[Dict[str, str]] = [
    {
        'id': 'prog_ds_overview',
        'title': 'Data Science track – overview',
        'campus': 'Marseille',
        'program': 'Data Science',
        'text': textwrap.dedent('''\
            The Data Science (DS) track is part of the Master MAS at Aix-Marseille Université.
            It is a multidisciplinary programme in applied mathematics, statistics and computer science
            designed to train students to handle, visualise and analyse complex data.

            Students take courses in probability, optimisation, statistical modelling, signal and image
            processing, databases, machine learning and programming. The goal is to be able to design and
            evaluate modern data analysis pipelines and decision-support tools, not just run pre-built
            software.

            The programme is taught in Marseille with close links to local research laboratories (I2M and LIS)
            and industrial partners. A significant part of the training is project-based, with group work on
            real or realistic data science problems.
        ''').strip(),
    },
    {
        'id': 'prog_ds_structure',
        'title': 'Structure and learning sites',
        'campus': 'Marseille',
        'program': 'Data Science',
        'text': textwrap.dedent('''\
            The DS track is organised over two years (M1 and M2). In the first year, students consolidate
            their background in analysis, probability, statistics and programming, while discovering
            introductory data science and machine learning courses.

            In the second year, the focus moves to advanced statistical learning, high-dimensional data,
            optimisation for machine learning, and applications such as signal and image processing.
            Some modules are shared with other tracks of the MAS master, which helps maintain a strong
            mathematical core.

            Teaching takes place mainly on the Saint-Charles campus in Marseille, with some activities on
            other science sites. Courses combine lectures, tutorials and computer labs using Python and
            common data science libraries.
        ''').strip(),
    },
    {
        'id': 'prog_ds_careers',
        'title': 'Careers and skills',
        'campus': 'Marseille',
        'program': 'Data Science',
        'text': textwrap.dedent('''\
            Graduates of the DS track typically work as data scientists or data engineers, but also as
            statisticians or machine learning engineers in sectors such as digital services, industry,
            health, environment or finance.

            The programme emphasises the ability to build explanatory and predictive models from data,
            to design and implement end-to-end data processing workflows, and to communicate results to
            non-specialist audiences.

            Students learn to use statistical and scientific computing tools, to evaluate model performance
            and robustness, and to manage data-driven projects from specification to delivery.
        ''').strip(),
    },
]

len(documents)


3

### Chunking

We split each document into smaller **chunks** (a few sentences each). These are the units we will index and retrieve.


In [3]:
def make_chunks(docs: List[Dict[str, str]]) -> List[Dict[str, str]]:
    chunks: List[Dict[str, str]] = []
    for doc in docs:
        # Split text into paragraphs based on double newlines
        paragraphs = [p.strip() for p in doc['text'].split('\n\n') if p.strip()]
        for i, para in enumerate(paragraphs):
            chunks.append({
                'chunk_id': f"{doc['id']}_p{i}",
                'doc_id': doc['id'],
                'title': doc['title'],
                'campus': doc['campus'],
                'program': doc['program'],
                'text': para,
            })
    return chunks


chunks = make_chunks(documents)
len(chunks), chunks[0]


(9,
 {'chunk_id': 'prog_ds_overview_p0',
  'doc_id': 'prog_ds_overview',
  'title': 'Data Science track – overview',
  'campus': 'Marseille',
  'program': 'Data Science',
  'text': 'The Data Science (DS) track is part of the Master MAS at Aix-Marseille Université.\nIt is a multidisciplinary programme in applied mathematics, statistics and computer science\ndesigned to train students to handle, visualise and analyse complex data.'})

## 2) Build BM25 and embedding indices

We now build two simple retrievers:
- a **BM25** keyword index over the chunk texts,
- an **embedding index** using `sentence-transformers`.


In [4]:
# Prepare texts
corpus_texts = [c['text'] for c in chunks]

# BM25 index (lexical)
tokenized_corpus = [text.lower().split() for text in corpus_texts]
bm25 = BM25Okapi(tokenized_corpus)

# Embedding index (semantic)
emb_matrix = embedder.encode(corpus_texts, convert_to_numpy=True, normalize_embeddings=True)
emb_matrix.shape


(9, 384)

### Retrieval helpers

We implement three helper functions:
- `retrieve_bm25` for keyword search,
- `retrieve_embeddings` for semantic search,
- `retrieve_hybrid` combining both.


In [5]:
def _format_hit(chunk: Dict[str, str], score: float) -> str:
    # Create a short preview of the chunk text
    preview = chunk['text']
    if len(preview) > 180:
        preview = preview[:177] + '...'
    return f"[score={score:.3f}] {chunk['title']} → {preview}"


def retrieve_bm25(query: str, k: int = 5) -> List[Tuple[Dict[str, str], float]]:
    # BM25 retrieval (lexical)
    tokens = query.lower().split()
    scores = bm25.get_scores(tokens)
    idx = np.argsort(scores)[::-1][:k]
    return [(chunks[i], float(scores[i])) for i in idx]


def retrieve_embeddings(query: str, k: int = 5) -> List[Tuple[Dict[str, str], float]]:
    # Embedding-based retrieval (semantic)
    q_emb = embedder.encode([query], convert_to_numpy=True, normalize_embeddings=True)[0]
    scores = emb_matrix @ q_emb  # cosine similarity (because we normalised)
    idx = np.argsort(scores)[::-1][:k]
    return [(chunks[i], float(scores[i])) for i in idx]


def retrieve_hybrid(query: str, k: int = 5, alpha: float = 0.5) -> List[Tuple[Dict[str, str], float]]:
    # Combine BM25 and embeddings with a simple weighted sum.
    # alpha = 0 → only BM25, alpha = 1 → only embeddings.
    tokens = query.lower().split()
    bm25_scores = bm25.get_scores(tokens)

    q_emb = embedder.encode([query], convert_to_numpy=True, normalize_embeddings=True)[0]
    emb_scores = emb_matrix @ q_emb

    if bm25_scores.max() > 0:
        bm25_norm = bm25_scores / bm25_scores.max()
    else:
        bm25_norm = bm25_scores

    if emb_scores.max() > emb_scores.min():
        emb_norm = (emb_scores - emb_scores.min()) / (emb_scores.max() - emb_scores.min())
    else:
        emb_norm = emb_scores

    combined = (1 - alpha) * bm25_norm + alpha * emb_norm
    idx = np.argsort(combined)[::-1][:k]
    return [(chunks[i], float(combined[i])) for i in idx]


### Compare retrieval modes

Let's compare BM25, embeddings, and the hybrid retriever on a query about lectures in Marseille.


In [6]:
query = 'Where are the data science lectures usually held in Marseille?'

print('--- BM25 ---')
for c, s in retrieve_bm25(query, k=3):
    print(_format_hit(c, s))

print('--- Embeddings ---')
for c, s in retrieve_embeddings(query, k=3):
    print(_format_hit(c, s))

print('--- Hybrid (alpha=0.5) ---')
for c, s in retrieve_hybrid(query, k=3, alpha=0.5):
    print(_format_hit(c, s))


--- BM25 ---
[score=2.468] Structure and learning sites → In the second year, the focus moves to advanced statistical learning, high-dimensional data,
optimisation for machine learning, and applications such as signal and image process...
[score=1.646] Structure and learning sites → The DS track is organised over two years (M1 and M2). In the first year, students consolidate
their background in analysis, probability, statistics and programming, while discov...
[score=1.519] Data Science track – overview → The Data Science (DS) track is part of the Master MAS at Aix-Marseille Université.
It is a multidisciplinary programme in applied mathematics, statistics and computer science
de...
--- Embeddings ---
[score=0.696] Structure and learning sites → Teaching takes place mainly on the Saint-Charles campus in Marseille, with some activities on
other science sites. Courses combine lectures, tutorials and computer labs using Py...
[score=0.671] Data Science track – overview → The programme is 

### Exercise

Try some of the following:
- A query using synonyms (e.g., *"master in artificial intelligence"* vs *"MSc Data Science"*).
- A query mentioning a specific place (e.g., *"Where can I study quietly in Marseille?"*).
- Change `alpha` in `retrieve_hybrid` to emphasise either BM25 (lexical) or embeddings (semantic).


In [7]:
query = "master in artificial intelligence"

print("=== BM25 ===")
for c, s in retrieve_bm25(query, k=3):
    print(_format_hit(c, s))

print("\n=== Embeddings ===")
for c, s in retrieve_embeddings(query, k=3):
    print(_format_hit(c, s))

print("\n=== Hybrid (alpha=0.5) ===")
for c, s in retrieve_hybrid(query, k=3, alpha=0.5):
    print(_format_hit(c, s))


=== BM25 ===
[score=2.026] Data Science track – overview → The Data Science (DS) track is part of the Master MAS at Aix-Marseille Université.
It is a multidisciplinary programme in applied mathematics, statistics and computer science
de...
[score=0.535] Structure and learning sites → The DS track is organised over two years (M1 and M2). In the first year, students consolidate
their background in analysis, probability, statistics and programming, while discov...
[score=0.390] Structure and learning sites → Teaching takes place mainly on the Saint-Charles campus in Marseille, with some activities on
other science sites. Courses combine lectures, tutorials and computer labs using Py...

=== Embeddings ===
[score=0.471] Structure and learning sites → In the second year, the focus moves to advanced statistical learning, high-dimensional data,
optimisation for machine learning, and applications such as signal and image process...
[score=0.379] Data Science track – overview → The programme is

In [8]:
query = "Where can I study quietly in Marseille?"

print("=== BM25 ===")
for c, s in retrieve_bm25(query, k=3):
    print(_format_hit(c, s))

print("\n=== Embeddings ===")
for c, s in retrieve_embeddings(query, k=3):
    print(_format_hit(c, s))

print("\n=== Hybrid (alpha=0.5) ===")
for c, s in retrieve_hybrid(query, k=3, alpha=0.5):
    print(_format_hit(c, s))


=== BM25 ===
[score=0.535] Structure and learning sites → The DS track is organised over two years (M1 and M2). In the first year, students consolidate
their background in analysis, probability, statistics and programming, while discov...
[score=0.390] Structure and learning sites → Teaching takes place mainly on the Saint-Charles campus in Marseille, with some activities on
other science sites. Courses combine lectures, tutorials and computer labs using Py...
[score=0.385] Careers and skills → Graduates of the DS track typically work as data scientists or data engineers, but also as
statisticians or machine learning engineers in sectors such as digital services, indus...

=== Embeddings ===
[score=0.613] Structure and learning sites → Teaching takes place mainly on the Saint-Charles campus in Marseille, with some activities on
other science sites. Courses combine lectures, tutorials and computer labs using Py...
[score=0.472] Data Science track – overview → The programme is taught in 

In [9]:
query = "Where are classes held for the data science programme?"

for alpha in [0.0, 0.5, 1.0]:
    print(f"\n=== Hybrid α={alpha} ===")
    for c, s in retrieve_hybrid(query, k=3, alpha=alpha):
        print(_format_hit(c, s))



=== Hybrid α=0.0 ===
[score=1.000] Structure and learning sites → In the second year, the focus moves to advanced statistical learning, high-dimensional data,
optimisation for machine learning, and applications such as signal and image process...
[score=0.315] Data Science track – overview → The Data Science (DS) track is part of the Master MAS at Aix-Marseille Université.
It is a multidisciplinary programme in applied mathematics, statistics and computer science
de...
[score=0.302] Structure and learning sites → The DS track is organised over two years (M1 and M2). In the first year, students consolidate
their background in analysis, probability, statistics and programming, while discov...

=== Hybrid α=0.5 ===
[score=0.657] Data Science track – overview → The Data Science (DS) track is part of the Master MAS at Aix-Marseille Université.
It is a multidisciplinary programme in applied mathematics, statistics and computer science
de...
[score=0.614] Data Science track – overview → The 

## 3) From retrieval to augmented prompts

We now connect the retriever to the chat model.

We will:
- Ask a question **without RAG** (no context),
- Ask the same question **with RAG** (top chunks added to the prompt),
- Compare the answers.


In [10]:
def build_context(chosen_chunks: List[Dict[str, str]]) -> str:
    # Build context string from chosen chunks
    parts = []
    for i, c in enumerate(chosen_chunks, start=1):
        parts.append(f"""[{i}] {c['title']} (campus={c['campus']}, program={c['program']})
{c['text']}""")
    return '\n\n'.join(parts)


def answer_without_rag(question: str) -> str:
    prompt = textwrap.dedent(f''' 
    QUESTION:
    {question}
    ''').strip()
    resp = model_general.generate_content(question, generation_config=GEN_CONFIG)
    return resp.text or ''


def answer_with_rag(
    question: str,
    k: int = 4,
    alpha: float = 0.5,
    campus: str | None = None,
    program: str | None = None,
) -> Tuple[str, List[Dict[str, str]]]:
    # Step 1: hybrid retrieval
    hits = retrieve_hybrid(question, k=10, alpha=alpha)

    # Step 2: metadata-based refinement (final filter, as in the lecture)
    if campus or program:
        filtered = []
        for chunk, score in hits:
            if campus is not None and chunk['campus'] != campus:
                continue
            if program is not None and chunk['program'] != program:
                continue
            filtered.append((chunk, score))
        hits = filtered or hits  # fall back if filter is too strict

    chosen_chunks = [c for c, _ in hits[:k]]
    context = build_context(chosen_chunks)

    prompt = textwrap.dedent(f''' 
        CONTEXT:
        {context}

        QUESTION:
        {question}
    ''').strip()

    resp = model_rag.generate_content(prompt, generation_config=GEN_CONFIG)
    return resp.text or '', chosen_chunks


### Example: lectures in Marseille

We reuse our earlier query and compare:
- the answer without RAG,
- the answer with RAG (hybrid + metadata refinement).


In [11]:
q = 'Where are the data science lectures usually held in Marseille?'

print('--- Without RAG ---')
print(answer_without_rag(q))

print('--- With RAG (hybrid + metadata filter) ---')
answer, used_chunks = answer_with_rag(q, campus='Marseille', program='Data Science')
print(answer)

print('[Context chunks used:]')
for c in used_chunks:
    print('-', c['chunk_id'], 'from', c['title'])


--- Without RAG ---


E0000 00:00:1763371807.207111   10773 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


Data science lectures are typically held at Aix-Marseille Université (AMU), primarily on the **Luminy campus** and sometimes the **Saint-Charles campus**.
--- With RAG (hybrid + metadata filter) ---
Data science lectures are mainly held on the Saint-Charles campus in Marseille.
SOURCES: [1]
[Context chunks used:]
- prog_ds_structure_p2 from Structure and learning sites
- prog_ds_overview_p2 from Data Science track – overview
- prog_ds_overview_p0 from Data Science track – overview
- prog_ds_structure_p1 from Structure and learning sites


### Exercise

- Try questions where the answer **is** in the corpus and where it is **not**.
- Does the RAG answer correctly admit when it does not know?
- Vary `k` (number of chunks) and observe when the answer becomes better or worse.
- Add a new document to `documents` (e.g., another campus or a new study spot) and re-run the indexing cell. How does it change the retrieved chunks and answers?


In [12]:
q = "Where are data science lectures usually held?"

print("=== Without RAG ===")
print(answer_without_rag(q), "\n")

print("=== With RAG ===")
answer, used = answer_with_rag(q, campus="Marseille", program="Data Science")
print(answer, "\n")

print("[Chunks used:]")
for c in used:
    print("-", c["chunk_id"])


=== Without RAG ===
At the Luminy campus of Aix-Marseille University. 

=== With RAG ===
Data science lectures are usually held mainly on the Saint-Charles campus in Marseille, with some activities on other science sites.
SOURCES: [1] 

[Chunks used:]
- prog_ds_structure_p2
- prog_ds_overview_p0
- prog_ds_overview_p2
- prog_ds_structure_p0


In [13]:
q = "Does the DS programme include a course on quantum computing?"

print("=== Without RAG ===")
print(answer_without_rag(q), "\n")

print("=== With RAG ===")
answer, used = answer_with_rag(q, campus="Marseille", program="Data Science")
print(answer, "\n")

print("[Chunks used:]")
for c in used:
    print("-", c["chunk_id"])


=== Without RAG ===
No, the standard Data Science programme does not typically include a course on quantum computing. 

=== With RAG ===
I do not know.
SOURCES: [] 

[Chunks used:]
- prog_ds_overview_p0
- prog_ds_overview_p2
- prog_ds_structure_p0
- prog_ds_careers_p0


In [14]:
q = "What does the DS programme teach in the second year?"

for k in [1, 2, 4, 6]:
    print(f"\n=== With RAG (k={k}) ===")
    answer, used = answer_with_rag(q, k=k, campus="Marseille", program="Data Science")
    print(answer)
    print("Chunks:", [c['chunk_id'] for c in used])



=== With RAG (k=1) ===
I do not know what the DS programme teaches in the second year based on the provided context. The context only describes the curriculum for the first year (M1).
SOURCES: [1]
Chunks: ['prog_ds_structure_p0']

=== With RAG (k=2) ===
I do not know.
SOURCES: []
Chunks: ['prog_ds_structure_p0', 'prog_ds_overview_p0']

=== With RAG (k=4) ===
In the second year of the DS program, the focus is on advanced statistical learning, high-dimensional data, optimisation for machine learning, and applications such as signal and image processing. Some modules are shared with other tracks of the MAS master, which helps maintain a strong mathematical core.
SOURCES: [4]
Chunks: ['prog_ds_structure_p0', 'prog_ds_overview_p0', 'prog_ds_careers_p0', 'prog_ds_structure_p1']

=== With RAG (k=6) ===
In the second year of the DS programme, the focus is on advanced statistical learning, high-dimensional data, optimisation for machine learning, and applications such as signal and image proce

In [18]:
documents.append({
    'id': 'marseille_study_spots',
    'title': 'Study spots in Marseille',
    'campus': 'Marseille',
    'program': 'General',
    'text': """\
        Marseille offers several quiet places to study.
        Popular choices include the Saint-Charles university library,
        the Alcazar public library near the Vieux-Port,
        and the campus learning centre with group-study rooms.
    """,
})


In [19]:
chunks = make_chunks(documents)

corpus_texts = [c['text'] for c in chunks]
tokenized_corpus = [t.lower().split() for t in corpus_texts]
bm25 = BM25Okapi(tokenized_corpus)
emb_matrix = embedder.encode(corpus_texts, convert_to_numpy=True, normalize_embeddings=True)


In [21]:
q = "Where can I study quietly in Marseille?"

answer, used = answer_with_rag(q, k=4)

print(answer)
print("Chunks:", [c['chunk_id'] for c in used])


Marseille offers several quiet places to study, including the Saint-Charles university library, the Alcazar public library near the Vieux-Port, and the campus learning centre which has group-study rooms.
SOURCES: [4]
Chunks: ['prog_ds_structure_p2', 'prog_ds_overview_p2', 'prog_ds_structure_p0', 'marseille_study_spots_p0']


## 4) (Optional) Evaluate retrieval with precision and recall

For a real project, you would:
- create a small set of labelled queries with known relevant chunks,
- compute **precision** and **recall** for your retriever,
- tune parameters (e.g., `alpha`, `k`) to get high recall while keeping prompts small.

You can sketch your own mini-evaluation below by defining a few queries and lists of relevant `chunk_id`s. 


In [None]:
# Optional: small manual evaluation scaffold
# Example structure; fill in your own queries and relevant chunks.

examples = [
    {
        'query': 'Where does the Data Science track take place?',
        'relevant_chunk_ids': ['prog_ds_structure_p2', 'prog_ds_overview_p2'],
    },
    {
        'query': 'What kinds of jobs can graduates of the Data Science track expect?',
        'relevant_chunk_ids': ['prog_ds_careers_p0'],
    },
]

alpha = 0.5
k = 4

for ex in examples:
    hits = retrieve_hybrid(ex['query'], k=k, alpha=alpha)
    retrieved_ids = [c['chunk_id'] for c, _ in hits]

    tp = len(set(retrieved_ids) & set(ex['relevant_chunk_ids']))
    fp = len(set(retrieved_ids) - set(ex['relevant_chunk_ids']))
    fn = len(set(ex['relevant_chunk_ids']) - set(retrieved_ids))

    precision = tp / (tp + fp) if tp + fp > 0 else 0.0
    recall = tp / (tp + fn) if tp + fn > 0 else 0.0

    print('\nQuery:', ex['query'])
    print('Retrieved:', retrieved_ids)
    print('Relevant:', ex['relevant_chunk_ids'])
    print(f'Precision={precision:.2f}, Recall={recall:.2f}')
