# NLP RAG Tutor — Démarche itérative (Notebook)

Ce notebook illustre **la démarche itérative** suivie pour améliorer un système **RAG (Retrieval-Augmented Generation)** à partir des pdfs

On montre :
1. **Baseline** (chunking + embeddings + retrieval)
2. **Itération 1 — Chunking** (taille/overlap) et impact sur Recall@k / MRR
3. **Itération 2 — Embeddings** (multilingue vs anglais) et impact
4. **Itération 3 — Prompt** (anti-hallucination) et exemple qualitatif
5. **Itération 4 — small_to_big** 

5. **Conclusion** : choix final

> Prérequis : avoir généré  `data/interim/pages.jsonl` via `python -m src.main ingest`.


## 0) Setup



In [38]:
import sys
from pathlib import Path

ROOT = Path.cwd().parent

if not (ROOT / "src").exists():
    raise RuntimeError("Structure du projet invalide : dossier src introuvable")

sys.path.insert(0, str(ROOT))

print("Project root added to PYTHONPATH:", ROOT)

PAGES = ROOT / "data/interim/pages.jsonl"
QUESTIONS = ROOT / "data/eval/questions.csv"
assert PAGES.exists(), "Fichier pages.jsonl introuvable. Lance: python -m src.main ingest"
assert QUESTIONS.exists(), "questions.csv introuvable. Crée: data/eval/questions.csv"
print("OK: pages.jsonl et questions.csv trouvés.")


Project root added to PYTHONPATH: c:\Users\0204528N\Desktop\Projet_nlp\nlp-rag-tutor
OK: pages.jsonl et questions.csv trouvés.


## 1) Fonctions utilitaires : index + évaluation

On va :
- chunker (max_chars, overlap_chars)
- indexer avec FAISS (embed_model)
- évaluer le retrieval (Recall@k, MRR) via la commande `run_evaluation()`


In [None]:
import shutil
from dataclasses import dataclass
from typing import Dict

from src.utils.io import read_jsonl, write_jsonl
from src.chunking.chunker import chunk_pages, ChunkConfig
from src.retrieval.build_index import build_index
from src.retrieval.embedder import EmbeddingConfig
from src.eval.evaluate import run_evaluation, EvalConfig

def build_run(
    run_name: str,
    max_chars: int,
    overlap_chars: int,
    top_k: int,
    embed_model: str,
) -> Dict:
    run_dir = Path("data/experiments") / run_name
    run_dir.mkdir(parents=True, exist_ok=True)

    # 1) Chunk
    pages = list(read_jsonl(PAGES))

    cfg = ChunkConfig(max_chars=max_chars, overlap_chars=overlap_chars, min_chars=300)
    chunks = chunk_pages(pages, cfg)
    chunks_path = run_dir / "chunks.jsonl"
    write_jsonl(chunks_path, chunks)

    # 2) Index
    index_dir = run_dir / "faiss"
    if index_dir.exists():
        shutil.rmtree(index_dir)
    build_index(
        chunks_jsonl=chunks_path,
        index_dir=index_dir,
        embed_cfg=EmbeddingConfig(model_name=embed_model, normalize=True),
    )

    # 3) Evaluate retrieval
    metrics = run_evaluation(
        index_dir=index_dir,
        cfg=EvalConfig(
            questions_csv=QUESTIONS,
            out_dir=run_dir / "eval",
            top_k=top_k,
            embed_model=embed_model,
            use_llm=False,
        ),
    )
    metrics.update({
        "run_name": run_name,
        "max_chars": max_chars,
        "overlap_chars": overlap_chars,
        "top_k": top_k,
        "embed_model": embed_model,
        "n_chunks": len(chunks),
        "run_dir": str(run_dir),
    })
    return metrics

print("Ready.")


Ready.


## 2) Baseline

- Chunking: `max_chars=3500`, `overlap=400` 
- Embeddings: multilingue (`paraphrase-multilingual-MiniLM-L12-v2`)
- top_k: 8

On mesure Recall@8 et MRR.


In [40]:
baseline = build_run(
    run_name="baseline_multilingual_3500_400",
    max_chars=3500,
    overlap_chars=400,
    top_k=8,
    embed_model="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
)
baseline


Batches: 100%|██████████| 22/22 [00:42<00:00,  1.92s/it]
Batches: 100%|██████████| 1/1 [00:00<00:00, 19.19it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 24.81it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 33.04it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 30.27it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 23.52it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 25.75it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 26.90it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 20.57it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 23.73it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 28.66it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 28.23it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 32.76it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 30.05it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 26.33it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 33.87it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 29.97it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 34.55it/s]
Batches:

{'n_questions': 40,
 'recall@8': 0.825,
 'mrr': 0.6320833333333333,
 'run_name': 'baseline_multilingual_3500_400',
 'max_chars': 3500,
 'overlap_chars': 400,
 'top_k': 8,
 'embed_model': 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2',
 'n_chunks': 688,
 'run_dir': 'data\\experiments\\baseline_multilingual_3500_400'}

## 3) Itération 1 — Chunking 

Avec un livre long et dense, des chunks plus petits donnent souvent un meilleur retrieval :
- `max_chars=2500`, `overlap=300`

On garde le même modèle d'embeddings (multilingue) pour isoler l'effet du chunking.


In [9]:
chunk_iter = build_run(
    run_name="iter1_chunking_multilingual_2500_300",
    max_chars=2500,
    overlap_chars=300,
    top_k=8,
    embed_model="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
)
chunk_iter


Batches: 100%|██████████| 31/31 [00:40<00:00,  1.32s/it]
Batches: 100%|██████████| 1/1 [00:00<00:00, 32.63it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 46.74it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 45.16it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 44.60it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 42.90it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 46.68it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 44.39it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 43.10it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 42.69it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 44.45it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 40.68it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 51.49it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 41.91it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 41.44it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 46.86it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 38.77it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 41.86it/s]
Batches:

{'n_questions': 40,
 'recall@8': 0.875,
 'mrr': 0.7098214285714286,
 'run_name': 'iter1_chunking_multilingual_2500_300',
 'max_chars': 2500,
 'overlap_chars': 300,
 'top_k': 8,
 'embed_model': 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2',
 'n_chunks': 962,
 'run_dir': 'data\\experiments\\iter1_chunking_multilingual_2500_300'}

## 4) Itération 2 — Embeddings (anglais vs multilingue)

Le livre source est en **anglais**. Souvent, un modèle d'embeddings **anglais** améliore la similarité sémantique.
On compare :
- multilingue (`paraphrase-multilingual-MiniLM-L12-v2`)
- anglais (`all-MiniLM-L6-v2`)

On garde le chunking optimisé du livre.


In [10]:
embed_iter = build_run(
    run_name="iter2_embeddings_english_2500_300",
    max_chars=2500,
    overlap_chars=300,
    top_k=8,
    embed_model="sentence-transformers/all-MiniLM-L6-v2",
)
embed_iter


Batches: 100%|██████████| 31/31 [00:35<00:00,  1.14s/it]
Batches: 100%|██████████| 1/1 [00:00<00:00, 51.15it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 97.96it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 107.97it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 110.81it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 82.37it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 89.44it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 65.55it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 79.64it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 76.31it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 77.04it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 60.95it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 82.64it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 77.00it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 77.15it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 71.05it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 69.50it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 81.65it/s]
Batche

{'n_questions': 40,
 'recall@8': 0.95,
 'mrr': 0.8800000000000001,
 'run_name': 'iter2_embeddings_english_2500_300',
 'max_chars': 2500,
 'overlap_chars': 300,
 'top_k': 8,
 'embed_model': 'sentence-transformers/all-MiniLM-L6-v2',
 'n_chunks': 962,
 'run_dir': 'data\\experiments\\iter2_embeddings_english_2500_300'}

## 5) Résumé quantitatif des itérations

On compare Recall@8 et MRR pour les 3 runs.


In [41]:
import pandas as pd

df = pd.DataFrame([baseline, chunk_iter, embed_iter])[
    ["run_name","n_chunks","max_chars","overlap_chars","embed_model","recall@8","mrr","run_dir"]
]
df


Unnamed: 0,run_name,n_chunks,max_chars,overlap_chars,embed_model,recall@8,mrr,run_dir
0,baseline_multilingual_3500_400,688,3500,400,sentence-transformers/paraphrase-multilingual-...,0.825,0.632083,data\experiments\baseline_multilingual_3500_400
1,iter1_chunking_multilingual_2500_300,962,2500,300,sentence-transformers/paraphrase-multilingual-...,0.875,0.709821,data\experiments\iter1_chunking_multilingual_2...
2,iter2_embeddings_english_2500_300,962,2500,300,sentence-transformers/all-MiniLM-L6-v2,0.95,0.88,data\experiments\iter2_embeddings_english_2500...


## 6) Itération 3 — Prompt (anti-hallucination) : démonstration qualitative

Le retrieval peut être bon, mais la génération peut **inventer** une formule si on n'impose pas :
- "recopie la formule **mot pour mot** depuis les sources"
- "si la formule n'est pas dans les extraits, le dire explicitement"

On compare 2 prompts :
- Prompt “souple”
- Prompt “strict” (anti-hallucination)

> Cette section est qualitative : on montre l'effet sur une question sensible (TF-IDF).


In [42]:
from src.retrieval.retriever import Retriever, RetrieverConfig
from src.rag.llm_groq import GroqLLM, GroqConfig

# Choisir le meilleur index (anglais) pour la démo
INDEX_DIR = Path(embed_iter["run_dir"]) / "faiss"

retriever = Retriever(
    index_dir=INDEX_DIR,
    embed_cfg=EmbeddingConfig(model_name="sentence-transformers/all-MiniLM-L6-v2", normalize=True),
    cfg=RetrieverConfig(top_k=8),
)

q = "Give the TF-IDF formula, then explain it simply."
hits = retriever.retrieve(q)
print("Top source pages:", [(h["page_start"], h["page_end"], round(h["score"],4)) for h in hits[:3]])


Batches: 100%|██████████| 1/1 [00:00<00:00, 15.36it/s]

Top source pages: [(121, 121, 0.6782), (301, 302, 0.6234), (292, 293, 0.6207)]





In [47]:


def build_messages(question: str, hits, strict: bool) -> list[dict]:
    sources = []
    for i, h in enumerate(hits, start=1):
        excerpt = (h.get("text") or "")[:1200]
        sources.append(
            f"[SOURCE {i}] pdf={h.get('pdf_name')} pages={h.get('page_start')}-{h.get('page_end')} score={h.get('score'):.4f}\n{excerpt}\n"
        )
    sources_txt = "\n".join(sources)

    if strict:
        system = (
            "You are an NLP tutor.\n"
            "STRICT RULES:\n"
            "1) Use ONLY the provided sources.\n"
            "2) If you write any formula (TF, IDF, TF-IDF), copy it EXACTLY from the sources  .\n"
            "3) If the exact formula is not present in the excerpts, say so and only explain the intuition.\n"
            "4) Cite sources like (SOURCE k, pdf, pages) \n"
        )
    else:
        system = (
            "You are an NLP tutor. Use the sources , make it clean ,  to answer and cite them."
        )

    user = f"Question:\n{question}\n\nSOURCES:\n{sources_txt}\n\nAnswer:"
    return [{"role":"system","content":system},{"role":"user","content":user}]



llm = GroqLLM(GroqConfig(model="llama-3.1-8b-instant", temperature=0.2, max_tokens=450))


In [48]:
messages_soft = build_messages(q, hits, strict=False)
ans_soft, usage_soft = llm.chat(messages_soft)


messages_strict = build_messages(q, hits, strict=True)
ans_strict, usage_strict = llm.chat(messages_strict)

print("=== Prompt souple ===\n")
print(ans_soft)
print("\nTokens:", usage_soft)
print("\n\n=================================================================================================================================\n")

print("\n\n=== Prompt strict ===\n")
print(ans_strict)
print("\nTokens:", usage_strict)



=== Prompt souple ===

**TF-IDF Formula:**

The TF-IDF formula is used to calculate the importance of a word in a document. It is defined as the product of two components:

1. **Term Frequency (TF)**: This measures the frequency of a word in a document. It is calculated as:

TF = (1 + log(f)) / (1 + log(N))

where f is the frequency of the word in the document and N is the total number of words in the document.

2. **Inverse Document Frequency (IDF)**: This measures the rarity of a word across all documents. It is calculated as:

IDF = log(N / df)

where N is the total number of documents and df is the number of documents containing the word.

The TF-IDF formula is then calculated as:

TF-IDF = TF × IDF

**Simplified Explanation:**

Imagine you have a large collection of documents, and you want to find the most important words in each document. TF-IDF helps you do this by calculating the importance of each word based on two factors:

1. **How often does the word appear in the document?

### jouer sur la maniere de formuler le prompt 

**On observe un trade-off : un prompt souple produit une réponse pédagogique mais moins strictement ancrée dans les sources, tandis qu’un prompt strict peut dégrader la qualité lorsque les sources contiennent des artefacts d’extraction PDF (ex: (cid:...)).”**



# Iteration 4: small-to-big 

In [53]:
from collections import defaultdict

from pathlib import Path

PAGES_DATA = list(read_jsonl(PAGES))

page_lookup = {}
for p in PAGES_DATA:
    pdf = p.get("pdf_name") or p.get("pdf") or p.get("source")
    page = p.get("page")
    # IMPORTANT: ton fichier a "text_raw"
    text = p.get("text_raw") or p.get("text") or ""

    if pdf is None or page is None:
        continue

    page_lookup[(Path(str(pdf)).name, int(page))] = text



def retrieve_small2big(question: str, retriever, *, expand_pages: int = 1, top_k_small: int = 8):
    """
    1) retrieve sur index small (chunks)
    2) expansion en contexte big: concat des pages voisines autour des hits
    """
    # 1) Small retrieve
    hits = retriever.retrieve(question)[:top_k_small]

    expanded = []
    seen = set()

    for h in hits:
        pdf = h.get("pdf_name")
        ps = int(h.get("page_start"))
        pe = int(h.get("page_end"))

        # élargit la fenêtre
        start = max(1, ps - expand_pages)
        end = pe + expand_pages

        key = (pdf, start, end)
        if key in seen:
            continue
        seen.add(key)

        parts = []
        for page in range(start, end + 1):
            t = page_lookup.get((pdf, page))
            if t:
                parts.append(t)

        big_text = "\n\n".join(parts).strip()
        if not big_text:
            continue

        expanded.append({
            "pdf_name": pdf,
            "page_start": start,
            "page_end": end,
            "score": float(h.get("score", 0.0)),
            "text": big_text,
            "seed_chunk_pages": (ps, pe),
        })

    # tri par score desc
    expanded.sort(key=lambda x: x["score"], reverse=True)
    return hits, expanded


In [54]:
q = "Give the TF-IDF formula, then explain it simply."

hits_small, hits_big = retrieve_small2big(
    q,
    retriever,
    expand_pages=1,      
    top_k_small=8
)

print("=== SMALL HITS ===")
print([(h["page_start"], h["page_end"], round(h["score"], 4)) for h in hits_small[:3]])

print("\n=== BIG (EXPANDED) HITS ===")
print([(h["page_start"], h["page_end"], round(h["score"], 4)) for h in hits_big[:3]])

messages_soft = build_messages(q, hits_big, strict=False)
ans_soft, usage_soft = llm.chat(messages_soft)

messages_strict = build_messages(q, hits_big, strict=True)
ans_strict, usage_strict = llm.chat(messages_strict)


Batches: 100%|██████████| 1/1 [00:00<00:00, 73.03it/s]


=== SMALL HITS ===
[(121, 121, 0.6782), (301, 302, 0.6234), (292, 293, 0.6207)]

=== BIG (EXPANDED) HITS ===
[(120, 122, 0.6782), (300, 303, 0.6234), (291, 294, 0.6207)]


In [55]:
q = "Give the TF-IDF formula, then explain it simply."

for e in [0, 1, 2, 3]:
    hits_small, hits_big = retrieve_small2big(q, retriever, expand_pages=e, top_k_small=8)
    print(f"\n=== expand_pages={e} ===")
    print("small:", [(h["page_start"], h["page_end"]) for h in hits_small[:3]])
    print("big  :", [(h["page_start"], h["page_end"]) for h in hits_big[:3]])

    messages = build_messages(q, hits_big, strict=True)
    ans, usage = llm.chat(messages)
    print("answer snippet:", ans[:300])

Batches: 100%|██████████| 1/1 [00:00<00:00, 71.17it/s]



=== expand_pages=0 ===
small: [(121, 121), (301, 302), (292, 293)]
big  : [(121, 121), (301, 302), (292, 293)]
answer snippet: The TF-IDF formula is not explicitly provided in the sources. However, we can infer the formula from the given information.

The TF-IDF formula is a product of two components:

1. Term Frequency (TF): This measures the frequency of a term in a document.
2. Inverse Document Frequency (IDF): This meas


Batches: 100%|██████████| 1/1 [00:00<00:00, 85.08it/s]


=== expand_pages=1 ===
small: [(121, 121), (301, 302), (292, 293)]
big  : [(120, 122), (300, 303), (291, 294)]





answer snippet: The TF-IDF formula is not explicitly provided in the sources. However, we can infer the formula from the information given in the sources.

The TF-IDF formula is a combination of two weights:

1. Term Frequency (TF): This weight measures the importance of a term in a document. It is calculated as th


Batches: 100%|██████████| 1/1 [00:00<00:00, 83.94it/s]


=== expand_pages=2 ===
small: [(121, 121), (301, 302), (292, 293)]
big  : [(119, 123), (299, 304), (290, 295)]





answer snippet: The TF-IDF formula is not explicitly provided in the sources. However, we can infer the TF-IDF formula from SOURCE 4, which provides an example of a tf-idf weighted term-document matrix.

From SOURCE 4, we can see that the value for the word "wit" in the play "As You Like It" is 0.085, which is the 


Batches: 100%|██████████| 1/1 [00:00<00:00, 98.21it/s]


=== expand_pages=3 ===
small: [(121, 121), (301, 302), (292, 293)]
big  : [(118, 124), (298, 305), (289, 296)]





answer snippet: The TF-IDF formula is:

idf = log(N/df) (SOURCE 4, pdf, pages 121-128)

where N is the total number of documents in the collection, and df is the number of documents in which term t occurs.

The TF-IDF formula is the product of two weights:

tf-idf = tf * idf

where tf is the term frequency, which i




Dans cette iteration , nous avons étudié une stratégie de récupération *small-to-big*  
L’idée consiste à récupérer d’abord des passages courts et précis, puis à élargir progressivement le contexte en incluant les pages voisines.

Les résultats montrent que l’utilisation de petits passages seuls n’est souvent pas suffisante pour répondre correctement à des questions nécessitant des définitions complètes ou des formules.  
En élargissant le contexte autour des passages pertinents, le modèle a accès à davantage d’informations utiles, ce qui améliore la complétude et la qualité des réponses.

Cette expérience met en évidence l’intérêt de la stratégie *small-to-big*, qui permet de trouver un bon compromis entre précision du retrieval et richesse du contexte dans les systèmes RAG.


## 7) Conclusion

- **Chunking adapté livre** améliore souvent la récupération
- **Embeddings anglais** améliorent nettement la similarité sur un corpus anglais
- **Prompt strict** réduit les hallucinations et force les formules exactes

Le pipeline final (retenu) :
- `max_chars=2500`, `overlap=300`
- embeddings: `sentence-transformers/all-MiniLM-L6-v2`
- top_k: 8 (ou 10 sur questions difficiles)
- prompt anti-hallucination
-  Small → Big retrieval permet de compléter l’information en élargissant le contexte autour des chunks pertinents



----------------------------------------------------------------------------------------------------------------------------------------------------------Merci