# Exercises XP: W8_D4

## Reading & Analyzing a Scientific Paper (RAG + LLM)

## What You'll Learn
- Interpret scientific articles with purpose and extract structure quickly  
- Critically analyze methods and experimental design  
- Identify/assess evaluation metrics and quality of evidence  
- Practice Cornell-style notes to retain key takeaways  
- Summarize with the 5W1H technique  
- Reflect on a practical RAG + LLM design for a real-world use case

## Deliverables
1. Article structure map (sections + 1-2 sentence purpose each)  
2. Critical analysis of experimental design (6 guiding questions)  
3. Evaluation metrics & evidence critique  
4. Cornell notes of one technical subsection  
5. 5W1H summary + 4-sentence abstract  
6. Design reflection for your own RAG + LLM scenario

**Paper:** *A Study on the Implementation of Generative AI Services Using an Enterprise Data-Based LLM Application Architecture* (PDF provided).


## Exercise 1 — Article Structure Mapping

In [3]:
# [Upload] Upload the PDF file to Colab
# Comment: Use Colab's file upload widget to import the PDF and get its path.
from google.colab import files
uploaded = files.upload()

# Comment: Extract the file path from the uploaded dictionary
PDF_PATH = list(uploaded.keys())[0]
print(f"PDF uploaded: {PDF_PATH}")


Saving A Study on the Implementation of Generative AI Services Using an Enterprise Data-Based LLM Application Architecture.pdf to A Study on the Implementation of Generative AI Services Using an Enterprise Data-Based LLM Application Architecture (1).pdf
PDF uploaded: A Study on the Implementation of Generative AI Services Using an Enterprise Data-Based LLM Application Architecture (1).pdf


In [4]:
# [Setup] Install and import PDF tools
# Comment: Install PyMuPDF for PDF reading, and import needed libraries.
!pip -q install pymupdf pandas

import re
import fitz  # PyMuPDF
import pandas as pd
from pathlib import Path

# Comment: Update this path if needed. Your file was uploaded to /mnt/data/
from pathlib import Path
PDF_PATH = Path(PDF_PATH)
assert PDF_PATH.exists(), f"PDF not found at: {PDF_PATH}"


In [7]:
# [Helper v2] Stricter section heading detection (line-start + Roman numerals)
# Comment: This version reduces false positives by matching headings at the start of lines,
# Comment: optionally prefixed with Roman numerals like "I.", "II.", etc.

import re
import pandas as pd
import fitz
from pathlib import Path

ROMAN = r"(?:[IVXLCDM]+\.\s*)?"  # optional roman numeral like "I. "
# Comment: Patterns enforce line-start (^) and typical heading words
# [Helper v3] Match numbered headings only to avoid false hits in abstracts
STRICT_SECTION_PATTERNS = [
    r"^I\.\s*Introduction\b",
    r"^II\.\s*(Related Work|Background)\b",
    r"^III\.\s*(Method|Methods|Methodology)\b",
    r"^IV\.\s*(Experiment|Experiments|Results|Evaluation)\b",
    r"^V\.\s*(Conclusion|Conclusions|Discussion|Conclusion and Future Work)\b",
]
SECTION_LABELS = [
    "Introduction",
    "Related Work / Background",
    "Methods",
    "Experiment / Results",
    "Conclusion / Discussion",
]

def extract_pages_lines(pdf_path: Path):
    # Comment: Return page texts split by lines for line-anchored regex
    doc = fitz.open(pdf_path)
    pages = []
    for i in range(len(doc)):
        text = doc[i].get_text("text")
        # Normalize line endings and trim trailing spaces
        lines = [ln.strip() for ln in text.splitlines()]
        pages.append(lines)
    doc.close()
    return pages

def find_sections_strict(pages_lines):
    rows = []
    for label, pattern in zip(SECTION_LABELS, STRICT_SECTION_PATTERNS):
        found_page = None
        snippet = ""
        regex = re.compile(pattern, flags=re.IGNORECASE)
        for idx, lines in enumerate(pages_lines):
            for li, line in enumerate(lines):
                if regex.search(line):
                    found_page = idx + 1  # 1-based page
                    # Build a small preview (the line + next few lines)
                    context = lines[li : li + 3]
                    snippet = " | ".join(context)
                    break
            if found_page:
                break
        rows.append({"section": label, "start_page": found_page, "preview": snippet})
    return pd.DataFrame(rows)

# Comment: If you already defined PDF_PATH earlier via the upload cell, we reuse it.
PDF_PATH = Path(PDF_PATH)
pages_lines = extract_pages_lines(PDF_PATH)
df_sections_strict = find_sections_strict(pages_lines)
df_sections_strict


Unnamed: 0,section,start_page,preview
0,Introduction,2,I. Introduction | Recent developments in gener...
1,Related Work / Background,4,II. Related Work | For the purpose of this stu...
2,Methods,19,"III. Methods | In this chapter, we outline a c..."
3,Experiment / Results,20,"IV. Experiment | In this chapter, the generati..."
4,Conclusion / Discussion,26,"V. Conclusion and Discussion | In this study, ..."


## For each section, write 1-2 sentences summarizing its purpose

In [8]:
# [Extract Sections] Pull raw text for each section by page range
# Comment: We define the section start pages found earlier and slice text until the next section.

import fitz
from pathlib import Path

assert 'PDF_PATH' in globals(), "Please run the Upload cell first to define PDF_PATH."
PDF_PATH = Path(PDF_PATH)

SECTION_STARTS = {
    "Introduction": 2,
    "Related Work / Background": 4,
    "Methods": 19,
    "Experiment / Results": 20,
    "Conclusion / Discussion": 26,
}

# Comment: Helper to extract page texts as a list (1-based page semantics when slicing below).
doc = fitz.open(PDF_PATH)
pages_full_text = [doc[i].get_text("text") for i in range(len(doc))]
doc.close()

def section_text(pages, start_page, end_page_exclusive):
    """Return plain text from start_page up to the page before end_page_exclusive (1-based)."""
    s = start_page - 1
    e = max(start_page, end_page_exclusive) - 1
    return "\n".join(pages[s:e])

# Comment: Build a dict of section -> raw text
sorted_sections = list(SECTION_STARTS.items())
sorted_sections.sort(key=lambda kv: kv[1])  # sort by page number

section_texts = {}
for i, (name, start) in enumerate(sorted_sections):
    end = sorted_sections[i+1][1] if i+1 < len(sorted_sections) else len(pages_full_text)+1
    section_texts[name] = section_text(pages_full_text, start, end)

list(section_texts.keys()), [len(t) for t in section_texts.values()]


(['Introduction',
  'Related Work / Background',
  'Methods',
  'Experiment / Results',
  'Conclusion / Discussion'],
 [9122, 44615, 2181, 11079, 8614])

In [9]:
# [Summarize] Create 1–2 sentence purpose summaries using a simple heuristic
# Comment: We take the first meaningful paragraph and keep up to two sentences.

import re
from textwrap import shorten

def clean_text(t: str) -> str:
    """Normalize whitespace and remove overly long runs of spaces/newlines."""
    t = t.replace("\xa0", " ")
    t = re.sub(r"[ \t]+", " ", t)
    t = re.sub(r"\n{2,}", "\n\n", t.strip())
    return t

def first_paragraph(t: str) -> str:
    """Return the first non-trivial paragraph (>= 200 chars after cleaning)."""
    paras = [p.strip() for p in t.split("\n\n") if p.strip()]
    for p in paras:
        if len(p) >= 200:
            return p
    return paras[0] if paras else ""

def two_sentence_summary(p: str) -> str:
    """Keep up to two sentences from the paragraph."""
    # Simple sentence split; handle ., ?, !
    parts = re.split(r"(?<=[\.\?\!])\s+", p)
    # Filter tiny fragments and headings
    parts = [s.strip() for s in parts if len(s.strip()) > 0]
    if not parts:
        return ""
    # Keep up to two sentences, and trim overly long output for safety
    summary = " ".join(parts[:2])
    return shorten(summary, width=800, placeholder="...")

# Comment: Build 1–2 sentence purpose per section
section_purposes = {}
for name, txt in section_texts.items():
    p = first_paragraph(clean_text(txt))
    section_purposes[name] = two_sentence_summary(p)

section_purposes

{'Introduction': '2 I. Introduction Recent developments in generative AI, catalyzed by ChatGPT, have become a focal point of discussion.',
 'Related Work / Background': '4 In Chapter 2, the theoretical foundation is established by detailing prior research and key related concepts pertinent to this study. The exploration encompasses the trends in generative AI and LLM on both domestic and international fronts.',
 'Methods': '19 III. Methods In this chapter, we outline a comprehensive framework for implementing generative AI services by effectively combining and orchestrating various technologies within the previously discussed generative AI technology stack.',
 'Experiment / Results': '20 3.2.1 RAG based implementation procedure The RAG model, as discussed in the architecture of the RAG model in Chapter 2, is a search-augmented generative model used to retrieve and generate responses based on information relevant to given questions or topics. Each step follows the procedure outlined in 

## [IMRaD Check] Decide overall format and what (if anything) is missing

In [10]:
# [IMRaD Check] Decide overall format and what (if anything) is missing
# Comment: We check which canonical IMRaD blocks are present based on our keys.

present = set(section_texts.keys())

has_intro = any("Introduction" in k for k in present)
has_methods = any("Methods" in k for k in present)
has_results = any("Experiment" in k or "Results" in k or "Evaluation" in k for k in present)
has_discussion = any("Conclusion" in k or "Discussion" in k for k in present)
has_related = any("Related" in k or "Background" in k for k in present)

if has_intro and has_methods and has_results and has_discussion:
    overall_format = "IMRaD-like with an explicit Related Work section."
else:
    overall_format = "Partially IMRaD; some components may be merged or implicit."

missing_notes = []
if not has_related:
    missing_notes.append("Related Work is not explicit.")
# Often Discussion is merged into Conclusion in such papers:
if has_discussion and any("Conclusion" in k for k in present) and not any("Discussion" in k for k in present):
    missing_notes.append("Discussion appears merged into Conclusion.")
if not has_results:
    missing_notes.append("Results/Experiment section is not clearly separated.")
if not missing_notes:
    missing_notes.append("No major structural gaps; minor merges are typical (e.g., Discussion in Conclusion).")

overall_format, missing_notes

('IMRaD-like with an explicit Related Work section.',
 ['No major structural gaps; minor merges are typical (e.g., Discussion in Conclusion).'])

In [11]:
# [Markdown Answer] Print a ready-to-paste Markdown block for Exercise 1 (Q3 & Q4)
# Comment: We inject the start pages we already know and the auto-summaries.

md_lines = []
md_lines.append("## Exercise 1: Article Structure Mapping — Answers (Q3 & Q4)")
md_lines.append("")
md_lines.append("### Sections (start page + 1–2 sentence purpose each)")
md_lines.append("")

ordering = [
    ("Introduction", 2),
    ("Related Work / Background", 4),
    ("Methods", 19),
    ("Experiment / Results", 20),
    ("Conclusion / Discussion", 26),
]

for name, page in ordering:
    purpose = section_purposes.get(name, "").strip()
    if not purpose:
        purpose = "_(Short section; summary requires brief manual reading.)_"
    md_lines.append(f"- **{name}** — page: {page}  ")
    md_lines.append(f"  Purpose: {purpose}")
    md_lines.append("")

md_lines.append("### Overall Format")
md_lines.append(f"- Format: {overall_format}")
md_lines.append(f"- Notes: {' '.join(missing_notes)}")

md_block = "\n".join(md_lines)
print(md_block)


## Exercise 1: Article Structure Mapping — Answers (Q3 & Q4)

### Sections (start page + 1–2 sentence purpose each)

- **Introduction** — page: 2  
  Purpose: 2 I. Introduction Recent developments in generative AI, catalyzed by ChatGPT, have become a focal point of discussion.

- **Related Work / Background** — page: 4  
  Purpose: 4 In Chapter 2, the theoretical foundation is established by detailing prior research and key related concepts pertinent to this study. The exploration encompasses the trends in generative AI and LLM on both domestic and international fronts.

- **Methods** — page: 19  
  Purpose: 19 III. Methods In this chapter, we outline a comprehensive framework for implementing generative AI services by effectively combining and orchestrating various technologies within the previously discussed generative AI technology stack.

- **Experiment / Results** — page: 20  
  Purpose: 20 3.2.1 RAG based implementation procedure The RAG model, as discussed in the architecture of

# Exercise 2: Critical Analysis of Experimental Design

In [13]:
# [Extract Sections] Methods & Experiment raw text
# Comment: Slice the PDF text using the page numbers we discovered.
import fitz
from pathlib import Path

assert 'PDF_PATH' in globals(), "Run the Upload cell first to define PDF_PATH."
PDF_PATH = Path(PDF_PATH)

METHODS_START = 19
EXPERIMENT_START = 20
CONCLUSION_START = 26  # end bound for Experiment

doc = fitz.open(PDF_PATH)
pages_full_text = [doc[i].get_text("text") for i in range(len(doc))]
doc.close()

def extract_range(pages, start_page, end_page_exclusive):
    s = start_page - 1
    e = end_page_exclusive - 1
    return "\n".join(pages[s:e])

methods_txt = extract_range(pages_full_text, METHODS_START, EXPERIMENT_START)
experiment_txt = extract_range(pages_full_text, EXPERIMENT_START, CONCLUSION_START)

print("Methods length:", len(methods_txt), "| Experiment length:", len(experiment_txt))

Methods length: 2181 | Experiment length: 11079


In [14]:
# [Heuristics] Utility cleaners and sentence splitter
# Comment: Basic cleanup and sentence split for keyword mining.
import re

def clean(t: str) -> str:
    t = t.replace("\xa0", " ")
    t = re.sub(r"[ \t]+", " ", t)
    t = re.sub(r"\n{2,}", "\n\n", t.strip())
    return t

def split_sentences(t: str):
    # Simple sentence split by punctuation; not perfect but OK for technical text.
    t = clean(t)
    parts = re.split(r"(?<=[\.\?\!])\s+", t)
    return [p.strip() for p in parts if p.strip()]

methods_sents = split_sentences(methods_txt)
exp_sents = split_sentences(experiment_txt)

len(methods_sents), len(exp_sents)


(14, 73)

In [15]:
# [Mining] Extract candidate answers by keyword patterns
# Comment: We search sentences for cues to each question and select top candidates.

from collections import defaultdict

def find_sentences(sents, patterns, top=3):
    """Return up to 'top' sentences that match any regex in patterns."""
    out = []
    regs = [re.compile(p, flags=re.I) for p in patterns]
    for s in sents:
        if any(r.search(s) for r in regs):
            out.append(s)
        if len(out) >= top:
            break
    return out

# 1) Research question cues
rq_patterns = [
    r"\b(this study|we (aim|seek|investigate|study)|our goal|research question|we propose|we design)\b"
]

# 2) Type of study cues
type_patterns = [
    r"\b(case study|implementation case|prototype|experimental|experiment|comparative|evaluation|deployment)\b"
]

# 3) Variables cues (independent/dependent)
indep_patterns = [
    r"\b(we vary|we configured|we set|we adjust|parameter|top[- ]?k|chunk size|embedding model|retrieval|index|vector db|prompt|temperature|context window)\b"
]
dep_patterns = [
    r"\b(accuracy|precision|recall|latency|throughput|response time|cost|quality|relevance|hit rate|success rate|user (study|feedback)|evaluation metric)\b"
]

# 4) Datasets & tools cues
data_tools_patterns = [
    r"\b(dataset|corpus|enterprise data|knowledge base|documents|pdf|csv|log|\bETL\b)\b",
    r"\b(Chroma|FAISS|Weaviate|Pinecone|Milvus|LangChain|LlamaIndex|Azure OpenAI|OpenAI|Hugging Face|PostgreSQL|Elastic|Docker|Kubernetes|Ray|Airflow|GCS|S3)\b"
]

# 5) Control/baseline cues
baseline_patterns = [
    r"\b(baseline|compared to|versus|vs\.|control|ablation|alternative)\b"
]

# 6) Repeatability/transparency cues
repeat_patterns = [
    r"\b(reproducible|repeatable|we release|open[- ]source|code|parameters|configuration|implementation details|pipeline|procedure|steps)\b"
]

candidates = defaultdict(list)

# Research question -> search both sections (often phrased in Methods/Intro of Methods or Exp intro)
candidates['research_question'] = find_sentences(methods_sents + exp_sents, rq_patterns, top=3)
# Type of study -> search Experiment first, then Methods
candidates['study_type'] = find_sentences(exp_sents + methods_sents, type_patterns, top=3)
# Independent variables (likely in Methods)
candidates['independent'] = find_sentences(methods_sents + exp_sents, indep_patterns, top=5)
# Dependent variables / metrics (likely in Experiment)
candidates['dependent'] = find_sentences(exp_sents + methods_sents, dep_patterns, top=5)
# Datasets & tools
candidates['data_tools'] = find_sentences(methods_sents + exp_sents, data_tools_patterns, top=8)
# Baseline/control
candidates['baseline'] = find_sentences(exp_sents + methods_sents, baseline_patterns, top=5)
# Repeatability/transparency
candidates['repeatability'] = find_sentences(methods_sents + exp_sents, repeat_patterns, top=5)

candidates


defaultdict(list,
            {'research_question': ['Considering the awareness and cost aspects of these solutions, this study has proposed a framework that primarily \nrelies on open-source products in alignment with the findings from Chapter 2.'],
             'study_type': ['Experiment \nIn this chapter, the generative AI service implementation framework introduced in Chapter 3 is utilized to implement \nvarious scenarios based on enterprise internal data using the integrated RAG model and LangChain according to the \nimplementation procedure.'],
             'independent': ['The following sections detail the implementation process for each technology \ncomponent within the framework: \n \n3.1 Framework for Implementing Generative AI Services using RAG Model \nBased on previous research, we have designed a comprehensive framework for implementing generative AI \nservices using the Retrieval-Augmented Generation (RAG) model.',
              "LangChain's module is utilized to split d

In [16]:
# [Synthesis] Turn candidates into concise answers (fallbacks if nothing found)
# Comment: We compose short, direct answers based on mined sentences + heuristics.

def summarize_list(lst, max_len=2):
    return " ".join(lst[:max_len]) if lst else ""

answers = {}

# 1) Research question
answers['research_question'] = summarize_list(candidates['research_question']) or \
    "The study investigates how to implement enterprise-grade Generative AI services using a RAG-based LLM application architecture."

# 2) Type of study
type_guess = "implementation case with experimental evaluation"
if candidates['study_type']:
    if re.search(r"comparative", " ".join(candidates['study_type']), flags=re.I):
        type_guess = "comparative experimental study"
    elif re.search(r"experiment|experimental|evaluation", " ".join(candidates['study_type']), flags=re.I):
        type_guess = "experimental evaluation of an implementation"
answers['study_type'] = type_guess

# 3) Independent / Dependent variables
indep_guess = "Configuration choices such as chunk size, top-k retrieval, embedding/LLM selection, and prompt parameters."
dep_guess = "Outcome measures such as relevance/quality of answers, and possibly latency or cost if reported."
answers['independent_vars'] = summarize_list(candidates['independent']) or indep_guess
answers['dependent_vars'] = summarize_list(candidates['dependent']) or dep_guess

# 4) Datasets & tools
answers['datasets_tools'] = summarize_list(candidates['data_tools'], max_len=4) or \
    "Enterprise knowledge sources (documents) plus standard RAG components (vector DB such as Chroma/FAISS, framework like LangChain/LlamaIndex, and an LLM provider)."

# 5) Control / Baseline
answers['baseline'] = summarize_list(candidates['baseline'], max_len=2) or \
    "No explicit baseline/control is clearly stated in the extracted text; if present, it may be qualitative or implicit."

# 6) Repeatability / Transparency
answers['repeatability'] = summarize_list(candidates['repeatability'], max_len=3) or \
    "The procedure is described at a system level (pipelines, components, and steps). Exact parameters/code availability may limit full reproducibility."

answers


{'research_question': 'Considering the awareness and cost aspects of these solutions, this study has proposed a framework that primarily \nrelies on open-source products in alignment with the findings from Chapter 2.',
 'study_type': 'experimental evaluation of an implementation',
 'independent_vars': "The following sections detail the implementation process for each technology \ncomponent within the framework: \n \n3.1 Framework for Implementing Generative AI Services using RAG Model \nBased on previous research, we have designed a comprehensive framework for implementing generative AI \nservices using the Retrieval-Augmented Generation (RAG) model. LangChain's module is utilized to split data into chunks that are suitable \nfor retrieval.",
 'dependent_vars': 'Considering the awareness and cost aspects of these solutions, this study has proposed a framework that primarily \nrelies on open-source products in alignment with the findings from Chapter 2.',
 'datasets_tools': "The diagram

In [17]:
# [Markdown Output] Ready-to-paste answers for Exercise 2
# Comment: Produce a clean Markdown block with concise responses.
md2 = f"""## Exercise 2: Critical Analysis of Experimental Design

**Target sections:** Methods (p.19) and Experiment (p.20–25)

1) **Research question**
{answers['research_question']}

2) **Type of study (implementation case / experimental / comparative)**
{answers['study_type']}

3) **Independent and dependent variables**
- **Independent:** {answers['independent_vars']}
- **Dependent:** {answers['dependent_vars']}

4) **Datasets and tools used**
{answers['datasets_tools']}

5) **Control or comparison method (baseline / alternative)**
{answers['baseline']}

6) **Repeatability and transparency**
{answers['repeatability']}
"""
print(md2)


## Exercise 2: Critical Analysis of Experimental Design

**Target sections:** Methods (p.19) and Experiment (p.20–25)

1) **Research question**  
Considering the awareness and cost aspects of these solutions, this study has proposed a framework that primarily 
relies on open-source products in alignment with the findings from Chapter 2.

2) **Type of study (implementation case / experimental / comparative)**  
experimental evaluation of an implementation

3) **Independent and dependent variables**  
- **Independent:** The following sections detail the implementation process for each technology 
component within the framework: 
 
3.1 Framework for Implementing Generative AI Services using RAG Model 
Based on previous research, we have designed a comprehensive framework for implementing generative AI 
services using the Retrieval-Augmented Generation (RAG) model. LangChain's module is utilized to split data into chunks that are suitable 
for retrieval.  
- **Dependent:** Considering the 

# Exercise 3: Evaluation Metrics and Evidence

In [18]:
# [Extract] Focus on Experiment/Results region to mine claims & metrics
# Comment: We reuse the page bounds found earlier.
import fitz
from pathlib import Path

assert 'PDF_PATH' in globals(), "Run the Upload cell first."
PDF_PATH = Path(PDF_PATH)

EXPERIMENT_START = 20
CONCLUSION_START = 26

doc = fitz.open(PDF_PATH)
pages_full_text = [doc[i].get_text("text") for i in range(len(doc))]
doc.close()

def extract_range(pages, start_page, end_page_exclusive):
    s = start_page - 1
    e = end_page_exclusive - 1
    return "\n".join(pages[s:e])

exp_txt = extract_range(pages_full_text, EXPERIMENT_START, CONCLUSION_START)
len(exp_txt)


11079

In [19]:
# [Prep] Clean text and split sentences
# Comment: Simple normalization to improve regex search quality.
import re

def clean(t: str) -> str:
    t = t.replace("\xa0", " ")
    t = re.sub(r"[ \t]+", " ", t)
    t = re.sub(r"\n{2,}", "\n\n", t.strip())
    return t

def split_sentences(t: str):
    t = clean(t)
    parts = re.split(r"(?<=[\.\?\!])\s+", t)
    return [p.strip() for p in parts if p.strip()]

exp_sents = split_sentences(exp_txt)
len(exp_sents)


73

In [20]:
# [Mining] Detect performance claims, model names, and metrics (heuristics)
# Comment: We look for typical phrases (e.g., "compared to", metric names like accuracy/latency, model names).
from collections import defaultdict

patterns = {
    "claims": [
        r"\b(compared to|versus|vs\.|improve(s|d)|outperform(s|ed)|better than|higher than|lower than)\b",
        r"\b(RAG|retrieval[- ]?augmented|fine[- ]?tuning|zero[- ]?shot|few[- ]?shot)\b",
        r"\b(ablations?|baseline)\b",
    ],
    "models": [
        r"\bGPT[- ]?3\.5\b|\bGPT[- ]?4\b|\bLlama[- ]?\d\b|\bMistral\b|\bClaude\b|\bPaLM\b|\bGemini\b",
        r"\bAzure OpenAI\b|\bOpenAI\b|\bLlamaIndex\b|\bLangChain\b",
    ],
    "metrics": [
        r"\baccuracy\b|\bprecision\b|\brecall\b|\bF1\b|\bbleu\b|\brouge[- ]?(L|1|2)?\b|\bNDCG\b|\bMRR\b|\bhit rate\b|\bMAP\b",
        r"\blatency\b|\bresponse time\b|\bthroughput\b|\bcost\b|\btoken(s)?\b|\bprice\b",
        r"\bhuman (evaluation|feedback|study|ratings?)\b|\buser study\b|\bsubjective\b|\bLikert\b",
        r"\brobust(ness)?\b|\bgeneralization\b|\baccuracy on\b|\berror rate\b",
    ],
}

def find_by_patterns(sents, pats, top=50):
    res = []
    regs = [re.compile(p, flags=re.I) for p in pats]
    for s in sents:
        if any(r.search(s) for r in regs):
            res.append(s)
            if len(res) >= top:
                break
    return res

mined = {
    "claims": find_by_patterns(exp_sents, patterns["claims"], top=80),
    "models": find_by_patterns(exp_sents, patterns["models"], top=80),
    "metrics": find_by_patterns(exp_sents, patterns["metrics"], top=80),
}
{ k: len(v) for k,v in mined.items() }


{'claims': 3, 'models': 15, 'metrics': 0}

In [22]:
# [Synthesis — Fixed] Build concise answers for the 4 questions
# Comment: Use bool(re.search(...)) instead of any(re.search(...)).

def shortlist(lines, n=4, max_chars=300):
    out = []
    for s in lines[:n]:
        s_clean = s.strip()
        if len(s_clean) > max_chars:
            s_clean = s_clean[:max_chars].rstrip() + "..."
        out.append(f"- {s_clean}")
    if not out:
        out = ["- (No explicit sentences mined; the paper may be descriptive or qualitative in this part.)"]
    return "\n".join(out)

claims_block = shortlist(mined["claims"], n=5)
models_block = shortlist(mined["models"], n=5)
metrics_block = shortlist(mined["metrics"], n=6)

# Decide appropriateness & suggestions (fixed)
text_metrics = "\n".join(mined["metrics"])
has_quant_metrics = bool(re.search(r"(accuracy|precision|recall|F1|BLEU|ROUGE|NDCG|MRR|MAP|hit rate)", text_metrics, re.I))
has_latency_cost  = bool(re.search(r"(latency|response time|throughput|cost|token)", text_metrics, re.I))
has_human_eval    = bool(re.search(r"(human (evaluation|feedback|study|ratings?)|user study|Likert)", text_metrics, re.I))

appropriateness = []
if has_quant_metrics:
    appropriateness.append("Quantitative metrics are mentioned, which are generally appropriate for evaluating retrieval/answer quality.")
else:
    appropriateness.append("No clear quantitative metrics were detected; evidence appears descriptive or system-level.")

if has_latency_cost:
    appropriateness.append("Operational aspects (latency/cost) are considered, which align with enterprise constraints.")
else:
    appropriateness.append("Latency/cost are not explicitly reported; for enterprise deployment, these are important.")

if has_human_eval:
    appropriateness.append("Human evaluation is referenced, which helps assess answer usefulness.")
else:
    appropriateness.append("No human evaluation found; user studies or expert ratings would strengthen conclusions.")

suggestions = []
if not has_quant_metrics:
    suggestions.append("Add task-appropriate quantitative metrics (e.g., retrieval NDCG/MRR, QA Exact Match/F1).")
if not has_latency_cost:
    suggestions.append("Report latency and cost per query (tokens, dollar cost) under realistic loads.")
if not has_human_eval:
    suggestions.append("Include small-scale human evaluation (usefulness, correctness, citations) with clear rubrics.")
suggestions.append("Provide ablations (e.g., chunk size, top-k, embed model) and confidence intervals to show robustness.")
suggestions.append("If claiming improvements vs. baselines, specify baselines and test sets clearly.")

answers3 = {
    "claims": claims_block,
    "metrics_found": metrics_block,
    "models_mentioned": models_block,
    "appropriateness": " ".join(appropriateness),
    "what_to_add": " ".join(suggestions)
}

answers3


{'claims': '- 20 \n \n3.2.1 RAG based implementation procedure \nThe RAG model, as discussed in the architecture of the RAG model in Chapter 2, is a search-augmented \ngenerative model used to retrieve and generate responses based on information relevant to given questions or topics.\n- The RAG-based implementation procedure outlined above illustrates how the RAG model, in combination with \nLangChain, can be effectively integrated into the generative AI service framework.\n- Experiment \nIn this chapter, the generative AI service implementation framework introduced in Chapter 3 is utilized to implement \nvarious scenarios based on enterprise internal data using the integrated RAG model and LangChain according to the \nimplementation procedure.',
 'metrics_found': '- (No explicit sentences mined; the paper may be descriptive or qualitative in this part.)',
 'models_mentioned': "- Preparatory materials \nrelated to the task, such as regulations, user manuals, and terms and conditions, a

In [23]:
# [Markdown Output] Ready-to-paste block for Exercise 3
md3 = f"""## Exercise 3: Evaluation Metrics and Evidence

**Goal:** Understand and critique how the paper evaluates success.
**Focus region:** Experiment/Results (pp. 20–25)

### 1) Performance claims or comparisons (mined sentences)
{answers3['claims']}

**Models / frameworks mentioned (mined sentences)**
{answers3['models_mentioned']}

### 2) Evaluation metrics used (if any)
{answers3['metrics_found']}

### 3) Are these metrics appropriate for the task?
{answers3['appropriateness']}

### 4) What could be added to strengthen the evidence?
- {answers3['what_to_add']}
"""

print(md3)


## Exercise 3: Evaluation Metrics and Evidence

**Goal:** Understand and critique how the paper evaluates success.  
**Focus region:** Experiment/Results (pp. 20–25)

### 1) Performance claims or comparisons (mined sentences)
- 20 
 
3.2.1 RAG based implementation procedure 
The RAG model, as discussed in the architecture of the RAG model in Chapter 2, is a search-augmented 
generative model used to retrieve and generate responses based on information relevant to given questions or topics.
- The RAG-based implementation procedure outlined above illustrates how the RAG model, in combination with 
LangChain, can be effectively integrated into the generative AI service framework.
- Experiment 
In this chapter, the generative AI service implementation framework introduced in Chapter 3 is utilized to implement 
various scenarios based on enterprise internal data using the integrated RAG model and LangChain according to the 
implementation procedure.

**Models / frameworks mentioned (mined s

# Exercise 4: Take Notes Using the Cornell Method

In [24]:
# [Config] Choose the technical subsection to extract
# Comment: Set the exact heading label you want to target. We default to "3.2.1".
TARGET_SUBSECTION_LABEL = "3.2.1"
# Comment: (Optional) If you know part of the heading text, add it to tighten the match.
TARGET_HEADING_HINT = "RAG"  # e.g., "RAG Based Implementation Procedure" (can be empty)

# Comment: Safety check
assert 'PDF_PATH' in globals(), "Please run the upload cell first to define PDF_PATH."


In [25]:
# [Extract Lines] Read PDF as lines per page for robust heading matching
# Comment: We'll search for the subsection heading at line starts and capture until the next numbered heading.
import fitz, re
from pathlib import Path

PDF_PATH = Path(PDF_PATH)

def get_pages_lines(pdf_path: Path):
    doc = fitz.open(pdf_path)
    pages = []
    for i in range(len(doc)):
        text = doc[i].get_text("text")
        lines = [ln.rstrip() for ln in text.splitlines()]
        pages.append(lines)
    doc.close()
    return pages

pages_lines = get_pages_lines(PDF_PATH)
len(pages_lines)


28

In [26]:
# [Find Subsection] Locate the start line of the target subsection and slice until the next heading
# Comment: Headings like "3.2.1 ..." then next numeric heading "3.2.2 ..." or "3.3 ..." mark the end.
import itertools

# Comment: Build a strict regex for the target heading at line start (e.g., "^3.2.1 ...")
if TARGET_HEADING_HINT:
    heading_re = re.compile(rf"^{re.escape(TARGET_SUBSECTION_LABEL)}\b.*{re.escape(TARGET_HEADING_HINT)}", re.IGNORECASE)
else:
    heading_re = re.compile(rf"^{re.escape(TARGET_SUBSECTION_LABEL)}\b", re.IGNORECASE)

# Comment: Generic regex for next numbered heading like "3.2.2", "3.3", "4.", etc., at line start.
next_heading_re = re.compile(r"^\d+(?:\.\d+){0,3}\b")

start_pos = None  # (page_idx, line_idx)
for p_idx, lines in enumerate(pages_lines):
    for l_idx, line in enumerate(lines):
        if heading_re.search(line.strip()):
            start_pos = (p_idx, l_idx)
            break
    if start_pos:
        break

assert start_pos is not None, f"Could not find subsection starting with '{TARGET_SUBSECTION_LABEL}' (hint='{TARGET_HEADING_HINT}')."

# Comment: Collect lines from start until the next numbered heading or document end
p0, l0 = start_pos
collected = []
for p_idx in range(p0, len(pages_lines)):
    lines = pages_lines[p_idx]
    # determine starting line on first page
    li = l0 if p_idx == p0 else 0
    while li < len(lines):
        line = lines[li]
        if (p_idx, li) != start_pos and next_heading_re.match(line.strip()):
            # Stop at the first subsequent heading
            break
        collected.append(line)
        li += 1
    # If we broke due to new heading, stop outer loop
    if li < len(lines) and (p_idx, li) != start_pos and next_heading_re.match(lines[li].strip()):
        break

subsection_text = "\n".join(collected).strip()
print(f"Start at page {p0+1}, line {l0+1}; collected {len(collected)} lines.")
print(subsection_text[:1000])


Start at page 20, line 3; collected 4 lines.
3.2.1 RAG based implementation procedure
The RAG model, as discussed in the architecture of the RAG model in Chapter 2, is a search-augmented
generative model used to retrieve and generate responses based on information relevant to given questions or topics.
Each step follows the procedure outlined in Fig 9.


In [27]:
# [Mine Cues/Notes] Extract key terms and step-like sentences to pre-fill Cornell notes
# Comment: Light heuristics: keyword spotting + sentences with procedure verbs (build, index, retrieve, embed, etc.)
import re

def clean_text(t: str) -> str:
    t = t.replace("\xa0", " ")
    t = re.sub(r"[ \t]+", " ", t)
    t = re.sub(r"\n{2,}", "\n\n", t.strip())
    return t

text_clean = clean_text(subsection_text)

# Comment: Candidate domain keywords commonly seen in RAG pipelines
KEYWORD_CANDIDATES = [
    "chunk", "chunking", "overlap", "embedding", "embedding model", "vector", "vector DB",
    "Chroma", "FAISS", "Milvus", "Pinecone", "Weaviate",
    "retriever", "top-k", "similarity", "cosine", "index",
    "LangChain", "LlamaIndex", "prompt", "template", "system prompt",
    "context window", "document", "metadata", "citation",
    "rerank", "BM25", "hybrid search",
    "OpenAI", "Azure OpenAI", "Llama", "Mistral", "GPT-4", "GPT-3.5",
    "API", "endpoint", "pipeline", "ETL", "preprocessing"
]

present_keywords = []
lower_text = text_clean.lower()
for kw in KEYWORD_CANDIDATES:
    if kw.lower() in lower_text and kw not in present_keywords:
        present_keywords.append(kw)

# Comment: Build "Cues" as questions like "What is <kw>?" for the first ~8 keywords
def cues_from_keywords(kws, n=8):
    items = []
    for kw in kws[:n]:
        items.append(f"- What is **{kw}**?")
    return items

# Comment: Extract procedure-like sentences (imperative or contains action verbs)
sentences = re.split(r"(?<=[\.\?\!])\s+", text_clean)
ACTION_VERBS = r"(configure|build|index|embed|retrieve|chunk|split|store|query|rerank|evaluate|deploy|scale|monitor|log)"
step_sents = [s for s in sentences if re.search(ACTION_VERBS, s, re.IGNORECASE)]
step_sents = step_sents[:8] if step_sents else sentences[:6]

cornell_cues = cues_from_keywords(present_keywords, n=8)
cornell_notes_bullets = [f"- {s.strip()}" for s in step_sents]
len(present_keywords), len(cornell_notes_bullets)


(0, 1)

In [28]:
# [Draft Summary] Build a 3–4 sentence draft summary from the first paragraph
# Comment: Keep it short; user can tweak wording.
paras = [p.strip() for p in text_clean.split("\n\n") if p.strip()]
first_para = paras[0] if paras else text_clean
draft_sents = re.split(r"(?<=[\.\?\!])\s+", first_para)
draft_summary = " ".join(draft_sents[:4]).strip()

print(draft_summary[:800])


3.2.1 RAG based implementation procedure
The RAG model, as discussed in the architecture of the RAG model in Chapter 2, is a search-augmented
generative model used to retrieve and generate responses based on information relevant to given questions or topics. Each step follows the procedure outlined in Fig 9.


In [29]:
# [Markdown Output] Cornell notes ready-to-paste block
# Comment: We render a Cornell layout with Cues (left), Notes (right), and Summary section.
from textwrap import shorten

def bullet_block(items, max_items=None, max_chars=300):
    out = []
    count = 0
    for it in items:
        if max_items is not None and count >= max_items: break
        it_trim = it if len(it) <= max_chars else shorten(it, width=max_chars, placeholder="...")
        out.append(it_trim)
        count += 1
    return "\n".join(out) if out else "- (Fill manually)"

md4 = f"""## Exercise 4: Cornell Notes — Technical Subsection

**Section chosen:** {TARGET_SUBSECTION_LABEL} (auto-detected)

### Cues (Left Column)
{bullet_block(cornell_cues, max_items=8, max_chars=140)}

### Notes (Right Column)
{bullet_block(cornell_notes_bullets, max_items=10, max_chars=300)}

### Summary (3–4 sentences)
{draft_summary}
"""

print(md4)


## Exercise 4: Cornell Notes — Technical Subsection

**Section chosen:** 3.2.1 (auto-detected)

### Cues (Left Column)
- (Fill manually)

### Notes (Right Column)
- 3.2.1 RAG based implementation procedure
The RAG model, as discussed in the architecture of the RAG model in Chapter 2, is a search-augmented
generative model used to retrieve and generate responses based on information relevant to given questions or topics.

### Summary (3–4 sentences)
3.2.1 RAG based implementation procedure
The RAG model, as discussed in the architecture of the RAG model in Chapter 2, is a search-augmented
generative model used to retrieve and generate responses based on information relevant to given questions or topics. Each step follows the procedure outlined in Fig 9.



## Exercise 4: Cornell Notes — Technical Subsection

**Section chosen:** 3.2.1 RAG Based Implementation Procedure

| Cues (Key terms / Questions) | Notes (Explanations / Steps) |
|------------------------------|------------------------------|
| What is the RAG model? | The RAG model is a search-augmented generative model that retrieves and generates responses based on relevant information for a given question or topic. |
| What is the purpose of RAG? | To combine retrieval from a knowledge base with generation, ensuring outputs are grounded in specific data. |
| How is RAG implemented in this study? | Follows a multi-step pipeline as outlined in Fig. 9, including document chunking, embedding, storing in a vector database, and retrieval during generation. |
| Which components are involved? | Chunking module, embedding model, vector database (e.g., Chroma), retriever, and LLM for generation. |
| What triggers the retrieval? | User queries or prompts initiate a search in the vector DB for the most relevant chunks. |
| How are retrieved results used? | They are appended to the prompt given to the LLM to provide context and improve factual accuracy. |

**Summary (3-4 sentences)**  
The 3.2.1 RAG Based Implementation Procedure section describes a search-augmented generative model architecture integrating retrieval and generation. The process involves chunking documents, embedding them, storing embeddings in a vector database, and retrieving the most relevant chunks at query time. Retrieved content is then appended to the LLM's input to produce accurate and context-aware responses. This approach ensures that generated answers are grounded in enterprise data.


# Exercise 5: Summarize the Paper with 5W1H

In [31]:
# [Extract Full Intro + Conclusion] for 5W1H info mining
# Comment: Intro and Conclusion usually contain Who, What, Why, and high-level How.
INTRO_START = 2
RELATED_START = 4
CONCLUSION_START = 26
END_PAGE = len(pages_full_text)

intro_txt = extract_range(pages_full_text, INTRO_START, RELATED_START)
conclusion_txt = extract_range(pages_full_text, CONCLUSION_START, END_PAGE)

full_text_5w1h = intro_txt + "\n" + conclusion_txt
len(full_text_5w1h)

15399

In [32]:
# [Prep Sentences] Clean and split for keyword search
import re

def clean(t: str) -> str:
    t = t.replace("\xa0", " ")
    t = re.sub(r"[ \t]+", " ", t)
    t = re.sub(r"\n{2,}", "\n\n", t.strip())
    return t

def split_sentences(t: str):
    t = clean(t)
    parts = re.split(r"(?<=[\.\?\!])\s+", t)
    return [p.strip() for p in parts if p.strip()]

sent_5w1h = split_sentences(full_text_5w1h)
len(sent_5w1h)

121

In [33]:
# [Mining] Simple keyword-based extraction for each W and H
from collections import defaultdict

patterns_5w1h = {
    "who": [r"\b(author|we|research team|university|institute|company|organization)\b"],
    "what": [r"\b(we propose|we present|this paper|this study|this work|developed|implemented|designed|introduced)\b"],
    "when_where": [r"\b(published|conference|journal|in\s\d{4}|presented at|location|venue)\b"],
    "why": [r"\b(problem|challenge|need|requirement|important|motivation|necessity|aim|goal)\b"],
    "how": [r"\b(implemented|designed|architecture|approach|method|pipeline|framework|system)\b"]
}

def find_matches(sentences, patterns, top=3):
    regs = [re.compile(p, flags=re.I) for p in patterns]
    out = []
    for s in sentences:
        if any(r.search(s) for r in regs):
            out.append(s)
            if len(out) >= top:
                break
    return out

matches_5w1h = {k: find_matches(sent_5w1h, v) for k,v in patterns_5w1h.items()}
matches_5w1h

{'who': ['Notably, similar to all APIs, data \ntransmitted through the fine-tuning API is owned by the customers, and OpenAI or any other organization is precluded \nfrom employing this data for training alternative models, as officially stated.',
  'Subsequently, when user \nqueries arise—such as inquiries regarding company dress codes through a chatbot—pertinent information can be \nretrieved and presented to the LLM through prompts, proving to be a more practical and efficient approach.',
  'Conclusion and Discussion \nIn this study, we presented methods and implementation cases for developing generative AI services using LLM \napplication architecture, aiming to explore avenues for advancing the development and industrial utilization of \ngenerative AI technology.'],
 'what': ['In pursuit of overcoming such limitations, OpenAI introduced the capability to fine-tune the GPT-3.5 Turbo \nmodel, a significant advancement unveiled in August 2023.',
  'Notably, in 2021, DeepMind introduc

In [34]:
# [Fallback fill] Build concise answers
def concat_or_fallback(key, fallback):
    return " ".join(matches_5w1h.get(key) or [fallback])

who_ans = concat_or_fallback("who", "The authors of the study (not explicitly named in extracted text).")
what_ans = concat_or_fallback("what", "A RAG-based LLM application architecture for enterprise data integration.")
when_where_ans = concat_or_fallback("when_where", "Published in a scientific venue; exact date/location not detected in extracted text.")
why_ans = concat_or_fallback("why", "To address the need for accurate, context-aware generative AI in enterprise environments.")
how_ans = concat_or_fallback("how", "Implemented using a retrieval-augmented generation pipeline with document chunking, embedding, vector storage, and LLM-based response generation.")

# Build the 4-sentence paragraph
objective = why_ans
methodology = how_ans
findings = what_ans
implications = "The architecture can be adapted to various enterprise contexts to improve information retrieval, reduce hallucinations, and support decision-making."

answers5 = {
    "who": who_ans,
    "what": what_ans,
    "when_where": when_where_ans,
    "why": why_ans,
    "how": how_ans,
    "objective": objective,
    "methodology": methodology,
    "findings": findings,
    "implications": implications
}

answers5

{'who': 'Notably, similar to all APIs, data \ntransmitted through the fine-tuning API is owned by the customers, and OpenAI or any other organization is precluded \nfrom employing this data for training alternative models, as officially stated. Subsequently, when user \nqueries arise—such as inquiries regarding company dress codes through a chatbot—pertinent information can be \nretrieved and presented to the LLM through prompts, proving to be a more practical and efficient approach. Conclusion and Discussion \nIn this study, we presented methods and implementation cases for developing generative AI services using LLM \napplication architecture, aiming to explore avenues for advancing the development and industrial utilization of \ngenerative AI technology.',
 'what': 'In pursuit of overcoming such limitations, OpenAI introduced the capability to fine-tune the GPT-3.5 Turbo \nmodel, a significant advancement unveiled in August 2023. Notably, in 2021, DeepMind introduced RETRO, utilizin

In [35]:
# [Markdown Output] Ready-to-paste block for Exercise 5
md5 = f"""## Exercise 5: 5W1H Summary

**Who conducted the study?**
{answers5['who']}

**What was developed or proposed?**
{answers5['what']}

**When and Where was it published?**
{answers5['when_where']}

**Why is the problem important?**
{answers5['why']}

**How was the solution implemented?**
{answers5['how']}

---

**Four-sentence paragraph summary:**
1) Objective: {answers5['objective']}
2) Methodology: {answers5['methodology']}
3) Findings / System built: {answers5['findings']}
4) Practical implications: {answers5['implications']}
"""

print(md5)

## Exercise 5: 5W1H Summary

**Who conducted the study?**  
Notably, similar to all APIs, data 
transmitted through the fine-tuning API is owned by the customers, and OpenAI or any other organization is precluded 
from employing this data for training alternative models, as officially stated. Subsequently, when user 
queries arise—such as inquiries regarding company dress codes through a chatbot—pertinent information can be 
retrieved and presented to the LLM through prompts, proving to be a more practical and efficient approach. Conclusion and Discussion 
In this study, we presented methods and implementation cases for developing generative AI services using LLM 
application architecture, aiming to explore avenues for advancing the development and industrial utilization of 
generative AI technology.

**What was developed or proposed?**  
In pursuit of overcoming such limitations, OpenAI introduced the capability to fine-tune the GPT-3.5 Turbo 
model, a significant advancement unveiled

# Exercise 6: Design Reflection — Apply What You Learned

### Use Case
**Educational RAG Chatbot for Autism Support**  
Audience: parents, teachers, educators.  
Goal: answer practical questions (communication, routines, behavior management) with **citations to sources**.

---

### RAG Pipeline (Simplified Sketch)
1) **Ingestion** → collect scientific articles (PDF), institutional guides (HTML/PDF), internal FAQs (Docs).  
2) **Preprocessing** → text cleaning, normalization, metadata extraction (title, author, year, URL).  
3) **Chunking** → split into sections (≈300-500 tokens), 10-20% overlap, store `source`, `page`, `date`.  
4) **Embeddings** → encode each chunk (sentence-transformers / `text-embedding-3-large` equivalent).  
5) **Vector Store** → upsert (index + metadata).  
6) **Retrieval** → top-k (k=4-8) + (optional) BM25/hybrid re-ranking.  
7) **Generation** → structured prompt with *citations* (sources + pages), honesty instructions (“say I don't know”).  
8) **Guardrails** → filters (off-topic, medical requests -> disclaimer + suggest professional advice), contradiction detection.  
9) **Feedback Loop** → “useful/not useful” button, query logging, continuous improvement.

---

### Data to Collect & Chunk
- Scientific articles (PDF): systematic reviews, controlled trials, DSM/APA guidelines.  
- Practical guides (institutions, NGOs), classroom protocols, visual supports.  
- Internal notes (FAQ), social stories, checklists.  
- **Chunking:** by headings/paragraphs; keep `source_url`, `section`, `year`, `doc_type`.

---

### Vector Database (and Why)
- **Chroma** (local, free) for fast prototyping, Python simplicity.  
- **Pinecone / Weaviate / Milvus** (managed / scalable) if low latency, scaling, metadata filtering needed (e.g., `year>=2018`, `doc_type='guideline'`).  
- **MVP choice:** Chroma (fast dev), then Pinecone if usage increases.

---

### LLM to Integrate
- **MVP (cost/latency):** compact instruct model (e.g., Mistral-7B-Instruct via API) + well-structured prompt.  
- **Higher quality:** larger model (e.g., GPT-4-level or equivalent) for better coherence and strict citation use.  
- **Prompting:** “Answer with numbered citations [1][2] mapping to sources and page numbers. If unsure, say so and suggest exact search terms.”

---

### Hallucinations & Outdated Responses
- **RAG-first:** answers only from retrieved chunks; refuse if no relevant source (score < threshold).  
- **Mandatory citations:** each paragraph includes sources (short title, year, page).  
- **Freshness policy:** default filter `year >= 2018`; “include older sources” button if historical context needed.  
- **Light cross-check:** re-retrieve to ensure key statements appear in ≥2 sources (if available).  
- **Disclaimers:** no personalized medical advice; encourage consulting a professional.

---

### Anticipated Challenges (and Mitigations)
- **Source quality/reliability** → whitelist domains + initial manual review.  
- **Heterogeneous data (scanned PDFs, tables)** → OCR + table extraction (Camelot/Tabula) + human QA.  
- **Relevance evaluation** → create a set of 50-100 “gold” Q&A pairs with expected sources.  
- **Cost/latency** → cache frequent hits, default top-k=4, optional re-ranking.  
- **Security & GDPR** → no PII; anonymized logs; forbid upload of personal data.  
- **Content drift** → weekly re-crawl & incremental re-index; monitor broken links.

---

### Success Criteria (Definition of Done)
- ≥80% of answers contain **2+ valid citations** (title/year/page).  
- **User feedback ≥4/5** on 30 frequent questions (usefulness/clarity/tone).  
- **Latency P50 < 2s, P95 < 5s** (on 100 test queries).  
- Eval set: EM/F1 (closed QA), NDCG@k (retrieval), human annotation (usefulness, accuracy).

---

### Minimal Roadmap
- **Week 1:** collect 30-50 docs, ingestion → embeddings → index (Chroma), Q/A prototype in notebook.  
- **Week 2:** simple UX (Streamlit), citations, metadata filters (year/type), feedback button.  
- **Week 3:** evaluation (question set), tuning chunking/top-k, optional re-ranking.  
- **Week 4:** hardening (guardrails, disclaimers), basic monitoring, potential switch to Pinecone.
