<a href="https://colab.research.google.com/github/19782020/EAN_11562596_AM/blob/main/Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install PyPDF2 rank_bm25 sentence-transformers faiss-cpu scikit-learn openai

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m40.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rank_bm25, PyPDF2, faiss-cpu
Successfully installed PyPDF2-3.0.1 faiss-cpu-1.12.0 rank_bm25-0.2.2


In [None]:
# Dependencies
from google.colab import files
import PyPDF2, re, numpy as np, faiss
from sentence_transformers import SentenceTransformer
from openai import OpenAI
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from rank_bm25 import BM25Okapi
import os
from itertools import chain
import openai

In [None]:
# Upload PDF
uploaded = files.upload()
pdf_files = list(uploaded.keys())
print(f"A total of {len(pdf_files)} PDF(s) uploaded")

Saving Acid rain and air pollution 50 years of progress in environmental science and policy.pdf to Acid rain and air pollution 50 years of progress in environmental science and policy.pdf
Saving Advances in air quality research - current and emerging challenges.pdf to Advances in air quality research - current and emerging challenges.pdf
Saving Air pollution and control action in Beijing.pdf to Air pollution and control action in Beijing.pdf
Saving Air pollution and public health emerging hazards and improved understanding of risk.pdf to Air pollution and public health emerging hazards and improved understanding of risk.pdf
Saving Air Pollution Control Policies in China A Retrospective and Prospects.pdf to Air Pollution Control Policies in China A Retrospective and Prospects.pdf
Saving Air pollution reduction in China Recent success but great challenge for the future.pdf to Air pollution reduction in China Recent success but great challenge for the future.pdf
A total of 6 PDF(s) upload

# 1. Document-level segmentation

In [None]:
# ================== 1. Upload PDF =================
# Set the folder path
folder_path = "/content"

# Get all PDF filenames
pdf_files = [f for f in os.listdir(folder_path) if f.lower().endswith('.pdf')]
print(f"A total of {len(pdf_files)} PDF file(s) found")

A total of 6 PDF file(s) found


In [None]:
# ============ 2. Text Cleaning + Document-level Concatenation =============
def clean_line(s: str) -> str:
    """Remove hyphenated line breaks & clean multiple spaces"""
    s = re.sub(r'-\s*\n', '', s)
    s = re.sub(r'\s+', ' ', s).strip()
    return s

def is_author_line(line: str) -> bool:
    """If line contains ≥2 English names and lacks predicates, treat as author line"""
    return len(re.findall(r'[A-Z][a-z]+\s+[A-Z][a-z]+', line)) >= 2 and \
           not re.search(r'\b(is|was|were|are|has|have)\b', line, re.I)

def is_metadata_line(line: str) -> bool:
    """Filter out copyright, journal info, keyword lists, and other irrelevant lines"""
    if re.search(r'(Elsevier|Springer|doi|ISSN|eISSN|Published|Available online|ScienceDirect|'
                 r'Correspondence|Open Access|Author information|Received|Accepted|All rights reserved|'
                 r'Journal|Volume|Issue|Editor|University|Department|Faculty|Copyright)', line, re.I):
        return True
    if re.search(r'(ARTICLE INFO|Keywords|ABSTRACT|Article history|Resources Policy)', line, re.I):
        return True
    if is_author_line(line):
        return True
    if len(line.split()) >= 8 and not re.search(
        r'\b(is|was|were|are|has|have|using|used|based|conducted|shows|analyze|explore|assess|'
        r'estimate|report|evaluate|demonstrate)\b', line, re.I):
        return True
    return False

def merge_lines(lines):
    """Merge multiple lines into natural paragraphs to reduce sentence break noise"""
    merged, buf = [], ''
    for ln in lines:
        if not buf:
            buf = ln
        else:
            if not re.search(r'[.!?。！？]$', buf):
                buf += ' ' + ln
            else:
                merged.append(buf)
                buf = ln
    if buf: merged.append(buf)
    return merged


doc_texts, doc_files = [], []

for file in pdf_files:
    with open(file, 'rb') as f:
        reader = PyPDF2.PdfReader(f)
        raw_lines = []

        for pg in reader.pages:
            raw = pg.extract_text() or ''
            for ln in raw.split('\n'):
                ln = clean_line(ln)
                if ln and not is_metadata_line(ln):
                    raw_lines.append(ln)

        # Concatenate all natural paragraphs into one long text
        paragraphs = merge_lines(raw_lines)
        long_text  = ' '.join(paragraphs).strip()

        if long_text:          # filter out empty documents
            doc_texts.append(long_text)
            doc_files.append(file)

assert doc_texts, " No main text extracted, please check the PDFs."
print(" Cleaning complete, number of documents:", len(doc_texts))


 Cleaning complete, number of documents: 6


In [None]:
# ================== 3. Building Index ==================
# --- A. TF-IDF + Cosine ---
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.95)
tfidf_mat  = vectorizer.fit_transform(doc_texts)
print(" A: TF-IDF index is ready")

# --- B. BM25 ---
bm25 = BM25Okapi([doc.lower().split() for doc in doc_texts])
print(" B: BM25 index is ready")

# --- C. SBERT + FAISS ---
sbert = SentenceTransformer('all-MiniLM-L6-v2')
embs  = sbert.encode(doc_texts, normalize_embeddings=True, show_progress_bar=False)
index = faiss.IndexFlatIP(embs.shape[1])
index.add(embs.astype('float32'))
print(" C: SBERT embeddings + FAISS index is ready")

 A: TF-IDF index is ready
 B: BM25 index is ready


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

 C: SBERT embeddings + FAISS index is ready


In [None]:
# ================== 4. Retrieval Functions ==================
def retrieve_A(q, k=3):
    sims = cosine_similarity(vectorizer.transform([q]), tfidf_mat).flatten()
    idx  = sims.argsort()[::-1][:k]
    return [(doc_files[i], doc_texts[i], float(sims[i])) for i in idx]

def retrieve_B(q, k=3):
    scores = bm25.get_scores(q.lower().split())
    idx    = np.argsort(scores)[::-1][:k]
    return [(doc_files[i], doc_texts[i], float(scores[i])) for i in idx]

def retrieve_C(q, k=3):
    q_emb = sbert.encode([q], normalize_embeddings=True)
    sims, idx = index.search(q_emb.astype('float32'), k)
    return [(doc_files[i], doc_texts[i], float(sims[0][j])) for j, i in enumerate(idx[0])]


## 1.1 GPT-3.5

In [None]:
# ================== 5. GPT Generation ==================
client = OpenAI(api_key="")

def gen_with_ctx(query, docs, max_tokens=12000):
    max_chars, acc, ctx = max_tokens * 4, 0, []
    for _, d, _ in docs:
        if acc >= max_chars:
            break
        chunk = d[:max_chars - acc]
        ctx.append(chunk)
        acc += len(chunk)

    ctx_joined = "\n\n".join(ctx)

    #  System prompt + user prompt structure
    system_prompt = (
        "You are an expert assistant in environmental policy research. "
        "When answering questions, do not refer to specific papers using phrases like 'this study' or 'the paper'. "
        "Instead, synthesize the content in an abstract, generalized manner, describing methods and findings without attributing them to individual sources."
    )

    user_prompt = (
        f"The following are excerpts from multiple environmental policy documents:\n\n"
        f"{ctx_joined}\n\n"
        f"Based on the information above, answer the following question in clear and concise academic English:\n\n{query}"
    )

    rsp = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0
    )
    return rsp.choices[0].message.content


def gen_no_rag(query):
    rsp = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role":"user","content":query}],
        temperature=0
    )
    return rsp.choices[0].message.content


In [None]:
# ========= 6. Hybrid-RAG Construction =========
def merge_docs(*doc_lists, top_k=6, max_chars=1200):
    """Merge multiple retrieval results and truncate uniformly"""
    cache = {}
    for docs in doc_lists:
        for fn, txt, sc in docs:
            key = (fn, txt[:256])
            cache[key] = max(cache.get(key, -1), sc)

    merged = sorted([(fn, txt[:max_chars], sc)
                     for (fn, txt), sc in cache.items()],
                    key=lambda x: x[2], reverse=True)
    return merged[:top_k]


def gen_hybrid_rag(query, *doc_lists):
    """Generate final answer by augmenting a No-RAG draft with multi-source evidence"""
    # ① Base draft from No-RAG
    draft = gen_no_rag(query)

    # ② Collect evidence paragraphs
    docs = merge_docs(*doc_lists)
    evidence_txt = "\n\n".join(f"[{i}] {d}" for i, (_, d, _) in enumerate(docs, 1))

    # ③ Let GPT augment draft with evidence, adding citations
    system_prompt = (
        "You are an expert environmental-policy assistant. "
        "Take the DRAFT answer the user already wrote, KEEP its structure, "
        "but augment it with precise facts drawn from the EVIDENCE below. "
        "Cite the evidence numbers (e.g. [1]) at relevant places. "
        "If draft statements conflict with evidence, correct them."
    )
    user_prompt = (
        f"DRAFT ANSWER:\n{draft}\n\n"
        f"EVIDENCE:\n{evidence_txt}\n\n"
        f"Please return the enhanced answer."
    )
    rsp = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role":"system","content":system_prompt},
                  {"role":"user","content":user_prompt}],
        temperature=0
    )
    return rsp.choices[0].message.content, docs


In [None]:
# ================== 7. Example Run ==================
query = "What monitoring techniques are suitable for measuring PM2.5?"

docs_A, ans_A = retrieve_A(query), gen_with_ctx(query, retrieve_A(query))
docs_B, ans_B = retrieve_B(query), gen_with_ctx(query, retrieve_B(query))
docs_C, ans_C = retrieve_C(query), gen_with_ctx(query, retrieve_C(query))
ans_D         = gen_no_rag(query)

print("—— Experiment A (TF-IDF) ——\n", ans_A, "\n")
print("—— Experiment B (BM25) ——\n", ans_B, "\n")
print("—— Experiment C (SBERT+FAISS) ——\n", ans_C, "\n")
print("—— Experiment D (No-RAG) ——\n", ans_D)

# —— Experiment E (Hybrid-RAG) ——
ans_E, docs_E = gen_hybrid_rag(query, docs_A, docs_B, docs_C)
print("—— Experiment E (Hybrid-RAG) ——\n", ans_E)

show_sources(docs_E, "E")


—— Experiment A (TF-IDF) ——
 Monitoring techniques suitable for measuring PM2.5 include ground-based sensors, low-cost sensor networks, satellite observations, and unmanned aerial vehicles (UAVs). These techniques provide spatially resolved data on PM2.5 concentrations, allowing for comprehensive air quality assessments. Additionally, the use of high-resolution measurement networks and data assimilation methods can enhance the accuracy and reliability of PM2.5 measurements. 

—— Experiment B (BM25) ——
 Monitoring techniques suitable for measuring PM2.5 include ground-based sensors, low-cost sensors, satellite observations, and unmanned aerial vehicles (UAVs). These techniques provide valuable data for assessing air quality and exposure to particulate matter. Ground-based sensors and low-cost sensors offer cost-effective options for continuous monitoring, while satellite observations and UAVs provide broader spatial coverage for monitoring PM2.5 levels. Integrating data from these vario

In [None]:
# ================== 8. Display Source Excerpts ==================
def show_sources(docs, label):
    print(f"\n===== Source Excerpts {label} =====")
    for i, (fn, txt, sc) in enumerate(docs, 1):
        print(f"\n[{i}] {fn} | Score: {sc:.3f}\n{txt}\n")

show_sources(docs_A, "A")
show_sources(docs_B, "B")
show_sources(docs_C, "C")



===== Source Excerpts A =====

[1] Advances in air quality research - current and emerging challenges.pdf | Score: 0.036
Atmos. Chem. Phys., 22, 4615–4703, 2022 © Author(s) 2022. This work is distributed under the Creative Commons Attribution 4.0 License. Review article challenges Jaakko Kukkonen9,1 6ARIANET, via Gilino 9, 20128 Milan, Italy Max-Planck-Straße 1, 21502 Geesthacht, Germany 13Aerosol Akademie, 83404 Ainring, Germany 82467 Garmisch-Partenkirchen, Germany 16European Commission, DG Environment, Brussels, Belgium 3720 BA Bilthoven, the Netherlands Heidelbergerlaan 8, 3584 CS Utrecht, the Netherlands research needs for selected key topics. While this paper is not an exhaustive review of all research areas in the ﬁeld of air quality, we have selected key topics that we feel are important from air quality research and policy health assessment, and air quality management and policy. In conducting the review, speciﬁc objectives were portance for air quality policy. The original c

## 1.2 DeepSeek-CHAT

In [None]:
# ================== 5. DeepSeek Generation ==================
client = OpenAI(api_key="", base_url="https://api.deepseek.com")

def gen_with_ctx(query, docs, max_tokens=12000):
    max_chars, acc, ctx = max_tokens * 4, 0, []
    for _, d, _ in docs:
        if acc >= max_chars: break
        chunk = d[:max_chars - acc]
        ctx.append(chunk)
        acc += len(chunk)

    ctx_joined = "\n\n".join(ctx)

    #  System prompt + user prompt structure
    system_prompt = (
        "You are an expert assistant in environmental policy research. "
        "When answering questions, do not refer to specific papers using phrases like 'this study' or 'the paper'. "
        "Instead, synthesize the content in an abstract, generalized manner, describing methods and findings without attributing them to individual sources."
    )

    user_prompt = (
        f"The following are excerpts from multiple environmental policy documents:\n\n"
        f"{ctx_joined}\n\n"
        f"Based on the information above, answer the following question in clear and concise academic English:\n\n{query}"
    )

    rsp = client.chat.completions.create(
        model="deepseek-chat",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0
    )
    return rsp.choices[0].message.content


def gen_no_rag(query):
    rsp = client.chat.completions.create(
        model="deepseek-chat",
        messages=[{"role":"user","content":query}],
        temperature=0
    )
    return rsp.choices[0].message.content

In [None]:
# ========= 6. Hybrid-RAG Construction =========
def merge_docs(*doc_lists, top_k=6, max_chars=1200):
    """Merge multi-source retrieval results and truncate text"""
    cache = {}
    for docs in doc_lists:
        for fn, txt, sc in docs:
            key = (fn, txt[:256])
            cache[key] = max(cache.get(key, -1), sc)

    merged = sorted([(fn, txt[:max_chars], sc)
                     for (fn, txt), sc in cache.items()],
                    key=lambda x: x[2], reverse=True)
    return merged[:top_k]


def gen_hybrid_rag(query, *doc_lists):
    """Hybrid-RAG: No-RAG draft + evidence augmentation"""

    # ① Obtain No-RAG draft
    draft = gen_no_rag(query)

    # ② Merge evidence paragraphs
    docs = merge_docs(*doc_lists)
    evidence_txt = "\n\n".join(f"[{i}] {d}" for i, (_, d, _) in enumerate(docs, 1))

    # ③ Enhance draft using evidence
    system_prompt = (
        "You are an expert environmental-policy assistant. "
        "Take the DRAFT answer the user already wrote, KEEP its structure, "
        "but augment it with precise facts drawn from the EVIDENCE below. "
        "Cite the evidence numbers (e.g. [1]) at relevant places. "
        "If draft statements conflict with evidence, correct them."
    )
    user_prompt = (
        f"DRAFT ANSWER:\n{draft}\n\n"
        f"EVIDENCE:\n{evidence_txt}\n\n"
        f"Please return the enhanced answer."
    )
    rsp = client.chat.completions.create(
        model="deepseek-chat",
        messages=[{"role":"system","content":system_prompt},
                  {"role":"user","content":user_prompt}],
        temperature=0
    )
    return rsp.choices[0].message.content, docs


In [None]:
# ================== 7. Example Run ==================
query = " What monitoring techniques are suitable for measuring PM2.5？"

docs_A, ans_A = retrieve_A(query), gen_with_ctx(query, retrieve_A(query))
docs_B, ans_B = retrieve_B(query), gen_with_ctx(query, retrieve_B(query))
docs_C, ans_C = retrieve_C(query), gen_with_ctx(query, retrieve_C(query))
ans_D         = gen_no_rag(query)

print("—— Experiment (TF-IDF) ——\n", ans_A, "\n")
print("—— Experiment (BM25) ——\n", ans_B, "\n")
print("—— Experiment (SBERT+FAISS) ——\n", ans_C, "\n")
print("—— Experiment (No-RAG) ——\n", ans_D)

# —— Experiment E (Hybrid-RAG) ——
ans_E, docs_E = gen_hybrid_rag(query, docs_A, docs_B, docs_C)
print("—— Experiment E (Hybrid-RAG) ——\n", ans_E)

show_sources(docs_E, "E")


—— Experiment (TF-IDF) ——
 Multiple monitoring techniques are suitable for measuring PM₂.₅ concentrations, each with distinct applications and characteristics:

1. **Ground-based reference monitoring stations**: These provide high-precision, regulatory-grade measurements using standardized instruments (e.g., gravimetric samplers, beta attenuation monitors, or tapered element oscillating microbalances). They form the backbone of official air quality networks but are limited in spatial coverage due to cost and infrastructure requirements.

2. **Low-cost sensors (LCS)**: These offer higher spatial density and real-time data at reduced cost, enabling deployment in citizen science projects and dense urban networks. However, they require rigorous calibration, quality assurance protocols, and intercomparison with reference instruments to ensure data reliability.

3. **Satellite remote sensing**: Provides broad spatial coverage and columnar aerosol optical depth (AOD) data, which can be conver

In [None]:
# ================== 8. Display Source Excerpts ==================
def show_sources(docs, label):
    print(f"\n===== Source Excerpts {label} =====")
    for i, (fn, txt, sc) in enumerate(docs, 1):
        print(f"\n[{i}] {fn} | Score: {sc:.3f}\n{txt}\n")

show_sources(docs_A, "A")
show_sources(docs_B, "B")
show_sources(docs_C, "C")



===== Source Excerpts A =====

[1] Advances in air quality research - current and emerging challenges.pdf | Score: 0.036
Atmos. Chem. Phys., 22, 4615–4703, 2022 © Author(s) 2022. This work is distributed under the Creative Commons Attribution 4.0 License. Review article challenges Jaakko Kukkonen9,1 6ARIANET, via Gilino 9, 20128 Milan, Italy Max-Planck-Straße 1, 21502 Geesthacht, Germany 13Aerosol Akademie, 83404 Ainring, Germany 82467 Garmisch-Partenkirchen, Germany 16European Commission, DG Environment, Brussels, Belgium 3720 BA Bilthoven, the Netherlands Heidelbergerlaan 8, 3584 CS Utrecht, the Netherlands research needs for selected key topics. While this paper is not an exhaustive review of all research areas in the ﬁeld of air quality, we have selected key topics that we feel are important from air quality research and policy health assessment, and air quality management and policy. In conducting the review, speciﬁc objectives were portance for air quality policy. The original c

## 1.3 LLaMA-3-8b

In [None]:
# ================== 5. LLaMA Generation ==================

client = openai.OpenAI(
    api_key="",
    base_url="https://openrouter.ai/api/v1"
)

def gen_with_ctx(query, docs, max_tokens=12000):
    max_chars, acc, ctx = max_tokens * 4, 0, []
    for _, d, _ in docs:
        if acc >= max_chars: break
        chunk = d[:max_chars - acc]
        ctx.append(chunk)
        acc += len(chunk)

    ctx_joined = "\n\n".join(ctx)

    system_prompt = (
        "You are an expert assistant in environmental policy research. "
        "When answering questions, do not refer to specific papers using phrases like 'this study' or 'the paper'. "
        "Instead, synthesize the content in an abstract, generalized manner, describing methods and findings without attributing them to individual sources."
    )

    user_prompt = (
        f"The following are excerpts from multiple environmental policy documents:\n\n"
        f"{ctx_joined}\n\n"
        f"Based on the information above, answer the following question in clear and concise academic English:\n\n{query}"
    )

    response = client.chat.completions.create(
        model="meta-llama/llama-3-8b-instruct",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0.2
    )

    return response.choices[0].message.content


def gen_no_rag(query):
    response = client.chat.completions.create(
        model="meta-llama/llama-3-8b-instruct",
        messages=[{"role": "user", "content": query}],
        temperature=0.2
    )
    return response.choices[0].message.content


In [None]:
# ========= 6. Hybrid-RAG Construction =========

def merge_docs(*doc_lists, top_k=6, max_chars=1200):
    """Merge multi-source retrieval results and truncate text"""
    cache = {}
    for docs in doc_lists:
        for fn, txt, sc in docs:
            key = (fn, txt[:256])
            cache[key] = max(cache.get(key, -1), sc)

    merged = sorted([(fn, txt[:max_chars], sc)
                     for (fn, txt), sc in cache.items()],
                    key=lambda x: x[2], reverse=True)
    return merged[:top_k]


def gen_hybrid_rag(query, *doc_lists):
    """Hybrid-RAG: No-RAG draft + evidence augmentation"""

    # ① Obtain No-RAG draft
    draft_rsp = client.chat.completions.create(
        model="meta-llama/llama-3-8b-instruct",  # or llama-3-70b-instruct
        messages=[{"role": "user", "content": query}],
        temperature=0.2
    )
    draft = draft_rsp.choices[0].message.content

    # ② Merge evidence paragraphs
    docs = merge_docs(*doc_lists)
    evidence_txt = "\n\n".join(f"[{i}] {d}" for i, (_, d, _) in enumerate(docs, 1))

    # ③ Enhance draft using evidence
    system_prompt = (
        "You are an expert environmental-policy assistant. "
        "Take the DRAFT answer the user already wrote, KEEP its structure, "
        "but augment it with precise facts drawn from the EVIDENCE below. "
        "Cite the evidence numbers (e.g. [1]) at relevant places. "
        "If draft statements conflict with evidence, correct them."
    )
    user_prompt = (
        f"DRAFT ANSWER:\n{draft}\n\n"
        f"EVIDENCE:\n{evidence_txt}\n\n"
        f"Please return the enhanced answer."
    )

    enhanced_rsp = client.chat.completions.create(
        model="meta-llama/llama-3-8b-instruct",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0.2
    )

    return enhanced_rsp.choices[0].message.content, docs


In [None]:
# ================== 7. Example Run ==================
query = "What is a Clean Air Zone and how is it implemented in the UK?"

docs_A, ans_A = retrieve_A(query), gen_with_ctx(query, retrieve_A(query))
docs_B, ans_B = retrieve_B(query), gen_with_ctx(query, retrieve_B(query))
docs_C, ans_C = retrieve_C(query), gen_with_ctx(query, retrieve_C(query))
ans_D         = gen_no_rag(query)

print("—— Experiment A (TF-IDF) ——\n", ans_A, "\n")
print("—— Experiment B (BM25) ——\n", ans_B, "\n")
print("—— Experiment C (SBERT+FAISS) ——\n", ans_C, "\n")
print("—— Experiment D (No-RAG) ——\n", ans_D)

# —— Experiment E (Hybrid-RAG) ——
ans_E, docs_E = gen_hybrid_rag(query, docs_A, docs_B, docs_C)
print("—— Experiment E (Hybrid-RAG) ——\n", ans_E)

show_sources(docs_E, "E")


—— Experiment A (TF-IDF) ——
 The provided excerpts do not mention a "Clean Air Zone" or its implementation in the UK. However, based on general knowledge and environmental policy research, a Clean Air Zone (CAZ) is a designated area where specific measures are taken to reduce air pollution from vehicles and other sources.

In the UK, Clean Air Zones are implemented by local authorities, such as city councils, to improve air quality and reduce the negative impacts of air pollution on public health. The implementation of a CAZ typically involves the following steps:

1. Identification of the area: The local authority identifies the area that requires improvement in terms of air quality.
2. Setting of targets: The authority sets targets for reducing air pollution in the designated area.
3. Vehicle restrictions: The authority introduces restrictions on vehicle access to the area, such as congestion charges, low-emission zones, or bans on certain types of vehicles.
4. Monitoring and enforce

In [None]:
# ================== 8. Display Source Excerpts ==================
def show_sources(docs, label):
    print(f"\n===== Source Excerpts {label} =====")
    for i, (fn, txt, sc) in enumerate(docs, 1):
        print(f"\n[{i}] {fn} | Score: {sc:.3f}\n{txt}\n")

show_sources(docs_A, "A")
show_sources(docs_B, "B")
show_sources(docs_C, "C")


===== Source Excerpts A =====

[1] Air pollution and public health emerging hazards and improved understanding of risk.pdf | Score: 0.080


[2] Air Pollution Control Policies in China A Retrospective and Prospects.pdf | Score: 0.046
Environmental Research and Public Health Review A Retrospective and Prospects henrik.andersson@tse-fr.eu and up-to-date understanding of China’s air pollution policies is of worldwide relevance. Based on and onwards. We show that: (1) The early policies, until 2005, were ineffective at reducing emissions; (2) During 2006–2012, new instruments which interact with political incentives were introduced in the 11th Five-Year Plan, and the national goal of reducing total sulfur dioxide (SO 2) emissions by 10% was in eastern China in 2013, air pollution control policies have been experiencing signiﬁcant changes on multiple fronts. In this work we analyze the different policy changes, the drivers of changes and key evolution have implications for future studies, a

# 2. Paragraph-level segmentation

In [None]:
# ============ 1. Set folder path =============
folder_path = "/content"

# Get all PDF file names
pdf_files = [f for f in os.listdir(folder_path) if f.lower().endswith('.pdf')]
print(f"Found {len(pdf_files)} PDF files in total")

Found 6 PDF files in total


In [None]:
# ============ 2. Text Cleaning + Paragraph Segmentation =============

def clean_line(s: str) -> str:
    s = re.sub(r'-\s*\n', '', s)
    s = re.sub(r'\s+', ' ', s).strip()
    return s

def is_author_line(line: str) -> bool:
    return len(re.findall(r'[A-Z][a-z]+\s+[A-Z][a-z]+', line)) >= 2 and \
           not re.search(r'\b(is|was|were|are|has|have)\b', line, re.I)

def is_metadata_line(line: str) -> bool:
    if re.search(r'(Elsevier|Springer|doi|ISSN|eISSN|Published|Available online|ScienceDirect|'
                 r'Correspondence|Open Access|Author information|Received|Accepted|All rights reserved|'
                 r'Journal|Volume|Issue|Editor|University|Department|Faculty|Copyright)', line, re.I):
        return True
    if re.search(r'(ARTICLE INFO|Keywords|ABSTRACT|Article history|Resources Policy)', line, re.I):
        return True
    if is_author_line(line):
        return True
    # Lines with many words but no verbs
    if len(line.split()) >= 8 and not re.search(
        r'\b(is|was|were|are|has|have|using|used|based|conducted|shows|analyze|explore|assess|estimate|report|evaluate|demonstrate)\b',
        line, re.I):
        return True
    return False

def merge_lines(lines):
    merged, buf = [], ''
    for ln in lines:
        if not buf:
            buf = ln
        else:
            if not re.search(r'[.!?。！？]$', buf):
                buf += ' ' + ln
            else:
                merged.append(buf)
                buf = ln
    if buf:
        merged.append(buf)
    return merged

para_texts, para_files = [], []

for file in pdf_files:
    with open(file, 'rb') as f:
        rd, raw_lines = PyPDF2.PdfReader(f), []
        for pg in rd.pages:
            raw = pg.extract_text() or ''
            for ln in raw.split('\n'):
                ln = clean_line(ln)
                if ln and not is_metadata_line(ln):
                    raw_lines.append(ln)
        for para in merge_lines(raw_lines):
            if len(para.split()) >= 20:        # Filter very short paragraphs
                para_texts.append(para)
                para_files.append(file)

assert para_texts, " No valid paragraphs extracted"
print(" Cleaning complete, number of paragraphs:", len(para_texts))


 Cleaning complete, number of paragraphs: 517


In [None]:
# ================== 3. Build Index ==================

# --- A. TF-IDF + Cosine ---
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.95)
tfidf_mat  = vectorizer.fit_transform(para_texts)
print(" A: TF-IDF index is ready")

# --- B. BM25 ---
bm25 = BM25Okapi([p.lower().split() for p in para_texts])
print(" B: BM25 index is ready")

# --- C. SBERT + FAISS ---
sbert = SentenceTransformer('all-MiniLM-L6-v2')
embs  = sbert.encode(para_texts, normalize_embeddings=True, show_progress_bar=False)
index = faiss.IndexFlatIP(embs.shape[1])
index.add(embs.astype('float32'))
print("C: SBERT embeddings + FAISS index is ready")


 A: TF-IDF index is ready
 B: BM25 index is ready
C: SBERT embeddings + FAISS index is ready


In [None]:
# ================== 4. Retrieval Functions ==================
def retrieve_A(q, k=3):
    sims = cosine_similarity(vectorizer.transform([q]), tfidf_mat).flatten()
    idx  = sims.argsort()[::-1][:k]
    return [(para_files[i], para_texts[i], float(sims[i])) for i in idx]

def retrieve_B(q, k=3):
    scores = bm25.get_scores(q.lower().split())
    idx    = np.argsort(scores)[::-1][:k]
    return [(para_files[i], para_texts[i], float(scores[i])) for i in idx]

def retrieve_C(q, k=3):
    q_emb = sbert.encode([q], normalize_embeddings=True)
    sims, idx = index.search(q_emb.astype('float32'), k)
    return [(para_files[i], para_texts[i], float(sims[0][j])) for j, i in enumerate(idx[0])]


## GPT

In [None]:
# ================== 5. GPT Generation ==================
client = OpenAI(api_key="")

def gen_with_ctx(query, docs, max_tokens=12000):
    max_chars, acc, ctx = max_tokens * 4, 0, []
    for _, d, _ in docs:
        if acc >= max_chars:
            break
        chunk = d[:max_chars - acc]
        ctx.append(chunk)
        acc += len(chunk)

    ctx_joined = "\n\n".join(ctx)

    #  System prompt + user prompt structure
    system_prompt = (
        "You are an expert assistant in environmental policy research. "
        "When answering questions, do not refer to specific papers using phrases like 'this study' or 'the paper'. "
        "Instead, synthesize the content in an abstract, generalized manner, describing methods and findings without attributing them to individual sources."
    )

    user_prompt = (
        f"The following are excerpts from multiple environmental policy documents:\n\n"
        f"{ctx_joined}\n\n"
        f"Based on the information above, answer the following question in clear and concise academic English:\n\n{query}"
    )

    rsp = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0
    )
    return rsp.choices[0].message.content


def gen_no_rag(query):
    rsp = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role":"user","content":query}],
        temperature=0
    )
    return rsp.choices[0].message.content


In [None]:
# ========= 6. Hybrid-RAG Construction =========
def merge_docs(*doc_lists, top_k=6):
    """Merge multi-source retrieval results, deduplicate, and return top_k by score"""
    cache = {}
    for docs in doc_lists:
        for fn, txt, sc in docs:
            key = (fn, txt)
            cache[key] = max(cache.get(key, -1), sc)
    merged = sorted([(fn, txt, sc) for (fn, txt), sc in cache.items()],
                    key=lambda x: x[2], reverse=True)
    return merged[:top_k]

def gen_hybrid_rag(query, *doc_lists):
    """Generate Hybrid-RAG answer: No-RAG draft + evidence augmentation"""
    # ① Obtain No-RAG draft
    draft = gen_no_rag(query)

    # ② Merge evidence paragraphs
    docs = merge_docs(*doc_lists)
    evidence_txt = "\n\n".join(f"[{i}] {d}" for i, (_, d, _) in enumerate(docs, 1))

    # ③ Enhance draft using evidence
    system_prompt = (
        "You are an expert environmental-policy assistant. "
        "Take the DRAFT answer the user already wrote, KEEP its structure, "
        "but augment it with precise facts drawn from the EVIDENCE below. "
        "Cite the evidence numbers (e.g. [1]) at relevant places. "
        "If draft statements conflict with evidence, correct them."
    )
    user_prompt = (
        f"DRAFT ANSWER:\n{draft}\n\n"
        f"EVIDENCE:\n{evidence_txt}\n\n"
        f"Please return the enhanced answer."
    )
    rsp = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role":"system","content":system_prompt},
                  {"role":"user","content":user_prompt}],
        temperature=0
    )
    return rsp.choices[0].message.content, docs


In [None]:
# ================== 7. Example Run ==================
query = "What monitoring techniques are suitable for measuring PM2.5?"

docs_A, ans_A = retrieve_A(query), gen_with_ctx(query, retrieve_A(query))
docs_B, ans_B = retrieve_B(query), gen_with_ctx(query, retrieve_B(query))
docs_C, ans_C = retrieve_C(query), gen_with_ctx(query, retrieve_C(query))
ans_D         = gen_no_rag(query)

print("—— Experiment A (TF-IDF) ——\n", ans_A, "\n")
print("—— Experiment B (BM25) ——\n", ans_B, "\n")
print("—— Experiment C (SBERT+FAISS) ——\n", ans_C, "\n")
print("—— Experiment D (No-RAG) ——\n", ans_D)

# —— Experiment E (Hybrid-RAG) ——
ans_E, docs_E = gen_hybrid_rag(query, docs_A, docs_B, docs_C)
print("—— Experiment E (Hybrid-RAG) ——\n", ans_E)

show_sources(docs_E, "E")


—— Experiment A (TF-IDF) ——
 Suitable monitoring techniques for measuring PM2.5 include ground-based, aircraft-based, and space-based remote sensing techniques, as well as integrated measuring techniques. Additionally, satellite observations and the use of unmanned aerial vehicles (UAVs) are emerging as effective methods for monitoring PM2.5 pollution levels. 

—— Experiment B (BM25) ——
 Ground-based, aircraft-based, and space-based remote sensing techniques, as well as integrated measuring techniques, are suitable for measuring PM2.5. Additionally, the use of unmanned aerial vehicles (UAVs) for air pollution measurements is a growing trend. These techniques can provide valuable information for assessing PM2.5 levels and understanding related atmospheric processes. 

—— Experiment C (SBERT+FAISS) ——
 Monitoring techniques suitable for measuring PM2.5 include the use of cheap measurement devices, citizen science projects, remote sensing techniques, and observational data. These techniqu

In [None]:
# ================== 8. Display Source Excerpts ==================
def show_sources(docs, label):
    print(f"\n===== Source Excerpts {label} =====")
    for i, (fn, txt, sc) in enumerate(docs, 1):
        print(f"\n[{i}] {fn} | Score: {sc:.3f}\n{txt}\n")

show_sources(docs_A, "A")
show_sources(docs_B, "B")
show_sources(docs_C, "C")



===== Source Excerpts A =====

[1] Advances in air quality research - current and emerging challenges.pdf | Score: 0.255
1.4 Measuring air pollution Measurements in the atmosphere are necessary not only duction, agriculture, trafﬁc, industry, health protection, or suring, and ground-based, aircraft-based, and space-based remote sensing techniques and integrated measuring tech- niques are available. Satellite observations are a growing growth is the use of unmanned aerial vehicles (UA Vs) for air pollution measurements (Gu et al., 2018).


[2] Advances in air quality research - current and emerging challenges.pdf | Score: 0.247
4.2 Current status and challenges tain lines of research and technical development are formu- high-resolution measurement networks by the installation of ground-based, aircraft-based, and space-based remote sens- ing techniques or integrated measuring techniques are no longer considered. Also, satellite observations, which are a cost-effective platforms, are not

## 2.2 DeepSeek-CHAT

In [None]:
# ================== 5. Deepseek Generation ==================
client = OpenAI(api_key="", base_url="https://api.deepseek.com")

def gen_with_ctx(query, docs, max_tokens=12000):
    max_chars, acc, ctx = max_tokens * 4, 0, []
    for _, d, _ in docs:
        if acc >= max_chars:
            break
        chunk = d[:max_chars - acc]
        ctx.append(chunk)
        acc += len(chunk)

    ctx_joined = "\n\n".join(ctx)

    #  System prompt + user prompt structure
    system_prompt = (
        "You are an expert assistant in environmental policy research. "
        "When answering questions, do not refer to specific papers using phrases like 'this study' or 'the paper'. "
        "Instead, synthesize the content in an abstract, generalized manner, describing methods and findings without attributing them to individual sources."
    )

    user_prompt = (
        f"The following are excerpts from multiple environmental policy documents:\n\n"
        f"{ctx_joined}\n\n"
        f"Based on the information above, answer the following question in clear and concise academic English:\n\n{query}"
    )

    rsp = client.chat.completions.create(
        model="deepseek-chat",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0
    )
    return rsp.choices[0].message.content


def gen_no_rag(query):
    rsp = client.chat.completions.create(
        model="deepseek-chat",
        messages=[{"role":"user","content":query}],
        temperature=0
    )
    return rsp.choices[0].message.content


In [None]:
# ========= 6. Hybrid-RAG Construction =========
def merge_docs(*doc_lists, top_k=6, max_chars=1200):
    """Merge multi-source retrieval results with truncation"""
    cache = {}
    for docs in doc_lists:
        for fn, txt, sc in docs:
            key = (fn, txt[:256])
            cache[key] = max(cache.get(key, -1), sc)

    merged = sorted([(fn, txt[:max_chars], sc)
                     for (fn, txt), sc in cache.items()],
                    key=lambda x: x[2], reverse=True)
    return merged[:top_k]


def gen_hybrid_rag(query, *doc_lists):
    """Generate Hybrid-RAG answer: No-RAG draft + evidence augmentation"""
    # ① Obtain No-RAG draft
    draft = gen_no_rag(query)

    # ② Merge evidence paragraphs
    docs = merge_docs(*doc_lists)
    evidence_txt = "\n\n".join(f"[{i}] {d}" for i, (_, d, _) in enumerate(docs, 1))

    # ③ Enhance draft using evidence
    system_prompt = (
        "You are an expert environmental-policy assistant. "
        "Take the DRAFT answer the user already wrote, KEEP its structure, "
        "but augment it with precise facts drawn from the EVIDENCE below. "
        "Cite the evidence numbers (e.g. [1]) at relevant places. "
        "If draft statements conflict with evidence, correct them."
    )
    user_prompt = (
        f"DRAFT ANSWER:\n{draft}\n\n"
        f"EVIDENCE:\n{evidence_txt}\n\n"
        f"Please return the enhanced answer."
    )
    rsp = client.chat.completions.create(
        model="deepseek-chat",
        messages=[{"role":"system","content":system_prompt},
                  {"role":"user","content":user_prompt}],
        temperature=0
    )
    return rsp.choices[0].message.content, docs


In [None]:
# ================== 7. Example Run ==================
query = "What monitoring techniques are suitable for measuring PM2.5?"

docs_A, ans_A = retrieve_A(query), gen_with_ctx(query, retrieve_A(query))
docs_B, ans_B = retrieve_B(query), gen_with_ctx(query, retrieve_B(query))
docs_C, ans_C = retrieve_C(query), gen_with_ctx(query, retrieve_C(query))
ans_D         = gen_no_rag(query)

print("—— Experiment A (TF-IDF) ——\n", ans_A, "\n")
print("—— Experiment B (BM25) ——\n", ans_B, "\n")
print("—— Experiment C (SBERT+FAISS) ——\n", ans_C, "\n")
print("—— Experiment D (No-RAG) ——\n", ans_D)

# —— Experiment E (Hybrid-RAG) ——
ans_E, docs_E = gen_hybrid_rag(query, docs_A, docs_B, docs_C)
print("—— Experiment E (Hybrid-RAG) ——\n", ans_E)

show_sources(docs_E, "E")


—— Experiment A (TF-IDF) ——
 Suitable monitoring techniques for measuring PM2.5 include ground-based, aircraft-based, and space-based remote sensing methods, as well as integrated measuring systems. Satellite observations are noted for their cost-effectiveness and scalability, while unmanned aerial vehicles (UA Vs) represent an emerging platform for such measurements. Ground-based monitoring networks, which may consist of multiple distributed sites across urban, suburban, and rural areas, are essential for capturing localized variations and ensuring data accuracy. These techniques collectively support comprehensive air quality assessment and policy implementation. 

—— Experiment B (BM25) ——
 Multiple monitoring techniques are suitable for measuring PM2.5 concentrations. Ground-based instruments provide direct, high-resolution measurements at specific locations and are essential for regulatory compliance and health assessments. Remote sensing methods, including ground-based, aircraft-b

In [None]:
# ================== 8. Display Source Excerpts ==================
def show_sources(docs, label):
    print(f"\n===== Source Excerpts {label} =====")
    for i, (fn, txt, sc) in enumerate(docs, 1):
        print(f"\n[{i}] {fn} | Score: {sc:.3f}\n{txt}\n")

show_sources(docs_A, "A")
show_sources(docs_B, "B")
show_sources(docs_C, "C")


===== Source Excerpts A =====

[1] Advances in air quality research - current and emerging challenges.pdf | Score: 0.255
1.4 Measuring air pollution Measurements in the atmosphere are necessary not only duction, agriculture, trafﬁc, industry, health protection, or suring, and ground-based, aircraft-based, and space-based remote sensing techniques and integrated measuring tech- niques are available. Satellite observations are a growing growth is the use of unmanned aerial vehicles (UA Vs) for air pollution measurements (Gu et al., 2018).


[2] Advances in air quality research - current and emerging challenges.pdf | Score: 0.247
4.2 Current status and challenges tain lines of research and technical development are formu- high-resolution measurement networks by the installation of ground-based, aircraft-based, and space-based remote sens- ing techniques or integrated measuring techniques are no longer considered. Also, satellite observations, which are a cost-effective platforms, are not

## 2.3 LLaMa-3-8b

In [None]:
# ================== 5. LLaMa Generation ==================

client = openai.OpenAI(
    api_key="",
    base_url="https://openrouter.ai/api/v1"
)

def gen_with_ctx(query, docs, max_tokens=12000):
    max_chars, acc, ctx = max_tokens * 4, 0, []
    for _, d, _ in docs:
        if acc >= max_chars: break
        chunk = d[:max_chars - acc]
        ctx.append(chunk)
        acc += len(chunk)

    ctx_joined = "\n\n".join(ctx)

    system_prompt = (
        "You are an expert assistant in environmental policy research. "
        "When answering questions, do not refer to specific papers using phrases like 'this study' or 'the paper'. "
        "Instead, synthesize the content in an abstract, generalized manner, describing methods and findings without attributing them to individual sources."
    )

    user_prompt = (
        f"The following are excerpts from multiple environmental policy documents:\n\n"
        f"{ctx_joined}\n\n"
        f"Based on the information above, answer the following question in clear and concise academic English:\n\n{query}"
    )

    response = client.chat.completions.create(
        model="meta-llama/llama-3-8b-instruct",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0.2
    )

    return response.choices[0].message.content


def gen_no_rag(query):
    response = client.chat.completions.create(
        model="meta-llama/llama-3-8b-instruct",
        messages=[{"role": "user", "content": query}],
        temperature=0.2
    )
    return response.choices[0].message.content


In [None]:
# ========= 6. Hybrid-RAG Construction =========

def merge_docs(*doc_lists, top_k=6, max_chars=1200):
    """Merge multi-source retrieval results with truncation"""
    cache = {}
    for docs in doc_lists:
        for fn, txt, sc in docs:
            key = (fn, txt[:256])  # Use prefix to avoid duplicates
            cache[key] = max(cache.get(key, -1), sc)

    merged = sorted([(fn, txt[:max_chars], sc)
                     for (fn, txt), sc in cache.items()],
                    key=lambda x: x[2], reverse=True)
    return merged[:top_k]


def gen_hybrid_rag(query, *doc_lists):
    """Hybrid-RAG: No-RAG draft + evidence augmentation"""

    # ① Obtain No-RAG draft
    draft_rsp = client.chat.completions.create(
        model="meta-llama/llama-3-8b-instruct",  # or llama-3-70b-instruct
        messages=[{"role": "user", "content": query}],
        temperature=0.2
    )
    draft = draft_rsp.choices[0].message.content

    # ② Merge evidence paragraphs
    docs = merge_docs(*doc_lists)
    evidence_txt = "\n\n".join(f"[{i}] {d}" for i, (_, d, _) in enumerate(docs, 1))

    # ③ Enhance draft with evidence
    system_prompt = (
        "You are an expert environmental-policy assistant. "
        "Take the DRAFT answer the user already wrote, KEEP its structure, "
        "but augment it with precise facts drawn from the EVIDENCE below. "
        "Cite the evidence numbers (e.g. [1]) at relevant places. "
        "If draft statements conflict with evidence, correct them."
    )
    user_prompt = (
        f"DRAFT ANSWER:\n{draft}\n\n"
        f"EVIDENCE:\n{evidence_txt}\n\n"
        f"Please return the enhanced answer."
    )

    enhanced_rsp = client.chat.completions.create(
        model="meta-llama/llama-3-8b-instruct",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0.2
    )

    return enhanced_rsp.choices[0].message.content, docs


In [None]:
# ================== 7. Example Run ==================
query = "What is a Clean Air Zone and how is it implemented in the UK?"

docs_A, ans_A = retrieve_A(query), gen_with_ctx(query, retrieve_A(query))
docs_B, ans_B = retrieve_B(query), gen_with_ctx(query, retrieve_B(query))
docs_C, ans_C = retrieve_C(query), gen_with_ctx(query, retrieve_C(query))
ans_D         = gen_no_rag(query)

print("—— Experiment A (TF-IDF) ——\n", ans_A, "\n")
print("—— Experiment B (BM25) ——\n", ans_B, "\n")
print("—— Experiment C (SBERT+FAISS) ——\n", ans_C, "\n")
print("—— Experiment D (No-RAG) ——\n", ans_D)

# —— Experiment E (Hybrid-RAG) ——
ans_E, docs_E = gen_hybrid_rag(query, docs_A, docs_B, docs_C)
print("—— Experiment E (Hybrid-RAG) ——\n", ans_E)

show_sources(docs_E, "E")


—— Experiment A (TF-IDF) ——
 There is no information provided in the excerpts about Clean Air Zones or their implementation in the UK. The text mentions air quality issues, urbanization, and historical perspectives on air pollution, but does not specifically discuss Clean Air Zones or their implementation in the UK. 

—— Experiment B (BM25) ——
 Based on the context of environmental policy research, a Clean Air Zone (CAZ) is a regulatory measure aimed at improving air quality by restricting or charging polluters, typically in urban areas with poor air quality. In the UK, Clean Air Zones are implemented as part of a broader strategy to reduce emissions and improve public health.

Implementation of Clean Air Zones in the UK typically involves a combination of measures, including:

1. Charging or restricting polluters: Vehicles that do not meet certain emissions standards may be charged or restricted from entering certain areas, such as city centers.
2. Emissions standards: Vehicles are re

In [None]:
# ================== 8. Display Source Excerpts ==================
def show_sources(docs, label):
    print(f"\n===== Source Excerpts {label} =====")
    for i, (fn, txt, sc) in enumerate(docs, 1):
        print(f"\n[{i}] {fn} | Score: {sc:.3f}\n{txt}\n")

show_sources(docs_A, "A")
show_sources(docs_B, "B")
show_sources(docs_C, "C")



===== Source Excerpts A =====

[1] Advances in air quality research - current and emerging challenges.pdf | Score: 0.126
COSMO, ENVIRO-HIRLAM) successfully implemented (a hierarchy of) urban parameterizations with different com- plexities and reached suitable spatial resolutions (Baklanov tions implemented inside limited-area meteorological mod- els is becoming a common approach to drive urban air qual- scription in different climatic and environmental conditions features (Brousse et al., 2016) are continuing.


[2] Air pollution and public health emerging hazards and improved understanding of risk.pdf | Score: 0.125
awareness /C1Air quality communication Introduction Historical perspective Air pollution is now fully acknowledged to be a signiﬁcant public health problem, responsible for a growing range of health effects that are well docu- conducted in many regions of the world. Whilst there is no doubt that rapid urbanisation means that we are diverse variety of ambient air pollutant