## **Unstructured Document Retrieval Pipline**

In [1]:
# Setup unstructured for all document types

%pip install "unstructured[all-docs]"

Note: you may need to restart the kernel to use updated packages.


In [2]:
# Warning control
import warnings
warnings.filterwarnings('ignore')

In [3]:
from unstructured.chunking.title import chunk_by_title
from unstructured.partition.md import partition_md
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.base import dict_to_elements

In [4]:
# Watermark = Secret invisible signature in AI-generated text

%pip install watermark

%load_ext watermark
%watermark --iversions

Note: you may need to restart the kernel to use updated packages.
unstructured: 0.18.27



In [5]:
import unstructured.partition

help(unstructured.partition)

Help on package unstructured.partition in unstructured:

NAME
    unstructured.partition

PACKAGE CONTENTS
    api
    auto
    common (package)
    csv
    doc
    docx
    email
    epub
    html (package)
    image
    json
    md
    model_init
    msg
    ndjson
    odt
    org
    pdf
    pdf_image (package)
    ppt
    pptx
    rst
    rtf
    strategies
    text
    text_type
    tsv
    utils (package)
    xlsx
    xml

FILE
    e:\the leo programmer\internships\nrsc\assignments\testing-models-huge-dataset\.venv\lib\site-packages\unstructured\partition\__init__.py




### Preprocessing the PDF

In [6]:
partition_pdf??

[31mSignature:[39m
partition_pdf(
    filename: [33m'Optional[str]'[39m = [38;5;28;01mNone[39;00m,
    file: [33m'Optional[IO[bytes]]'[39m = [38;5;28;01mNone[39;00m,
    include_page_breaks: [33m'bool'[39m = [38;5;28;01mFalse[39;00m,
    strategy: [33m'str'[39m = [33m'auto'[39m,
    infer_table_structure: [33m'bool'[39m = [38;5;28;01mFalse[39;00m,
    ocr_languages: [33m'Optional[str]'[39m = [38;5;28;01mNone[39;00m,
    languages: [33m'Optional[list[str]]'[39m = [38;5;28;01mNone[39;00m,
    detect_language_per_element: [33m'bool'[39m = [38;5;28;01mFalse[39;00m,
    metadata_last_modified: [33m'Optional[str]'[39m = [38;5;28;01mNone[39;00m,
    chunking_strategy: [33m'Optional[str]'[39m = [38;5;28;01mNone[39;00m,
    hi_res_model_name: [33m'Optional[str]'[39m = [38;5;28;01mNone[39;00m,
    extract_images_in_pdf: [33m'bool'[39m = [38;5;28;01mFalse[39;00m,
    extract_image_block_types: [33m'Optional[list[str]]'[39m = [38;5;28;01mNone[3

In [7]:
from unstructured.partition.pdf import partition_pdf

# Specify the path to your PDF file
filename = "../data/pdf/F-62.pdf"

# Extract images, tables, and chunk text
pdf_elements = partition_pdf(
    filename=filename,
    extract_images_in_pdf=False,
    #strategy = "hi_res",
    strategy = "fast",
    hi_res_model_name="yolox",
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=3000,
    #new_after_n_chars=3800,
    combine_text_under_n_chars=200,
    #extract_image_block_output_dir=path,
)

Cannot set non-stroke color because expected 3 components but got [0.9]
Cannot set non-stroke color because expected 3 components but got [0.9]
Cannot set non-stroke color because expected 3 components but got [0.9]
Cannot set non-stroke color because expected 3 components but got [0.9]
Cannot set non-stroke color because expected 3 components but got [0.9]




In [None]:
pdf_elements

[<unstructured.documents.elements.CompositeElement at 0x2598b8b9ee0>,
 <unstructured.documents.elements.CompositeElement at 0x2598b8ba3c0>,
 <unstructured.documents.elements.CompositeElement at 0x2598b975d90>,
 <unstructured.documents.elements.CompositeElement at 0x2598b9754c0>,
 <unstructured.documents.elements.CompositeElement at 0x2598b9753d0>,
 <unstructured.documents.elements.CompositeElement at 0x2598b977470>,
 <unstructured.documents.elements.CompositeElement at 0x2598ba3adb0>,
 <unstructured.documents.elements.CompositeElement at 0x2598ba399d0>,
 <unstructured.documents.elements.CompositeElement at 0x2598ba3a5d0>,
 <unstructured.documents.elements.CompositeElement at 0x2598ba3b1a0>,
 <unstructured.documents.elements.CompositeElement at 0x2598ba2d310>,
 <unstructured.documents.elements.CompositeElement at 0x2598ba2f290>,
 <unstructured.documents.elements.CompositeElement at 0x2598ba2cfe0>,
 <unstructured.documents.elements.CompositeElement at 0x2598ba2c410>,
 <unstructured.docum

In [None]:
pdf_elements[:5]  # Display the first 5 elements

[<unstructured.documents.elements.CompositeElement at 0x2598b8b9ee0>,
 <unstructured.documents.elements.CompositeElement at 0x2598b8ba3c0>,
 <unstructured.documents.elements.CompositeElement at 0x2598b975d90>,
 <unstructured.documents.elements.CompositeElement at 0x2598b9754c0>,
 <unstructured.documents.elements.CompositeElement at 0x2598b9753d0>]

In [10]:
# Create a dictionary to store counts of each type
category_counts = {}

for element in pdf_elements:
    category = str(type(element))
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1

# Unique_categories will have unique elements
unique_categories = set(category_counts.keys())
category_counts

{"<class 'unstructured.documents.elements.CompositeElement'>": 46}

In [11]:
element_dict = [el.to_dict() for el in pdf_elements]

unique_types = set()

for item in element_dict:
    unique_types.add(item['type'])

print(unique_types)

{'CompositeElement'}


In [12]:
# Extract images, tables, and chunk text
pdf_elements = partition_pdf(
    filename=filename,
    extract_images_in_pdf=False,
    strategy = "fast",
    hi_res_model_name="yolox",
    infer_table_structure=True,
    #chunking_strategy="by_title",
    max_characters=3000,
    #new_after_n_chars=3800,
    combine_text_under_n_chars=200,
    #extract_image_block_output_dir=path,
)

for element in pdf_elements:
    category = str(type(element))
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1

# Unique_categories will have unique elements
unique_categories = set(category_counts.keys())
print(category_counts)

element_dict = [el.to_dict() for el in pdf_elements]

unique_types = set()

for item in element_dict:
    unique_types.add(item['type'])

print(unique_types)

Cannot set non-stroke color because expected 3 components but got [0.9]


Cannot set non-stroke color because expected 3 components but got [0.9]
Cannot set non-stroke color because expected 3 components but got [0.9]
Cannot set non-stroke color because expected 3 components but got [0.9]
Cannot set non-stroke color because expected 3 components but got [0.9]


{"<class 'unstructured.documents.elements.CompositeElement'>": 46, "<class 'unstructured.documents.elements.Title'>": 129, "<class 'unstructured.documents.elements.NarrativeText'>": 89, "<class 'unstructured.documents.elements.ListItem'>": 18, "<class 'unstructured.documents.elements.Text'>": 31, "<class 'unstructured.documents.elements.Header'>": 6, "<class 'unstructured.documents.elements.Footer'>": 3}
{'Title', 'Header', 'ListItem', 'NarrativeText', 'Footer', 'UncategorizedText'}


In [13]:
pdf_elements[0].to_dict()

{'type': 'Title',
 'element_id': 'c61a9b38a9d6298964b86638210a4622',
 'text': 'QUALITY MEASURES FOR HUMANITARIAN DATA',
 'metadata': {'coordinates': {'points': ((83.5767, 330.3082),
    (83.5767, 396.9582),
    (537.5593000000001, 396.9582),
    (537.5593000000001, 330.3082)),
   'system': 'PixelSpace',
   'layout_width': 612.0,
   'layout_height': 792.0},
  'file_directory': '../data/pdf',
  'filename': 'F-62.pdf',
  'last_modified': '2026-01-19T15:23:02',
  'page_number': 1,
  'languages': ['eng'],
  'filetype': 'application/pdf'}}

In [14]:
tables = [el for el in pdf_elements if el.category == "Table"]

In [15]:
tables

[]

In [16]:
#table_html = tables[0].metadata.text_as_html

In [17]:
titles = [el for el in pdf_elements if el.category == "Title"]
titles

[<unstructured.documents.elements.Title at 0x2598babeab0>,
 <unstructured.documents.elements.Title at 0x2598babd880>,
 <unstructured.documents.elements.Title at 0x2598babe900>,
 <unstructured.documents.elements.Title at 0x2598ba81460>,
 <unstructured.documents.elements.Title at 0x2598ba804d0>,
 <unstructured.documents.elements.Title at 0x2598b9f3950>,
 <unstructured.documents.elements.Title at 0x2598ba2ce60>,
 <unstructured.documents.elements.Title at 0x2598ba2ec60>,
 <unstructured.documents.elements.Title at 0x2598b8a78c0>,
 <unstructured.documents.elements.Title at 0x2598b8b9820>,
 <unstructured.documents.elements.Title at 0x2598ba2fdd0>,
 <unstructured.documents.elements.Title at 0x2598b8c4b90>,
 <unstructured.documents.elements.Title at 0x2598ba2cd40>,
 <unstructured.documents.elements.Title at 0x2598ba95f70>,
 <unstructured.documents.elements.Title at 0x2598ba96240>,
 <unstructured.documents.elements.Title at 0x2598ba94b90>,
 <unstructured.documents.elements.Title at 0x2598ba95550

In [18]:
title_html = titles[0].metadata.text_as_html

In [19]:
titles[0]

<unstructured.documents.elements.Title at 0x2598babeab0>

In [None]:
# See what's in the Title element
print(titles[0].text)              # Title text content
print(titles[0].category)          # "Title"
print(titles[0].metadata.text_as_html)  # None for Titles

QUALITY MEASURES FOR HUMANITARIAN DATA
Title
None


In [21]:
from io import StringIO 
from lxml import etree

# 1. Get tables from PDF
tables = [el for el in pdf_elements if el.category == "Table"]

if tables:
    # 2. Get HTML from FIRST table
    table_html = tables[0].metadata.text_as_html
    
    if table_html:  # Check if not None
        # 3. Parse HTML
        parser = etree.XMLParser(remove_blank_text=True)
        file_obj = StringIO(table_html)
        tree = etree.parse(file_obj, parser)
        print(etree.tostring(tree, pretty_print=True).decode())
    else:
        print("No HTML available - use infer_table_structure=True")
else:
    print("No tables found in PDF")

No tables found in PDF


In [22]:
# CORRECT way to use Titles
for title in titles:
    print(f"Title: {title.text}")
    print(f"Page: {title.metadata.page_number}")
    print(f"Filename: {title.metadata.filename}")
    print("---")

Title: QUALITY MEASURES FOR HUMANITARIAN DATA
Page: 1
Filename: F-62.pdf
---
Title: SPRINT REPORT
Page: 1
Filename: F-62.pdf
---
Title: APRIL 2023
Page: 1
Filename: F-62.pdf
---
Title: QUALITY MEASURES FOR HUMANITARIAN DATA
Page: 2
Filename: F-62.pdf
---
Title: SPRINT REPORT
Page: 2
Filename: F-62.pdf
---
Title: APRIL 2023
Page: 2
Filename: F-62.pdf
---
Title: MEET THE TEAM
Page: 3
Filename: F-62.pdf
---
Title: Lead Kasia Chmielinski
Page: 3
Filename: F-62.pdf
---
Title: Technology Matt Taylor
Page: 3
Filename: F-62.pdf
---
Title: Research Sarah Newman
Page: 3
Filename: F-62.pdf
---
Title: Design Jessica Yurkofsky
Page: 3
Filename: F-62.pdf
---
Title: Design Chelsea Qiu
Page: 3
Filename: F-62.pdf
---
Title: i
Page: 3
Filename: F-62.pdf
---
Title: I. BACKGROUND 1
Page: 4
Filename: F-62.pdf
---
Title: II. CHALLENGES 2
Page: 4
Filename: F-62.pdf
---
Title: III. KEY FINDINGS 4
Page: 4
Filename: F-62.pdf
---
Title: IV. APPROACH & DIRECTIONS 6
Page: 4
Filename: F-62.pdf
---
Title: V. RECOMME

In [23]:
# Group content by titles (perfect for RAG)
chunks = []

for i, title in enumerate(titles):
    chunk = {
        "title": title.text,
        "content": title.text,  # Start with title
        "page": title.metadata.page_number,
        "source": title.metadata.filename,
        "type": "section"
    }
    chunks.append(chunk)

print(f"---> Created {len(chunks)} title-based chunks for RAG")

---> Created 129 title-based chunks for RAG


In [24]:
# Better approach - ALL elements for hybrid retrieval
for el in pdf_elements:
    print(f"{el.category:15} | Page {el.metadata.page_number} | {el.text[:80]}...")

Title           | Page 1 | QUALITY MEASURES FOR HUMANITARIAN DATA...
Title           | Page 1 | SPRINT REPORT...
Title           | Page 1 | APRIL 2023...
Title           | Page 2 | QUALITY MEASURES FOR HUMANITARIAN DATA...
Title           | Page 2 | SPRINT REPORT...
Title           | Page 2 | APRIL 2023...
Title           | Page 3 | MEET THE TEAM...
Title           | Page 3 | Lead Kasia Chmielinski...
Title           | Page 3 | Technology Matt Taylor...
Title           | Page 3 | Research Sarah Newman...
Title           | Page 3 | Design Jessica Yurkofsky...
Title           | Page 3 | Design Chelsea Qiu...
Title           | Page 3 | i...
Title           | Page 4 | I. BACKGROUND 1...
Title           | Page 4 | II. CHALLENGES 2...
Title           | Page 4 | III. KEY FINDINGS 4...
Title           | Page 4 | IV. APPROACH & DIRECTIONS 6...
Title           | Page 4 | V. RECOMMENDATIONS & DESIGNS 11...
Title           | Page 4 | VI. ROADMAP & IMPLEMENTATION 15...
Title           | Page 4 | VI

In [25]:
# Create proper chunks with metadata
chunks = []

for el in pdf_elements:
    chunk = {
        "content": el.text,
        "category": el.category,  # Title, Table, Text, etc.
        "page": el.metadata.page_number,
        "source": el.metadata.filename,
        "is_table": el.category == "Table",
        "table_html": el.metadata.text_as_html if el.category == "Table" else None
    }
    chunks.append(chunk)

print(f"---> {len(chunks)} chunks ready for BM25 + Semantic indexing")

---> 276 chunks ready for BM25 + Semantic indexing


In [26]:
# Find the element with text "References" and category "Title"
title = [
    el for el in pdf_elements
    if el.category == "Title"
][0]

In [27]:
title.to_dict()

{'type': 'Title',
 'element_id': 'c61a9b38a9d6298964b86638210a4622',
 'text': 'QUALITY MEASURES FOR HUMANITARIAN DATA',
 'metadata': {'coordinates': {'points': ((83.5767, 330.3082),
    (83.5767, 396.9582),
    (537.5593000000001, 396.9582),
    (537.5593000000001, 330.3082)),
   'system': 'PixelSpace',
   'layout_width': 612.0,
   'layout_height': 792.0},
  'file_directory': '../data/pdf',
  'filename': 'F-62.pdf',
  'last_modified': '2026-01-19T15:23:02',
  'page_number': 1,
  'languages': ['eng'],
  'filetype': 'application/pdf'}}

In [28]:
title.id

'c61a9b38a9d6298964b86638210a4622'

In [29]:
for el in pdf_elements:
    print(el.to_dict())

{'type': 'Title', 'element_id': 'c61a9b38a9d6298964b86638210a4622', 'text': 'QUALITY MEASURES FOR HUMANITARIAN DATA', 'metadata': {'coordinates': {'points': ((83.5767, 330.3082), (83.5767, 396.9582), (537.5593000000001, 396.9582), (537.5593000000001, 330.3082)), 'system': 'PixelSpace', 'layout_width': 612.0, 'layout_height': 792.0}, 'file_directory': '../data/pdf', 'filename': 'F-62.pdf', 'last_modified': '2026-01-19T15:23:02', 'page_number': 1, 'languages': ['eng'], 'filetype': 'application/pdf'}}
{'type': 'Title', 'element_id': '7da0b5e778fd30f37d7ec6ba7a536453', 'text': 'SPRINT REPORT', 'metadata': {'coordinates': {'points': ((244.796, 421.828), (244.796, 434.178), (367.1871, 434.178), (367.1871, 421.828)), 'system': 'PixelSpace', 'layout_width': 612.0, 'layout_height': 792.0}, 'file_directory': '../data/pdf', 'filename': 'F-62.pdf', 'last_modified': '2026-01-19T15:23:02', 'page_number': 1, 'languages': ['eng'], 'filetype': 'application/pdf'}}
{'type': 'Title', 'element_id': '88b950

In [30]:
x = [ele.to_dict() for ele in pdf_elements]
x[0]

{'type': 'Title',
 'element_id': 'c61a9b38a9d6298964b86638210a4622',
 'text': 'QUALITY MEASURES FOR HUMANITARIAN DATA',
 'metadata': {'coordinates': {'points': ((83.5767, 330.3082),
    (83.5767, 396.9582),
    (537.5593000000001, 396.9582),
    (537.5593000000001, 330.3082)),
   'system': 'PixelSpace',
   'layout_width': 612.0,
   'layout_height': 792.0},
  'file_directory': '../data/pdf',
  'filename': 'F-62.pdf',
  'last_modified': '2026-01-19T15:23:02',
  'page_number': 1,
  'languages': ['eng'],
  'filetype': 'application/pdf'}}

In [31]:
chunk_by_title??

[31mSignature:[39m
chunk_by_title(
    elements: [33m'Iterable[Element]'[39m,
    *,
    combine_text_under_n_chars: [33m'Optional[int]'[39m = [38;5;28;01mNone[39;00m,
    include_orig_elements: [33m'Optional[bool]'[39m = [38;5;28;01mNone[39;00m,
    max_characters: [33m'Optional[int]'[39m = [38;5;28;01mNone[39;00m,
    multipage_sections: [33m'Optional[bool]'[39m = [38;5;28;01mNone[39;00m,
    new_after_n_chars: [33m'Optional[int]'[39m = [38;5;28;01mNone[39;00m,
    overlap: [33m'Optional[int]'[39m = [38;5;28;01mNone[39;00m,
    overlap_all: [33m'Optional[bool]'[39m = [38;5;28;01mNone[39;00m,
) -> [33m'list[Element]'[39m
[31mSource:[39m   
[38;5;28;01mdef[39;00m chunk_by_title(
    elements: Iterable[Element],
    *,
    combine_text_under_n_chars: Optional[int] = [38;5;28;01mNone[39;00m,
    include_orig_elements: Optional[bool] = [38;5;28;01mNone[39;00m,
    max_characters: Optional[int] = [38;5;28;01mNone[39;00m,
    multipage_sections: O

In [32]:
IGNORE = {"Header", "Footer", "UncategorizedText"}

elements = [
    el for el in pdf_elements
    if el.category not in IGNORE and el.text.strip()
]

In [33]:
# Group elements by section title

from collections import defaultdict

sections = []
current_section = {
    "title": None,
    "elements": [],
    "page_start": None
}

for el in elements:
    if el.category == "Title":
        if current_section["elements"]:
            sections.append(current_section)

        current_section = {
            "title": el.text,
            "elements": [],
            "page_start": el.metadata.page_number
        }
    else:
        current_section["elements"].append(el)

# flush
if current_section["elements"]:
    sections.append(current_section)

## 🧪 Input: simplified `elements` list (in order)

Assume this is what `elements` looks like:

```text
1. Title           | "I. BACKGROUND"          | page 5
2. NarrativeText   | "The purpose of this..." | page 5
3. ListItem        | "User research..."       | page 5
4. NarrativeText   | "We conducted..."        | page 5
5. Title           | "II. CHALLENGES"          | page 6
6. NarrativeText   | "There are many..."      | page 6
7. NarrativeText   | "Challenge 1 - ..."      | page 6
```

---

## 🧠 Initial state (before loop starts)

```python
sections = []

current_section = {
    "title": None,
    "elements": [],
    "page_start": None
}
```

---

## 🔁 Loop iteration-by-iteration

---

### ▶ Iteration 1

**Element:** `Title | "I. BACKGROUND"`

```python
if el.category == "Title":  # TRUE
```

Check:

```python
if current_section["elements"]:  # []
```

→ FALSE (empty)

So nothing is appended yet.

Now start new section:

```python
current_section = {
    "title": "I. BACKGROUND",
    "elements": [],
    "page_start": 5
}
```

📌 **State now:**

```
current_section = BACKGROUND (empty)
sections = []
```

---

### ▶ Iteration 2

**Element:** `NarrativeText | "The purpose of this..."`

```python
if el.category == "Title":  # FALSE
```

So:

```python
current_section["elements"].append(el)
```

📌 **State now:**

```
current_section:
  title = "I. BACKGROUND"
  elements = ["The purpose of this..."]
sections = []
```

---

### ▶ Iteration 3

**Element:** `ListItem | "User research..."`

Same path:

```python
current_section["elements"].append(el)
```

📌 **State now:**

```
current_section:
  title = "I. BACKGROUND"
  elements = [
    "The purpose of this...",
    "User research..."
  ]
```

---

### ▶ Iteration 4

**Element:** `NarrativeText | "We conducted..."`

Append again.

📌 **State now:**

```
current_section:
  title = "I. BACKGROUND"
  elements = [
    "The purpose of this...",
    "User research...",
    "We conducted..."
  ]
```

---

### ▶ Iteration 5

**Element:** `Title | "II. CHALLENGES"`

```python
if el.category == "Title":  # TRUE
```

Now check:

```python
if current_section["elements"]:  # NOT empty
```

So we **save the previous section**:

```python
sections.append(current_section)
```

📌 **sections now contains:**

```
[
  {
    title: "I. BACKGROUND",
    elements: [3 items],
    page_start: 5
  }
]
```

Now start a new section:

```python
current_section = {
    "title": "II. CHALLENGES",
    "elements": [],
    "page_start": 6
}
```

📌 **State now:**

```
current_section = CHALLENGES (empty)
sections = [BACKGROUND]
```

---

### ▶ Iteration 6

**Element:** `NarrativeText | "There are many..."`

Append:

```python
current_section["elements"].append(el)
```

📌 **State:**

```
current_section:
  title = "II. CHALLENGES"
  elements = ["There are many..."]
```

---

### ▶ Iteration 7

**Element:** `NarrativeText | "Challenge 1 - ..."`

Append again.

📌 **State:**

```
current_section:
  title = "II. CHALLENGES"
  elements = [
    "There are many...",
    "Challenge 1 - ..."
  ]
```

---

## 🔚 Loop ends

Now this code runs:

```python
if current_section["elements"]:
    sections.append(current_section)
```

This saves the **final section**, which otherwise would be lost.

---

## ✅ Final output: `sections`

```python
[
  {
    "title": "I. BACKGROUND",
    "elements": [
      "The purpose of this...",
      "User research...",
      "We conducted..."
    ],
    "page_start": 5
  },
  {
    "title": "II. CHALLENGES",
    "elements": [
      "There are many...",
      "Challenge 1 - ..."
    ],
    "page_start": 6
  }
]
```

---

## 🧠 What changed from before?

Before:

```
7 independent elements
```

After:

```
2 meaningful sections
```

This is the **exact transformation** your code performs.

---

## 🔥 Why this matters

* Chunking now respects **document meaning**
* Retrieval returns **topic-complete answers**
* LLM doesn’t guess context


In [34]:
current_section

{'title': 'Figure 9. HDX Search view with Quality Measure counts.',
 'elements': [<unstructured.documents.elements.NarrativeText at 0x2598bb2f2f0>],
 'page_start': 28}

In [None]:
# Section wise chunking: grouping by titles and splitting large sections

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,       # or 500-700 tokens
    chunk_overlap=200,     # optional
    separators=["\n\n", "\n", ". ", " "]

)

chunks = []

for sec in sections:
    # merge section text
    section_text = "\n\n".join(el.text for el in sec["elements"])

    # split section into chunks
    sub_chunks = splitter.split_text(section_text)

    # prepend title to each sub-chunk
    for sub in sub_chunks:
        cleaned = sub.lstrip(" .\n")
        chunks.append({
            "content": f"{sec['title']}\n\n{cleaned}",
            "title": sec['title'],                    # section title
            "page_start": sec['page_start'],          # first page of section
            "page_end": sec['elements'][-1].metadata.page_number,  # last page of section
            "source": sec['elements'][0].metadata.filename,        # file name
            "category": [el.category for el in sec["elements"]]   # list of element types
        })

In [36]:
for i, ch in enumerate(chunks[:4]):
    print(f"\n--- Chunk {i} ---")
    print(ch["content"])


--- Chunk 0 ---
Goals

The purpose of this Data Labeling project was for select members of the Data Nutrition Project team to research and prototype possible quality measures for humanitarian datasets that are hosted on the HDX platform, which is owned and managed by the UN Centre for Humanitarian Data. The scope included:

User and Platform Research. We conducted user research with the Centre team (and additional stakeholders suggested by the Centre) to learn about 1) Different conceptions of data quality in the humanitarian sector; 2) How users find and select data on HDX, including priority of criteria; 3) The current DPT / HDX team QA workflow with regards to assessing data quality.

Quality Measurement Prototype. Building on user research and an assessment of the state of the data and the needs in play, and using two preselected datasets as examples, we prototyped a quality measures label for HDX. The prototyping involved varying degrees of fidelity and was shaped by feedback fro

In [37]:
chunks[0]['content']

'Goals\n\nThe purpose of this Data Labeling project was for select members of the Data Nutrition Project team to research and prototype possible quality measures for humanitarian datasets that are hosted on the HDX platform, which is owned and managed by the UN Centre for Humanitarian Data. The scope included:\n\nUser and Platform Research. We conducted user research with the Centre team (and additional stakeholders suggested by the Centre) to learn about 1) Different conceptions of data quality in the humanitarian sector; 2) How users find and select data on HDX, including priority of criteria; 3) The current DPT / HDX team QA workflow with regards to assessing data quality.\n\nQuality Measurement Prototype. Building on user research and an assessment of the state of the data and the needs in play, and using two preselected datasets as examples, we prototyped a quality measures label for HDX. The prototyping involved varying degrees of fidelity and was shaped by feedback from the Cent

In [38]:
chunks[4]

{'content': 'II. CHALLENGES\n\nThere are many challenges that can be impediments to dataset quality. This is certainly the case in the humanitarian sector, where crises unfold quickly and data capture will almost always be imperfect, often as a consequence of the need for rapid collection. Furthermore, on a more philosophical level, the assigning of rankings, scores, or grades to a dataset will always be tricky business, for the legitimacy of the scoring standards themselves can undermine the effort for scoring in the first place. We believe it is useful to explicitly enumerate these challenges before we describe our recommendations. The latter were formed in light of the former, which will be familiar to the HDX team and to others who have worked on dataset metrics, measures, and assessments.\n\nChallenge 1 - Identifying scoring methods that are succinct while not overly simplistic',
 'title': 'II. CHALLENGES',
 'page_start': 6,
 'page_end': 7,
 'source': 'F-62.pdf',
 'category': ['Na

In [39]:
chunks[5]

{'content': 'II. CHALLENGES\n\nChallenge 1 - Identifying scoring methods that are succinct while not overly simplistic\n\nScores are meant to provide information quickly and ease comparison, while inviting further exploration. They can, however, risk being reductive or overly simplistic. This is particularly difficult when comparing datasets whose provenances are entirely different. A score that is too simplistic will not only be useless but may also seem arbitrary. A single score to compare across inconsistent data types or domains may risk both. Depending on the scoring framework, there is the additional challenge of validating accuracy: what is the rubric by which this score was determined? How is accuracy of evaluation defined and ensured?\n\nChallenge 2 - Balancing scalable (quantitative) & comprehensive (qualitative) measures',
 'title': 'II. CHALLENGES',
 'page_start': 6,
 'page_end': 7,
 'source': 'F-62.pdf',
 'category': ['NarrativeText',
  'NarrativeText',
  'NarrativeText',


In [40]:
sections[0]
section_text = "\n\n".join(el.text for el in sec["elements"])

In [41]:
section_text

'Figure 11. “Similar Datasets” comparison sketch, to be refined in Phase 3.'

### **Generate embeddings and index with Chroma**

In [None]:
# Generate embeddings and index with Chroma

from langchain_core.documents import Document
from langchain_ollama import OllamaEmbeddings
from langchain_chroma import Chroma

chunk_docs = [
    Document(
        page_content=chunk["content"],
        metadata={
            "title": chunk["title"],
            "page_start": chunk["page_start"],
            "page_end": chunk["page_end"],
            "source": chunk["source"],
            "category": ", ".join(chunk["category"])  # convert list to string
        }
    )
    for chunk in chunks
]

print("Generating embeddings and indexing...")
embeddings = OllamaEmbeddings(model="nomic-embed-text")

vectorstore = Chroma.from_documents(
    documents=chunk_docs,
    embedding=embeddings,
    persist_directory="./chroma_db",
    collection_name="unstructured_test"
)

Generating embeddings and indexing...


In [None]:
# Test retrieval
vectorstore_retriever = vectorstore.as_retriever(
    search_kwargs={"k": 5}
)

# Test query
query = "Graduation rates of Students with disabilities in higher education institutions"
relevant_docs = vectorstore_retriever.invoke(query)

print(f"Found {len(relevant_docs)} relevant chunks:")
for doc in relevant_docs:
    print("\n---")
    print(f"Source: {doc.metadata['source']}")
    print(f"Pages: {doc.metadata['page_start']} - {doc.metadata['page_end']}")
    print(f"Title: {doc.metadata['title']}")
    print(f"Content: {doc.page_content[:500]}...")

Found 5 relevant chunks:

---
Source: F-62.pdf
Pages: 14 - 14
Title: / Owner and Third-Party expert
Content: / Owner and Third-Party expert

organizations. These could be binary...

---
Source: F-62.pdf
Pages: 14 - 14
Title: / Owner and Third-Party expert
Content: / Owner and Third-Party expert

organizations. These could be binary...

---
Source: F-62.pdf
Pages: 14 - 14
Title: / Owner and Third-Party expert
Content: / Owner and Third-Party expert

organizations. These could be binary...

---
Source: F-62.pdf
Pages: 14 - 14
Title: / Owner and Third-Party expert
Content: / Owner and Third-Party expert

organizations. These could be binary...

---
Source: F-62.pdf
Pages: 14 - 14
Title: / Owner and Third-Party expert
Content: / Owner and Third-Party expert

organizations. These could be binary...


In [44]:
similar_docs = vectorstore.similarity_search("Graduation rates of Students with disabilities in higher education institutions")

for doc in similar_docs:
    print("\n---")
    print(f"Source: {doc.metadata['source']}")
    print(f"Pages: {doc.metadata['page_start']} - {doc.metadata['page_end']}")
    print(f"Title: {doc.metadata['title']}")
    print(f"Content: {doc.page_content[:500]}...")


---
Source: F-62.pdf
Pages: 14 - 14
Title: / Owner and Third-Party expert
Content: / Owner and Third-Party expert

organizations. These could be binary...

---
Source: F-62.pdf
Pages: 14 - 14
Title: / Owner and Third-Party expert
Content: / Owner and Third-Party expert

organizations. These could be binary...

---
Source: F-62.pdf
Pages: 14 - 14
Title: / Owner and Third-Party expert
Content: / Owner and Third-Party expert

organizations. These could be binary...

---
Source: F-62.pdf
Pages: 14 - 14
Title: / Owner and Third-Party expert
Content: / Owner and Third-Party expert

organizations. These could be binary...


In [45]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [1]:
# Testing LLM Response Just with a minimal Prompt Template

from langchain_core.runnables import RunnablePassthrough
from langchain_ollama import OllamaLLM
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Your llama2 model for answering
llm = OllamaLLM(model="qwen2.5:1.5b")

# Convert vectorstore to a retriever
vectorstore_retriever = vectorstore.as_retriever()

# Define the prompt template
prompt_template = """
Use the following context to answer the question as accurately as possible.
If the answer is not present in the context, say "Not available".

Context:
{context}

Question:
{question}

Answer:
"""
prompt = ChatPromptTemplate.from_template(prompt_template)

# Create the chain
rag_chain = (
    {"context": vectorstore_retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
)

# Invoke with just the question string
question = "Graduation rates of Students with disabilities in higher education institutions"
answer = rag_chain.invoke(question)

print(f"Q: {question}")
print("Ans: " + answer)

KeyboardInterrupt: 

In [47]:
# Invoke with just the question string
question = "What are the challenges mentioned here?"
answer = rag_chain.invoke(question)

print(f"Q: {question}")
print("Ans: " + answer)

Q: What are the challenges mentioned here?
Ans: Not available


In [48]:
# Invoke with just the question string
question = "What was the primary purpose of the Data Labeling project conducted for the HDX platform?"
answer = rag_chain.invoke(question)

print(f"Q: {question}")
print("Ans: " + answer)

Q: What was the primary purpose of the Data Labeling project conducted for the HDX platform?
Ans: Not available.


In [49]:
# Invoke with just the question string
question = "What is identified as the most common challenge in dataset transparency efforts?"
answer = rag_chain.invoke(question)

print(f"Q: {question}")
print("Ans: " + answer)

Q: What is identified as the most common challenge in dataset transparency efforts?
Ans: Not available.


In [None]:
# Invoke with just the question string
question = "How does the 'compare' feature proposed for Phase 3 depend on the 'HXL-ation' process?"
answer = rag_chain.invoke(question)

print(f"Q: {question}")
print("Ans: " + answer)

Q: How does the 'compare' feature proposed for Phase 3 depend on the 'HXL-ation' process?
Ans: The 'compare' feature proposed for Phase 3 depends directly on the 'HXL-ation' process. Specifically, the context states that "the addition of third party certifications and automated analyses by domain will provide the content for such comparison." This implies a dependency between the two processes as they work together to fulfill the need for robust dataset comparison in Phase 3.


# **Implementing the Hybrid Retrieval**

                             | DOCUMENTS |
                              -----------
                                   |
                ------------------- -------------------
                |                                     |
                |                                     |
        | Vector Index   │                   │ Keyword Index   │
        │ (Chroma +      │                   │    (BM25)       │
        │ nomic-embed)   │                   ------------------- 
        ------------------                            |
                |                                     |
                |                                     |
                ------------------- -------------------
                                   |
                                   |
                                   V
                             Hybrid Retriever
                                   |
                                   |
                                   V
                               Re-Ranker
                                   |
                                   |
                                   V
                                  LLM

### 1. BM25 Indexing (Keyword Indexing)

BM25 computes a relevance score between a query 'q' and a document 'd' using three main components: Term Frequency (TF), Inverse Document Frequency (IDF) and Document Length Normalization.

BM25 (Best Matching 25) is a ranking function used in information retrieval to estimate document relevance for a query, improving upon TF-IDF by incorporating term frequency saturation and document length normalization. It calculates a score based on IDF (inverse document frequency) and a frequency component that limits the impact of repeatedly appearing terms. [1, 2, 3]  
The FormulaFor a query $Q$ with terms $q_1, ..., q_n$, the BM25 score of a document $D$ is:$\text{Score}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{\text{avgdl}})}$ [2, 4, 5]  
Formula Components 

• : Term frequency of query term $q_i$ in document $D$. 
• : Length of document $D$ (number of words). 
• : Average document length in the collection. 
• : Free parameter (usually $1.2$–$2.0$) that controls term frequency saturation. 
• : Free parameter (usually $0.75$) that controls document length normalization. 
• : Inverse Document Frequency, often calculated as:$\text{IDF}(q_i) = \ln\left(\frac{N - n(q_i) + 0.5}{n(q_i) + 0.5} + 1\right)$ 

	• : Total number of documents. 
	• : Number of documents containing $q_i$. [4, 5, 6, 7, 8, 9]  

Key Concepts 

• Term Frequency Saturation: As $f(q_i, D)$ increases, the score increases, but it caps out, meaning multiple occurrences of a word matter less after a certain point. 
• Document Length Normalization: The $\frac{|D|}{\text{avgdl}}$ term reduces scores for longer documents, ensuring short, concise documents are not unfairly penalized. [5, 10]  

AI responses may include mistakes.

[1] https://en.wikipedia.org/wiki/Okapi_BM25
[2] https://www.luigisbox.com/search-glossary/bm25/
[3] https://docs.langchain.com/oss/python/integrations/retrievers/bm25
[4] https://docs.vespa.ai/en/ranking/bm25.html
[5] https://www.geeksforgeeks.org/nlp/what-is-bm25-best-matching-25-algorithm/
[6] https://www.kopp-online-marketing.com/what-is-bm25
[7] https://www.youtube.com/watch?v=YL-3G5-xVYU
[8] https://medium.com/@kimdoil1211/bm25-for-developers-a-guide-to-smarter-keyword-search-e6d83e8c8c8c
[9] https://www.ai-bites.net/tf-idf-and-bm25-for-rag-a-complete-guide/
[10] https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables

In [51]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(
    chunk_docs,
    bm25_variant="plus",    
)

In [52]:
result = bm25_retriever.invoke("retrieval augmented generation")
result

[Document(metadata={'title': 'Figure 9. HDX Search view with Quality Measure counts.', 'page_start': 28, 'page_end': 31, 'source': 'F-62.pdf', 'category': 'NarrativeText'}, page_content='Figure 9. HDX Search view with Quality Measure counts.\n\nFigure 11. “Similar Datasets” comparison sketch, to be refined in Phase 3.'),
 Document(metadata={'title': 'Aid Memoir for a COR/AOR” (March 2012)', 'page_start': 26, 'page_end': 26, 'source': 'F-62.pdf', 'category': 'NarrativeText, NarrativeText'}, page_content='Aid Memoir for a COR/AOR” (March 2012)\n\nb. Frontier Technologies Hub, “releasing the power of digital data for development: a guide to new\n\nopportunities” (June 2019)'),
 Document(metadata={'title': 'IX. APPENDIX', 'page_start': 26, 'page_end': 26, 'source': 'F-62.pdf', 'category': 'NarrativeText, NarrativeText, NarrativeText'}, page_content='IX. APPENDIX\n\nFigure 1. Matrix aligning data quality principles across several organizations, with HDX in blue.\n\nDocuments cited:\n\na. US

**Decision Table:**

| Question type        | Retrieval |
| -------------------- | --------- |
| Exact fact           | Lexical   |
| Numbers / tables     | Lexical   |
| Laws / policies      | Lexical   |
| Headings / titles    | Lexical   |
| Open-ended           | Semantic  |
| Explanatory          | Semantic  |
| User unsure of terms | Semantic  |
| Research / discovery | Semantic  |

**What is RRF?**

RRF = Reciprocal Rank Fusion

* ---> It is a ranking-level fusion technique
* ---> It does NOT use scores
* ---> It only uses positions (ranks)

In [53]:
import hashlib

def doc_hash(doc):
    return hashlib.md5(doc.page_content.encode("utf-8")).hexdigest()

## **Reciprocal Rank Fusion (RRF)**

Reciprocal Rank Fusion (RRF) is a popular, unsupervised method used in Retrieval-Augmented Generation (RAG) to combine ranked results from multiple search systems (e.g., hybrid search combining semantic vector search and keyword-based sparse search) into a single, optimized, and more relevant ranking. [1, 2]  
It is highly effective because it doesn't require training data or complex tuning, and it favors documents that consistently appear at the top of multiple search methods. [3, 4, 5, 6]  
The RRF Formula 
The RRF score for a document $d$ is calculated by summing its reciprocal rank across all retrievers: [1]  
$\text{RRF}(d) = \sum_{r \in R} \frac{1}{k + \text{rank}(d)}$ 

• : The document being scored. 
• : The set of rankers (e.g., Vector Search, BM25). 
• : The position of the document in the results list (starting from 1). 
• : A constant used to minimize the impact of low-ranked documents (often set to 60). [1, 7]  

Step-by-Step RRF Calculation 

1. Retrieve Results: Run the user query through multiple systems (e.g., Dense Vector search and Sparse BM25 search). 
2. Assign Scores: For each document in each result set, calculate its individual score using $\frac{1}{k + \text{rank}}$. 

	• Example: If a document is #1 in Vector search, its score is $\frac{1}{60+1} = 0.01639$. 
	• Example: If the same document is #5 in BM25, its score is $\frac{1}{60+5} = 0.01538$. 

3. Sum Scores: Add the scores from all retrievers for each unique document. 
4. Final Ranking: Sort the documents in descending order based on their total RRF score. [2, 3, 8, 9, 10]  

Why RRF is Used in GenAI/RAG 

• Robustness: Combines the semantic understanding of vector search with the keyword precision of traditional search. 
• No Training Needed: Unlike Learn-to-Rank (LTR), RRF is an unsupervised algorithm. 
• Small  Advantage: A $k$ value of 60 is generally chosen to provide a good balance between the influence of top-ranked and lower-ranked items. 
• Handles Ties: Helps break ties among lower-ranked items effectively. [1, 2, 11, 12, 13]  

Example RRF Calculation 
Imagine two documents ($A$ and $B$) retrieved by two systems ($S_1$, $S_2$). 

| Document [3, 14] | Rank ($S_1$) | Rank ($S_2$) | RRF Calculation ($k=60$) | Total Score  |
| --- | --- | --- | --- | --- |
| Doc A | 1 | 2 | $1/(60+1) + 1/(60+2)$ | $\approx 0.0325$  |
| Doc B | 4 | 1 | $1/(60+4) + 1/(60+1)$ | $\approx 0.0320$  |

In this scenario, Document A is prioritized slightly higher due to its stronger top-1 placement, even though both appear in the top 4 of both lists. 
Implementation Example (LlamaIndex) 
 [15]  

AI responses may include mistakes.

[1] https://medium.com/@devalshah1619/mathematical-intuition-behind-reciprocal-rank-fusion-rrf-explained-in-2-mins-002df0cc5e2a
[2] https://medium.com/@mudassar.hakim/designing-retrieval-in-rag-dense-sparse-and-the-rrf-merge-layer-bc176207de50
[3] https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking
[4] https://mycodingjourney.hashnode.dev/how-ai-ranks-better-with-rrf-the-genius-of-merging-search-results
[5] https://www.youtube.com/watch?v=px4YBYrz0NU
[6] https://www.paradedb.com/learn/search-concepts/reciprocal-rank-fusion
[7] https://medium.com/dataseries/generative-ai-qa-model-using-chroma-and-mistral-7b-565088031e80
[8] https://www.youtube.com/watch?v=6dDvfGrxFns
[9] https://www.linkedin.com/pulse/newmind-ai-journal-62-newmind-ai-m4ynf
[10] https://weaviate.io/blog/hybrid-search-explained
[11] https://www.linkedin.com/pulse/newmind-ai-journal-62-newmind-ai-m4ynf
[12] https://www.researchgate.net/publication/221301121_Reciprocal_Rank_Fusion_outperforms_Condorcet_and_Individual_Rank_Learning_Methods
[13] https://jusky8.medium.com/learning-to-rank-for-information-retrieval-9a0bd9f0b27d
[14] https://dev.to/lucash_ribeiro_dev/graph-augmented-hybrid-retrieval-and-multi-stage-re-ranking-a-framework-for-high-fidelity-chunk-50ca
[15] https://developers.llamaindex.ai/python/examples/retrievers/reciprocal_rerank_fusion/



In [None]:
# Hybrid Retriever using RRF

class HybridRetriever:
    def __init__(self, bm25_retriever, vector_retriever, weights=[0.5, 0.5]):
        self.bm25_retriever = bm25_retriever
        self.vector_retriever = vector_retriever
        self.bm25_weight = weights[0]
        self.vector_weight = weights[1]

    def invoke(self, query, k=5):
        bm25_docs = self.bm25_retriever.invoke(query)
        vector_docs = self.vector_retriever.invoke(query)

        rrf_k = 60
        doc_scores = {}
        doc_map = {}

        # BM25
        for rank, doc in enumerate(bm25_docs, 1):
            key = doc_hash(doc)
            doc_scores[key] = self.bm25_weight / (rrf_k + rank)
            doc_map[key] = doc

        # Vector
        for rank, doc in enumerate(vector_docs, 1):
            key = doc_hash(doc)
            score = self.vector_weight / (rrf_k + rank)
            doc_scores[key] = doc_scores.get(key, 0) + score
            doc_map[key] = doc

        ranked = sorted(doc_scores.items(), key=lambda x: x[1], reverse=True)
        return [doc_map[k] for k, _ in ranked[:k]]

# Create hybrid retriever
hybrid_retriever = HybridRetriever(
    bm25_retriever=bm25_retriever,
    vector_retriever=vectorstore_retriever,
    weights=[0.5, 0.5]
)

# Test it
docs = hybrid_retriever.invoke("How does the 'compare' feature proposed for Phase 3 depend on the 'HXL-ation' process?", k=5)
print(f"---> Got {len(docs)} hybrid results")

---> Got 5 hybrid results


**RE-RANKING (Stage-2 Retrieval)**

A reranking model — also known as a cross-encoder — is a type of model that, given a query and document pair, will output a similarity score. We use this score to reorder the documents by relevance to our query.

A two-stage retrieval system. The vector DB step will typically include a bi-encoder or sparse embedding model.
A two-stage retrieval system. The vector DB step will typically include a bi-encoder or sparse embedding model.
Search engineers have used rerankers in two-stage retrieval systems for a long time. In these two-stage systems, a first-stage model (an embedding model/retriever) retrieves a set of relevant documents from a larger dataset. Then, a second-stage model (the reranker) is used to rerank those documents retrieved by the first-stage model.

We use two stages because retrieving a small set of documents from a large dataset is much faster than reranking a large set of documents — we'll discuss why this is the case soon — but TL;DR, rerankers are slow, and retrievers are fast.

Why Rerankers?
If a reranker is so much slower, why bother using them? The answer is that rerankers are much more accurate than embedding models.

The intuition behind a bi-encoder's inferior accuracy is that bi-encoders must compress all of the possible meanings of a document into a single vector — meaning we lose information. Additionally, bi-encoders have no context on the query because we don't know the query until we receive it (we create embeddings before user query time).

On the other hand, a reranker can receive the raw information directly into the large transformer computation, meaning less information loss. Because we are running the reranker at user query time, we have the added benefit of analyzing our document's meaning specific to the user query — rather than trying to produce a generic, averaged meaning.

Rerankers avoid the information loss of bi-encoders — but they come with a different penalty — time.

A bi-encoder model compresses the document or query meaning into a single vector. Note that the bi-encoder processes our query in the same way as it does documents, but at user query time.
A bi-encoder model compresses the document or query meaning into a single vector. Note that the bi-encoder processes our query in the same way as it does documents, but at user query time.
When using bi-encoder models with vector search, we frontload all of the heavy transformer computation to when we are creating the initial vectors — that means that when a user queries our system, we have already created the vectors, so all we need to do is:

Run a single transformer computation to create the query vector.
Compare the query vector to document vectors with cosine similarity (or another lightweight metric).
With rerankers, we are not pre-computing anything. Instead, we're feeding our query and a single other document into the transformer, running a whole transformer inference step, and outputting a single similarity score.

A reranker considers query and document to produce a single similarity score over a full transformer inference step. Note that document A here is equivalent to our query.
A reranker considers query and document to produce a single similarity score over a full transformer inference step. Note that document A here is equivalent to our query.
Given 40M records, if we use a small reranking model like BERT on a V100 GPU — we'd be waiting more than 50 hours to return a single query result [3]. We can do the same in <100ms with encoder models and vector search.

In [55]:
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

In [56]:
def rerank_docs(query, docs, top_k=5):
    """
    query: str
    docs: List[Document]
    top_k: int
    """

    # Prepare (query, doc_text) pairs
    pairs = [(query, doc.page_content) for doc in docs]

    # Get relevance scores
    scores = reranker.predict(pairs)

    # Attach scores to docs
    scored_docs = list(zip(docs, scores))

    # Sort by score (descending)
    scored_docs.sort(key=lambda x: x[1], reverse=True)

    # Return top-k docs only
    return [doc for doc, score in scored_docs[:top_k]]

In [57]:
query = "How does the 'compare' feature proposed for Phase 3 depend on the 'HXL-ation' process?"
candidate_docs = hybrid_retriever.invoke(query, k=20)

In [58]:
candidate_docs

[Document(id='db041ae1-00f5-4f25-926d-7930cc543bb7', metadata={'page_start': 21, 'category': 'NarrativeText, NarrativeText, NarrativeText', 'title': 'Rationale', 'page_end': 22, 'source': 'F-62.pdf'}, page_content='Rationale\n\nAnd, while the measures in Phase 1 are a helpful starting point to assess quality, more analysis is needed for robust dataset comparison. The addition of third party certifications and automated analyses by domain will provide the content for such comparison. Technical considerations for Phase 3 would be enumerated after the research, development, and design for Phases 2 and 3.'),
 Document(metadata={'title': '“Compare” feature dependencies', 'page_start': 23, 'page_end': 24, 'source': 'F-62.pdf', 'category': 'NarrativeText, NarrativeText'}, page_content='“Compare” feature dependencies\n\nPhase 3 considerations include the addition of domain-specific metadata, third-party certification metrics, and the ability to compare and see additional, related datasets'),
 

In [59]:
final_docs = rerank_docs(query, candidate_docs, top_k=5)

In [60]:
final_docs

[Document(metadata={'title': '“Compare” feature dependencies', 'page_start': 23, 'page_end': 24, 'source': 'F-62.pdf', 'category': 'NarrativeText, NarrativeText'}, page_content='“Compare” feature dependencies\n\non HDX. Multiple conversations with the Centre and its users highlighted the critical importance of data selection. However, the notion of comparing and contrasting datasets requires standardized metadata, some of which is already collected, but much of which is not programmatically accessible. In particular, the usefulness of the “compare” feature, which requires additional research but could appear, for example, within the “Quality Measures” tab on the dataset page or the search results returned after submitting a query – increases significantly with the inclusion of technical information about the dataset that would be made available through the HXL-ation process. This is no doubt a challenge, considering that the majority of datasets are not yet HXLated. There is an open qu

In [61]:
query = "Explain me about data nutrition project?"
candidate_docs = hybrid_retriever.invoke(query, k=20)
candidate_docs

[Document(id='281ff060-a184-43b4-8b83-b8b8653753d1', metadata={'page_end': 5, 'source': 'F-62.pdf', 'title': 'Philosophy', 'page_start': 5, 'category': 'NarrativeText'}, page_content='Philosophy\n\nThe Data Nutrition Project is a non-profit initiative that formed in 2018 to develop tools and practices to improve transparency into datasets. Our team is interdisciplinary, and we leverage insights from a variety of fields, including product development, data science, ethics, engineering, design, and education. Our approach with our Nutrition Labels for Datasets is threefold: 1) We encourage the creation, documentation, and publishing of higher quality data; 2) We enable transparency into datasets through our legible, extensible, interactive framework; and 3) Our Labels provide education about what kinds of information a user should ascertain before using a dataset. We bring this approach into our work with clients, where we prioritize user-centered design, realistic goals, and practitione

In [62]:
final_docs = rerank_docs(query, candidate_docs, top_k=5)
final_docs

[Document(id='281ff060-a184-43b4-8b83-b8b8653753d1', metadata={'page_end': 5, 'source': 'F-62.pdf', 'title': 'Philosophy', 'page_start': 5, 'category': 'NarrativeText'}, page_content='Philosophy\n\nThe Data Nutrition Project is a non-profit initiative that formed in 2018 to develop tools and practices to improve transparency into datasets. Our team is interdisciplinary, and we leverage insights from a variety of fields, including product development, data science, ethics, engineering, design, and education. Our approach with our Nutrition Labels for Datasets is threefold: 1) We encourage the creation, documentation, and publishing of higher quality data; 2) We enable transparency into datasets through our legible, extensible, interactive framework; and 3) Our Labels provide education about what kinds of information a user should ascertain before using a dataset. We bring this approach into our work with clients, where we prioritize user-centered design, realistic goals, and practitione

In [None]:
def build_structured_context(docs):
    context_blocks = []
    source_map = {}

    for i, doc in enumerate(docs, 1):
        context_blocks.append(
            f"""
[Source {i}]
Document: {doc.metadata['source']}
Section: {doc.metadata.get('title', 'N/A')}
Pages: {doc.metadata.get('page_start')}–{doc.metadata.get('page_end')}

Content:
{doc.page_content}
""".strip()
        )

        source_map[i] = {
            "file": doc.metadata["source"],
            "section": doc.metadata.get("title"),
            "pages": f"{doc.metadata.get('page_start')}–{doc.metadata.get('page_end')}"
        }

    return "\n\n---\n\n".join(context_blocks), source_map

structured_context, source_map = build_structured_context(final_docs)
print("Structured Context:\n")
print(structured_context)

Structured Context:

[Source 1]
Document: F-62.pdf
Section: Philosophy
Pages: 5–5

Content:
Philosophy

The Data Nutrition Project is a non-profit initiative that formed in 2018 to develop tools and practices to improve transparency into datasets. Our team is interdisciplinary, and we leverage insights from a variety of fields, including product development, data science, ethics, engineering, design, and education. Our approach with our Nutrition Labels for Datasets is threefold: 1) We encourage the creation, documentation, and publishing of higher quality data; 2) We enable transparency into datasets through our legible, extensible, interactive framework; and 3) Our Labels provide education about what kinds of information a user should ascertain before using a dataset. We bring this approach into our work with clients, where we prioritize user-centered design, realistic goals, and practitioner-focused outcomes, informed by our experience working in data transparency initiatives and wi

### **LLM Integration**

temperature = 0
→ Always pick highest-probability next token

top_p = 1
→ No token filtering

repeat_penalty = 1
→ No penalty for repetition

In [149]:
from langchain_ollama import OllamaLLM
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Initializing the Qwen2.5:1.5b model
llm = OllamaLLM(
    model="qwen2.5:1.5b",
    temperature=0.0,
    top_p=1.0
)

In [None]:
prompt_template = ChatPromptTemplate.from_template("""
You are a research assistant answering questions strictly from the provided document context.

CRITICAL RULES (NON-NEGOTIABLE):
1. Use ONLY the information explicitly stated in the context.
2. Do NOT use prior knowledge, assumptions, interpretations, or reasoning beyond the text.
3. Every factual sentence MUST end with a citation in the form [Source X].
4. If a question has multiple sub-questions, ALL parts must be answered.
5. If ANY part of the question cannot be answered from the context, DO NOT partially answer.

MISSING INFORMATION HANDLING:
If required information is missing or unclear, respond ONLY with:
"I couldn’t find this information in the available documents."
                                                   
CRITICAL ANSWERABILITY RULE:
The question may contain multiple sub-questions.
You must answer ONLY if ALL sub-questions are explicitly and clearly answered in the provided context.

If ANY sub-question is missing, unclear, or indirectly implied,
respond ONLY with:

"I could not find information related to this question according to the provided documents."

Do not provide partial answers.
Do not summarize unrelated sections.
Do not include Sources Summary when refusing.


PROHIBITED BEHAVIOR:
- Do NOT speculate or infer timelines, durations, or participants.
- Do NOT explain why the information is missing.
- Do NOT summarize document structure or page counts.
- Do NOT reinterpret or restate the user's question.
- Do NOT include phrases like "it appears", "based on the information", or "indicates that".

CONTEXT:
The following are retrieved document sections. Each section is identified by a source number.                                                                                                   

{context}

QUESTION:
{question}

ANSWER FORMAT:
- Use bullet points if applicable.
- Each sentence must end with one or more citations.
- Do NOT include a references list.
- Do NOT repeat the context.

ANSWER:
""")

In [155]:
rag_chain = (
    prompt_template
    | llm
    | StrOutputParser()
)

In [121]:
def format_reranked_sources(docs):
    """
    Convert reranked Document objects into a clean, readable source list
    """
    formatted = []

    for i, doc in enumerate(docs):
        meta = doc.metadata or {}

        source = meta.get("source", "Unknown")
        title = meta.get("title", "Unknown")
        page_start = meta.get("page_start", "?")
        page_end = meta.get("page_end", "?")

        formatted.append(            
            f"[{i+1}] Source: {source}\n"
            f"Title: {title}\n"
            f"Pages: {page_start}–{page_end}"
        )

    return "\n\n".join(formatted)

In [122]:
def answer_query(
    query: str,
    *,
    hybrid_retriever,
    reranker,
    rag_chain,
    top_k_retrieval: int = 20,
    top_k_rerank: int = 5
):
    """
    End-to-end RAG pipeline:
    Query → Hybrid Retrieval → Re-ranking → Structured Context → LLM Answer
    """

    # 1️. Stage-1: Hybrid Retrieval (BM25 + Vector)
    candidate_docs = hybrid_retriever.invoke(
        query,
        k=top_k_retrieval
    )

    if not candidate_docs:
        return "I couldn’t find relevant information for this question in the documents."

    # 2️. Stage-2: Re-ranking (Cross-encoder)
    reranked_docs = rerank_docs(
        query=query,
        docs=candidate_docs,
        top_k=top_k_rerank
    )



    if not reranked_docs:
        return "I found documents, but none were relevant enough after re-ranking."

    # 3️. Build structured context with source indices
    context_text = build_structured_context(reranked_docs)

    # 4️. LLM generation
    response = rag_chain.invoke({
        "context": context_text,
        "question": query
    })

    sources_summary = format_reranked_sources(reranked_docs)

    return response, sources_summary

In [109]:
query = "Graduation rates of Students with disabilities in higher education institutions"

response = answer_query(
    query=query,
    hybrid_retriever=hybrid_retriever,
    reranker=reranker,
    rag_chain=rag_chain,
    top_k_retrieval=20,
    top_k_rerank=5
)

print("LLM Response:\n")
print(response)

LLM Response:

('I could not find information related to this question in the provided document sections. Could you please clarify what specific detail you are looking for or share additional documents?', '[0] Source: F-62.pdf\nTitle: Philosophy\nPages: 5–5\n\n[1] Source: F-62.pdf\nTitle: Recommendations\nPages: 15–15\n\n[2] Source: F-62.pdf\nTitle: for the data quality principle of credibility.\nPages: 12–13\n\n[3] Source: F-62.pdf\nTitle: number of categories and against a common framework.\nPages: 13–14\n\n[4] Source: F-62.pdf\nTitle: / Owner and Third-Party expert\nPages: 14–14')


In [115]:
query = "Explain me about data nutrition project?"

response, sources_summary = answer_query(
    query=query,
    hybrid_retriever=hybrid_retriever,
    reranker=reranker,
    rag_chain=rag_chain,
    top_k_retrieval=20,
    top_k_rerank=5
)

print("LLM Response:\n")
print(response)
print("\n---> Sources Summary:")
print(sources_summary)

LLM Response:

The Data Nutrition Project is a non-profit initiative that was established in 2018 with the goal of developing tools and practices to enhance transparency into datasets. The team behind this project consists of individuals from various disciplines, including product development, data science, ethics, engineering, design, and education. They approach their work using three key methods: encouraging higher quality data creation and documentation, making datasets transparent through an interactive framework that provides educational information about the dataset's attributes and usage, and offering education on what specific types of information users should investigate before utilizing a dataset.

The project’s methodology involves interviews with stakeholders who represent different points along the data collection, processing, hosting, and use timeline. These stakeholders include both external entities like the IOM and Humanitarian OpenStreetMap Team as well as internal e

In [116]:
query = "what is the impact of on existing data processes and systems"

response, sources_summary = answer_query(
    query=query,
    hybrid_retriever=hybrid_retriever,
    reranker=reranker,
    rag_chain=rag_chain,
    top_k_retrieval=20,
    top_k_rerank=5
)

print("LLM Response:\n")
print(response)
print("\n---> Sources Summary:")
print(sources_summary)

LLM Response:

The Impact on Existing Data Processes and Systems is as follows:

Currently, data organizations upload their datasets to HDX either in bulk (using the HDX / CKAN APIs) or manually (using the upload form process). Many dataset quality measures are already collected during these processes. We recommend that additional information could be gathered through updating the API and the form. Our Phase 1 recommendation does not require this additional infrastructure, but instead utilizes only information that HDX already collects. In Phase 2 and beyond, there are additional automated and manual processes in which more metadata is gathered, some of which could be leveraged as quality measures. The particulars of what is gathered and when, as well as how to collect more data automatically or otherwise, require further exploration.

---> Sources Summary:
[0] Source: F-62.pdf
Title: VII. ADDITIONAL CONSIDERATIONS
Pages: 23–23

[1] Source: F-62.pdf
Title: Technical Considerations
Page

In [129]:
query = "When did the research campaign held? and how many days? and by whom?"

response, sources_summary = answer_query(
    query=query,
    hybrid_retriever=hybrid_retriever,
    reranker=reranker,
    rag_chain=rag_chain,
    top_k_retrieval=20,
    top_k_rerank=5
)

print("LLM Response:\n")
print(response)
print("\n---> Sources Summary:")
print(sources_summary)

LLM Response:

I couldn’t find this information in the available documents.

---> Sources Summary:
[1] Source: F-62.pdf
Title: Phase 2
Pages: 21–21

[2] Source: F-62.pdf
Title: Goals
Pages: 5–5

[3] Source: F-62.pdf
Title: Update Frequency & Last Updated date*
Pages: 20–20

[4] Source: F-62.pdf
Title: II. CHALLENGES
Pages: 6–7

[5] Source: F-62.pdf
Title: II. CHALLENGES
Pages: 6–7


In [None]:
query = "When did the research campaign held? and how many days? and by whom?"

response, sources_summary = answer_query(
    query=query,
    hybrid_retriever=hybrid_retriever,
    reranker=reranker,
    rag_chain=rag_chain,
    top_k_retrieval=20,
    top_k_rerank=5
)

print("LLM Response:\n")
print(response)
print("\n---> Sources Summary:")
print(sources_summary)

LLM Response:

-I couldn’t find this information in the available documents.

---> Sources Summary:
[1] Source: F-62.pdf
Title: Phase 2
Pages: 21–21

[2] Source: F-62.pdf
Title: Goals
Pages: 5–5

[3] Source: F-62.pdf
Title: Update Frequency & Last Updated date*
Pages: 20–20

[4] Source: F-62.pdf
Title: II. CHALLENGES
Pages: 6–7

[5] Source: F-62.pdf
Title: II. CHALLENGES
Pages: 6–7


In [133]:
query = "When did the research campaign held? and how many days? and by whom?"

response, sources_summary = answer_query(
    query=query,
    hybrid_retriever=hybrid_retriever,
    reranker=reranker,
    rag_chain=rag_chain,
    top_k_retrieval=20,
    top_k_rerank=5
)

print("LLM Response:\n")
print(response)
print("\n---> Sources Summary:")
print(sources_summary)

LLM Response:

- Research was conducted in Phase 2 of the project.
- The research campaign took place from pages 21 to 21 of Document F-62.pdf.
- The study was led by the Data Nutrition Project team.

---> Sources Summary:
[1] Source: F-62.pdf
Title: Phase 2
Pages: 21–21

[2] Source: F-62.pdf
Title: Goals
Pages: 5–5

[3] Source: F-62.pdf
Title: Update Frequency & Last Updated date*
Pages: 20–20

[4] Source: F-62.pdf
Title: II. CHALLENGES
Pages: 6–7

[5] Source: F-62.pdf
Title: II. CHALLENGES
Pages: 6–7


In [132]:
query = "When did the research campaign and design sprint held? and how many days? and by whom?"

response, sources_summary = answer_query(
    query=query,
    hybrid_retriever=hybrid_retriever,
    reranker=reranker,
    rag_chain=rag_chain,
    top_k_retrieval=20,
    top_k_rerank=5
)

print("LLM Response:\n")
print(response)
print("\n---> Sources Summary:")
print(sources_summary)

LLM Response:

- The research campaign and design sprint took place from February into early March 2023. 
- This period lasted for **5 weeks**.
- The research was conducted by the DNP team, as outlined in Source [Source 1].

---> Sources Summary:
[1] Source: F-62.pdf
Title: VIII. CONCLUSION
Pages: 25–25

[2] Source: F-62.pdf
Title: Rationale
Pages: 21–22

[3] Source: F-62.pdf
Title: Goals
Pages: 5–5

[4] Source: F-62.pdf
Title: II. CHALLENGES
Pages: 6–7


In [126]:
query = "When did the research campaign held?"

response, sources_summary = answer_query(
    query=query,
    hybrid_retriever=hybrid_retriever,
    reranker=reranker,
    rag_chain=rag_chain,
    top_k_retrieval=20,
    top_k_rerank=5
)

print("LLM Response:\n")
print(response)
print("\n---> Sources Summary:")
print(sources_summary)

LLM Response:

Based on the information provided, there is no explicit mention of when the research campaign was held. Therefore, I couldn’t find this information in the available documents.

---> Sources Summary:
[1] Source: F-62.pdf
Title: Prototype directions
Pages: 11–11

[2] Source: F-62.pdf
Title: Phase 2
Pages: 21–21

[3] Source: F-62.pdf
Title: Goals
Pages: 5–5

[4] Source: F-62.pdf
Title: II. CHALLENGES
Pages: 6–7

[5] Source: F-62.pdf
Title: Update Frequency & Last Updated date*
Pages: 20–20


In [138]:
# Responses with Temperature - 0, Top P = 0

query = "When did the research campaign held? and how many days? and by whom?"

response, sources_summary = answer_query(
    query=query,
    hybrid_retriever=hybrid_retriever,
    reranker=reranker,
    rag_chain=rag_chain,
    top_k_retrieval=20,
    top_k_rerank=5
)

print("LLM Response:\n")
print(response)
print("\n---> Sources Summary:")
print(sources_summary)

LLM Response:

I couldn’t find this information in the available documents.

---> Sources Summary:
[1] Source: F-62.pdf
Title: Phase 2
Pages: 21–21

[2] Source: F-62.pdf
Title: Goals
Pages: 5–5

[3] Source: F-62.pdf
Title: Update Frequency & Last Updated date*
Pages: 20–20

[4] Source: F-62.pdf
Title: II. CHALLENGES
Pages: 6–7

[5] Source: F-62.pdf
Title: II. CHALLENGES
Pages: 6–7


In [163]:
query = "When did the research campaign held?"

response, sources_summary = answer_query(
    query=query,
    hybrid_retriever=hybrid_retriever,
    reranker=reranker,
    rag_chain=rag_chain,
    top_k_retrieval=20,
    top_k_rerank=5
)

print("LLM Response:\n")
print(response)
print("\n---> Sources Summary:")
print(sources_summary)

LLM Response:

- The research and interviews conducted in the first two weeks of our sprint.

---> Sources Summary:
[1] Source: F-62.pdf
Title: Prototype directions
Pages: 11–11

[2] Source: F-62.pdf
Title: Phase 2
Pages: 21–21

[3] Source: F-62.pdf
Title: Goals
Pages: 5–5

[4] Source: F-62.pdf
Title: II. CHALLENGES
Pages: 6–7

[5] Source: F-62.pdf
Title: Update Frequency & Last Updated date*
Pages: 20–20


Yes, a 1.5B parameter model **will definitely hallucinate**, and generally at higher rates than larger models. Here's what research shows: [arxiv](https://arxiv.org/html/2512.22416v1)

## Hallucination Tendency in Small Models

**Smaller models like 1.5B parameters exhibit higher hallucination scores and greater tendency to generate factually incorrect content**. Research on the Qwen2.5 series specifically found that the 1.5B version shows higher hallucination variability compared to both smaller (0.5B) and larger (3B) versions, suggesting intermediate-sized models may have more unstable factual grounding due to incomplete knowledge retention during pretraining. [arxiv](https://arxiv.org/html/2512.22416v1)

## Parameter Size Impact

The general trend shows **hallucination rates decrease as parameter size increases**, though the relationship isn't strictly linear. Models with 7-9 billion parameters consistently outperform smaller models in terms of factual consistency and stable outputs. For example, SeaLLM 1.5B produces some of the highest hallucination rates among tested models. [openreview](https://openreview.net/pdf?id=AGsrWqu0dh)

## Important Exception

While the trend favors larger models, **training quality matters more than size alone**. Intel's Neural-chat 7B achieved just 2.8% hallucination rate—lower than GPT-4's 3% despite GPT-4 having ~1.8 trillion parameters. This demonstrates that well-optimized smaller models with good training data can outperform poorly trained larger ones. [intel](https://www.intel.com/content/www/us/en/developer/articles/technical/do-smaller-models-hallucinate-more.html)

### *For our RAG and semantic search work, we're using a 1.5B model locally via Ollama, so there are more hallucinations, and we have to implement strong retrieval-based grounding and fact-checking in your pipeline to mitigate this.*