This notebook demonstrates a RAG pipeline built over historical metadata from the Boston Public Library's Digital Commonwealth API. We aim to simulate the experience of a professional librarian responding to queries using embedded and re-ranked metadata content.

In [None]:
# Step 1: Ensure local binaries (like openai CLI) are on PATH for SCC

!pip install -q --upgrade pip
!pip install -q langchain sentence-transformers pinecone-client openai python-dotenv rank_bm25

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [3]:
# Step 2: Import core LangChain, Pinecone, OpenAI, and utilities
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import Pinecone as PineconeVectorStore
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pinecone import Pinecone
import hashlib
import json
from dotenv import load_dotenv
from openai import OpenAI

# Load secrets from .env file (e.g., API keys)
load_dotenv()

True

We began by installing the necessary packages. These include:

    - langchain: For orchestrating embeddings and vector stores.

    - pinecone-client: To interact with Pinecone, our vector DB.

    - sentence-transformers: To generate vector embeddings.

    - rank_bm25: For BM25 re-ranking based on metadata field content.

This prepares our environment for building and testing the RAG pipeline.

In [None]:
# Step 3: Fetch Metadata from Digital Commonwealth (returns directly instead of saving to file)
import requests
import json

def fetch_digital_commonwealth(start_page=0, end_page=5):
    BASE_URL = "https://www.digitalcommonwealth.org/search.json?search_field=all_fields&per_page=100&q="
    FINAL_PAGE = 13038
    output = []

    print(f"Reading pages {start_page} to {end_page}")
    retries = 0

    while start_page < end_page:
        try:
            response = requests.get(f"{BASE_URL}&page={start_page}")
            response.raise_for_status()
            data = response.json()
            output.append(data)

            next_page = data['meta']['pages'].get('next_page')
            if next_page is None or next_page >= end_page:
                break

            start_page = next_page
            retries = 0  # reset on success
        except requests.exceptions.RequestException as e:
            print(f"Error: {e}")
            retries += 1
            if retries >= 5:
                print("Too many retries. Stopping.")
                break

    print(f"Fetched {len(output)} pages.")
    return output

raw_data = fetch_digital_commonwealth(1, 3)
print(raw_data)

Reading pages 1 to 2
Fetched 1 pages.
[{'links': {'self': 'https://www.digitalcommonwealth.org/search.json?page=1&per_page=100&q=&search_field=all_fields', 'next': 'https://www.digitalcommonwealth.org/search.json?page=2&per_page=100&q=&search_field=all_fields', 'last': 'https://www.digitalcommonwealth.org/search.json?page=13785&per_page=100&q=&search_field=all_fields'}, 'meta': {'pages': {'current_page': 1, 'next_page': 2, 'prev_page': None, 'total_pages': 13785, 'limit_value': 100, 'offset_value': 0, 'total_count': 1378437, 'first_page?': True, 'last_page?': False}}, 'data': [{'id': 'commonwealth-oai:xp68md23x', 'type': 'DigitalObject', 'attributes': {'id': 'commonwealth-oai:xp68md23x', 'system_create_dtsi': '2021-03-04T00:13:09Z', 'system_modified_dtsi': '2021-09-02T20:40:00Z', 'curator_model_ssi': 'Curator::DigitalObject', 'curator_model_suffix_ssi': 'DigitalObject', 'title_info_primary_tsi': 'من فضلكم توقفوا الأشخاص الذين ارتكبوا أسوأ الجرائم ضد المرأة عن الإفلات من العدالة الرجاء 

In this step, we scrape metadata from the Digital Commonwealth API by sending repeated GET requests across multiple pages. Each page returns a batch of digital item records in JSON format.

We extract fields like title, abstract, and date, then load them into memory. The process continues until we reach the end page or there are no more results. This gives us the raw metadata we need for embedding and search.

In [None]:
# Step 3: Load a pre-scraped JSON file from BPL. THe file is in the scc /projectnb/sparkgrp/ml-bpl-rag-data/extraneous/metadata/

with open("./out1_1302.json", "r") as f:
    raw_data = json.load(f)

print(f"Loaded {len(raw_data)} pages of metadata.")

Loaded 1301 pages of metadata.


We loaded a JSON dump from the Digital Commonwealth API containing metadata records.
The dataset contains 1301 pages, representing different sets of records. This raw data is the foundation of our vector-based search system.

In [5]:
# Step 4.1: Confirm total unique metadata records
unique_ids = set()
for page in raw_data:
    for item in page.get("data", []):
        unique_ids.add(item.get("id"))

print(f"Found {len(unique_ids)} unique metadata records.")

# Step 4.2: Display a couple example records
print("\nSample Records:")
record_count = 0
for page in raw_data:
    for item in page.get("data", []):
        print(f"ID: {item.get('id')}")
        print(f"Attributes: {list(item.get('attributes', {}).keys())[:5]}")
        print("-" * 40)
        record_count += 1
        if record_count == 3:
            break
    if record_count == 3:
        break

# Step 4.3: Get 3 raw entries from the common fields
fields_to_use = ['abstract_tsi', 'title_info_primary_tsi', 'title_info_primary_subtitle_tsi']
raw_samples = []

for page in raw_data:
    for item in page.get("data", []):
        attrs = item.get("attributes", {})
        for field in fields_to_use:
            if field in attrs:
                entry = str(attrs[field])
                if entry.strip():
                    raw_samples.append(entry)
                    break
        if len(raw_samples) == 3:
            break
    if len(raw_samples) == 3:
        break
# Step 4.4: Embed and show vector dimensions
from time import time

start = time()
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
sample_vectors = embeddings.embed_documents(raw_samples)

for i, vec in enumerate(sample_vectors):
    print(f"\n🔹 Vector {i+1} (Length: {len(vec)})")
    print(vec[:10], "...")

print(f"\nEmbedded 3 samples in {time() - start:.2f} seconds.")


Found 130100 unique metadata records.

Sample Records:
ID: commonwealth-oai:xp68md23x
Attributes: ['id', 'system_create_dtsi', 'system_modified_dtsi', 'curator_model_ssi', 'curator_model_suffix_ssi']
----------------------------------------
ID: commonwealth-oai:xp68m844v
Attributes: ['id', 'system_create_dtsi', 'system_modified_dtsi', 'curator_model_ssi', 'curator_model_suffix_ssi']
----------------------------------------
ID: commonwealth-oai:xp68mb49n
Attributes: ['id', 'system_create_dtsi', 'system_modified_dtsi', 'curator_model_ssi', 'curator_model_suffix_ssi']
----------------------------------------


  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")



🔹 Vector 1 (Length: 384)
[-0.03625216335058212, 0.154201477766037, -0.061468496918678284, -0.030966727063059807, 0.11117375642061234, 0.04476006329059601, 0.05122426897287369, -0.06605665385723114, 0.02305932343006134, 0.03184349462389946] ...

🔹 Vector 2 (Length: 384)
[-0.012593441642820835, 0.08133575320243835, -0.002853062003850937, 0.028024710714817047, 0.14035390317440033, 0.04992254823446274, 0.11209998279809952, -0.010191850364208221, -0.060609083622694016, -0.01612008363008499] ...

🔹 Vector 3 (Length: 384)
[-0.09615501016378403, 0.13938933610916138, -0.08657781779766083, 0.04440562427043915, 0.04893776774406433, 0.010761924088001251, 0.075003981590271, -0.05063491687178612, -0.0393863208591938, -0.050925690680742264] ...

Embedded 3 samples in 10.60 seconds.


Here, we looped through the JSON pages to extract and count unique item IDs.
We confirmed that we have a total of 130,100 unique metadata records, which will later be embedded and uploaded to Pinecone.

To verify the structure, we printed the first three records. Each has an ID and several fields, like system_create_dtsi and curator_model_ssi. These fields are useful for tracing what metadata is available for embedding and display.

From the dataset, we selected three sample entries from high-signal fields:

    - abstract_tsi

    - title_info_primary_tsi

    - title_info_primary_subtitle_tsi

This shows what kind of natural language content is available and gives us a preview of what we’ll embed.

Using the sentence-transformers/all-MiniLM-L6-v2 model, we embedded the 3 text samples and displayed their vector size and a preview of the values.

Each vector is 384 dimensions — ideal for semantic search with Pinecone.
This step confirms that the embedding model is working as expected.

In [None]:
# Step 5: Connect to Pinecone vector DB (pre-loaded using load_pinecone.py)

from pinecone import Pinecone
from langchain_pinecone import PineconeVectorStore

# Load Pinecone API key
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))

# Connect to existing index
index = pc.Index("bpl-rag")

# Initialize LangChain-compatible VectorStore
vector_store = PineconeVectorStore(
    index=index,
    embedding=embeddings  # HuggingFaceEmbeddings from earlier step
)

print("Connected to Pinecone vector store.")


Connected to Pinecone vector store.


We connected to the existing Pinecone index named "bpl-rag" and wrapped it using LangChain’s PineconeVectorStore.
This index was pre-loaded using load_pinecone.py, which streamed in the full 800,000+ embedded documents.

In [9]:
# Step 6: Run Semantic Query
query = "What happened in Boston historical events happened in Boston?"
retrieved_docs = vector_store.similarity_search(query, k=5)

# Print top 5 results
for i, doc in enumerate(retrieved_docs):
    print(f"\n Result {i+1}")
    print(f"Source ID: {doc.metadata['source']} | Field: {doc.metadata['field']}")
    print(f"Content Preview:\n{doc.page_content[:300]} ...")



 Result 1
Source ID: commonwealth-oai:br86br38q | Field: title_info_primary_tsi
Content Preview:
Historical note (Boston, Mass.) ...

 Result 2
Source ID: commonwealth-oai:0p09c0464 | Field: title_info_primary_tsi
Content Preview:
Historic Boston in Four Seasons ...

 Result 3
Source ID: commonwealth:9s161c653 | Field: title_info_primary_tsi
Content Preview:
A guide to Boston ...

 Result 4
Source ID: commonwealth:fj237t22c | Field: title_info_primary_tsi
Content Preview:
Old State House and the site of the Boston Massacre ...

 Result 5
Source ID: commonwealth:416883725 | Field: note_tsim
Content Preview:
['Title and date from item, from additional material accompanying item, or from information provided by the Boston Public Library.', 'Published in: A History of Boston, by Caleb H. Snow, published by Abel Bowen, 1825.'] ...


We submitted a query:
"What happened in Boston historical events happened in Boston?"

Using vector similarity search, we retrieved 5 records from Pinecone based on semantic match.

    - The results included:

    - Historical notes

    - A guide to Boston

    - The Old State House and the Boston Massacre

This shows that vector-based retrieval can match natural language queries with historically significant metadata records — without needing exact keyword matches.

In [12]:
# Step 7: BM25 reranking of Pinecone-retrieved documents
from langchain_community.retrievers import BM25Retriever
import requests

# Fetch full metadata content from Digital Commonwealth API
def fetch_full_text(source_id):
    url = f"https://www.digitalcommonwealth.org/search/{source_id}.json"
    try:
        response = requests.get(url)
        if response.status_code == 200:
            data = response.json().get("data", {}).get("attributes", {})
            # Use key fields for richer reranking context
            parts = []
            for field in ["title_info_primary_tsi", "abstract_tsi", "subject_geographic_sim", "genre_basic_ssim", "date_tsim"]:
                if field in data:
                    parts.append(str(data[field]))
            return " ".join(parts)
        else:
            return ""
    except:
        return ""

# Get full versions of each doc
bm25_docs = []
seen = set()
for doc in retrieved_docs:
    source_id = doc.metadata.get("source")
    if not source_id or source_id in seen:
        continue
    seen.add(source_id)
    text = fetch_full_text(source_id)
    if text:
        bm25_docs.append(Document(page_content=text, metadata=doc.metadata))

# Run BM25 reranking
bm25_retriever = BM25Retriever.from_documents(bm25_docs, k=5)
reranked_docs = bm25_retriever.invoke(query)

# Preview reranked results
print("Top reranked documents (BM25):")
for i, doc in enumerate(reranked_docs):
    print(f"\nBM25 Rank {i+1}:")
    print(f"Source ID: {doc.metadata.get('source')} | Field: {doc.metadata.get('field')}")
    print(f"Content Preview:\n{doc.page_content[:300]} ...")


Top reranked documents (BM25):

BM25 Rank 1:
Source ID: commonwealth-oai:0p09c0464 | Field: title_info_primary_tsi
Content Preview:
Historic Boston in Four Seasons Picture book, includes photo of "Great Spirit" by Cyrus Dallin, p.64 ['Books'] ['1938'] ...

BM25 Rank 2:
Source ID: commonwealth-oai:br86br38q | Field: title_info_primary_tsi
Content Preview:
Historical note (Boston, Mass.) Notes by Charles C. Coffin on the history of Boston, Massachusetts, including brief biographies of important people in Boston's history. The notes are handwritten on loose sheets; some pages have writing on another subject on the back. Digitization funded by a grant f ...

BM25 Rank 3:
Source ID: commonwealth:9s161c653 | Field: title_info_primary_tsi
Content Preview:
A guide to Boston ['Boston', 'Massachusetts', 'North and Central America', 'Suffolk (county)', 'United States'] ['Maps'] ['[ca. 1895]'] ...

BM25 Rank 4:
Source ID: commonwealth:fj237t22c | Field: title_info_primary_tsi
Content Preview:
Old 

Although vector search gives us semantically close results, we wanted to further refine the ranking using metadata-aware signals.

For each of the top 5 results, we:

    1. Hit the Digital Commonwealth API using the record’s source ID to fetch full metadata.

    2. Extracted text from key fields like title, abstract, genre, and date.

    3. Created new Document objects for re-ranking.

    4. Used BM25 to rerank the documents based on the original query.

The new ranking helped surface records where the terms "historic" and "Boston" were used in more relevant ways — for example, books and photo guides specifically about Boston’s history.

This improves result quality using structured metadata even when vector search gives noisy matches.

In [None]:
# Step 7: Engineer prompt and generate response using GPT-4o-mini (with dynamic sources)

from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI  # for LangChain < 0.1.0
import re
import os

# Format each retrieved doc with source info
context_blocks = []
source_refs = []
for doc in retrieved_docs:
    source_id = doc.metadata.get("source", "Unknown Source")
    field = doc.metadata.get("field", "Unknown Field")
    content = doc.page_content.strip()
    block = f"[Source: {source_id} | Field: {field}]\n{content}"
    context_blocks.append(block)
    source_refs.append(f"{source_id} | {field}")

# Combine into one prompt context
context = "\n\n".join(context_blocks)

# Define librarian-style prompt
answer_template = PromptTemplate.from_template(
    """Pretend you are a professional librarian. Please summarize the following context as though you had retrieved it for a patron.

    Each source is labeled with [Source: ID | Field: field_name].

    Context:
    {context}

    Make sure to answer in the following format:
    <REASONING>your reasoning here</REASONING>
    <VALID>YES or NO</VALID>
    <RESPONSE>your answer here</RESPONSE>

    Here is an example:
    <EXAMPLE>
    <QUERY>Are pineapples a good fuel for cars?</QUERY>
    <CONTEXT>Cars use gasoline for fuel. Some cars use electricity for fuel. Tesla stock has increased by 10 percent.</CONTEXT>
    <REASONING>Context discusses gasoline and electricity but not pineapples, so it's not relevant.</REASONING>
    <VALID>NO</VALID>
    <RESPONSE>Pineapples are not a viable fuel source for cars at this time.</RESPONSE>
    </EXAMPLE>

    Now it's your turn:
    <QUERY>
    {query}
    </QUERY>
    """
)

# Inject the context and query
final_prompt = answer_template.format(context=context, query=query)

# Initialize the LLM (use .env key, not hardcoded)
llm = ChatOpenAI(
    model_name="gpt-4o-mini",
    temperature=0,
    openai_api_key=os.getenv("OPENAI_API_KEY")
)

# Generate LLM response
response = llm.predict(final_prompt)

# Helper to extract XML-tagged responses
def parse_xml_tag(xml: str, tag: str) -> str:
    match = re.search(f"<{tag}>(.*?)</{tag}>", xml, re.DOTALL)
    return match.group(1).strip() if match else f"{tag} not found."

# Parse XML sections
reasoning = parse_xml_tag(response, "REASONING")
validity = parse_xml_tag(response, "VALID")
final_answer = parse_xml_tag(response, "RESPONSE")

# Display answer and reasoning
print("REASONING:\n", reasoning)
print("\nFINAL RESPONSE:\n", final_answer)

# Display referenced source IDs and fields
print("\nHere are your provided sources:")
for i, src in enumerate(source_refs):
    print(f"Result {i+1}: {src}")


REASONING:
 The context provides titles of sources related to Boston's history, including a specific mention of the Old State House and the Boston Massacre, which are significant historical events. Additionally, there are references to a guide and a historical note about Boston, indicating a focus on its historical significance. Therefore, the context is relevant to the query about historical events in Boston.

FINAL RESPONSE:
 The context includes references to significant historical events in Boston, such as the Boston Massacre, and provides various sources that discuss Boston's history, making it relevant to your inquiry about historical events in the city.

Here are your provided sources:
Result 1: commonwealth-oai:br86br38q | title_info_primary_tsi
Result 2: commonwealth-oai:0p09c0464 | title_info_primary_tsi
Result 3: commonwealth:9s161c653 | title_info_primary_tsi
Result 4: commonwealth:fj237t22c | title_info_primary_tsi
Result 5: commonwealth:416883725 | note_tsim


We created a librarian-style prompt that includes:

    - Full context from each source

    - Source metadata

    - An instruction to summarize and validate the answer

The prompt is sent to OpenAI's gpt-4o-mini model, which returned:

Reasoning:

- "The context provides titles of sources related to Boston's history... Therefore, the context is relevant to the query..."

Final Response:

- "The context includes references to significant historical events in Boston, such as the Boston Massacre... These sources can help you explore Boston’s historical narrative further."