# Main Harvester Pipeline (PubMed → Unpaywall)

This notebook orchestrates:
1. PubMed harvesting
2. PubMed XML parsing
3. Normalization into internal `health_document` schema
4. DOI extraction
5. Unpaywall OA enrichment using DOI

Adapters used:
- pubmed_adapter
- unpaywall_adapter

## Dependencies

In [1]:
import json
import os
from pathlib import Path

## Storage directory configuration

In [2]:
PROJECT_ROOT = Path.cwd()          # main.ipynb directory
STORAGE_DIR = PROJECT_ROOT / "storage"

STORAGE_DIR.mkdir(parents=True, exist_ok=True)

print("Storage dir:", STORAGE_DIR.resolve())

Storage dir: C:\Users\Aman Sheikh\Desktop\Projects\VeriFact\Model\harvester\storage


## Importing python notebooks

In [3]:
%run adapters/adapter_pubmed.ipynb
%run adapters/adapter_unpaywall.ipynb

## pubmed Keyword Search: (Add dynamic keyword mechanism later)

In [4]:
# Example: Start by using keyword search:
SEARCH_KEYWORD='air circulation'
RETRIEVE_MAX=5
search_resp = pubmed_search(SEARCH_KEYWORD, retmax=RETRIEVE_MAX)
# Get the pmids:
pmids = search_resp["esearchresult"]["idlist"]

## pubmed Fetch

In [5]:
print("ℹ️ Fetching PubMed XML...")
pubmed_xml = pubmed_fetch(pmids)
print("✅ PubMed XML fetched")

ℹ️ Fetching PubMed XML...
✅ PubMed XML fetched


## Parse fetched XML records

In [6]:
print("ℹ️ Parsing PubMed XML...")
parsed_pubmed_records = parse_pubmed_xml(pubmed_xml)

print(f"✅ Parsed {len(parsed_pubmed_records)} PubMed records")
print(f"ℹ️ Showing the first PubMed record:")
parsed_pubmed_records[0]

ℹ️ Parsing PubMed XML...
✅ Parsed 5 PubMed records
ℹ️ Showing the first PubMed record:


{'pmid': '41509674',
 'title': 'Synergistic effects of genetic susceptibility and air pollution on cardiovascular disease.',
 'abstract': 'BACKGROUND: Although both genetic susceptibility and air pollution are established risk factors for cardiovascular disease (CVD), evidence for their interaction, particularly on the additive scale, remains limited and inconclusive. We aimed to investigate the individual and joint effects of long-term exposure to air pollutants and polygenic risk on incident CVD. METHODS: In a prospective cohort of 460,572 participants from the UK Biobank, we estimated hazard ratios (HRs) for CVD associated with particulate matter (PM RESULTS: Over a median follow-up of 11.92 years, 48,690 incident CVD cases occurred. Both a higher genetic risk and increased air pollution exposure were independently associated with elevated CVD risk. Notably, a significant synergistic effect was observed. Compared to participants with low genetic risk and low pollution exposure, thos

## Normalize the parsed records (in list of Maps)

In [7]:
normalized_documents = []

for rec in parsed_pubmed_records:
    doc = normalize_pubmed_record(
        rec,
        raw_ref="pubmed_raw.xml"
    )
    normalized_documents.append(doc)

print(f"✅ Normalized {len(normalized_documents)} documents")

print(f"ℹ️ Showing the first normalized PubMed record:")
normalized_documents[0]

✅ Normalized 5 documents
ℹ️ Showing the first normalized PubMed record:


{'schema_version': '1.0',
 'document_id': '038717b9236b5fe8a56a8b2f856bab5ce9154cf27751f039354bf88beaed0bb7',
 'source': 'pubmed',
 'source_id': 'PMID:41509674',
 'identifiers': [{'type': 'pmid', 'value': '41509674'},
  {'type': 'pubmed', 'value': '41509674'},
  {'type': 'pmc', 'value': 'PMC12775991'},
  {'type': 'doi', 'value': '10.1016/j.ajpc.2025.101381'},
  {'type': 'pii', 'value': 'S2666-6677(25)00456-8'}],
 'title': 'Synergistic effects of genetic susceptibility and air pollution on cardiovascular disease.',
 'subtitle': None,
 'authors': [{'name': 'Yun-Jiu Cheng',
   'given_names': 'Yun-Jiu',
   'family_name': 'Cheng',
   'affiliations': ["Department of Guangdong Cardiovascular Institute, Guangdong Provincial People's Hospital, Guangdong Academy of Medical Sciences, Southern Medical University, Guangzhou, China."],
   'orcid': None,
   'email': None,
   'contribution_role': [],
   'author_id': None},
  {'name': 'Li-Juan Liu',
   'given_names': 'Li-Juan',
   'family_name': 'Liu',

## Save the Normalized pubmed Record (JSON & XML)

In [8]:
paths = save_normalized_pubmed(normalized_documents, pubmed_xml)
print(f"Saved raw XML: {paths['raw_xml_path']}")
print(f"Saved normalized JSON: {paths['normalized_json_path']}")

Saved raw XML: C:\Users\Aman Sheikh\Desktop\Projects\VeriFact\Model\harvester\storage\pubmed_raw.xml
Saved normalized JSON: C:\Users\Aman Sheikh\Desktop\Projects\VeriFact\Model\harvester\storage\pubmed_normalized.json


## Enrich documents with Unpaywall (multithreading)

In [9]:
# Enrich documents with Unpaywall (clean notebook output)

import os
import time
import random
import threading
from concurrent.futures import ThreadPoolExecutor, as_completed

MAX_WORKERS = 10
MAX_RETRIES = 5
BACKOFF_BASE = 1  # seconds

# Thread-safe print
print_lock = threading.Lock()


def safe_print(*args, **kwargs):
    with print_lock:
        print(*args, **kwargs)


os.makedirs("output", exist_ok=True)


def process_doc(doc):
    identifiers = doc.get("identifiers", [])
    doi = extract_doi(identifiers)

    if not doi:
        return ("skipped", doc)

    try:
        unpay_json = None
        for attempt in range(1, MAX_RETRIES + 1):
            try:
                unpay_json = fetch_unpaywall(doi)
                break
            except Exception as e:
                if attempt == MAX_RETRIES:
                    raise
                backoff = BACKOFF_BASE * (2 ** (attempt - 1)) + random.uniform(0, 0.2)
                time.sleep(backoff)

        # tiny jitter to avoid hammering API
        time.sleep(random.uniform(0.05, 0.2))

        raw_unpay_path = save_unpaywall_raw(doi, unpay_json)
        with open(raw_unpay_path, "rb") as f:
            raw_bytes = f.read()

        enriched_doc = enrich_document_with_unpaywall(
            document=doc,
            unpay_json=unpay_json,
            raw_ref=raw_unpay_path,
            raw_bytes=raw_bytes
        )

        return ("ok", enriched_doc, raw_unpay_path)

    except Exception as e:
        return ("error", doc, str(e))


# Run threaded processing
enriched_documents = []
errors = []
total = len(normalized_documents)

with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
    future_to_doc = {executor.submit(process_doc, doc): doc for doc in normalized_documents}

    for i, future in enumerate(as_completed(future_to_doc), 1):
        result = future.result()
        status = result[0]

        if status == "ok":
            enriched_doc, raw_path = result[1], result[2]
            enriched_documents.append(enriched_doc)
            safe_print(f"✅ [{i}/{total}] {enriched_doc['document_id']} saved")
        elif status == "skipped":
            enriched_documents.append(result[1])
            safe_print(f"⚪ [{i}/{total}] {result[1]['document_id']} skipped (no DOI)")
        else:
            doc, err_msg = result[1], result[2]
            errors.append((doc.get("document_id"), err_msg))
            enriched_documents.append(doc)
            safe_print(f"❌ [{i}/{total}] {doc['document_id']} error: {err_msg}")

safe_print(f"✅ Unpaywall enrichment complete — {total} docs, {len(errors)} errors")
safe_print(f"✅ Enriched {len(enriched_documents)} docs")

✅ [1/5] bac2091c2fca1fd17322876a1fc29b0d68725cc07313cae0a8ced5b771724518 saved
✅ [2/5] 038717b9236b5fe8a56a8b2f856bab5ce9154cf27751f039354bf88beaed0bb7 saved
✅ [3/5] 92cbf24c4d9d1e21b363eca7a1f89c8ad8015424abebd39c59384cd478972fab saved
✅ [4/5] 8737a58e13b0941a67789fcd87902a34b6277c2625cdd72d3902c45f08610215 saved
✅ [5/5] 86c399a6544e74bce695563ced1d7b0b359ff8fba5a5cfac216c86e12f25b356 saved
✅ Unpaywall enrichment complete — 5 docs, 0 errors
✅ Enriched 5 docs


## Display enriched-only data for first record

In [10]:
final_doc = enriched_documents[0]
print(f"Showing the first PubMed enrichment-only data:")
print(json.dumps({
    "document_id": final_doc["document_id"],
    "source": final_doc["source"],
    "identifiers": final_doc["identifiers"],
    "access": final_doc["access"],
    "license": final_doc["license"],
    "tags": final_doc["tags"]
}, indent=2))


Showing the first PubMed enrichment-only data:
{
  "document_id": "bac2091c2fca1fd17322876a1fc29b0d68725cc07313cae0a8ced5b771724518",
  "source": "pubmed",
  "identifiers": [
    {
      "type": "pmid",
      "value": "41498053"
    },
    {
      "type": "pubmed",
      "value": "41498053"
    },
    {
      "type": "pmc",
      "value": "PMC12765818"
    },
    {
      "type": "doi",
      "value": "10.1002/iju5.70111"
    },
    {
      "type": "pii",
      "value": "IJU570111"
    }
  ],
  "access": {
    "has_fulltext": true,
    "access_type": "open",
    "fulltext_urls": [
      {
        "url": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12765818/",
        "format": "html",
        "source": "pmc"
      },
      {
        "url": "https://onlinelibrary.wiley.com/doi/pdfdirect/10.1002/iju5.70111",
        "format": "pdf",
        "source": "unpaywall:publisher"
      },
      {
        "url": "https://pmc.ncbi.nlm.nih.gov/articles/PMC12765818/",
        "format": "html",
     

## Output each records in separate files

In [11]:
os.makedirs("output", exist_ok=True)


for doc in enriched_documents:
    # Create an output path name. (using the document id now) change later lol.
    output_path = f"output/{doc['document_id']}.json"
    # Open the file for writing
    with open(output_path, "w", encoding="utf-8") as f:
        # Dump the JSON value in the file (fall back to string if we couldn't serialize)
        json.dump(doc, f, indent=2, default=str)

    print("Saved:", output_path)

Saved: output/bac2091c2fca1fd17322876a1fc29b0d68725cc07313cae0a8ced5b771724518.json
Saved: output/038717b9236b5fe8a56a8b2f856bab5ce9154cf27751f039354bf88beaed0bb7.json
Saved: output/92cbf24c4d9d1e21b363eca7a1f89c8ad8015424abebd39c59384cd478972fab.json
Saved: output/8737a58e13b0941a67789fcd87902a34b6277c2625cdd72d3902c45f08610215.json
Saved: output/86c399a6544e74bce695563ced1d7b0b359ff8fba5a5cfac216c86e12f25b356.json
