A two-stage pipeline for harvesting oncology nutrition research papers and generating short-answer Q/A training data using LIMO principles, designed for RLHF fine-tuning.
NORE (Nutrition-Oncology Research Engine) produces exact-match Q/A pairs (1-3 word answers) from clinical nutrition and oncology literature for training domain-specific language models.
PubMed / Europe PMC
│
▼
┌─ Harvest ──────────────────────────┐
│ Discover → Screen → Locate → │
│ Extract (Firecrawl default) │
└────────┬───────────────────────────┘
│ .firecrawl.md + .meta.json
▼
┌─ Train ────────────────────────────┐
│ Text Route → Chunk → Gate → │
│ Generate Q/A → Verify → Output │
└────────┬───────────────────────────┘
│ JSONL
▼
Deduplicate → Compile CSV → RLHF Training
- Discover candidate papers via PubMed E-utilities + Europe PMC using topic-specific MeSH queries
- Screen abstracts through an LLM relevance gate (score >= 6/10 to pass)
- Locate open-access PDF URLs via PMC OA, Unpaywall, or Europe PMC
- Extract full text via Firecrawl server-side PDF parsing (default), or download PDFs locally (
--download-pdfsfor RLHF traceability)
- Route text source — auto-detects
.firecrawl.mdor falls back to local PDF extraction via PyMuPDF - Chunk into overlapping word windows (800 words, 50-word overlap)
- Gate chunks by relevance (filters out references, copyright, metadata)
- Augment accepted chunks with contextual enrichment
- Generate freeform Q/A pairs via LLM (OpenAI or Together AI)
- Verify quality through heuristic checks + LLM semantic verification (optional)
- Deduplicate via human-in-the-loop Excel triage with semantic similarity
- Compile to CSV for downstream RLHF training
| File | Description |
|---|---|
nore_paper_harvester.py |
Stage 1: discover, screen, and extract papers via Firecrawl or PDF download |
mupdf_trainer_v3.py |
Stage 2: text routing, chunking, Q/A generation, verification loop |
llm_adapter.py |
Universal LLM backend supporting Together AI and OpenAI with auto-detection |
prompts_qa.py |
Prompt templates for relevance gating, augmentation, and freeform Q/A generation |
qa_pipeline_skeleton_v2.py |
Lightweight skeleton: relevance gate + chunk augmentation (dependency injection) |
verification_qa.py |
Two-stage quality verification: heuristic pre-checks + LLM semantic validation |
duplicate_triage.py |
Human-in-the-loop deduplication: semantic similarity to Excel workbook for review |
compile_qa.py |
Final CSV export from deduplicated JSONL |
firecrawl_extract.py |
Standalone batch tool: retroactively extract text from existing PDFs via Firecrawl |
Supporting files: llm_smoke_test.py, qa_analyzer.py, paper_qa.py, setup_env.py, run_pipeline.sh
python -m venv pymupdf-venv
source pymupdf-venv/bin/activate
pip install -r requirements.txt
pip install firecrawl # Firecrawl SDK (default extraction)
pip install sentence-transformers pandas openpyxl # for dedup/triage stageCreate a .env file:
FIRECRAWL_API_KEY=fc-... # Required for default Firecrawl mode
NCBI_API_KEY=... # Recommended for faster PubMed rate limits
OPENAI_API_KEY=sk-... # Required for screening + Q/A generation
TOGETHER_API_KEY=... # Optional alternative LLM backend
# Default: Firecrawl text extraction (no PDFs downloaded); requires your NCBI linked email
python nore_paper_harvester.py --email your@email.com
# Single topic with limit
python nore_paper_harvester.py --email your@email.com --topic drug_nutrient --max-per-topic 100
# Download PDFs locally instead (for RLHF traceability)
python nore_paper_harvester.py --email your@email.com --download-pdfs
# Resume interrupted run
python nore_paper_harvester.py --your@email.com --resume# Process ALL topics at once (recurses into subdirectories)
python mupdf_trainer_v3.py ./harvested_papers --gen_qa \
--llm_model gpt-4.1-mini --qa_k 1 --rpm 20 --pretty_json
# Process a single topic
python mupdf_trainer_v3.py ./harvested_papers/drug_nutrient --gen_qa \
--llm_model gpt-4.1-mini --qa_k 1 --rpm 20 --pretty_json
# Force Firecrawl text only (error if .firecrawl.md missing)
python mupdf_trainer_v3.py ./harvested_papers --gen_qa \
--text-source firecrawl --llm_model gpt-4.1-mini --qa_k 1 --rpm 20
# Force local PDF extraction (original v3 behavior)
python mupdf_trainer_v3.py ./pdfs --gen_qa \
--text-source mupdf --llm_model gpt-4.1-mini --qa_k 1 --rpm 20
# With verification enabled
python mupdf_trainer_v3.py ./harvested_papers --gen_qa --enable-verification \
--llm_model gpt-4.1-mini --qa_k 1 --rpm 20
# Test LLM connectivity
python mupdf_trainer_v3.py --llm_smoke_test# Phase 1: Generate triage workbook for human review
python duplicate_triage.py --triage --dir qa_jsonl --out triage_workbook.xlsx
# Human reviews triage_workbook.xlsx in Excel (fill in decisions column)
# Phase 2: Apply human decisions
python duplicate_triage.py --apply --workbook triage_workbook.xlsx --dir qa_jsonlpython compile_qa.py --input-dir qa_jsonl --output qa_output.csvThe trainer's --text-source flag controls how text is extracted:
| Flag | Behavior |
|---|---|
auto (default) |
Use .firecrawl.md if it exists, fall back to mupdf |
firecrawl |
Require .firecrawl.md, error if missing |
mupdf |
Always use local PDF extraction via PyMuPDF |
The master summary report tracks which source was used per file:
Text Source Breakdown:
- Firecrawl: 142
- mupdf (local PDF): 3
The harvester searches 6 priority areas in nutrition oncology:
| Topic | Description |
|---|---|
drug_nutrient |
Drug-Nutrient Interactions in Oncology |
cachexia_sarcopenia |
Cancer Cachexia & Sarcopenia |
immunotherapy_nutrition |
Immunotherapy & Nutrition |
cancer_malnutrition |
Cancer-Related Malnutrition |
dietary_patterns |
Dietary Patterns & Cancer Outcomes |
microbiome_diet_cancer |
Microbiome, Diet & Cancer |
Each topic defines 6-8 PubMed MeSH queries and 2-3 Europe PMC free-text queries.
Each Q/A record is a JSON object in JSONL format:
{
"type": "freeform",
"model": "gpt-4.1-mini",
"timestamp": "2026-03-12 10:54:37",
"file": "Curcumin_interactions_a1b2c3d4.pdf",
"chunk_id": 5,
"qa_id": "ff-5-0",
"passage_hash": "a1b2c3d4e5",
"question": "What enzyme catalyzes tryptophan degradation in the kynurenine pathway?",
"answer": "IDO1"
}Answers are constrained to 1-3 words for exact-match RLHF grading.
The pipeline supports two LLM providers via llm_adapter.py:
| Backend | Model Examples | Detection |
|---|---|---|
| OpenAI | gpt-4.1-mini, gpt-4.1, gpt-4o, o1, o3 |
Prefix match (gpt-, o1, o3, ...) |
| Together AI | moonshotai/Kimi-K2-Instruct-0905, meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo |
Contains / (org/model format) |
Backend is auto-detected from the model name, or forced with --llm_backend openai|together.
When --enable-verification is set, each Q/A pair goes through:
- Heuristic pre-check (no LLM call): forbidden reference patterns, answer length, safety term detection
- LLM semantic verification: factual accuracy vs source chunk, self-containment, clinical safety
- Confidence scoring: >= 0.9 auto-pass, 0.7-0.9 flag for review, < 0.7 regenerate (up to 2 attempts)
- Firecrawl as default (v4): Server-side PDF extraction produces cleaner text than local PyMuPDF (no column blending, header/footer artifacts). PDF download preserved as
--download-pdfsfor RLHF traceability. - Freeform-only (v3): MCQ and reasoning question types were removed because their answers (7-15 words) were too long for exact-match RLHF grading. Freeform produces clean 1-3 word answers.
- Human-in-the-loop dedup: Automated Union-Find clustering destroyed 85% of data through transitive chaining. The current system uses direct pairwise similarity with human confirmation in Excel.
- Seen-answers injection: Reduces within-PDF duplicate questions by injecting prior chunk answers as exclusions into the freeform prompt.
- Dependency injection: The skeleton receives
llm_fnandjson_parser_fnas arguments, keeping it backend-agnostic.
.
├── nore_paper_harvester.py # Stage 1: Paper harvesting
├── mupdf_trainer_v3.py # Stage 2: Q/A generation pipeline
├── llm_adapter.py # LLM backend (OpenAI + Together AI)
├── prompts_qa.py # Prompt templates
├── qa_pipeline_skeleton_v2.py # Relevance gate + augmentation
├── verification_qa.py # Q/A quality verification
├── duplicate_triage.py # Human-in-the-loop dedup
├── compile_qa.py # CSV export
├── firecrawl_extract.py # Standalone Firecrawl batch extraction
├── requirements.txt # Python dependencies
├── .env # API keys (not committed)
├── PIPELINE_ARCHITECTURE.md # Detailed architecture documentation
├── harvested_papers/ # Harvested data (per-topic subdirectories)
│ ├── drug_nutrient/
│ ├── cachexia_sarcopenia/
│ ├── immunotherapy_nutrition/
│ ├── cancer_malnutrition/
│ ├── dietary_patterns/
│ └── microbiome_diet_cancer/
├── qa_jsonl/ # Generated Q/A output (per-PDF JSONL)
└── chunks_csv/ # Intermediate chunk CSVs
| Branch | Description |
|---|---|
main |
v3 freeform-only pipeline |
feature/firecrawl-integration |
v4 with Firecrawl harvesting + text source routing |
v2-comprehensive-archive |
Previous v2 pipeline with MCQ + reasoning + freeform generation |
- Python 3.10+
- PyMuPDF >= 1.24.0
- Firecrawl SDK (
pip install firecrawl) - OpenAI SDK >= 1.40.0 and/or Together AI SDK >= 1.2.0
- sentence-transformers (for deduplication)
- openpyxl, pandas, tqdm
This pipeline is part of the NORE (Nutrition-Oncology Research Engine) project at Baylor University Greathouse Lab, developed in collaboration with Argonne National Lab. The generated Q/A datasets are used for RLHF fine-tuning of domain-specific language models for clinical nutrition and oncology applications.