Zero-loss Pathfinder 1st Edition data extraction and transformation pipeline. Parses all content from the Archives of Nethys (d20pfsrd.com) into a fully normalized DuckDB database with 416 tables across bronze, silver, and marts schemas.
This project extracts every piece of Pathfinder 1e data from HTML into structured, queryable format:
- 3,659 creatures (monsters, NPCs, mythic creatures)
- 7,000+ spells with full metadata
- 2,000+ feats with parsed prerequisites
- 10,000+ items with aura schools, construction requirements
- 100+ classes with progression tables, archetypes, class options
- Plus: deities, domains, skills, traits, rules, and more
Medallion Pipeline (bronze → silver → marts):
HTML Files (1.1GB)
↓
[Python Parsers] → parsing_results.json (3.3GB)
↓
[load_bronze.py] → Bronze Schema (116 tables, all TEXT)
↓
[dbt-duckdb] → Silver Schema (142 staging/intermediate tables)
↓
[dbt-duckdb] → Marts Schema (158 final tables + 34 junctions)
↓
[export_to_lua.py] → 147 Lua data files
See ARCHITECTURE.md for technical details.
- Python 3.9+
- uv (recommended) or Poetry
- DuckDB 1.3.2+
# 1. Parse HTML files → parsing_results.json (one-time, slow)
uv run python -m src.pipeline.structural_parser_pipeline
# 2. Load bronze layer (fast)
uv run python load_bronze.py -r outputs/structural_parsing/parsing_results.json
# 3. Build dbt models (silver + marts schemas)
cd dbt_pf1e && dbt build
# 4. Export to Lua (optional)
uv run python export_to_lua.py --output ~/pathfinder-api/rules/data/aonprd/The final database is pathfinder1e.duckdb (591MB).
- Bronze layer stores ALL fields as TEXT with no transformation
- Full HTML structure preserved in coverage tracking
- Pickle cache speeds up repeated loads (2min → 15sec)
- 34 junction tables for many-to-many relationships
- Spell components decomposed to boolean flags + text
- Item auras capture ALL schools (not just first)
- Creature stats unified across monsters/NPCs/mythic
- dbt models with full lineage tracking
- Seed CSVs for data quality fixes
- Custom DuckDB UDFs for complex parsing
- Materialized tables for fast queries
aonprd-parse/
├── load_bronze.py # Bronze layer ingestion
├── export_to_lua.py # Marts → Lua export
├── dbt_pf1e/ # dbt project (300 models)
│ ├── models/
│ │ ├── staging/ # 29 stg_* views (silver)
│ │ ├── intermediate/ # 142 int_* tables (silver)
│ │ └── marts/ # 158 final tables (marts)
│ ├── seeds/ # 6 CSV data quality fixes
│ ├── macros/ # Custom dbt macros
│ └── plugins/pf1e_udfs.py # DuckDB UDFs
├── src/pipeline/ # Python HTML parsers
│ ├── integration/ # field_parsers, normalizers, table_configs
│ ├── semantic/ # Entity assemblers
│ ├── parsers/ # Schema-driven parsers
│ └── components/ # Content extractors
└── data/html/ # Source HTML files (1.1GB, gitignored)
Raw text fields extracted from HTML. Examples:
bronze.spells- All spell fields as TEXTbronze.monsters- All monster fields as TEXTbronze.feats- All feat fields as TEXT
Staging + intermediate transformations. Examples:
silver.stg_spells- Type-cast, cleaned spellssilver.int_spell_class_levels- Parsed spell/class relationshipssilver.int_creature_feats- Normalized creature feats
Final denormalized tables. Examples:
marts.creatures(3,659 rows) - Unified monsters/NPCs/mythicmarts.spells(7,000+ rows) - Complete spell datamarts.spell_class_levels- Which classes get which spells at what levelmarts.item_aura_schools- All magic schools for each item aura
# Visualize what the parser extracts
uv run python demo_parser.py data/html/SpellDisplay.aspx?ItemName=Fireball.html
# Compare parsing strategies
uv run python explore_parser.pyuv run pytest
uv run pytest --cov=src tests/# Rebuild only changed dbt models
cd dbt_pf1e && dbt build --select state:modified+
# Rebuild specific marts
cd dbt_pf1e && dbt build --select marts.creatures+- 100% FK resolution for class skills, prestige requirements
- 99.93% FK resolution for items
- 5NF normalization - no duplicate data
- Canonical IDs from URL parameters (most authoritative source)
- Seed CSVs for manual overrides (feat_name_fixes, spell_aliases, etc.)
MIT
Data sourced from Archives of Nethys (d20pfsrd.com).