Skip to content

Parsing aonprd 1e data into a memgraph, and perhaps more

Notifications You must be signed in to change notification settings

JoeLuker/aonprd-parse

Repository files navigation

aonprd-parse

Zero-loss Pathfinder 1st Edition data extraction and transformation pipeline. Parses all content from the Archives of Nethys (d20pfsrd.com) into a fully normalized DuckDB database with 416 tables across bronze, silver, and marts schemas.

Overview

This project extracts every piece of Pathfinder 1e data from HTML into structured, queryable format:

  • 3,659 creatures (monsters, NPCs, mythic creatures)
  • 7,000+ spells with full metadata
  • 2,000+ feats with parsed prerequisites
  • 10,000+ items with aura schools, construction requirements
  • 100+ classes with progression tables, archetypes, class options
  • Plus: deities, domains, skills, traits, rules, and more

Architecture

Medallion Pipeline (bronze → silver → marts):

HTML Files (1.1GB)
    ↓
[Python Parsers] → parsing_results.json (3.3GB)
    ↓
[load_bronze.py] → Bronze Schema (116 tables, all TEXT)
    ↓
[dbt-duckdb] → Silver Schema (142 staging/intermediate tables)
    ↓
[dbt-duckdb] → Marts Schema (158 final tables + 34 junctions)
    ↓
[export_to_lua.py] → 147 Lua data files

See ARCHITECTURE.md for technical details.

Quick Start

Prerequisites

  • Python 3.9+
  • uv (recommended) or Poetry
  • DuckDB 1.3.2+

Build the Database

# 1. Parse HTML files → parsing_results.json (one-time, slow)
uv run python -m src.pipeline.structural_parser_pipeline

# 2. Load bronze layer (fast)
uv run python load_bronze.py -r outputs/structural_parsing/parsing_results.json

# 3. Build dbt models (silver + marts schemas)
cd dbt_pf1e && dbt build

# 4. Export to Lua (optional)
uv run python export_to_lua.py --output ~/pathfinder-api/rules/data/aonprd/

The final database is pathfinder1e.duckdb (591MB).

Key Features

Zero Data Loss

  • Bronze layer stores ALL fields as TEXT with no transformation
  • Full HTML structure preserved in coverage tracking
  • Pickle cache speeds up repeated loads (2min → 15sec)

Fully Normalized (5NF)

  • 34 junction tables for many-to-many relationships
  • Spell components decomposed to boolean flags + text
  • Item auras capture ALL schools (not just first)
  • Creature stats unified across monsters/NPCs/mythic

Production-Ready SQL

  • dbt models with full lineage tracking
  • Seed CSVs for data quality fixes
  • Custom DuckDB UDFs for complex parsing
  • Materialized tables for fast queries

Project Structure

aonprd-parse/
├── load_bronze.py              # Bronze layer ingestion
├── export_to_lua.py            # Marts → Lua export
├── dbt_pf1e/                   # dbt project (300 models)
│   ├── models/
│   │   ├── staging/            # 29 stg_* views (silver)
│   │   ├── intermediate/       # 142 int_* tables (silver)
│   │   └── marts/              # 158 final tables (marts)
│   ├── seeds/                  # 6 CSV data quality fixes
│   ├── macros/                 # Custom dbt macros
│   └── plugins/pf1e_udfs.py    # DuckDB UDFs
├── src/pipeline/               # Python HTML parsers
│   ├── integration/            # field_parsers, normalizers, table_configs
│   ├── semantic/               # Entity assemblers
│   ├── parsers/                # Schema-driven parsers
│   └── components/             # Content extractors
└── data/html/                  # Source HTML files (1.1GB, gitignored)

Database Schema

Bronze (116 tables)

Raw text fields extracted from HTML. Examples:

  • bronze.spells - All spell fields as TEXT
  • bronze.monsters - All monster fields as TEXT
  • bronze.feats - All feat fields as TEXT

Silver (142 tables)

Staging + intermediate transformations. Examples:

  • silver.stg_spells - Type-cast, cleaned spells
  • silver.int_spell_class_levels - Parsed spell/class relationships
  • silver.int_creature_feats - Normalized creature feats

Marts (158 tables)

Final denormalized tables. Examples:

  • marts.creatures (3,659 rows) - Unified monsters/NPCs/mythic
  • marts.spells (7,000+ rows) - Complete spell data
  • marts.spell_class_levels - Which classes get which spells at what level
  • marts.item_aura_schools - All magic schools for each item aura

Development

Debug Parsers

# Visualize what the parser extracts
uv run python demo_parser.py data/html/SpellDisplay.aspx?ItemName=Fireball.html

# Compare parsing strategies
uv run python explore_parser.py

Run Tests

uv run pytest
uv run pytest --cov=src tests/

Rebuild Incrementally

# Rebuild only changed dbt models
cd dbt_pf1e && dbt build --select state:modified+

# Rebuild specific marts
cd dbt_pf1e && dbt build --select marts.creatures+

Data Quality

  • 100% FK resolution for class skills, prestige requirements
  • 99.93% FK resolution for items
  • 5NF normalization - no duplicate data
  • Canonical IDs from URL parameters (most authoritative source)
  • Seed CSVs for manual overrides (feat_name_fixes, spell_aliases, etc.)

License

MIT

Acknowledgments

Data sourced from Archives of Nethys (d20pfsrd.com).

About

Parsing aonprd 1e data into a memgraph, and perhaps more

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages