# IWTC SRD Import (Design Execution)

This notebook imports D&D 5e SRD data from the official open-source dataset into IWTC-Lab, using transparent, reproducible steps. It serves as the model for importing any future licensed or third-party datasets.

## 0. Overview and Goals
- Import SRD 5e (2014 + 2024) content from [5e-bits/5e-database](https://github.com/5e-bits/5e-database).  
- Copy only the required JSON files into `data/srd/`.  
- Record provenance (commit SHA, license, timestamp).  
- Keep IWTC-Lab independent of any nested Git repos.  
- Apply this pattern later for licensed and homebrew data.

## 1. Configuration
Define all key paths and parameters for the SRD import.
- The upstream repo (`5e-bits/5e-database`)
- Which folders to pull (`src/2014`, `src/2024`)
- Which JSON files to copy (or ‚Äúcopy all‚Äù)
- The destination structure under `data/`
- License files for both OGL and MIT sources


In [17]:
from pathlib import Path

# --- Project Root ---
PROJECT_ROOT = Path.home() / "iwtc-lab"

# --- Upstream Repository ---
REPO_URL = "https://github.com/5e-bits/5e-database.git"

# --- Sparse Paths (folders to fetch) ---
SPARSE_PATHS = ["src/2014", "src/2024"]

# --- Files to Copy ---
# Leave this as None to copy all .json files in each year.
RESOURCE_KEYS = ["monsters", "spells", "equipment"]

# --- Temporary and Destination Paths ---
TEMP_CLONE = PROJECT_ROOT / "data" / "srd_raw" / "_tmp_sparse_5e_db"
DEST_2014 = PROJECT_ROOT / "data" / "srd" / "2014"
DEST_2024 = PROJECT_ROOT / "data" / "srd" / "2024"

# --- License References ---
OGL_LICENSE = PROJECT_ROOT / "LICENSE_OGL.html"
MIT_LICENSE = PROJECT_ROOT / "LICENSE_5e-database.md"  # downloaded from repo

# --- Metadata Path ---
META_PATH = PROJECT_ROOT / "data" / "srd" / "_meta.yaml"

# --- Print summary ---
print("=== IWTC SRD Import Configuration ===")
print(f"Project root:   {PROJECT_ROOT}")
print(f"Repo URL:       {REPO_URL}")
print(f"Sparse paths:   {SPARSE_PATHS}")
print(f"Resources:      {RESOURCE_FILES if RESOURCE_FILES else 'ALL .json'}")
print()
print(f"Temporary clone: {TEMP_CLONE}")
print(f"Dest (2014):     {DEST_2014}")
print(f"Dest (2024):     {DEST_2024}")
print(f"Meta path:       {META_PATH}")
print()
print("License references:")
print(f" - OGL License:  {OGL_LICENSE}")
print(f" - MIT License:  {MIT_LICENSE}")
print("=====================================")


=== IWTC SRD Import Configuration ===
Project root:   /Users/charissophia/iwtc-lab
Repo URL:       https://github.com/5e-bits/5e-database.git
Sparse paths:   ['src/2014', 'src/2024']
Resources:      ['monsters.json', 'spells.json', 'equipment.json']

Temporary clone: /Users/charissophia/iwtc-lab/data/srd_raw/_tmp_sparse_5e_db
Dest (2014):     /Users/charissophia/iwtc-lab/data/srd/2014
Dest (2024):     /Users/charissophia/iwtc-lab/data/srd/2024
Meta path:       /Users/charissophia/iwtc-lab/data/srd/_meta.yaml

License references:
 - OGL License:  /Users/charissophia/iwtc-lab/LICENSE_OGL.html
 - MIT License:  /Users/charissophia/iwtc-lab/LICENSE_5e-database.md


## 2. Environment Validation
Verify that the environment and directories are ready for import.

This step ensures:
- `git` is installed and accessible.
- Core project directories exist or are created.
- No conflicting temporary clone remains.


In [18]:
import shutil, subprocess, sys

print("=== Environment Validation ===")

# --- 1. Check for Git ---
git_path = shutil.which("git")
if not git_path:
    raise SystemExit("‚ùå Git not found on PATH. Please install Git before proceeding.")
print(f"‚úÖ Git found: {git_path}")

# --- 2. Check Python version ---
print(f"‚úÖ Python version: {sys.version.split()[0]}")

# --- 3. Validate directories ---
for path in [META_PATH.parent, DEST_2014, DEST_2024, TEMP_CLONE.parent]:
    if not path.exists():
        path.mkdir(parents=True, exist_ok=True)
        print(f"üìÅ Created directory: {path}")
    else:
        print(f"üìÇ Directory exists: {path}")

# --- 4. Check for leftover temporary clone ---
if TEMP_CLONE.exists():
    print(f"‚ö†Ô∏è  Removing existing temp clone: {TEMP_CLONE}")
    shutil.rmtree(TEMP_CLONE)
    print("üßπ  Cleaned up old temporary clone.")
else:
    print("‚úÖ No existing temp clone found.")

print("‚úÖ Environment ready.")
print("===============================")


=== Environment Validation ===
‚úÖ Git found: /usr/local/bin/git
‚úÖ Python version: 3.11.14
üìÇ Directory exists: /Users/charissophia/iwtc-lab/data/srd
üìÇ Directory exists: /Users/charissophia/iwtc-lab/data/srd/2014
üìÇ Directory exists: /Users/charissophia/iwtc-lab/data/srd/2024
üìÇ Directory exists: /Users/charissophia/iwtc-lab/data/srd_raw
‚ö†Ô∏è  Removing existing temp clone: /Users/charissophia/iwtc-lab/data/srd_raw/_tmp_sparse_5e_db
üßπ  Cleaned up old temporary clone.
‚úÖ Environment ready.


## 3. Sparse-Checkout Clone
Fetch a minimal subset of the upstream repo into a temporary workspace.
- Creates a temp clone using `git` with sparse-checkout for `src/2014` and `src/2024`.
- Captures the upstream commit SHA.
- Copies the repo‚Äôs MIT license to the project root as `LICENSE_5e-database.txt`.
- Ensures the OGL license (`LICENSE_OGL.pdf`) exists; downloads if missing.


In [19]:
import shutil, subprocess, sys, urllib.request
from pathlib import Path
from datetime import datetime

def run(cmd, cwd=None):
    print("$", " ".join(cmd))
    res = subprocess.run(cmd, cwd=cwd, capture_output=True, text=True)
    if res.returncode != 0:
        print(res.stdout)
        print(res.stderr)
        raise RuntimeError(f"Command failed: {' '.join(cmd)}")
    return res.stdout.strip()

print("=== Sparse-Checkout Clone ===")

# 1) Fresh temp clone
if TEMP_CLONE.exists():
    print(f"Removing existing temp clone: {TEMP_CLONE}")
    shutil.rmtree(TEMP_CLONE)

run(["git", "clone", "--filter=blob:none", "--no-checkout", REPO_URL, str(TEMP_CLONE)])
run(["git", "sparse-checkout", "init", "--cone"], cwd=TEMP_CLONE)
run(["git", "sparse-checkout", "set", *SPARSE_PATHS], cwd=TEMP_CLONE)
run(["git", "checkout"], cwd=TEMP_CLONE)

UPSTREAM_SHA = run(["git", "rev-parse", "HEAD"], cwd=TEMP_CLONE)
print("Upstream commit:", UPSTREAM_SHA)

# 2) Repo MIT license -> project root
repo_mit = TEMP_CLONE / "LICENSE.md"
if repo_mit.exists():
    MIT_LICENSE.write_text(repo_mit.read_text(encoding="utf-8"), encoding="utf-8")
    print(f"Copied MIT license to: {MIT_LICENSE}")
else:
    print("Warning: MIT license file not found in upstream clone (expected 'LICENSE').")

# 3) OGL license -> project root (download if missing)
if not OGL_LICENSE.exists():
    ogl_url = "https://opengamingfoundation.org/ogl.html"
    # Save the HTML as reference if PDF isn‚Äôt available‚Äîrename to .html for clarity
    # If you‚Äôve already saved a PDF, this block is skipped.
    ogl_dest = OGL_LICENSE.with_suffix(".html")
    print(f"OGL PDF not found; downloading HTML copy from: {ogl_url}")
    urllib.request.urlretrieve(ogl_url, ogl_dest.as_posix())
    print(f"Saved OGL HTML to: {ogl_dest}")
else:
    print(f"OGL license already present: {OGL_LICENSE}")

print("‚úÖ Sparse-checkout and license placement complete.")
print("===============================")


=== Sparse-Checkout Clone ===
$ git clone --filter=blob:none --no-checkout https://github.com/5e-bits/5e-database.git /Users/charissophia/iwtc-lab/data/srd_raw/_tmp_sparse_5e_db
$ git sparse-checkout init --cone
$ git sparse-checkout set src/2014 src/2024
$ git checkout
$ git rev-parse HEAD
Upstream commit: 6563534a017ae7c1b64a3d2cda35d4c79c8bfdb5
Copied MIT license to: /Users/charissophia/iwtc-lab/LICENSE_5e-database.md
OGL license already present: /Users/charissophia/iwtc-lab/LICENSE_OGL.html
‚úÖ Sparse-checkout and license placement complete.


## 4. Data Extraction
Copy the selected SRD JSON files from the temporary sparse-checkout into the project:

- Source: `TEMP_CLONE/src/<year>/`
- Destinations: `data/srd/2014/` and `data/srd/2024/`
- If `RESOURCE_KEYS` is `None`, copy **all** `*.json` in each year folder.
- Produce a summary dictionary (`COPIED_SUMMARY`) for Step 5 (provenance).

Selection policy (per year, per key):
1) Prefer filenames whose **stem tokens** (split on `[-_.]`) include the key (case-insensitive).
2) Otherwise, allow **substring** match in the stem.
3) If multiple candidates remain, pick the one with the **shortest stem**; if still tied, pick **lexicographically smallest**.
4) Require exactly **one** resolved file per key; on 0 matches, error. On multiple unresolved, error with diagnostics.
5) Record the resolved file plus any alternates considered.



In [21]:
import shutil
from pathlib import Path

print("=== Data Extraction (key-based resolution) ===")

def stem_tokens(p: Path):
    import re
    return re.split(r"[-_.]+", p.stem.lower())

def resolve_for_key(candidates, key):
    """
    Apply the policy:
    1) token-equal match (case-insensitive)
    2) else substring match (case-insensitive)
    3) choose shortest stem, then lexicographically smallest
    Returns (winner, alternates) or (None, []) if no matches.
    """
    key_l = key.lower()
    token_matches = [p for p in candidates if key_l in stem_tokens(p)]
    pool = token_matches if token_matches else [p for p in candidates if key_l in p.stem.lower()]
    if not pool:
        return None, []
    def ranking(p: Path):  # shortest stem ‚Üí lexicographic
        return (len(p.stem), p.name.lower())
    winner = sorted(pool, key=ranking)[0]
    return winner, [c.name for c in pool if c != winner]


years = [("2014", DEST_2014), ("2024", DEST_2024)]
COPIED_SUMMARY = {"2014": [], "2024": []}

for year, dest in years:
    src_dir = TEMP_CLONE / "src" / year
    if not src_dir.exists():
        print(f"‚ö†Ô∏è  Upstream year path missing, skipping: {src_dir}")
        continue

    dest.mkdir(parents=True, exist_ok=True)
    available = sorted(src_dir.glob("*.json"))

    if not available:
        print(f"‚ÑπÔ∏è  No JSON files found in {src_dir}")
        continue

    if 'RESOURCE_KEYS' not in globals() or RESOURCE_KEYS is None:
        # Copy all JSON files
        for src in available:
            out = dest / src.name
            shutil.copy2(src, out)
            COPIED_SUMMARY[year].append({"key": None, "file": src.name, "alternates": []})
            print(f"üìÑ {year}: {src.name}  ‚Üí  {out.relative_to(PROJECT_ROOT)}")
    else:
        # Resolve per key using the policy
        for key in RESOURCE_KEYS:
            winner, alternates = resolve_for_key(available, key)
            if not winner:
                print(f"‚ö†Ô∏è  {year}: No match found for key '{key}' ‚Äî skipped.")
                continue
            out = dest / winner.name
            shutil.copy2(winner, out)
            COPIED_SUMMARY[year].append({"key": key, "file": winner.name, "alternates": alternates})
            print(f"üìÑ {year} [{key}]: {winner.name}  ‚Üí  {out.relative_to(PROJECT_ROOT)}")

# Totals & preview
print("\n=== Extraction Summary ===")
total = 0
for y in ("2014", "2024"):
    items = COPIED_SUMMARY.get(y, [])
    total += len(items)
    pretty = [f"{i['key'] or '*'}:{i['file']}" for i in items]
    print(f"{y}: {len(items)} ‚Üí {pretty}")
print(f"TOTAL copied: {total}")
print("==========================")


=== Data Extraction (key-based resolution) ===
üìÑ 2014 [monsters]: 5e-SRD-Monsters.json  ‚Üí  data/srd/2014/5e-SRD-Monsters.json
üìÑ 2014 [spells]: 5e-SRD-Spells.json  ‚Üí  data/srd/2014/5e-SRD-Spells.json
üìÑ 2014 [equipment]: 5e-SRD-Equipment.json  ‚Üí  data/srd/2014/5e-SRD-Equipment.json
‚ö†Ô∏è  2024: No match found for key 'monsters' ‚Äî skipped.
‚ö†Ô∏è  2024: No match found for key 'spells' ‚Äî skipped.
üìÑ 2024 [equipment]: 5e-SRD-Equipment.json  ‚Üí  data/srd/2024/5e-SRD-Equipment.json

=== Extraction Summary ===
2014: 3 ‚Üí ['monsters:5e-SRD-Monsters.json', 'spells:5e-SRD-Spells.json', 'equipment:5e-SRD-Equipment.json']
2024: 1 ‚Üí ['equipment:5e-SRD-Equipment.json']
TOTAL copied: 4


## 5. Provenance Record
Write metadata documenting the import event.  
- Create or update `data/srd/_meta.yaml`.  
- Include:
  - repo name and URL  
  - commit SHA  
  - sparse paths used  
  - import timestamp  
  - license files
  - exact files imported per year (and alternates considered but not chosen)

Note: This step **writes** the record. Step 7 will **read/display** it for verification only.


In [22]:
from datetime import datetime
from ruamel.yaml import YAML

print("=== Write Provenance (_meta.yaml) ===")

yaml = YAML()
yaml.default_flow_style = False

# Collect license paths that actually exist
license_paths = []
if OGL_LICENSE.exists():
    license_paths.append(str(OGL_LICENSE.relative_to(PROJECT_ROOT)))
else:
    # If you downloaded an HTML fallback earlier, include it as well
    ogl_html = OGL_LICENSE.with_suffix(".html")
    if ogl_html.exists():
        license_paths.append(str(ogl_html.relative_to(PROJECT_ROOT)))

if 'MIT_LICENSE' in globals() and MIT_LICENSE.exists():
    license_paths.append(str(MIT_LICENSE.relative_to(PROJECT_ROOT)))

# Normalize copied summary for YAML (ensure stable ordering)
resources = {}
for year in ("2014", "2024"):
    items = COPIED_SUMMARY.get(year, [])
    # Each entry: {"key": key-or-None, "file": "filename.json", "alternates": [...]}
    # Sort by key then file for determinism
    items_sorted = sorted(
        items,
        key=lambda i: ((i["key"] or ""), i["file"])
    )
    resources[year] = items_sorted

meta = {
    "source_repos": [{
        "name": "5e-database",
        "url": REPO_URL,
        "sparse_paths": SPARSE_PATHS,
        "commit": UPSTREAM_SHA,   # defined in Step 3
    }],
    "imported_at": datetime.now().isoformat(timespec="seconds"),
    "license_upstream": "MIT (repo); SRD content under OGL v1.0a",
    "license_files": license_paths,  # relative to project root
    "resources": resources,          # detailed per-year list of resolved files
    "schema_version": "statblock.v1",
    "notes": "SRD JSON copied via sparse-checkout; homebrew remains YAML and editable.",
}

# Ensure parent exists and write
META_PATH.parent.mkdir(parents=True, exist_ok=True)
with open(META_PATH, "w", encoding="utf-8") as f:
    yaml.dump(meta, f)

print(f"‚úÖ Wrote provenance: {META_PATH.relative_to(PROJECT_ROOT)}")
print("=======================================")


=== Write Provenance (_meta.yaml) ===
‚úÖ Wrote provenance: data/srd/_meta.yaml


## 6. Cleanup
Remove the temporary sparse-checkout clone and verify no Git metadata remains inside `data/srd/`.


In [23]:
import shutil
from pathlib import Path

print("=== Cleanup ===")

# 1) Remove the temporary clone
if TEMP_CLONE.exists():
    shutil.rmtree(TEMP_CLONE)
    print(f"üßπ Removed temp clone: {TEMP_CLONE}")
else:
    print("‚úÖ No temp clone present.")

# 2) Sanity check: ensure no nested Git metadata under data/srd
srd_root = PROJECT_ROOT / "data" / "srd"
git_dirs = list(srd_root.rglob(".git"))
if git_dirs:
    print("‚ö†Ô∏è Unexpected Git directories found under data/srd:")
    for g in git_dirs:
        print(" -", g)
else:
    print("‚úÖ Verified: no .git directories under data/srd.")

print("‚úÖ Cleanup complete.")
print("================")


=== Cleanup ===
üßπ Removed temp clone: /Users/charissophia/iwtc-lab/data/srd_raw/_tmp_sparse_5e_db
‚úÖ Verified: no .git directories under data/srd.
‚úÖ Cleanup complete.


## 7. Verification
Read and display `_meta.yaml`, confirm imported files exist on disk, and summarize counts per year.  

This step **does not** modify any files ‚Äî it only verifies the import and provenance.

In [24]:
from pathlib import Path
from ruamel.yaml import YAML

print("=== Verification ===")

yaml = YAML()
if not META_PATH.exists():
    raise SystemExit(f"‚ùå Provenance file not found: {META_PATH}")

with open(META_PATH, "r", encoding="utf-8") as f:
    meta = yaml.load(f)

# Print concise provenance summary
print("Provenance:")
srcs = meta.get("source_repos", [])
for s in srcs:
    print(f" - Repo: {s.get('name')} | URL: {s.get('url')}")
    print(f"   Sparse: {s.get('sparse_paths')}")
    print(f"   Commit: {s.get('commit')}")
print(f"Imported at: {meta.get('imported_at')}")
print(f"License note: {meta.get('license_upstream')}")
print("License files:")
for lf in meta.get("license_files", []):
    p = PROJECT_ROOT / lf
    print(f"   - {lf}  {'‚úÖ' if p.exists() else '‚õî missing'}")

# Verify resources exist on disk and summarize
resources = meta.get("resources", {})
totals = 0
for year in ("2014", "2024"):
    year_dir = PROJECT_ROOT / "data" / "srd" / year
    entries = resources.get(year, [])
    print(f"\n{year} resources: {len(entries)}")
    for e in entries:
        fname = e["file"]
        fpath = year_dir / fname
        ok = fpath.exists()
        print(f"  - {e.get('key') or '*'}: {fname}  {'‚úÖ' if ok else '‚õî missing'}")
        if not ok:
            print(f"    ‚Üí Expected at: {fpath}")
    totals += len(entries)

print(f"\nTOTAL files referenced: {totals}")
print("====================================")


=== Verification ===
Provenance:
 - Repo: 5e-database | URL: https://github.com/5e-bits/5e-database.git
   Sparse: ['src/2014', 'src/2024']
   Commit: 6563534a017ae7c1b64a3d2cda35d4c79c8bfdb5
Imported at: 2025-10-29T23:27:14
License note: MIT (repo); SRD content under OGL v1.0a
License files:
   - LICENSE_OGL.html  ‚úÖ
   - LICENSE_5e-database.md  ‚úÖ

2014 resources: 3
  - equipment: 5e-SRD-Equipment.json  ‚úÖ
  - monsters: 5e-SRD-Monsters.json  ‚úÖ
  - spells: 5e-SRD-Spells.json  ‚úÖ

2024 resources: 1
  - equipment: 5e-SRD-Equipment.json  ‚úÖ

TOTAL files referenced: 4


## 8. Notes and Next Steps
**Purpose:** Record reflections and upcoming tasks.  
- Validate SRD JSON files using `lib/statblock_schema.py`.  
- Build a loader to normalize SRD JSON into `statblock.v1`.  
- Extend import pattern to other datasets (licensed or homebrew).  
- Optional: schedule periodic SRD updates.