# IWTC Raw Source Indexing

This notebook executes the raw source indexing workflow defined in:

- `docs/raw_source_indexing_design.md`

It is intended for hands-on execution and experimentation. Conceptual scope, responsibilities, and workflow design are defined in the linked design document.


## Parameters

This notebook operates on a single world repository.

You must provide the full path to a `world_repository.yml` descriptor file.
All paths declared in that file are resolved relative to the `world_root`
defined within it.

A minimal example of `world_repository.yml` is provided in this repository
under:

- `data/config_examples/world_repository.yml`

You may copy and adapt that example for your own world repository.

Source selection (which files to index) is provided separately and is not
encoded in the descriptor.


In [11]:
# ------------------------------------------------------------
# Parameters for Raw Source Indexing notebook
# ------------------------------------------------------------

# Absolute path to the world_repository.yml descriptor.
WORLD_REPOSITORY_DESCRIPTOR = (
    "/Users/charissophia/obsidian/Iron Wolf Trading Company/_meta/descriptors/world_repository.yml"
)

# Optional: explicit source file specification.
# If None, the notebook will list candidate files under
# sources.read_paths and require human selection before proceeding.
SOURCE_FILE_SPEC = None

# ------------------------------------------------------------
# Internal run metadata (do not edit)
# ------------------------------------------------------------
from datetime import datetime
print(f"Notebook run initialized at: {datetime.now().strftime('%Y-%m-%d %H:%M')}")


Notebook run initialized at: 2025-12-29 12:47


## Load world repository descriptor

This section loads the world repository descriptor (usually `world_repository.yml`)
and normalizes the paths it contains so downstream steps can rely on a consistent,
absolute-path view of the world.

In [14]:
# Load world repository descriptor and verify required entries for this notebook

from pathlib import Path
import yaml

errors = []

descriptor_path = Path(WORLD_REPOSITORY_DESCRIPTOR)

# ---- load descriptor file ----
if not descriptor_path.exists():
    raise FileNotFoundError(
        f"Descriptor file not found: {descriptor_path}"
    )

try:
    with descriptor_path.open("r", encoding="utf-8") as f:
        world_repo = yaml.safe_load(f)
except Exception as e:
    raise ValueError(
        f"Failed to load descriptor '{descriptor_path.name}'.\n"
        f"Compare this file to the example world_repository.yml provided with the tools,\n"
        f"fix the file if needed, and rerun this cell."
    )

if not isinstance(world_repo, dict):
    raise ValueError(
        f"Descriptor '{descriptor_path.name}' is not usable as a world repository descriptor.\n"
        f"Compare this file to the example world_repository.yml provided with the tools,\n"
        f"fix the file if needed, and rerun this cell."
    )

print(f"Loaded descriptor: {descriptor_path.name}")

# ---- verify required entries ----
if "world_root" not in world_repo:
    errors.append("world_root")

sources = world_repo.get("sources")
if not isinstance(sources, dict) or "read_paths" not in sources:
    errors.append("sources.read_paths")

working_drafts = world_repo.get("working_drafts")
if not isinstance(working_drafts, dict) or "path" not in working_drafts:
    errors.append("working_drafts.path")

if errors:
    raise ValueError(
        f"Descriptor '{descriptor_path.name}' is missing required entries:\n- "
        + "\n- ".join(errors)
        + "\n\nFix the descriptor file and rerun this cell."
    )

# ---- extract raw values (no filesystem validation yet) ----
world_root_raw = world_repo["world_root"]
source_entries_raw = world_repo["sources"]["read_paths"]
working_drafts_raw = world_repo["working_drafts"]["path"]

print("Required descriptor entries found:")
print(f"- world_root: {world_root_raw!r}")
print(f"- sources.read_paths: {source_entries_raw!r}")
print(f"- working_drafts.path: {working_drafts_raw!r}")


Loaded descriptor: world_repository.yml
Required descriptor entries found:
- world_root: '/Users/charissophia/obsidian/Iron Wolf Trading Company'
- sources.read_paths: ['_local/auto_transcripts', '_local/pbp_transcripts', '_local/planning_notes', '_local/recollections', '_local/session_notes']
- working_drafts.path: '_local/machine_wip'


In [17]:
# Validate key values are useable

from pathlib import Path

errors = []
notes = []

# -----------------------
# Validate world_root
# -----------------------
world_root = Path(world_root_raw) # extracted in prior cell

if str(world_root).startswith("~"):
    errors.append("world_root: '~' is not allowed. Use a full absolute path.")
elif not world_root.is_absolute():
    errors.append("world_root must start with / (macOS/Linux) or a drive like C:\ (Windows). See the example descriptor.")
elif not world_root.exists():
    errors.append(f"world_root: does not exist: {world_root}")
elif not world_root.is_dir():
    errors.append(f"world_root: is not a directory: {world_root}")
else:
    world_root = world_root.resolve()
    notes.append(f"world_root OK: {world_root}")

# Stop early if world_root is not valid, because we can't resolve relative paths safely.
if errors:
    raise ValueError("Descriptor path validation failed:\n- " + "\n- ".join(errors))

# -----------------------
# Validate sources.read_paths
# -----------------------
resolved_sources = []

if not isinstance(source_entries_raw, list):
    errors.append("sources.read_paths must be a bulleted list (each line starts with ‘-’) of file/folder paths.")
else:
    for i, entry in enumerate(source_entries_raw):
        try:
            p = Path(entry)

            if str(p).startswith("~"):
                errors.append(f"sources.read_paths[{i}]: '~' is not allowed: {entry!r}")
                continue

            if not p.is_absolute():
                p = world_root / p

            if not p.exists():
                errors.append(f"sources.read_paths[{i}]: does not exist: {p}")
                continue

            resolved_sources.append(p.resolve())

        except Exception as e:
            errors.append(f"sources.read_paths[{i}]: invalid path value {entry!r} ({type(e).__name__}: {e})")

if not errors and resolved_sources:
    notes.append(f"sources.read_paths OK: {len(resolved_sources)} source paths confirmed")

# -----------------------
# Validate working_drafts.path
# -----------------------
drafts_path = Path(working_drafts_raw)

if str(drafts_path).startswith("~"):
    errors.append("working_drafts.path: '~' is not allowed.")
else:
    if not drafts_path.is_absolute():
        drafts_path = world_root / drafts_path

    if drafts_path.exists():
        if not drafts_path.is_dir():
            errors.append(f"working_drafts.path must be a directory because this process may generate multiple files. {drafts_path}")
        else:
            drafts_path = drafts_path.resolve()
            notes.append(f"working_drafts.path exists: {drafts_path}")
    else:
        parent = drafts_path.parent
        if not parent.exists():
            notes.append(f"working_drafts.path does not exist and parent is missing: {drafts_path}")
            notes.append("Action: create the parent directories (human decision) or update working_drafts.path.")
        elif not parent.is_dir():
            errors.append(f"working_drafts.path parent exists but is not a directory: {parent}")
        else:
            # Probe whether parent appears writable (without creating drafts_path)
            try:
                probe = parent / ".iwtc_tools_write_probe.tmp"
                probe.write_text("test", encoding="utf-8")
                probe.unlink()
                notes.append(f"working_drafts.path does not exist but appears creatable: {drafts_path}")
                notes.append("Action: create this directory now (human decision) or proceed and create later.")
            except Exception:
                notes.append(f"working_drafts.path does not exist and may not be creatable (parent not writable): {drafts_path}")
                notes.append("Action: choose a different working_drafts.path or adjust permissions.")

# -----------------------
# Report or raise
# -----------------------
if errors:
    raise ValueError(
        "Descriptor path validation failed:\n- "
        + "\n- ".join(errors)
        + f"\n\nFix the descriptor entries (starting from world_root: {world_root_raw!r}), "
          "rerun the descriptor load cell, and then rerun this cell."
    )

ENV = {
    "world_root": world_root,
    "sources": resolved_sources,           # mixed files/dirs; classification later
    "working_drafts_path": drafts_path,    # may not exist yet
}

print("Descriptor paths are usable for this notebook.")
print(f"world_root: {ENV['world_root']}")
print("sources.read_paths (resolved):")
for p in ENV["sources"]:
    print(f" - {p}")
print(f"working_drafts.path: {ENV['working_drafts_path']}")
print("")
print("Notes:")
for n in notes:
    print(f" - {n}")


Descriptor paths are usable for this notebook.
world_root: /Users/charissophia/obsidian/Iron Wolf Trading Company
sources.read_paths (resolved):
 - /Users/charissophia/obsidian/Iron Wolf Trading Company/_local/auto_transcripts
 - /Users/charissophia/obsidian/Iron Wolf Trading Company/_local/pbp_transcripts
 - /Users/charissophia/obsidian/Iron Wolf Trading Company/_local/planning_notes
 - /Users/charissophia/obsidian/Iron Wolf Trading Company/_local/recollections
 - /Users/charissophia/obsidian/Iron Wolf Trading Company/_local/session_notes
working_drafts.path: /Users/charissophia/obsidian/Iron Wolf Trading Company/_local/machine_wip

Notes:
 - world_root OK: /Users/charissophia/obsidian/Iron Wolf Trading Company
 - sources.read_paths OK: 5 source paths confirmed
 - working_drafts.path exists: /Users/charissophia/obsidian/Iron Wolf Trading Company/_local/machine_wip


## Discover sources

## Select inputs

## Normalize inputs

## Vocabulary proposal (optional)

## Generate draft index

## Emit outputs