# IWTC Raw Source Indexing

This notebook executes the raw source indexing workflow defined in:

- `docs/raw_source_indexing_design.md`

It is intended for hands-on execution and experimentation. Conceptual scope, responsibilities, and workflow design are defined in the linked design document.


## Parameters

This notebook operates on a single world repository.

You must provide the full path to a `world_repository.yml` descriptor file.
All paths declared in that file are resolved relative to the `world_root`
defined within it.

A minimal example of `world_repository.yml` is provided in this repository
under:

- `data/config_examples/world_repository.yml`

You may copy and adapt that example for your own world repository.


In [18]:
# ------------------------------------------------------------
# Parameters for Raw Source Indexing notebook
# ------------------------------------------------------------

# Absolute path to the world_repository.yml descriptor.
WORLD_REPOSITORY_DESCRIPTOR = (
    "/Users/charissophia/obsidian/Iron Wolf Trading Company/_meta/descriptors/world_repository.yml"
)

# Optional: explicit source file specification.
# - None: list candidates and require human selection
# - "ALL": process all discovered supported files
# - otherwise: restrict discovery to the specified files
SOURCE_FILE_SPEC = None

# ------------------------------------------------------------
# Internal run metadata (do not edit)
# ------------------------------------------------------------
from datetime import datetime
print(f"Notebook run initialized at: {datetime.now().strftime('%Y-%m-%d %H:%M')}")


Notebook run initialized at: 2025-12-29 21:26


## Load world repository descriptor

This section loads the world repository descriptor (usually `world_repository.yml`)
and normalizes the paths it contains so downstream steps can rely on a consistent,
absolute-path view of the world.

In [14]:
# Load world repository descriptor and verify required entries for this notebook

from pathlib import Path
import yaml

errors = []

descriptor_path = Path(WORLD_REPOSITORY_DESCRIPTOR)

# ---- load descriptor file ----
if not descriptor_path.exists():
    raise FileNotFoundError(
        f"Descriptor file not found: {descriptor_path}"
    )

try:
    with descriptor_path.open("r", encoding="utf-8") as f:
        world_repo = yaml.safe_load(f)
except Exception as e:
    raise ValueError(
        f"Failed to load descriptor '{descriptor_path.name}'.\n"
        f"Compare this file to the example world_repository.yml provided with the tools,\n"
        f"fix the file if needed, and rerun this cell."
    )

if not isinstance(world_repo, dict):
    raise ValueError(
        f"Descriptor '{descriptor_path.name}' is not usable as a world repository descriptor.\n"
        f"Compare this file to the example world_repository.yml provided with the tools,\n"
        f"fix the file if needed, and rerun this cell."
    )

print(f"Loaded descriptor: {descriptor_path.name}")

# ---- verify required entries ----
if "world_root" not in world_repo:
    errors.append("world_root")

sources = world_repo.get("sources")
if not isinstance(sources, dict) or "read_paths" not in sources:
    errors.append("sources.read_paths")

working_drafts = world_repo.get("working_drafts")
if not isinstance(working_drafts, dict) or "path" not in working_drafts:
    errors.append("working_drafts.path")

if errors:
    raise ValueError(
        f"Descriptor '{descriptor_path.name}' is missing required entries:\n- "
        + "\n- ".join(errors)
        + "\n\nFix the descriptor file and rerun this cell."
    )

# ---- extract raw values (no filesystem validation yet) ----
world_root_raw = world_repo["world_root"]
source_entries_raw = world_repo["sources"]["read_paths"]
working_drafts_raw = world_repo["working_drafts"]["path"]

print("Required descriptor entries found:")
print(f"- world_root: {world_root_raw!r}")
print(f"- sources.read_paths: {source_entries_raw!r}")
print(f"- working_drafts.path: {working_drafts_raw!r}")


Loaded descriptor: world_repository.yml
Required descriptor entries found:
- world_root: '/Users/charissophia/obsidian/Iron Wolf Trading Company'
- sources.read_paths: ['_local/auto_transcripts', '_local/pbp_transcripts', '_local/planning_notes', '_local/recollections', '_local/session_notes']
- working_drafts.path: '_local/machine_wip'


In [17]:
# Validate key values are useable

from pathlib import Path

errors = []
notes = []

# -----------------------
# Validate world_root
# -----------------------
world_root = Path(world_root_raw) # extracted in prior cell

if str(world_root).startswith("~"):
    errors.append("world_root: '~' is not allowed. Use a full absolute path.")
elif not world_root.is_absolute():
    errors.append("world_root must start with / (macOS/Linux) or a drive like C:\ (Windows). See the example descriptor.")
elif not world_root.exists():
    errors.append(f"world_root: does not exist: {world_root}")
elif not world_root.is_dir():
    errors.append(f"world_root: is not a directory: {world_root}")
else:
    world_root = world_root.resolve()
    notes.append(f"world_root OK: {world_root}")

# Stop early if world_root is not valid, because we can't resolve relative paths safely.
if errors:
    raise ValueError("Descriptor path validation failed:\n- " + "\n- ".join(errors))

# -----------------------
# Validate sources.read_paths
# -----------------------
resolved_sources = []

if not isinstance(source_entries_raw, list):
    errors.append("sources.read_paths must be a bulleted list (each line starts with ‘-’) of file/folder paths.")
else:
    for i, entry in enumerate(source_entries_raw):
        try:
            p = Path(entry)

            if str(p).startswith("~"):
                errors.append(f"sources.read_paths[{i}]: '~' is not allowed: {entry!r}")
                continue

            if not p.is_absolute():
                p = world_root / p

            if not p.exists():
                errors.append(f"sources.read_paths[{i}]: does not exist: {p}")
                continue

            resolved_sources.append(p.resolve())

        except Exception as e:
            errors.append(f"sources.read_paths[{i}]: invalid path value {entry!r} ({type(e).__name__}: {e})")

if not errors and resolved_sources:
    notes.append(f"sources.read_paths OK: {len(resolved_sources)} source paths confirmed")

# -----------------------
# Validate working_drafts.path
# -----------------------
drafts_path = Path(working_drafts_raw)

if str(drafts_path).startswith("~"):
    errors.append("working_drafts.path: '~' is not allowed.")
else:
    if not drafts_path.is_absolute():
        drafts_path = world_root / drafts_path

    if drafts_path.exists():
        if not drafts_path.is_dir():
            errors.append(f"working_drafts.path must be a directory because this process may generate multiple files. {drafts_path}")
        else:
            drafts_path = drafts_path.resolve()
            notes.append(f"working_drafts.path exists: {drafts_path}")
    else:
        parent = drafts_path.parent
        if not parent.exists():
            notes.append(f"working_drafts.path does not exist and parent is missing: {drafts_path}")
            notes.append("Action: create the parent directories (human decision) or update working_drafts.path.")
        elif not parent.is_dir():
            errors.append(f"working_drafts.path parent exists but is not a directory: {parent}")
        else:
            # Probe whether parent appears writable (without creating drafts_path)
            try:
                probe = parent / ".iwtc_tools_write_probe.tmp"
                probe.write_text("test", encoding="utf-8")
                probe.unlink()
                notes.append(f"working_drafts.path does not exist but appears creatable: {drafts_path}")
                notes.append("Action: create this directory now (human decision) or proceed and create later.")
            except Exception:
                notes.append(f"working_drafts.path does not exist and may not be creatable (parent not writable): {drafts_path}")
                notes.append("Action: choose a different working_drafts.path or adjust permissions.")

# -----------------------
# Report or raise
# -----------------------
if errors:
    raise ValueError(
        "Descriptor path validation failed:\n- "
        + "\n- ".join(errors)
        + f"\n\nFix the descriptor entries (starting from world_root: {world_root_raw!r}), "
          "rerun the descriptor load cell, and then rerun this cell."
    )

ENV = {
    "world_root": world_root,
    "sources": resolved_sources,           # mixed files/dirs; classification later
    "working_drafts_path": drafts_path,    # may not exist yet
}

print("Descriptor paths are usable for this notebook.")
print(f"world_root: {ENV['world_root']}")
print("sources.read_paths (resolved):")
for p in ENV["sources"]:
    print(f" - {p}")
print(f"working_drafts.path: {ENV['working_drafts_path']}")
print("")
print("Notes:")
for n in notes:
    print(f" - {n}")


Descriptor paths are usable for this notebook.
world_root: /Users/charissophia/obsidian/Iron Wolf Trading Company
sources.read_paths (resolved):
 - /Users/charissophia/obsidian/Iron Wolf Trading Company/_local/auto_transcripts
 - /Users/charissophia/obsidian/Iron Wolf Trading Company/_local/pbp_transcripts
 - /Users/charissophia/obsidian/Iron Wolf Trading Company/_local/planning_notes
 - /Users/charissophia/obsidian/Iron Wolf Trading Company/_local/recollections
 - /Users/charissophia/obsidian/Iron Wolf Trading Company/_local/session_notes
working_drafts.path: /Users/charissophia/obsidian/Iron Wolf Trading Company/_local/machine_wip

Notes:
 - world_root OK: /Users/charissophia/obsidian/Iron Wolf Trading Company
 - sources.read_paths OK: 5 source paths confirmed
 - working_drafts.path exists: /Users/charissophia/obsidian/Iron Wolf Trading Company/_local/machine_wip


## Discover sources

This step scans the descriptor-defined `sources.read_paths` (files and directories) and lists candidate source files for this run.

Recognized file types: `.md`, `.docx`. Other file types may be listed but are not processed.

Discovery behavior depends on SOURCE_FILE_SPEC set in Parameters above.


In [27]:
# Discover supported source files (recursive), grouped by directory, assign IDs,
# and (if needed) prompt the human to select inputs for this run.

from collections import defaultdict

SUPPORTED_SUFFIXES = {".md", ".docx"}

def parse_selection_expression(expr: str) -> list[int]:
    expr = expr.strip()
    if not expr:
        return []

    selected_ids = set()
    for part in expr.split(","):
        part = part.strip()
        if not part:
            continue

        if "-" in part:
            a, b = part.split("-", 1)
            a = int(a.strip())
            b = int(b.strip())
            if a > b:
                a, b = b, a
            selected_ids.update(range(a, b + 1))
        else:
            selected_ids.add(int(part))

    return sorted(selected_ids)

files_by_directory = defaultdict(list)

for source in ENV["sources"]:
    if source.is_file():
        if source.suffix.lower() in SUPPORTED_SUFFIXES:
            files_by_directory[source.parent].append(source.resolve())

    elif source.is_dir():
        for path in source.rglob("*"):
            if path.is_file() and path.suffix.lower() in SUPPORTED_SUFFIXES:
                files_by_directory[path.parent].append(path.resolve())

CANDIDATE_SOURCE_FILES = []

if not files_by_directory:
    print("No supported source files were discovered.")
    SELECTED_SOURCE_FILES = []
else:
    print("Discovered source files:\n")

    current_id = 1
    for directory in sorted(files_by_directory):
        print(directory)
        for file in sorted(files_by_directory[directory]):
            print(f"  {current_id:>3}. {file.name}")
            CANDIDATE_SOURCE_FILES.append(file)
            current_id += 1
        print()

    print(f"Total discovered files: {len(CANDIDATE_SOURCE_FILES)}\n")

    # Selection behavior
    if isinstance(SOURCE_FILE_SPEC, str) and SOURCE_FILE_SPEC.strip().upper() == "ALL":
        SELECTED_SOURCE_FILES = list(CANDIDATE_SOURCE_FILES)
        print(f'SOURCE_FILE_SPEC="ALL" -> selected all files: {len(SELECTED_SOURCE_FILES)}')

    elif SOURCE_FILE_SPEC is None:
        print("Enter IDs to select inputs for this run.")
        print("Example: 1,3,5-7 (or ALL)\n")
        selection_raw = input("Selection: ").strip()

        if selection_raw.upper() == "ALL":
            SELECTED_SOURCE_FILES = list(CANDIDATE_SOURCE_FILES)
        else:
            selected_ids = parse_selection_expression(selection_raw)
            if not selected_ids:
                raise ValueError("No selection provided.")

            SELECTED_SOURCE_FILES = []
            for sid in selected_ids:
                if sid < 1 or sid > len(CANDIDATE_SOURCE_FILES):
                    raise ValueError(f"Selection ID out of range: {sid}")
                SELECTED_SOURCE_FILES.append(CANDIDATE_SOURCE_FILES[sid - 1])

        print(f"Selected files: {len(SELECTED_SOURCE_FILES)}")

    else:
        # Defer non-interactive ID parsing to the next cell (keeps this cell simple)
        SELECTED_SOURCE_FILES = []
        print("SOURCE_FILE_SPEC provided (IDs). Run the next cell to apply it.")


Discovered source files:

/Users/charissophia/obsidian/Iron Wolf Trading Company/_local/pbp_transcripts
    1. Kavar notes.docx
    2. PbP10 - The Second Camp.md
    3. PbP11 - Lia and the Tolanites.md
    4. PbP12 - Meeting in the Vestry.md
    5. PbP13 - The Town Square Incident.md
    6. PbP14 - Recon.md
    7. PbP15 - Debrief and Safety.md
    8. PbP16 - Nightfall in Elysia.md
    9. PbP17 - Night Meetings.md

/Users/charissophia/obsidian/Iron Wolf Trading Company/_local/planning_notes
   10. Allip Encounter Notes.docx
   11. Elulind map descriptions.docx
   12. The Premature Pods Mystery.docx
   13. The Wolfstream Situation.docx
   14. current_Dhassa staged narration.docx
   15. current_IWTC names.docx
   16. current_IWTC planning notes.docx
   17. current_Kwalish.docx

/Users/charissophia/obsidian/Iron Wolf Trading Company/_local/session_notes
   18. IWTC session notes 1-50.docx
   19. IWTC session notes 51-100.docx
   20. current_IWTC session notes.docx

Total discovered files: 

Selection:  1,10,18-20


Selected files: 5


## Select inputs

This step determines which discovered source files will be indexed in this notebook run.

Selection behavior depends on `SOURCE_FILE_SPEC` set in Parameters above.

In [28]:
# Apply SOURCE_FILE_SPEC when it specifies IDs (non-interactive)

if not CANDIDATE_SOURCE_FILES:
    raise ValueError("No selected candidates exist. Rerun the Discover sources cell first.")

if SELECTED_SOURCE_FILES:
    print("Source files selected in Discover sources cell.")
else:
    def parse_selection_expression(expr: str) -> list[int]:
        expr = expr.strip()
        if not expr:
            return []

        selected_ids = set()
        for part in expr.split(","):
            part = part.strip()
            if not part:
                continue

            if "-" in part:
                a, b = part.split("-", 1)
                a = int(a.strip())
                b = int(b.strip())
                if a > b:
                    a, b = b, a
                selected_ids.update(range(a, b + 1))
            else:
                selected_ids.add(int(part))

        return sorted(selected_ids)

    if isinstance(SOURCE_FILE_SPEC, (list, tuple)):
        selected_ids = [int(x) for x in SOURCE_FILE_SPEC]
    elif isinstance(SOURCE_FILE_SPEC, str):
        selected_ids = parse_selection_expression(SOURCE_FILE_SPEC)
    else:
        raise ValueError(
            "Unsupported SOURCE_FILE_SPEC type. Use one of: None, 'ALL', "
            "a string of IDs like '1,3,5-7', or a list of IDs like [1, 3, 5]. "
            f"Got: {type(SOURCE_FILE_SPEC).__name__}"
        )

    if not selected_ids:
        raise ValueError(
            "SOURCE_FILE_SPEC did not specify any IDs. "
            "Use 'ALL', an ID string like '1,3,5-7', or a list like [1, 3, 5]. "
            f"Got: {SOURCE_FILE_SPEC!r}"
        )

    SELECTED_SOURCE_FILES = []
    for sid in selected_ids:
        if sid < 1 or sid > len(CANDIDATE_SOURCE_FILES):
            raise ValueError(f"Selection ID out of range: {sid}")
            raise ValueError(
                "SOURCE_FILE_SPEC includes IDs that are out of range. "
                f"Valid IDs are 1 through {max_id}. "
            )
        SELECTED_SOURCE_FILES.append(CANDIDATE_SOURCE_FILES[sid - 1])

    print(f"Selected files: {len(SELECTED_SOURCE_FILES)}")

print("\nSelected source files for this run:")
for p in SELECTED_SOURCE_FILES:
    print(f" - {p}")


Source files selected in Discover sources cell.

Selected source files for this run:
 - /Users/charissophia/obsidian/Iron Wolf Trading Company/_local/pbp_transcripts/Kavar notes.docx
 - /Users/charissophia/obsidian/Iron Wolf Trading Company/_local/planning_notes/Allip Encounter Notes.docx
 - /Users/charissophia/obsidian/Iron Wolf Trading Company/_local/session_notes/IWTC session notes 1-50.docx
 - /Users/charissophia/obsidian/Iron Wolf Trading Company/_local/session_notes/IWTC session notes 51-100.docx
 - /Users/charissophia/obsidian/Iron Wolf Trading Company/_local/session_notes/current_IWTC session notes.docx


## Normalize inputs

## Vocabulary proposal (optional)

## Generate draft index

## Emit outputs