# IWTC Index Querying (v0)

This notebook executes the index querying workflow defined in:

- `docs/index_querying_design.md`

It is intended for hands-on execution and experimentation. Conceptual scope, responsibilities, and query pattern design are defined in the linked design document.

This notebook operates on a single world repository.

A minimal example of `world_repository.yml` is provided in this repository
under:

- `data/config_examples/world_repository.yml`

You may copy and adapt that example for your own world repository.

This notebook loads raw index CSV artifacts generated by the prior indexing notebook:
- `INDEX_ENTITY_TO_CHUNKS_V0`
- `INDEX_CHUNK_TO_ENTITIES_V0`
- `INDEX_PLAYER_TO_CHUNKS_V0`
- `INDEX_SOURCE_FILES_V0`

No normalization, schema changes, or canonical file modifications are performed in this notebook.


## Phase 0: Parameters

This notebook operates on a **campaign world repository** and loads previously
generated index artifacts for interactive querying.

In this phase, you tell the notebook:

- Which world repository it is operating on.
- Which index artifact version it should expect to load.

This notebook does **not** generate new indexes.
It does **not** modify canonical files.
It does **not** alter schema.

The goal is simply to answer:
*"What indexed world am I querying right now?"*

The code cell below contains inline comments explaining each parameter.

**IMPORTANT:** This notebook assumes index artifacts already exist and will fail
if required CSV files are missing.


In [1]:
# Phase 0: Parameters
LAST_PHASE_RUN = "0"

# Absolute path to the world_repository.yml descriptor.
WORLD_REPOSITORY_DESCRIPTOR = (
    "/Users/charissophia/obsidian/Iron Wolf Trading Company/_meta/descriptors/world_repository.yml"
)

# Index version to load (must match previously generated artifacts)
INDEX_VERSION = "V0"

# Internal run metadata (do not edit)
from datetime import datetime
print(f"Notebook run initialized at: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
del datetime

Notebook run initialized at: 2026-02-10 20:11


## Phase 1: Load and validate world descriptor

Before this notebook can safely read or write anything, it must be confident that it understands the **structure of the world repository**.

In this phase, the notebook:

- Loads the world repository descriptor file you provided
- Confirms that it is readable and structurally valid
- Extracts only the information this notebook needs
- Verifies that referenced paths actually exist and are usable

This phase answers a single question:

**“Can I trust this descriptor enough to proceed?”**

If the answer is *no*, the notebook will stop with clear, actionable error messages explaining what needs to be fixed in the descriptor file.  
Nothing is modified, created, or scanned until this check succeeds.

This phase does **not** interpret world lore, indexing rules, or heuristics.  
It only establishes that the filesystem layout described by the world is coherent and usable.

In [3]:
# Phase 1: Load and validate world repository descriptor (Index Querying v0)
LAST_PHASE_RUN = "1"

from pathlib import Path
import yaml

errors = []
warnings = []

# --- Load descriptor file ---
descriptor_path = Path(WORLD_REPOSITORY_DESCRIPTOR)

if not descriptor_path.exists():
    raise FileNotFoundError(
        "World repository descriptor file was not found.\n"
        f"Path provided:\n  {descriptor_path}\n\n"
        "What to do:\n"
        "- Confirm the file exists at this location or fix WORLD_REPOSITORY_DESCRIPTOR in Phase 0\n"
        "- If you just edited Phase 0, rerun Phase 0 and then rerun this cell\n"
    )

try:
    with descriptor_path.open("r", encoding="utf-8") as f:
        world_repo = yaml.safe_load(f)
except Exception:
    raise ValueError(
        "The world repository descriptor could not be read.\n"
        "This usually indicates a YAML formatting problem.\n\n"
        f"File:\n  {descriptor_path}\n\n"
        "What to do:\n"
        "- Compare the file against the example world_repository.yml\n"
        "- Paste the contents into https://www.yamllint.com/\n"
        "- Fix any reported issues, save the file, and rerun this cell"
    )

if not isinstance(world_repo, dict):
    raise ValueError(
        "The world repository descriptor was read, but its structure is not usable.\n"
        "The file must be a YAML mapping (top-level `name: value` entries).\n\n"
        "What to do:\n"
        "- Compare the file against the example world_repository.yml\n"
        "- Ensure it uses clear `name: value` lines\n"
        "- Fix the file and rerun this cell"
    )

print(f"World repository descriptor loaded successfully: {descriptor_path.name}")

# --- Extract required entries ---
WORLD_ROOT_RAW = world_repo.get("world_root")

indexes_block = world_repo.get("indexes")
INDEXES_RAW = indexes_block.get("path") if isinstance(indexes_block, dict) else None

vocab = world_repo.get("vocabulary")
ENTITIES_RAW = vocab.get("entities") if isinstance(vocab, dict) else None
ALIASES_RAW = vocab.get("aliases") if isinstance(vocab, dict) else None
AUTHORS_RAW = vocab.get("author_aliases") if isinstance(vocab, dict) else None
PC_MAP_RAW = vocab.get("player_character_map") if isinstance(vocab, dict) else None

if not WORLD_ROOT_RAW:
    errors.append("Missing required entry: world_root")

if not INDEXES_RAW:
    errors.append("Missing required entry: indexes.path")

if errors:
    raise ValueError(
        "World repository descriptor is missing required entries:\n- "
        + "\n- ".join(errors)
        + "\n\nWhat to do:\n"
          "- Edit your world_repository.yml and add/fix the missing entries\n"
          "- Save the file and rerun this cell"
    )

# --- Validate and resolve world_root ---
WORLD_ROOT = Path(WORLD_ROOT_RAW)

if str(WORLD_ROOT).startswith("~"):
    errors.append("world_root: '~' is not allowed. Use a full absolute path.")
elif not WORLD_ROOT.is_absolute():
    errors.append("world_root must be an absolute path (starts with / on macOS/Linux, or C:\\ on Windows).")
elif not WORLD_ROOT.is_dir():
    errors.append(f"world_root must be an existing directory: {WORLD_ROOT}")
else:
    WORLD_ROOT = WORLD_ROOT.resolve()

if errors:
    raise ValueError("Descriptor path validation failed:\n- " + "\n- ".join(errors))

# --- Resolve and validate indexes path ---
INDEXES_PATH = Path(INDEXES_RAW)
if not INDEXES_PATH.is_absolute():
    INDEXES_PATH = WORLD_ROOT / INDEXES_PATH
INDEXES_PATH = INDEXES_PATH.resolve()

try:
    INDEXES_RELPATH = str(INDEXES_PATH.relative_to(WORLD_ROOT))
except Exception:
    INDEXES_RELPATH = str(INDEXES_PATH)

if not INDEXES_PATH.exists():
    errors.append(f"indexes: path does not exist: {INDEXES_PATH}")
elif not INDEXES_PATH.is_dir():
    errors.append(f"indexes: {INDEXES_PATH} must be a directory")

# --- Resolve vocabulary paths (optional) ---
VOCAB_ENTITIES_PATH = None
VOCAB_ENTITIES_RELPATH = None
VOCAB_ALIASES_PATH = None
VOCAB_ALIASES_RELPATH = None
VOCAB_AUTHORS_PATH = None
VOCAB_AUTHORS_RELPATH = None
VOCAB_PC_MAP_PATH = None
VOCAB_PC_MAP_RELPATH = None

vocab_entries = [
    ("entities", "vocab.entities"),
    ("aliases", "vocab.aliases"),
    ("author_aliases", "vocab.author_aliases"),
    ("player_character_map", "vocab.player_character_map"),
]

for key, label in vocab_entries:
    raw = vocab.get(key)
    if not raw:
        continue

    p = Path(raw)
    if not p.is_absolute():
        p = WORLD_ROOT / p
    p = p.resolve()

    try:
        rel = str(p.relative_to(WORLD_ROOT))
    except Exception:
        rel = str(p)

    if p.exists() and p.is_dir():
        warnings.append(f"{label}: {p} must be a file (got directory). Ignoring.")
        continue

    if not p.exists():
        warnings.append(f"{label}: file does not exist: {p} (name resolution may be limited).")

    if key == "entities":
        VOCAB_ENTITIES_PATH = p
        VOCAB_ENTITIES_RELPATH = rel
    elif key == "aliases":
        VOCAB_ALIASES_PATH = p
        VOCAB_ALIASES_RELPATH = rel
    elif key == "author_aliases":
        VOCAB_AUTHORS_PATH = p
        VOCAB_AUTHORS_RELPATH = rel
    elif key == "player_character_map":
        VOCAB_PC_MAP_PATH = p
        VOCAB_PC_MAP_RELPATH = rel

print("Descriptor paths are usable for this notebook.")
print(f"world_root: {WORLD_ROOT}")
print(f"indexes: {INDEXES_RELPATH}")
print(f"vocab.entities: {VOCAB_ENTITIES_RELPATH} (exists={VOCAB_ENTITIES_PATH.exists() if VOCAB_ENTITIES_PATH else False})")
print(f"vocab.aliases: {VOCAB_ALIASES_RELPATH} (exists={VOCAB_ALIASES_PATH.exists() if VOCAB_ALIASES_PATH else False})")
print(f"vocab.author_aliases: {VOCAB_AUTHORS_RELPATH} (exists={VOCAB_AUTHORS_PATH.exists() if VOCAB_AUTHORS_PATH else False})")
print(f"vocab.player_character_map: {VOCAB_PC_MAP_RELPATH} (exists={VOCAB_PC_MAP_PATH.exists() if VOCAB_PC_MAP_PATH else False})")

if warnings:
    print("\nWarnings:")
    for w in warnings:
        print(f"- {w}")

# cleanup
del yaml, Path, descriptor_path, world_repo, indexes_block, vocab
del WORLD_ROOT_RAW, INDEXES_RAW, ENTITIES_RAW, ALIASES_RAW, AUTHORS_RAW, PC_MAP_RAW
del vocab_entries, key, label, raw, p, rel, errors, warnings, f

World repository descriptor loaded successfully: world_repository.yml
Descriptor paths are usable for this notebook.
world_root: /Users/charissophia/obsidian/Iron Wolf Trading Company
indexes: _meta/indexes
vocab.entities: _meta/indexes/vocab_entities.csv (exists=True)
vocab.aliases: _meta/indexes/vocab_aliases.csv (exists=True)
vocab.author_aliases: _meta/indexes/vocab_author_aliases.csv (exists=True)
vocab.player_character_map: _meta/indexes/vocab_map_player_character.csv (exists=True)


## Phase 2: Load index artifacts

Before this notebook can execute any queries, it must confirm that the
required index artifacts already exist and can be loaded.

In this phase, the notebook:

- Constructs the expected index artifact filenames based on `INDEX_VERSION`
- Confirms those files exist under the repository’s declared `indexes.path`
- Loads each artifact as a raw dataframe
- Verifies that required columns are present
- Publishes stable dataframe variables for downstream query logic

This phase answers a single question:

**“Are the required index artifacts present and structurally usable?”**

If any required artifact is missing or malformed, the notebook will stop
with clear instructions explaining how to regenerate them.

No canonical files are modified.
No schema transformations are performed.
No normalization occurs.

This phase does not execute queries.
It only establishes the concrete, in-memory tables that the query layer will operate on.


In [6]:
# Phase 2: Load index artifacts (v0)
LAST_PHASE_RUN = "2"

import pandas as pd
from pathlib import Path

errors = []

# Normalize INDEX_VERSION into the on-disk suffix (your files use lowercase v0)
# Accepts "V0", "v0", "0" (if you ever use that), but publishes "v0"
INDEX_VERSION_SUFFIX = f"v{str(INDEX_VERSION).lower().lstrip('v')}"

# Required artifact filenames (fixed contract for this notebook)
required = {
    "entity_to_chunks": f"index_entity_to_chunks_{INDEX_VERSION_SUFFIX}.csv",
    "chunk_to_entities": f"index_chunk_to_entities_{INDEX_VERSION_SUFFIX}.csv",
    "player_to_chunks": f"index_player_to_chunks_{INDEX_VERSION_SUFFIX}.csv",
    "source_files": f"index_source_files_{INDEX_VERSION_SUFFIX}.csv",
}

# Resolve paths and validate existence
INDEX_FILES = {}
for key, fname in required.items():
    p = (INDEXES_PATH / fname).resolve()
    INDEX_FILES[key] = p
    if not p.exists():
        errors.append(f"Missing required index artifact: {fname}\n  Expected at: {p}")

if errors:
    raise FileNotFoundError(
        "Phase 2 cannot proceed because required index artifacts are missing.\n\n"
        + "\n\n".join(errors)
        + "\n\nWhat to do:\n"
          "- Rerun IWTC_Raw_Source_Indexing.ipynb to generate the v0 artifacts\n"
          "- Ensure the resulting index_*.csv files are placed under your indexes.path directory\n"
          f"- indexes.path resolved to:\n  {INDEXES_PATH}\n"
          "- Then rerun Phase 2"
    )

# Load CSVs (raw)
DF_ENTITY_TO_CHUNKS = pd.read_csv(INDEX_FILES["entity_to_chunks"])
DF_CHUNK_TO_ENTITIES = pd.read_csv(INDEX_FILES["chunk_to_entities"])
DF_PLAYER_TO_CHUNKS = pd.read_csv(INDEX_FILES["player_to_chunks"])
DF_SOURCE_FILES = pd.read_csv(INDEX_FILES["source_files"])

# Validate required columns (presence only)
expected_cols = {
    "DF_ENTITY_TO_CHUNKS": {"entity_id", "canonical", "chunk_ids", "chunk_count", "file_relpaths", "file_count"},
    "DF_CHUNK_TO_ENTITIES": {
        "chunk_id", "source_id", "source_type", "relpath",
        "chunk_start_line", "chunk_end_line",
        "entity_ids", "canonicals", "entity_count",
        "matched_vocabs", "match_kinds",
    },
    "DF_PLAYER_TO_CHUNKS": {"player_entity_id", "canonical", "chunk_ids", "chunk_count", "file_relpaths", "file_count"},
    "DF_SOURCE_FILES": {"source_id", "relpath", "source_type"},
}

for df_name, cols in expected_cols.items():
    df = globals()[df_name]
    missing = [c for c in cols if c not in df.columns]
    if missing:
        errors.append(f"{df_name}: missing expected columns: {missing}")

if errors:
    raise ValueError(
        "One or more index artifacts were loaded but do not match expected v0 columns.\n- "
        + "\n- ".join(errors)
        + "\n\nWhat to do:\n"
          "- Confirm you are using the v0 CSVs produced by IWTC_Raw_Source_Indexing.ipynb\n"
          "- Do not edit the CSVs manually\n"
          "- If you changed the producer notebook, re-run it to regenerate indexes and retry"
    )

# Summary prints
print("Phase 2 OK: index artifacts loaded.")
print(f"indexes.path: {INDEXES_PATH}")
print(f"index version: {INDEX_VERSION_SUFFIX}")

print("\nLoaded tables:")
print(f"- DF_ENTITY_TO_CHUNKS:   {len(DF_ENTITY_TO_CHUNKS):>8} rows, {len(DF_ENTITY_TO_CHUNKS.columns):>3} cols")
print(f"- DF_CHUNK_TO_ENTITIES:  {len(DF_CHUNK_TO_ENTITIES):>8} rows, {len(DF_CHUNK_TO_ENTITIES.columns):>3} cols")
print(f"- DF_PLAYER_TO_CHUNKS:   {len(DF_PLAYER_TO_CHUNKS):>8} rows, {len(DF_PLAYER_TO_CHUNKS.columns):>3} cols")
print(f"- DF_SOURCE_FILES:       {len(DF_SOURCE_FILES):>8} rows, {len(DF_SOURCE_FILES.columns):>3} cols")

# Optional: quick column display (helps debugging early)
print("\nDF_ENTITY_TO_CHUNKS columns:", list(DF_ENTITY_TO_CHUNKS.columns))
print("DF_CHUNK_TO_ENTITIES columns:", list(DF_CHUNK_TO_ENTITIES.columns))
print("DF_PLAYER_TO_CHUNKS columns:", list(DF_PLAYER_TO_CHUNKS.columns))
print("DF_SOURCE_FILES columns:", list(DF_SOURCE_FILES.columns))

# cleanup locals
del pd, Path, errors, required, key, fname, p, cols, df_name, df, missing
del expected_cols, INDEX_VERSION_SUFFIX, INDEX_FILES

Phase 2 OK: index artifacts loaded.
indexes.path: /Users/charissophia/obsidian/Iron Wolf Trading Company/_meta/indexes
index version: v0

Loaded tables:
- DF_ENTITY_TO_CHUNKS:        168 rows,   6 cols
- DF_CHUNK_TO_ENTITIES:      1139 rows,  11 cols
- DF_PLAYER_TO_CHUNKS:          6 rows,   6 cols
- DF_SOURCE_FILES:            130 rows,   3 cols

DF_ENTITY_TO_CHUNKS columns: ['entity_id', 'canonical', 'chunk_ids', 'chunk_count', 'file_relpaths', 'file_count']
DF_CHUNK_TO_ENTITIES columns: ['chunk_id', 'source_id', 'source_type', 'relpath', 'chunk_start_line', 'chunk_end_line', 'entity_ids', 'canonicals', 'entity_count', 'matched_vocabs', 'match_kinds']
DF_PLAYER_TO_CHUNKS columns: ['player_entity_id', 'canonical', 'chunk_ids', 'chunk_count', 'file_relpaths', 'file_count']
DF_SOURCE_FILES columns: ['source_id', 'relpath', 'source_type']


In [13]:
# optional: clean up INDEXES path variables that have been loaded into dataframes
del INDEXES_PATH, INDEXES_RELPATH

## Phase 3: Load vocabulary tables

This phase loads optional vocabulary tables that enable human-readable
resolution and display during querying.

The notebook:

- Loads `vocab_entities.csv`
- Loads `vocab_aliases.csv`
- Loads `vocab_author_aliases.csv`
- Loads `vocab_map_player_character.csv`
- Validates minimal required columns (presence only)
- Publishes vocabulary dataframes for use in resolution helpers

This phase does not modify index tables and does not merge data.
It only prepares lookup tables for name resolution and display.

In [11]:
# Phase 3: Load vocabulary tables (human-authored CSVs; entities required)
LAST_PHASE_RUN = "3"

import pandas as pd
from pathlib import Path

errors = []
warnings = []

# ------------------------------------------------------------------
# Semantic column mappings for vocab CSVs
# Each semantic field -> acceptable column names in the CSV
# Any other CSV columns are ignored on purpose.
# ------------------------------------------------------------------
ENTITY_COLS = {
    "entity_id": ["entity_id", "id"],
    "canonical": ["canonical", "canonical_name", "name"],
}
ALIAS_COLS = {
    "entity_id": ["entity_id", "id"],
    "alias": ["alias", "alt", "alternate"],
}
AUTHOR_ALIAS_COLS = {
    "author": ["author", "discord_name", "handle"],
    "player_entity_id": ["player_entity_id", "player", "player_id"],
    "ambig_char_id": ["ambig_char_id", "ambiguous_character", "ambig_character"],
}
PC_MAP_COLS = {
    "player_entity_id": ["player_entity_id", "player", "player_id"],
    "char_entity_id": ["char_entity_id", "character_entity_id", "character"],
}

# ------------------------------------------------------------------
# Expected vocab artifact filenames
# - entities required
# - others optional
# ------------------------------------------------------------------
vocab_files = [
    ("entities", "vocab_entities.csv", ENTITY_COLS, True),
    ("aliases", "vocab_aliases.csv", ALIAS_COLS, False),
    ("author_aliases", "vocab_author_aliases.csv", AUTHOR_ALIAS_COLS, False),
    ("pc_map", "vocab_map_player_character.csv", PC_MAP_COLS, False),
]

# Outputs (published)
DF_VOCAB_ENTITIES = pd.DataFrame(columns=list(ENTITY_COLS.keys()))
DF_VOCAB_ALIASES = pd.DataFrame(columns=list(ALIAS_COLS.keys()))
DF_VOCAB_AUTHORS = pd.DataFrame(columns=list(AUTHOR_ALIAS_COLS.keys()))
DF_VOCAB_PC_MAP = pd.DataFrame(columns=list(PC_MAP_COLS.keys()))

VOCAB_FILES = {}

# ------------------------------------------------------------------
# Load + normalize each vocab CSV (in place, no helper function)
# ------------------------------------------------------------------
for key, fname, col_map, required in vocab_files:
    p = (INDEXES_PATH / fname).resolve()
    VOCAB_FILES[key] = p

    if required and not p.exists():
        errors.append(f"Missing required vocabulary file: {fname}\n  Expected at: {p}")
        continue

    if not p.exists():
        warnings.append(f"Optional vocab file not found: {fname}")
        continue

    raw_df = pd.read_csv(p, dtype=str).fillna("")

    # Build rename map: first matching option wins
    rename = {}
    for semantic, options in col_map.items():
        found = next((c for c in options if c in raw_df.columns), None)
        if found:
            rename[found] = semantic

    # If there are rows but we couldn't resolve any semantic columns, warn loudly
    if len(raw_df) > 0 and not rename:
        header_line = ", ".join(list(raw_df.columns))
        expected = {k: v for k, v in col_map.items()}
        warnings.append(
            f"WARNING [{key}]: CSV has rows but none of the expected columns were found.\n"
            f"  CSV columns: {header_line}\n"
            f"  Expected (semantic -> acceptable names): {expected}\n"
            f"  File: {p}"
        )
        norm_df = pd.DataFrame(columns=list(col_map.keys()))
    else:
        out = raw_df.rename(columns=rename)

        missing_semantic = [k for k in col_map.keys() if k not in out.columns]
        if len(raw_df) > 0 and missing_semantic:
            warnings.append(
                f"WARNING [{key}]: missing semantic columns after normalization: {missing_semantic}\n"
                f"  CSV columns: {list(raw_df.columns)}\n"
                f"  Using: {list(out.columns)}\n"
                f"  File: {p}"
            )

        keep = [k for k in col_map.keys() if k in out.columns]
        norm_df = out[keep].copy()

    # Publish to the appropriate DF_*
    if key == "entities":
        DF_VOCAB_ENTITIES = norm_df
    elif key == "aliases":
        DF_VOCAB_ALIASES = norm_df
    elif key == "author_aliases":
        DF_VOCAB_AUTHORS = norm_df
    elif key == "pc_map":
        DF_VOCAB_PC_MAP = norm_df

    # cleanup per-iteration locals
    del raw_df, rename, semantic, options, found, col_map, required, fname, p, norm_df

# Required entities must be usable
if errors:
    raise FileNotFoundError(
        "Phase 3 cannot proceed because required vocabulary files are missing.\n\n"
        + "\n\n".join(errors)
        + "\n\nWhat to do:\n"
          "- Create or place the required vocab files under indexes.path\n"
          "- If you need a starter schema, use the config_examples and existing IWTC files as reference\n"
          f"- indexes.path resolved to:\n  {INDEXES_PATH}\n"
          "- Then rerun Phase 3"
    )

if DF_VOCAB_ENTITIES.empty:
    raise ValueError(
        "Entities vocab file was loaded but did not produce any usable rows after normalization.\n"
        "What to do:\n"
        "- Confirm the entities CSV includes columns for entity_id and canonical name\n"
        "- Acceptable column names:\n"
        f"  entity_id: {ENTITY_COLS['entity_id']}\n"
        f"  canonical: {ENTITY_COLS['canonical']}\n"
        "- Fix the file and rerun Phase 3"
    )

# Summary prints (match Phase 2 style)
print("Phase 3 OK: vocabulary tables loaded (human-authored schemas supported).")
print(f"indexes.path: {INDEXES_PATH}")

print("\nLoaded vocab tables:")
print(f"- DF_VOCAB_ENTITIES: {len(DF_VOCAB_ENTITIES):>8} rows, {len(DF_VOCAB_ENTITIES.columns):>3} cols")
print(f"- DF_VOCAB_ALIASES:  {len(DF_VOCAB_ALIASES):>8} rows, {len(DF_VOCAB_ALIASES.columns):>3} cols")
print(f"- DF_VOCAB_AUTHORS:  {len(DF_VOCAB_AUTHORS):>8} rows, {len(DF_VOCAB_AUTHORS.columns):>3} cols")
print(f"- DF_VOCAB_PC_MAP:   {len(DF_VOCAB_PC_MAP):>8} rows, {len(DF_VOCAB_PC_MAP.columns):>3} cols")

print("\nDF_VOCAB_ENTITIES columns:", list(DF_VOCAB_ENTITIES.columns))
print("DF_VOCAB_ALIASES columns:", list(DF_VOCAB_ALIASES.columns))
print("DF_VOCAB_AUTHORS columns:", list(DF_VOCAB_AUTHORS.columns))
print("DF_VOCAB_PC_MAP columns:", list(DF_VOCAB_PC_MAP.columns))

if warnings:
    print("\nWarnings:")
    for w in warnings:
        print(f"- {w}")

# cleanup locals (keep published DFs and VOCAB_FILES)
del pd, Path
del errors, warnings, vocab_files, key, keep, missing_semantic, out
del ENTITY_COLS, ALIAS_COLS, AUTHOR_ALIAS_COLS, PC_MAP_COLS, VOCAB_FILES

Phase 3 OK: vocabulary tables loaded (human-authored schemas supported).
indexes.path: /Users/charissophia/obsidian/Iron Wolf Trading Company/_meta/indexes

Loaded vocab tables:
- DF_VOCAB_ENTITIES:      176 rows,   2 cols
- DF_VOCAB_ALIASES:        87 rows,   2 cols
- DF_VOCAB_AUTHORS:         6 rows,   3 cols
- DF_VOCAB_PC_MAP:         42 rows,   2 cols

DF_VOCAB_ENTITIES columns: ['entity_id', 'canonical']
DF_VOCAB_ALIASES columns: ['entity_id', 'alias']
DF_VOCAB_AUTHORS columns: ['author', 'player_entity_id', 'ambig_char_id']
DF_VOCAB_PC_MAP columns: ['player_entity_id', 'char_entity_id']


In [12]:
# optional: clean up VOCAB path variables
del VOCAB_ENTITIES_PATH, VOCAB_ENTITIES_RELPATH
del VOCAB_ALIASES_PATH, VOCAB_ALIASES_RELPATH
del VOCAB_AUTHORS_PATH, VOCAB_AUTHORS_RELPATH
del VOCAB_PC_MAP_PATH, VOCAB_PC_MAP_RELPATH

## Phase 4: Query building blocks

This phase defines the notebook's basic query tools ("building blocks") that later phases combine into DM-friendly questions.

Key ideas:
- A **chunk** is a small excerpt of a file, identified by `chunk_id` plus the file `relpath` and line range.
- An **entity** is a tracked name (character, place, faction, etc.) identified by `entity_id` and shown by `canonical`.

These tools let you type either a name (canonical or alias) or an ID, and then:
- find which chunks mention an entity or a player
- list which entities appear inside a given chunk
- pull the index rows needed to navigate back to the source files

In Phase 5/6 we will use these building blocks to answer practical questions like:
"Where does X appear?" and "What entities appear in file Y?"

In [15]:
# Phase 4: Query building blocks (DM-facing names, v0)
LAST_PHASE_RUN = "4"

import ast
import json
import pandas as pd

# -------------------------------------------------------------------
# Internal: parse list-encoded fields from CSVs
# -------------------------------------------------------------------
def _parse_list_field(raw):
    """
    Parse a list-like field (stored as text in the CSVs) into list[str].

    Supported forms:
    - "" / None / NaN -> []
    - JSON: ["a","b"]
    - Python repr: ['a','b']
    - Delimited: "a|b|c" or "a;b;c" or "a,b,c"
    - Single token: "a" -> ["a"]
    """
    if raw is None:
        return []
    if isinstance(raw, float) and pd.isna(raw):
        return []
    s = str(raw).strip()
    if not s:
        return []

    if s.startswith("[") and s.endswith("]"):
        # JSON
        try:
            v = json.loads(s)
            if isinstance(v, list):
                return [str(x).strip() for x in v if str(x).strip()]
        except Exception:
            pass

        # Python repr
        try:
            v = ast.literal_eval(s)
            if isinstance(v, (list, tuple)):
                return [str(x).strip() for x in v if str(x).strip()]
        except Exception:
            pass

    for delim in ["|", ";", ","]:
        if delim in s:
            out = [x.strip() for x in s.split(delim)]
            return [x for x in out if x]

    return [s]


# -------------------------------------------------------------------
# Name resolution (entity + player)
# -------------------------------------------------------------------
def find_entity_ids(name_or_id, include_aliases=True):
    """
    Resolve a user input (name, alias, or ID) to matching entity_id(s).

    You can type:
    - an entity_id (exact)
    - a canonical name (case-insensitive)
    - an alias (case-insensitive) if aliases exist and include_aliases=True

    Returns:
    - list[str] of entity_id (possibly empty, possibly multiple)
    """
    if name_or_id is None:
        return []

    q = str(name_or_id).strip()
    if not q:
        return []

    # Direct ID match (fast path)
    if "entity_id" in DF_VOCAB_ENTITIES.columns:
        if q in set(DF_VOCAB_ENTITIES["entity_id"]):
            return [q]

    q_lower = q.lower()

    # Canonical match
    canon_hits = DF_VOCAB_ENTITIES[
        DF_VOCAB_ENTITIES["canonical"].astype(str).str.lower() == q_lower
    ]["entity_id"].tolist()

    hits = list(dict.fromkeys([x for x in canon_hits if str(x).strip()]))

    # Alias match (optional)
    if include_aliases and DF_VOCAB_ALIASES is not None and not DF_VOCAB_ALIASES.empty:
        alias_hits = DF_VOCAB_ALIASES[
            DF_VOCAB_ALIASES["alias"].astype(str).str.lower() == q_lower
        ]["entity_id"].tolist()

        for eid in alias_hits:
            if eid and eid not in hits:
                hits.append(eid)

    return hits


def find_player_ids(author_or_player_id):
    """
    Resolve a user input to player_entity_id(s).

    You can type:
    - a player_entity_id (exact)
    - an author handle (Discord name) if author aliases exist

    Returns:
    - list[str] of player_entity_id
    """
    if author_or_player_id is None:
        return []

    q = str(author_or_player_id).strip()
    if not q:
        return []

    # Direct match against player index
    if "player_entity_id" in DF_PLAYER_TO_CHUNKS.columns:
        if q in set(DF_PLAYER_TO_CHUNKS["player_entity_id"]):
            return [q]

    # Author handle lookup (optional)
    if DF_VOCAB_AUTHORS is None or DF_VOCAB_AUTHORS.empty:
        return []

    q_lower = q.lower()
    hits = DF_VOCAB_AUTHORS[
        DF_VOCAB_AUTHORS["author"].astype(str).str.lower() == q_lower
    ]["player_entity_id"].tolist()

    hits = [x.strip() for x in hits if str(x).strip()]
    hits = list(dict.fromkeys(hits))
    return hits


def find_character_ids_for_player(player_entity_id):
    """
    Return character entity_id(s) mapped to a player_entity_id (if a PC map exists).

    Returns:
    - list[str] (possibly empty)
    """
    if not player_entity_id:
        return []

    if DF_VOCAB_PC_MAP is None or DF_VOCAB_PC_MAP.empty:
        return []

    q = str(player_entity_id).strip()
    if not q:
        return []

    hits = DF_VOCAB_PC_MAP[
        DF_VOCAB_PC_MAP["player_entity_id"].astype(str) == q
    ]["char_entity_id"].tolist()

    hits = [x.strip() for x in hits if str(x).strip()]
    hits = list(dict.fromkeys(hits))
    return hits


# -------------------------------------------------------------------
# Chunk lookup
# -------------------------------------------------------------------
def find_chunk_ids_for_entity(entity_name_or_id, include_aliases=True):
    """
    Return a set of chunk_id where an entity appears.

    Input can be a canonical name, alias (if enabled), or entity_id.
    """
    eids = find_entity_ids(entity_name_or_id, include_aliases=include_aliases)
    if not eids:
        return set()

    rows = DF_ENTITY_TO_CHUNKS[DF_ENTITY_TO_CHUNKS["entity_id"].isin(eids)]
    out = set()

    for raw in rows["chunk_ids"].tolist():
        for cid in _parse_list_field(raw):
            out.add(cid)

    return out


def find_chunk_ids_for_player(player_name_or_id):
    """
    Return a set of chunk_id associated with a player.

    Input can be a player_entity_id or an author handle (if author aliases exist).
    """
    pids = find_player_ids(player_name_or_id)
    if not pids:
        return set()

    rows = DF_PLAYER_TO_CHUNKS[DF_PLAYER_TO_CHUNKS["player_entity_id"].isin(pids)]
    out = set()

    for raw in rows["chunk_ids"].tolist():
        for cid in _parse_list_field(raw):
            out.add(cid)

    return out


# -------------------------------------------------------------------
# Chunk inspection
# -------------------------------------------------------------------
def get_chunk_row(chunk_id):
    """
    Return the DF_CHUNK_TO_ENTITIES row for a chunk_id (0 or 1 rows).
    """
    q = str(chunk_id).strip() if chunk_id is not None else ""
    if not q:
        return DF_CHUNK_TO_ENTITIES.iloc[0:0].copy()

    return DF_CHUNK_TO_ENTITIES[DF_CHUNK_TO_ENTITIES["chunk_id"] == q].copy()


def list_entity_ids_in_chunk(chunk_id):
    """
    Return entity_id list in a chunk (order as stored in the index).
    """
    df = get_chunk_row(chunk_id)
    if df.empty:
        return []
    return _parse_list_field(df.iloc[0]["entity_ids"])


def list_entity_names_in_chunk(chunk_id):
    """
    Return canonical-name list in a chunk (order as stored in the index).
    """
    df = get_chunk_row(chunk_id)
    if df.empty:
        return []
    return _parse_list_field(df.iloc[0]["canonicals"])


def get_chunk_rows(chunk_ids):
    """
    Return DF_CHUNK_TO_ENTITIES rows for a set/list of chunk_ids.
    """
    ids = list(chunk_ids) if chunk_ids is not None else []
    ids = [str(x).strip() for x in ids if str(x).strip()]
    if not ids:
        return DF_CHUNK_TO_ENTITIES.iloc[0:0].copy()

    return DF_CHUNK_TO_ENTITIES[DF_CHUNK_TO_ENTITIES["chunk_id"].isin(ids)].copy()


print("Phase 4 OK: query building blocks defined.")
print("Building blocks:")
print("- find_entity_ids(name_or_id, include_aliases=True)")
print("- find_player_ids(author_or_player_id)")
print("- find_character_ids_for_player(player_entity_id)")
print("- find_chunk_ids_for_entity(entity_name_or_id, include_aliases=True)")
print("- find_chunk_ids_for_player(player_name_or_id)")
print("- get_chunk_row(chunk_id)")
print("- list_entity_ids_in_chunk(chunk_id)")
print("- list_entity_names_in_chunk(chunk_id)")
print("- get_chunk_rows(chunk_ids)")

# cleanup module imports (functions remain usable)
del ast, json, pd

Phase 4 OK: query building blocks defined.
Building blocks:
- find_entity_ids(name_or_id, include_aliases=True)
- find_player_ids(author_or_player_id)
- find_character_ids_for_player(player_entity_id)
- find_chunk_ids_for_entity(entity_name_or_id, include_aliases=True)
- find_chunk_ids_for_player(player_name_or_id)
- get_chunk_row(chunk_id)
- list_entity_ids_in_chunk(chunk_id)
- list_entity_names_in_chunk(chunk_id)
- get_chunk_rows(chunk_ids)


...


## Phase 5: Implement query primitives

This phase implements the primitives defined above.

Key responsibilities:

- Parse list-encoded fields (`chunk_ids`, `entity_ids`)
- Convert serialized lists into Python sets
- Ensure stable return types
- Avoid modifying any canonical data

All transformations occur in-memory only.


## Phase 6: Implement composed query patterns

This phase builds higher-level query patterns using primitives.

Supported patterns:

- Where does entity X appear?
- What entities appear in file Y?
- What chunks are associated with player Z?
- What entities co-occur with entity X?

Each pattern:

- Returns a dataframe formatted for inspection
- Uses only existing index tables
- Performs no normalization or schema rewriting


## Phase 7: Output formatting conventions

This phase standardizes how results are presented.

Defines:

- Chunk reference row format:
  - chunk_id
  - relpath
  - chunk_start_line
  - chunk_end_line
  - source_type
  - entity_count

- Entity summary row format:
  - entity_id
  - canonical
  - n_chunks
  - n_files

- Sorting rules (e.g., relpath + line order)

No data is modified in this phase.
This phase concerns display consistency only.


## Phase 8: Interactive scratchpad

This phase provides example queries to validate notebook behavior.

Examples:

- Query a known entity
- Query a known player
- Inspect a known file
- Test entity co-occurrence

This section serves as a reusable verification block for future runs.
