# IWTC Index Querying

This notebook loads previously-generated index artifacts and provides DM-friendly query recipes.

How to use this notebook:

1) Run the "Build" phases once, top to bottom (Phases 0-4).
2) Collapse the build section.
3) Use the "Example Queries" section as your working interface.


## Build Phases (v0)

This notebook executes the index querying workflow defined in:

- `docs/index_querying_design.md`

It is intended for hands-on execution and experimentation. Conceptual scope, responsibilities, and query pattern design are defined in the linked design document.

This notebook operates on a single world repository.

A minimal example of `world_repository.yml` is provided in this repository
under:

- `data/config_examples/world_repository.yml`

You may copy and adapt that example for your own world repository.

This notebook loads raw index CSV artifacts generated by the prior indexing notebook:
- `INDEX_ENTITY_TO_CHUNKS_V0`
- `INDEX_CHUNK_TO_ENTITIES_V0`
- `INDEX_PLAYER_TO_CHUNKS_V0`
- `INDEX_SOURCE_FILES_V0`

No normalization, schema changes, or canonical file modifications are performed in this notebook.


## Phase 0: Parameters

This notebook operates on a **campaign world repository** and loads previously
generated index artifacts for interactive querying.

In this phase, you tell the notebook:

- Which world repository it is operating on.
- Which index artifact version it should expect to load.

This notebook does **not** generate new indexes.
It does **not** modify canonical files.
It does **not** alter schema.

The goal is simply to answer:
*"What indexed world am I querying right now?"*

The code cell below contains inline comments explaining each parameter.

**IMPORTANT:** This notebook assumes index artifacts already exist and will fail
if required CSV files are missing.


In [2]:
# Phase 0: Parameters
LAST_PHASE_RUN = "0"

# Absolute path to the world_repository.yml descriptor.
WORLD_REPOSITORY_DESCRIPTOR = (
    "/Users/charissophia/obsidian/Iron Wolf Trading Company/_meta/descriptors/world_repository.yml"
)

# Index version to load (must match previously generated artifacts)
INDEX_VERSION = "V0"

# Internal run metadata (do not edit)
from datetime import datetime
print(f"Notebook run initialized at: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
del datetime

Notebook run initialized at: 2026-02-11 23:49


## Phase 1: Load and validate world descriptor

Before this notebook can safely read or write anything, it must be confident that it understands the **structure of the world repository**.

In this phase, the notebook:

- Loads the world repository descriptor file you provided
- Confirms that it is readable and structurally valid
- Extracts only the information this notebook needs
- Verifies that referenced paths actually exist and are usable

This phase answers a single question:

**“Can I trust this descriptor enough to proceed?”**

If the answer is *no*, the notebook will stop with clear, actionable error messages explaining what needs to be fixed in the descriptor file.  
Nothing is modified, created, or scanned until this check succeeds.

This phase does **not** interpret world lore, indexing rules, or heuristics.  
It only establishes that the filesystem layout described by the world is coherent and usable.

In [3]:
# Phase 1: Load and validate world repository descriptor (Index Querying v0)
LAST_PHASE_RUN = "1"

from pathlib import Path
import yaml

errors = []
warnings = []

# --- Load descriptor file ---
descriptor_path = Path(WORLD_REPOSITORY_DESCRIPTOR)

if not descriptor_path.exists():
    raise FileNotFoundError(
        "World repository descriptor file was not found.\n"
        f"Path provided:\n  {descriptor_path}\n\n"
        "What to do:\n"
        "- Confirm the file exists at this location or fix WORLD_REPOSITORY_DESCRIPTOR in Phase 0\n"
        "- If you just edited Phase 0, rerun Phase 0 and then rerun this cell\n"
    )

try:
    with descriptor_path.open("r", encoding="utf-8") as f:
        world_repo = yaml.safe_load(f)
except Exception:
    raise ValueError(
        "The world repository descriptor could not be read.\n"
        "This usually indicates a YAML formatting problem.\n\n"
        f"File:\n  {descriptor_path}\n\n"
        "What to do:\n"
        "- Compare the file against the example world_repository.yml\n"
        "- Paste the contents into https://www.yamllint.com/\n"
        "- Fix any reported issues, save the file, and rerun this cell"
    )

if not isinstance(world_repo, dict):
    raise ValueError(
        "The world repository descriptor was read, but its structure is not usable.\n"
        "The file must be a YAML mapping (top-level `name: value` entries).\n\n"
        "What to do:\n"
        "- Compare the file against the example world_repository.yml\n"
        "- Ensure it uses clear `name: value` lines\n"
        "- Fix the file and rerun this cell"
    )

print(f"World repository descriptor loaded successfully: {descriptor_path.name}")

# --- Extract required entries ---
WORLD_ROOT_RAW = world_repo.get("world_root")

indexes_block = world_repo.get("indexes")
INDEXES_RAW = indexes_block.get("path") if isinstance(indexes_block, dict) else None

vocab = world_repo.get("vocabulary")
ENTITIES_RAW = vocab.get("entities") if isinstance(vocab, dict) else None
ALIASES_RAW = vocab.get("aliases") if isinstance(vocab, dict) else None
AUTHORS_RAW = vocab.get("author_aliases") if isinstance(vocab, dict) else None
PC_MAP_RAW = vocab.get("player_character_map") if isinstance(vocab, dict) else None

if not WORLD_ROOT_RAW:
    errors.append("Missing required entry: world_root")

if not INDEXES_RAW:
    errors.append("Missing required entry: indexes.path")

if errors:
    raise ValueError(
        "World repository descriptor is missing required entries:\n- "
        + "\n- ".join(errors)
        + "\n\nWhat to do:\n"
          "- Edit your world_repository.yml and add/fix the missing entries\n"
          "- Save the file and rerun this cell"
    )

# --- Validate and resolve world_root ---
WORLD_ROOT = Path(WORLD_ROOT_RAW)

if str(WORLD_ROOT).startswith("~"):
    errors.append("world_root: '~' is not allowed. Use a full absolute path.")
elif not WORLD_ROOT.is_absolute():
    errors.append("world_root must be an absolute path (starts with / on macOS/Linux, or C:\\ on Windows).")
elif not WORLD_ROOT.is_dir():
    errors.append(f"world_root must be an existing directory: {WORLD_ROOT}")
else:
    WORLD_ROOT = WORLD_ROOT.resolve()

if errors:
    raise ValueError("Descriptor path validation failed:\n- " + "\n- ".join(errors))

# --- Resolve and validate indexes path ---
INDEXES_PATH = Path(INDEXES_RAW)
if not INDEXES_PATH.is_absolute():
    INDEXES_PATH = WORLD_ROOT / INDEXES_PATH
INDEXES_PATH = INDEXES_PATH.resolve()

try:
    INDEXES_RELPATH = str(INDEXES_PATH.relative_to(WORLD_ROOT))
except Exception:
    INDEXES_RELPATH = str(INDEXES_PATH)

if not INDEXES_PATH.exists():
    errors.append(f"indexes: path does not exist: {INDEXES_PATH}")
elif not INDEXES_PATH.is_dir():
    errors.append(f"indexes: {INDEXES_PATH} must be a directory")

# --- Resolve vocabulary paths (optional) ---
VOCAB_ENTITIES_PATH = None
VOCAB_ENTITIES_RELPATH = None
VOCAB_ALIASES_PATH = None
VOCAB_ALIASES_RELPATH = None
VOCAB_AUTHORS_PATH = None
VOCAB_AUTHORS_RELPATH = None
VOCAB_PC_MAP_PATH = None
VOCAB_PC_MAP_RELPATH = None

vocab_entries = [
    ("entities", "vocab.entities"),
    ("aliases", "vocab.aliases"),
    ("author_aliases", "vocab.author_aliases"),
    ("player_character_map", "vocab.player_character_map"),
]

for key, label in vocab_entries:
    raw = vocab.get(key)
    if not raw:
        continue

    p = Path(raw)
    if not p.is_absolute():
        p = WORLD_ROOT / p
    p = p.resolve()

    try:
        rel = str(p.relative_to(WORLD_ROOT))
    except Exception:
        rel = str(p)

    if p.exists() and p.is_dir():
        warnings.append(f"{label}: {p} must be a file (got directory). Ignoring.")
        continue

    if not p.exists():
        warnings.append(f"{label}: file does not exist: {p} (name resolution may be limited).")

    if key == "entities":
        VOCAB_ENTITIES_PATH = p
        VOCAB_ENTITIES_RELPATH = rel
    elif key == "aliases":
        VOCAB_ALIASES_PATH = p
        VOCAB_ALIASES_RELPATH = rel
    elif key == "author_aliases":
        VOCAB_AUTHORS_PATH = p
        VOCAB_AUTHORS_RELPATH = rel
    elif key == "player_character_map":
        VOCAB_PC_MAP_PATH = p
        VOCAB_PC_MAP_RELPATH = rel

print("Descriptor paths are usable for this notebook.")
print(f"world_root: {WORLD_ROOT}")
print(f"indexes: {INDEXES_RELPATH}")
print(f"vocab.entities: {VOCAB_ENTITIES_RELPATH} (exists={VOCAB_ENTITIES_PATH.exists() if VOCAB_ENTITIES_PATH else False})")
print(f"vocab.aliases: {VOCAB_ALIASES_RELPATH} (exists={VOCAB_ALIASES_PATH.exists() if VOCAB_ALIASES_PATH else False})")
print(f"vocab.author_aliases: {VOCAB_AUTHORS_RELPATH} (exists={VOCAB_AUTHORS_PATH.exists() if VOCAB_AUTHORS_PATH else False})")
print(f"vocab.player_character_map: {VOCAB_PC_MAP_RELPATH} (exists={VOCAB_PC_MAP_PATH.exists() if VOCAB_PC_MAP_PATH else False})")

if warnings:
    print("\nWarnings:")
    for w in warnings:
        print(f"- {w}")

# cleanup
del yaml, Path
del descriptor_path, world_repo, indexes_block, vocab
del WORLD_REPOSITORY_DESCRIPTOR
del WORLD_ROOT_RAW, INDEXES_RAW, ENTITIES_RAW, ALIASES_RAW, AUTHORS_RAW, PC_MAP_RAW
del vocab_entries, key, label, raw, p, rel, errors, warnings, f

World repository descriptor loaded successfully: world_repository.yml
Descriptor paths are usable for this notebook.
world_root: /Users/charissophia/obsidian/Iron Wolf Trading Company
indexes: _meta/indexes
vocab.entities: _meta/indexes/vocab_entities.csv (exists=True)
vocab.aliases: _meta/indexes/vocab_aliases.csv (exists=True)
vocab.author_aliases: _meta/indexes/vocab_author_aliases.csv (exists=True)
vocab.player_character_map: _meta/indexes/vocab_map_player_character.csv (exists=True)


## Phase 2: Load index artifacts

Before this notebook can execute any queries, it must confirm that the
required index artifacts already exist and can be loaded.

In this phase, the notebook:

- Constructs the expected index artifact filenames based on `INDEX_VERSION`
- Confirms those files exist under the repository’s declared `indexes.path`
- Loads each artifact as a raw dataframe
- Verifies that required columns are present
- Publishes stable dataframe variables for downstream query logic

This phase answers a single question:

**“Are the required index artifacts present and structurally usable?”**

If any required artifact is missing or malformed, the notebook will stop
with clear instructions explaining how to regenerate them.

No canonical files are modified.
No schema transformations are performed.
No normalization occurs.

This phase does not execute queries.
It only establishes the concrete, in-memory tables that the query layer will operate on.


In [4]:
# Phase 2: Load index artifacts (v0)
LAST_PHASE_RUN = "2"

import pandas as pd
from pathlib import Path

errors = []

# Normalize INDEX_VERSION into the on-disk suffix (your files use lowercase v0)
# Accepts "V0", "v0", "0" (if you ever use that), but publishes "v0"
INDEX_VERSION_SUFFIX = f"v{str(INDEX_VERSION).lower().lstrip('v')}"

# Required artifact filenames (fixed contract for this notebook)
required = {
    "entity_to_chunks": f"index_entity_to_chunks_{INDEX_VERSION_SUFFIX}.csv",
    "chunk_to_entities": f"index_chunk_to_entities_{INDEX_VERSION_SUFFIX}.csv",
    "player_to_chunks": f"index_player_to_chunks_{INDEX_VERSION_SUFFIX}.csv",
    "source_files": f"index_source_files_{INDEX_VERSION_SUFFIX}.csv",
}

# Resolve paths and validate existence
INDEX_FILES = {}
for key, fname in required.items():
    p = (INDEXES_PATH / fname).resolve()
    INDEX_FILES[key] = p
    if not p.exists():
        errors.append(f"Missing required index artifact: {fname}\n  Expected at: {p}")

if errors:
    raise FileNotFoundError(
        "Phase 2 cannot proceed because required index artifacts are missing.\n\n"
        + "\n\n".join(errors)
        + "\n\nWhat to do:\n"
          "- Rerun IWTC_Raw_Source_Indexing.ipynb to generate the v0 artifacts\n"
          "- Ensure the resulting index_*.csv files are placed under your indexes.path directory\n"
          f"- indexes.path resolved to:\n  {INDEXES_PATH}\n"
          "- Then rerun Phase 2"
    )

# Load CSVs (raw)
DF_ENTITY_TO_CHUNKS = pd.read_csv(INDEX_FILES["entity_to_chunks"])
DF_CHUNK_TO_ENTITIES = pd.read_csv(INDEX_FILES["chunk_to_entities"])
DF_PLAYER_TO_CHUNKS = pd.read_csv(INDEX_FILES["player_to_chunks"])
DF_SOURCE_FILES = pd.read_csv(INDEX_FILES["source_files"])

# Validate required columns (presence only)
expected_cols = {
    "DF_ENTITY_TO_CHUNKS": {"entity_id", "canonical", "chunk_ids", "chunk_count", "file_relpaths", "file_count"},
    "DF_CHUNK_TO_ENTITIES": {
        "chunk_id", "source_id", "source_type", "relpath",
        "chunk_start_line", "chunk_end_line",
        "entity_ids", "canonicals", "entity_count",
        "matched_vocabs", "match_kinds",
    },
    "DF_PLAYER_TO_CHUNKS": {"player_entity_id", "canonical", "chunk_ids", "chunk_count", "file_relpaths", "file_count"},
    "DF_SOURCE_FILES": {"source_id", "relpath", "source_type"},
}

for df_name, cols in expected_cols.items():
    df = globals()[df_name]
    missing = [c for c in cols if c not in df.columns]
    if missing:
        errors.append(f"{df_name}: missing expected columns: {missing}")

if errors:
    raise ValueError(
        "One or more index artifacts were loaded but do not match expected v0 columns.\n- "
        + "\n- ".join(errors)
        + "\n\nWhat to do:\n"
          "- Confirm you are using the v0 CSVs produced by IWTC_Raw_Source_Indexing.ipynb\n"
          "- Do not edit the CSVs manually\n"
          "- If you changed the producer notebook, re-run it to regenerate indexes and retry"
    )

# Summary prints
print("Phase 2 OK: index artifacts loaded.")
print(f"indexes.path: {INDEXES_PATH}")
print(f"index version: {INDEX_VERSION_SUFFIX}")

print("\nLoaded tables:")
print(f"- DF_ENTITY_TO_CHUNKS:   {len(DF_ENTITY_TO_CHUNKS):>8} rows, {len(DF_ENTITY_TO_CHUNKS.columns):>3} cols")
print(f"- DF_CHUNK_TO_ENTITIES:  {len(DF_CHUNK_TO_ENTITIES):>8} rows, {len(DF_CHUNK_TO_ENTITIES.columns):>3} cols")
print(f"- DF_PLAYER_TO_CHUNKS:   {len(DF_PLAYER_TO_CHUNKS):>8} rows, {len(DF_PLAYER_TO_CHUNKS.columns):>3} cols")
print(f"- DF_SOURCE_FILES:       {len(DF_SOURCE_FILES):>8} rows, {len(DF_SOURCE_FILES.columns):>3} cols")

# Optional: quick column display (helps debugging early)
print("\nDF_ENTITY_TO_CHUNKS columns:", list(DF_ENTITY_TO_CHUNKS.columns))
print("DF_CHUNK_TO_ENTITIES columns:", list(DF_CHUNK_TO_ENTITIES.columns))
print("DF_PLAYER_TO_CHUNKS columns:", list(DF_PLAYER_TO_CHUNKS.columns))
print("DF_SOURCE_FILES columns:", list(DF_SOURCE_FILES.columns))

# cleanup locals
del pd, Path, errors, required, key, fname, p, cols, df_name, df, missing
del expected_cols, INDEX_VERSION_SUFFIX, INDEX_FILES

Phase 2 OK: index artifacts loaded.
indexes.path: /Users/charissophia/obsidian/Iron Wolf Trading Company/_meta/indexes
index version: v0

Loaded tables:
- DF_ENTITY_TO_CHUNKS:        168 rows,   6 cols
- DF_CHUNK_TO_ENTITIES:      1139 rows,  11 cols
- DF_PLAYER_TO_CHUNKS:          6 rows,   6 cols
- DF_SOURCE_FILES:            130 rows,   3 cols

DF_ENTITY_TO_CHUNKS columns: ['entity_id', 'canonical', 'chunk_ids', 'chunk_count', 'file_relpaths', 'file_count']
DF_CHUNK_TO_ENTITIES columns: ['chunk_id', 'source_id', 'source_type', 'relpath', 'chunk_start_line', 'chunk_end_line', 'entity_ids', 'canonicals', 'entity_count', 'matched_vocabs', 'match_kinds']
DF_PLAYER_TO_CHUNKS columns: ['player_entity_id', 'canonical', 'chunk_ids', 'chunk_count', 'file_relpaths', 'file_count']
DF_SOURCE_FILES columns: ['source_id', 'relpath', 'source_type']


In [5]:
# optional: clean up INDEXES path variables that have been loaded into dataframes
del INDEXES_PATH, INDEXES_RELPATH

## Phase 3: Load vocabulary tables

This phase loads optional vocabulary tables that enable human-readable
resolution and display during querying.

The notebook:

- Loads `vocab_entities.csv`
- Loads `vocab_aliases.csv`
- Loads `vocab_author_aliases.csv`
- Loads `vocab_map_player_character.csv`
- Validates minimal required columns (presence only)
- Publishes vocabulary dataframes for use in resolution helpers

This phase does not modify index tables and does not merge data.
It only prepares lookup tables for name resolution and display.

In [6]:
# Phase 3: Load vocabulary tables (human-authored CSVs; entities required)
LAST_PHASE_RUN = "3"

import pandas as pd
from pathlib import Path

errors = []
warnings = []

# ------------------------------------------------------------------
# Semantic column mappings
# ------------------------------------------------------------------
ENTITY_COLS = {
    "entity_id": ["entity_id", "id"],
    "canonical": ["canonical", "canonical_name", "name"],
}
ALIAS_COLS = {
    "entity_id": ["entity_id", "id"],
    "alias": ["alias", "alt", "alternate"],
}
AUTHOR_ALIAS_COLS = {
    "author": ["author", "discord_name", "handle"],
    "player_entity_id": ["player_entity_id", "player", "player_id"],
    "ambig_char_id": ["ambig_char_id", "ambiguous_character", "ambig_character"],
}
PC_MAP_COLS = {
    "player_entity_id": ["player_entity_id", "player", "player_id"],
    "char_entity_id": ["char_entity_id", "character_entity_id", "character"],
}

# ------------------------------------------------------------------
# Use descriptor-validated vocab paths (from Phase 1)
# ------------------------------------------------------------------
vocab_files = [
    ("entities", VOCAB_ENTITIES_PATH, ENTITY_COLS, True),
    ("aliases", VOCAB_ALIASES_PATH, ALIAS_COLS, False),
    ("author_aliases", VOCAB_AUTHORS_PATH, AUTHOR_ALIAS_COLS, False),
    ("pc_map", VOCAB_PC_MAP_PATH, PC_MAP_COLS, False),
]

# Published outputs
DF_VOCAB_ENTITIES = pd.DataFrame(columns=list(ENTITY_COLS.keys()))
DF_VOCAB_ALIASES = pd.DataFrame(columns=list(ALIAS_COLS.keys()))
DF_VOCAB_AUTHORS = pd.DataFrame(columns=list(AUTHOR_ALIAS_COLS.keys()))
DF_VOCAB_PC_MAP = pd.DataFrame(columns=list(PC_MAP_COLS.keys()))

# ------------------------------------------------------------------
# Load + normalize (looped, inline)
# ------------------------------------------------------------------
for key, path_obj, col_map, required in vocab_files:

    if not path_obj:
        if required:
            errors.append(f"Missing required path for {key} in descriptor.")
        continue

    p = Path(path_obj)

    if required and not p.exists():
        errors.append(f"Missing required vocabulary file:\n  {p}")
        continue

    if not p.exists():
        warnings.append(f"Optional vocab file not found: {p}")
        continue

    raw_df = pd.read_csv(p, dtype=str).fillna("")

    rename = {}
    for semantic, options in col_map.items():
        found = next((c for c in options if c in raw_df.columns), None)
        if found:
            rename[found] = semantic

    if len(raw_df) > 0 and not rename:
        warnings.append(
            f"[{key}] CSV has rows but none of the expected columns were found.\n"
            f"  CSV columns: {list(raw_df.columns)}\n"
            f"  Expected mapping: {col_map}\n"
            f"  File: {p}"
        )
        norm_df = pd.DataFrame(columns=list(col_map.keys()))
    else:
        out = raw_df.rename(columns=rename)
        keep = [k for k in col_map.keys() if k in out.columns]
        norm_df = out[keep].copy()

    if key == "entities":
        DF_VOCAB_ENTITIES = norm_df
    elif key == "aliases":
        DF_VOCAB_ALIASES = norm_df
    elif key == "author_aliases":
        DF_VOCAB_AUTHORS = norm_df
    elif key == "pc_map":
        DF_VOCAB_PC_MAP = norm_df

    del raw_df, rename, semantic, options, found, out, keep, norm_df

# ------------------------------------------------------------------
# Hard validation: entities must be usable
# ------------------------------------------------------------------
if errors:
    raise FileNotFoundError(
        "Phase 3 cannot proceed.\n\n"
        + "\n\n".join(errors)
        + "\n\nFix the descriptor or vocabulary files, then rerun Phase 3."
    )

if DF_VOCAB_ENTITIES.empty:
    raise ValueError(
        "Entities vocab file loaded but no usable rows were found.\n"
        "Ensure the CSV contains entity_id and canonical columns."
    )

# ------------------------------------------------------------------
# Build DF_VOCAB_LOOKUP (unified vocab table for remapping)
# Columns:
#   - vocab_id: entity_id or player_entity_id
#   - vocab: canonical / alias / author handle
#   - vocab_kind: "entity" | "alias" | "author"
#   - vocab_norm: lowercase normalized vocab for matching
# ------------------------------------------------------------------
rows = []

# Entities (canonical names)
for _, r in DF_VOCAB_ENTITIES.iterrows():
    vid = str(r.get("entity_id", "")).strip()
    v = str(r.get("canonical", "")).strip()
    if vid and v:
        rows.append([vid, v, "entity"])

# Aliases (optional)
if DF_VOCAB_ALIASES is not None and not DF_VOCAB_ALIASES.empty:
    for _, r in DF_VOCAB_ALIASES.iterrows():
        vid = str(r.get("entity_id", "")).strip()
        v = str(r.get("alias", "")).strip()
        if vid and v:
            rows.append([vid, v, "alias"])

# Author handles (optional)
if DF_VOCAB_AUTHORS is not None and not DF_VOCAB_AUTHORS.empty:
    for _, r in DF_VOCAB_AUTHORS.iterrows():
        vid = str(r.get("player_entity_id", "")).strip()
        v = str(r.get("author", "")).strip()
        if vid and v:
            rows.append([vid, v, "author"])

DF_VOCAB_LOOKUP = pd.DataFrame(rows, columns=["vocab_id", "vocab", "vocab_kind"])
DF_VOCAB_LOOKUP["vocab_norm"] = DF_VOCAB_LOOKUP["vocab"].astype(str).str.strip().str.lower()
DF_VOCAB_LOOKUP = DF_VOCAB_LOOKUP.drop_duplicates(
    subset=["vocab_id", "vocab_norm", "vocab_kind"]
).reset_index(drop=True)

del rows, r, vid, v

# ------------------------------------------------------------------
# Summary
# ------------------------------------------------------------------
print("Phase 3 OK: vocabulary tables loaded.")

print("\nLoaded vocab tables:")
print(f"- DF_VOCAB_ENTITIES: {len(DF_VOCAB_ENTITIES):>8} rows, {len(DF_VOCAB_ENTITIES.columns):>3} cols")
print(f"- DF_VOCAB_ALIASES:  {len(DF_VOCAB_ALIASES):>8} rows, {len(DF_VOCAB_ALIASES.columns):>3} cols")
print(f"- DF_VOCAB_AUTHORS:  {len(DF_VOCAB_AUTHORS):>8} rows, {len(DF_VOCAB_AUTHORS.columns):>3} cols")
print(f"- DF_VOCAB_PC_MAP:   {len(DF_VOCAB_PC_MAP):>8} rows, {len(DF_VOCAB_PC_MAP.columns):>3} cols")
print(f"- DF_VOCAB_LOOKUP:   {len(DF_VOCAB_LOOKUP):>8} rows, {len(DF_VOCAB_LOOKUP.columns):>3} cols")

if warnings:
    print("\nWarnings:")
    for w in warnings:
        print(f"- {w}")

# cleanup
del pd, Path
del errors, warnings, vocab_files, key, path_obj, col_map, required, p
del ENTITY_COLS, ALIAS_COLS, AUTHOR_ALIAS_COLS, PC_MAP_COLS

Phase 3 OK: vocabulary tables loaded.

Loaded vocab tables:
- DF_VOCAB_ENTITIES:      176 rows,   2 cols
- DF_VOCAB_ALIASES:        87 rows,   2 cols
- DF_VOCAB_AUTHORS:         6 rows,   3 cols
- DF_VOCAB_PC_MAP:         42 rows,   2 cols
- DF_VOCAB_LOOKUP:        269 rows,   4 cols


In [7]:
# optional: clean up VOCAB path variables
del VOCAB_ENTITIES_PATH, VOCAB_ENTITIES_RELPATH
del VOCAB_ALIASES_PATH, VOCAB_ALIASES_RELPATH
del VOCAB_AUTHORS_PATH, VOCAB_AUTHORS_RELPATH
del VOCAB_PC_MAP_PATH, VOCAB_PC_MAP_RELPATH

## Phase 4: Query building blocks

This phase defines the notebook's basic query tools ("building blocks") that later phases combine into DM-friendly questions.

Key ideas:
- A **chunk** is a small excerpt of a file, identified by `chunk_id` plus the file `relpath` and line range.
- An **entity** is a tracked name (character, place, faction, etc.) identified by `entity_id` and shown by `canonical`.

These tools let you type either a name (canonical or alias) or an ID, and then:
- find which chunks mention an entity or a player
- list which entities appear inside a given chunk
- pull the index rows needed to navigate back to the source files

In Phase 5/6 we will use these building blocks to answer practical questions like:
"Where does X appear?" and "What entities appear in file Y?"

In [8]:
# Phase 4: Query building blocks (DM-facing names, v0)
LAST_PHASE_RUN = "4"

import ast
import json
import pandas as pd
from IPython.display import display

# Display settings (make tables readable in notebooks)
pd.set_option("display.max_colwidth", None)       # do not truncate long strings
pd.set_option("display.max_columns", None)        # show all columns
pd.set_option("display.width", 0)                 # let Jupyter decide width / wrap
pd.set_option("display.expand_frame_repr", False) # wrap long paths


# -------------------------------------------------------------------
# Internal: parse list-encoded fields from CSVs
# -------------------------------------------------------------------
def _parse_list_field(raw):
    """
    Parse a list-like field (stored as text in the CSVs) into list[str].

    Supported forms:
    - "" / None / NaN -> []
    - JSON: ["a","b"]
    - Python repr: ['a','b']
    - Delimited: "a|b|c" or "a;b;c" or "a,b,c"
    - Single token: "a" -> ["a"]
    """
    if raw is None:
        return []
    if isinstance(raw, float) and pd.isna(raw):
        return []
    s = str(raw).strip()
    if not s:
        return []

    if s.startswith("[") and s.endswith("]"):
        # JSON
        try:
            v = json.loads(s)
            if isinstance(v, list):
                return [str(x).strip() for x in v if str(x).strip()]
        except Exception:
            pass

        # Python repr
        try:
            v = ast.literal_eval(s)
            if isinstance(v, (list, tuple)):
                return [str(x).strip() for x in v if str(x).strip()]
        except Exception:
            pass

    for delim in ["|", ";", ","]:
        if delim in s:
            out = [x.strip() for x in s.split(delim)]
            return [x for x in out if x]

    return [s]

# Phase 4 demo collector (DM-facing)
phase4_demo = []
phase4_demo_columns = ["Call", "Result"]

# -------------------------------------------------------------------
# Name resolution (entity + player)
# -------------------------------------------------------------------
def find_entity_ids(name_or_id, include_aliases=True):
    """
    Resolve a user input (name, alias, or ID) to matching entity_id(s).

    You can type:
    - an entity_id (exact)
    - a canonical name (case-insensitive)
    - an alias (case-insensitive) if aliases exist and include_aliases=True

    Returns:
    - list[str] of entity_id (possibly empty, possibly multiple)
    """
    if name_or_id is None:
        return []

    q = str(name_or_id).strip()
    if not q:
        return []

    # Direct ID match (fast path)
    if "entity_id" in DF_VOCAB_ENTITIES.columns:
        if q in set(DF_VOCAB_ENTITIES["entity_id"]):
            return [q]

    q_lower = q.lower()

    # Canonical match
    canon_hits = DF_VOCAB_ENTITIES[
        DF_VOCAB_ENTITIES["canonical"].astype(str).str.lower() == q_lower
    ]["entity_id"].tolist()

    hits = list(dict.fromkeys([x for x in canon_hits if str(x).strip()]))

    # Alias match (optional)
    if include_aliases and DF_VOCAB_ALIASES is not None and not DF_VOCAB_ALIASES.empty:
        alias_hits = DF_VOCAB_ALIASES[
            DF_VOCAB_ALIASES["alias"].astype(str).str.lower() == q_lower
        ]["entity_id"].tolist()

        for eid in alias_hits:
            if eid and eid not in hits:
                hits.append(eid)

    return hits

# sanity check
phase4_demo.append(['find_entity_ids("Henry")', find_entity_ids("Henry")])


def find_player_ids(author_or_player_id):
    """
    Resolve a user input to player_entity_id(s).

    You can type:
    - a player_entity_id (exact)
    - a player canonical name (case-insensitive) if it exists in vocab.entities
    - an author handle (Discord name) if author aliases exist

    Returns:
    - list[str] of player_entity_id
    """
    if author_or_player_id is None:
        return []

    q = str(author_or_player_id).strip()
    if not q:
        return []

    # Direct match against player index
    if "player_entity_id" in DF_PLAYER_TO_CHUNKS.columns:
        if q in set(DF_PLAYER_TO_CHUNKS["player_entity_id"]):
            return [q]

    q_lower = q.lower()

    # Author handle lookup (optional)
    if DF_VOCAB_AUTHORS is not None and not DF_VOCAB_AUTHORS.empty:
        hits = DF_VOCAB_AUTHORS[
            DF_VOCAB_AUTHORS["author"].astype(str).str.lower() == q_lower
        ]["player_entity_id"].tolist()

        hits = [x.strip() for x in hits if str(x).strip()]
        hits = list(dict.fromkeys(hits))
        if hits:
            return hits

    # Player canonical lookup (optional but DM-friendly)
    if DF_VOCAB_ENTITIES is None or DF_VOCAB_ENTITIES.empty:
        return []

    canon_hits = DF_VOCAB_ENTITIES[
        DF_VOCAB_ENTITIES["canonical"].astype(str).str.lower() == q_lower
    ]["entity_id"].tolist()

    canon_hits = [x.strip() for x in canon_hits if str(x).strip()]

    # Only keep ids that are actually players (as defined by DF_PLAYER_TO_CHUNKS)
    player_ids = set(DF_PLAYER_TO_CHUNKS["player_entity_id"]) if "player_entity_id" in DF_PLAYER_TO_CHUNKS.columns else set()
    canon_hits = [x for x in canon_hits if x in player_ids]
    canon_hits = list(dict.fromkeys(canon_hits))

    return canon_hits

# sanity check
phase4_demo.append(['find_player_ids("CroweTheDualityKing")', find_player_ids("CroweTheDualityKing")])
phase4_demo.append(['find_player_ids("Crowe")', find_player_ids("Crowe")])


def find_character_ids_for_player(player_entity_id):
    """
    Return character entity_id(s) mapped to a player_entity_id (if a PC map exists).

    Returns:
    - list[str] (possibly empty)
    """
    if not player_entity_id:
        return []

    if DF_VOCAB_PC_MAP is None or DF_VOCAB_PC_MAP.empty:
        return []

    q = str(player_entity_id).strip()
    if not q:
        return []

    hits = DF_VOCAB_PC_MAP[
        DF_VOCAB_PC_MAP["player_entity_id"].astype(str) == q
    ]["char_entity_id"].tolist()

    hits = [x.strip() for x in hits if str(x).strip()]
    hits = list(dict.fromkeys(hits))
    return hits

# sanity check
phase4_demo.append(['find_character_ids_for_player("player_crowe")', find_character_ids_for_player("player_crowe")])


# -------------------------------------------------------------------
# Chunk lookup
# -------------------------------------------------------------------
def find_chunk_ids_for_entity(entity_name_or_id, include_aliases=True):
    """
    Return a set of chunk_id where an entity appears.

    Input can be a canonical name, alias (if enabled), or entity_id.
    """
    eids = find_entity_ids(entity_name_or_id, include_aliases=include_aliases)
    if not eids:
        return set()

    rows = DF_ENTITY_TO_CHUNKS[DF_ENTITY_TO_CHUNKS["entity_id"].isin(eids)]
    out = set()

    for raw in rows["chunk_ids"].tolist():
        for cid in _parse_list_field(raw):
            out.add(cid)

    return out

# sanity check
phase4_demo.append(['find_chunk_ids_for_entity("person_henry")', find_chunk_ids_for_entity("person_henry")])
phase4_demo.append(['find_chunk_ids_for_entity("artifact_folly")', find_chunk_ids_for_entity("artifact_folly")])


def find_chunk_ids_for_player(player_name_or_id):
    """
    Return a set of chunk_id associated with a player.

    Input can be a player_entity_id or an author handle (if author aliases exist).
    """
    pids = find_player_ids(player_name_or_id)
    if not pids:
        return set()

    rows = DF_PLAYER_TO_CHUNKS[DF_PLAYER_TO_CHUNKS["player_entity_id"].isin(pids)]
    out = set()

    for raw in rows["chunk_ids"].tolist():
        for cid in _parse_list_field(raw):
            out.add(cid)

    return out

# sanity check
phase4_demo.append(['find_chunk_ids_for_player("Lia")', find_chunk_ids_for_player("Lia")])


# -------------------------------------------------------------------
# Chunk inspection
# -------------------------------------------------------------------
def get_chunk_row(chunk_id):
    """
    Return the DF_CHUNK_TO_ENTITIES row for a chunk_id (0 or 1 rows).
    """
    if chunk_id is None:
        return DF_CHUNK_TO_ENTITIES.iloc[0:0].copy()

    try:
        q = int(chunk_id)
    except Exception:
        raise ValueError(f"chunk_id must be an integer. Received: {chunk_id}")

    return DF_CHUNK_TO_ENTITIES[
        DF_CHUNK_TO_ENTITIES["chunk_id"] == q
    ].copy()

# sanity check
phase4_demo.append(['get_chunk_row(169182)', get_chunk_row(169182)])


def get_chunk_rows(chunk_ids):
    """
    Return DF_CHUNK_TO_ENTITIES rows for a set/list of chunk_ids.
    """
    if chunk_ids is None:
        return DF_CHUNK_TO_ENTITIES.iloc[0:0].copy()

    try:
        ids = [int(x) for x in chunk_ids]
    except Exception:
        raise ValueError(f"All chunk_ids must be integers. Received: {chunk_ids}")

    if not ids:
        return DF_CHUNK_TO_ENTITIES.iloc[0:0].copy()

    return DF_CHUNK_TO_ENTITIES[
        DF_CHUNK_TO_ENTITIES["chunk_id"].isin(ids)
    ].copy()


# sanity check
phase4_demo.append(["get_chunk_rows(['169182', '169233', '169237'])", get_chunk_rows(['169182', '169233', '169237'])])


def list_entity_ids_in_chunk(chunk_id):
    """
    Return entity_id list in a chunk (order as stored in the index).
    """
    df = get_chunk_row(chunk_id)
    if df.empty:
        return []
    return _parse_list_field(df.iloc[0]["entity_ids"])

# sanity check
phase4_demo.append(['list_entity_ids_in_chunk(169182)', list_entity_ids_in_chunk(169182)])


def list_entity_names_in_chunk(chunk_id):
    """
    Return canonical-name list in a chunk (order as stored in the index).
    """
    df = get_chunk_row(chunk_id)
    if df.empty:
        return []
    return _parse_list_field(df.iloc[0]["canonicals"])

# sanity check
phase4_demo.append(['list_entity_names_in_chunk(169182)', list_entity_names_in_chunk(169182)])


# -------------------------------------------------------------------
# Print out the sanity checks
# -------------------------------------------------------------------
print("Phase 4 query building blocks defined:")
DF_PHASE4_DEMO = pd.DataFrame(phase4_demo, columns=phase4_demo_columns)
display(DF_PHASE4_DEMO)

# clean-up locals
del phase4_demo, phase4_demo_columns, DF_PHASE4_DEMO

Phase 4 query building blocks defined:


Unnamed: 0,Call,Result
0,"find_entity_ids(""Henry"")",[person_henry]
1,"find_player_ids(""CroweTheDualityKing"")",[player_crowe]
2,"find_player_ids(""Crowe"")",[player_crowe]
3,"find_character_ids_for_player(""player_crowe"")","[person_aniya, person_credus, person_liavarah, person_mowser, person_unala, person_victor]"
4,"find_chunk_ids_for_entity(""person_henry"")","{169824, 169511, 169936, 169002, 169084, 169395, 169870, 169792, 169934, 168690, 169745, 169755, 169492, 168998, 169389, 169509, 168905, 168684, 169803, 169802, 169338, 168984, 169402, 169763, 169185, 168865, 168673, 169729, 169376, 168762, 169274, 169742, 169868, 169087, 169854, 169815, 169398, 169748, 169897, 169926, 169740, 169906, 169743, 169387, 169810, 169874, 169079, 168813, 169806, 168944, 168940, 169739, 169170, 169050, 169778, 169015, 168834, 169702, 169412, 168817, 168675, 169790, 168743, 169233, 169083, 169332, 168915, 169121, 169431, 169251, 168748, 169771, 169513, 169927, 169349, 169817, 169786, 169905, 169732, 169527, 168765, 168826, 169322, 169372, 169191, 169769, 169938, 168712, 169271, 169882, 168693, 169299, 169189, 169603, 169738, 169785, 169118, 169237, 169471, 169775, ...}"
5,"find_chunk_ids_for_entity(""artifact_folly"")","{169593, 169936, 169941, 169841, 169808, 169925, 169839, 169931, 169921, 169831, 169942, 169812, 169801, 169805, 169923, 169390, 169944, 169799, 169810, 169800, 169803, 169802, 169937}"
6,"find_chunk_ids_for_player(""Lia"")","{168848, 168976, 169133, 168852, 169257, 169222, 168846, 168850}"
7,get_chunk_row(169182),chunk_id source_id source_type relpath chunk_start_line chunk_end_line entity_ids canonicals entity_count matched_vocabs match_kinds 413 169182 110 pbp_transcripts _local/pbp_transcripts/PbP15 - Debrief and Safety.md 350 358 faction_hands|faction_thieves_guild|faction_tolanites|org_king|person_alivyre|person_gina|person_henry|person_ronric|person_victor|player_bysickle Alivyre Dawntracker|Bysickle|Gina|Hands of Liamassa|Henry Sleepsong|King|Ronric|Thieves Guild|Tolanites|Victor D Evernight 10 Alivyre|Bysickle|Gina|Hands|Henry|King|Ronric|Thieves Guild|Tolanite|Victor alias|canonical
8,"get_chunk_rows(['169182', '169233', '169237'])",chunk_id source_id source_type relpath chunk_start_line chunk_end_line entity_ids canonicals entity_count matched_vocabs match_kinds 413 169182 110 pbp_transcripts _local/pbp_transcripts/PbP15 - Debrief and Safety.md 350 358 faction_hands|faction_thieves_guild|faction_tolanites|org_king|person_alivyre|person_gina|person_henry|person_ronric|person_victor|player_bysickle Alivyre Dawntracker|Bysickle|Gina|Hands of Liamassa|Henry Sleepsong|King|Ronric|Thieves Guild|Tolanites|Victor D Evernight 10 Alivyre|Bysickle|Gina|Hands|Henry|King|Ronric|Thieves Guild|Tolanite|Victor alias|canonical 459 169233 110 pbp_transcripts _local/pbp_transcripts/PbP15 - Debrief and Safety.md 685 689 person_alivyre|person_henry|person_victor|place_crafthold|place_elysia Alivyre Dawntracker|Crafthold|Elysia|Henry Sleepsong|Victor D Evernight 5 Alivyre|Crafthold|Elysia|Shadowboy|Victor alias|canonical 463 169237 110 pbp_transcripts _local/pbp_transcripts/PbP15 - Debrief and Safety.md 705 709 faction_hands|person_alivyre|person_henry|person_vyssa Alivyre Dawntracker|Hands of Liamassa|Henry Sleepsong|Vyssa 4 Alivyre|Hand|Henry|Vyssa alias|canonical
9,list_entity_ids_in_chunk(169182),"[faction_hands, faction_thieves_guild, faction_tolanites, org_king, person_alivyre, person_gina, person_henry, person_ronric, person_victor, player_bysickle]"


# Example Queries (DM workflow)

Use these blocks to answer common questions. Each block is independent.
Most queries return a table you can sort, filter, or copy into notes.

## Q1: Where does entity X appear?

This query answers:

"Where does a specific entity appear in the indexed material?"

You can enter:
- a canonical name (e.g., "Henry Sleepsong")
- a short name if it exists in vocabulary (e.g., "Henry")
- an entity_id (e.g., "person_henry")

The result is a table of chunk rows showing:
- chunk_id
- file (relpath)
- line range
- entity_count (how dense the chunk is)

Edit the ENTITY value below and re-run the code cell.

In [9]:
# -------------------------------------------------------------------
# Q1: Where does entity X appear?
# -------------------------------------------------------------------

ENTITY = "Bronze Flame"  # <-- edit this

entity_ids = find_entity_ids(ENTITY)

if not entity_ids:
    print(f"No entity found matching: {ENTITY}")

else:
    canonicals = DF_VOCAB_ENTITIES[
        DF_VOCAB_ENTITIES["entity_id"].isin(entity_ids)
    ]["canonical"].tolist()

    chunk_ids = set()
    for eid in entity_ids:
        chunk_ids |= find_chunk_ids_for_entity(eid)

    if not chunk_ids:
        print(f"Entity '{ENTITY}' resolved to {entity_ids}, but no chunks were found.")

    else:
        df = get_chunk_rows(chunk_ids)[
            ["chunk_id", "relpath", "chunk_start_line", "chunk_end_line", "entity_count"]
        ].sort_values(["relpath", "chunk_start_line"])

        print(f"Entity: {ENTITY}")
        print(f"Canonical name(s): {canonicals}")
        print(f"Resolved entity_ids: {entity_ids}")
        print(f"Chunks found: {len(df)}")
        print(f'World Root: cd "{WORLD_ROOT}"')

        df["cmd_show_chunk"] = df.apply(
            lambda r: (
                f'nl -ba "{r["relpath"]}" | '
                f"sed -n '{int(r['chunk_start_line'])},{int(r['chunk_end_line'])}p'"
            ),
            axis=1
        )

        display(df)

        del df

    del chunk_ids, eid, canonicals

del ENTITY, entity_ids

No entity found matching: Bronze Flame


## Q2: What entities appear in file Y?

This query lists the entities found in a specific file, aggregated across all chunks.

Output is one row per entity, including:
- canonical entity name
- entity_id
- mentions (how many times entity is referenced in the file)
- how the entity appeared in the text (useful for searching)

In [57]:
# -------------------------------------------------------------------
# Q2: What entities appear in file Y?
# -------------------------------------------------------------------

FILE = "_local/pbp_transcripts/PbP15 - Debrief and Safety.md"  # <-- edit this

df_file = DF_CHUNK_TO_ENTITIES[DF_CHUNK_TO_ENTITIES["relpath"] == FILE]

df_mentions_in_file = (
    pd.Series([
        p.strip().lower()
        for raw in df_file["matched_vocabs"]
        for p in str(raw).split("|")
        if p.strip()
    ], name="vocab_norm")
    .to_frame()
    .merge(
        DF_VOCAB_LOOKUP[["vocab_id", "vocab_norm", "vocab"]],
        on="vocab_norm",
        how="left"
    )
)

df_entities_in_file = (
    df_mentions_in_file
    .dropna(subset=["vocab_id"])
    .groupby("vocab_id", as_index=False)
    .agg(
        mentions=("vocab_norm", "size"),
        vocab=("vocab", lambda x: sorted(set(x)))
    )
    .merge(
        DF_VOCAB_ENTITIES[["entity_id", "canonical"]],
        left_on="vocab_id",
        right_on="entity_id",
        how="left"
    )
    .drop(columns=["vocab_id"])
    .loc[:, ["canonical", "entity_id", "mentions", "vocab"]]
    .sort_values(["mentions", "entity_id"], ascending=[False, True])
    .reset_index(drop=True)
)

print(f'World Root: cd "{WORLD_ROOT}"')
print(f"All tracked entities found in: {FILE}")
print(f"Chunks scanned: {len(df_file)}")
print(f"Entities found: {len(df_entities_in_file)}\n")

display(
    df_entities_in_file
    .assign(vocab=lambda d: d["vocab"].apply(lambda x: ", ".join(x) if isinstance(x, list) else str(x)))
    .rename(columns={
        "canonical": "Entity",
        "mentions": "Mentions",
        "vocab": "How it appears in the text",
    })
)

del FILE, df_file, df_mentions_in_file, df_entities_in_file

World Root: cd "/Users/charissophia/obsidian/Iron Wolf Trading Company"
All tracked entities found in: _local/pbp_transcripts/PbP15 - Debrief and Safety.md
Chunks scanned: 116
Entities found: 48



Unnamed: 0,Entity,entity_id,Mentions,How it appears in the text
0,Victor D Evernight,person_victor,70,Victor
1,Alivyre Dawntracker,person_alivyre,45,Alivyre
2,Henry Sleepsong,person_henry,45,"Henry, Shadowboy"
3,Hands of Liamassa,faction_hands,28,"Hand, Hands, Hands of Liamassa"
4,Faeryne,person_faeryne,23,Faeryne
5,Luminia,person_luminia,21,Luminia
6,Bysickle,player_bysickle,18,Bysickle
7,Terry,person_terry,17,Terry
8,Kalina,player_kalina,16,"Kalina, Kalina Hitana"
9,Crowe,player_crowe,15,"Crowe, CroweTheDualityKing"


## Q3: What entities co-occur with entity X?

This question asks:

**Which tracked entities appear in the same chunks as a given entity?**

In other words:
- Where does ENTITY appear?
- In those same chunks, what *other* entities are mentioned?
- How often do they appear together?

This helps answer questions like:
- Who most frequently interacts with this character?
- What factions tend to appear alongside this place?
- What players are active in the same scenes?
- What unexpected connections exist?

The result shows:

- **Entity** — canonical name
- **entity_id** — stable index identifier
- **Co-occurrence chunks** — number of chunks where both appear

Higher counts suggest stronger narrative proximity.

Edit `ENTITY` in the code cell below to explore different relationships.

In [81]:
# -------------------------------------------------------------------
# Q3: What entities co-occur with entity X?
# -------------------------------------------------------------------

ENTITY = "Avarna"  # <-- edit this

target_eids = set(find_entity_ids(ENTITY))

df_chunks = get_chunk_rows(set().union(*[
    find_chunk_ids_for_entity(eid)
    for eid in target_eids
]))

df_cooccur = (
    pd.Series(
        [
            eid
            for raw in df_chunks["entity_ids"]
            for eid in str(raw).split("|")
            if str(eid).strip() and eid not in target_eids
        ],
        name="entity_id"
    )
    .value_counts()
    .rename_axis("entity_id")
    .reset_index(name="co_chunks")
    .merge(DF_VOCAB_ENTITIES[["entity_id", "canonical"]], on="entity_id", how="left")
    .loc[:, ["canonical", "entity_id", "co_chunks"]]
    .rename(columns={"canonical": "Entity", "co_chunks": "Co-occurrence chunks"})
    .sort_values(["Co-occurrence chunks", "entity_id"], ascending=[False, True])
    .reset_index(drop=True)
)

print(f"Co-occurring entities with: {ENTITY}")
print(f"Target entity_id(s): {sorted(target_eids)}")
print(f"Chunks scanned: {len(df_chunks)}\n")
display(df_cooccur)

del ENTITY, target_eids, df_chunks, df_cooccur


Co-occurring entities with: Avarna
Target entity_id(s): ['person_avarna']
Chunks scanned: 15



Unnamed: 0,Entity,entity_id,Co-occurrence chunks
0,Henry Sleepsong,person_henry,14
1,Alivyre Dawntracker,person_alivyre,13
2,Victor D Evernight,person_victor,13
3,Faeryne,person_faeryne,11
4,Shworn Sleepsong,person_shworn,11
5,Luminia,person_luminia,10
6,Terry,person_terry,8
7,Dhassa,place_dhassa,8
8,Bysickle,player_bysickle,6
9,Crowe,player_crowe,6


## Q4: Where do X and Y overlap?

This question asks:

**In which specific chunks do entity X and entity Y both appear?**

This is useful when:
- You want the exact scenes where two characters interact.
- You are investigating a surprising co-occurrence from Q3.
- You want to inspect narrative context directly.

The result shows:

- chunk_id  
- relpath  
- chunk_start_line  
- chunk_end_line  
- entity_count  

Use the chunk information to open the file and inspect the scene directly.

Edit `ENTITY_X` and `ENTITY_Y` in the code cell below.

Now let’s build it cleanly for your example:

ENTITY_X = "Avarna"
ENTITY_Y = "Aren"

We will:
Resolve both to entity_ids
Get their chunk sets
Intersect
Pull rows from DF_CHUNK_TO_ENTITIES
Display

No aggregation. Just overlap.

In [85]:
# -------------------------------------------------------------------
# Q4: Where do X and Y overlap?
# -------------------------------------------------------------------

ENTITY_X = "Avarna"  # <-- edit this
ENTITY_Y = "Aren"    # <-- edit this

x_ids = find_entity_ids(ENTITY_X)
y_ids = find_entity_ids(ENTITY_Y)

x_chunks = set().union(*[find_chunk_ids_for_entity(eid) for eid in x_ids]) if x_ids else set()
y_chunks = set().union(*[find_chunk_ids_for_entity(eid) for eid in y_ids]) if y_ids else set()

shared_chunks = x_chunks & y_chunks

df_overlap = (
    get_chunk_rows(shared_chunks)
    .loc[:, ["chunk_id", "relpath", "chunk_start_line", "chunk_end_line", "entity_count"]]
    .sort_values(["relpath", "chunk_start_line"])
    .assign(cmd_show_chunk=lambda d: d.apply(
        lambda r: (
            f'nl -ba "{r["relpath"]}" | '
            f"sed -n '{int(r['chunk_start_line'])},{int(r['chunk_end_line'])}p'"
        ),
        axis=1
    ))
    .reset_index(drop=True)
)

print(f"Overlap: {ENTITY_X} ∩ {ENTITY_Y}")
print(f"{ENTITY_X} entity_id(s): {x_ids}")
print(f"{ENTITY_Y} entity_id(s): {y_ids}")
print(f"Shared chunks: {len(df_overlap)}\n")

display(df_overlap)

del ENTITY_X, ENTITY_Y, x_ids, y_ids, x_chunks, y_chunks, shared_chunks, df_overlap

Overlap: Avarna ∩ Aren
Avarna entity_id(s): ['person_avarna']
Aren entity_id(s): ['person_aren']
Shared chunks: 5



Unnamed: 0,chunk_id,relpath,chunk_start_line,chunk_end_line,entity_count,cmd_show_chunk
0,168940,_local/pbp_transcripts/PbP12 - Meeting in the Vestry.md,901,909,4,"nl -ba ""_local/pbp_transcripts/PbP12 - Meeting in the Vestry.md"" | sed -n '901,909p'"
1,169819,_local/session_notes/IWTC session notes 101-150.md,480,509,16,"nl -ba ""_local/session_notes/IWTC session notes 101-150.md"" | sed -n '480,509p'"
2,169820,_local/session_notes/IWTC session notes 101-150.md,510,539,20,"nl -ba ""_local/session_notes/IWTC session notes 101-150.md"" | sed -n '510,539p'"
3,169823,_local/session_notes/IWTC session notes 101-150.md,590,609,20,"nl -ba ""_local/session_notes/IWTC session notes 101-150.md"" | sed -n '590,609p'"
4,169829,_local/session_notes/IWTC session notes 101-150.md,731,756,18,"nl -ba ""_local/session_notes/IWTC session notes 101-150.md"" | sed -n '731,756p'"


## Q5: What does player Z usually write about?

This shows the tracked entities (and how they appear in text) that show up most often in chunks associated with a given player.

Interpretation:
- This is based on the indexing output, not semantic understanding.
- False positives can occur (example: "aren't" matching "Aren").
Use this to discover patterns and then inspect specific overlap chunks when something looks weird.

In [89]:
# -------------------------------------------------------------------
# Q5: What does player Z usually write about?
# -------------------------------------------------------------------

PLAYER = "Kalina"  # <-- edit this (handle, alias, or player_entity_id)

player_ids = find_player_ids(PLAYER)
player_chunks = find_chunk_ids_for_player(PLAYER)
df_player_chunks = get_chunk_rows(player_chunks)

df_mentions_in_player_chunks = (
    pd.Series([
        p.strip().lower()
        for raw in df_player_chunks["matched_vocabs"]
        for p in str(raw).split("|")
        if p.strip()
    ], name="vocab_norm")
    .to_frame()
    .merge(
        DF_VOCAB_LOOKUP[["vocab_id", "vocab_norm", "vocab"]],
        on="vocab_norm",
        how="left"
    )
)

df_entities_for_player = (
    df_mentions_in_player_chunks
    .dropna(subset=["vocab_id"])
    .groupby("vocab_id", as_index=False)
    .agg(
        mentions=("vocab_norm", "size"),
        vocab=("vocab", lambda x: sorted(set(x)))
    )
    .merge(
        DF_VOCAB_ENTITIES[["entity_id", "canonical"]],
        left_on="vocab_id",
        right_on="entity_id",
        how="left"
    )
    .drop(columns=["vocab_id"])
    .loc[:, ["canonical", "entity_id", "mentions", "vocab"]]
    .rename(columns={
        "canonical": "Entity",
        "mentions": "Mentions",
        "vocab": "How it appears in the text",
    })
    .sort_values(["Mentions", "entity_id"], ascending=[False, True])
    .reset_index(drop=True)
)

print(f"World Root: cd \"{WORLD_ROOT.as_posix()}\"")
print(f"Player: {PLAYER}")
print(f"Resolved player_entity_id(s): {player_ids}")
print(f"Chunks scanned (player-associated): {len(df_player_chunks)}")
print(f"Tracked entities found: {len(df_entities_for_player)}\n")

display(df_entities_for_player)

del PLAYER, player_ids, player_chunks
del df_player_chunks, df_mentions_in_player_chunks, df_entities_for_player

World Root: cd "/Users/charissophia/obsidian/Iron Wolf Trading Company"
Player: Kalina
Resolved player_entity_id(s): ['player_kalina']
Chunks scanned (player-associated): 31
Tracked entities found: 18



Unnamed: 0,Entity,entity_id,Mentions,How it appears in the text
0,Faeryne,person_faeryne,20,[Faeryne]
1,Luminia,person_luminia,10,[Luminia]
2,Terry,person_terry,10,[Terry]
3,Victor D Evernight,person_victor,10,[Victor]
4,Alivyre Dawntracker,person_alivyre,6,[Alivyre]
5,Crafthold,place_crafthold,3,[Crafthold]
6,Hands of Liamassa,faction_hands,2,"[Hand, Hands]"
7,Party,org_party,2,[Party]
8,Charis,player_charis,2,[Charis]
9,Crowe,player_crowe,2,[CroweTheDualityKing]
