# IWTC Graph Querying

This notebook answers DM questions by treating graph edges as grammatical statements and chaining them into evidence trails.

It loads graph CSV artifacts (nodes + edges) and provides query recipes that traverse relationships rather than composing table joins.

Read-only: no writes, no promotion, no index regeneration.


# Pre-Build: Load descriptor and canonical artifacts

Run these phases to point the notebook at a world repository, validate paths, and load canonical graph artifacts.

Once this section succeeds, you can collapse it. The query recipes below assume these DataFrames exist.

## Phase P0: Parameters

Set the world repository descriptor and the index version you want to query.

This phase only selects *which* world and *which* artifact version is in scope.

In [15]:
# -------------------------------------------------------------------
# Phase P0: Parameters
# -------------------------------------------------------------------
LAST_PHASE_RUN = "P0"

# Absolute path to the world_repository.yml descriptor.
WORLD_REPOSITORY_DESCRIPTOR = (
    "/Users/charissophia/obsidian/Iron Wolf Trading Company/_meta/descriptors/world_repository.yml"
)

# Artifact version to load (must match previously generated artifacts).
# In v0, graph artifacts are versioned using the same INDEX_VERSION as the index artifacts they were built from.
INDEX_VERSION = "V0"

# Internal run metadata (do not edit)
from datetime import datetime
print(f"Notebook run initialized at: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
del datetime

Notebook run initialized at: 2026-02-17 19:28


## Phase P1: Load and validate world descriptor

Loads the repository descriptor and validates the filesystem layout it declares.

This phase answers: "Can I trust these paths enough to read artifacts safely?"


In [16]:
# Phase P1: Load and validate world repository descriptor (Graph Querying v0)
LAST_PHASE_RUN = "P1"

from pathlib import Path
import yaml

errors = []
warnings = []

# --- Load descriptor file ---
descriptor_path = Path(WORLD_REPOSITORY_DESCRIPTOR)

if not descriptor_path.exists():
    raise FileNotFoundError(
        "World repository descriptor file was not found.\n"
        f"Path provided:\n  {descriptor_path}\n\n"
        "What to do:\n"
        "- Confirm the file exists at this location or fix WORLD_REPOSITORY_DESCRIPTOR in Phase 0\n"
        "- If you just edited Phase 0, rerun Phase 0 and then rerun this cell\n"
    )

try:
    with descriptor_path.open("r", encoding="utf-8") as f:
        world_repo = yaml.safe_load(f)
except Exception:
    raise ValueError(
        "The world repository descriptor could not be read.\n"
        "This usually indicates a YAML formatting problem.\n\n"
        f"File:\n  {descriptor_path}\n\n"
        "What to do:\n"
        "- Compare the file against the example world_repository.yml\n"
        "- Paste the contents into https://www.yamllint.com/\n"
        "- Fix any reported issues, save the file, and rerun this cell"
    )

if not isinstance(world_repo, dict):
    raise ValueError(
        "World repository descriptor structure is not usable.\n"
        "The file must be a YAML mapping (top-level `name: value` entries).\n"
    )

print(f"World repository descriptor loaded successfully: {descriptor_path.name}")

# --- Extract required entries ---
WORLD_ROOT_RAW = world_repo.get("world_root")

indexes_block = world_repo.get("indexes")
INDEXES_RAW = indexes_block.get("path") if isinstance(indexes_block, dict) else None

vocab = world_repo.get("vocabulary") or {}
ENTITIES_RAW = vocab.get("entities")
ALIASES_RAW = vocab.get("aliases")
AUTHORS_RAW = vocab.get("author_aliases")

if not WORLD_ROOT_RAW:
    errors.append("Missing required entry: world_root")
if not INDEXES_RAW:
    errors.append("Missing required entry: indexes.path")

if errors:
    raise ValueError(
        "World repository descriptor is missing required entries:\n- "
        + "\n- ".join(errors)
        + "\n\nWhat to do:\n"
          "- Edit your world_repository.yml and add/fix the missing entries\n"
          "- Save the file and rerun this cell"
    )

# ------------------------------------------------------------------
# Published outputs (initialize up front for later phases)
# ------------------------------------------------------------------
WORLD_ROOT = None

INDEXES_PATH = None
INDEXES_RELPATH = None

VOCAB_ENTITIES_PATH = None
VOCAB_ENTITIES_RELPATH = None
VOCAB_ALIASES_PATH = None
VOCAB_ALIASES_RELPATH = None
VOCAB_AUTHORS_PATH = None
VOCAB_AUTHORS_RELPATH = None

# --- Validate and resolve world_root ---
WORLD_ROOT = Path(WORLD_ROOT_RAW)

if str(WORLD_ROOT).startswith("~"):
    errors.append("world_root: '~' is not allowed. Use a full absolute path.")
elif not WORLD_ROOT.is_absolute():
    errors.append("world_root must be an absolute path (starts with / on macOS/Linux, or C:\\ on Windows).")
elif not WORLD_ROOT.is_dir():
    errors.append(f"world_root must be an existing directory: {WORLD_ROOT}")
else:
    WORLD_ROOT = WORLD_ROOT.resolve()

if errors:
    raise ValueError("Descriptor path validation failed:\n- " + "\n- ".join(errors))

def _resolve_under_world_root(raw_path: str, label: str):
    if raw_path is None or str(raw_path).strip() == "":
        return None, None

    p = Path(str(raw_path))

    if str(p).startswith("~"):
        errors.append(f"{label}: '~' is not allowed: {raw_path}")
        return None, None

    if not p.is_absolute():
        p = WORLD_ROOT / p
    p = p.resolve()

    try:
        rel = str(p.relative_to(WORLD_ROOT))
    except Exception:
        rel = str(p)

    return p, rel

# --- Resolve and validate indexes path (required, directory) ---
INDEXES_PATH, INDEXES_RELPATH = _resolve_under_world_root(INDEXES_RAW, "indexes.path")

if INDEXES_PATH is None:
    errors.append("indexes.path: missing or invalid.")
else:
    if not INDEXES_PATH.exists():
        errors.append(f"indexes.path: path does not exist: {INDEXES_PATH}")
    elif not INDEXES_PATH.is_dir():
        errors.append(f"indexes.path: must be a directory: {INDEXES_PATH}")

# --- Resolve vocabulary paths (optional; warn if missing) ---
vocab_entries = [
    ("entities", "vocab.entities", ENTITIES_RAW),
    ("aliases", "vocab.aliases", ALIASES_RAW),
    ("author_aliases", "vocab.author_aliases", AUTHORS_RAW),
]

for key, label, raw in vocab_entries:
    if not raw:
        continue

    p, rel = _resolve_under_world_root(raw, label)
    if p is None:
        continue

    if p.exists() and p.is_dir():
        warnings.append(f"{label}: {p} must be a file (got directory). Ignoring.")
        continue

    if not p.exists():
        warnings.append(f"{label}: file does not exist: {p} (name resolution may be limited).")

    if key == "entities":
        VOCAB_ENTITIES_PATH, VOCAB_ENTITIES_RELPATH = p, rel
    elif key == "aliases":
        VOCAB_ALIASES_PATH, VOCAB_ALIASES_RELPATH = p, rel
    elif key == "author_aliases":
        VOCAB_AUTHORS_PATH, VOCAB_AUTHORS_RELPATH = p, rel

if errors:
    raise ValueError("Descriptor path validation failed:\n- " + "\n- ".join(errors))

print("Descriptor paths are usable for this notebook.")
print(f"world_root: {WORLD_ROOT}")
print(f"indexes: {INDEXES_RELPATH}")
print(f"vocab.entities: {VOCAB_ENTITIES_RELPATH} (exists={VOCAB_ENTITIES_PATH.exists() if VOCAB_ENTITIES_PATH else False})")
print(f"vocab.aliases: {VOCAB_ALIASES_RELPATH} (exists={VOCAB_ALIASES_PATH.exists() if VOCAB_ALIASES_PATH else False})")
print(f"vocab.author_aliases: {VOCAB_AUTHORS_RELPATH} (exists={VOCAB_AUTHORS_PATH.exists() if VOCAB_AUTHORS_PATH else False})")

if warnings:
    print("\nWarnings:")
    for w in warnings:
        print(f"- {w}")

# cleanup
del yaml, Path
del descriptor_path, world_repo, indexes_block, vocab
del WORLD_REPOSITORY_DESCRIPTOR
del WORLD_ROOT_RAW, INDEXES_RAW, ENTITIES_RAW, ALIASES_RAW, AUTHORS_RAW
del vocab_entries, key, label, raw, p, rel, warnings, errors, f
del _resolve_under_world_root

World repository descriptor loaded successfully: world_repository.yml
Descriptor paths are usable for this notebook.
world_root: /Users/charissophia/obsidian/Iron Wolf Trading Company
indexes: _meta/indexes
vocab.entities: _meta/indexes/vocab_entities.csv (exists=True)
vocab.aliases: _meta/indexes/vocab_aliases.csv (exists=True)
vocab.author_aliases: _meta/indexes/vocab_author_aliases.csv (exists=True)


## Phase P2: Load graph artifacts

Loads the graph CSVs for the selected index version:

- graph_nodes_vN.csv
- graph_edges_vN.csv

This phase answers: "Are the graph artifacts present and structurally usable?"

In [17]:
# Phase P2: Load graph artifacts (v0)
LAST_PHASE_RUN = "P2"

import pandas as pd
from pathlib import Path

errors = []

# Normalize INDEX_VERSION into the on-disk suffix (your files use lowercase v0)
# Accepts "V0", "v0", "0" but publishes "v0"
INDEX_VERSION_SUFFIX = f"v{str(INDEX_VERSION).lower().lstrip('v')}"

# Required artifact filenames (fixed contract for this notebook)
required = {
    "graph_nodes": f"graph_nodes_{INDEX_VERSION_SUFFIX}.csv",
    "graph_edges": f"graph_edges_{INDEX_VERSION_SUFFIX}.csv",
}

# Resolve paths and validate existence
GRAPH_FILES = {}
for key, fname in required.items():
    p = (INDEXES_PATH / fname).resolve()
    GRAPH_FILES[key] = p
    if not p.exists():
        errors.append(f"Missing required graph artifact: {fname}\n  Expected at: {p}")

if errors:
    raise FileNotFoundError(
        "Phase P2 cannot proceed because required graph artifacts are missing.\n\n"
        + "\n\n".join(errors)
        + "\n\nWhat to do:\n"
          "- Run IWTC_Graph_Indexing.ipynb to generate graph_nodes_*.csv and graph_edges_*.csv\n"
          "- Ensure the resulting graph_*.csv files are placed under your indexes.path directory\n"
          f"- indexes.path resolved to:\n  {INDEXES_PATH}\n"
          "- Then rerun Phase P2"
    )

# Load CSVs (raw)
DF_GRAPH_NODES = pd.read_csv(GRAPH_FILES["graph_nodes"])
DF_GRAPH_EDGES = pd.read_csv(GRAPH_FILES["graph_edges"])

# Validate required columns (presence only)
expected_cols = {
    "DF_GRAPH_NODES": {"node_id", "node_type", "label"},
    # weight is optional; if missing, we'll add it as an empty column.
    "DF_GRAPH_EDGES": {"subject", "predicate", "object"},
}

for df_name, cols in expected_cols.items():
    df = globals()[df_name]
    missing = [c for c in cols if c not in df.columns]
    if missing:
        errors.append(f"{df_name}: missing expected columns: {missing}")

if errors:
    raise ValueError(
        "One or more graph artifacts were loaded but do not match expected v0 columns.\n- "
        + "\n- ".join(errors)
        + "\n\nWhat to do:\n"
          "- Confirm you are using the v0 CSVs produced by IWTC_Graph_Indexing.ipynb\n"
          "- Do not edit the CSVs manually\n"
          "- If you changed the producer notebook, re-run it to regenerate graphs and retry"
    )

# Summary prints
print("Phase P2 OK: graph artifacts loaded.")
print(f"indexes.path: {INDEXES_PATH}")
print(f"graph version: {INDEX_VERSION_SUFFIX}")

print("\nLoaded tables:")
print(f"- DF_GRAPH_NODES:  {len(DF_GRAPH_NODES):>8} rows, {len(DF_GRAPH_NODES.columns):>3} cols")
print(f"- DF_GRAPH_EDGES:  {len(DF_GRAPH_EDGES):>8} rows, {len(DF_GRAPH_EDGES.columns):>3} cols")

print("\nDF_GRAPH_NODES columns:", list(DF_GRAPH_NODES.columns))
print("DF_GRAPH_EDGES columns:", list(DF_GRAPH_EDGES.columns))

# cleanup locals (keep DF_GRAPH_NODES, DF_GRAPH_EDGES, INDEXES_PATH)
del pd, Path, errors, required, key, fname, p, cols, df_name, df, missing
del expected_cols, INDEX_VERSION_SUFFIX, GRAPH_FILES

Phase P2 OK: graph artifacts loaded.
indexes.path: /Users/charissophia/obsidian/Iron Wolf Trading Company/_meta/indexes
graph version: v0

Loaded tables:
- DF_GRAPH_NODES:      1696 rows,   3 cols
- DF_GRAPH_EDGES:     12691 rows,   4 cols

DF_GRAPH_NODES columns: ['node_id', 'node_type', 'label']
DF_GRAPH_EDGES columns: ['subject', 'predicate', 'object', 'weight']


In [18]:
# optional: clean up INDEXES path variables that have been loaded into dataframes
del INDEXES_PATH, INDEXES_RELPATH

## Phase P3: Optional supporting tables (labels + context)

Optionally loads canonical index and vocabulary tables to improve readability and expand context in displays.

Graph traversal uses nodes + edges as the primary structure. Supporting tables are for presentation only.

In [19]:
# Phase P3: Load optional vocabulary tables (for lookup / entrypoints)
LAST_PHASE_RUN = "P3"

import pandas as pd
from pathlib import Path

warnings = []

# ------------------------------------------------------------------
# Semantic column mappings (same as Index Query)
# ------------------------------------------------------------------
ENTITY_COLS = {"entity_id": ["entity_id", "id"], "canonical": ["canonical", "canonical_name", "name"]}
ALIAS_COLS  = {"entity_id": ["entity_id", "id"], "alias": ["alias", "alt", "alternate"]}
AUTHOR_ALIAS_COLS = {
    "author": ["author", "discord_name", "handle"],
    "player_entity_id": ["player_entity_id", "player", "player_id"],
    "ambig_char_id": ["ambig_char_id", "ambiguous_character", "ambig_character"],
}

# ------------------------------------------------------------------
# Use descriptor-validated vocab paths (from Phase P1)
# ------------------------------------------------------------------
vocab_files = [
    ("entities", VOCAB_ENTITIES_PATH, ENTITY_COLS),
    ("aliases", VOCAB_ALIASES_PATH, ALIAS_COLS),
    ("author_aliases", VOCAB_AUTHORS_PATH, AUTHOR_ALIAS_COLS),
]

# Published outputs (always defined)
DF_VOCAB_ENTITIES = pd.DataFrame(columns=list(ENTITY_COLS.keys()))
DF_VOCAB_ALIASES  = pd.DataFrame(columns=list(ALIAS_COLS.keys()))
DF_VOCAB_AUTHORS  = pd.DataFrame(columns=list(AUTHOR_ALIAS_COLS.keys()))
DF_VOCAB_LOOKUP   = pd.DataFrame(columns=["vocab_id", "vocab", "vocab_kind", "vocab_norm"])
DF_VOCAB_TO_NODE  = pd.DataFrame(columns=["vocab_id", "node_id"])  # convenience, same value in v0

def _load_and_normalize_csv(path_obj, col_map, key_label):
    if not path_obj:
        return pd.DataFrame(columns=list(col_map.keys()))

    p = Path(path_obj)
    if not p.exists():
        warnings.append(f"Optional vocab file not found: {p}")
        return pd.DataFrame(columns=list(col_map.keys()))

    raw_df = pd.read_csv(p, dtype=str).fillna("")

    rename = {}
    for semantic, options in col_map.items():
        found = next((c for c in options if c in raw_df.columns), None)
        if found:
            rename[found] = semantic

    if len(raw_df) > 0 and not rename:
        warnings.append(
            f"[{key_label}] CSV has rows but none of the expected columns were found.\n"
            f"  CSV columns: {list(raw_df.columns)}\n"
            f"  Expected mapping: {col_map}\n"
            f"  File: {p}"
        )
        return pd.DataFrame(columns=list(col_map.keys()))

    out = raw_df.rename(columns=rename)
    keep = [k for k in col_map.keys() if k in out.columns]
    return out[keep]

# ------------------------------------------------------------------
# Load tables (optional)
# ------------------------------------------------------------------
for key, path_obj, col_map in vocab_files:
    norm_df = _load_and_normalize_csv(path_obj, col_map, key)

    if key == "entities":
        DF_VOCAB_ENTITIES = norm_df
    elif key == "aliases":
        DF_VOCAB_ALIASES = norm_df
    elif key == "author_aliases":
        DF_VOCAB_AUTHORS = norm_df

del vocab_files, key, path_obj, col_map, norm_df

# ------------------------------------------------------------------
# Build DF_VOCAB_LOOKUP (unified lookup for free-text -> node id)
# NOTE: In v0, vocab_id is already the node_id for entities/players.
# ------------------------------------------------------------------
rows = []

for _, r in DF_VOCAB_ENTITIES.iterrows():
    vid = str(r.get("entity_id", "")).strip()
    v = str(r.get("canonical", "")).strip()
    if vid and v:
        rows.append([vid, v, "entity"])

if not DF_VOCAB_ALIASES.empty:
    for _, r in DF_VOCAB_ALIASES.iterrows():
        vid = str(r.get("entity_id", "")).strip()
        v = str(r.get("alias", "")).strip()
        if vid and v:
            rows.append([vid, v, "alias"])

if not DF_VOCAB_AUTHORS.empty:
    for _, r in DF_VOCAB_AUTHORS.iterrows():
        vid = str(r.get("player_entity_id", "")).strip()
        v = str(r.get("author", "")).strip()
        if vid and v:
            rows.append([vid, v, "author"])

DF_VOCAB_LOOKUP = pd.DataFrame(rows, columns=["vocab_id", "vocab", "vocab_kind"])
DF_VOCAB_LOOKUP["vocab_norm"] = DF_VOCAB_LOOKUP["vocab"].astype(str).str.strip().str.lower()
DF_VOCAB_LOOKUP = (
    DF_VOCAB_LOOKUP
    .drop_duplicates(subset=["vocab_id", "vocab_norm", "vocab_kind"])
    .reset_index(drop=True)
)

# Convenience mapping (explicit name helps later recipes)
DF_VOCAB_TO_NODE = DF_VOCAB_LOOKUP.loc[:, ["vocab_id"]].drop_duplicates().rename(columns={"vocab_id": "node_id"})
DF_VOCAB_TO_NODE["vocab_id"] = DF_VOCAB_TO_NODE["node_id"]

del rows, r, vid, v, _

# ------------------------------------------------------------------
# Summary
# ------------------------------------------------------------------
print("Phase P3 OK: optional vocabulary tables loaded (if present).")
print(f"- DF_VOCAB_ENTITIES: {len(DF_VOCAB_ENTITIES):>6} rows")
print(f"- DF_VOCAB_ALIASES:  {len(DF_VOCAB_ALIASES):>6} rows")
print(f"- DF_VOCAB_AUTHORS:  {len(DF_VOCAB_AUTHORS):>6} rows")
print(f"- DF_VOCAB_LOOKUP:   {len(DF_VOCAB_LOOKUP):>6} rows")

if warnings:
    print("\nWarnings:")
    for w in warnings:
        print(f"- {w}")

# cleanup
del pd, Path, warnings
del ENTITY_COLS, ALIAS_COLS, AUTHOR_ALIAS_COLS
del _load_and_normalize_csv

Phase P3 OK: optional vocabulary tables loaded (if present).
- DF_VOCAB_ENTITIES:    176 rows
- DF_VOCAB_ALIASES:      87 rows
- DF_VOCAB_AUTHORS:       6 rows
- DF_VOCAB_LOOKUP:      269 rows


In [20]:
# optional: clean up VOCAB path variables
del VOCAB_ENTITIES_PATH, VOCAB_ENTITIES_RELPATH
del VOCAB_ALIASES_PATH, VOCAB_ALIASES_RELPATH
del VOCAB_AUTHORS_PATH, VOCAB_AUTHORS_RELPATH

# Graph Engine: Build the in-memory graph

This section converts the loaded CSV artifacts into an in-memory graph object for traversal.

- Input artifacts: `DF_GRAPH_NODES`, `DF_GRAPH_EDGES`
- Output object: `WORLD_GRAPH` (a NetworkX graph with node and edge attributes)

Everything below this point should query the graph, not the raw tables.

In [23]:
# -------------------------------------------------------------------
# Phase G: Build graph object (bridge between CSV and queries)
# -------------------------------------------------------------------
LAST_PHASE_RUN = "G"

import networkx as nx
import pandas as pd

# WORLD_GRAPH is the in-memory graph object we will query.
# MultiDiGraph allows multiple edges between the same nodes (needed for ambiguity later).
WORLD_GRAPH = nx.MultiDiGraph()

print("Initialized WORLD_GRAPH:", type(WORLD_GRAPH).__name__)

# -------------------------------------------------------------------
# G.1: Add nodes to WORLD_GRAPH
# -------------------------------------------------------------------
# Source grammar (nodes table):
#   node_id + node_type + label  => "This thing exists in the world graph"
#
# Target grammar (graph object):
#   WORLD_GRAPH.add_node(node_id, node_type=..., label=...)
# -------------------------------------------------------------------
for r in DF_GRAPH_NODES.itertuples(index=False):
    WORLD_GRAPH.add_node(
        str(r.node_id),
        node_type=str(r.node_type),
        label=str(r.label),
    )

print("Nodes loaded into WORLD_GRAPH")
print(f"- WORLD_GRAPH number_of_nodes(): {WORLD_GRAPH.number_of_nodes():,}")

# -------------------------------------------------------------------
# G.2: Add edges to WORLD_GRAPH
# -------------------------------------------------------------------
# Source grammar (edges table):
#   subject + predicate + object (+ optional weight)
#     => "This relationship statement exists"
#
# Target grammar (graph object):
#   WORLD_GRAPH.add_edge(subject, object, predicate=..., weight=...)
# -------------------------------------------------------------------

for r in DF_GRAPH_EDGES.itertuples(index=False):
    WORLD_GRAPH.add_edge(
        str(r.subject),
        str(r.object),
        predicate=str(r.predicate),
        weight=r.weight,   # whatever pandas loaded (NaN for blanks is fine for now)
    )

print("Edges loaded into WORLD_GRAPH")
print(f"- WORLD_GRAPH number_of_edges(): {WORLD_GRAPH.number_of_edges():,}")

# clean up locals
del r


Initialized WORLD_GRAPH: MultiDiGraph
Nodes loaded into WORLD_GRAPH
- WORLD_GRAPH number_of_nodes(): 1,696
Edges loaded into WORLD_GRAPH
- WORLD_GRAPH number_of_edges(): 12,691


# Query recipes (edit and run)

Each recipe answers one DM question by chaining statements into an evidence trail.

You will usually only edit the parameter lines (ENTITY, PLAYER, FILE, etc.) and rerun the cell.
Recipes are designed to be modified during use.

## Q1: Where does entity X appear?

Goal: produce evidence locations (files + chunks) for a named entity.

Chain used (conceptually):
file contains chunk; chunk mentions vocab; vocab refers_to entity


## Q2: What else appears with entity X?

Goal: find entities structurally associated with X via cooccurrence.

This is a weighted relationship:
entity cooccurs_with entity (weight = shared chunks)


## Q3: Where do X and Y overlap?

Goal: show evidence chunks where both X and Y appear.

This recipe is "proof oriented": it outputs overlap chunks and a show-chunk command for context review.


## Q4: How does X likely connect to Y?

Goal: support “social proximity” style questions using multi-hop structure.

This recipe explicitly calls out:
- when multiple hops are required
- when a predicate is read backwards (incoming edges)
- when weights are used as strength signals


## Q5: What does player Z usually write about?

Goal: map from player -> character(s) -> evidence chunks and summarize recurring entities.

This recipe teaches two things:
- using declared mappings (plays)
- chaining from authorship-related nodes into evidence trails


# Notes and next steps

This notebook is an analysis layer over existing graph artifacts.

If results look wrong:
- adjust vocabulary / indexing inputs
- regenerate canonical indexes and graph artifacts
- re-run this notebook

No files are written here.
