### Cell 0 ‚Äî Repository Bootstrap & Experiment Registry (Required)

This cell ensures that the project repository is discoverable by Python **and**
defines the registry of experiments that will be executed in this notebook.

---

#### Why this is needed

- The notebook lives inside the `notebooks/` directory  
- Python does not automatically know where the project root is  
- All project code lives under the `src/` directory  
- Multiple experimental configurations (prompt-only, USE + map, OpenAI + map)
  must be executed in a single, reproducible workflow  

---

#### What this cell does

- Walks up the directory tree starting from the current notebook location  
- Finds the repository root (identified by the presence of a `src/` folder)  
- Adds that directory to `sys.path` so imports such as  
  `from src.config import ...` work correctly  
- Defines a central **experiment registry** describing:
  - which feature representation is used
  - where training data is read from
  - where trained models and metadata are saved  

---

#### Design principles

- Executed once at the very top of the notebook  
- Contains no learning or model-specific logic  
- Provides a single source of truth for all experiment configurations  
- Enables fair and controlled comparison between different feature setups  

---

This cell must be executed **before any imports from `src.*`**.  
All subsequent cells rely on the repository path and experiment definitions
established here.


In [1]:
# ===================== CELL 0 ‚Äî Bootstrap + experiment registry =====================

import sys
from pathlib import Path

# -------------------------------------------------
# Find repo root (folder that CONTAINS "src/")
# -------------------------------------------------
p = Path.cwd().resolve()
REPO_ROOT = None

for candidate in [p, *p.parents]:
    if (candidate / "src" / "imgofup").is_dir():
        REPO_ROOT = candidate
        break

if REPO_ROOT is None:
    raise RuntimeError("Could not find repo root (no 'src/imgofup' found).")

SRC_DIR = REPO_ROOT / "src"

# -------------------------------------------------
# Make src/ importable (NOT the repo root)
# -------------------------------------------------
if str(SRC_DIR) not in sys.path:
    sys.path.insert(0, str(SRC_DIR))

print("üì¶ Repo root:", REPO_ROOT)

# -------------------------------------------------
# Experiment registry (ALL experiments will run)
# -------------------------------------------------
DATA_DIR = REPO_ROOT / "data"

EXPERIMENTS = {
    "openai_prompt_only": {
        "train_out": DATA_DIR / "output" / "train_out_openai_prompt_only",
        "model_out": DATA_DIR / "output" / "models" / "exp_openai_prompt_only",
        "feature_mode": "prompt_only",
        "prompt_encoder_kind": "openai-small",
    },
    "use_prompt_only": {
        "train_out": DATA_DIR / "output" / "train_out_use_prompt_only",
        "model_out": DATA_DIR / "output" / "models" / "exp_use_prompt_only",
        "feature_mode": "prompt_only",
        "prompt_encoder_kind": "dan",
    },
    "map_only": {
        "train_out": DATA_DIR / "output" / "train_out_map_only",
        "model_out": DATA_DIR / "output" / "models" / "exp_map_only",
        "feature_mode": "map_only",
    },
    "use_map": {
        "train_out": DATA_DIR / "output" / "train_out_use_map",
        "model_out": DATA_DIR / "output" / "models" / "exp_use_map",
        "feature_mode": "prompt_plus_map",
        "prompt_encoder_kind": "dan",
    },
    "openai_map": {
        "train_out": DATA_DIR / "output" / "train_out_openai_map",
        "model_out": DATA_DIR / "output" / "models" / "exp_openai_map",
        "feature_mode": "prompt_plus_map",
        "prompt_encoder_kind": "openai-small",
    },
}

# Ensure output dirs exist
for exp_cfg in EXPERIMENTS.values():
    exp_cfg["train_out"] = Path(exp_cfg["train_out"])
    exp_cfg["model_out"] = Path(exp_cfg["model_out"])
    exp_cfg["train_out"].mkdir(parents=True, exist_ok=True)
    exp_cfg["model_out"].mkdir(parents=True, exist_ok=True)

# -------------------------------------------------
# Summary
# -------------------------------------------------
print("üß™ Will run experiments:")
for exp_name, cfg in EXPERIMENTS.items():
    pe = cfg.get("prompt_encoder_kind", "-")
    print(
        f" - {exp_name:18s} | "
        f"mode={cfg['feature_mode']:14s} | "
        f"prompt={pe:14s} | "
        f"train_out={cfg['train_out'].name}"
    )


üì¶ Repo root: /Users/amirdonyadide/Documents/GitHub/IMGOFUP
üß™ Will run experiments:
 - openai_prompt_only | mode=prompt_only    | prompt=openai-small   | train_out=train_out_openai_prompt_only
 - use_prompt_only    | mode=prompt_only    | prompt=dan            | train_out=train_out_use_prompt_only
 - map_only           | mode=map_only       | prompt=-              | train_out=train_out_map_only
 - use_map            | mode=prompt_plus_map | prompt=dan            | train_out=train_out_use_map
 - openai_map         | mode=prompt_plus_map | prompt=openai-small   | train_out=train_out_openai_map


### Cell 1 ‚Äî Experiment Setup & Global Configuration

This cell initializes the experiment environment and loads the global configuration required
for training and evaluation. It is designed so the notebook can run **all experiment
configurations in one pass** (prompt-only, USE + map, OpenAI + map) without manual edits.

**What this cell does:**

- **Loads global project configuration**
  - Paths (`PATHS`)
  - Runtime settings (`CFG`)
  - Operator groups (`DISTANCE_OPS`, `AREA_OPS`)
  - Dynamic extent configuration flags (`USE_DYNAMIC_EXTENT_REFS`, `ALLOW_FALLBACK_EXTENT`)
  - Extent reference column names (`EXTENT_DIAG_COL`, `EXTENT_AREA_COL`)

- **Defines a central experiment registry**
  - `EXPERIMENTS` ‚Äî a dictionary where each experiment specifies:
    - `feature_mode`:
      - `prompt_only` (uses prompt embeddings only)
      - `fused` (uses concatenated map + prompt embeddings)
    - `train_out` ‚Äî where the prepared matrices (`X_*`) and `train_pairs.parquet` are read from
    - `model_out` ‚Äî where trained artifacts are saved

- **Sets embedding dimensions**
  - `MAP_DIM`, `PROMPT_DIM`
  - `FUSED_DIM = MAP_DIM + PROMPT_DIM`
  - The effective input dimension is derived per experiment:
    - `prompt_only` ‚Üí `PROMPT_DIM`
    - `fused` ‚Üí `FUSED_DIM`

- **Validates experiment folders**
  - Ensures each experiment‚Äôs `train_out` and `model_out` directories exist
  - Performs basic schema checks (required keys, valid feature modes)

**Design principles**

- Executed once near the top of the notebook (before data loading/training)
- Contains no model training logic
- Provides a single source of truth for experiment configuration
- Prevents accidental overwrites by saving each experiment into its own output folder

This cell must be executed **before any training or evaluation cells**. All experiment
comparisons depend on the consistent configuration established here.


In [2]:
# ===================== CELL 1 ‚Äî PARAMETERS =====================

from pathlib import Path

from imgofup.config.paths import (
    PATHS, CFG, print_summary,
    DISTANCE_OPS, AREA_OPS,
    USE_DYNAMIC_EXTENT_REFS, ALLOW_FALLBACK_EXTENT,
)
from imgofup.config.constants import EXTENT_DIAG_COL, EXTENT_AREA_COL

print_summary()
print("USE_DYNAMIC_EXTENT_REFS:", USE_DYNAMIC_EXTENT_REFS)
print("ALLOW_FALLBACK_EXTENT  :", ALLOW_FALLBACK_EXTENT)
print("EXTENT_DIAG_COL:", EXTENT_DIAG_COL, " EXTENT_AREA_COL:", EXTENT_AREA_COL)

MAP_DIM_CFG = int(CFG.MAP_DIM)
PROMPT_DIM_CFG = int(CFG.PROMPT_DIM)
FUSED_DIM_CFG = MAP_DIM_CFG + PROMPT_DIM_CFG
BATCH_SIZE = int(CFG.BATCH_SIZE)

print("CFG dims -> MAP_DIM:", MAP_DIM_CFG, "| PROMPT_DIM:", PROMPT_DIM_CFG, "| FUSED_DIM:", FUSED_DIM_CFG)
print("BATCH_SIZE:", BATCH_SIZE)

# -------------------------------------------------
# Validate experiment registry (from Cell 0)
# -------------------------------------------------
required_keys = {"train_out", "model_out", "feature_mode"}
allowed_modes = {"prompt_only", "map_only", "prompt_plus_map"}

for exp_name, exp_cfg in EXPERIMENTS.items():
    missing = required_keys - set(exp_cfg.keys())
    if missing:
        raise ValueError(f"Experiment '{exp_name}' is missing keys: {missing}")

    mode = str(exp_cfg["feature_mode"]).strip().lower()
    if mode not in allowed_modes:
        raise ValueError(
            f"Experiment '{exp_name}' has invalid feature_mode='{exp_cfg['feature_mode']}'. "
            f"Allowed: {sorted(allowed_modes)}"
        )
    exp_cfg["feature_mode"] = mode

    exp_cfg["train_out"] = Path(exp_cfg["train_out"])
    exp_cfg["model_out"] = Path(exp_cfg["model_out"])

    exp_cfg["train_out"].mkdir(parents=True, exist_ok=True)
    exp_cfg["model_out"].mkdir(parents=True, exist_ok=True)

    # Prompt encoder required whenever prompts are part of features
    if mode in {"prompt_only", "prompt_plus_map"}:
        if "prompt_encoder_kind" not in exp_cfg:
            raise ValueError(
                f"Experiment '{exp_name}' needs 'prompt_encoder_kind' because feature_mode='{mode}'."
            )

    # No prompt encoder needed for map_only
    if mode == "map_only":
        exp_cfg.pop("prompt_encoder_kind", None)

print("\nüß™ Experiments to be executed:")
for exp_name, exp_cfg in EXPERIMENTS.items():
    pe = exp_cfg.get("prompt_encoder_kind", "-")
    print(
        f" - {exp_name:18s} | "
        f"mode={exp_cfg['feature_mode']:14s} | "
        f"prompt={pe:14s} | "
        f"train_out={exp_cfg['train_out'].name} | "
        f"model_out={exp_cfg['model_out'].name}"
    )

def get_feature_dims_from_cfg(feature_mode: str):
    fm = str(feature_mode).strip().lower()
    if fm == "prompt_only":
        return 0, PROMPT_DIM_CFG, PROMPT_DIM_CFG
    if fm == "map_only":
        return MAP_DIM_CFG, 0, MAP_DIM_CFG
    if fm == "prompt_plus_map":
        return MAP_DIM_CFG, PROMPT_DIM_CFG, FUSED_DIM_CFG
    raise ValueError(f"Unknown feature_mode: {feature_mode}")


=== CONFIG SUMMARY ===
PROJ_ROOT  : /Users/amirdonyadide/Documents/GitHub/IMGOFUP
DATA_DIR   : /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data
INPUT_DIR  : /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/input
OUTPUT_DIR : /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output
MAPS_ROOT  : /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/input/samples/pairs
INPUT PAT. : *_input.geojson
--- User Study ---
USER_STUDY_XLSX : /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/userstudy/UserStudy.xlsx
RESPONSES_SHEET : Responses
TILE_ID_COL     : tile_id
COMPLETE_COL    : complete
REMOVE_COL      : remove
TEXT_COL        : cleaned_text
PARAM_VALUE_COL : param_value
OPERATOR_COL    : operator
INTENSITY_COL   : intensity
--- Filters / IDs / Split ---
ONLY_COMPLETE   : True
EXCLUDE_REMOVED : True
PROMPT_ID_COL   : prompt_id
PROMPT_ID_RULE  : Read from Excel (no generation).
SPLIT_BY        : tile
--- Outputs ---
PROMPT_OUT : /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/

## Step 2 ‚Äî Prompt Embedding Generation (All Experiments)

In this step, we generate vector embeddings for all user prompts for **each experiment
configuration** (e.g., prompt-only, USE + map, OpenAI + map). The embedding backend is
selected per experiment via `PROMPT_ENCODER` (e.g., **USE-DAN/Transformer** or **OpenAI
text-embedding-3-* models**).

### What happens here?

For each entry in the experiment registry:

- Prompts are loaded from the user study source file.
- Only valid prompts are kept (`complete == True`, `remove == False`).
- The prompt embedding model is selected from the experiment configuration
  (mapped to `CFG.PROMPT_ENCODER`).
- All prompts are embedded in batches (with optional L2 normalization).
- The resulting embeddings and metadata are saved to an **experiment-specific folder**
  so no configuration overwrites another.

**Outputs per experiment** (written under `PATHS.PROMPT_OUT/<experiment_name>/`):

- `prompts_embeddings.npz` ‚Äî matrix `E` and `ids`
- `prompts.parquet` ‚Äî prompt_id, text, tile_id
- `meta.json` ‚Äî model label, dimensionality, and export info

### Why this is encapsulated in a helper

To keep the notebook clean and reproducible, all logic related to:

- loading and filtering prompts,
- selecting the embedding backend,
- batching and normalization,
- saving outputs in a consistent format,

is encapsulated in `src/train/run_prompt_embeddings.py`.

The notebook only *orchestrates* experiments by calling this helper with an
experiment-specific configuration.

This design ensures:

- consistent prompt embeddings across training and evaluation,
- easy comparison between USE and OpenAI backends,
- clean separation between experiment orchestration (notebook) and implementation (src),
- safe parallel storage of artifacts for multiple experiment runs.


In [3]:
# ===================== CELL 2 ‚Äî Prompt embeddings (experiment-scoped) =====================

from pathlib import Path
from dataclasses import replace

from imgofup.config import paths
from imgofup.config.constants import (
    PROMPT_EMBEDDINGS_NPZ_NAME,
    PROMPTS_PARQUET_NAME,
    PROMPT_EMBED_VERBOSITY_DEFAULT,
    PROMPT_EMBED_L2_NORMALIZE_DEFAULT,
    PROMPT_EMBED_SAVE_CSV_DEFAULT,
)
from imgofup.pipelines.run_prompt_embeddings import run_prompt_embeddings_from_config

print("\n=== Running prompt embeddings for experiments that require prompts ===")

prompt_meta_by_experiment = {}

# IMPORTANT: because prompt_id is now read from Excel, old artifacts are stale.
FORCE_REBUILD_PROMPTS = False  # set True if you want to recompute regardless of existing artifacts

for exp_name, exp_cfg in EXPERIMENTS.items():
    feature_mode = exp_cfg["feature_mode"]

    if feature_mode == "map_only":
        print(f"\nüß™ Experiment: {exp_name}")
        print("   (skip) feature_mode=map_only ‚Üí no prompt embeddings required.")
        continue

    prompt_encoder_kind = exp_cfg.get("prompt_encoder_kind", paths.CFG.PROMPT_ENCODER)
    CFG_EXP = replace(paths.CFG, PROMPT_ENCODER=prompt_encoder_kind)

    prompt_out_dir = Path(paths.PATHS.PROMPT_OUT) / exp_name
    prompt_out_dir.mkdir(parents=True, exist_ok=True)

    emb_npz = prompt_out_dir / PROMPT_EMBEDDINGS_NPZ_NAME
    prm_pq  = prompt_out_dir / PROMPTS_PARQUET_NAME

    print(f"\nüß™ Experiment: {exp_name}")
    print(f"   feature_mode   : {feature_mode}")
    print(f"   PROMPT_ENCODER : {CFG_EXP.PROMPT_ENCODER}")
    print(f"   Output dir     : {prompt_out_dir}")

    if (not FORCE_REBUILD_PROMPTS) and emb_npz.exists() and prm_pq.exists():
        print("   ‚úÖ Prompt embeddings already exist ‚Äî skipping recomputation.")
        meta = {
            "out_dir": str(prompt_out_dir),
            "embeddings_path": str(emb_npz),
            "prompts_parquet_path": str(prm_pq),
            "skipped": True,
        }
    else:
        meta = run_prompt_embeddings_from_config(
            input_path=Path(paths.PATHS.USER_STUDY_XLSX),
            out_dir=prompt_out_dir,
            cfg=CFG_EXP,
            paths=paths.PATHS,
            verbosity=PROMPT_EMBED_VERBOSITY_DEFAULT,
            l2_normalize=PROMPT_EMBED_L2_NORMALIZE_DEFAULT,
            also_save_embeddings_csv=PROMPT_EMBED_SAVE_CSV_DEFAULT,
        )
        print("   ‚úÖ Prompt embeddings completed.")

    prompt_meta_by_experiment[exp_name] = meta

print("\n‚úÖ Prompt embedding step finished.")



=== Running prompt embeddings for experiments that require prompts ===

üß™ Experiment: openai_prompt_only
   feature_mode   : prompt_only
   PROMPT_ENCODER : openai-small
   Output dir     : /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/prompt_out/openai_prompt_only
   ‚úÖ Prompt embeddings already exist ‚Äî skipping recomputation.

üß™ Experiment: use_prompt_only
   feature_mode   : prompt_only
   PROMPT_ENCODER : dan
   Output dir     : /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/prompt_out/use_prompt_only
   ‚úÖ Prompt embeddings already exist ‚Äî skipping recomputation.

üß™ Experiment: map_only
   (skip) feature_mode=map_only ‚Üí no prompt embeddings required.

üß™ Experiment: use_map
   feature_mode   : prompt_plus_map
   PROMPT_ENCODER : dan
   Output dir     : /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/prompt_out/use_map
   ‚úÖ Prompt embeddings already exist ‚Äî skipping recomputation.

üß™ Experiment: openai_map
   feature_mode 

## Step 3 ‚Äî Map Embeddings (Dynamic, Shared Across Experiments)

In this step, we compute **map embeddings** for all GeoJSON tiles that are eligible for the user study.  
Because map embeddings depend only on the input map data (not on the prompt encoder), they are computed **once** and stored in a **shared output folder**, then reused across all experiments.

### What this step does

1. **Filters tiles using the user study Excel**
   - Keeps only rows marked as *complete*
   - Excludes rows marked as *removed*
   - Extracts the set of allowed `tile_id`s (used to select which map folders to embed)

2. **Discovers and embeds GeoJSON maps**
   - Finds all GeoJSON files under `PATHS.MAPS_ROOT`
   - Keeps only those whose `map_id` is in the allowed set
   - Counts valid polygons per map to determine a dataset-wide `max_polygons`
     (used to normalize the `poly_count` feature safely)

3. **Computes map embeddings**
   - Uses `norm="extent"` for **dynamic per-map normalization**
   - Ensures all vectors have consistent dimensionality
   - Skips maps with invalid geometries or degenerate extents

4. **Stores dynamic extent references (required for parameter scaling)**
   - `extent_diag_m`
   - `extent_area_m2`

   These are saved alongside embeddings and are later used to convert
   normalized parameters (`param_norm`) into real-world units:
   - distance operators ‚Üí meters via `extent_diag_m`
   - area operators ‚Üí m¬≤ via `extent_area_m2`

5. **Writes outputs once to a shared directory**
   - Prevents redundant computation across experiments
   - Guarantees every experiment uses the **same map representation**
   - Avoids accidental overwrites while keeping artifacts reusable

### Why this is important

- Ensures **consistent normalization** between training and evaluation  
- Provides the necessary per-map reference scales for parameter un-normalization  
- Improves reproducibility and efficiency by reusing identical map embeddings  
- Supports fair comparison between:
  - prompt-only baselines (which ignore map embeddings)
  - fused prompt + map hybrids
  - different prompt backends (USE vs OpenAI)

At the end of this step, the repository contains a self-contained set of map embeddings
ready to be concatenated with prompt embeddings in the next stage.

In [4]:
# ===================== CELL 3 ‚Äî Map embeddings (shared) =====================

from pathlib import Path
from imgofup.config import paths
from imgofup.pipelines.run_map_embeddings import run_map_embeddings_from_config

# Map embeddings do NOT depend on prompt backend, so compute once and reuse.
MAP_EMB_DIR = Path(paths.PATHS.MAP_OUT) / "shared_extent"
MAP_EMB_DIR.mkdir(parents=True, exist_ok=True)

maps_npz = MAP_EMB_DIR / "maps_embeddings.npz"
maps_pq  = MAP_EMB_DIR / "maps.parquet"

print("\n=== Map embeddings (shared across all experiments) ===")
print("Target dir:", MAP_EMB_DIR)

# If you changed anything about map embedding logic, set this True once.
FORCE_REBUILD_MAPS = False

if (not FORCE_REBUILD_MAPS) and maps_npz.exists() and maps_pq.exists():
    print("‚úÖ Map embeddings already exist ‚Äî skipping recomputation.")
    map_meta = {"out_dir": str(MAP_EMB_DIR), "skipped": True}
else:
    map_meta = run_map_embeddings_from_config(
        maps_root=Path(paths.PATHS.MAPS_ROOT),
        input_pattern=paths.PATHS.INPUT_MAPS_PATTERN,
        user_study_xlsx=Path(paths.PATHS.USER_STUDY_XLSX),
        responses_sheet=paths.PATHS.RESPONSES_SHEET,
        tile_id_col=paths.PATHS.TILE_ID_COL,
        complete_col=paths.PATHS.COMPLETE_COL,
        remove_col=paths.PATHS.REMOVE_COL,
        only_complete=False,
        exclude_removed=False,
        out_dir=MAP_EMB_DIR,
        verbosity=1,
        norm="extent",
    )
    print("‚úÖ Map embeddings completed.")

if not maps_npz.exists():
    raise FileNotFoundError(f"Missing maps_embeddings.npz at: {maps_npz}")
if not maps_pq.exists():
    raise FileNotFoundError(f"Missing maps.parquet at: {maps_pq}")

print("‚úÖ Map embedding artifacts ready:")
print(" -", maps_npz)
print(" -", maps_pq)
print("MAP_EMB_DIR:", MAP_EMB_DIR)
print(map_meta)


=== Map embeddings (shared across all experiments) ===
Target dir: /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/map_out/shared_extent
‚úÖ Map embeddings completed.
‚úÖ Map embedding artifacts ready:
 - /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/map_out/shared_extent/maps_embeddings.npz
 - /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/map_out/shared_extent/maps.parquet
MAP_EMB_DIR: /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/map_out/shared_extent
MapEmbeddingRunMeta(n_tiles_allowed=551, n_maps_found=824, n_maps_used=551, max_polygons=731, out_dir='/Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/map_out/shared_extent', embeddings_path='/Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/map_out/shared_extent/maps_embeddings.npz', maps_parquet_path='/Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/map_out/shared_extent/maps.parquet', n_skipped_bad_extent=0, n_skipped_dim_mismatch=0, n_failed_embed=0)


### üî¢ Inferring Embedding Dimensions (Experiment-Aware)

In this step, we **infer embedding dimensionalities directly from the saved embedding files**
rather than relying on configuration defaults. Dimensions are inferred **per experiment** to
account for different prompt embedding backends (e.g., USE vs OpenAI), while map embeddings
are shared across experiments.

This ensures:
- A **single source of truth** for feature dimensions
- **Consistency** between training and evaluation pipelines
- Robustness to changes in embedding models or backend configurations
- Correct handling of mixed feature modes (prompt-only vs. fused prompt + map)

Specifically, we:

- Load **prompt embeddings** from  
  `PATHS.PROMPT_OUT/<experiment_name>/prompts_embeddings.npz`
- Load **map embeddings** from the shared map embedding directory  
  `PATHS.MAP_OUT/<shared_folder>/maps_embeddings.npz`
- Infer dimensions as follows:
  - `PROMPT_DIM` ‚Äî from prompt embeddings (per experiment)
  - `MAP_DIM` ‚Äî from map embeddings (shared)
  - `FUSED_DIM` ‚Äî computed per experiment:
    - `prompt_only` ‚Üí `PROMPT_DIM`
    - `fused` ‚Üí `MAP_DIM + PROMPT_DIM`

If inferred dimensions differ from those defined in the global configuration (`CFG`),
the inferred values take precedence for all downstream processing.

The inferred dimensions are stored in the experiment registry and used consistently by:
- feature preprocessing
- operator classification
- parameter regression
- evaluation and inference

This guarantees that all downstream models operate on **correctly shaped feature vectors**
and that comparisons between experiments remain valid and reproducible.


In [5]:
# ===================== CELL 4 ‚Äî Infer embedding dimensions (multi-experiment, incl. map-only) =====================

from pathlib import Path
import numpy as np
from imgofup.config import paths

def _infer_dim_from_npz(npz_path: Path) -> int:
    if not npz_path.exists():
        raise FileNotFoundError(f"Missing embeddings file: {npz_path}")
    with np.load(npz_path, allow_pickle=True) as z:
        if "E" not in z:
            raise ValueError(f"{npz_path} missing array 'E'")
        E = z["E"]
    if E.ndim != 2 or E.shape[1] <= 0:
        raise ValueError(f"Invalid embedding matrix in {npz_path}: shape={E.shape}")
    return int(E.shape[1])

# -------------------------------
# Map dim (shared across all experiments)
# -------------------------------
maps_npz = Path(MAP_EMB_DIR) / "maps_embeddings.npz"
MAP_DIM_INF = _infer_dim_from_npz(maps_npz)
print("‚úÖ Inferred MAP_DIM from shared maps:", MAP_DIM_INF)

# -------------------------------
# Prompt dim per experiment + final input dim per experiment
# -------------------------------
dims_by_experiment = {}

PROMPT_BASED_MODES = {"prompt_only", "prompt_plus_map"}

for exp_name, exp_cfg in EXPERIMENTS.items():
    feature_mode = str(exp_cfg["feature_mode"]).strip().lower()

    # infer prompt dim if this mode uses prompts
    PROMPT_DIM_INF = 0
    if feature_mode in PROMPT_BASED_MODES:
        prm_npz = Path(paths.PATHS.PROMPT_OUT) / exp_name / "prompts_embeddings.npz"
        PROMPT_DIM_INF = _infer_dim_from_npz(prm_npz)

    if feature_mode == "prompt_only":
        map_dim = 0
        prompt_dim = PROMPT_DIM_INF
        fused_dim = PROMPT_DIM_INF

    elif feature_mode == "map_only":
        map_dim = MAP_DIM_INF
        prompt_dim = 0
        fused_dim = MAP_DIM_INF

    elif feature_mode == "prompt_plus_map":
        map_dim = MAP_DIM_INF
        prompt_dim = PROMPT_DIM_INF
        fused_dim = MAP_DIM_INF + PROMPT_DIM_INF

    else:
        raise ValueError(f"Unknown feature_mode for {exp_name}: {feature_mode}")

    exp_cfg["map_dim"] = int(map_dim)
    exp_cfg["prompt_dim"] = int(prompt_dim)
    exp_cfg["fused_dim"] = int(fused_dim)

    dims_by_experiment[exp_name] = {
        "feature_mode": feature_mode,
        "MAP_DIM": int(map_dim),
        "PROMPT_DIM": int(prompt_dim),
        "FUSED_DIM": int(fused_dim),
    }


print("\n‚úÖ Inferred dims per experiment:")
for exp_name, d in dims_by_experiment.items():
    print(
        f" - {exp_name:12s} | mode={d['feature_mode']:14s} | "
        f"MAP_DIM={d['MAP_DIM']:4d} | PROMPT_DIM={d['PROMPT_DIM']:4d} | FUSED_DIM={d['FUSED_DIM']:4d}"
    )

if MAP_DIM_INF != int(paths.CFG.MAP_DIM):
    print("\n‚ö†Ô∏è CONFIG.CFG.MAP_DIM differs from inferred MAP_DIM (using inferred for experiments).")
    print(f"   inferred MAP_DIM={MAP_DIM_INF} vs CONFIG.CFG.MAP_DIM={paths.CFG.MAP_DIM}")

‚úÖ Inferred MAP_DIM from shared maps: 165

‚úÖ Inferred dims per experiment:
 - openai_prompt_only | mode=prompt_only    | MAP_DIM=   0 | PROMPT_DIM=1536 | FUSED_DIM=1536
 - use_prompt_only | mode=prompt_only    | MAP_DIM=   0 | PROMPT_DIM= 512 | FUSED_DIM= 512
 - map_only     | mode=map_only       | MAP_DIM= 165 | PROMPT_DIM=   0 | FUSED_DIM= 165
 - use_map      | mode=prompt_plus_map | MAP_DIM= 165 | PROMPT_DIM= 512 | FUSED_DIM= 677
 - openai_map   | mode=prompt_plus_map | MAP_DIM= 165 | PROMPT_DIM=1536 | FUSED_DIM=1701


### üîó Feature Construction (Prompt-Only vs. Fused Map + Prompt Embeddings)

In this step, we construct the final feature matrices used for training and evaluation by
aligning prompt embeddings with map embeddings and exporting **experiment-scoped** artifacts.

#### What this step does (per experiment)

- Loads **prompt embeddings** from  
  `PATHS.PROMPT_OUT/<experiment_name>/prompts_embeddings.npz`
- Loads **map embeddings** from the shared map embedding directory  
  `PATHS.MAP_OUT/<shared_folder>/maps_embeddings.npz`
- Aligns samples via the authoritative pairing table (`prompts.parquet` ‚Üí `map_id/tile_id` + `prompt_id`)
- Merges **dynamic map extent metadata** (e.g., `extent_diag_m`, `extent_area_m2`) from `maps.parquet`
  so downstream regression can convert normalized parameters into real-world units.

#### Feature modes supported

- **`prompt_only`**  
  Uses only prompt vectors:  
  `X = prompt_embedding`

- **`fused`**  
  Concatenates map and prompt vectors:  
  `X = [map_embedding | prompt_embedding]`

#### Outputs written per experiment

All artifacts are saved into the experiment‚Äôs `train_out` directory (to prevent overwrites):

- `X_prompt.npy` or `X_concat.npy` ‚Äî final feature matrix (depending on feature mode)
- `train_pairs.parquet` ‚Äî aligned metadata (including `operator`, `param_value`, and extent references)
- `meta.json` ‚Äî provenance (sources, options) and shape information

#### Why this design

- Keeps **training and evaluation perfectly aligned** by exporting a single, consistent pairing table
- Avoids hard-coded dimensions by relying on the saved embedding files
- Supports **multiple experiments side-by-side** without overwriting artifacts
- Enables **dynamic extent-aware** parameter regression (meters / m¬≤ scaling) downstream
- Ensures fair comparison: prompt-only baselines vs. fused map+prompt models

After this step, each experiment has a complete, self-contained dataset ready for:
1. Operator classification  
2. Per-operator parameter regression  
3. End-to-end evaluation in the evaluation notebook


In [6]:
# ===================== CELL 5 ‚Äî Feature construction (multi-experiment) =====================

from pathlib import Path
from imgofup.config import paths

from imgofup.pipelines.run_concat_features import run_concat_features_from_dirs

print("\n=== Building feature matrices for all experiments ===")

concat_meta_by_experiment = {}

# Because prompt_id changed, you should rebuild features at least once.
FORCE_REBUILD_FEATURES = True  # set False later

# Choose a canonical source of prompts.parquet (pairs table) for map_only.
# Any prompt-based experiment works as long as it produced prompts.parquet.
PAIRS_SOURCE_EXP = "use_prompt_only"
PAIRS_PARQUET_CANON = Path(paths.PATHS.PROMPT_OUT) / PAIRS_SOURCE_EXP / "prompts.parquet"

if not PAIRS_PARQUET_CANON.exists():
    raise FileNotFoundError(
        f"Expected pairs parquet for map_only at {PAIRS_PARQUET_CANON}. "
        f"Run Cell 2 (prompt embeddings) for '{PAIRS_SOURCE_EXP}' first."
    )

for exp_name, exp_cfg in EXPERIMENTS.items():
    feature_mode = exp_cfg["feature_mode"]

    train_out_dir = Path(exp_cfg["train_out"])
    train_out_dir.mkdir(parents=True, exist_ok=True)

    map_out_dir = MAP_EMB_DIR
    prompt_out_dir = Path(paths.PATHS.PROMPT_OUT) / exp_name  # per-experiment prompt embeddings dir

    # ‚úÖ NEW: for map_only we still need a pairs table, but not prompt embeddings
    pairs_parquet = None
    if feature_mode == "map_only":
        pairs_parquet = PAIRS_PARQUET_CANON

    print(f"\nüß™ Experiment: {exp_name}")
    print(f"   Feature mode : {feature_mode}")
    print(f"   Prompt out   : {prompt_out_dir}")
    if pairs_parquet is not None:
        print(f"   Pairs parquet: {pairs_parquet}  (shared)")
    print(f"   Map out      : {map_out_dir}")
    print(f"   Train out    : {train_out_dir}")

    # Expected outputs for this experiment
    X_expected = train_out_dir / f"X_{exp_name}.npy"
    pairs_expected = train_out_dir / f"train_pairs_{exp_name}.parquet"

    if (not FORCE_REBUILD_FEATURES) and X_expected.exists() and pairs_expected.exists():
        print("   ‚úÖ Features already exist ‚Äî skipping recomputation.")
        meta = {
            "skipped": True,
            "X_path": str(X_expected),
            "pairs_path": str(pairs_expected),
        }
    else:
        meta = run_concat_features_from_dirs(
            prompt_out_dir=prompt_out_dir,
            map_out_dir=map_out_dir,
            out_dir=train_out_dir,
            exp_name=exp_name,
            feature_mode=feature_mode,
            verbosity=1,
            prompt_id_width=4,
            pairs_parquet=pairs_parquet,
        )
        print("   ‚úÖ Feature construction completed.")

    concat_meta_by_experiment[exp_name] = meta

print("\n‚úÖ All feature construction finished.")



=== Building feature matrices for all experiments ===

üß™ Experiment: openai_prompt_only
   Feature mode : prompt_only
   Prompt out   : /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/prompt_out/openai_prompt_only
   Map out      : /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/map_out/shared_extent
   Train out    : /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/train_out_openai_prompt_only
   ‚úÖ Feature construction completed.

üß™ Experiment: use_prompt_only
   Feature mode : prompt_only
   Prompt out   : /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/prompt_out/use_prompt_only
   Map out      : /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/map_out/shared_extent
   Train out    : /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/train_out_use_prompt_only
   ‚úÖ Feature construction completed.

üß™ Experiment: map_only
   Feature mode : map_only
   Prompt out   : /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/o

## Step 6 ‚Äî Load Training Data and Build Normalized Regression Target (Experiment-Aware)

In this step, we load the **experiment-specific training data** produced by the feature
construction stage and build the final learning targets for both classification and regression.

Because this notebook runs **multiple experiments**, the same procedure is applied
**independently for each experiment**, using its own `train_out` directory.

### What happens here (per experiment)

1. **Load the feature matrix**
   - `prompt_only` experiments load prompt-only features (e.g., `X_prompt.npy`)
   - fused experiments load concatenated features (e.g., `X_concat.npy`)

2. **Load the paired metadata table**
   - `train_pairs.parquet` containing aligned `(map_id, prompt_id)` rows and dynamic extent references

3. **Attach labels and apply consistent filtering**
   - Ensures only valid user study rows are included (e.g., `complete == True`, `remove == False`)
   - Ensures `operator` and `param_value` are present (and prompt text if required)

4. **Validate dynamic extent references**
   - Confirms the presence of per-map reference scales required for normalization:
     - `extent_diag_m`
     - `extent_area_m2`

5. **Compute the normalized regression target `param_norm`**
   Normalization depends on the operator group:

   - **Distance-based operators** (`aggregate`, `displace`, `simplify`):  
     `param_norm = param_value / extent_diag_m`

   - **Area-based operators** (`select`):  
     `param_norm = param_value / extent_area_m2`

### Why this normalization is used

This step converts parameters from heterogeneous, map-scale-dependent units into a
scale-aware normalized target. It allows per-operator regressors to generalize across
maps of different extents while preserving physical meaning during inference
(when `param_norm` is converted back to meters or m¬≤ using the same extent references).

### Outputs

For each experiment, we obtain:

- `X` ‚Äî feature matrix aligned with labels  
- `df` ‚Äî cleaned metadata table including `operator`, `param_value`, `param_norm`, and extent references  

These outputs feed directly into the subsequent training stages:
1. Operator classification  
2. Per-operator parameter regression  
3. End-to-end evaluation across experiments


In [7]:
# ===================== CELL 6 ‚Äî Load training data + compute param_norm (collision-proof) =====================

from dataclasses import replace
from pathlib import Path

from imgofup.config import paths
from imgofup.datasets.load_training_data import load_training_data_with_dynamic_param_norm

TRAIN_DATA = {}

print("\n=== Loading training data for all experiments (unified loader) ===")
print("CONFIG type:", type(paths))

for exp_name, exp_cfg in EXPERIMENTS.items():
    train_out_dir = Path(exp_cfg["train_out"])
    if not train_out_dir.exists():
        raise FileNotFoundError(f"Missing train_out for {exp_name}: {train_out_dir}")

    feature_mode = str(exp_cfg["feature_mode"]).strip().lower()

    print(f"\nüß™ Experiment: {exp_name}")
    print(f"   train_out : {train_out_dir}")
    print(f"   mode      : {feature_mode}")

    PATHS_EXP = replace(paths.PATHS, TRAIN_OUT=str(train_out_dir))

    data = load_training_data_with_dynamic_param_norm(
        exp_name=exp_name,
        feature_mode=feature_mode,
        paths=PATHS_EXP,
        cfg=paths.CFG,
        distance_ops=paths.DISTANCE_OPS,
        area_ops=paths.AREA_OPS,
        require_text=True,
    )

    X = data.X
    df = data.df

    print(f"   ‚úÖ Loaded: X={X.shape} | df={df.shape}")
    print("   Operators:", sorted(df[PATHS_EXP.OPERATOR_COL].dropna().unique().tolist()))

    TRAIN_DATA[exp_name] = {"X": X, "df": df, "paths": PATHS_EXP}

first_key = next(iter(TRAIN_DATA.keys()))



=== Loading training data for all experiments (unified loader) ===
CONFIG type: <class 'module'>

üß™ Experiment: openai_prompt_only
   train_out : /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/train_out_openai_prompt_only
   mode      : prompt_only
   ‚úÖ Loaded: X=(562, 1536) | df=(562, 15)
   Operators: ['aggregate', 'displace', 'select', 'simplify']

üß™ Experiment: use_prompt_only
   train_out : /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/train_out_use_prompt_only
   mode      : prompt_only
   ‚úÖ Loaded: X=(562, 512) | df=(562, 15)
   Operators: ['aggregate', 'displace', 'select', 'simplify']

üß™ Experiment: map_only
   train_out : /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/train_out_map_only
   mode      : map_only
   ‚úÖ Loaded: X=(562, 165) | df=(562, 15)
   Operators: ['aggregate', 'displace', 'select', 'simplify']

üß™ Experiment: use_map
   train_out : /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/train_out_use_map


## Step 7 ‚Äî Train/Validation/Test Split (Shared, Leakage-Free by Map)

In this step, we construct a **reproducible and fair** train/validation/test split that is
**shared across all experiments** (prompt-only, USE + map, OpenAI + map).  
The split is computed **once** and then applied to every experiment to ensure that
performance differences are attributable solely to the feature representation.

---

### Constraints enforced

- **No leakage by map**  
  The same `map_id` never appears in more than one split (train / validation / test).

- **Multi-prompt maps are forced into TRAIN**  
  If a map has multiple prompts, *all* corresponding samples are assigned to the
  training set.  
  As a result, validation and test sets contain **single-prompt maps only**.

- **Consistency across experiments**  
  The exact same samples (identified by `(map_id, prompt_id)`) are used for
  train/validation/test in every experiment.

---

### Stratification strategy

To obtain balanced splits while respecting the above constraints, stratification is applied as:

- Primary: `operator √ó intensity` (if sufficient samples exist), otherwise
- Fallback: `operator` only (automatically selected if finer stratification is infeasible)

---

### Coverage requirement

Each split is required to contain **all operators** in the fixed class set:

- `simplify`
- `select`
- `aggregate`
- `displace`

This guarantees that classification and regression models can be trained and evaluated
for every operator.

---

### Outputs

- A single shared split definition is saved to disk as:  
  `splits_shared.json`
- For each experiment, the split is applied to slice:
  - `X` ‚Äî the feature matrix
  - `df` ‚Äî the aligned metadata table

The resulting subsets (`train`, `val`, `test`) are then used in all downstream
training and evaluation steps.

---

This design ensures:
- **Leakage-free evaluation**
- **Fair, apples-to-apples comparison** between experiments
- **Reproducibility**, since the split is deterministic and persisted to disk


In [8]:
# ===================== CELL 7 ‚Äî Shared Train/Val/Test Split (fair across experiments, incl. map-only) =====================

from pathlib import Path

from imgofup.config import paths
from imgofup.datasets.splitting import make_splits_multi_prompt_to_train

FIXED_CLASSES = ["simplify", "select", "aggregate", "displace"]
USE_INTENSITY_FOR_STRAT = True

OP_COL  = paths.PATHS.OPERATOR_COL
INT_COL = paths.PATHS.INTENSITY_COL

# Where to save ONE shared split (used by all experiments)
SPLITS_DIR = Path(paths.PATHS.SPLIT_OUT)
SPLITS_DIR.mkdir(parents=True, exist_ok=True)
split_path = SPLITS_DIR / "splits_shared.json"

# -------------------------------
# Choose a stable reference experiment
# Prefer prompt-based; avoid map_only if possible.
# -------------------------------
preferred_order = ["use_prompt_only", "use_map", "openai_map", "map_only"]
ref_exp = next((name for name in preferred_order if name in TRAIN_DATA), None)
if ref_exp is None:
    ref_exp = next(iter(TRAIN_DATA.keys()))

ref_df = TRAIN_DATA[ref_exp]["df"].copy()
ref_X  = TRAIN_DATA[ref_exp]["X"]

# Must have keys for stable matching across experiments
if not {"map_id", "prompt_id"}.issubset(ref_df.columns):
    raise ValueError("Expected columns {'map_id','prompt_id'} in df for split mapping.")

# Must have operator for stratification constraints
if OP_COL not in ref_df.columns:
    raise ValueError(f"Reference df missing operator column '{OP_COL}'. Cannot build stratified split.")

# Build a stable key per row for mapping splits across experiments
ref_df["row_key"] = ref_df["map_id"].astype(str).str.zfill(4) + "::" + ref_df["prompt_id"].astype(str)

print(f"\n=== Computing shared split using reference experiment: {ref_exp} ===")
print("ref_df:", ref_df.shape, "| ref_X:", ref_X.shape)

split = make_splits_multi_prompt_to_train(
    df=ref_df,
    X=ref_X,
    op_col=OP_COL,
    intensity_col=INT_COL if (USE_INTENSITY_FOR_STRAT and INT_COL in ref_df.columns) else None,
    map_id_col="map_id",
    fixed_classes=FIXED_CLASSES,
    use_intensity_for_strat=USE_INTENSITY_FOR_STRAT,
    seed=int(paths.CFG.SEED),
    val_ratio=float(paths.CFG.VAL_RATIO),
    test_ratio=float(paths.CFG.TEST_RATIO),
    max_attempts=500,
    save_splits_json=split_path,
    verbose=True,
)

train_idx_ref, val_idx_ref, test_idx_ref = split.train_idx, split.val_idx, split.test_idx

# Convert indices -> row_key sets (transfer across experiments)
train_keys = set(ref_df.loc[train_idx_ref, "row_key"].tolist())
val_keys   = set(ref_df.loc[val_idx_ref,   "row_key"].tolist()) if len(val_idx_ref) else set()
test_keys  = set(ref_df.loc[test_idx_ref,  "row_key"].tolist()) if len(test_idx_ref) else set()

# Sanity: no overlap
assert train_keys.isdisjoint(val_keys)
assert train_keys.isdisjoint(test_keys)
assert val_keys.isdisjoint(test_keys)

print("\n‚úÖ Shared split created:")
print(f"   Train keys: {len(train_keys)} | Val keys: {len(val_keys)} | Test keys: {len(test_keys)}")
print(f"   Saved to  : {split_path}")

# -------------------------------
# Apply the SAME split to every experiment by mapping row_key -> indices
# -------------------------------
SPLITS = {}  # exp_name -> dict with X_train/X_val/X_test and df_train/df_val/df_test

for exp_name, pack in TRAIN_DATA.items():
    df = pack["df"].copy()
    X  = pack["X"]

    if not {"map_id", "prompt_id"}.issubset(df.columns):
        raise ValueError(f"Experiment '{exp_name}' df missing map_id/prompt_id needed for split mapping.")

    df["row_key"] = df["map_id"].astype(str).str.zfill(4) + "::" + df["prompt_id"].astype(str)

    # Build index arrays in the current df order
    train_idx = df.index[df["row_key"].isin(train_keys)].to_numpy()
    val_idx   = df.index[df["row_key"].isin(val_keys)].to_numpy() if val_keys else df.index[df["row_key"].isin([])].to_numpy()
    test_idx  = df.index[df["row_key"].isin(test_keys)].to_numpy() if test_keys else df.index[df["row_key"].isin([])].to_numpy()

    # Check full coverage ONLY for keys that exist (train always exists; val/test may be empty in fallback)
    needed_keys = train_keys | val_keys | test_keys
    missing = needed_keys - set(df["row_key"].tolist())
    if missing:
        raise ValueError(
            f"Experiment '{exp_name}' is missing {len(missing)} rows from the shared split "
            f"(first few: {list(sorted(missing))[:5]}). "
            "This usually means the pairs table differs between experiments."
        )

    X_train, X_val, X_test = X[train_idx], X[val_idx], X[test_idx]
    df_train = df.loc[train_idx].reset_index(drop=True)
    df_val   = df.loc[val_idx].reset_index(drop=True)
    df_test  = df.loc[test_idx].reset_index(drop=True)

    SPLITS[exp_name] = {
        "train_idx": train_idx,
        "val_idx": val_idx,
        "test_idx": test_idx,
        "X_train": X_train, "X_val": X_val, "X_test": X_test,
        "df_train": df_train, "df_val": df_val, "df_test": df_test,
    }

    print(f"\nüß™ {exp_name}")
    print("Rows -> Train:", X_train.shape, "Val:", X_val.shape, "Test:", X_test.shape)



=== Computing shared split using reference experiment: use_prompt_only ===
ref_df: (562, 16) | ref_X: (562, 512)
=== DATASET SUMMARY ===
Total rows (prompts): 562
Unique maps: 399
Multi-prompt maps (>1 prompt): 22
Single-prompt maps (=1 prompt): 377

Top 10 maps by prompt count:
map_id
1646    30
1304    29
1755    26
1532    13
0127    10
0168     8
0142     7
0078     6
0080     6
0001     6
dtype: int64

‚úÖ Saved splits to /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/train_out/splits/splits_shared.json

‚úÖ Shared split created:
   Train keys: 448 | Val keys: 57 | Test keys: 57
   Saved to  : /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/train_out/splits/splits_shared.json

üß™ openai_prompt_only
Rows -> Train: (448, 1536) Val: (57, 1536) Test: (57, 1536)

üß™ use_prompt_only
Rows -> Train: (448, 512) Val: (57, 512) Test: (57, 512)

üß™ map_only
Rows -> Train: (448, 165) Val: (57, 165) Test: (57, 165)

üß™ use_map
Rows -> Train: (448, 677) Val: (57, 67

## Step 8 ‚Äî Modality-Aware Preprocessing (Experiment-Aware)

This step applies preprocessing tailored to the input modalities **separately for each
experiment**, using **training data only** to fit preprocessing parameters.  
A preprocessing bundle is saved per experiment so the exact same transformations can be
reused during evaluation and inference.

---

### Prompt embeddings (all feature modes)

Prompt vectors are normalized using **row-wise L2 normalization** to ensure consistent scale
across samples and embedding backends (e.g., USE vs OpenAI).

---

### Map embeddings (only for fused prompt + map experiments)

When the feature mode includes map vectors (`prompt_plus_map` / fused), the map block is
processed using a robust pipeline:

1. Replace non-finite values (`¬±inf`) with `NaN`
2. Impute missing values using the **median** (fit on training data only)
3. Clip each feature to training-set **5th‚Äì95th percentiles** to reduce outlier impact
4. Drop zero-variance (or near-constant) features based on training data
5. Apply **RobustScaler** using quantile range **(5, 95)**

The prompt block remains L2-normalized and is concatenated with the processed map block.

---

### Outputs (per experiment)

For each experiment we obtain:

- `X_train_s`, `X_val_s`, `X_test_s` ‚Äî preprocessed matrices ready for training

A preprocessing bundle is saved into the experiment‚Äôs model output folder as:

- `preproc.joblib`

This ensures the exact same preprocessing can be reused for:
- reproducible training
- consistent evaluation across experiments
- deployment-time inference (operator + parameter prediction)


In [9]:
# ===================== CELL 8 ‚Äî Modality-aware preprocessing (per experiment, incl. map-only) =====================

from pathlib import Path
from imgofup.preprocessing.preprocessing import fit_transform_modality_preproc

PREPROC = {}  # exp_name -> dict with scaled arrays + bundle path

print("\n=== Fitting modality-aware preprocessing per experiment ===")

def _to_preproc_mode(feature_mode: str) -> str:
    """
    fit_transform_modality_preproc expects:
      - "prompt_only"
      - "prompt_plus_map"
    For map_only we use "prompt_plus_map" semantics but with prompt_dim=0.
    """
    fm = str(feature_mode).strip().lower()
    if fm == "prompt_only":
        return "prompt_only"
    if fm in {"prompt_plus_map", "map_only"}:
        return "prompt_plus_map"
    raise ValueError(f"Unsupported feature_mode for preprocessing: {feature_mode}")

for exp_name, cfg in EXPERIMENTS.items():

    split = SPLITS[exp_name]
    feature_mode = cfg["feature_mode"]
    preproc_mode = _to_preproc_mode(feature_mode)

    # dims inferred in Cell 4 (map_only sets prompt_dim=0, prompt_only sets map_dim=0)
    map_dim    = int(cfg["map_dim"])
    prompt_dim = int(cfg["prompt_dim"])

    model_out_dir = Path(cfg["model_out"])
    model_out_dir.mkdir(parents=True, exist_ok=True)

    preproc_path = model_out_dir / "preproc.joblib"

    print(f"\nüß™ Experiment: {exp_name}")
    print(f"   Feature mode : {feature_mode} -> preproc_mode={preproc_mode}")
    print(f"   map_dim      : {map_dim}")
    print(f"   prompt_dim   : {prompt_dim}")
    print(f"   Save preproc : {preproc_path}")

    # Safety checks: X dims must match the experiment dims
    Xtr = split["X_train"]
    if Xtr.shape[1] != (map_dim + prompt_dim):
        raise ValueError(
            f"Dim mismatch in {exp_name}: X_train has {Xtr.shape[1]} cols, "
            f"but map_dim+prompt_dim={map_dim + prompt_dim} (map_dim={map_dim}, prompt_dim={prompt_dim})."
        )

    res = fit_transform_modality_preproc(
        X_train=split["X_train"],
        X_val=split["X_val"],
        X_test=split["X_test"],
        feature_mode=preproc_mode,
        map_dim=map_dim,
        prompt_dim=prompt_dim,
        eps=1e-12,
        clip_q=(5, 95),
        impute_strategy="median",
        robust_qrange=(5, 95),
        save_path=preproc_path,
    )

    PREPROC[exp_name] = {
        "X_train_s": res.X_train_s,
        "X_val_s":   res.X_val_s,
        "X_test_s":  res.X_test_s,
        "bundle_path": res.bundle_path,
    }

    print("   ‚úÖ Preprocessing complete.")
    print("   Shapes:", res.X_train_s.shape, res.X_val_s.shape, res.X_test_s.shape)

print("\n‚úÖ All preprocessing finished.")



=== Fitting modality-aware preprocessing per experiment ===

üß™ Experiment: openai_prompt_only
   Feature mode : prompt_only -> preproc_mode=prompt_only
   map_dim      : 0
   prompt_dim   : 1536
   Save preproc : /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/models/exp_openai_prompt_only/preproc.joblib
   ‚úÖ Preprocessing complete.
   Shapes: (448, 1536) (57, 1536) (57, 1536)

üß™ Experiment: use_prompt_only
   Feature mode : prompt_only -> preproc_mode=prompt_only
   map_dim      : 0
   prompt_dim   : 512
   Save preproc : /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/models/exp_use_prompt_only/preproc.joblib
   ‚úÖ Preprocessing complete.
   Shapes: (448, 512) (57, 512) (57, 512)

üß™ Experiment: map_only
   Feature mode : map_only -> preproc_mode=prompt_plus_map
   map_dim      : 165
   prompt_dim   : 0
   Save preproc : /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/models/exp_map_only/preproc.joblib
   ‚úÖ Preprocessing complete.
   Shapes

## Step 9 ‚Äî Build Class Labels and Sample Weights (Experiment-Aware)

In this step, we construct the classification labels and training sample weights **for each
experiment**, using the same fixed class order and the same split definition. This guarantees
that differences in performance across experiments are due to the feature representation,
not label encoding or sampling artifacts.

---

### Fixed class encoding

Operator labels are encoded using a fixed global class order:

`[simplify, select, aggregate, displace]`

This guarantees consistent label indices across:
- training
- saved model bundles
- evaluation and inference code

Because the split is shared across experiments, this encoding remains stable and comparable.

---

### Sample weighting

Training samples are weighted to address two common sources of bias:

1. **Class imbalance**
   - Balanced class weights are computed from the training distribution to prevent majority
     classes from dominating learning.

2. **Map-level prompt multiplicity**
   - Some `map_id`s have multiple prompts.
   - To prevent such maps from contributing disproportionately, each map contributes
     approximately equal total weight by assigning each prompt a map-weight of:

   `map_weight = 1 / (#prompts for that map_id)`

---

### Final weight definition

The final per-sample weight used during training is:

`sample_w = class_weight(operator) √ó map_weight(map_id)`

These weights are used during classifier training (and optionally regression training)
to improve robustness and ensure fair learning across operators and maps.


In [11]:
# ===================== CELL 9 ‚Äî Build labels + sample weights (per experiment, incl. map-only) =====================

import numpy as np
from imgofup.datasets.labels_and_weights import build_labels_and_sample_weights

OP_COL = paths.PATHS.OPERATOR_COL  # usually "operator"

LABELS = {}  # exp_name -> labels, weights, class_names, etc.

print("\n=== Building labels and sample weights per experiment ===")

for exp_name, split in SPLITS.items():

    df_train = split["df_train"].copy()
    df_val   = split["df_val"].copy()
    df_test  = split["df_test"].copy()

    # Fail early if operator missing (this should NOT happen if Cell 6 merge worked)
    for part_name, dfi in [("train", df_train), ("val", df_val), ("test", df_test)]:
        if OP_COL not in dfi.columns:
            raise ValueError(f"{exp_name}: df_{part_name} missing operator column '{OP_COL}'.")
        n_miss = int(dfi[OP_COL].isna().sum())
        if n_miss:
            raise ValueError(
                f"{exp_name}: df_{part_name} has {n_miss} missing operator labels. "
                "Fix the label merge in the data-loading step before training."
            )

    lab = build_labels_and_sample_weights(
        df_train=df_train,
        df_val=df_val,
        df_test=df_test,
        op_col=OP_COL,
        map_id_col="map_id",
        fixed_classes=FIXED_CLASSES,
        use_map_weight=True,
        class_weight_mode="balanced",
    )

    class_names = np.array(lab.class_names)

    LABELS[exp_name] = {
        "class_names": class_names,
        "y_train_cls": lab.y_train,
        "y_val_cls":   lab.y_val,
        "y_test_cls":  lab.y_test,
        "sample_w":    lab.sample_w,
        "class_weight_map": lab.class_weight_map,
    }

    print(f"\nüß™ {exp_name}")
    print("Classes (fixed order):", list(class_names))
    print("Class weights:", lab.class_weight_map)
    print("y_train/y_val/y_test shapes:", lab.y_train.shape, lab.y_val.shape, lab.y_test.shape)
    sw = lab.sample_w
    print("Sample weight summary:", {"min": float(sw.min()), "max": float(sw.max()), "mean": float(sw.mean())})

# Sanity: class order must match across experiments
first = next(iter(LABELS.keys()))
base_classes = LABELS[first]["class_names"].tolist()
for exp_name in LABELS.keys():
    if LABELS[exp_name]["class_names"].tolist() != base_classes:
        raise ValueError(f"Class order differs in experiment {exp_name}. This would break fair comparison.")

print("\n‚úÖ Label build complete for all experiments (class order consistent).")



=== Building labels and sample weights per experiment ===

üß™ openai_prompt_only
Classes (fixed order): [np.str_('simplify'), np.str_('select'), np.str_('aggregate'), np.str_('displace')]
Class weights: {'simplify': 1.0275229357798166, 'select': 0.7777777777777778, 'aggregate': 0.835820895522388, 'displace': 1.8360655737704918}
y_train/y_val/y_test shapes: (448,) (57,) (57,)
Sample weight summary: {'min': 0.025925925925925925, 'max': 1.8360655737704918, 'mean': 0.6487687942076353}

üß™ use_prompt_only
Classes (fixed order): [np.str_('simplify'), np.str_('select'), np.str_('aggregate'), np.str_('displace')]
Class weights: {'simplify': 1.0275229357798166, 'select': 0.7777777777777778, 'aggregate': 0.835820895522388, 'displace': 1.8360655737704918}
y_train/y_val/y_test shapes: (448,) (57,) (57,)
Sample weight summary: {'min': 0.025925925925925925, 'max': 1.8360655737704918, 'mean': 0.6487687942076353}

üß™ map_only
Classes (fixed order): [np.str_('simplify'), np.str_('select'), np.st

## Step 10 ‚Äî Operator Classification Model (MLP, Experiment-Aware)

This step trains the operator classifier that predicts one of the four map generalization
operators:

`{simplify, select, aggregate, displace}`

The same training protocol is applied **independently for each experiment** (prompt-only,
USE + map, OpenAI + map) using the **shared split**. This ensures that differences in
performance across experiments are attributable to the feature representation rather than
changes in training procedure.

---

### Model and training strategy

- We use an **MLPClassifier** (multi-layer perceptron).
- Hyperparameters are explored via a lightweight random search over:
  - hidden layer sizes
  - weight decay (`alpha`)
  - learning rate schedule
  - batch size / optimization settings (as implemented in the helper)

---

### Validation protocol (leakage-free)

To prevent leakage, we perform **grouped cross-validation** using `map_id`:

- prompts from the same map are never split across folds

This is critical because multiple prompts may refer to the same map and would otherwise
inflate performance due to memorization.

---

### Model selection and evaluation

The best configuration is selected using validation performance (with grouped CV used for
reliable hyperparameter tuning). The selected model is then retrained on the full training
split and evaluated on validation and test splits.

---

### Outputs (per experiment)

For each experiment, the trained classifier is saved into the experiment‚Äôs model folder as:

- `classifier.joblib`

This classifier is later used to:
1. predict the operator class
2. route each sample to the correct operator-specific parameter regressor


In [12]:
# ===================== CELL 10 ‚Äî Train classifier (per experiment, MLP search + final fit) =====================

from pathlib import Path
import json
from dataclasses import asdict, is_dataclass

from imgofup.models.train_classifier import train_mlp_classifier_with_search

CLF_RESULTS = {}

def _safe_get(obj, *names, default=None):
    for n in names:
        if hasattr(obj, n):
            return getattr(obj, n)
    return default

print("\n=== Training operator classifiers for all experiments ===")

printed_debug_fields = False

for exp_name, cfg in EXPERIMENTS.items():

    split = SPLITS[exp_name]
    pre   = PREPROC[exp_name]
    lab   = LABELS[exp_name]

    X_train_s = pre["X_train_s"]
    X_val_s   = pre["X_val_s"]
    X_test_s  = pre["X_test_s"]

    y_train = lab["y_train_cls"]
    y_val   = lab["y_val_cls"]
    y_test  = lab["y_test_cls"]
    sample_w = lab["sample_w"]

    class_names = [str(x) for x in lab["class_names"]]

    # Sanity checks
    if X_train_s.shape[0] != len(y_train):
        raise ValueError(f"{exp_name}: X_train rows {X_train_s.shape[0]} != y_train {len(y_train)}")
    if X_val_s.shape[0] != len(y_val):
        raise ValueError(f"{exp_name}: X_val rows {X_val_s.shape[0]} != y_val {len(y_val)}")
    if X_test_s.shape[0] != len(y_test):
        raise ValueError(f"{exp_name}: X_test rows {X_test_s.shape[0]} != y_test {len(y_test)}")

    # Grouped CV: group by map_id to avoid leakage across folds
    groups_tr = split["df_train"]["map_id"].astype(str).to_numpy()

    model_out_dir = Path(cfg["model_out"])
    model_out_dir.mkdir(parents=True, exist_ok=True)

    print(f"\nüß™ Experiment: {exp_name}")
    print(f"   Classes   : {class_names}")
    print(f"   Train X   : {X_train_s.shape}")
    print(f"   Val X     : {X_val_s.shape}")
    print(f"   Test X    : {X_test_s.shape}")
    print(f"   Model out : {model_out_dir}")

    res_clf = train_mlp_classifier_with_search(
        exp_name=exp_name,
        X_train=X_train_s,
        y_train=y_train,
        groups_train=groups_tr,
        sample_w=sample_w,
        X_val=X_val_s,
        y_val=y_val,
        X_test=X_test_s,
        y_test=y_test,
        class_names=class_names,
        out_dir=model_out_dir,
        n_iter=5,  #50
        n_splits=5,
        seed=int(CFG.SEED),
        verbose=True,
        save_name="classifier.joblib",
    )

    CLF_RESULTS[exp_name] = res_clf

    # ---- robust reporting (no assumptions about field names) ----
    model_path    = _safe_get(res_clf, "model_path", "path", default=str(model_out_dir / "classifier.joblib"))
    best_val_f1   = _safe_get(res_clf, "val_f1_macro", "best_val_f1", "val_f1", "best_f1", default=None)
    best_val_acc  = _safe_get(res_clf, "val_acc", "best_val_acc", "best_accuracy", default=None)
    test_f1       = _safe_get(res_clf, "test_f1_macro", "test_f1", default=None)
    test_acc      = _safe_get(res_clf, "test_acc", "accuracy_test", default=None)

    print("   ‚úÖ Classifier training done.")
    print("   Saved to:", model_path)
    if best_val_f1 is not None or best_val_acc is not None:
        print("   Best VAL:", {"macro_f1": best_val_f1, "acc": best_val_acc})
    if test_f1 is not None or test_acc is not None:
        print("   TEST     :", {"macro_f1": test_f1, "acc": test_acc})

    # Save lightweight meta for evaluation / reporting
    clf_meta = {
        "experiment": exp_name,
        "feature_mode": cfg["feature_mode"],
        "class_names": class_names,
        "best_val": {"macro_f1": best_val_f1, "acc": best_val_acc},
        "test": {"macro_f1": test_f1, "acc": test_acc},
        "model_path": str(model_path),
    }
    (model_out_dir / "classifier_meta.json").write_text(json.dumps(clf_meta, indent=2), encoding="utf-8")

    # Print available fields once for debugging
    if not printed_debug_fields:
        printed_debug_fields = True
        if is_dataclass(res_clf):
            print("   (debug) Result fields:", list(asdict(res_clf).keys()))
        else:
            print("   (debug) Result attrs :", [a for a in dir(res_clf) if not a.startswith("_")])

print("\n‚úÖ All classifiers trained.")


=== Training operator classifiers for all experiments ===

üß™ Experiment: openai_prompt_only
   Classes   : ['simplify', 'select', 'aggregate', 'displace']
   Train X   : (448, 1536)
   Val X     : (57, 1536)
   Test X    : (57, 1536)
   Model out : /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/models/exp_openai_prompt_only

Searching 5 MLP configs...
[01/5] cvF1=0.929¬±0.018 | VAL F1=0.923 acc=0.930 | (128, 64), Œ±=2.02e-02, lr=1.2e-03, bs=16
[02/5] cvF1=0.918¬±0.038 | VAL F1=0.920 acc=0.930 | (256, 128), Œ±=3.49e-05, lr=1.7e-04, bs=64
[03/5] cvF1=0.915¬±0.042 | VAL F1=0.944 acc=0.947 | (256,), Œ±=1.03e-02, lr=7.7e-04, bs=128
[04/5] cvF1=0.916¬±0.034 | VAL F1=0.923 acc=0.930 | (256,), Œ±=1.18e-05, lr=2.7e-03, bs=128
[05/5] cvF1=0.915¬±0.036 | VAL F1=0.939 acc=0.947 | (256, 128, 64), Œ±=5.47e-05, lr=1.9e-04, bs=16

üèÜ Selected params: {'hidden_layer_sizes': (256,), 'alpha': 0.010275417738969424, 'learning_rate_init': 0.0007725378389307358, 'batch_size': 128, 'activatio

## Step 11 ‚Äî Parameter Regression (Per-Operator) and Final Model Bundle (Experiment-Aware)

This step trains **operator-specific regressors** to predict the generalization parameter in a
**scale-independent form**, and then packages all trained components into a single,
experiment-scoped model bundle.

The same procedure is applied **independently for each experiment** (prompt-only,
USE + map, OpenAI + map), using the shared data split and preprocessing pipeline.

---

### Regression target

Regressors are trained on the normalized target `param_norm`, defined as:

- **Distance-based operators** (`simplify`, `aggregate`, `displace`):  
  `param_norm = param_value / extent_diag_m`

- **Area-based operators** (`select`):  
  `param_norm = param_value / extent_area_m2`

This normalization allows each regressor to generalize across maps of different spatial
extent while preserving physical meaning. During inference, predictions are converted back
to real-world units using the same per-map extent references.

---

### Training strategy

- One **MLPRegressor per operator**
- Training data restricted to samples of the corresponding operator
- **Grouped cross-validation** (`GroupKFold`) by `map_id` to prevent spatial leakage
- Hyperparameter optimization via `RandomizedSearchCV` for each operator independently

---

### Final model bundle

For each experiment, the trained components are stored together in a single bundle:

- `cls_plus_regressors.joblib`

This bundle contains:
- the trained operator classifier
- the dictionary of operator-specific regressors
- the fixed class order
- normalization metadata (operator groups and extent columns)

Along with the experiment‚Äôs `preproc.joblib`, this bundle is sufficient for the evaluation
notebook to compute:

1. **Classifier-only metrics**  
2. **Regressor-only metrics** (oracle operator routing)  
3. **End-to-end pipeline metrics** (predicted operator routing)

This design keeps evaluation simple, reproducible, and fully decoupled from the training
notebook.


In [13]:
# ===================== CELL 11 ‚Äî Train per-operator regressors + save final bundle (per experiment) =====================

from pathlib import Path
import joblib

from imgofup.models.train_regressors import train_regressors_per_operator
from imgofup.models.save_bundle import save_cls_plus_regressors_bundle

BUNDLES = {}     # exp_name -> bundle path
REG_RESULTS = {} # exp_name -> regressor training result

print("\n=== Training per-operator regressors and saving final bundles ===")

for exp_name, cfg in EXPERIMENTS.items():

    split = SPLITS[exp_name]
    pre   = PREPROC[exp_name]
    lab   = LABELS[exp_name]
    res_clf = CLF_RESULTS[exp_name]

    X_train_s = pre["X_train_s"]
    df_train  = split["df_train"]
    y_train_cls = lab["y_train_cls"]
    sample_w = lab["sample_w"]

    # Make sure class_names is list[str] (stable ordering)
    cn = [str(x) for x in lab["class_names"]]

    model_out_dir = Path(cfg["model_out"])
    model_out_dir.mkdir(parents=True, exist_ok=True)

    print(f"\nüß™ Experiment: {exp_name}")
    print(f"   Model out: {model_out_dir}")
    print(f"   Train X  : {X_train_s.shape} | df_train: {df_train.shape}")

    # ---- (1) Train per-operator regressors on TRAIN only ----
    reg_res = train_regressors_per_operator(
        X_train_s=X_train_s,
        df_train=df_train,
        y_train_cls=y_train_cls,
        class_names=cn,
        sample_w=sample_w,
        group_col="map_id",
        target_col="param_norm",
        use_log1p=False,
        n_splits=5,
        n_iter=40,
        random_state=int(CFG.SEED),
        verbose=1,
    )

    REG_RESULTS[exp_name] = reg_res

    # ---- (2) Load the trained classifier model (from Cell 10 output) ----
    clf_pack = joblib.load(Path(res_clf.model_path))
    final_clf = clf_pack["model"]

    # ---- (3) Save combined bundle for evaluation notebook ----
    bundle_res = save_cls_plus_regressors_bundle(
        exp_name=exp_name,
        out_dir=model_out_dir,
        classifier=final_clf,
        regressors_by_class=reg_res.regressors_by_class,
        class_names=cn,
        use_log1p=reg_res.use_log1p,
        cv_summary=reg_res.cv_summary,
        distance_ops=DISTANCE_OPS,
        area_ops=AREA_OPS,
        diag_col="extent_diag_m",
        area_col="extent_area_m2",
        save_name="cls_plus_regressors.joblib",  # fixed name inside each experiment folder
    )

    BUNDLES[exp_name] = bundle_res.bundle_path

    print("   ‚úÖ Saved bundle:", bundle_res.bundle_path)
    print("   ‚úÖ Regressors trained for:", sorted(list(reg_res.regressors_by_class.keys())))

print("\n‚úÖ All bundles saved.")
for k, v in BUNDLES.items():
    print(f" - {k:12s}: {v}")



=== Training per-operator regressors and saving final bundles ===

üß™ Experiment: openai_prompt_only
   Model out: /Users/amirdonyadide/Documents/GitHub/IMGOFUP/data/output/models/exp_openai_prompt_only
   Train X  : (448, 1536) | df_train: (448, 16)
Fitting 5 folds for each of 40 candidates, totalling 200 fits

=== Regressor for class 'simplify' (predicting param_norm) ===
samples=109, groups=66, cv_splits=5, used_sample_weight=True
best CV RMSE (scaled): 0.9974241516696944
best CV RMSE (param_norm units): 0.004328295055365128
best params: {'alpha': np.float64(0.0041619125396912095), 'hidden_layer_sizes': (64,), 'learning_rate_init': np.float64(0.00010558059144381523)}
Fitting 5 folds for each of 40 candidates, totalling 200 fits

=== Regressor for class 'select' (predicting param_norm) ===
samples=144, groups=90, cv_splits=5, used_sample_weight=True
best CV RMSE (scaled): 1.0162233279108734
best CV RMSE (param_norm units): 0.0003594227195831839
best params: {'alpha': np.float64(0.

In [14]:
# ===================== CELL 12A ‚Äî Classifier comparison table =====================
import json
import pandas as pd
from pathlib import Path

rows = []
for exp_name, cfg in EXPERIMENTS.items():
    meta_path = Path(cfg["model_out"]) / "classifier_meta.json"
    meta = json.loads(meta_path.read_text(encoding="utf-8"))

    rows.append({
        "experiment": exp_name,
        "val_acc": meta["best_val"]["acc"],
        "val_f1_macro": meta["best_val"]["macro_f1"],
        "test_acc": meta["test"]["acc"],
        "test_f1_macro": meta["test"]["macro_f1"],
        "model_path": meta["model_path"],
    })

df_clf = pd.DataFrame(rows).sort_values("test_f1_macro", ascending=False)
print(df_clf)


           experiment   val_acc  val_f1_macro  test_acc  test_f1_macro  \
0  openai_prompt_only  0.947368      0.944028  0.912281       0.914519   
1     use_prompt_only  0.947368      0.944028  0.894737       0.898839   
4          openai_map  0.947368      0.944028  0.842105       0.851891   
3             use_map  0.894737      0.883793  0.736842       0.736268   
2            map_only  0.280702      0.281486  0.280702       0.265773   

                                          model_path  
0  /Users/amirdonyadide/Documents/GitHub/IMGOFUP/...  
1  /Users/amirdonyadide/Documents/GitHub/IMGOFUP/...  
4  /Users/amirdonyadide/Documents/GitHub/IMGOFUP/...  
3  /Users/amirdonyadide/Documents/GitHub/IMGOFUP/...  
2  /Users/amirdonyadide/Documents/GitHub/IMGOFUP/...  


In [15]:
# ===================== CELL 12B (revised) ‚Äî One clear regressor comparison table =====================
import joblib
import pandas as pd
import numpy as np

# 1) Load cv_summary from bundles
bund_cv = {}
for exp_name, bundle_path in BUNDLES.items():
    pack = joblib.load(bundle_path)
    cv_summary = pack.get("cv_summary", None)
    if cv_summary is None:
        raise ValueError(f"{exp_name}: bundle has no cv_summary. Check save_bundle.py")
    bund_cv[exp_name] = cv_summary

# 2) Extract RMSE for each operator
def get_rmse_param(cv_summary, op_name):
    d = cv_summary.get(op_name, cv_summary.get(str(op_name), {}))
    for k in ["best_rmse_param", "rmse_param", "rmse_param_units", "best_cv_rmse_param"]:
        if k in d and d[k] is not None:
            return float(d[k])
    # fallback search
    for k, v in d.items():
        if isinstance(v, (int, float)) and "rmse" in k.lower() and "param" in k.lower():
            return float(v)
    return np.nan

ops = ["simplify", "select", "aggregate", "displace"]

rows = []
for exp_name, cv_summary in bund_cv.items():
    row = {"experiment": exp_name}
    for op in ops:
        row[op] = get_rmse_param(cv_summary, op)
    rows.append(row)

df_rmse = pd.DataFrame(rows).set_index("experiment")

# 3) Add summary stats
df_rmse["mean_rmse"] = df_rmse[ops].mean(axis=1)

# 4) Sort by mean_rmse (lower is better)
df_rmse_sorted = df_rmse.sort_values("mean_rmse")

# 5) Create a human-friendly display version (percent of [0,1] range)
df_pct = (df_rmse_sorted * 100).round(3)   # e.g. 0.0043 -> 0.43%
df_pct = df_pct.rename(columns={op: f"{op} RMSE (%)" for op in ops} | {"mean_rmse": "Mean RMSE (%)"})

df_rmse_sorted.round(6), df_pct


(                    simplify    select  aggregate  displace  mean_rmse
 experiment                                                            
 map_only            0.004217  0.000253   0.003658  0.003258   0.002847
 use_map             0.004395  0.000249   0.003710  0.003310   0.002916
 use_prompt_only     0.004347  0.000362   0.003499  0.003669   0.002969
 openai_prompt_only  0.004328  0.000359   0.003560  0.003719   0.002992
 openai_map          0.004644  0.000251   0.003799  0.003386   0.003020,
                     simplify RMSE (%)  select RMSE (%)  aggregate RMSE (%)  \
 experiment                                                                   
 map_only                        0.422            0.025               0.366   
 use_map                         0.440            0.025               0.371   
 use_prompt_only                 0.435            0.036               0.350   
 openai_prompt_only              0.433            0.036               0.356   
 openai_map          

In [16]:
# ===================== CELL 12C ‚Äî End-to-end evaluation (TEST) [FIXED: scaler is y-scaler] =====================
import numpy as np
import pandas as pd
import joblib

TOL = 0.05  # tolerance in param_norm units (0..1)

def _predict_param(reg_and_scaler, Xi):
    """
    reg_and_scaler is either:
      - regressor
      - (regressor, y_scaler) where y_scaler was fit on target y (shape [n,1])
    Xi is a 2D row: shape (1, n_features)
    """
    # unpack
    if isinstance(reg_and_scaler, (tuple, list)):
        reg = reg_and_scaler[0]
        y_scaler = reg_and_scaler[1] if len(reg_and_scaler) > 1 else None
    else:
        reg = reg_and_scaler
        y_scaler = None

    # predict (in scaled-y space if y_scaler exists)
    y_hat = float(reg.predict(Xi)[0])

    # if scaler looks like a y-scaler, inverse-transform
    if y_scaler is not None:
        # y_scaler fitted on y => expects 1 feature
        try:
            y_hat = float(y_scaler.inverse_transform(np.array([[y_hat]], dtype=float))[0, 0])
        except Exception:
            # if inverse_transform fails for any reason, keep raw prediction
            pass

    return y_hat

rows = []

for exp_name, cfg in EXPERIMENTS.items():
    bundle = joblib.load(BUNDLES[exp_name])

    clf = bundle["classifier"]
    regs = bundle["regressors_by_class"]      # op -> (regressor, y_scaler)
    class_names = [str(x) for x in bundle["class_names"]]

    pre   = PREPROC[exp_name]
    split = SPLITS[exp_name]
    lab   = LABELS[exp_name]

    X_test = pre["X_test_s"]
    y_true_cls = lab["y_test_cls"]
    df_test = split["df_test"]

    if "param_norm" not in df_test.columns:
        raise KeyError(f"{exp_name}: df_test has no 'param_norm' column. Available: {list(df_test.columns)}")
    y_true_param = df_test["param_norm"].to_numpy(dtype=float)

    # ---- Predict operator ----
    y_pred_cls = clf.predict(X_test)
    op_acc = float((y_pred_cls == y_true_cls).mean())

    pred_names = [class_names[int(i)] for i in y_pred_cls]
    true_names = [class_names[int(i)] for i in y_true_cls]

    # ---- Predict parameter using regressor of the PREDICTED operator ----
    y_pred_param = np.zeros_like(y_true_param, dtype=float)
    for i, op in enumerate(pred_names):
        Xi = X_test[i:i+1]  # shape (1, n_features)
        y_pred_param[i] = _predict_param(regs[op], Xi)

    # errors
    abs_err = np.abs(y_pred_param - y_true_param)
    correct_mask = (np.array(pred_names) == np.array(true_names))

    # Param RMSE/MAE only when operator correct
    if correct_mask.any():
        rmse_cond = float(np.sqrt(np.mean((y_pred_param[correct_mask] - y_true_param[correct_mask])**2)))
        mae_cond  = float(np.mean(abs_err[correct_mask]))
    else:
        rmse_cond, mae_cond = np.nan, np.nan

    # Joint metric: operator correct AND parameter close enough
    joint_success = float(np.mean(correct_mask & (abs_err <= TOL)))

    # Penalized RMSE over ALL: if operator wrong, set error = 1.0 (max on [0,1])
    penalized_err = abs_err.copy()
    penalized_err[~correct_mask] = 1.0
    rmse_penalized = float(np.sqrt(np.mean(penalized_err**2)))

    rows.append({
        "experiment": exp_name,
        "op_acc": op_acc,
        "param_rmse_if_op_correct": rmse_cond,
        "param_mae_if_op_correct": mae_cond,
        f"joint_success@{TOL}": joint_success,
        "rmse_penalized_all": rmse_penalized,
        "n_test": int(len(y_true_cls)),
        "n_op_correct": int(correct_mask.sum()),
    })

df_e2e = pd.DataFrame(rows).sort_values("rmse_penalized_all")
df_e2e


Unnamed: 0,experiment,op_acc,param_rmse_if_op_correct,param_mae_if_op_correct,joint_success@0.05,rmse_penalized_all,n_test,n_op_correct
0,openai_prompt_only,0.912281,0.002812,0.002045,0.912281,0.296187,57,52
1,use_prompt_only,0.894737,0.002987,0.002177,0.894737,0.324455,57,51
4,openai_map,0.842105,0.003312,0.002367,0.842105,0.397371,57,48
3,use_map,0.736842,0.003057,0.002204,0.736842,0.512996,57,42
2,map_only,0.280702,0.003805,0.002604,0.280702,0.848117,57,16
