# PRISM — Experiment B Tracking: Uncertainty-Driven Exploration

**Objective**: Analyze the results of Experiment B (Proposition P3) — Does PRISM explore more efficiently than the baselines?

### Protocol
- **Environment**: FourRooms 19×19, 260 accessible states
- **Task**: 4 hidden goals (one per room), discovered by visiting
- **9 conditions**: PRISM, SR-Oracle, SR-e-greedy, SR-e-decay, SR-Count-Bonus, SR-Norm-Bonus, SR-Posterior, SR-Count-Matched, Random
- **100 runs/condition**, same goal placement across conditions
- **Max 2000 steps** per run

### Metrics
| Metric | Description | Direction |
|--------|-------------|----------|
| `steps` | Steps to find all 4 goals | ↓ better |
| `goals_found` | Number of goals found (max 4) | ↑ better |
| `coverage` | Fraction of states visited | ↑ better |
| `redundancy` | steps / unique states (revisits) | ↓ better |
| `discovery_1-4` | Discovery step of each goal | ↓ better |
| `auc_discovery` | Discovery AUC of the discovery curve (§6bis) | ↑ better |

### Notebook structure
- **Sections 1–8**: Full analysis (900 runs, 9 conditions)
- **Section 9**: SR-Count-Matched — temporal profile control
- **Section 10**: Guidance index — Does U(s) guide exploration?

In [None]:
import sys
sys.path.insert(0, "..")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from prism.analysis.metrics import bootstrap_ci, compare_conditions

%matplotlib inline
sns.set_theme(style="whitegrid")
print("Imports OK")

## 1. Loading results

Results are generated by `python -m experiments.exp_b_exploration`.

In [None]:
from prism.analysis.results import get_latest_run, load_run, list_runs
from pathlib import Path

# Show all available cataloged runs
cataloged = list_runs("exp_b", results_root="../results")
if cataloged:
    print("Runs catalogués :")
    for r in cataloged:
        note_str = " -- " + r["note"] if r["note"] else ""
        print("  %s  (%d conds x %d runs)%s" % (
            r["path"].name, len(r["conditions"]), r["n_runs"], note_str))

# Try to load a cataloged run with 9 conditions; fall back to exp_b_v2
run_dir = None
run_info = None
try:
    candidate = get_latest_run("exp_b", results_root="../results")
    _, info = load_run(candidate)
    if info.get("n_conditions", 0) >= 9:
        run_dir = candidate
        run_info = info
except FileNotFoundError:
    pass

if run_dir is not None:
    df, run_info = load_run(run_dir)
    print("\n--- Run chargé : %s ---" % run_dir.name)
    print("Date       : %s" % run_info["timestamp"])
    print("Conditions : %d x %d runs = %d épisodes" % (
        run_info["n_conditions"], run_info["n_runs"], run_info["total_episodes"]))
    print("Durée      : %.0fs" % run_info["elapsed_seconds"])
    if run_info.get("note"):
        print("Note       : %s" % run_info["note"])
else:
    # Fallback: legacy exp_b_v2 directory
    legacy_path = Path("../results/exp_b_v2/exploration_results.csv")
    assert legacy_path.exists(), "Aucun run 9-conditions trouvé !"
    df = pd.read_csv(legacy_path)
    df["coverage"] = df["coverage"].astype(float)
    df["redundancy"] = df["redundancy"].astype(float)
    print("\n--- Chargement legacy : results/exp_b_v2/ ---")

print("\nColonnes: %s" % list(df.columns))
print("Lignes  : %d" % len(df))
print("\nConditions (%d) :" % df.condition.nunique())
for cond in df.condition.unique():
    n = len(df[df.condition == cond])
    print("  %-20s : %d runs" % (cond, n))

df.head()

## 2. Overview — Summary table

For each condition: median, 95% CI, goals found, coverage.

In [None]:
CONDITIONS_ORDER = ["SR-Oracle", "PRISM", "SR-Count-Matched", "SR-Count-Bonus",
                    "SR-Posterior", "SR-Norm-Bonus", "SR-e-greedy", "SR-e-decay", "Random"]

# Filter to conditions actually present
conditions = [c for c in CONDITIONS_ORDER if c in df.condition.values]

summary_rows = []
for cond in conditions:
    cond_df = df[df.condition == cond]
    steps = cond_df.steps.values
    mean_steps, ci_lo, ci_hi = bootstrap_ci(steps)
    summary_rows.append({
        "Condition": cond,
        "Steps (mean)": f"{mean_steps:.0f}",
        "CI 95%": f"[{ci_lo:.0f}, {ci_hi:.0f}]",
        "Steps (median)": f"{np.median(steps):.0f}",
        "Goals found": f"{cond_df.goals_found.mean():.1f}/4",
        "All found %": f"{cond_df.all_found.mean():.0%}",
        "Coverage": f"{cond_df.coverage.mean():.1%}",
        "Redundancy": f"{cond_df.redundancy.mean():.2f}",
    })

summary = pd.DataFrame(summary_rows)
print(summary.to_string(index=False))

### Reading the table

#### Purpose of this section

We just ran 900 simulations: 100 runs for each of the 9 conditions. Each run places an agent in FourRooms (19x19, 260 accessible cells) with 4 hidden goals (one per room). The agent has 2000 steps to find them all.

The question is: **which agent finds the 4 goals fastest, and why?**

#### What we see

Let us go through the table column by column:

- **Steps (mean)**: the average number of steps to find the 4 goals. This is the primary metric — lower is better. Agents that did not find all 4 goals within 2000 steps receive a score of 2001 (sentinel).

- **CI 95%**: the bootstrap confidence interval. If two intervals do not overlap, the difference is virtually certain. Here, Oracle [632, 733] and Count-Bonus [1035, 1189] do not overlap: the difference is real.

- **All found %**: the proportion of runs where all 4 goals were found. This is a reliability indicator. Oracle = 100% (always finds all 4), PRISM = 88% (12 out of 100 runs did not finish), Random = 1% (nearly impossible in 2000 steps).

- **Coverage**: the fraction of the 260 cells visited at least once. An efficient agent does not need to visit the entire grid — it only needs to pass through the 4 goal cells. But high coverage indicates systematic exploration.

- **Redundancy**: steps / unique states visited. If an agent takes 1000 steps and visits 200 unique cells, its redundancy is 5.0 — on average it passed through each visited cell 5 times. Low redundancy = efficient exploration (few revisits).

#### Interpretation: three performance tiers

The 9 conditions clearly separate into **three groups**:

| Tier | Conditions | Steps | Coverage | Characteristics |
|------|-----------|-------|----------|----------------|
| **Tier 1** — Theoretical ceiling | SR-Oracle | ~680 | 86% | Knows the true SR (M*). No learning cost. This is the lower bound: one cannot do better without knowing goal positions directly. |
| **Tier 2** — Structured exploration | SR-Count-Bonus, SR-Count-Matched, **PRISM** | 1070–1250 | 82–84% | Use an informed exploration bonus that guides the agent toward unvisited areas. The three conditions are statistically indistinguishable (see section 5). |
| **Tier 3** — Blind exploration | The other 5 | 1950–2000 | 42–56% | No effective guidance. epsilon-greedy and epsilon-decay explore only through the randomness of the random action. Norm-Bonus and Posterior have an exploration signal but poorly calibrated. Random is the upper bound (the worst). |

The gap between Tier 2 and Tier 3 is **massive**: ~800 steps and ~40 coverage points. This is the difference between "finding almost all goals" and "plateauing at 2 out of 4 goals".

The gap between Tier 1 and Tier 2 is more moderate (~400-500 steps). Knowing M* provides an advantage but not an overwhelming one — Tier 2 agents compensate with effective exploration bonuses.

#### The Tier 2 case: a three-way cluster

The most striking result of Tier 2 is that the three conditions — Count-Bonus, Count-Matched, and PRISM — are **very close** in performance. Count-Bonus is the fastest (~1072 steps), followed by Count-Matched (~1145) and PRISM (~1251), but the CIs partially overlap. We will analyze this proximity in detail in sections 5 and 9.

#### Technical note — The Oracle fix

The original Oracle bonus used raw `||M(s,:) - M*(s,:)||` as the exploration signal. This L2 norm over a 260-dimensional vector is dominated by the "tail" of distant states — it reflects the **global connectivity** of the state (corridor vs corner) rather than whether the agent has already visited it or not. Result: a contrast of only 1.16x between visited and unvisited states, trapping the agent in local attractors (~31% coverage).

Adding a count-based decay factor (`/sqrt(visits+1)`) solved the problem. With this fix, the Oracle achieves 100% coverage on the 20 validation seeds, confirming that M* information provides real added value when properly exploited.

## 3. Comparative boxplots

Three panels: steps, coverage, redundancy.

In [None]:
colors = {
    "SR-Oracle": "#2E75B6", "PRISM": "#E87722", "SR-Posterior": "#5B9E3A",
    "SR-Count-Bonus": "#888888", "SR-Norm-Bonus": "#AAAAAA",
    "SR-e-greedy": "#CCCCCC", "SR-e-decay": "#BBBBBB", "Random": "#EEEEEE",
    "SR-Count-Matched": "#D4A017",
}

fig, axes = plt.subplots(1, 3, figsize=(22, 6))

# --- Panel A : Steps ---
data_steps = [df[df.condition == c].steps.values for c in conditions]
bp = axes[0].boxplot(data_steps, patch_artist=True, showfliers=False)
for patch, cond in zip(bp["boxes"], conditions):
    patch.set_facecolor(colors.get(cond, "#DDDDDD"))
axes[0].set_xticklabels([c.replace("SR-", "") for c in conditions], rotation=35, ha="right")
axes[0].set_ylabel("Steps pour 4 goals (\u2193 meilleur)")
axes[0].set_title("A \u2014 Efficacit\u00e9 d'exploration")

# --- Panel B : Coverage ---
data_cov = [df[df.condition == c].coverage.values for c in conditions]
bp2 = axes[1].boxplot(data_cov, patch_artist=True, showfliers=False)
for patch, cond in zip(bp2["boxes"], conditions):
    patch.set_facecolor(colors.get(cond, "#DDDDDD"))
axes[1].set_xticklabels([c.replace("SR-", "") for c in conditions], rotation=35, ha="right")
axes[1].set_ylabel("Couverture finale (\u2191 meilleur)")
axes[1].set_title("B \u2014 Couverture")

# --- Panel C : Redundancy ---
data_red = [df[df.condition == c].redundancy.values for c in conditions]
bp3 = axes[2].boxplot(data_red, patch_artist=True, showfliers=False)
for patch, cond in zip(bp3["boxes"], conditions):
    patch.set_facecolor(colors.get(cond, "#DDDDDD"))
axes[2].set_xticklabels([c.replace("SR-", "") for c in conditions], rotation=35, ha="right")
axes[2].set_ylabel("Redondance steps/\u00e9tats (\u2193 meilleur)")
axes[2].set_title("C \u2014 Redondance")

fig.suptitle("Exp B \u2014 Comparaison des 9 conditions", fontsize=14, fontweight="bold")
fig.tight_layout()
plt.savefig("../results/exp_b/boxplots.png", dpi=150, bbox_inches="tight")
plt.show()

### Reading the boxplots

#### How to read these plots

Each **box** summarizes the distribution of 100 runs for a condition:
- The **center line** of the box: the median (50% of runs above, 50% below)
- The **edges** of the box: the quartiles Q1 (25%) and Q3 (75%). The box thus contains 50% of runs.
- The **whiskers**: extend to the extreme values (outliers excluded for readability)

The 3 panels show complementary facets of the same experiment:

**Panel A — Steps (lower = better)**: how many steps to find the 4 goals. This is the primary metric. The 3 tiers are clearly visible: Oracle has a low, tight box (~500-800), the three Tier 2 conditions (Count-Bonus, Count-Matched, PRISM) are in the middle (~800-1600), and the other 5 are pressed against the 2000 ceiling.

**Panel B — Coverage (higher = better)**: the fraction of the grid explored. Oracle, PRISM, Count-Bonus, and Count-Matched reach 80-86% — they visit nearly the entire grid. The Tier 3 baselines plateau at 40-60%: they do not leave 1-2 rooms.

**Panel C — Redundancy (lower = better)**: steps / unique states = how many times the agent revisits each cell on average. Oracle = ~3 (very efficient, few revisits), Tier 2 = ~5 (reasonable), Tier 3 = 15-20 (the agent loops through the same cells).

#### What the boxes reveal that averages hide

- **PRISM has greater dispersion than Count-Bonus and Count-Matched**: the PRISM box is wider (Q1-Q3 more spread out). Some PRISM runs are excellent (~700 steps), others are poor (~2000). Count-Bonus and Count-Matched are more consistent — nearly all runs finish before 1500. This PRISM variability could stem from the SR warm-up: depending on the initial trajectory, the warm-up may take more or less time.

- **Count-Matched and Count-Bonus have very similar boxes**: this is consistent — both use a visit-count-based bonus, with different decay profiles but identical logic.

- **The 5 Tier 3 baselines are nearly identical to each other**: no visible dispersion because almost all runs plateau at 2000 (time limit). Blind exploration simply does not have enough time to find the 4 goals.

## 4. Discovery curves

For each condition, we plot the cumulative proportion of goals found as a function of steps. This shows the *pace* of discovery, not just the final result.

In [None]:
MAX_STEPS = int(df.steps.max())
step_grid = np.arange(0, MAX_STEPS + 1, 10)  # resolution of 10 steps

fig, ax = plt.subplots(figsize=(12, 6))

for cond in conditions:
    cond_df = df[df.condition == cond]
    # For each run, compute cumulative goals at each step point
    disc_cols = [c for c in cond_df.columns if c.startswith("discovery_")]
    
    curves = []
    for _, row in cond_df.iterrows():
        disc_times = [row[c] for c in disc_cols]
        # At each step, count how many goals have been found
        cum_goals = np.array([sum(1 for t in disc_times if t <= s) for s in step_grid])
        curves.append(cum_goals)
    
    curves = np.array(curves)
    mean_curve = curves.mean(axis=0)
    
    color = colors.get(cond, "#333333")
    linewidth = 2.5 if cond in ("PRISM", "SR-Count-Matched") else 1.5
    alpha = 1.0 if cond in ("PRISM", "SR-Oracle", "Random", "SR-Count-Matched") else 0.7
    ax.plot(step_grid, mean_curve, color=color, linewidth=linewidth, 
            alpha=alpha, label=cond)

ax.axhline(y=4, color="gray", linestyle="--", alpha=0.3, label="4 goals")
ax.set_xlabel("Steps")
ax.set_ylabel("Goals trouv\u00e9s (moyenne)")
ax.set_title("Courbes de d\u00e9couverte \u2014 Rythme d'exploration")
ax.legend(loc="lower right", fontsize=8)
ax.set_ylim([0, 4.2])
fig.tight_layout()
plt.savefig("../results/exp_b/discovery_curves.png", dpi=150, bbox_inches="tight")
plt.show()

### Reading the discovery curves

#### Purpose of this section

The summary table (section 2) gives the **final results**: how many goals found, in how many steps. But it says nothing about the **pace** — does the agent find goals steadily, or does it struggle for a long time before a sudden breakthrough?

The discovery curves answer this question. For each condition, we plot the average number of goals found as a function of time (steps). It is like watching a movie instead of a photograph.

#### What we see — how to read the plot

- **Horizontal axis**: time (number of steps elapsed since the start of the episode)
- **Vertical axis**: the average number of goals found at that moment (averaged over 100 runs)
- **Gray dashed line** at y=4: the maximum — all 4 goals have been found
- Each colored curve represents a condition

**The three typical profiles:**

1. **Oracle (blue)**: near-linear rise. All 4 goals are found before 700 steps. The curve rises steadily, with no prolonged plateaus. This is the behavior of an agent that knows exactly where uncertainty remains and goes there directly.

2. **Tier 2 (orange PRISM, gold Count-Matched, gray Count-Bonus)**: staircase profile. We see phases of rapid rise (a goal has just been found) followed by **plateaus** (the agent is searching for the next goal). The three curves are very close — Count-Bonus rises slightly earlier in the first steps.

3. **Tier 3 (the other 5)**: the curves plateau around 1.5-2.5 goals and progress very slowly thereafter. Discovering the remaining goals depends on chance — the agent has no strategy for finding them.

#### Tier 2: three curves, one profile

The overlap of the Count-Bonus, Count-Matched, and PRISM curves is remarkable. The three conditions share the same **discovery pace**, with a slight offset in the first ~200 steps:

- **Count-Bonus** starts at full efficiency immediately. Its bonus (`lambda/sqrt(visits)`) is maximal from step 1. No initialization cost.
- **Count-Matched** starts nearly as fast. Its bonus is a lookup table `u_profile[visits]` calibrated on the PRISM profile — it also has no initialization cost.
- **PRISM** has a **warm-up**. At startup, the SR $M$ is initialized to the identity $I$ — the uncertainty $U(s)$ has no exploitable spatial structure. It takes ~100-200 steps for $M$ to reflect the grid geometry.

It is like the difference between three people searching for an object:
- Count-Bonus has a **checklist of rooms** — simple, effective from the start
- Count-Matched has the **same checklist, but with a pen that fades at the same rate** as PRISM's map
- PRISM **first builds a floor plan of the apartment** — slower at the start, but the map contains structural information

## 4bis. Tier 2 — In-depth analysis (PRISM vs Count-Bonus vs Count-Matched)

#### Purpose of this section

The table (section 2) and the curves (section 4) show that the three Tier 2 conditions are close. But "close" in what way exactly? Is the difference a uniform offset across each goal, or a more subtle pattern?

To answer this, we compare the **discovery time distributions** for each goal individually. This is more informative than a global average: it shows *where* each agent loses time.

In [None]:
# Tier 2 comparison: PRISM vs Count-Bonus vs Count-Matched
prism_df = df[df.condition == "PRISM"]
cb_df = df[df.condition == "SR-Count-Bonus"]
cm_df = df[df.condition == "SR-Count-Matched"]

disc_cols = ["discovery_1", "discovery_2", "discovery_3", "discovery_4"]
tier2 = [("PRISM", prism_df, "#E87722"), 
         ("Count-Bonus", cb_df, "#888888"),
         ("Count-Matched", cm_df, "#D4A017")]

fig, axes = plt.subplots(1, 4, figsize=(20, 5), sharey=True)
for i, col in enumerate(disc_cols):
    ax = axes[i]
    data = []
    pcts = []
    for name, cdf, color in tier2:
        vals = cdf[col].values
        found = vals[vals <= 2000]
        data.append(found if len(found) > 1 else np.array([0]))
        pcts.append(len(found) / len(vals) * 100)
    
    parts = ax.violinplot(data, positions=[1, 2, 3], showmedians=True, showextrema=False)
    for j, (body, (name, _, color)) in enumerate(zip(parts["bodies"], tier2)):
        body.set_facecolor(color)
        body.set_alpha(0.7)
    
    ax.set_title("Goal %d" % (i + 1))
    ax.set_xticks([1, 2, 3])
    ax.set_xticklabels(["PRISM", "Count-B", "Count-M"], fontsize=8)
    for j, (name, _, color) in enumerate(tier2):
        ax.text(j + 1, -100, "%.0f%%" % pcts[j], ha="center", fontsize=9, color=color)
    
    if i == 0:
        ax.set_ylabel("Step de d\u00e9couverte")

fig.suptitle("Tier 2 \u2014 Temps de d\u00e9couverte par goal", fontsize=13, fontweight="bold")
fig.tight_layout()
plt.savefig("../results/exp_b/tier2_violins.png", dpi=150, bbox_inches="tight")
plt.show()

# Stats summary table
header = "%-20s %10s %12s %14s" % ("Metrique", "PRISM", "Count-Bonus", "Count-Matched")
print(header)
print("-" * len(header))

metrics = [
    ("Steps (mean)",   [cdf.steps.mean() for _, cdf, _ in tier2]),
    ("Steps (median)", [cdf.steps.median() for _, cdf, _ in tier2]),
    ("All found %%",   [cdf.all_found.mean() * 100 for _, cdf, _ in tier2]),
    ("Coverage %%",    [cdf.coverage.mean() * 100 for _, cdf, _ in tier2]),
    ("Redundancy",     [cdf.redundancy.mean() for _, cdf, _ in tier2]),
]

for label, vals in metrics:
    print("%-20s %10.1f %12.1f %14.1f" % (label, vals[0], vals[1], vals[2]))

#### Reading the violin plots

**How to read each panel:**
- **Horizontal axis**: the three Tier 2 conditions (PRISM, Count-Bonus, Count-Matched)
- **Vertical axis**: the step at which this goal was found (lower = earlier = better)
- **Violin shape**: the distribution — where the violin is wide, many runs have that discovery time
- **Horizontal line** at the center of the violin: the median
- **Percentage** below each violin: the proportion of runs where this goal was found

The 4 goals are numbered by **sorted spatial position** (coordinates (x,y) in lexicographic order). For a given run, the three conditions compare **exactly the same goal** at the same location on the grid — this is a perfect placement control.

#### Interpretation

The dominant result is the **similarity** between the three conditions. The violins have comparable shapes, close medians, and similar success rates. The differences manifest mainly in the **distribution tails**:

- **Count-Bonus** has the shortest tails: it is the most regular, with very few catastrophic runs.
- **Count-Matched** is intermediate: its decay profile (calibrated on PRISM) gives it behavior very close to Count-Bonus.
- **PRISM** has the longest tails: some runs are excellent (find the goal before the other two), others struggle for a long time (unfavorable warm-up).

#### What this means for the thesis

The three Tier 2 conditions form a **performance cluster**. This clustering has a precise implication:

- **Count-Bonus** uses `lambda/sqrt(visits)` — a purely temporal signal (how many times the state has been visited)
- **Count-Matched** uses `u_profile[visits]` — the same decay profile as PRISM, but without structural content
- **PRISM** uses U(s) — a signal based on the SR's TD errors, containing both a temporal profile AND spatial structure

The fact that the three are indistinguishable on this task means that in FourRooms, **the temporal decay profile suffices** to guide exploration. The spatial structure of U(s) does not provide a measurable speed gain. This is exactly what section 9 will analyze in detail.

The added value of PRISM is not in exploration speed — it lies in the **qualitative** properties of U(s): calibration (Exp A), adaptation (Exp C), spatial coherence, and spectral decomposition.

## 5. Statistical tests

Mann-Whitney U (one-sided): PRISM < each condition.
Holm-Bonferroni correction for multiple comparisons.
Effect size: rank-biserial r.

In [None]:
# Build steps dict
steps_dict = {}
for cond in conditions:
    steps_dict[cond] = df[df.condition == cond].steps.values

# Run comparisons
comparisons = compare_conditions(steps_dict, reference="PRISM", alternative="less")

# Display
print(f"{'Condition':<20s} {'PRISM':>8s} {'Other':>8s} {'p-value':>10s} {'p-corr':>10s} {'Effect r':>9s} {'Sig':>5s}")
print("-" * 75)
for r in comparisons:
    sig = ""
    if r["p_corrected"] < 0.05:
        sig = "*"
    if r["p_corrected"] < 0.01:
        sig = "**"
    if r["p_corrected"] < 0.001:
        sig = "***"
    print(f"{r['condition']:<20s} {r['ref_mean']:>7.0f}  {r['cond_mean']:>7.0f}  "
          f"{r['p_value']:>9.4f}  {r['p_corrected']:>9.4f}  "
          f"{r['effect_size']:>+8.3f}  {sig:>4s}")

### Reading the statistical tests

#### Purpose of this section

The boxplots and curves show **visual trends**, but we need to know whether these differences are **real** or could be due to the randomness of the 100 runs. This is the role of statistical tests.

We use the **Mann-Whitney U** test (one-sided): for each condition, we test "Does PRISM have significantly lower steps?". This is a non-parametric test — it does not assume the data are normally distributed, which is important here (the step distributions are highly asymmetric, with a ceiling at 2000).

The **Holm-Bonferroni** correction adjusts p-values for the fact that we perform 8 simultaneous comparisons (one per baseline). Without this correction, we would risk finding "false positives" through accumulation of tests.

#### What we see — how to read the table

- **Condition**: the baseline against which PRISM is compared
- **PRISM / Other**: the mean steps for both groups
- **p-value**: the probability of observing such a difference if the two conditions were identical
- **p-corr**: the p-value after Holm-Bonferroni correction
- **Effect r**: the rank-biserial effect size, between -1 and +1:
  - r > 0: PRISM is **better** (lower steps)
  - r < 0: PRISM is **worse** (higher steps)
  - |r| > 0.5: large effect, |r| > 0.8: very large effect
- **Sig**: significance stars after correction

#### Interpretation by group

**Against the 5 Tier 3 baselines** (epsilon-decay, Random, epsilon-greedy, Posterior, Norm-Bonus):
- Massive effect sizes: r between +0.83 and +0.88
- Corrected p-values close to 0
- **Conclusion**: PRISM's superiority is overwhelming and unambiguous.

In concrete terms: if we randomly pick one PRISM run and one epsilon-greedy run, PRISM will be faster in ~94% of cases.

**Against SR-Count-Bonus and SR-Count-Matched** (negative r, corrected p = 1.0):
- The effect is **negative**: PRISM is slightly worse, not merely "non-significantly different".
- But the effect is **small** (|r| < 0.25). With n=100, a Mann-Whitney test can detect effects of |r| >= 0.2 with > 80% power.
- In concrete terms: Count-Bonus or Count-Matched will be faster in ~55-60% of cases. This is a modest advantage.

**Against SR-Oracle** (strongly negative r, p = 1.0):
- The effect is strongly negative — Oracle is clearly better, as expected. It knows M* and has no learning cost.

#### What the tests confirm

The **three-tier** structure is statistically robust:
- **Tier 1** (Oracle): significantly better than everything else
- **Tier 2** (Count-Bonus, Count-Matched, PRISM): all three are significantly better than Tier 3, but do not differ from each other
- **Tier 3** (the other 5): indistinguishable from Random

The fact that PRISM, Count-Bonus, and Count-Matched are statistically indistinguishable is the **central result** of this experiment — it will be analyzed in depth in section 9.

## 6. Efficiency ratio

**Efficiency ratio** = (Random - Condition) / (Random - Oracle)

- 1.0 = as good as Oracle (which knows M*)
- 0.0 = as good as Random
- Target for PRISM: > 0.5

In [None]:
random_med = np.median(steps_dict.get("Random", [2000]))
oracle_med = np.median(steps_dict.get("SR-Oracle", [500]))
denominator = random_med - oracle_med

fig, ax = plt.subplots(figsize=(10, 6))

ratios = {}
for cond in conditions:
    med = np.median(steps_dict[cond])
    if denominator != 0:
        ratios[cond] = (random_med - med) / denominator
    else:
        ratios[cond] = 0.0

# Sort by ratio
sorted_conds = sorted(ratios, key=ratios.get, reverse=True)
y_pos = range(len(sorted_conds))

bars = ax.barh(y_pos, [ratios[c] for c in sorted_conds], 
               color=[colors.get(c, "#DDDDDD") for c in sorted_conds], alpha=0.85)
ax.set_yticks(y_pos)
ax.set_yticklabels(sorted_conds)
ax.axvline(x=0.5, color="green", linestyle="--", alpha=0.5, label="Cible 0.5")
ax.axvline(x=1.0, color="blue", linestyle="--", alpha=0.3, label="Oracle")
ax.axvline(x=0.0, color="red", linestyle="--", alpha=0.3, label="Random")
ax.set_xlabel("Efficiency ratio")
ax.set_title("Fraction du gain th\u00e9orique captur\u00e9")
ax.legend()

# Annotate values
for i, cond in enumerate(sorted_conds):
    ax.text(ratios[cond] + 0.02, i, f"{ratios[cond]:.2f}", va="center", fontsize=9)

fig.tight_layout()
plt.savefig("../results/exp_b/efficiency_ratio.png", dpi=150, bbox_inches="tight")
plt.show()

if "PRISM" in ratios:
    print(f"\nEfficiency ratio PRISM : {ratios['PRISM']:.3f}")
    print(f"  Cible : > 0.5 {'PASS' if ratios['PRISM'] > 0.5 else 'FAIL'}")

### Reading the efficiency ratio

#### Purpose of this section

The statistical tests (section 5) tell us "Is PRISM significantly better than X?". But they do not tell us **by how much** on an interpretable scale.

The efficiency ratio places each condition on an axis between two bounds:
- **0.0 = Random** (the simplest possible agent)
- **1.0 = Oracle** (the agent that knows M*)

A ratio of 0.60 means "PRISM captures 60% of the gain one would obtain by going from Random to Oracle".

#### What we see

**Formula**: Efficiency ratio = (median Random - median Condition) / (median Random - median Oracle)

We use the **median** rather than the mean because the step distributions are highly asymmetric (capped at 2000).

#### Interpretation

| Tier | Conditions | Ratio | What it means |
|------|-----------|:---:|---|
| **Tier 1** | SR-Oracle | 1.00 | By definition. This is the ceiling. |
| **Tier 2** | Count-Bonus, Count-Matched, PRISM | 0.60-0.65 | All three capture more than half of the theoretical gain. Close to each other. |
| **Tier 3** | The other 5 | <= 0.04 | Virtually indistinguishable from Random. |

**Key observation**: there is a **chasm** between Tier 2 (ratio >= 0.60) and Tier 3 (ratio <= 0.04). There is no intermediate condition. Either an agent has a good exploration signal, or it performs at Random's level.

The fact that Count-Matched (ratio ~0.63) is virtually identical to Count-Bonus (~0.64) confirms that the decay profile calibrated on PRISM is **as effective** as the classic `1/sqrt(visits)`.

#### Thesis framing

PRISM reaches the 0.5 target — this is a positive result. But the true value of PRISM is measured in the following experiments:
- **Exp A** (calibration): Does U(s) faithfully reflect the actual error of M?
- **Exp C** (adaptation): When the environment changes, does U(s) rise in the perturbed areas? Count-Bonus cannot do this — its counter never goes back up.

## 6bis. Discovery AUC — a finer metric

#### Why a new metric?

The primary metric of Exp B is the **number of steps to find the 4 goals**. It is simple and interpretable, but it is a **coarse scalar** that hides the discovery profile.

Consider two agents:
- Agent A finds 3 goals in 100 steps, then struggles for 1900 steps on the last one → steps = 2000
- Agent B finds 1 goal every 500 steps → steps = 2000

Same score, but Agent A is clearly better: it found 3 goals ultra-fast, it is just stuck on a difficult case. The discovery AUC captures this difference.

#### Formula

$$\text{AUC} = \frac{1}{n_{\text{goals}} \times T_{\max}} \sum_{i=1}^{n_{\text{goals}}} \max(0,\; T_{\max} - d_i)$$

where $d_i$ is the discovery step of goal $i$ (or $T_{\max}+1$ if not found).

- **AUC = 1.0** if all goals are found at step 0 (impossible in practice)
- **AUC = 0.0** if no goal is found
- **AUC = 0.5** means goals are found on average at mid-run

The AUC penalizes agents that find most goals but get stuck on the last one, and rewards agents that find goals steadily.

In [None]:
# --- Calcul de l'AUC de découverte ---
T_MAX = 2000
N_GOALS = 4

if "auc_discovery" not in df.columns:
    def compute_auc_from_row(row, n_goals=N_GOALS, t_max=T_MAX):
        """AUC = sum(max(0, t_max - d_i) for found goals) / (n_goals * t_max)."""
        discs = [row[f"discovery_{i}"] for i in range(1, n_goals + 1)]
        return sum(max(0, t_max - d) for d in discs if d <= t_max) / (n_goals * t_max)
    df["auc_discovery"] = df.apply(compute_auc_from_row, axis=1)
    print("AUC calcul\u00e9e post-hoc")
else:
    print("AUC d\u00e9j\u00e0 pr\u00e9sente dans le CSV")

# --- Bar chart ---
fig, ax = plt.subplots(figsize=(10, 6))

auc_data = []
for cond in conditions:
    auc_vals = df[df.condition == cond].auc_discovery.values
    mean_auc, ci_lo, ci_hi = bootstrap_ci(auc_vals)
    auc_data.append({"condition": cond, "mean": mean_auc, "ci_lo": ci_lo, "ci_hi": ci_hi})

# Sort by AUC descending
auc_data.sort(key=lambda x: x["mean"], reverse=True)
sorted_conds_auc = [d["condition"] for d in auc_data]

y_pos = range(len(auc_data))
means = [d["mean"] for d in auc_data]
errors_lo = [d["mean"] - d["ci_lo"] for d in auc_data]
errors_hi = [d["ci_hi"] - d["mean"] for d in auc_data]

bars = ax.barh(y_pos, means,
               xerr=[errors_lo, errors_hi],
               color=[colors.get(c, "#DDDDDD") for c in sorted_conds_auc],
               alpha=0.85, capsize=3)
ax.set_yticks(y_pos)
ax.set_yticklabels(sorted_conds_auc)
ax.set_xlabel("AUC de d\u00e9couverte (\u2191 meilleur)")
ax.set_title("AUC discovery \u2014 Vitesse et r\u00e9gularit\u00e9 de d\u00e9couverte")
ax.set_xlim([0, 1])
ax.axvline(x=0.5, color="green", linestyle="--", alpha=0.3, label="AUC = 0.5")

for i, d in enumerate(auc_data):
    ax.text(d["mean"] + 0.02, i, f"{d['mean']:.3f}", va="center", fontsize=9)

fig.tight_layout()
plt.savefig("../results/exp_b/auc_discovery.png", dpi=150, bbox_inches="tight")
plt.show()

# --- Summary table ---
print(f"\n{'Condition':<20s} {'AUC mean':>10s} {'CI 95%':>18s}")
print("-" * 50)
for d in auc_data:
    print(f"{d['condition']:<20s} {d['mean']:>10.4f} [{d['ci_lo']:.4f}, {d['ci_hi']:.4f}]")

### Reading the discovery AUC

#### How to read the plot

Each bar represents the mean AUC over 100 runs, with an error bar (95% bootstrap CI). Conditions are sorted from best (top) to worst (bottom).

#### What the AUC adds compared to steps

The AUC captures two aspects that the step count alone does not show:

1. **Regularity**: an agent that finds the 4 goals at a constant pace has a higher AUC than an agent that finds 3 goals quickly and struggles on the last one — even if both finish at the same step.

2. **Partial credit**: an agent that finds 3 out of 4 goals but not the last one still gets an AUC > 0. With the "steps to 4 goals" metric, it receives the sentinel 2001 — no distinction from an agent that finds none.

#### Interpretation of results

The ranking by AUC is **the same** as by steps — no surprising reordering. The three-tier structure is intact.

**The Tier 2 cluster** is even more visible in AUC: the three conditions (Count-Bonus, Count-Matched, PRISM) are very close. The gap between PRISM and the two Count-* is **slightly reduced** in AUC compared to raw steps, which is consistent with the warm-up hypothesis: PRISM finds the first goals quickly (good AUC credit) but loses time at the start on all 4 goals.

**Tier 3** has a very low but > 0 AUC: these agents find 1-2 goals by chance, giving them partial credit.

## 7. CP4 Diagnostic — Go / No-Go

Automated verification of Checkpoint 4 criteria (checkpoints.md).

In [None]:
print("=" * 60)
print("CP4 \u2014 DIAGNOSTIC GO / NO-GO")
print("=" * 60)
print()

prism_steps = steps_dict.get("PRISM", np.array([]))
eff = ratios.get("PRISM", 0)

# Find significant wins
sig_wins = [r["condition"] for r in comparisons if r["p_corrected"] < 0.05]

checks = [
    ("PRISM bat SR-e-greedy (p < 0.05)",
     any(r["condition"] == "SR-e-greedy" and r["p_corrected"] < 0.05 for r in comparisons),
     "wins: %s" % sig_wins),
    ("PRISM bat SR-Count-Bonus (p < 0.05)",
     any(r["condition"] == "SR-Count-Bonus" and r["p_corrected"] < 0.05 for r in comparisons),
     ""),
    ("PRISM bat SR-Norm-Bonus (p < 0.05)",
     any(r["condition"] == "SR-Norm-Bonus" and r["p_corrected"] < 0.05 for r in comparisons),
     ""),
    ("Efficiency ratio > 0.5",
     eff > 0.5,
     "%.3f" % eff),
    ("Goals found moyen > 3.5",
     df[df.condition == "PRISM"].goals_found.mean() > 3.5 if "PRISM" in df.condition.values else False,
     "%.1f/4" % df[df.condition == "PRISM"].goals_found.mean() if "PRISM" in df.condition.values else "N/A"),
    ("Coverage > 60%",
     df[df.condition == "PRISM"].coverage.mean() > 0.6 if "PRISM" in df.condition.values else False,
     "%.1f%%" % (df[df.condition == "PRISM"].coverage.mean() * 100) if "PRISM" in df.condition.values else "N/A"),
]

for label, passed, detail in checks:
    status = "PASS" if passed else "FAIL"
    detail_str = " \u2014 %s" % detail if detail else ""
    print("  [%s] %s%s" % (status, label, detail_str))

n_pass = sum(1 for _, p, _ in checks if p)
print()
if n_pass == len(checks):
    print("  >>> %d/%d \u2014 GO, tous les crit\u00e8res passent" % (n_pass, len(checks)))
elif n_pass >= len(checks) - 2:
    print("  >>> %d/%d \u2014 GO avec r\u00e9serve" % (n_pass, len(checks)))
else:
    print("  >>> %d/%d \u2014 STOP, diagnostic n\u00e9cessaire" % (n_pass, len(checks)))

# Additional Palier 2 checks
print()
print("=" * 60)
print("Palier 2 \u2014 Contr\u00f4les suppl\u00e9mentaires")
print("=" * 60)
print()

# PRISM vs Count-Matched
cm_comp = [r for r in comparisons if r["condition"] == "SR-Count-Matched"]
if cm_comp:
    r = cm_comp[0]
    status = "PRISM \u2248 Count-Matched" if r["p_corrected"] > 0.05 else "PRISM > Count-Matched"
    print("  PRISM vs Count-Matched : p=%.4f, r=%+.3f \u2192 %s" % (r["p_corrected"], r["effect_size"], status))

# Count-Matched vs Tier 3
if "SR-Count-Matched" in steps_dict:
    from scipy.stats import mannwhitneyu
    cm_steps = steps_dict["SR-Count-Matched"]
    for baseline in ["SR-e-greedy", "Random"]:
        if baseline in steps_dict:
            stat, p = mannwhitneyu(cm_steps, steps_dict[baseline], alternative="less")
            n1, n2 = len(cm_steps), len(steps_dict[baseline])
            effect_r = 1 - 2 * stat / (n1 * n2)
            print("  Count-Matched vs %s : p=%.4f, r=%+.3f" % (baseline, p, effect_r))

### Reading the CP4 diagnostic

#### Purpose of this section

CP4 (Checkpoint 4) is a **decision gate**: we verify a list of predefined criteria to decide whether the results of Exp B are sufficient to proceed. The criteria were defined **before** running the experiment.

#### The 6 original criteria

The overall result is **GO with reservation** (5/6). The only missing criterion is "PRISM beats Count-Bonus" — PRISM also does not beat Count-Matched.

This is not a failure of PRISM. It is an **informative result**: in FourRooms, exploration speed does not discriminate between a structural signal (SR) and a temporal signal (visit counting).

#### The Tier 2 controls

The additional controls confirm:

1. **PRISM and Count-Matched are indistinguishable** (p >> 0.05). The decay profile calibrated on PRISM is exactly as effective as PRISM itself. This means that the temporal component of U(s) accounts for all observed exploration efficiency.

2. **Count-Matched beats Tier 3** with large effects. The PRISM decay profile is a good exploration signal — it is just as good when it comes from a lookup table as when it emerges from TD errors.

#### Implication

The diagnostic points to a **clear but nuanced** conclusion:

- **For pure exploration**: PRISM = Count-Matched = Count-Bonus. No added value from the SR.
- **For what follows (Exp A, C)**: the qualitative properties of U(s) — calibration, adaptation, spatial coherence — are the real differentiators. Count-Matched cannot adapt to an environment change, nor provide a signal calibrated on actual error.

## 8. Conclusions and next steps

### What we learned

This experiment tested **proposition P3** of the thesis: "The uncertainty signal U(s) from PRISM, derived from the SR's TD errors, guides exploration more efficiently than alternative approaches."

#### Result 1 — PRISM explores significantly better than blind exploration

PRISM beats 5 of the 8 baselines with massive effect sizes (r > 0.82, p < 0.001). The efficiency ratio is ~0.60, above the 0.50 target.

**What this confirms**: the U(s) signal is a functional exploration bonus. It guides the agent toward unexplored areas and enables discovering the 4 goals in ~88% of runs, compared to 0-3% for the Tier 3 baselines.

#### Result 2 — Tier 2 is a cluster: PRISM = Count-Bonus = Count-Matched

This is the central result. The three Tier 2 conditions are **statistically indistinguishable** in exploration performance. PRISM does not beat Count-Bonus or Count-Matched (p >> 0.05).

**What this means**: in FourRooms, the **temporal shape** of the exploration bonus (decaying with the number of visits) suffices to guide exploration. The **spatial structure** contained in U(s) via the SR does not provide a measurable speed gain.

This is an **honest and informative** result. It does not disqualify PRISM — it specifies its domain of added value: not exploration speed, but calibration and adaptation.

#### Result 3 — Count-Matched validates the temporal control

The fact that Count-Matched (bonus = lookup table `u_profile[visits]`) performs as well as PRISM confirms that it is the **decay profile** of U(s) — and not its structural content — that drives exploration. This control was the main methodological contribution of the Tier 2 analysis.

#### Result 4 — The Oracle works (after fix)

SR-Oracle is the theoretical ceiling: ~681 mean steps, 100% success rate. This result confirms that M* information has real value for exploration, provided it is properly exploited (fix `||M-M*||/sqrt(visits+1)`).

#### Result 5 — The AUC confirms the 3-tier structure

The discovery AUC metric captures both speed and regularity. The ranking is identical — the PRISM vs Count-* gap is slightly reduced in AUC, consistent with the warm-up hypothesis.

### CP4 Diagnostic: GO with reservation (5/6)

The only missing criterion (PRISM > Count-Bonus) is now explained by the Tier 2 analysis: SR structure does not provide a speed gain in this simple geometry. The other 5 criteria pass comfortably.

### Next steps

| Step | Objective | What it tests |
|------|----------|---------------|
| **Exp A** — Calibration | ECE, MI, reliability diagram | Does U(s) faithfully reflect the actual error of M? |
| **Exp C** — Adaptation | Change detection, re-exploration after perturbation | Does U(s) rise when the environment changes? |
| **Tier 3** — Multi-geometry | FourRooms + corridor + 3x3 grid | Does SR structure help in geometries with bottlenecks? |

### Limitations of Exp B

- **A single geometry**: FourRooms 19x19, 4 symmetric rooms. Results could differ in an environment with asymmetric bottlenecks where SR structure could be exploited.
- **Fixed hyperparameters**: no sweep on PRISM parameters. The warm-up could be reduced with a higher alpha_M.
- **Specific task**: discovering 4 hidden goals. Other exploration tasks could yield different results.
- **Tabular agent**: the SR is a 260x260 matrix. Switching to neural representations would change the learning dynamics.

---

## 9. SR-Count-Matched — Temporal profile control

### What this condition tests

The central result of sections 1–8 is that PRISM, Count-Bonus, and Count-Matched form a cluster. But this observation does not tell us **why** the three are similar. Section 9 decomposes the U(s) signal of PRISM into two components:

1. **Temporal component**: the fact that U(s) decays when state s is revisited (like an inverse visit counter)
2. **Structural component**: the fact that U(s) reflects local topology via the SR's TD errors (the neighbors of a visited state also have reduced U)

The **SR-Count-Matched** isolates the temporal component. Its bonus is:

$$\text{bonus}_{\text{matched}}(s) = u_{\text{profile}}[\min(\text{visits}(s),\; N_{\max})]$$

where `u_profile` is a table calibrated to have **exactly the same decay profile** as PRISM's U(s), but **without structural content** (the bonus depends only on the visit count of the current state, not on the transition structure).

### Expected vs observed result

| Result | What it would mean |
|---|---|
| **PRISM > Count-Matched** | SR structure provides exploitable information beyond the temporal profile |
| **PRISM = Count-Matched** | All of U(s)'s exploration signal is contained in its decay profile |

**Observed result: PRISM = Count-Matched.** The structural component does not provide a measurable gain in FourRooms.

In [None]:
# === 9a. Profil de calibration u_profile ===
from pathlib import Path
import numpy as np

# Try to load u_profile from the run directory or fallback to exp_b_v2
u_profile = None
if run_dir is not None:
    u_path = run_dir / "u_profile.npy"
    if u_path.exists():
        u_profile = np.load(u_path)
if u_profile is None:
    u_path = Path("../results/exp_b_v2/u_profile.npy")
    if u_path.exists():
        u_profile = np.load(u_path)

if u_profile is not None:
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Left: full profile (log scale)
    ax1.plot(u_profile, color="#D4A017", linewidth=1.5)
    ax1.set_xlabel("visits(s)")
    ax1.set_ylabel("u_profile[visits]")
    ax1.set_title("Profil de calibration U(s) — vue compl\u00e8te")
    ax1.set_yscale("log")
    ax1.axhline(y=u_profile[0], color="gray", linestyle="--", alpha=0.3,
                label="U_prior = %.2f" % u_profile[0])
    ax1.legend()
    
    # Right: zoom on first 50 visits (linear scale)
    zoom = min(50, len(u_profile))
    ax2.plot(range(zoom), u_profile[:zoom], color="#D4A017", linewidth=2, marker="o", markersize=3)
    ax2.set_xlabel("visits(s)")
    ax2.set_ylabel("u_profile[visits]")
    ax2.set_title("Zoom : 0\u201350 visites (phase critique)")
    
    # Overlay 1/sqrt(n+1) for comparison
    n = np.arange(zoom)
    count_bonus_ref = u_profile[0] / np.sqrt(n + 1)
    ax2.plot(n, count_bonus_ref, color="#888888", linestyle="--", alpha=0.6,
             label="Ref: U_prior / sqrt(n+1)")
    ax2.legend()
    
    fig.suptitle("SR-Count-Matched \u2014 Profil de d\u00e9croissance calibr\u00e9", 
                 fontsize=13, fontweight="bold")
    fig.tight_layout()
    plt.savefig("../results/exp_b/u_profile.png", dpi=150, bbox_inches="tight")
    plt.show()
    
    print("Profile: %d bins, range [%.4f, %.4f]" % (len(u_profile), u_profile.min(), u_profile.max()))
    print("U_prior (visits=0): %.4f" % u_profile[0])
    print("U at 10 visits: %.4f" % u_profile[min(10, len(u_profile)-1)])
    print("U at 50 visits: %.4f" % u_profile[min(50, len(u_profile)-1)])
else:
    print("u_profile.npy non trouv\u00e9 \u2014 calibration non disponible")

### Reading the calibration profile

#### Left panel — Full view (log scale)

The profile shows how U(s) evolves as a function of the number of visits, averaged over 20 calibration runs under a random policy. The logarithmic scale reveals two regimes:

1. **Rapid decay** (0–20 visits): U(s) drops from ~0.80 (U_prior) to ~0.17. This is the critical phase for exploration — rarely visited states have a 4-5x higher bonus than frequently visited states.

2. **Plateau** (20+ visits): U(s) stabilizes around ~0.16. The sliding buffer of K=20 errors reaches its steady state. The bonus no longer decreases — the residual uncertainty reflects the intrinsic variability of TD errors under a random policy.

#### Right panel — Zoom on the first 50 visits

The solid curve (gold) = PRISM's calibrated profile. The dashed curve (gray) = reference `U_prior / sqrt(n+1)` (Count-Bonus profile). The two profiles are **very close** over the first visits but then diverge: PRISM decays faster then stabilizes, Count-Bonus continues to decay (but more slowly).

This initial similarity explains why the three Tier 2 conditions are indistinguishable: the 0–20 visit phase is the most critical for exploration (this is where the bonus orients the first actions), and in this phase, all decay profiles look alike.

In [None]:
# === 9b. PRISM vs Count-Matched — Head-to-head ===
from scipy.stats import mannwhitneyu

prism_s = df[df.condition == "PRISM"].steps.values
cm_s = df[df.condition == "SR-Count-Matched"].steps.values
cb_s = df[df.condition == "SR-Count-Bonus"].steps.values

# Two-sided test (not one-sided: we don't have a directional hypothesis)
stat_pc, p_pc = mannwhitneyu(prism_s, cm_s, alternative="two-sided")
r_pc = 1 - 2 * stat_pc / (len(prism_s) * len(cm_s))

stat_pm, p_pm = mannwhitneyu(prism_s, cb_s, alternative="two-sided")
r_pm = 1 - 2 * stat_pm / (len(prism_s) * len(cb_s))

stat_cm_cb, p_cm_cb = mannwhitneyu(cm_s, cb_s, alternative="two-sided")
r_cm_cb = 1 - 2 * stat_cm_cb / (len(cm_s) * len(cb_s))

print("=" * 60)
print("PRISM vs Count-Matched vs Count-Bonus \u2014 Tests bilat\u00e9raux")
print("=" * 60)
print()
print("  PRISM vs Count-Matched:   p = %.4f, r = %+.3f" % (p_pc, r_pc))
print("  PRISM vs Count-Bonus:     p = %.4f, r = %+.3f" % (p_pm, r_pm))
print("  Count-Matched vs Count-B: p = %.4f, r = %+.3f" % (p_cm_cb, r_cm_cb))
print()

# AUC comparison
if "auc_discovery" in df.columns:
    prism_auc = df[df.condition == "PRISM"].auc_discovery.values
    cm_auc = df[df.condition == "SR-Count-Matched"].auc_discovery.values
    cb_auc = df[df.condition == "SR-Count-Bonus"].auc_discovery.values
    
    print("AUC discovery (mean +/- std):")
    print("  PRISM:         %.4f +/- %.4f" % (prism_auc.mean(), prism_auc.std()))
    print("  Count-Matched: %.4f +/- %.4f" % (cm_auc.mean(), cm_auc.std()))
    print("  Count-Bonus:   %.4f +/- %.4f" % (cb_auc.mean(), cb_auc.std()))
    
    stat_auc, p_auc = mannwhitneyu(prism_auc, cm_auc, alternative="two-sided")
    r_auc = 1 - 2 * stat_auc / (len(prism_auc) * len(cm_auc))
    print("\n  AUC: PRISM vs Count-Matched: p = %.4f, r = %+.3f" % (p_auc, r_auc))

### Interpretation of Tier 2 tests

#### The three comparisons

The two-sided tests (Mann-Whitney U) confirm that the three Tier 2 conditions are **statistically indistinguishable**:

- **PRISM vs Count-Matched**: p >> 0.05, small effect. The temporal control is validated — the decay profile suffices.
- **PRISM vs Count-Bonus**: p >> 0.05, small effect. Same conclusion with a different decay profile (1/sqrt).
- **Count-Matched vs Count-Bonus**: p >> 0.05, negligible effect. The two decay profiles (calibrated vs 1/sqrt) are interchangeable.

#### The AUC confirms

The discovery AUCs are also indistinguishable across the three conditions. The pace and regularity of discovery are the same — this is not an artifact of the "steps" metric.

#### Conclusion of section 9

SR-Count-Matched achieves its methodological objective: it isolates the temporal component of U(s) and shows that it **entirely explains** PRISM's exploration efficiency in FourRooms.

This result has two sides:

**Negative side**: PRISM has no exploration speed advantage over a simple lookup table. In this geometry, the SR's predictive structure is an unnecessary luxury for exploration.

**Positive side**: PRISM's U(s) signal is **at least as good** as an optimized visit counter — and it emerges naturally from SR learning errors, without needing a priori calibration. Moreover, U(s) possesses properties that counting does not: adaptation to changes (Exp C), calibration on actual error (Exp A), and spatial coherence.

The value of PRISM for exploration would likely manifest in **non-stationary** environments or with **asymmetric bottlenecks** — which FourRooms does not offer.

---

## 10. Guidance index — Does U(s) truly guide exploration?

### The most direct question

All previous analyses measure the **outcome** of exploration (how many goals, in how many steps). But none directly measures whether the exploration bonus **guides** the agent's behavior toward the right areas.

The **guidance index** tests this:

> Does the agent visit first the rooms whose exploration bonus is highest?

### Formula

$$\text{guidance\_index} = -\rho_{\text{Spearman}}\big(\text{room\_visit\_rank},\; \overline{\text{bonus}}_{\text{room}}\big)$$

For each run:
1. Identify the **step of first visit** in each of the 4 rooms
2. Rank the 4 rooms by order of visit
3. For each room, compute the **mean bonus** of the states in that room at the time of first visit
4. Compute the Spearman correlation (negated, because low rank = visited early = good)

**Interpretation**:
- **guidance > 0**: the agent visits high-bonus rooms first
- **guidance = 0**: no link between bonus and visit order
- **guidance < 0**: the agent avoids high-bonus rooms

### Current limitation

The guidance index requires the **full trajectory** and **bonus values at each step** — data not collected in the current CSV. The implementation requires `log_trajectory=True` in the runner, which will be added in a future version.

In the meantime, we can compute a **proxy** based on the order of goal discovery (available in the CSV) and goal placement by room. This proxy is less precise but provides an indication.

In [None]:
# === 10. Proxy guidance: order of goal discovery across Tier 2 ===
# The discovery_i columns give the step at which each goal was found.
# Goals are sorted by spatial position (lexicographic). Since goals are placed
# one per room, the discovery order reflects room-visit order.
#
# We measure: for runs where all 4 goals are found, is there a consistent
# ordering pattern? And does it differ between conditions?

from scipy.stats import spearmanr

tier2_conds = ["PRISM", "SR-Count-Bonus", "SR-Count-Matched"]

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, cond in enumerate(tier2_conds):
    ax = axes[idx]
    cond_df = df[(df.condition == cond) & (df.all_found == True)]
    
    if len(cond_df) < 5:
        ax.set_title(f"{cond} (n={len(cond_df)} — trop peu)")
        continue
    
    # For each run, compute the rank of discovery for each goal
    disc_cols = ["discovery_1", "discovery_2", "discovery_3", "discovery_4"]
    ranks = cond_df[disc_cols].rank(axis=1, method="average")
    
    # Heatmap of mean rank per goal position
    mean_ranks = ranks.mean()
    std_ranks = ranks.std()
    
    ax.bar(range(4), mean_ranks.values, yerr=std_ranks.values, 
           color=colors.get(cond, "#DDDDDD"), alpha=0.8, capsize=5)
    ax.axhline(y=2.5, color="gray", linestyle="--", alpha=0.3, label="Uniform = 2.5")
    ax.set_xticks(range(4))
    ax.set_xticklabels(["Goal 1", "Goal 2", "Goal 3", "Goal 4"])
    ax.set_ylabel("Rang moyen de d\u00e9couverte")
    ax.set_title(f"{cond} (n={len(cond_df)} runs)")
    ax.set_ylim([0.5, 4.5])

fig.suptitle("Ordre de d\u00e9couverte par position de goal (runs compl\u00e9t\u00e9s)", 
             fontsize=13, fontweight="bold")
fig.tight_layout()
plt.savefig("../results/exp_b/discovery_order.png", dpi=150, bbox_inches="tight")
plt.show()

# Variance of discovery times — proxy for exploration strategy
print("Variabilit\u00e9 de l'ordre de d\u00e9couverte (runs compl\u00e9t\u00e9s)")
print("=" * 60)
for cond in tier2_conds:
    cond_df = df[(df.condition == cond) & (df.all_found == True)]
    if len(cond_df) < 5:
        continue
    disc_times = cond_df[["discovery_1", "discovery_2", "discovery_3", "discovery_4"]].values
    # Mean inter-goal interval
    sorted_times = np.sort(disc_times, axis=1)
    intervals = np.diff(sorted_times, axis=1)
    mean_interval = intervals.mean()
    std_interval = intervals.std()
    cv = std_interval / mean_interval if mean_interval > 0 else 0
    print(f"  {cond:<20s}: interval moyen = {mean_interval:.0f} steps, "
          f"CV = {cv:.2f}, n = {len(cond_df)} runs")

### Reading the discovery order analysis

#### What we measure

The **mean discovery rank** for each goal position. If a goal is consistently found first (rank = 1), it means it is in the closest room to the start or the most accessible.

If all 4 bars are close to 2.5 (the dashed line), the discovery order is **random** — the agent has no systematic preference for a room.

#### What we expect to see

- **If PRISM uses SR structure**: discovery ranks could be more **systematic** (bars further from 2.5, lower variance) — the agent would have a visit order guided by topology.
- **If the bonus is purely temporal**: the three conditions would have **identical** rank patterns — the order would be determined by starting position and geometry, not by the exploration signal.

#### Inter-goal interval

The **coefficient of variation (CV)** of the inter-goal interval measures discovery regularity:
- Low CV: goals are found at regular intervals (systematic exploration)
- High CV: intervals are highly variable (some goals found quickly, others after a long search)

If the three Tier 2 conditions have the same CV, this confirms that the exploration signal produces the same **behavioral pattern** — SR structure does not induce a qualitatively different exploration strategy.

#### Toward the true guidance index

This analysis is a **proxy**. The true guidance index (correlation between room visit rank and mean room bonus at the time of visit) requires trajectory and bonus logging — an extension planned for future iterations.