# KSG — Reproducibility & Similarity to Healthy (provenance)

**Goal** From session-wise KSG binary adjacencies (per-edge permutation threshold, no across-edge FDR), compute:
- **Edge-presence** per group (fraction of sessions with an edge),
- **Robust sets** at 70% and 90% presence,
- **Similarity to Healthy** via the **Jaccard index** between robust sets,
- **Presence-difference maps** (Healthy − PD) with a **robust overlay** (opaque where either group is ≥ threshold),
- A CSV summarising robust-edge **counts** and **Jaccard** values for LaTeX.

**Inputs**
- `Results/ksg_results/sub-XXX_ses-YYY_combined_ksg.pkl` (one per session; IDTxl `ResultsNetworkInference`).
- `subject_session_metadata.csv` (columns: `subject`, `session`, `group`).

**Key analysis choices** (matches chapter text)
- Binary adjacency via `get_adjacency_matrix('binary', fdr=False)`; diagonals zeroed.
- Presence thresholds reported: **70%** and **90%** (50% was inspected but not emphasised).
- Jaccard computed between **Healthy robust set** and each PD robust set.
- Presence-difference maps plot *(Healthy − PD)*; edges robust in **either** group (at the chosen threshold) are drawn **opaque**.

**Outputs**
- PNGs:  
  `.../Step-wise/figs/groups_comp/Healthy_vs_PDoff_KSG_diffmap.png`  
  `.../Step-wise/figs/groups_comp/Healthy_vs_PDon_KSG_diffmap.png`  
  *(Copy/symlink these to your LaTeX path `images/Results2/groups_comp/`.)*
- CSV: `.../Step-wise/ksg_repro_counts_jaccard.csv` (used for Table of counts + Jaccard).

**Design caveat** (as stated in chapter)
- Presence matrices are **between-subject** summaries within groups. PD-off and PD-on are measured **within the same individuals**, which can inflate apparent ON–OFF similarity versus Healthy; therefore similarity is benchmarked to the **Healthy backbone** and stricter thresholds (70%, 90%) are emphasised.


In [1]:
# ============================
# STEP 2: Reproducibility & Similarity to Healthy (KSG)
# Consistent with Step 1: uses get_adjacency_matrix('binary', fdr=False)
# ============================
from pathlib import Path
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

BASE      = Path("/lustre/majlepy2/myproject")
RESULTS   = BASE / "Results" / "ksg_results"
META_CSV  = BASE / "subject_session_metadata.csv"

# Figures are written under OUTROOT; copy/symlink to LaTeX path afterwards.
OUTROOT   = Path("/home/majlepy2/myproject/Step-wise")
FIGDIR    = OUTROOT / "figs" / "groups_comp"
FIGDIR.mkdir(parents=True, exist_ok=True)

# ---------- Load metadata ----------
# Build a (sub_ses -> group) map used to collect sessions per group.
meta = pd.read_csv(META_CSV)
meta["sub_ses"] = meta["subject"] + "_" + meta["session"]
subses_to_group = dict(zip(meta["sub_ses"], meta["group"]))

# ---------- Helper: consistent binary extraction (same as Step 1) ----------
# IMPORTANT: fdr=False (no across-edge FDR at extraction) and diagonal→0.
def get_binary(res, fdr=False):
    A = np.array(res.get_adjacency_matrix("binary", fdr=fdr)).astype(np.uint8)
    np.fill_diagonal(A, 0)
    return A

# ---------- Either reuse saved presence arrays or compute them now ----------
# Presence = fraction of sessions within a group where edge is present.
def maybe_load_presence(name):
    f = RESULTS / f"{name}_ksg_edge_presence.npy"
    return np.load(f) if f.exists() else None

def save_presence(name, P):
    np.save(RESULTS / f"{name}_ksg_edge_presence.npy", P)

group_names = ["healthy", "PD-off", "PD-on"]
presence = {g: maybe_load_presence(g) for g in group_names}

if any(P is None for P in presence.values()):
    print("[INFO] Computing group presence from session pickles (fdr=False)...")
    # Gather session files
    session_pkls = sorted(RESULTS.glob("sub-*_*_combined_ksg.pkl"))
    group_to_mats = {g: [] for g in group_names}
    N = None

    for pkl in session_pkls:
        stem = pkl.name.replace("_combined_ksg.pkl", "")  # sub-XXX_ses-YY
        sub, ses = stem.split("_", 1)
        sub_ses = f"{sub}_{ses}"
        g = subses_to_group.get(sub_ses)
        if g not in group_to_mats:
            continue

        with open(pkl, "rb") as f:
            res = pickle.load(f)
        A = get_binary(res, fdr=False)
        if N is None:
            N = A.shape[0]
        group_to_mats[g].append(A)

    # Stack and average to presence
    for g in group_names:
        mats = group_to_mats[g]
        if not mats:
            raise RuntimeError(f"No sessions found for group '{g}'.")
        M = np.stack(mats, axis=0)  # (S,N,N), uint8
        P = M.mean(axis=0)          # presence fraction in [0,1]
        np.fill_diagonal(P, 0.0)
        presence[g] = P
        save_presence(g, P)
        print(f"[OK] {g}: {len(mats)} sessions -> presence saved, shape={P.shape}")
else:
    for g in group_names:
        print(f"[OK] Loaded saved presence: {g} -> {presence[g].shape}")

P_H   = presence["healthy"]
P_OFF = presence["PD-off"]
P_ON  = presence["PD-on"]
N = P_H.shape[0]
diag = np.eye(N, dtype=bool)

# ---------- Robust masks, Jaccard, and figures ----------
# Robust set = edges with presence >= threshold, excluding diagonal.
def robust_mask(P, thr):
    return (P >= thr) & (~diag)

# Jaccard between two boolean masks (robust sets).
def jaccard(A, B):
    inter = (A & B).sum()
    union = (A | B).sum()
    return float(inter)/float(union) if union else np.nan

# Presence-difference map: D = Healthy - PD; robust edges (in either group)
# at threshold 'thr' are shown opaque; others are faded for context.
def diffmap_with_overlay(PA, PB, thr, title, outfile):
    D = PA - PB
    robust_any = (robust_mask(PA, thr) | robust_mask(PB, thr))
    fig, ax = plt.subplots(figsize=(7,6), dpi=200)
    vmax = max(abs(D.min()), abs(D.max()))
    im = ax.imshow(D, vmin=-vmax, vmax=+vmax, cmap="bwr", interpolation="nearest")
    im.set_alpha(np.where(robust_any, 1.0, 0.15))  # robust edges opaque
    ax.set_title(f"{title}\nPresence difference (Healthy − PD); robust (≥{int(thr*100)}%) opaque")
    ax.set_xlabel("Target"); ax.set_ylabel("Source")
    cbar = plt.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
    cbar.set_label("Presence difference")
    plt.tight_layout(); plt.savefig(outfile, bbox_inches="tight"); plt.close(fig)
    print(f"Saved: {outfile}")

# Compute counts + Jaccard at 70% and 90% (50% inspected but not emphasised).
rows = []
for thr in (0.70, 0.90):
    R_H   = robust_mask(P_H, thr)
    R_OFF = robust_mask(P_OFF, thr)
    R_ON  = robust_mask(P_ON, thr)

    cnt_H, cnt_OFF, cnt_ON = int(R_H.sum()), int(R_OFF.sum()), int(R_ON.sum())
    J_H_OFF, J_H_ON = jaccard(R_H, R_OFF), jaccard(R_H, R_ON)

    rows.append({
        "threshold": f"{int(thr*100)}%",
        "robust_edges_Healthy": cnt_H,
        "robust_edges_PD_off": cnt_OFF,
        "robust_edges_PD_on": cnt_ON,
        "Jaccard_H_vs_PDoff": None if np.isnan(J_H_OFF) else round(J_H_OFF, 3),
        "Jaccard_H_vs_PDon":  None if np.isnan(J_H_ON)  else round(J_H_ON,  3),
    })

    # Two difference maps per threshold (H−PDoff and H−PDon) with robust overlay.
    diffmap_with_overlay(P_H, P_OFF, thr,
                         title="Healthy vs PD-off (KSG)",
                         outfile=FIGDIR / "Healthy_vs_PDoff_KSG_diffmap.png")
    diffmap_with_overlay(P_H, P_ON, thr,
                         title="Healthy vs PD-on (KSG)",
                         outfile=FIGDIR / "Healthy_vs_PDon_KSG_diffmap.png")

# ---------- Save summary CSV for LaTeX ----------
# This CSV is the direct source for robust-edge counts and Jaccard values in the Results table.
df = pd.DataFrame(rows)
csv_out = OUTROOT / "ksg_repro_counts_jaccard.csv"
df.to_csv(csv_out, index=False)
print("Saved:", csv_out)
print(df)


[OK] Loaded saved presence: healthy -> (23, 23)
[OK] Loaded saved presence: PD-off -> (23, 23)
[OK] Loaded saved presence: PD-on -> (23, 23)
Saved: /home/majlepy2/myproject/Step-wise/figs/groups_comp/Healthy_vs_PDoff_KSG_diffmap.png
Saved: /home/majlepy2/myproject/Step-wise/figs/groups_comp/Healthy_vs_PDon_KSG_diffmap.png
Saved: /home/majlepy2/myproject/Step-wise/figs/groups_comp/Healthy_vs_PDoff_KSG_diffmap.png
Saved: /home/majlepy2/myproject/Step-wise/figs/groups_comp/Healthy_vs_PDon_KSG_diffmap.png
Saved: /home/majlepy2/myproject/Step-wise/ksg_repro_counts_jaccard.csv
  threshold  robust_edges_Healthy  robust_edges_PD_off  robust_edges_PD_on  \
0       70%                   331                  296                 331   
1       90%                   117                   84                 140   

   Jaccard_H_vs_PDoff  Jaccard_H_vs_PDon  
0               0.560              0.643  
1               0.129              0.201  
