# Cross-Representation Class Characterization Evaluation

This notebook demonstrates the **Cross-Representation Class Characterization** evaluation pipeline, which synthesizes 4 dependency experiments into a class-level representation suitability map across 5 clinical text datasets.

**What this evaluation computes (6 phases):**
1. **Representation Suitability Map** — D-gap analysis across feature spaces (TF-IDF, Sentence Transformer, LLM Zero-Shot)
2. **Disagreement Topology** — Boundary CRND graph + Mantel test between overlap and disagreement
3. **Predictive Value** — Kendall tau (D-gap → F1-gap), LOO-CV with CRND improvement, high-overlap precision
4. **Clinical Interpretability Profiles** — Per-class characterization for medical datasets
5. **Human Deferral Pairs** — Identifies class pairs requiring human review
6. **Ecological vs. ML Overlap Comparison** — Fisher F1, Bhattacharyya coefficient, partial Spearman correlations

**Key metrics:** mean_d_gap=0.368, kendall_tau=0.259, spearman_D_vs_bhattacharyya=0.501, 88 class pairs analyzed.

In [1]:
import subprocess, sys
def _pip(*a): subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', *a])

# loguru — NOT on Colab, always install
_pip('loguru==0.7.3')

# numpy, pandas, scikit-learn, scipy, matplotlib — pre-installed on Colab, install locally only
if 'google.colab' not in sys.modules:
    _pip('numpy==2.0.2', 'pandas==2.2.2', 'scikit-learn==1.6.1', 'scipy==1.16.3', 'matplotlib==3.10.0')


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.12 -m pip install --upgrade pip[0m



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.12 -m pip install --upgrade pip[0m


In [2]:
import json
import sys
import time
import warnings
from itertools import combinations
from collections import defaultdict

import numpy as np
import pandas as pd
from scipy import stats
from scipy.spatial.distance import squareform, pdist
from sklearn.decomposition import PCA
from sklearn.neighbors import KDTree
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import matplotlib

from loguru import logger

warnings.filterwarnings("ignore", category=RuntimeWarning)

logger.remove()
logger.add(sys.stdout, level="INFO", format="{time:HH:mm:ss}|{level:<7}|{message}")

1

## Data Loading

Load pre-computed evaluation data from GitHub (with local fallback). The data contains D-gap results, CRND statistics, Mantel test results, and per-example evaluations across 5 clinical text datasets.

In [3]:
GITHUB_DATA_URL = "https://raw.githubusercontent.com/AMGrobelnik/ai-invention-e0166a-cross-representation-neighborhood-disson/main/evaluation_iter4_cross_represent/demo/mini_demo_data.json"
import json, os

def load_data():
    try:
        import urllib.request
        with urllib.request.urlopen(GITHUB_DATA_URL) as response:
            return json.loads(response.read().decode())
    except Exception: pass
    if os.path.exists("mini_demo_data.json"):
        with open("mini_demo_data.json") as f: return json.load(f)
    raise FileNotFoundError("Could not load mini_demo_data.json")

In [4]:
data = load_data()
logger.info(f"Loaded data with {len(data['datasets'])} datasets, {sum(len(d['examples']) for d in data['datasets'])} total examples")
logger.info(f"Metadata keys: {list(data['metadata'].keys())}")

10:28:38|INFO   |Loaded data with 5 datasets, 15 total examples


10:28:38|INFO   |Metadata keys: ['evaluation_name', 'description', 'd_gap_results_per_dataset', 'uniformly_high_d_pairs', 'crnd_per_class_stats', 'mantel_test_results', 'kendall_tau_d_gap_vs_f1_gap', 'crnd_prediction_improvement', 'high_overlap_identification', 'clinical_profiles', 'human_deferral_pairs', 'ecological_vs_ml_comparison']


## Configuration

Tunable parameters for the evaluation. The Mantel test permutations and bootstrap iterations control statistical robustness vs. runtime.

In [5]:
# ── Configuration ────────────────────────────────────────────────────────
# Mantel test permutations (original: 9999)
N_PERMUTATIONS = 9999

# Bootstrap iterations for Kendall tau CI (original: 1000)
N_BOOTSTRAP = 1000

# KDE grid points for Bhattacharyya coefficient (original: 200)
KDE_GRID_POINTS = 200

# Thresholds for uniformly-high-D and high-overlap detection
UNIFORMLY_HIGH_D_THRESHOLDS = [0.4, 0.5, 0.6, 0.7]
HIGH_OVERLAP_THRESHOLDS = [0.4, 0.5, 0.6, 0.7]

# Boundary proximity threshold
BOUNDARY_PROXIMITY_THRESHOLD = 0.3

# Human deferral thresholds
DEFERRAL_D_THRESHOLD = 0.5
DEFERRAL_CRND_THRESHOLD = 0.85

# Feature space names
CANONICAL_SPACES = ["tfidf", "sentence_transformer", "llm_zeroshot"]
DATASETS = [
    "medical_abstracts",
    "mimic_iv_ed_demo",
    "clinical_patient_triage_nl",
    "ohsumed_single",
    "mental_health_conditions",
]

logger.info(f"Config: N_PERMUTATIONS={N_PERMUTATIONS}, N_BOOTSTRAP={N_BOOTSTRAP}, KDE_GRID_POINTS={KDE_GRID_POINTS}")

10:28:38|INFO   |Config: N_PERMUTATIONS=9999, N_BOOTSTRAP=1000, KDE_GRID_POINTS=200


## Phase 1: Extract D-Gap Results from Pre-Computed Data

The D-gap measures how much Schoener's D (niche overlap) varies across feature spaces for each class pair. A large D-gap means representation choice matters significantly.

In [6]:
t0 = time.time()
logger.info("Phase 1: Extracting D-gap results from pre-computed data...")

meta = data["metadata"]
d_gap_per_dataset = meta["d_gap_results_per_dataset"]

# Build structured d_gap_results matching original script format
d_gap_results = {}
for ds_name, ds_data in d_gap_per_dataset.items():
    pair_results = {}
    for pair_name, pair_data in ds_data["pair_details"].items():
        d_values = pair_data["d_values"]
        vals = list(d_values.values())
        pair_results[pair_name] = {
            "d_gap": pair_data["d_gap"],
            "d_values": d_values,
            "best_space": pair_data["best_space"],
            "min_d": pair_data["min_d"],
            "max_d": pair_data["max_d"],
            "mean_d": np.mean(vals),
        }
    d_gap_results[ds_name] = {
        "pairs": pair_results,
        "mean_d_gap": ds_data["mean_d_gap"],
        "max_d_gap": ds_data["max_d_gap"],
    }

for ds, res in d_gap_results.items():
    logger.info(f"  {ds}: mean_d_gap={res['mean_d_gap']:.4f}, max_d_gap={res['max_d_gap']:.4f}, n_pairs={len(res['pairs'])}")

# Extract class names from CRND per-class stats
crnd_per_class = meta["crnd_per_class_stats"]
class_names_map = {}
for ds_name in DATASETS:
    if ds_name in crnd_per_class:
        class_names_map[ds_name] = sorted(crnd_per_class[ds_name].keys())
        logger.info(f"  {ds_name}: {len(class_names_map[ds_name])} classes")

t1 = time.time()
logger.info(f"Phase 1 complete in {t1-t0:.1f}s")

10:28:38|INFO   |Phase 1: Extracting D-gap results from pre-computed data...


10:28:38|INFO   |  medical_abstracts: mean_d_gap=0.2989, max_d_gap=0.5513, n_pairs=10


10:28:38|INFO   |  mimic_iv_ed_demo: mean_d_gap=0.2831, max_d_gap=0.4044, n_pairs=6


10:28:38|INFO   |  clinical_patient_triage_nl: mean_d_gap=0.4964, max_d_gap=0.9381, n_pairs=15


10:28:38|INFO   |  ohsumed_single: mean_d_gap=0.4592, max_d_gap=0.6918, n_pairs=36


10:28:38|INFO   |  mental_health_conditions: mean_d_gap=0.3021, max_d_gap=0.7388, n_pairs=21


10:28:38|INFO   |  medical_abstracts: 5 classes


10:28:38|INFO   |  mimic_iv_ed_demo: 4 classes


10:28:38|INFO   |  clinical_patient_triage_nl: 6 classes


10:28:38|INFO   |  ohsumed_single: 9 classes


10:28:38|INFO   |  mental_health_conditions: 7 classes


10:28:38|INFO   |Phase 1 complete in 0.0s


## Phase 2: Representation Suitability Map

Compute uniformly-high-D pairs (where min(D) > threshold across all spaces) and extract CRND per-class statistics.

In [7]:
logger.info("Phase 2: Computing Representation Suitability Map...")

# Compute uniformly-high-D pairs
def compute_uniformly_high_d(d_gap_results, thresholds):
    """Identify class pairs where min(D) > threshold across all spaces."""
    results = {}
    for ds_name, ds_data in d_gap_results.items():
        ds_results = {}
        for threshold in thresholds:
            high_pairs = []
            for pair_name, pair_data in ds_data["pairs"].items():
                if pair_data["min_d"] > threshold:
                    high_pairs.append(pair_name)
            total = len(ds_data["pairs"])
            ds_results[f"threshold_{threshold}"] = {
                "count": len(high_pairs),
                "fraction": len(high_pairs) / total if total > 0 else 0.0,
                "pairs": high_pairs,
            }
        results[ds_name] = ds_results
    return results

uniformly_high_d = compute_uniformly_high_d(d_gap_results, UNIFORMLY_HIGH_D_THRESHOLDS)
for ds in DATASETS:
    if ds in uniformly_high_d and "threshold_0.5" in uniformly_high_d[ds]:
        total_pairs = len(d_gap_results.get(ds, {}).get("pairs", {}))
        logger.info(f"  {ds} uniformly-high-D (>0.5): {uniformly_high_d[ds]['threshold_0.5']['count']}/{total_pairs}")

# CRND stats (already in metadata)
crnd_stats = meta["crnd_per_class_stats"]
logger.info(f"  CRND stats for {len(crnd_stats)} datasets")

t2 = time.time()
logger.info(f"Phase 2 complete in {t2-t1:.1f}s")

10:28:38|INFO   |Phase 2: Computing Representation Suitability Map...


10:28:38|INFO   |  medical_abstracts uniformly-high-D (>0.5): 1/10


10:28:38|INFO   |  mimic_iv_ed_demo uniformly-high-D (>0.5): 1/6


10:28:38|INFO   |  clinical_patient_triage_nl uniformly-high-D (>0.5): 1/15


10:28:38|INFO   |  ohsumed_single uniformly-high-D (>0.5): 0/36


10:28:38|INFO   |  mental_health_conditions uniformly-high-D (>0.5): 0/21


10:28:38|INFO   |  CRND stats for 5 datasets


10:28:38|INFO   |Phase 2 complete in 0.0s


## Phase 3: Disagreement Topology (Mantel Test)

Build boundary CRND and D-similarity graphs from per-instance data, then compute the Mantel test between them. This tests whether the disagreement topology (CRND boundary) correlates with the niche overlap topology (Schoener's D).

In [8]:
logger.info("Phase 3: Computing Disagreement Topology...")

# Build per-instance DataFrames from example data
per_instance = {}
for ds_entry in data["datasets"]:
    ds_name = ds_entry["dataset"]
    rows = []
    for ex in ds_entry["examples"]:
        rows.append({
            "class": ex["output"],
            "crnd_k10": ex.get("eval_instance_crnd_k10", np.nan),
            "boundary_proximity": ex.get("metadata_boundary_proximity", np.nan),
        })
    per_instance[ds_name] = pd.DataFrame(rows)
logger.info(f"  Per-instance data: {', '.join(f'{k}={len(v)}' for k, v in per_instance.items())}")

# Build boundary CRND graph
def build_boundary_crnd_graph(per_instance, class_names_map):
    """Build boundary CRND graph: nodes=classes, edge weight = mean CRND at boundary."""
    results = {}
    for ds_name, df in per_instance.items():
        if ds_name not in class_names_map:
            continue
        classes = class_names_map[ds_name]
        n = len(classes)
        boundary_matrix = np.zeros((n, n))
        for i, ci in enumerate(classes):
            for j, cj in enumerate(classes):
                if i == j:
                    continue
                mask_i = df["class"] == ci
                if mask_i.sum() == 0:
                    boundary_matrix[i, j] = np.nan
                    continue
                instances_i = df[mask_i]
                boundary_instances = instances_i[instances_i["boundary_proximity"] > BOUNDARY_PROXIMITY_THRESHOLD]
                if len(boundary_instances) > 0:
                    boundary_matrix[i, j] = boundary_instances["crnd_k10"].mean()
                else:
                    boundary_matrix[i, j] = instances_i["crnd_k10"].mean()
        sym_matrix = (boundary_matrix + boundary_matrix.T) / 2
        np.fill_diagonal(sym_matrix, 0.0)
        results[ds_name] = {"matrix": sym_matrix, "classes": classes}
    return results

# Build D similarity graph from d_gap_results
def build_d_similarity_graph(d_gap_results, class_names_map):
    """Build Schoener's D similarity graph: mean D across feature spaces."""
    results = {}
    for ds_name in d_gap_results:
        if ds_name not in class_names_map:
            continue
        classes = class_names_map[ds_name]
        n = len(classes)
        mean_d_matrix = np.zeros((n, n))
        for pair_name, pair_data in d_gap_results[ds_name]["pairs"].items():
            parts = pair_name.split("__vs__")
            if len(parts) != 2:
                continue
            try:
                i = classes.index(parts[0])
                j = classes.index(parts[1])
            except ValueError:
                continue
            mean_d_matrix[i, j] = pair_data["mean_d"]
            mean_d_matrix[j, i] = pair_data["mean_d"]
        np.fill_diagonal(mean_d_matrix, 0.0)
        results[ds_name] = {"matrix": mean_d_matrix, "classes": classes}
    return results

# Mantel test
def mantel_test(dist_matrix_1, dist_matrix_2, n_permutations):
    """Compute Mantel test between two distance matrices."""
    n = dist_matrix_1.shape[0]
    if n < 3:
        return {"mantel_r": np.nan, "p_value": np.nan, "n_permutations": 0}
    idx = np.triu_indices(n, k=1)
    vec1 = dist_matrix_1[idx]
    vec2 = dist_matrix_2[idx]
    valid = ~(np.isnan(vec1) | np.isnan(vec2))
    vec1 = vec1[valid]
    vec2 = vec2[valid]
    if len(vec1) < 3:
        return {"mantel_r": np.nan, "p_value": np.nan, "n_permutations": 0}
    r_obs, _ = stats.pearsonr(vec1, vec2)
    count_ge = 0
    rng = np.random.RandomState(42)
    for _ in range(n_permutations):
        perm = rng.permutation(n)
        perm_matrix = dist_matrix_1[np.ix_(perm, perm)]
        perm_vec = perm_matrix[idx]
        perm_vec = perm_vec[valid]
        r_perm, _ = stats.pearsonr(perm_vec, vec2)
        if r_perm >= r_obs:
            count_ge += 1
    p_value = (count_ge + 1) / (n_permutations + 1)
    return {"mantel_r": float(r_obs), "p_value": float(p_value), "n_permutations": n_permutations}

boundary_graph = build_boundary_crnd_graph(per_instance, class_names_map)
d_graph = build_d_similarity_graph(d_gap_results, class_names_map)

mantel_results = {}
for ds_name in DATASETS:
    if ds_name not in boundary_graph or ds_name not in d_graph:
        continue
    crnd_mat = boundary_graph[ds_name]["matrix"]
    d_mat = d_graph[ds_name]["matrix"]
    d_dist = 1.0 - d_mat
    np.fill_diagonal(d_dist, 0.0)
    mantel_result = mantel_test(crnd_mat, d_dist, n_permutations=N_PERMUTATIONS)
    mantel_results[ds_name] = mantel_result
    logger.info(f"  Mantel test {ds_name}: r={mantel_result['mantel_r']:.4f}, p={mantel_result['p_value']:.4f}")

t3 = time.time()
logger.info(f"Phase 3 complete in {t3-t2:.1f}s")

10:28:38|INFO   |Phase 3: Computing Disagreement Topology...


10:28:38|INFO   |  Per-instance data: medical_abstracts=3, mimic_iv_ed_demo=3, clinical_patient_triage_nl=3, ohsumed_single=3, mental_health_conditions=3


10:28:41|INFO   |  Mantel test medical_abstracts: r=0.1398, p=0.0479


10:28:41|INFO   |  Mantel test mimic_iv_ed_demo: r=nan, p=nan


10:28:44|INFO   |  Mantel test clinical_patient_triage_nl: r=0.0206, p=0.0256


10:28:44|INFO   |  Mantel test ohsumed_single: r=nan, p=nan


10:28:48|INFO   |  Mantel test mental_health_conditions: r=0.9323, p=0.0098


10:28:48|INFO   |Phase 3 complete in 9.3s


## Phase 4: Predictive Value of the Suitability Map

Compute Kendall's tau between D-gap ranking and classifier F1-gap ranking (pooled across datasets), LOO-CV improvement from adding CRND, and high-overlap pair identification precision.

In [9]:
logger.info("Phase 4: Loading pre-computed Predictive Value results...")

# These were computed on the full dataset — we load them from metadata
kendall_results = meta["kendall_tau_d_gap_vs_f1_gap"]
logger.info(f"  Pooled Kendall tau: {kendall_results.get('pooled', {}).get('kendall_tau')}")
for ds_name, ds_data in kendall_results.items():
    if ds_name != "pooled":
        logger.info(f"    {ds_name}: tau={ds_data.get('kendall_tau')}, p={ds_data.get('p_value')}, n={ds_data.get('n_pairs')}")

crnd_improvement = meta["crnd_prediction_improvement"]
logger.info(f"  CRND improvement delta: {crnd_improvement.get('delta_accuracy')}")
logger.info(f"  Model A accuracy (D-gap only): {crnd_improvement.get('model_a_accuracy'):.4f}")
logger.info(f"  Model B accuracy (D-gap + CRND): {crnd_improvement.get('model_b_accuracy'):.4f}")

high_overlap = meta["high_overlap_identification"]
logger.info(f"  High-overlap best precision: {high_overlap.get('best_precision')} at threshold={high_overlap.get('best_threshold')}")
logger.info(f"  True hard pairs: {high_overlap.get('n_true_hard')}/{high_overlap.get('n_total_pairs')}")

t4 = time.time()
logger.info(f"Phase 4 complete in {t4-t3:.1f}s")

10:28:48|INFO   |Phase 4: Loading pre-computed Predictive Value results...


10:28:48|INFO   |  Pooled Kendall tau: 0.2591817363768771


10:28:48|INFO   |    medical_abstracts: tau=-0.06666666666666667, p=0.8618005952380953, n=10


10:28:48|INFO   |    mimic_iv_ed_demo: tau=-0.33333333333333337, p=1.0, n=3


10:28:48|INFO   |    clinical_patient_triage_nl: tau=1.0, p=0.3333333333333333, n=3


10:28:48|INFO   |    ohsumed_single: tau=-0.08737095686938733, p=0.45372541122467214, n=36


10:28:48|INFO   |    mental_health_conditions: tau=-0.08571428571428572, p=0.6119326072720337, n=21


10:28:48|INFO   |  CRND improvement delta: 0.0


10:28:48|INFO   |  Model A accuracy (D-gap only): 0.6849


10:28:48|INFO   |  Model B accuracy (D-gap + CRND): 0.6849


10:28:48|INFO   |  High-overlap best precision: 0.2 at threshold=0.4


10:28:48|INFO   |  True hard pairs: 4/73


10:28:48|INFO   |Phase 4 complete in 0.0s


## Phase 5: Clinical Interpretability Profiles & Human Deferral

Generate clinical profiles and identify class pairs where automated methods are insufficient (high D + high boundary CRND).

In [10]:
logger.info("Phase 5: Clinical Interpretability Profiles...")

# Clinical profiles from metadata
clinical_profiles = meta["clinical_profiles"]
for ds, profiles in clinical_profiles.items():
    logger.info(f"  {ds}: {len(profiles)} class profiles generated")
    for cls, profile in list(profiles.items())[:2]:  # Show first 2
        logger.info(f"    {profile[:120]}...")

# Human deferral pairs
def identify_human_deferral_pairs(d_gap_results, boundary_graph, d_threshold, crnd_threshold):
    """Identify pairs where both uniformly high D AND high boundary CRND."""
    results = {}
    total_deferral = 0
    for ds_name in DATASETS:
        if ds_name not in d_gap_results or ds_name not in boundary_graph:
            continue
        classes = boundary_graph[ds_name]["classes"]
        crnd_mat = boundary_graph[ds_name]["matrix"]
        deferral_pairs = []
        for pair_name, pair_data in d_gap_results[ds_name]["pairs"].items():
            if pair_data["min_d"] <= d_threshold:
                continue
            parts = pair_name.split("__vs__")
            if len(parts) != 2:
                continue
            try:
                i = classes.index(parts[0])
                j = classes.index(parts[1])
            except ValueError:
                continue
            boundary_crnd = crnd_mat[i, j]
            if not np.isnan(boundary_crnd) and boundary_crnd > crnd_threshold:
                deferral_pairs.append({
                    "pair": pair_name,
                    "min_d": pair_data["min_d"],
                    "boundary_crnd": float(boundary_crnd),
                })
        results[ds_name] = {"count": len(deferral_pairs), "pairs": deferral_pairs}
        total_deferral += len(deferral_pairs)
    return results, total_deferral

deferral_pairs, total_deferral = identify_human_deferral_pairs(
    d_gap_results, boundary_graph, DEFERRAL_D_THRESHOLD, DEFERRAL_CRND_THRESHOLD
)
logger.info(f"  Total human deferral pairs: {total_deferral}")
for ds_name, dp in deferral_pairs.items():
    if dp["count"] > 0:
        for p in dp["pairs"]:
            logger.info(f"    {ds_name}: {p['pair']} (min_d={p['min_d']:.3f}, boundary_crnd={p['boundary_crnd']:.3f})")

t5 = time.time()
logger.info(f"Phase 5 complete in {t5-t4:.1f}s")

10:28:48|INFO   |Phase 5: Clinical Interpretability Profiles...


10:28:48|INFO   |  medical_abstracts: 5 class profiles generated


10:28:48|INFO   |    Class Cardiovascular_diseases: best separated from [Digestive_system_diseases in llm_zeroshot (D=0.21), Neoplasms in sen...


10:28:48|INFO   |    Class Digestive_system_diseases: best separated from [Cardiovascular_diseases in llm_zeroshot (D=0.21), Nervous_system_d...


10:28:48|INFO   |  mental_health_conditions: 7 class profiles generated


10:28:48|INFO   |    Class anxiety: best separated from [bipolar in llm_zeroshot (D=0.08), depression in tfidf (D=0.20), normal in sentence_t...


10:28:48|INFO   |    Class bipolar: best separated from [anxiety in llm_zeroshot (D=0.08), depression in llm_zeroshot (D=0.14), normal in sen...


10:28:48|INFO   |  Total human deferral pairs: 1


10:28:48|INFO   |    medical_abstracts: Digestive_system_diseases__vs__General_pathological_conditions (min_d=0.532, boundary_crnd=0.931)


10:28:48|INFO   |Phase 5 complete in 0.0s


## Phase 6: Ecological vs. ML Overlap Comparison

Load pre-computed Spearman correlations between Schoener's D (ecological overlap) and ML-based measures (Fisher discriminant ratio, Bhattacharyya coefficient).

In [11]:
logger.info("Phase 6: Ecological vs. ML Overlap Comparison...")

eco_vs_ml = meta["ecological_vs_ml_comparison"]
logger.info(f"  Spearman D vs inv-Fisher: rho={eco_vs_ml['spearman_d_vs_inv_fisher']['rho']:.4f}, p={eco_vs_ml['spearman_d_vs_inv_fisher']['p_value']:.6f}")
logger.info(f"  Spearman D vs Bhattacharyya: rho={eco_vs_ml['spearman_d_vs_bhattacharyya']['rho']:.4f}, p={eco_vs_ml['spearman_d_vs_bhattacharyya']['p_value']:.2e}")
logger.info(f"  Partial Spearman (D|Fisher): rho={eco_vs_ml['partial_spearman_d_controlling_fisher']['rho']:.4f}, p={eco_vs_ml['partial_spearman_d_controlling_fisher']['p_value']:.2e}")
logger.info(f"  Partial Spearman (Fisher|D): rho={eco_vs_ml['partial_spearman_fisher_controlling_d']['rho']:.4f}, p={eco_vs_ml['partial_spearman_fisher_controlling_d']['p_value']:.2e}")
logger.info(f"  Total pairs pooled: {eco_vs_ml['n_pairs_pooled']}")

t6 = time.time()
logger.info(f"Phase 6 complete in {t6-t5:.1f}s")

10:28:48|INFO   |Phase 6: Ecological vs. ML Overlap Comparison...


10:28:48|INFO   |  Spearman D vs inv-Fisher: rho=0.3627, p=0.001384


10:28:48|INFO   |  Spearman D vs Bhattacharyya: rho=0.5011, p=6.59e-07


10:28:48|INFO   |  Partial Spearman (D|Fisher): rho=0.5889, p=2.73e-08


10:28:48|INFO   |  Partial Spearman (Fisher|D): rho=0.6423, p=5.27e-10


10:28:48|INFO   |  Total pairs pooled: 88


10:28:48|INFO   |Phase 6 complete in 0.0s


## Aggregate Metrics Summary

Compute and display the 13 aggregate evaluation metrics across all datasets and phases.

In [12]:
def safe_float(val):
    """Convert to float, replacing None/NaN with 0.0."""
    if val is None:
        return 0.0
    if isinstance(val, float) and np.isnan(val):
        return 0.0
    return float(val)

# Compute aggregate metrics
mean_d_gaps = [d_gap_results[ds]["mean_d_gap"] for ds in DATASETS if ds in d_gap_results]
mean_d_gap_across = np.mean(mean_d_gaps) if mean_d_gaps else 0.0

total_high = 0
total_pairs = 0
for ds in DATASETS:
    if ds in uniformly_high_d and "threshold_0.5" in uniformly_high_d[ds]:
        total_high += uniformly_high_d[ds]["threshold_0.5"]["count"]
    if ds in d_gap_results:
        total_pairs += len(d_gap_results[ds]["pairs"])
frac_high = total_high / total_pairs if total_pairs > 0 else 0.0

mantel_rs = [v["mantel_r"] for v in mantel_results.values()
             if isinstance(v.get("mantel_r"), (int, float)) and not np.isnan(v.get("mantel_r", np.nan))]

metrics_agg = {
    "mean_d_gap_across_datasets": safe_float(mean_d_gap_across),
    "frac_uniformly_high_d_pairs": safe_float(frac_high),
    "mean_mantel_r": safe_float(np.mean(mantel_rs) if mantel_rs else np.nan),
    "kendall_tau_pooled": safe_float(kendall_results.get("pooled", {}).get("kendall_tau")),
    "crnd_improves_delta": safe_float(crnd_improvement.get("delta_accuracy")),
    "high_overlap_precision": safe_float(high_overlap.get("best_precision")),
    "spearman_D_vs_fisher": safe_float(eco_vs_ml.get("spearman_d_vs_inv_fisher", {}).get("rho")),
    "spearman_D_vs_bhattacharyya": safe_float(eco_vs_ml.get("spearman_d_vs_bhattacharyya", {}).get("rho")),
    "partial_spearman_D|Fisher": safe_float(eco_vs_ml.get("partial_spearman_d_controlling_fisher", {}).get("rho")),
    "num_human_deferral": safe_float(total_deferral),
    "num_datasets": safe_float(len(DATASETS)),
    "num_class_pairs_total": safe_float(total_pairs),
}

logger.info("=" * 60)
logger.info("EVALUATION SUMMARY — Aggregate Metrics")
logger.info("=" * 60)
for k, v in metrics_agg.items():
    logger.info(f"  {k}: {v:.4f}" if isinstance(v, float) and v != int(v) else f"  {k}: {v}")

total_time = time.time() - t0
logger.info(f"Total runtime: {total_time:.1f}s")



10:28:48|INFO   |EVALUATION SUMMARY — Aggregate Metrics




10:28:48|INFO   |  mean_d_gap_across_datasets: 0.3679


10:28:48|INFO   |  frac_uniformly_high_d_pairs: 0.0341


10:28:48|INFO   |  mean_mantel_r: 0.3642


10:28:48|INFO   |  kendall_tau_pooled: 0.2592


10:28:48|INFO   |  crnd_improves_delta: 0.0


10:28:48|INFO   |  high_overlap_precision: 0.2000


10:28:48|INFO   |  spearman_D_vs_fisher: 0.3627


10:28:48|INFO   |  spearman_D_vs_bhattacharyya: 0.5011


10:28:48|INFO   |  partial_spearman_D|Fisher: 0.5889


10:28:48|INFO   |  num_human_deferral: 1.0


10:28:48|INFO   |  num_datasets: 5.0


10:28:48|INFO   |  num_class_pairs_total: 88.0


10:28:48|INFO   |Total runtime: 9.4s


## Visualization

Visualize key results: D-gap distribution across datasets, Mantel test correlations, and ecological vs. ML correlation summary.

In [13]:
matplotlib.use("Agg")
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle("Cross-Representation Class Characterization — Key Results", fontsize=14, fontweight="bold")

# ── Plot 1: Mean D-gap per dataset ──
ax1 = axes[0, 0]
ds_names_short = [ds.replace("_", "\n")[:20] for ds in DATASETS if ds in d_gap_results]
mean_gaps = [d_gap_results[ds]["mean_d_gap"] for ds in DATASETS if ds in d_gap_results]
max_gaps = [d_gap_results[ds]["max_d_gap"] for ds in DATASETS if ds in d_gap_results]
x = np.arange(len(ds_names_short))
ax1.bar(x - 0.15, mean_gaps, 0.3, label="Mean D-gap", color="#4C72B0", alpha=0.85)
ax1.bar(x + 0.15, max_gaps, 0.3, label="Max D-gap", color="#DD8452", alpha=0.85)
ax1.set_xticks(x)
ax1.set_xticklabels(ds_names_short, fontsize=7)
ax1.set_ylabel("D-gap")
ax1.set_title("D-gap per Dataset")
ax1.legend(fontsize=8)
ax1.axhline(y=np.mean(mean_gaps), color="gray", linestyle="--", alpha=0.5, label="Overall mean")

# ── Plot 2: Mantel r per dataset ──
ax2 = axes[0, 1]
mantel_ds = [ds for ds in DATASETS if ds in mantel_results]
mantel_r_vals = [mantel_results[ds]["mantel_r"] for ds in mantel_ds]
mantel_p_vals = [mantel_results[ds]["p_value"] for ds in mantel_ds]
colors = ["#55A868" if p < 0.05 else "#C44E52" for p in mantel_p_vals]
ds_labels = [ds.replace("_", "\n")[:20] for ds in mantel_ds]
bars = ax2.bar(range(len(mantel_ds)), mantel_r_vals, color=colors, alpha=0.85)
ax2.set_xticks(range(len(mantel_ds)))
ax2.set_xticklabels(ds_labels, fontsize=7)
ax2.set_ylabel("Mantel r")
ax2.set_title("Mantel Test (CRND vs D-distance)")
ax2.axhline(y=0, color="black", linewidth=0.5)
# Add p-value annotations
for i, (r, p) in enumerate(zip(mantel_r_vals, mantel_p_vals)):
    ax2.annotate(f"p={p:.3f}", (i, r), textcoords="offset points",
                 xytext=(0, 5 if r >= 0 else -15), ha="center", fontsize=7)

# ── Plot 3: Ecological vs ML correlations ──
ax3 = axes[1, 0]
corr_names = ["D vs 1/Fisher", "D vs Bhatt.", "D|Fisher\n(partial)", "Fisher|D\n(partial)"]
corr_vals = [
    eco_vs_ml["spearman_d_vs_inv_fisher"]["rho"],
    eco_vs_ml["spearman_d_vs_bhattacharyya"]["rho"],
    eco_vs_ml["partial_spearman_d_controlling_fisher"]["rho"],
    eco_vs_ml["partial_spearman_fisher_controlling_d"]["rho"],
]
corr_ps = [
    eco_vs_ml["spearman_d_vs_inv_fisher"]["p_value"],
    eco_vs_ml["spearman_d_vs_bhattacharyya"]["p_value"],
    eco_vs_ml["partial_spearman_d_controlling_fisher"]["p_value"],
    eco_vs_ml["partial_spearman_fisher_controlling_d"]["p_value"],
]
bar_colors = ["#4C72B0" if p < 0.01 else "#DD8452" for p in corr_ps]
ax3.barh(range(len(corr_names)), corr_vals, color=bar_colors, alpha=0.85)
ax3.set_yticks(range(len(corr_names)))
ax3.set_yticklabels(corr_names, fontsize=9)
ax3.set_xlabel("Spearman rho")
ax3.set_title("Ecological vs ML Overlap Correlations")
ax3.axvline(x=0, color="black", linewidth=0.5)
for i, (v, p) in enumerate(zip(corr_vals, corr_ps)):
    ax3.annotate(f"p={p:.1e}", (v, i), textcoords="offset points",
                 xytext=(5 if v >= 0 else -50, 0), fontsize=7, va="center")

# ── Plot 4: CRND per class (medical_abstracts) ──
ax4 = axes[1, 1]
if "medical_abstracts" in crnd_stats:
    ma_stats = crnd_stats["medical_abstracts"]
    cls_names = sorted(ma_stats.keys())
    means = [ma_stats[c]["mean_crnd"] for c in cls_names]
    stds = [ma_stats[c]["std_crnd"] for c in cls_names]
    cls_labels = [c.replace("_", "\n")[:18] for c in cls_names]
    ax4.barh(range(len(cls_names)), means, xerr=stds, color="#55A868", alpha=0.85, capsize=3)
    ax4.set_yticks(range(len(cls_names)))
    ax4.set_yticklabels(cls_labels, fontsize=8)
    ax4.set_xlabel("Mean CRND")
    ax4.set_title("CRND per Class (medical_abstracts)")
    ax4.set_xlim(0.8, 0.95)

plt.tight_layout()
plt.savefig("results_visualization.png", dpi=150, bbox_inches="tight")
plt.show()
logger.info("Visualization saved to results_visualization.png")

10:28:48|INFO   |Visualization saved to results_visualization.png


  plt.show()
