# ISO-FIGS Benchmark — Breast Cancer Wisconsin Dataset

**Interaction-Stratified Oblique FIGS** benchmark dataset preparation and exploration.

This notebook loads and analyses the **Breast Cancer Wisconsin (Diagnostic)** dataset
— the canonical FIGS benchmark with 30 highly-correlated features computed from
cell-nuclei images. Strong within-group correlations (radius↔perimeter↔area, r > 0.99)
make axis-aligned splits suboptimal and motivate the oblique splits central to ISO-FIGS.

| Property | Value |
|---|---|
| Source | `sklearn.datasets.load_breast_cancer` |
| Features | 30 (10 measurements × 3 statistics: mean, SE, worst) |
| Task | Binary classification: 0 = malignant, 1 = benign |
| Class balance | ~37 % malignant, ~63 % benign |
| Full dataset | 200 examples (160 train + 40 test) |
| Demo subset | 15 stratified examples |

## 1 — Setup & Data Loading

In [None]:
import json
import os
from collections import Counter

import numpy as np

In [None]:
# Configuration
GITHUB_RAW_URL = "https://raw.githubusercontent.com/AMGrobelnik/ai-invention-e82757-interaction-stratified-oblique-tree-ense/main/all_output_files_verified_and_under_size_limits/demo/demo_data.json"
LOCAL_FILE = "demo_data.json"


def load_data():
    """Load demo data from GitHub (works in Colab) or local file (dev fallback)."""
    # Try GitHub URL first (works in Colab)
    try:
        import urllib.request
        with urllib.request.urlopen(GITHUB_RAW_URL) as response:
            return json.loads(response.read().decode())
    except Exception:
        pass
    # Fallback to local file
    if os.path.exists(LOCAL_FILE):
        with open(LOCAL_FILE) as f:
            return json.load(f)
    raise FileNotFoundError("Could not load data from GitHub or local file")


data = load_data()
examples = data["examples"]
metadata = data.get("metadata", {})
print(f"Loaded {len(examples)} demo examples")
if metadata:
    print(f"Dataset: {metadata.get('description', 'N/A')}")
    print(f"Full dataset size: {metadata.get('total_in_full_dataset', '?')} examples")

## 2 — Example Schema

Each record contains:
- **input** — natural-language prediction prompt with all 30 feature values
- **context** — structured metadata (feature names & values, known interactions)
- **output** — target label string (`"0"` = malignant, `"1"` = benign)
- **dataset** / **split** — provenance fields

In [None]:
ex = examples[0]
print("Top-level keys:", list(ex.keys()))
print(f"\ndataset : {ex['dataset']}")
print(f"split   : {ex['split']}")
label_name = 'malignant' if ex['output'] == '0' else 'benign'
print(f"output  : {ex['output']} ({label_name})")
print(f"\nContext keys: {list(ex['context'].keys())}")
ctx = ex['context']
print(f"  task_type       : {ctx['task_type']}")
print(f"  n_features      : {ctx['n_features']}")
print(f"  n_samples_total : {ctx['n_samples_total']}")
print(f"  source          : {ctx['source']}")
print(f"\nInput prompt (first 200 chars):")
print(f"  {ex['input'][:200]}\u2026")

## 3 — Split & Class Distribution

In [None]:
LABEL = {"0": "malignant", "1": "benign"}

split_cnt = Counter(e["split"] for e in examples)
print("Split distribution:")
for s in sorted(split_cnt):
    print(f"  {s:6s} : {split_cnt[s]}")

print("\nPer-split class counts:")
for s in sorted(split_cnt):
    cc = Counter(LABEL[e["output"]] for e in examples if e["split"] == s)
    print(f"  {s:6s} : {', '.join(f'{k}={v}' for k, v in sorted(cc.items()))}")

## 4 — Feature Matrix & Summary Statistics

In [None]:
feature_names = examples[0]["context"]["feature_names"]
X = np.array([e["context"]["feature_values"] for e in examples])
y = np.array([int(e["output"]) for e in examples])

print(f"X shape : {X.shape}   (samples \u00d7 features)")
print(f"y shape : {y.shape}")
print(f"\nFeature statistics (first 10 of {len(feature_names)}):")
print(f"{'#':>3}  {'Feature':>28s}  {'Mean':>10s}  {'Std':>10s}  {'Min':>10s}  {'Max':>10s}")
print("-" * 78)
for i in range(min(10, X.shape[1])):
    c = X[:, i]
    print(f"{i:3d}  {feature_names[i]:>28s}  {c.mean():10.4f}  {c.std():10.4f}  {c.min():10.4f}  {c.max():10.4f}")

## 5 — Feature Correlations (Motivating Oblique Splits)

The strongest correlations occur within **size groups** (radius, perimeter, area)
and across measurement tiers (mean ↔ worst). These highly-correlated feature
groups make axis-aligned decision-tree splits suboptimal and directly motivate
the oblique (multi-feature) splits used by ISO-FIGS.

In [None]:
KEY_FEATS = [
    "mean radius", "mean perimeter", "mean area",
    "worst radius", "worst perimeter", "worst area",
    "mean concavity", "mean concave points",
]
kidx = [feature_names.index(f) for f in KEY_FEATS]
C = np.corrcoef(X[:, kidx].T)

short = [f[:9] for f in KEY_FEATS]
print(" " * 11 + "".join(f"{s:>10s}" for s in short))
for i, s in enumerate(short):
    print(f"{s:>10s} " + "".join(f"{C[i,j]:10.3f}" for j in range(len(short))))

print("\nHighly correlated pairs (|r| > 0.93):")
for i in range(len(KEY_FEATS)):
    for j in range(i + 1, len(KEY_FEATS)):
        if abs(C[i, j]) > 0.93:
            print(f"  {KEY_FEATS[i]} \u2194 {KEY_FEATS[j]} : r = {C[i,j]:.4f}")

## 6 — Known Interaction Structure

The dataset metadata documents the feature interactions that ISO-FIGS leverages
for its interaction-stratified tier construction and ANOVA decomposition.

In [None]:
interactions = examples[0]["context"].get("known_interactions", "(none)")
print("Known feature interactions (from context metadata):\n")
for part in interactions.split(", "):
    print(f"  \u2022 {part.strip()}")

## 7 — Class Separation by Key Features

In [None]:
SEP = ["mean radius", "mean area", "mean concavity",
       "worst radius", "worst area", "worst concavity"]

print(f"{'Feature':>25s}  {'Malignant':>10s}  {'Benign':>10s}  {'Ratio':>7s}")
print("-" * 58)
for fname in SEP:
    fi = feature_names.index(fname)
    m = X[y == 0, fi].mean()
    b = X[y == 1, fi].mean()
    r = m / b if b else float('inf')
    print(f"{fname:>25s}  {m:10.4f}  {b:10.4f}  {r:6.2f}x")

print("\n\u2192 Malignant tumours show systematically larger size and concavity.")
print("  These correlated feature groups motivate oblique (multi-feature) splits.")

## 8 — Reproducing the Full Dataset

This demo notebook uses **15 stratified examples** for quick exploration.
To regenerate the full 200-example benchmark dataset:

```bash
pip install numpy scikit-learn
python data.py   # → full_data_out.json (200 examples, ~580 KB)
```

The pipeline is fully deterministic (`RANDOM_SEED = 42`) and relies only on
`sklearn.datasets.load_breast_cancer` — no external data downloads required.