
# Tutorial (Deep Dive): Power & Effect Size Analysis for Biological ML with PyTorch + Lightning + `beignet`

> **Goal.** Teach you how to design, train, and *interpret* biological ML models using **effect sizes** and **power / sample-size analysis**—so your results are *detectable*, *reproducible*, and *biologically actionable*.

**Who is this for?**
- ML researchers/engineers working with biological data (drug discovery, protein design, single-cell, ADMET).
- Biologists validating ML models and planning experiments.
- Anyone shipping models where *small effects* still matter.

**What you'll learn**
1. Map *biological questions* to the **right statistical tests** and **effect sizes**.
2. Estimate **required sample sizes** *before* training.
3. Instrument PyTorch Lightning to log **effect sizes** and **power** alongside loss/AUROC.
4. Interpret results and write **power-aware claims** for papers and reports.

**Prerequisites**
- Comfortable with PyTorch/Lightning training loops.
- Basic stats (mean/variance/correlation, t-tests, chi-squared, ANOVA).  
  A concise refresher is included below.

**Datasets used (small/medium from `beignet`):**
- `FreeSolvDataset` (regression; hydration free energy)
- `ClinToxDataset` (binary classification; clinical trial toxicity)
- `SKEMPIDataset` (regression; ΔΔG mutational effects)

**Navigation**
- §1 Foundations: definitions, formulae, and when to use what  
- §2–§4 Hands-on case studies (FreeSolv, ClinTox, SKEMPI)  
- §5 Workflow patterns you can reuse  
- §6 Utilities and helper functions  
- §7 Reporting templates (what to claim & how)  
- Appendix: FAQs and pitfalls



# Power & Effect Size Analysis for Biological ML with PyTorch + Lightning + `beignet`

Deep learning metrics (loss, AUROC) don’t tell you whether your dataset/model can **reliably detect** biologically meaningful effects.  
This notebook adds **effect sizes** and **power/sample-size** to your workflow so claims are reproducible and decision-relevant.

**Datasets used (small/medium from `beignet`):**
- `FreeSolvDataset` (regression; hydration free energy)
- `ClinToxDataset` (binary classification; clinical trial toxicity)
- `SKEMPIDataset` (regression; ΔΔG mutational effects)

> If any field names differ (e.g., `X`, `y`), adjust in the marked cells below.


## 0) Setup

In [None]:
# If needed, uncomment:
# !pip install beignet torch pytorch-lightning torchmetrics scikit-learn pandas

import pandas as pd
import torch
from lightning import LightningModule, Trainer, seed_everything
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from torch import nn, optim
from torch.utils.data import DataLoader, TensorDataset

seed_everything(7)
print("Torch:", torch.__version__)

### Import `beignet` datasets & metrics

In [None]:
# Datasets (swap for others in your package if preferred)
from torchmetrics.classification import AUROC, Accuracy, F1Score
from torchmetrics.regression import MeanSquaredError, R2Score

from beignet.datasets import ClinToxDataset, FreeSolvDataset, SKEMPIDataset

# TorchMetrics-style wrappers for power/effect/sample size
from beignet.metrics import (
    ANOVAPower,
    ChiSquaredIndependencePower,
    CohensD,
    CorrelationPower,
    CramersV,
    HedgesG,
    PhiCoefficient,
    ProportionTwoSamplePower,
    TTestPower,
    TTestSampleSize,
)


---
## 1) Statistical foundations (concise refresher)

### 1.1 Effect sizes (how *big* is the effect?)
- **Cohen's d (two means)**:  
  \[ d \;=\; \frac{\mu_1 - \mu_2}{s_\text{pooled}}, \quad
     s_\text{pooled} = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}} \]
  Interpretable in SD units. Use **Hedges' g** for small-sample bias correction.
- **Correlation r (association strength)**: absolute value indicates effect magnitude; can be converted to *f* or *d* if needed.
- **Cramér’s V / Phi (φ)** for association in contingency tables:  
  \[ \phi = \sqrt{\frac{\chi^2}{n}} \quad (\text{for 2x2}) \qquad
     V = \sqrt{\frac{\chi^2}{n \cdot (k-1)}} \quad (\text{for } r \times k) \]
- **Cohen’s f (ANOVA)** relates group-separation to within-group variance:  
  \[ f = \sqrt{\frac{\eta^2}{1-\eta^2}} \]

### 1.2 Power and sample size (can we **detect** it reliably?)
- **Power (1-β)**: probability of rejecting the null when the effect is real. In practice we target **0.8** or **0.9**.
- **Inputs**: effect size, alpha (type-I error, usually 0.05), sample size, variance/df.
- **Outputs**: (a) compute power for given *n*, (b) compute **required n** for target power.

### 1.3 Mapping biological questions → tests
- Regression (e.g., FreeSolv, SKEMPI): use **correlation power**, **t-test power**, **ANOVA power** for grouped effects.
- Binary classification (e.g., ClinTox): **χ² independence power**, **two-sample proportion power** for prevalence deltas.
- Multi-group categorical biology (e.g., assay tiers, mutation classes): **ANOVA power** and **Cohen’s f**.

> **Why not only AUROC/MSE?** Because AUROC/MSE don't tell you whether your dataset size + noise profile can detect the effect you *care* about. Power does.



## 1) Why power/effect matters (biology-first)

- **Effect size** (Cohen’s *d*, Hedges’ *g*, φ/Cramér’s *V*, Cohen’s *f/f²*): how *big* is a difference/association.
- **Power**: probability to detect that effect at α (given *n*, noise).
- **For DL**: plan **data needs** before training; **log** detectability alongside loss/AUC; **calibrate claims** to what your dataset can actually support.



---
## 2) Case A — **FreeSolv** (regression; hydration free energy)

We’ll show correlation/t-test power and standardized effects on a compact regression task.


### 2.1 Load & preprocess *(adjust field names if needed)*

In [None]:
ds = FreeSolvDataset()

# Adjust here if your dataset exposes different attributes
X = torch.as_tensor(ds.X, dtype=torch.float32)  # [N, D]
y = torch.as_tensor(ds.y, dtype=torch.float32)  # [N]

# Standardize for stable training (optional)
xsc = StandardScaler().fit(X.numpy())
ysc = StandardScaler().fit(y[:, None].numpy())
Xn = torch.from_numpy(xsc.transform(X.numpy())).float()
yn = torch.from_numpy(ysc.transform(y[:, None].numpy()).squeeze(1)).float()

train_idx, test_idx = train_test_split(
    torch.arange(len(Xn)),
    test_size=0.2,
    random_state=7,
)
train_loader = DataLoader(
    TensorDataset(Xn[train_idx], yn[train_idx]),
    batch_size=64,
    shuffle=True,
)
val_loader = DataLoader(TensorDataset(Xn[test_idx], yn[test_idx]), batch_size=128)

Xn.shape, yn.shape

### 2.2 Lightning model with **effect/power** logged next to loss

In [None]:
class LitRegressor(LightningModule):
    def __init__(self, in_dim, lr=1e-3, alpha=0.05):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential(
            nn.Linear(in_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, 1),
        )
        self.mse, self.r2 = MeanSquaredError(), R2Score()
        self.d = CohensD()
        self.power_t = TTestPower(alpha=alpha)
        self.power_corr = CorrelationPower(alpha=alpha)

    def forward(self, x):
        return self.net(x).squeeze(-1)

    def training_step(self, batch, _):
        x, y = batch
        pred = self(x)
        loss = nn.functional.mse_loss(pred, y)
        self.log("train/mse", loss, prog_bar=True)
        return loss

    def validation_step(self, batch, _):
        x, y = batch
        pred = self(x)
        self.log("val/mse", self.mse(pred, y), prog_bar=True)
        self.log("val/r2", self.r2(pred, y), prog_bar=True)
        # Effect size / power on predictions vs truth
        self.log("val/cohens_d", self.d(pred, y))
        self.log("val/power_ttest", self.power_t(pred, y))
        self.log("val/power_corr", self.power_corr(pred, y))

    def configure_optimizers(self):
        return optim.AdamW(self.parameters(), lr=self.hparams.lr)


model = LitRegressor(in_dim=Xn.shape[1])
trainer = Trainer(
    max_epochs=10,
    log_every_n_steps=5,
    deterministic=True,
    enable_checkpointing=False,
)
trainer.fit(model, train_loader, val_loader)

### 2.3 Plan **sample size** for a target correlation (threshold → detectability)

In [None]:
# Suppose r >= 0.35 is "biologically useful"
target_r, alpha = 0.35, 0.05
ns = torch.arange(30, 401, 10)
cp = CorrelationPower(alpha=alpha)

powers = [float(cp(effect_size=torch.tensor(target_r), n=int(n))) for n in ns]
pd.DataFrame({"n": ns.numpy(), "power_at_r=0.35": powers}).head(10)


---
## 3) Case B — **ClinTox** (binary classification; clinical trial toxicity)

Demonstrate **χ² power** for association (truth vs predictions), **φ/Cramér’s V** effect size,
and **two-sample proportion power** for prevalence differences.


### 3.1 Load & split *(adjust field names if needed)*

In [None]:
ds = ClinToxDataset()

X = torch.as_tensor(ds.X, dtype=torch.float32)
y = torch.as_tensor(ds.y, dtype=torch.long)  # 0/1 labels

Xn = torch.from_numpy(StandardScaler().fit_transform(X.numpy())).float()
tr, te = train_test_split(
    torch.arange(len(Xn)),
    test_size=0.2,
    stratify=y,
    random_state=7,
)

train_loader = DataLoader(TensorDataset(Xn[tr], y[tr]), batch_size=64, shuffle=True)
val_loader = DataLoader(TensorDataset(Xn[te], y[te]), batch_size=128)

Xn.shape, y.shape

### 3.2 Lightning classifier with **χ² power** & **effect sizes**

In [None]:
class LitClassifier(LightningModule):
    def __init__(self, in_dim, lr=1e-3, alpha=0.05):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential(
            nn.Linear(in_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, 1),
        )
        self.acc, self.auroc, self.f1 = (
            Accuracy(task="binary"),
            AUROC(task="binary"),
            F1Score(task="binary"),
        )
        self.chi_power = ChiSquaredIndependencePower(alpha=alpha)
        self.cramersV, self.phi = CramersV(), PhiCoefficient()

    def forward(self, x):
        return self.net(x).squeeze(-1)

    def _contingency(self, logits, y, thr=0.5):
        p = torch.sigmoid(logits)
        pred = (p >= thr).long()
        table = torch.zeros((2, 2), device=logits.device)
        for t, q in zip(y, pred, strict=False):
            table[int(t), int(q)] += 1
        return pred, p, table

    def training_step(self, batch, _):
        x, y = batch
        logit = self(x)
        loss = nn.functional.binary_cross_entropy_with_logits(logit, y.float())
        self.log("train/loss", loss, prog_bar=True)
        return loss

    def validation_step(self, batch, _):
        x, y = batch
        logit = self(x)
        pred, p, table = self._contingency(logit, y)
        self.log("val/acc", self.acc(pred, y), prog_bar=True)
        self.log("val/auroc", self.auroc(p, y), prog_bar=True)
        self.log("val/f1", self.f1(pred, y))

        # Association detectability and effect size:
        self.log("val/chi_power", self.chi_power(table))
        self.log("val/cramers_v", self.cramersV(table))
        self.log("val/phi", self.phi(table))

    def configure_optimizers(self):
        return optim.AdamW(self.parameters(), lr=self.hparams.lr)


model_c = LitClassifier(in_dim=Xn.shape[1])
trainer = Trainer(
    max_epochs=10,
    log_every_n_steps=5,
    deterministic=True,
    enable_checkpointing=False,
)
trainer.fit(model_c, train_loader, val_loader)


> **Interpretation (ClinTox).**
> - *val/chi_power*: detects association between true and predicted labels (beyond raw accuracy).
> - *val/cramers_v* / *val/phi*: effect-size of association—0.1 (small), 0.3 (moderate), 0.5 (large) as rough guides.
> - *ProportionTwoSamplePower*: plan cohort sizes for minimum clinically meaningful prevalence differences (e.g., 10%).
> - If χ² power is high but AUROC is modest, the model captures useful association—optimize thresholding/costs.
> - If AUROC is high but χ² power is low, you may be underpowered at this *n* or imbalanced in subgroups.


### 3.3 Two-sample **proportion power** for prevalence differences

In [None]:
# Suppose positive-class prevalence differs between two subgroups by 0.10 (e.g., 0.20 vs 0.30)
p1, p2, alpha = 0.20, 0.30, 0.05
ns = torch.arange(50, 801, 25)
pp = ProportionTwoSamplePower(alpha=alpha)
powers = [
    float(pp(p1=torch.tensor(p1), p2=torch.tensor(p2), n1=int(n), n2=int(n)))
    for n in ns
]

pd.DataFrame({"n_per_group": ns.numpy(), "power_at_diff=0.10": powers}).head(10)


---
## 4) Case C — **SKEMPI** (ΔΔG regression; mutational effects)

Use **Cohen’s d / Hedges’ g** and **ANOVA power** across mutation classes.


### 4.1 Load & split *(adjust field names if needed)*

In [None]:
ds = SKEMPIDataset()

X = torch.as_tensor(ds.X, dtype=torch.float32)
y = torch.as_tensor(
    ds.y,
    dtype=torch.float32,
)  # ΔΔG (kcal/mol). Verify sign convention if needed.

Xn = torch.from_numpy(StandardScaler().fit_transform(X.numpy())).float()
yn = torch.from_numpy(
    StandardScaler().fit_transform(y[:, None].numpy()).squeeze(1),
).float()

tr, te = train_test_split(torch.arange(len(Xn)), test_size=0.2, random_state=7)
train_loader = DataLoader(TensorDataset(Xn[tr], yn[tr]), batch_size=64, shuffle=True)
val_loader = DataLoader(TensorDataset(Xn[te], yn[te]), batch_size=128)

Xn.shape, yn.shape

### 4.2 Lightning regressor with grouped effects and **ANOVA power**

In [None]:
class LitRegressorGrouped(LightningModule):
    def __init__(self, in_dim, lr=1e-3, alpha=0.05):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential(
            nn.Linear(in_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, 1),
        )
        self.mse, self.r2 = MeanSquaredError(), R2Score()
        self.d, self.g = CohensD(), HedgesG()
        self.anova_p = ANOVAPower(alpha=alpha)

    def forward(self, x):
        return self.net(x).squeeze(-1)

    def training_step(self, batch, _):
        x, y = batch
        pred = self(x)
        loss = nn.functional.mse_loss(pred, y)
        self.log("train/mse", loss, prog_bar=True)
        return loss

    def validation_step(self, batch, _):
        x, y = batch
        pred = self(x)
        self.log("val/mse", self.mse(pred, y), prog_bar=True)
        self.log("val/r2", self.r2(pred, y), prog_bar=True)
        self.log("val/d", self.d(pred, y))
        self.log("val/g", self.g(pred, y))

        # If SKEMPI exposes true mutation categories, replace this tertile bucketing with ds.category[indices].
        q = torch.quantile(y, torch.tensor([0.33, 0.66], device=y.device))
        cats = torch.bucketize(y, q)  # 0,1,2 tertiles as a stand-in for classes

        group_means = torch.stack([pred[cats == k].mean() for k in (0, 1, 2)])
        group_vars = torch.stack(
            [pred[cats == k].var(unbiased=True) for k in (0, 1, 2)],
        )
        group_ns = torch.tensor(
            [int((cats == k).sum()) for k in (0, 1, 2)],
            device=pred.device,
        )

        try:
            self.log("val/anova_power", self.anova_p(group_means, group_vars, group_ns))
        except TypeError:
            # If ANOVAPower expects raw groups instead of summary stats, adapt here.
            pass

    def configure_optimizers(self):
        return optim.AdamW(self.parameters(), lr=self.hparams.lr)


model_s = LitRegressorGrouped(in_dim=Xn.shape[1])
trainer = Trainer(
    max_epochs=10,
    log_every_n_steps=5,
    deterministic=True,
    enable_checkpointing=False,
)
trainer.fit(model_s, train_loader, val_loader)


> **Interpretation (SKEMPI).**
> - *val/d*, *val/g*: standardized effect between predictions and truth; *g* is bias-corrected for small samples.
> - *val/anova_power*: is between-class separation detectable at current *n*? Great for mutation class stratifications.
> - The **threshold → d → n** workflow translates a biological delta (e.g., ΔΔG = 0.5 kcal/mol) to required sample size.


### 4.3 Convert biological threshold → **required n**

In [None]:
# Suppose ΔΔG = 0.5 kcal/mol is biologically meaningful.
# Estimate Cohen's d from observed SD in original units:
with torch.no_grad():
    # NOTE: use original-scale y for this calculation
    y_val = y[te]
    sd = y_val.std()
    d_eff = torch.tensor(0.5) / sd

# Sample size for 80% power at α=0.05 using the Metric interface:
n_needed = TTestSampleSize()(effect_size=d_eff, alpha=0.05, power=0.80)
n_needed


---
## 5) Power-aware workflow (what to log & when)

1. **Before training**: convert meaningful thresholds (ΔTm, ΔΔG, prevalence deltas) → **effect sizes**; compute **required n** with `*SampleSize` metrics.  
2. **During validation**: log **effect sizes** and **power** alongside loss/AUC.  
3. **After training**: frame claims like *“association is moderate (Cramér’s V≈0.3) and detectable (χ²-power≈0.86) at n=…”* rather than only AUROC/MSE.



---
## 5) Common pitfalls & best practices

- **Confusing significance with importance**: small *p* with *tiny* effect may be irrelevant biologically; report effect sizes.
- **Underpowered negatives**: failing to detect ≠ no effect. Always report power achieved for the claim’s threshold.
- **Data leakage in power eval**: compute effect/power on held-out validation/test—not on the training set.
- **Ignoring stratification**: subgroup effects (e.g., mutation classes) can differ; use ANOVA power and per-stratum reporting.
- **Class imbalance**: pair χ² power with prevalence-aware metrics (precision/recall) and use proportion-power for planning.
- **Multiple testing**: adjust α for multiple hypotheses or use hierarchical claims; document your correction.



## 6) Decision checklist (copy/paste for your projects)

1. **Define a biologically meaningful threshold** (ΔΔG, ΔTm, prevalence delta, r).  
2. **Translate to effect size** (e.g., Cohen’s d or *r*).  
3. **Before training**: compute **required n** for 80–90% power at α=0.05.  
4. **During validation**: log AUROC/MSE **and** effect sizes **and** power.  
5. **After**: write claims that include effect magnitude and detectability at reported *n*.  
6. **If power < 0.8**: expand data, reduce noise, or narrow the claim; repeat steps 1–5.


## 6) Utilities

In [None]:
def power_learning_curve_regression(
    X,
    y,
    model_ctor,
    power_metric,
    ns=(50, 100, 150, 200),
    repeats=3,
):
    """Subsampled n → power. power_metric(pred, y) should return a scalar tensor."""
    out, idx_all = [], torch.arange(len(X))
    for n in ns:
        for r in range(repeats):
            idx = idx_all[torch.randperm(len(idx_all))[:n]]
            tr, va = train_test_split(torch.arange(n), test_size=0.25, random_state=r)
            train_loader = DataLoader(
                TensorDataset(X[idx][tr], y[idx][tr]),
                batch_size=64,
                shuffle=True,
            )
            val_loader = DataLoader(
                TensorDataset(X[idx][va], y[idx][va]),
                batch_size=128,
            )
            model = model_ctor()
            Trainer(
                max_epochs=5,
                logger=False,
                enable_checkpointing=False,
                deterministic=True,
            ).fit(model, train_loader, val_loader)
            with torch.no_grad():
                pred = model(val_loader.dataset.tensors[0])
                pwr = float(power_metric(pred, val_loader.dataset.tensors[1]))
            out.append({"n": int(n), "repeat": r, "power": pwr})
    return pd.DataFrame(out)


def threshold_to_d(delta_units, sd_units):
    """Convert a meaningful difference in original units (e.g., 2°C Tm, 0.5 kcal/mol ΔΔG) to Cohen's d."""
    return torch.tensor(delta_units, dtype=torch.float32) / torch.tensor(
        sd_units,
        dtype=torch.float32,
    )


## 7) Reporting check-list

- **Data adequacy**: “Power to detect r≥0.35 at n=240 is 0.82 (α=0.05).”  
- **Effect sizes**: “φ=0.28 (weak–moderate), Cramér’s V=0.31 (moderate).”  
- **Limits**: “Power < 0.8 for ΔΔG=0.3 kcal/mol; claims restricted to ≥0.5 kcal/mol shifts.”



---
## Appendix — FAQ

**Q: Why compute Cohen’s d between predictions and ground truth?**  
A: It’s a compact, unitless measure of separation. For regression, it complements MSE/R² by summarizing how distinguishable predictions are from the empirical distribution, and it feeds directly into t-test power/sample-size calculations.

**Q: My dataset is tiny. Should I still do power analysis?**  
A: Especially then. Power tells you what claims are feasible *now* and what additional *n* is needed for target claims.

**Q: Are these operators differentiable?**  
A: Many effect-size computations are differentiable; the power/sample-size solvers are typically used as diagnostics (not for gradient steps). Log them as metrics.

**Q: How do I handle multiple strata?**  
A: Use ANOVA (effect size *f*) for global separation; then do targeted pairwise tests (with correction) if you plan granular claims.



## Reporting templates (drop into papers / PRDs)

- **Data adequacy**: *"For r ≥ 0.35 we require n ≈ 240 for 80% power (α=0.05); our study uses n=260."*
- **Binary association**: *"AUROC=0.78; χ²-power=0.86; Cramér’s V=0.31 (moderate). Association is detectable at current n."*
- **Regression**: *"R²=0.42; correlation-power=0.81; Cohen’s d=0.62 (medium). Detectable effect at target threshold."*
- **Limitations**: *"Power < 0.8 for ΔΔG ≤ 0.3 kcal/mol; conclusions restricted to ≥ 0.5 kcal/mol shifts."*
