# 3.0 — Baselines clasificación binaria (Malignant vs nonMalignant)

- `y=1`: `Class_group == "Malignant"`
- `y=0`: `Class_group == "nonMalignant"`

Este notebook:
1) Carga `train/test`.
2) Construye `gene_cols` (columnas que empiezan por `ENSG`).
3) Lanza experimentos baseline con MLflow.
4) Selecciona un candidato (minimizando FN) y lo guarda en `models/`.


In [1]:
from __future__ import annotations
from pathlib import Path

from dataclasses import replace
import pandas as pd
import numpy as np


from genomics_dl.models.train_binary import BinaryTrainConfig, run_training


In [2]:
# Paths
DATA_PROCESSED = Path("../data/processed")
TRAIN_PATH = DATA_PROCESSED / "gse183635_tep_tpm_train.parquet"
TEST_PATH  = DATA_PROCESSED / "gse183635_tep_tpm_test.parquet"

df_train = pd.read_parquet(TRAIN_PATH)
df_test  = pd.read_parquet(TEST_PATH)

df_train.shape, df_test.shape

((1880, 5452), (471, 5452))

In [3]:
metadata_cols = [
    "Sample ID",
    "Patient_group",
    "Stage",
    "Sex",
    "Age",
    "Sample-supplying institution",
    "Training series",
    "Evaluation series",
    "Validation series",
    "lib.size",
    "classificationScoreCancer",
    "Class_group",
]

# Genes: columnas ENSG...
gene_cols = [c for c in df_train.columns if str(c).startswith("ENSG")]

# Checks
assert "Class_group" in df_train.columns
assert len(gene_cols) > 0
assert set(gene_cols).isdisjoint(set(metadata_cols))

len(gene_cols), gene_cols[:5]

(5440,
 ['ENSG00000000419',
  'ENSG00000000460',
  'ENSG00000000938',
  'ENSG00000001036',
  'ENSG00000001461'])

In [4]:
# Sanity: etiquetas
df_train["Class_group"].value_counts(dropna=False)

Class_group
Malignant       1302
nonMalignant     578
Name: count, dtype: int64

## Experimentos baseline

Estrategia:
- Barrido pequeño de `pos_weight` (penaliza más errores en Malignant).
- Comparación sin PCA / con PCA.
- Selección de umbral con constraint de recall (sensibilidad).

Para no llenar `models/` durante el barrido: `save_local_bundle=False`.
Luego re-entrenamos el candidato final con `save_local_bundle=True`.


In [5]:
base_cfg = BinaryTrainConfig(
    train_path=TRAIN_PATH,
    test_path=TEST_PATH,
    label_col="Class_group",
    positive_label="Malignant",
    experiment_name="gse183635_binary",
    model_name="logreg_binary",
    model_version="v0.1.0",

    # FN focus
    min_recall_for_threshold=0.95,

    # No guardamos bundle durante el sweep
    save_local_bundle=False,
)

sweep = []
for use_pca in [False, True]:
    for pos_weight in [1.0, 2.0, 5.0]:
        sweep.append((use_pca, pos_weight))

results = []
for use_pca, pos_weight in sweep:
    cfg = replace(
        base_cfg,
        use_pca=use_pca,
        pos_weight=pos_weight,
        # nombre distinto en MLflow para que sea más legible
        model_name=f"logreg_binary_pca{int(use_pca)}_pw{int(pos_weight)}",
    )
    out = run_training(cfg, feature_cols=gene_cols)
    results.append({
        "model_name": cfg.model_name,
        "use_pca": use_pca,
        "pos_weight": pos_weight,
        "threshold": out["chosen_threshold"],
        "test_fn": out["test_metrics"]["fn"],
        "test_fnr": out["test_metrics"]["fnr"],
        "test_recall": out["test_metrics"]["recall_sensitivity"],
        "test_specificity": out["test_metrics"]["specificity"],
        "test_pr_auc": out["test_metrics"]["pr_auc"],
        "test_roc_auc": out["test_metrics"]["roc_auc"],
        "mlflow_run_id": out["mlflow_run_id"],
    })

res_df = pd.DataFrame(results).sort_values(["test_fn", "test_fnr", "test_recall"], ascending=[True, True, False])
res_df

2025/12/16 19:59:38 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2025/12/16 19:59:38 INFO mlflow.store.db.utils: Updating database tables
2025/12/16 19:59:38 INFO alembic.runtime.migration: Context impl SQLiteImpl.
2025/12/16 19:59:38 INFO alembic.runtime.migration: Will assume non-transactional DDL.
2025/12/16 19:59:38 INFO alembic.runtime.migration: Running upgrade  -> 451aebb31d03, add metric step
2025/12/16 19:59:38 INFO alembic.runtime.migration: Running upgrade 451aebb31d03 -> 90e64c465722, migrate user column to tags
2025/12/16 19:59:38 INFO alembic.runtime.migration: Running upgrade 90e64c465722 -> 181f10493468, allow nulls for metric values
2025/12/16 19:59:38 INFO alembic.runtime.migration: Running upgrade 181f10493468 -> df50e92ffc5e, Add Experiment Tags Table
2025/12/16 19:59:38 INFO alembic.runtime.migration: Running upgrade df50e92ffc5e -> 7ac759974ad8, Update run tags with larger limit
2025/12/16 19:59:38 INFO alembic.runtime.migration: Running 

Unnamed: 0,model_name,use_pca,pos_weight,threshold,test_fn,test_fnr,test_recall,test_specificity,test_pr_auc,test_roc_auc,mlflow_run_id
3,logreg_binary_pca1_pw1,True,1.0,0.016744,20.0,0.06135,0.93865,0.331034,0.911958,0.839433,183a1c2d0274470c893eec8808911c24
4,logreg_binary_pca1_pw2,True,2.0,0.0157,20.0,0.06135,0.93865,0.337931,0.911319,0.837127,af74ce7869af4964ab51f545b3a0e664
2,logreg_binary_pca0_pw5,False,5.0,0.023254,22.0,0.067485,0.932515,0.434483,0.92043,0.839962,e1eb3fb50d7145f9b930c900451b01b4
5,logreg_binary_pca1_pw5,True,5.0,0.012504,22.0,0.067485,0.932515,0.317241,0.912095,0.835816,eefc5d76faa241c9ab09654355e371fe
0,logreg_binary_pca0_pw1,False,1.0,0.021177,24.0,0.07362,0.92638,0.441379,0.92259,0.844278,483cfcb6a21f446a9973abb1f6b7ba03
1,logreg_binary_pca0_pw2,False,2.0,0.022066,24.0,0.07362,0.92638,0.441379,0.921577,0.84212,c93d9a85426f4326a1ac3b8e502843b2


## Entrenamiento final (guardar en `models/`)

Criterio por defecto aquí: menor FN en test (y luego menor FNR). Ajusta si quieres imponer un mínimo de especificidad.


In [6]:
best = res_df.iloc[0].to_dict()
best

{'model_name': 'logreg_binary_pca1_pw1',
 'use_pca': True,
 'pos_weight': 1.0,
 'threshold': 0.01674395024967066,
 'test_fn': 20.0,
 'test_fnr': 0.06134969325153374,
 'test_recall': 0.9386503067484663,
 'test_specificity': 0.3310344827586207,
 'test_pr_auc': 0.9119582696665948,
 'test_roc_auc': 0.8394330442140893,
 'mlflow_run_id': '183a1c2d0274470c893eec8808911c24'}

In [7]:
final_cfg = replace(
    base_cfg,
    use_pca=bool(best["use_pca"]),
    pos_weight=float(best["pos_weight"]),
    model_name="logreg_binary",
    model_version="v0.1.0",
    save_local_bundle=True,
)

final_out = run_training(final_cfg, feature_cols=gene_cols)
final_out



{'mlflow_run_id': '7ec8e7d239d0411c89a3273c68ff6e99',
 'chosen_threshold': 0.01674395024967066,
 'cv_metrics': {'threshold': 0.01674395024967066,
  'tn': 172.0,
  'fp': 406.0,
  'fn': 65.0,
  'tp': 1237.0,
  'recall_sensitivity': 0.9500768049155146,
  'fnr': 0.04992319508448541,
  'specificity': 0.2975778546712803,
  'precision': 0.7528910529519173,
  'npv': 0.7257383966244726,
  'f1': 0.8400679117147708,
  'balanced_accuracy': 0.6238273297933974,
  'roc_auc': 0.830221272569749,
  'pr_auc': 0.912394002873577,
  'fpr': 0.7024221453287197},
 'test_metrics': {'threshold': 0.01674395024967066,
  'tn': 48.0,
  'fp': 97.0,
  'fn': 20.0,
  'tp': 306.0,
  'recall_sensitivity': 0.9386503067484663,
  'fnr': 0.06134969325153374,
  'specificity': 0.3310344827586207,
  'precision': 0.7593052109181141,
  'npv': 0.7058823529411765,
  'f1': 0.8395061728395061,
  'balanced_accuracy': 0.6348423947535435,
  'roc_auc': 0.8394330442140893,
  'pr_auc': 0.9119582696665948,
  'fpr': 0.6689655172413793},
 'sav