# 3.0 — Baselines clasificación binaria (Malignant vs nonMalignant)

- `y=1`: `Class_group == "Malignant"`
- `y=0`: `Class_group == "nonMalignant"`

Este notebook:
1) Carga `train/test`.
2) Construye `gene_cols` (columnas que empiezan por `ENSG`).
3) Lanza experimentos baseline con MLflow.
4) Selecciona un candidato (minimizando FN) y lo guarda en `models/`.


In [1]:
from __future__ import annotations
from pathlib import Path

from dataclasses import replace
import pandas as pd
import numpy as np


from genomics_dl.models.train_binary import BinaryTrainConfig, run_training


### Configuración y dependencias
Importamos las librerías base y el pipeline `BinaryTrainConfig/run_training`, que encapsula todo el preprocesado, selección de umbral y guardado de artefactos.


In [2]:
# Paths
DATA_PROCESSED = Path("../data/processed")
TRAIN_PATH = DATA_PROCESSED / "gse183635_tep_tpm_train.parquet"
TEST_PATH  = DATA_PROCESSED / "gse183635_tep_tpm_test.parquet"

df_train = pd.read_parquet(TRAIN_PATH)
df_test  = pd.read_parquet(TEST_PATH)

df_train.shape, df_test.shape

((1880, 5452), (471, 5452))

### Rutas y carga de datos
Definimos las rutas procesadas de train/test, cargamos ambos dataframes en memoria y hacemos una comprobación rápida del tamaño de cada split.


In [3]:
metadata_cols = [
    "Sample ID",
    "Patient_group",
    "Stage",
    "Sex",
    "Age",
    "Sample-supplying institution",
    "Training series",
    "Evaluation series",
    "Validation series",
    "lib.size",
    "classificationScoreCancer",
    "Class_group",
]

# Genes: columnas ENSG...
gene_cols = [c for c in df_train.columns if str(c).startswith("ENSG")]

# Checks
assert "Class_group" in df_train.columns
assert len(gene_cols) > 0
assert set(gene_cols).isdisjoint(set(metadata_cols))

len(gene_cols), gene_cols[:5]

(5440,
 ['ENSG00000000419',
  'ENSG00000000460',
  'ENSG00000000938',
  'ENSG00000001036',
  'ENSG00000001461'])

### Selección de features expresados
Filtramos las columnas de genes (prefijo `ENSG`) y nos aseguramos de que no haya solapes con las columnas de metadatos antes de seguir con los experimentos.


In [4]:
# Sanity: etiquetas
df_train["Class_group"].value_counts(dropna=False)

Class_group
Malignant       1302
nonMalignant     578
Name: count, dtype: int64

> Revisamos la distribución de la columna `Class_group` para saber cuán desbalanceado está el problema antes de ajustar los pesos de clase (`pos_weight`).


## Experimentos baseline

En esta sección montamos un barrido amplio de configuraciones: distintas combinaciones de preprocesado (PCA/log/varianza), pesos de clase y variantes por clasificador (`logreg`, `sgd`, `linear_svc`, `rf`, `extratrees`). Todas las corridas usan `cv_splits=8` y un umbral optimizado para mantener la sensibilidad mientras maximizamos especificidad.

Para no llenar `models/` durante la búsqueda:
- `save_local_bundle=False`
- `mlflow_log_model=False`

Cuando el barrido termina, `res_df` contiene las métricas ordenadas por menor nº de FN.


In [5]:
def slugify_token(value):
    return str(value).replace(".", "p").replace("-", "m")


def build_model_name(clf_name, feat_cfg, pos_weight, variant_tag):
    return "_".join(
        [
            clf_name,
            f"pca{int(feat_cfg['use_pca'])}",
            f"log{int(feat_cfg['selector_on_log'])}",
            f"vq{int(feat_cfg['var_quantile'] * 100)}",
            f"pw{slugify_token(pos_weight)}",
            variant_tag,
        ]
    )


base_cfg = BinaryTrainConfig(
    train_path=TRAIN_PATH,
    test_path=TEST_PATH,
    label_col="Class_group",
    positive_label="Malignant",
    experiment_name="gse183635_binary",
    model_name="logreg_binary",
    model_version="v0.1.0",

    # FN focus: optimiza especificidad manteniendo el recall mínimo
    min_recall_for_threshold=0.95,
    threshold_objective="specificity",
    cv_splits=8,

    save_local_bundle=False,
    save_plots=False,
    mlflow_log_artifacts=False,
    mlflow_log_model=False,
)

# Configuraciones de preprocesado / selección de genes
feature_configs = [
    {"use_pca": False, "selector_on_log": False, "var_quantile": 0.10, "pca_var_threshold": 0.90},
    {"use_pca": False, "selector_on_log": False, "var_quantile": 0.20, "pca_var_threshold": 0.90},
    {"use_pca": False, "selector_on_log": True,  "var_quantile": 0.20, "pca_var_threshold": 0.90},
    {"use_pca": False, "selector_on_log": True,  "var_quantile": 0.30, "pca_var_threshold": 0.90},
    {"use_pca": True,  "selector_on_log": False, "var_quantile": 0.15, "pca_var_threshold": 0.90},
    {"use_pca": True,  "selector_on_log": True,  "var_quantile": 0.15, "pca_var_threshold": 0.95},
]

pos_weights = [1.0, 1.5, 2.0, 3.0, 5.0, 8.0]

classifier_grid = {
    "logreg": [
        {"clf_params": {"C": 0.5}, "tag": "C0p5"},
        {"clf_params": {"C": 1.0}, "tag": "C1"},
        {"clf_params": {"C": 2.0}, "tag": "C2"},
    ],
    "sgd_logloss": [
        {"clf_params": {"alpha": 1e-5}, "tag": "a1e-5"},
        {"clf_params": {"alpha": 1e-4}, "tag": "a1e-4"},
        {"clf_params": {"alpha": 5e-4}, "tag": "a5e-4"},
    ],
    "linear_svc_calibrated": [
        {"clf_params": {"C": 0.5}, "tag": "C0p5"},
        {"clf_params": {"C": 1.0}, "tag": "C1"},
        {"clf_params": {"C": 2.0}, "tag": "C2"},
    ],
    "rf": [
        {"clf_params": {"n_estimators": 800, "max_depth": None}, "tag": "n800"},
        {"clf_params": {"n_estimators": 1200, "max_depth": None}, "tag": "n1200"},
    ],
    "extratrees": [
        {"clf_params": {"n_estimators": 800, "max_depth": None}, "tag": "n800"},
        {"clf_params": {"n_estimators": 1200, "max_depth": None}, "tag": "n1200"},
    ],
}

sweep = []
for clf_name, clf_variants in classifier_grid.items():
    for feat_cfg in feature_configs:
        for pos_weight in pos_weights:
            for variant in clf_variants:
                combo = {
                    "clf_name": clf_name,
                    "variant_tag": variant.get("tag", "base"),
                    "pos_weight": pos_weight,
                    "use_pca": feat_cfg["use_pca"],
                    "selector_on_log": feat_cfg["selector_on_log"],
                    "var_quantile": feat_cfg["var_quantile"],
                    "pca_var_threshold": feat_cfg["pca_var_threshold"],
                    "clf_params": dict(variant.get("clf_params", {})),
                }
                combo["model_name"] = build_model_name(
                    clf_name=clf_name,
                    feat_cfg=feat_cfg,
                    pos_weight=pos_weight,
                    variant_tag=combo["variant_tag"],
                )
                sweep.append(combo)

len(sweep)

results = []
for combo in sweep:
    cfg = replace(
        base_cfg,
        clf_name=combo["clf_name"],
        pos_weight=combo["pos_weight"],
        use_pca=combo["use_pca"],
        selector_on_log=combo["selector_on_log"],
        var_quantile=combo["var_quantile"],
        pca_var_threshold=combo["pca_var_threshold"],
        clf_params=combo["clf_params"],
        model_name=combo["model_name"],
    )
    out = run_training(cfg, feature_cols=gene_cols)
    results.append({
        "model_name": combo["model_name"],
        "clf_name": combo["clf_name"],
        "variant_tag": combo["variant_tag"],
        "use_pca": combo["use_pca"],
        "selector_on_log": combo["selector_on_log"],
        "var_quantile": combo["var_quantile"],
        "pca_var_threshold": combo["pca_var_threshold"],
        "pos_weight": combo["pos_weight"],
        "threshold": out["chosen_threshold"],
        "test_fn": out["test_metrics"]["fn"],
        "test_fnr": out["test_metrics"]["fnr"],
        "test_recall": out["test_metrics"]["recall_sensitivity"],
        "test_specificity": out["test_metrics"]["specificity"],
        "test_pr_auc": out["test_metrics"]["pr_auc"],
        "test_roc_auc": out["test_metrics"]["roc_auc"],
        "mlflow_run_id": out["mlflow_run_id"],
    })

res_df = (
    pd.DataFrame(results)
      .sort_values(["test_fn", "test_fnr", "test_recall"], ascending=[True, True, False])
      .reset_index(drop=True)
)


2025/12/30 12:05:19 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2025/12/30 12:05:19 INFO mlflow.store.db.utils: Updating database tables
2025/12/30 12:05:19 INFO alembic.runtime.migration: Context impl SQLiteImpl.
2025/12/30 12:05:19 INFO alembic.runtime.migration: Will assume non-transactional DDL.
2025/12/30 12:05:20 INFO alembic.runtime.migration: Context impl SQLiteImpl.
2025/12/30 12:05:20 INFO alembic.runtime.migration: Will assume non-transactional DDL.


### Ranking de configuraciones evaluadas
Visualizamos el dataframe ordenado para inspeccionar rápidamente qué combinaciones minimizan los falsos negativos y mantienen buen recall/especificidad.


In [6]:
display(res_df)

Unnamed: 0,model_name,clf_name,variant_tag,use_pca,selector_on_log,var_quantile,pca_var_threshold,pos_weight,threshold,test_fn,test_fnr,test_recall,test_specificity,test_pr_auc,test_roc_auc,mlflow_run_id
0,sgd_logloss_pca0_log0_vq10_pw1p0_a1e-5,sgd_logloss,a1e-5,False,False,0.10,0.90,1.0,0.000000e+00,0.0,0.000000,1.000000,0.000000,0.797608,0.709467,0fa55df5cc864576931492b2c5cfc3b6
1,sgd_logloss_pca0_log0_vq10_pw1p5_a1e-5,sgd_logloss,a1e-5,False,False,0.10,0.90,1.5,0.000000e+00,0.0,0.000000,1.000000,0.000000,0.792369,0.701449,134aba329df84085bc6a861f34dd4e6a
2,sgd_logloss_pca0_log0_vq10_pw2p0_a1e-5,sgd_logloss,a1e-5,False,False,0.10,0.90,2.0,0.000000e+00,0.0,0.000000,1.000000,0.000000,0.798738,0.712704,52cb17ab4865452ba3c2df182469acda
3,sgd_logloss_pca0_log0_vq20_pw1p0_a1e-5,sgd_logloss,a1e-5,False,False,0.20,0.90,1.0,0.000000e+00,0.0,0.000000,1.000000,0.000000,0.816343,0.741739,038636ad71c243fa840f0135c1ee4bc6
4,sgd_logloss_pca0_log0_vq20_pw1p5_a1e-5,sgd_logloss,a1e-5,False,False,0.20,0.90,1.5,0.000000e+00,0.0,0.000000,1.000000,0.000000,0.793654,0.705120,0449b2dd66774d5e80377b5264812b95
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
463,rf_pca1_log1_vq15_pw5p0_n800,rf,n800,True,True,0.15,0.95,5.0,4.687500e-01,28.0,0.085890,0.914110,0.041379,0.835900,0.679078,5ebb2a88ab7d443aaa4be503e619bc01
464,rf_pca1_log1_vq15_pw8p0_n800,rf,n800,True,True,0.15,0.95,8.0,4.362500e-01,28.0,0.085890,0.914110,0.041379,0.829510,0.668024,bab7a5b3817b4097ba13f64ba8f2b228
465,rf_pca1_log1_vq15_pw8p0_n1200,rf,n1200,True,True,0.15,0.95,8.0,4.308333e-01,28.0,0.085890,0.914110,0.048276,0.828615,0.667464,c3b68edbd7914be587dc77a0b3074b6c
466,sgd_logloss_pca1_log0_vq15_pw1p5_a1e-4,sgd_logloss,a1e-4,True,False,0.15,0.90,1.5,9.576480e-116,29.0,0.088957,0.911043,0.468966,0.817099,0.738079,80fff352bdb64caab54f99f873576c6f


## Entrenamiento final (guardar en `models/`)

1. Seleccionamos la fila superior de `res_df` como configuración candidata.
2. Re-entrenamos con `save_local_bundle=True` y `mlflow_log_model=True` para tener artefactos reproducibles.
3. Verificamos que el bundle contenga tanto `model.pkl` como el paquete MLflow (`MLmodel`, `conda.yaml`, etc.).


In [7]:
# Elegimos el mejor del sweep
best = res_df.iloc[0].to_dict()
display(best)


{'model_name': 'sgd_logloss_pca0_log0_vq10_pw1p0_a1e-5',
 'clf_name': 'sgd_logloss',
 'variant_tag': 'a1e-5',
 'use_pca': False,
 'selector_on_log': False,
 'var_quantile': 0.1,
 'pca_var_threshold': 0.9,
 'pos_weight': 1.0,
 'threshold': 0.0,
 'test_fn': 0.0,
 'test_fnr': 0.0,
 'test_recall': 1.0,
 'test_specificity': 0.0,
 'test_pr_auc': 0.7976083084421235,
 'test_roc_auc': 0.7094668923207108,
 'mlflow_run_id': '0fa55df5cc864576931492b2c5cfc3b6'}

In [15]:
final_cfg = replace(
    base_cfg,
    use_pca=bool(best["use_pca"]),
    pos_weight=float(best["pos_weight"]),
    model_name="logreg_binary",
    model_version="v0.3.0",

    save_local_bundle=True,

    save_plots=True,
    mlflow_log_artifacts=False,

    mlflow_log_model=True,
)
final_out = run_training(final_cfg, feature_cols=gene_cols)



### Comprobación de artefactos
Antes de seguir, validamos que el guardado local haya generado los ficheros esperados dentro de `models/logreg_binary/...`.


In [16]:
# Verificamos que el bundle final contiene MLmodel y model.pkl
if final_out["saved_model_dir"] is None:
    raise RuntimeError("El entrenamiento final no guardó un bundle local.")

final_bundle_dir = Path(final_out["saved_model_dir"])
bundle_files = sorted(p.name for p in final_bundle_dir.iterdir())

assert (final_bundle_dir / "MLmodel").exists(), "Falta MLmodel en el bundle final"
assert (final_bundle_dir / "model.pkl").exists(), "Falta model.pkl en el bundle final"

{
    "bundle_dir": str(final_bundle_dir),
    "files": bundle_files,
}


{'bundle_dir': '/workspaces/TFM/models/logreg_binary/v0.3.0',
 'files': ['MLmodel',
  'README.md',
  'conda.yaml',
  'input_example.json',
  'metrics.json',
  'model.pkl',
  'params.yaml',
  'python_env.yaml',
  'requirements.txt',
  'serving_input_example.json',
  'signature.json']}

### Tabla de métricas (CV vs Test)
Construimos un dataframe compacto para comparar CV vs Test con el umbral elegido, ordenando las métricas clave (FN, recall, especificidad, etc.).


In [17]:
# DataFrame de métricas (CV vs Test)
cv = final_out["cv_metrics"].copy()
test = final_out["test_metrics"].copy()

chosen_thr = final_out["chosen_threshold"]
cv["threshold"] = chosen_thr
test["threshold"] = chosen_thr

metrics_df = (
    pd.DataFrame({"cv": cv, "test": test})
      .reset_index()
      .rename(columns={"index": "metric"})
)

priority_order = [
    "threshold",
    "fn", "fnr",
    "recall_sensitivity",
    "specificity",
    "fp", "tn", "tp",
    "precision", "npv",
    "f1",
    "balanced_accuracy",
    "pr_auc",
    "roc_auc",
    "fpr",
]

metrics_df["metric"] = pd.Categorical(metrics_df["metric"], categories=priority_order, ordered=True)
metrics_df = metrics_df.sort_values("metric").reset_index(drop=True)

In [18]:
display(metrics_df)

Unnamed: 0,metric,cv,test
0,threshold,0.01868,0.01868
1,fn,65.0,24.0
2,fnr,0.049923,0.07362
3,recall_sensitivity,0.950077,0.92638
4,specificity,0.3391,0.42069
5,fp,382.0,84.0
6,tn,196.0,61.0
7,tp,1237.0,302.0
8,precision,0.764052,0.782383
9,npv,0.750958,0.717647


## Prueba más pesada
Volvemos a entrenar el mejor modelo con `cv_splits=10` y más iteraciones para comprobar si un entrenamiento más costoso aporta mejoras adicionales (sin guardar artefactos intermedios).


In [12]:
# Prueba más pesada: aumentamos los pliegues de CV y max_iter
heavy_cfg = replace(
    final_cfg,
    model_name=f"{final_cfg.model_name}_heavy",
    model_version=f"{final_cfg.model_version}-heavy",
    cv_splits=10,
    max_iter=max(final_cfg.max_iter, 6000),
    save_local_bundle=False,
    save_plots=False,
    mlflow_log_model=False,
    mlflow_log_artifacts=False,
)

heavy_out = run_training(heavy_cfg, feature_cols=gene_cols)
heavy_out


{'mlflow_run_id': 'ab986f659fd9434fb990831415763134',
 'chosen_threshold': 0.015675576068106156,
 'cv_metrics': {'threshold': 0.015675576068106156,
  'tn': 188.0,
  'fp': 390.0,
  'fn': 64.0,
  'tp': 1238.0,
  'recall_sensitivity': 0.9508448540706606,
  'fnr': 0.04915514592933948,
  'specificity': 0.32525951557093424,
  'precision': 0.7604422604422605,
  'npv': 0.746031746031746,
  'f1': 0.8450511945392492,
  'balanced_accuracy': 0.6380521848207974,
  'roc_auc': 0.8424688661043165,
  'pr_auc': 0.9180038800545265,
  'fpr': 0.6747404844290658},
 'test_metrics': {'threshold': 0.015675576068106156,
  'tn': 58.0,
  'fp': 87.0,
  'fn': 21.0,
  'tp': 305.0,
  'recall_sensitivity': 0.9355828220858896,
  'fnr': 0.06441717791411043,
  'specificity': 0.4,
  'precision': 0.7780612244897959,
  'npv': 0.7341772151898734,
  'f1': 0.8495821727019499,
  'balanced_accuracy': 0.6677914110429448,
  'roc_auc': 0.8442775544742964,
  'pr_auc': 0.9225903538230424,
  'fpr': 0.6},
 'saved_model_dir': None}

### Métricas del modo pesado
Repetimos la tabla CV/Test para el modo pesado y así poder comparar fácilmente contra el modelo final estándar.


In [13]:
# DataFrame de métricas (CV vs Test)
cv = heavy_out["cv_metrics"].copy()
test = heavy_out["test_metrics"].copy()

chosen_thr = heavy_out["chosen_threshold"]
cv["threshold"] = chosen_thr
test["threshold"] = chosen_thr

metrics_df = (
    pd.DataFrame({"cv": cv, "test": test})
      .reset_index()
      .rename(columns={"index": "metric"})
)

priority_order = [
    "threshold",
    "fn", "fnr",
    "recall_sensitivity",
    "specificity",
    "fp", "tn", "tp",
    "precision", "npv",
    "f1",
    "balanced_accuracy",
    "pr_auc",
    "roc_auc",
    "fpr",
]

metrics_df["metric"] = pd.Categorical(metrics_df["metric"], categories=priority_order, ordered=True)
metrics_df = metrics_df.sort_values("metric").reset_index(drop=True)


In [14]:
display(metrics_df)

Unnamed: 0,metric,cv,test
0,threshold,0.015676,0.015676
1,fn,64.0,21.0
2,fnr,0.049155,0.064417
3,recall_sensitivity,0.950845,0.935583
4,specificity,0.32526,0.4
5,fp,390.0,87.0
6,tn,188.0,58.0
7,tp,1238.0,305.0
8,precision,0.760442,0.778061
9,npv,0.746032,0.734177
