# Entrenamiento Higgs → WW* 

## Objetivo
Pipeline end-to-end para entrenar modelo de clasificación Higgs vs DibosonWW usando:
- Datos pre-procesados y guardados
- Validación cruzada estratificada (5-fold)
- Feature engineering + selection automática
- Boosting (XGBoost/LightGBM)
- Métricas HEP (ROC-AUC, AMS)

In [14]:
import sys
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))

import os
import pandas as pd
import numpy as np
from src.data.merge_data import merge_and_save
from src.data.fold_split import generate_folds
from src.models.trainer import train_with_folds, train_final_model
from src.features.feature_engineering import add_feature_engineering
from collections import Counter
import joblib

## Importamos datos

In [15]:
merged_path = "../data/interim/merged_raw.pkl"

merged_df = pd.read_pickle(merged_path)
print(f"   Shape: {merged_df.shape}")
print(f"   Clases: {merged_df['target'].value_counts().to_dict()}")

print("\n Dataset listo para pipeline")
merged_df.head(3)

   Shape: (26277, 35)
   Clases: {0: 14937, 1: 11340}

 Dataset listo para pipeline


Unnamed: 0,trigE,trigM,lep_n,jet_n,met_et,met_phi,lep_pt_0,lep_pt_1,lep_eta_0,lep_eta_1,...,jet_eta,jet_phi,jet_E,jet_MV2c10,sample,mLL,pTll,dphi_ll,dphi_ll_met,target
0,True,True,2,0,39938.445,1.85693,39422.570312,26516.251953,0.571867,-0.233551,...,0.0,0.0,0.0,0.0,WW,46.733593,53.66068,1.268602,3.098009,0
1,True,True,2,1,44835.996,-2.870015,55275.070312,29202.830078,0.167149,0.127004,...,0.52575,2.928067,72089.546875,-0.842718,WW,26.723835,80.155811,0.676783,2.675013,0
2,True,False,2,1,85033.48,0.415477,51790.308594,20564.869141,-0.560513,-0.214823,...,-0.278951,-1.938559,21311.076172,-0.690779,Higgs,12.35321,72.188777,0.150418,2.62897,1


In [16]:
# Generar folds si no existen
folds_dir = "../data/interim/folded/"

if os.path.exists(folds_dir) and len(os.listdir(folds_dir)) > 0:
    print(f"Folds ya generados en {folds_dir}")
    print(f"Total archivos: {len(os.listdir(folds_dir))}")
else:
    print("Generando folds estratificados...")
    generate_folds(
        merged_path=merged_path,
        output_dir=folds_dir,
        n_splits=5,
        random_state=42
    )
    print(f"Folds guardados en {folds_dir}")

Folds ya generados en ../data/interim/folded/
Total archivos: 10


## Folds Estratificados

Crea 5 folds con StratifiedKFold para mantener proporciones señal/fondo.

In [17]:
print("Entrenando con validación cruzada...")

results_df, features_all = train_with_folds(
    folds_dir=folds_dir,
    output_dir="../models/folds/",
    model_type="lightgbm",  
    top_k_features=15
)

print("\nResultados por fold:")
print(results_df)

if not results_df.empty:
    print(f"\nMétricas promedio:")
    
    numeric_cols = results_df.select_dtypes(include=[np.number]).columns
    
    for col in numeric_cols:
        if col != 'fold':  
            print(f"   {col.upper().replace('_', '-')}: {results_df[col].mean():.4f} ± {results_df[col].std():.4f}")
else:
    print("\n No se generaron resultados de validación cruzada")

Entrenando con validación cruzada...

 Entrenamiento por folds


FOLD 1



Calculando Mutual Information...
Entrenando XGBoost para importancia de variables...
Entrenando XGBoost para importancia de variables...
Calculando Permutation Importance...
Calculando Permutation Importance...

Variables seleccionadas:
- MT_ll_met
- cluster_mass
- met_et
- mLL
- ptll_met
- E_sum_ll
- curr_mt
- lep_pt_1
- delta_R_ll
- pTll
- lep_E_1
- jet_pt
- pt_sum_ll
- pt_balance
- jet_eta
Entrenando modelo...
[LightGBM] [Info] Number of positive: 9072, number of negative: 11949
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001593 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3825
[LightGBM] [Info] Number of data points in the train set: 21021, number of used features: 15
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.431568 -> initscore=-0.275455
[LightGBM] [Info] Start training from score -0.275455

Variables sele

## Entrenamiento con Validación Cruzada

Entrena modelo en cada fold con:
- Feature engineering automático
- Feature selection (top-k por importancia)
- Métricas: ROC-AUC, Accuracy, F1, AMS

**ESTRATEGIA**: Voto Mayoritario en Features

En lugar de elegir features manualmente, usamos un criterio estadístico:

Seleccionamos variables que fueron importantes en la MAYORÍA de los folds.
 
Ventajas:
 * Reduce overfitting (evita features inestables)
 * Mejora generalización
 * Reproducible y objetivo


In [18]:
flat = [f for sublist in features_all for f in sublist]
counts = Counter(flat)

print("ANÁLISIS DE ESTABILIDAD DE FEATURES")
print("=" * 70)
print("\nFrecuencia de selección en los 5 folds:\n")

# Mostrar todas las features ordenadas por frecuencia
for feat, count in counts.most_common():
    bar = "█" * count + "░" * (5 - count)
    print(f"  {feat:30s} [{bar}] {count}/5 folds")

print("\n" + "=" * 70)

# CRITERIO DE SELECCIÓN: Voto Mayoritario (≥3 de 5 folds)
min_folds = 3 
final_features = [feat for feat, c in counts.most_common() if c >= min_folds]

print(f"\nFEATURES SELECCIONADAS POR CONSENSO: {len(final_features)}")
print(f"   (Criterio: presentes en ≥{min_folds} folds)\n")

print("Features finales (ordenadas por estabilidad):")
for i, (feat, count) in enumerate(counts.most_common(), 1):
    if count >= min_folds:
        stability = "" if count == 5 else "" if count == 4 else ""
        print(f"  {i:2d}. {stability} {feat:25s} ({count}/5 folds)")


import json
with open("../models/final_features.json", "w") as f:
    json.dump(final_features, f, indent=2)
    
print(f"\nFeatures guardadas en: models/final_features.json")
print(f"\nINSIGHT: Las features más estables son las que mejor generalizan")
print(f"   Features 5/5 folds = Extremadamente robustas (verde)")
print(f"   Features 4/5 folds = Muy robustas (amarillo)")
print(f"   Features 3/5 folds = Robustas (naranja)")

ANÁLISIS DE ESTABILIDAD DE FEATURES

Frecuencia de selección en los 5 folds:

  MT_ll_met                      [█████] 5/5 folds
  cluster_mass                   [█████] 5/5 folds
  met_et                         [█████] 5/5 folds
  mLL                            [█████] 5/5 folds
  ptll_met                       [█████] 5/5 folds
  E_sum_ll                       [█████] 5/5 folds
  curr_mt                        [█████] 5/5 folds
  lep_pt_1                       [█████] 5/5 folds
  delta_R_ll                     [█████] 5/5 folds
  pTll                           [█████] 5/5 folds
  jet_pt                         [█████] 5/5 folds
  pt_sum_ll                      [█████] 5/5 folds
  lep_E_1                        [████░] 4/5 folds
  pt_ratio                       [███░░] 3/5 folds
  pt_balance                     [██░░░] 2/5 folds
  jet_MV2c10                     [██░░░] 2/5 folds
  lep_pt_0                       [██░░░] 2/5 folds
  jet_eta                        [█░░░░] 1/5 folds
  tr

## Selección de Features con Estrategia de Consenso

**Metodología: Voto Mayoritario**

En lugar de seleccionar features manualmente, usamos un criterio estadístico:
- Contamos cuántas veces cada feature fue seleccionada en los 5 folds
- Seleccionamos solo las que aparecen en **≥3 folds** (mayoría)
- Esto garantiza estabilidad y reduce overfitting

**Ventajas:**
- **Robustez**: Features inestables son descartadas automáticamente
- **Generalización**: Mejor performance en datos no vistos
- **Reproducibilidad**: Criterio objetivo y transparente

**Clasificación de estabilidad:**
- **5/5 folds**: Extremadamente robusta (siempre seleccionada)
- **4/5 folds**: Muy robusta
- **3/5 folds**: Robusta (mínimo aceptable)
- **<3 folds**: Inestable (descartada)

---

In [19]:
print("Entrenando modelo final en dataset completo...\n")

best_model = train_final_model(
    merged_path=merged_path,
    features_list=final_features,
    model_type="lightgbm",
    output_path="../models/best_model.pkl"
)

print("\n Modelo final guardado: models/best_model.pkl")
print(f"   Tipo: LightGBM")
print(f"   Features: {len(final_features)}")
print(f"   Dataset: {merged_df.shape[0]:,} eventos")

Entrenando modelo final en dataset completo...


Entrenamiento modelo final

Entrenando modelo final con 14 featuresEntrenando modelo final con 14 features
[LightGBM] [Info] Number of positive: 11340, number of negative: 14937
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002418 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3570
[LightGBM] [Info] Number of data points in the train set: 26277, number of used features: 14
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.431556 -> initscore=-0.275505
[LightGBM] [Info] Start training from score -0.275505

[LightGBM] [Info] Number of positive: 11340, number of negative: 14937
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002418 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3570
[LightGBM] [Info] Number of data points in the train set: 26277, number of used fe

## Entrenamiento del modelo final

Entrena modelo final en TODO el dataset con las features seleccionadas.

In [20]:
model = joblib.load("../models/best_model.pkl")
df_test = pd.read_pickle(merged_path)

print("Aplicando feature engineering...")
df_test = add_feature_engineering(df_test)

missing_features = [f for f in final_features if f not in df_test.columns]
if missing_features:
    print(f"Features faltantes: {missing_features}")
    print(f"   Usando solo features disponibles...")
    available_features = [f for f in final_features if f in df_test.columns]
else:
    available_features = final_features
    print(f"Todas las {len(final_features)} features disponibles")

sample_indices = df_test.sample(10, random_state=42).index
X_sample = df_test.loc[sample_indices, available_features]
y_sample = df_test.loc[sample_indices, 'target']

probs = model.predict_proba(X_sample)[:, 1]
preds = model.predict(X_sample)

print("\nPredicciones del modelo final (10 muestras aleatorias):\n")
results_test = pd.DataFrame({
    'Real': y_sample.values,
    'Pred': preds,
    'Prob_Higgs': probs
})
results_test['Correcto'] = results_test['Real'] == results_test['Pred']
print(results_test)
print(f"\n✓ Accuracy en muestra: {results_test['Correcto'].mean():.2%}")

Aplicando feature engineering...
Todas las 14 features disponibles

Predicciones del modelo final (10 muestras aleatorias):

   Real  Pred  Prob_Higgs  Correcto
0     1     1    0.671631      True
1     0     0    0.060787      True
2     0     1    0.656003     False
3     1     1    0.753134      True
4     0     0    0.242721      True
5     1     1    0.643601      True
6     0     1    0.683979     False
7     1     0    0.468636     False
8     0     0    0.017518      True
9     0     0    0.159764      True

✓ Accuracy en muestra: 70.00%
Todas las 14 features disponibles

Predicciones del modelo final (10 muestras aleatorias):

   Real  Pred  Prob_Higgs  Correcto
0     1     1    0.671631      True
1     0     0    0.060787      True
2     0     1    0.656003     False
3     1     1    0.753134      True
4     0     0    0.242721      True
5     1     1    0.643601      True
6     0     1    0.683979     False
7     1     0    0.468636     False
8     0     0    0.017518      T