# üè• BioClinicalBERT + Machine Learning para Predicci√≥n de Diabetes
## Hackathon HackUPM 2025 - Notebook Completo y Funcional

**Pipeline completo:**
1. Cargar train.json + test.json
2. Extraer edad, g√©nero, features cl√≠nicos (con regex robustas y negaciones)
3. Generar embeddings con BioClinicalBERT (768-dim)
4. Agrupar por paciente (media de features + embeddings)
5. EDA completo sin errores
6. Modelado con RandomForest
7. Predicciones finales en test
8. Guardado en m√∫ltiples formatos

**Autor:** Pipeline integrado de IA  
**Fecha:** 03-11-2025


In [56]:
# !pip install word2number
# !pip install --upgrade git+https://github.com/huggingface/transformers.git 

In [57]:
# INSTALACIONES (ejecutar si es primera vez, descomentar)
# !pip install -q transformers torch pandas numpy tqdm scikit-learn

import json
import pandas as pd
import numpy as np
import torch
import re
from word2number import w2n
import warnings
from transformers import AutoTokenizer, AutoModel
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score

warnings.filterwarnings('ignore')

print("‚úÖ Librer√≠as importadas")
print(f"‚úÖ CUDA disponible: {torch.cuda.is_available()}")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"‚úÖ Dispositivo: {device}")


‚úÖ Librer√≠as importadas
‚úÖ CUDA disponible: True
‚úÖ Dispositivo: cuda


## 1Ô∏è‚É£ Cargar Datos (train.json + test.json)

In [58]:
print("=" * 80)
print("CARGANDO DATOS")
print("=" * 80)

print("\nüì• Leyendo train.json...")
with open("/kaggle/input/hackathon-dataset/train.json", "r") as f:
    train_data = json.load(f)

print("üì• Leyendo test.json...")
with open("/kaggle/input/hackathon-dataset/test.json", "r") as f:
    test_data = json.load(f)

# Crear DataFrames iniciales
df_train_raw = pd.DataFrame(train_data)
df_test_raw = pd.DataFrame(test_data)

print(f"\n‚úÖ Train: {len(df_train_raw)} registros, {df_train_raw.shape[1]} columnas")
print(f"‚úÖ Test: {len(df_test_raw)} registros, {df_test_raw.shape[1]} columnas")
print(f"\nüìä Distribuci√≥n diabetes en TRAIN:")
print(df_train_raw["has_diabetes"].value_counts())
print(f"Proporci√≥n positivos: {df_train_raw['has_diabetes'].mean():.2%}")


CARGANDO DATOS

üì• Leyendo train.json...
üì• Leyendo test.json...

‚úÖ Train: 3000 registros, 3 columnas
‚úÖ Test: 300 registros, 2 columnas

üìä Distribuci√≥n diabetes en TRAIN:
has_diabetes
0    2100
1     900
Name: count, dtype: int64
Proporci√≥n positivos: 30.00%


## 2Ô∏è‚É£ Funciones de Extracci√≥n (Robustas con Negaciones)

In [59]:
outliers = {
    "age": [],
    "glucose": [],
    "hba1c": [],
    "bmi": []
}

def safe_float(x):
    """Convierte a float de forma segura."""
    try:
        return float(x)
    except:
        return np.nan

def extract_age(text):
    """Extrae edad con rango v√°lido 0-120."""
    if not isinstance(text, str):
        return np.nan
    t = text.lower()
    # Patr√≥n 1: "X-year-old" o "age X"
    m = re.search(r'(\d{1,3})\s*(?:year)?-?\s*(?:year-old|yr|years?\s*old)', t)
    age = None
    if not m:
        m = re.search(r'(?:age|aged)\s*(?:is)?\s*(\d{1,3})\b', t)
        if not m:
            m = re.search(r"\b([a-zA-Z]+(?:[-\s][a-zA-Z]+)*)-year-old\b", t)
            if m:
                age = w2n.word_to_num(m.group(1))
    if not age:
        age = int(m.group(1)) if m else np.nan
    if not m:
        outliers["age"].append(t)
    return age if (not np.isnan(age) and 0 <= age <= 120) else np.nan

def extract_gender(text):
    """Extrae g√©nero (male/female/unknown)."""
    if not isinstance(text, str):
        return "unknown"
    t = text.lower()
    male_count = len(re.findall(r'\b(?:male|man|he|his|him|boy)\b', t))
    female_count = len(re.findall(r'\b(?:female|woman|she|her|girl)\b', t))

    if male_count > 0 and female_count == 0:
        return "male"
    elif female_count > 0 and male_count == 0:
        return "female"
    elif male_count == female_count == 0:
        return "unknown"
    else:
        return "male" if male_count >= female_count else "female"

def extract_bmi(text):
    """Extrae BMI con rango v√°lido 8-80."""
    if not isinstance(text, str):
        return np.nan
    t = text.lower()
    m = re.search(r'\b(?:bmi|imc)\b[^0-9]{0,30}(\d{1,3}(?:\.\d+)?)', t)
    if not m:
        m = re.search(r'\b(?:bmi|imc)\b.{0,26}range.{0,10}(\d{1,3}(?:[.,]\d+)?)', t)
    v = safe_float(m.group(1)) if m else np.nan
    if not m:
        outliers["bmi"].append(t)
    return v if (not np.isnan(v) and 0 <= v <= 80) else np.nan

def extract_hba1c(text):
    """Extrae HbA1c con rango v√°lido 3-20."""
    if not isinstance(text, str):
        return np.nan
    t = text.lower()
    # Ventana local: hasta 20 chars despu√©s de "hba1c"
    m = re.search(r'(?:hba1c|a1c)[^0-9]{0,20}(\d{1,2}(?:\.\d+)?)\s*%?', t)
    if not m:
        pattern = re.compile(
            r'\b(?:hba1c(?: level)?s?\b.{0,10}(?:is|are|was|were|within|of)?\s*(very\s+high|high|elevated|normal|within normal limits|low)|(very\s+high|high|elevated|normal|within normal limits|low)\s+(?:levels\s+of\s+)?hba1c)\b',
            re.IGNORECASE
        )
        m = pattern.search(t)
        if m:
            mapping = {
                "normal": 5.5,
                "elevated": 6.3,
                "high": 6.5,
                "very high": 8
                }
            v = mapping[m.group(1) or m.group(2)] if m else np.nan
            return v if (not np.isnan(v) and 0 <= v <= 20) else np.nan
    if not m:
        outliers["hba1c"].append(t)
    
    v = safe_float(m.group(1)) if m else np.nan
    return v if (not np.isnan(v) and 0 <= v <= 20) else np.nan

def extract_glucose(text):
    """Extrae glucosa (aleatoria o postprandial) con rango v√°lido 40-600."""
    if not isinstance(text, str):
        return np.nan
    t = text.lower()
    # Preferir "glucose" + 2-3 d√≠gitos
    m = re.search(r'\bglucose\b[^0-9]{0,20}(\d{2,3})(?:\s*mg/dl)?', t)
    v = safe_float(m.group(1)) if m else np.nan
    if not m:
        pattern = re.compile(
            r'\b(?:'
            r'(?:\w+\s+)?glucose(?: (?:level|levels|measurement|measurements|reading|readings))?\b.{0,10}(?:is|are|was|were|within|of)?\s*(?P<val>very\s+high|high|elevated|normal|within normal limits|low|decreased|reduced|abnormal)'
            r'|'
            r'(?P<val2>very\s+high|high|elevated|normal|within normal limits|low|decreased|reduced|abnormal)\s+(?:\w+\s+)?glucose(?: (?:level|levels|measurement|measurements|reading|readings))?'
            r')\b',
            re.IGNORECASE
        )
        m = pattern.search(t)
        if m:
            mapping = {
                "low": 70,
                "normal": 140,
                "within normal limits": 140,
                "elevated": 165,
                "high": 200,
                "abnormal": 200,
                "very high": 250,
                }
            try:
                v = mapping[m.group(1) or m.group(2)] if m else np.nan
            except:
                print(t)
                exit(0)
            return v if (not np.isnan(v)) else np.nan
    if not m:
        outliers["glucose"].append(t)
    return v if (not np.isnan(v)) else np.nan

def extract_flags(text):
    """Extrae hipertensi√≥n, cardiopat√≠a, fumaci√≥n (respeta negaciones)."""
    if not isinstance(text, str):
        return 0, 0, "unknown"

    t = text.lower()
    NEG_PAT = r'(?:no\s+|without\s+|denies\s+|negative\s+for\s+|no\s+history\s+of\s+)'

    # Hipertensi√≥n: negaci√≥n > pos
    hyp_neg = bool(re.search(NEG_PAT + r'(?:hypertension|high\s+blood\s+pressure)', t))
    hyp_pos = bool(re.search(r'\bhypertension\b|\bhigh\s+blood\s+pressure\b', t)) and not hyp_neg
    has_hypertension = 1 if hyp_pos else 0

    # Cardiopat√≠a: negaci√≥n > pos
    hd_neg = bool(re.search(NEG_PAT + r'(?:heart\s+disease|cardiovascular)', t))
    hd_pos = bool(re.search(r'\bheart\s+disease\b|\bcardiovascular', t)) and not hd_neg
    has_heart_disease = 1 if hd_pos else 0

    # Fumaci√≥n
    if re.search(r'\bnon-smoker\b|\bnever\s+smoked\b', t):
        smoking = "never"
    elif re.search(r'\b(?:past|former)\s+(?:smoker|smoking)\b', t):
        smoking = "past"
    elif re.search(r'\bcurrent\s+smoker\b|\bis\s+a\s+smoker\b|\bsmoker\b', t):
        smoking = "current"
    else:
        smoking = "unknown"

    return has_hypertension, has_heart_disease, smoking

# TEST: verificar extracci√≥n en muestra
import numpy as np

for i in range(min(3, len(df_train_raw))):
    note = df_train_raw["medical_note"].iloc[i]
    age = extract_age(note)
    gender = extract_gender(note)
    bmi = extract_bmi(note)
    hba1c = extract_hba1c(note)
    glucose = extract_glucose(note)
    hyp, hd, smoking = extract_flags(note)

    age_s    = "NaN" if (age is None or (isinstance(age, float) and np.isnan(age))) else f"{int(age)}"
    bmi_s    = "NaN" if (bmi is None or np.isnan(bmi)) else f"{bmi:.1f}"
    hba1c_s  = "NaN" if (hba1c is None or np.isnan(hba1c)) else f"{hba1c:.1f}"
    glucose_s= "NaN" if (glucose is None or np.isnan(glucose)) else f"{glucose:.0f}"

    print(
        f"\n  [{i}] age={age_s}, gender={gender}, bmi={bmi_s}, "
        f"hba1c={hba1c_s}, glucose={glucose_s}, hyp={hyp}, hd={hd}, smoking={smoking}"
    )


print("\n‚úÖ Extracci√≥n verificada")



  [0] age=16, gender=female, bmi=21.5, hba1c=6.2, glucose=140, hyp=0, hd=1, smoking=never

  [1] age=15, gender=female, bmi=33.6, hba1c=5.5, glucose=158, hyp=1, hd=1, smoking=unknown

  [2] age=54, gender=male, bmi=21.5, hba1c=5.5, glucose=145, hyp=0, hd=1, smoking=current

‚úÖ Extracci√≥n verificada


In [60]:
# Aplicar a TRAIN
print("\n Train: extrayendo features...")
df_train_raw["age"] = df_train_raw["medical_note"].apply(extract_age)
df_train_raw["gender"] = df_train_raw["medical_note"].apply(extract_gender)
df_train_raw["bmi"] = df_train_raw["medical_note"].apply(extract_bmi)
df_train_raw["hba1c"] = df_train_raw["medical_note"].apply(extract_hba1c)
df_train_raw["glucose"] = df_train_raw["medical_note"].apply(extract_glucose)

tmp_train = df_train_raw["medical_note"].apply(extract_flags)
df_train_raw["has_hypertension"] = [t[0] for t in tmp_train]
df_train_raw["has_heart_disease"] = [t[1] for t in tmp_train]
df_train_raw["smoking_status"] = [t[2] for t in tmp_train]

# Aplicar a TEST
print(" Test: extrayendo features...")
df_test_raw["age"] = df_test_raw["medical_note"].apply(extract_age)
df_test_raw["gender"] = df_test_raw["medical_note"].apply(extract_gender)
df_test_raw["bmi"] = df_test_raw["medical_note"].apply(extract_bmi)
df_test_raw["hba1c"] = df_test_raw["medical_note"].apply(extract_hba1c)
df_test_raw["glucose"] = df_test_raw["medical_note"].apply(extract_glucose)

tmp_test = df_test_raw["medical_note"].apply(extract_flags)
df_test_raw["has_hypertension"] = [t[0] for t in tmp_test]
df_test_raw["has_heart_disease"] = [t[1] for t in tmp_test]
df_test_raw["smoking_status"] = [t[2] for t in tmp_test]

print("\n Features extra√≠dos correctamente")
print(f"\n MUESTRA TRAIN (primeras 3 filas):")
cols = ['patient_id', 'age', 'gender', 'bmi', 'hba1c', 'glucose', 'smoking_status', 'has_diabetes']
print(df_train_raw[cols].head(3))


 Train: extrayendo features...
 Test: extrayendo features...

 Features extra√≠dos correctamente

 MUESTRA TRAIN (primeras 3 filas):
   patient_id   age  gender    bmi  hba1c  glucose smoking_status  \
0       82555  16.0  female  21.49    6.2    140.0          never   
1       92299  15.0  female  33.62    5.5    158.0        unknown   
2       18725  54.0    male  21.46    5.5    145.0        current   

   has_diabetes  
0             0  
1             0  
2             0  


In [61]:
df_train_raw.isnull().sum()

patient_id             0
has_diabetes           0
medical_note           0
age                    9
gender                 0
bmi                    5
hba1c                115
glucose              211
has_hypertension       0
has_heart_disease      0
smoking_status         0
dtype: int64

## 3Ô∏è‚É£ BioClinicalBERT: Generar Embeddings (768-dim)

In [62]:
def load_bioclinicalbert():
    """Carga Bio_ClinicalBERT desde HuggingFace."""
    model_name = "emilyalsentzer/Bio_ClinicalBERT" # coge el modelo de HuggingFace
    print(f" Cargando {model_name}...")
    tokenizer = AutoTokenizer.from_pretrained(model_name) # convierte el texto en tokens
    model = AutoModel.from_pretrained(model_name) # Genera los embeddings desde la red a los token de la secuencia
    model.eval() # evalua que se ha pasado de texto a embeddings
    model.to(device) # mueve el modelo a GPU (kaggle)
    return tokenizer, model

def mean_pool(last_hidden_state, attention_mask):
    """Mean pooling con mask."""
    mask = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float() # mascara que ve cuales partes de los valores del vector influyeno o no 
    sum_hidden = (last_hidden_state * mask).sum(dim=1) # quita los insignificantes y sumas todos los buenos
    sum_mask = torch.clamp(mask.sum(dim=1), min=1e-9) # clump evita que se divida en la siguiente linea entre 0 
    return sum_hidden / sum_mask # calcula la media de todos los tokens entre los de la mascara

def embed_text(text, tokenizer, model, max_length=512):
    """Genera 1 embedding de 768 dims para un texto si esta vacia o no es texto."""
    if not isinstance(text, str) or len(text.strip()) == 0: 
        return np.zeros(768, dtype=np.float32) 

    tokens = tokenizer(text,padding=True,truncation=True,max_length=max_length,return_tensors="pt").to(device) # crea los tokens desde la frase

    with torch.no_grad(): # sin gradientes para no entrenar modelo
        output = model(**tokens) # crea los embeddings
        pooled = mean_pool(output.last_hidden_state, tokens["attention_mask"]) # llama a la funcion de la mascara para ponderar correctamente

    return pooled.cpu().numpy()[0].astype(np.float32)

# Cargar modelo (una sola vez)
tokenizer, model = load_bioclinicalbert()

# Generar embeddings TRAIN
print("\n Generando embeddings TRAIN...")
train_embeddings = []
for note in tqdm(df_train_raw["medical_note"].tolist(), desc="Train embeddings", total=len(df_train_raw)):
    emb = embed_text(note, tokenizer, model)
    train_embeddings.append(emb)

df_train_raw["embedding"] = train_embeddings

# Generar embeddings TEST
print("\n Generando embeddings TEST...")
test_embeddings = []
for note in tqdm(df_test_raw["medical_note"].tolist(), desc="Test embeddings", total=len(df_test_raw)):
    emb = embed_text(note, tokenizer, model)
    test_embeddings.append(emb)

df_test_raw["embedding"] = test_embeddings

 Cargando emilyalsentzer/Bio_ClinicalBERT...

 Generando embeddings TRAIN...


Train embeddings: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3000/3000 [00:30<00:00, 99.44it/s] 



 Generando embeddings TEST...


Test embeddings: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 [00:03<00:00, 99.70it/s] 


## 4Ô∏è‚É£ Agrupar por Paciente + Expandir Embeddings

In [63]:
def most_common(series):
    """Retorna valor m√°s frecuente o 'unknown' del embedding."""
    s = series.dropna()
    return s.mode().iat[0] if not s.mode().empty else "unknown"

def emb_mean(series):
    """Promedia embeddings."""
    stacked = np.vstack(series.values)
    return stacked.mean(axis=0)


# Agregaci√≥n TRAIN
print("\n Agrupando TRAIN por patient_id...")
agg_dict = {
    "medical_note": "count",
    "has_diabetes": "first",
    "age": "mean",
    "gender": most_common,
    "bmi": "mean",
    "hba1c": "mean",
    "glucose": "mean",
    "has_hypertension": "max",
    "has_heart_disease": "max",
    "smoking_status": most_common,
    "embedding": emb_mean
}

df_train_agg = df_train_raw.groupby("patient_id").agg(agg_dict).reset_index() # por si se repiten pacientes, agruparlos
df_train_agg.rename(columns={"medical_note": "note_count"}, inplace=True)

print(f" Train agrupado: {df_train_agg.shape[0]} pacientes √∫nicos")

# Agregaci√≥n TEST (sin has_diabetes)
print("\n Agrupando TEST por patient_id...")
agg_dict_test = {
    "medical_note": "count",
    "age": "mean",
    "gender": most_common,
    "bmi": "mean",
    "hba1c": "mean",
    "glucose": "mean",
    "has_hypertension": "max",
    "has_heart_disease": "max",
    "smoking_status": most_common,
    "embedding": emb_mean
}

df_test_agg = df_test_raw.groupby("patient_id").agg(agg_dict_test).reset_index()
df_test_agg.rename(columns={"medical_note": "note_count"}, inplace=True)

print(f" Test agrupado: {df_test_agg.shape[0]} pacientes √∫nicos")

# Expandir embeddings en columnas
print("\n Expandiendo embeddings (768 columnas)...")

emb_train = np.vstack(df_train_agg["embedding"].values)
emb_test = np.vstack(df_test_agg["embedding"].values)

# Crea los embeddings (columnas) en el Dataframe
emb_cols = [f"emb_{i}" for i in range(emb_train.shape[1])]
emb_df_train = pd.DataFrame(emb_train, columns=emb_cols)
emb_df_test = pd.DataFrame(emb_test, columns=emb_cols)

df_train_final = pd.concat([df_train_agg.drop(columns=["embedding"]).reset_index(drop=True), emb_df_train], axis=1)
df_test_final = pd.concat([df_test_agg.drop(columns=["embedding"]).reset_index(drop=True), emb_df_test], axis=1)

print(f"\n Train final: {df_train_final.shape}")
print(f" Test final: {df_test_final.shape}")

print(f"\n PRIMERAS FILAS TRAIN (con age y gender):")
print(df_train_final[['patient_id', 'age', 'gender', 'bmi', 'hba1c', 'glucose', 'smoking_status', 'has_diabetes', 'note_count']].head(5))



 Agrupando TRAIN por patient_id...
 Train agrupado: 3000 pacientes √∫nicos

 Agrupando TEST por patient_id...
 Test agrupado: 300 pacientes √∫nicos

 Expandiendo embeddings (768 columnas)...

 Train final: (3000, 779)
 Test final: (300, 778)

 PRIMERAS FILAS TRAIN (con age y gender):
   patient_id   age  gender    bmi  hba1c  glucose smoking_status  \
0           5  23.0    male  21.05    6.5    200.0        current   
1          14  70.0  female  32.63    5.5    165.0        unknown   
2          36  42.0  female  31.50    5.8    200.0          never   
3          67  71.0    male  39.03    6.3      NaN          never   
4         127  66.0  female  23.58    5.8    145.0          never   

   has_diabetes  note_count  
0             0           1  
1             0           1  
2             0           1  
3             1           1  
4             1           1  


## 5Ô∏è‚É£ An√°lisis Exploratorio (EDA Completo)

In [64]:
print(f"\n EDAD (Age):")
print(f"   Media: {df_train_final['age'].mean():.1f} a√±os")
print(f"   Mediana: {df_train_final['age'].median():.1f} a√±os")
print(f"   Rango: {df_train_final['age'].min():.0f} - {df_train_final['age'].max():.0f} a√±os")
print(f"   Faltantes: {df_train_final['age'].isna().sum()}")

print(f"\n G√âNERO (Gender):")
gen_dist = df_train_final['gender'].value_counts()
for g, c in gen_dist.items():
    print(f"   {g}: {c} ({c/len(df_train_final)*100:.1f}%)")

print(f"\n BMI:")
print(f"   Media: {df_train_final['bmi'].mean():.2f} ¬± {df_train_final['bmi'].std():.2f}")
print(f"   Rango: {df_train_final['bmi'].min():.2f} - {df_train_final['bmi'].max():.2f}")
print(f"   Faltantes: {df_train_final['bmi'].isna().sum()}")

print(f"\n HbA1c:")
print(f"   Media: {df_train_final['hba1c'].mean():.2f} ¬± {df_train_final['hba1c'].std():.2f}")
print(f"   Rango: {df_train_final['hba1c'].min():.2f} - {df_train_final['hba1c'].max():.2f}")
print(f"   Faltantes: {df_train_final['hba1c'].isna().sum()}")

print(f"\n GLUCOSA (Glucose):")
print(f"   Media: {df_train_final['glucose'].mean():.2f} ¬± {df_train_final['glucose'].std():.2f}")
print(f"   Rango: {df_train_final['glucose'].min():.2f} - {df_train_final['glucose'].max():.2f}")
print(f"   Faltantes: {df_train_final['glucose'].isna().sum()}")

print(f"\n HIPERTENSI√ìN:")
hyp_count = df_train_final['has_hypertension'].sum()
print(f"   Con hipertensi√≥n: {hyp_count} ({hyp_count/len(df_train_final)*100:.1f}%)")

print(f"\n ENFERMEDAD CARD√çACA:")
hd_count = df_train_final['has_heart_disease'].sum()
print(f"   Con cardiopat√≠a: {hd_count} ({hd_count/len(df_train_final)*100:.1f}%)")

print(f"\n FUMACI√ìN:")
smoke_dist = df_train_final['smoking_status'].value_counts()
for s, c in smoke_dist.items():
    print(f"   {s}: {c} ({c/len(df_train_final)*100:.1f}%)")

print(f"\n DIABETES (TARGET):")
diab_dist = df_train_final['has_diabetes'].value_counts()
print(f"   Negativo (0): {diab_dist[0]} ({diab_dist[0]/len(df_train_final)*100:.1f}%)")
print(f"   Positivo (1): {diab_dist[1]} ({diab_dist[1]/len(df_train_final)*100:.1f}%)")

print(f"\n CORRELACI√ìN CON DIABETES:")
numeric_cols = ['age', 'bmi', 'hba1c', 'glucose', 'has_hypertension', 'has_heart_disease']
corr = df_train_final[numeric_cols + ['has_diabetes']].corr()['has_diabetes'].sort_values(ascending=False)
print(corr.head(8))


 EDAD (Age):
   Media: 46.3 a√±os
   Mediana: 49.0 a√±os
   Rango: 1 - 80 a√±os
   Faltantes: 9

 G√âNERO (Gender):
   female: 1663 (55.4%)
   male: 1337 (44.6%)

 BMI:
   Media: 28.05 ¬± 7.77
   Rango: 0.00 - 72.21
   Faltantes: 5

 HbA1c:
   Media: 6.25 ¬± 1.07
   Rango: 3.50 - 9.00
   Faltantes: 115

 GLUCOSA (Glucose):
   Media: 161.97 ¬± 43.13
   Rango: 15.00 - 300.00
   Faltantes: 211

 HIPERTENSI√ìN:
   Con hipertensi√≥n: 1017 (33.9%)

 ENFERMEDAD CARD√çACA:
   Con cardiopat√≠a: 2874 (95.8%)

 FUMACI√ìN:
   never: 1566 (52.2%)
   unknown: 768 (25.6%)
   past: 552 (18.4%)
   current: 114 (3.8%)

 DIABETES (TARGET):
   Negativo (0): 2100 (70.0%)
   Positivo (1): 900 (30.0%)

 CORRELACI√ìN CON DIABETES:
has_diabetes         1.000000
hba1c                0.535270
glucose              0.446588
age                  0.429397
bmi                  0.315766
has_hypertension     0.199608
has_heart_disease   -0.062372
Name: has_diabetes, dtype: float64


## 6Ô∏è‚É£ Preparaci√≥n para Modelado

In [65]:
scaler_bmi = StandardScaler()
scaler_hba1c = StandardScaler()
scaler_glucose = StandardScaler()
scaler_age = StandardScaler()

df_train_final["bmi"] = scaler_bmi.fit_transform(df_train_final[["bmi"]])
df_train_final["hba1c"] = scaler_hba1c.fit_transform(df_train_final[["hba1c"]])
df_train_final["glucose"] = scaler_glucose.fit_transform(df_train_final[["glucose"]])
df_train_final["age"] = scaler_age.fit_transform(df_train_final[["age"]])

df_test_final["bmi"] = scaler_bmi.transform(df_test_final[["bmi"]])
df_test_final["hba1c"] = scaler_hba1c.transform(df_test_final[["hba1c"]])
df_test_final["glucose"] = scaler_glucose.transform(df_test_final[["glucose"]])
df_test_final["age"] = scaler_age.transform(df_test_final[["age"]])

In [66]:
# Preparar X_train e y_train
X_train = df_train_final.drop(columns=["patient_id", "has_diabetes", "has_heart_disease", "note_count"])
y_train = df_train_final["has_diabetes"]

# Preparar X_test
X_test = df_test_final.drop(columns=["patient_id", "has_heart_disease", "note_count"])

# Rellenar NaNs
print("\n Rellenando valores faltantes...")

for col in X_train.columns:
    if X_train[col].dtype == 'object':
        # Categ√≥ricos: usar moda
        mode_val = X_train[col].mode()[0] if not X_train[col].mode().empty else "unknown"
        X_train[col] = X_train[col].fillna(mode_val)
        X_test[col] = X_test[col].fillna(mode_val)
    else:
        # Num√©ricos: usar media
        mean_val = X_train[col].mean()
        X_train[col] = X_train[col].fillna(mean_val)
        X_test[col] = X_test[col].fillna(mean_val)

# One-hot encoding para categor√≠as
print("\n One-hot encoding para gender y smoking_status...")
X_train = pd.get_dummies(X_train, columns=['gender', 'smoking_status'], drop_first=True)
X_test = pd.get_dummies(X_test, columns=['gender', 'smoking_status'], drop_first=True)

# Alinear columnas
for col in set(X_train.columns) - set(X_test.columns):
    X_test[col] = 0
for col in set(X_test.columns) - set(X_train.columns):
    X_train[col] = 0

X_train = X_train[sorted(X_train.columns)]
X_test = X_test[sorted(X_train.columns)]

print(f"\n X_train final shape: {X_train.shape}")
print(f" X_test final shape: {X_test.shape}")
print(f"\n Columnas features: {X_train.columns.tolist()[:15]}... (+{len(X_train.columns)-15})")


 Rellenando valores faltantes...

 One-hot encoding para gender y smoking_status...

 X_train final shape: (3000, 777)
 X_test final shape: (300, 777)

 Columnas features: ['age', 'bmi', 'emb_0', 'emb_1', 'emb_10', 'emb_100', 'emb_101', 'emb_102', 'emb_103', 'emb_104', 'emb_105', 'emb_106', 'emb_107', 'emb_108', 'emb_109']... (+762)


In [67]:
print(X_train[["bmi", "age", "glucose", "hba1c"]].head())

        bmi       age       glucose     hba1c
0 -0.900400 -1.026449  8.818876e-01  0.230055
1  0.589554  1.041082  7.033485e-02 -0.701727
2  0.444161 -0.190638  8.818876e-01 -0.422192
3  1.413017  1.085072 -1.413952e-16  0.043698
4 -0.574875  0.865122 -3.934096e-01 -0.422192


## 7Ô∏è‚É£ Modelado: RandomForest con Validaci√≥n

In [68]:
print("\n" + "=" * 80)
print("MODELADO: ENSEMBLE CON WEIGHTED SOFT VOTING")
print("=" * 80)

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

# Split validaci√≥n
X_tr, X_val, y_tr, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42, stratify=y_train
)

print(f"\n Entrenamiento: {X_tr.shape[0]} | Validaci√≥n: {X_val.shape[0]}")
print(f"   Train pos ratio: {y_tr.mean():.2%}")
print(f"   Val pos ratio: {y_val.mean():.2%}")

# ============================================================================
# Definir modelos base
# ============================================================================
print("\n Entrenando modelos base...")

clf_rf = RandomForestClassifier(
    n_estimators=200, max_depth=15, min_samples_split=5, min_samples_leaf=2,
    class_weight='balanced', n_jobs=-1, random_state=42
)

clf_lr = LogisticRegression(
    max_iter=2000, solver='lbfgs', class_weight='balanced', random_state=42
)

clf_gb = GradientBoostingClassifier(
    n_estimators=100, max_depth=5, learning_rate=0.1, random_state=42
)

clf_svc = SVC(
    kernel='rbf', probability=True, class_weight='balanced', 
    random_state=42, max_iter=2000
)

base_models = [
    ('rf', clf_rf),
    ('lr', clf_lr),
    ('gb', clf_gb),
    ('svc', clf_svc),
]

# ============================================================================
# Entrenar base models y calcular m√©tricas en validaci√≥n
# ============================================================================
auc_list, f1_list, acc_list, names = [], [], [], []

for name, model in base_models:
    print(f"  {name.upper()} ... ", end="", flush=True)
    model.fit(X_tr, y_tr)

    y_pred = model.predict(X_val)
    y_proba = model.predict_proba(X_val)[:, 1]

    auc = roc_auc_score(y_val, y_proba)
    f1 = f1_score(y_val, y_pred, zero_division=0)
    acc = accuracy_score(y_val, y_pred)

    auc_list.append(auc)
    f1_list.append(f1)
    acc_list.append(acc)
    names.append(name)

    print(f"AUC={auc:.4f}, F1={f1:.4f}, Acc={acc:.4f}")

# ============================================================================
# Calcular pesos a partir de AUC (normalizado)
# ============================================================================
print("\n Calculando pesos a partir de AUC (normalizado)...")

auc_clipped = np.clip(auc_list, 0.5, 1.0)  # evita pesos negativos
raw_weights = (np.array(auc_clipped) - 0.5) + 1e-6
weights = (raw_weights / raw_weights.sum()).tolist()

print("\nPesos finales (seg√∫n AUC):")
for n, w, a in zip(names, weights, auc_list):
    print(f"  {n.upper():>3s}: w={w:.3f}  (AUC={a:.4f})")

# ============================================================================
# VotingClassifier con soft voting ponderado
# ============================================================================
print("\n Creando VotingClassifier con soft voting...")

voter = VotingClassifier(
    estimators=base_models,
    voting='soft',
    weights=weights,
    n_jobs=-1
)

voter.fit(X_tr, y_tr)

# Predicciones en validaci√≥n
y_pred_val = voter.predict(X_val)
y_pred_proba_val = voter.predict_proba(X_val)[:, 1]

print("\n" + "=" * 80)
print(" RESULTADOS ENSEMBLE EN VALIDACI√ìN")
print("=" * 80)

acc = accuracy_score(y_val, y_pred_val)
prec = precision_score(y_val, y_pred_val, zero_division=0)
rec = recall_score(y_val, y_pred_val, zero_division=0)
f1 = f1_score(y_val, y_pred_val, zero_division=0)
auc = roc_auc_score(y_val, y_pred_proba_val) if len(np.unique(y_val)) > 1 else 0

print(f"Accuracy:  {acc:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall:    {rec:.4f}")
print(f"F1-Score:  {f1:.4f}")
print(f"ROC-AUC:   {auc:.4f}")

print(f"\nüìä Matriz de Confusi√≥n:")
cm = confusion_matrix(y_val, y_pred_val)
print(f"   TN={cm[0,0]} | FP={cm[0,1]}")
print(f"   FN={cm[1,0]} | TP={cm[1,1]}")

# ============================================================================
# Reentrenar ensemble en TODO el dataset de entrenamiento
# ============================================================================
print("\nüîÑ Reentrenando ensemble en TODO el dataset de entrenamiento...")
voter.fit(X_train, y_train)
print("‚úÖ Ensemble reentrenado")

# Guardar modelo para referencia
import pickle
with open("ensemble_model.pkl", "wb") as f:
    pickle.dump(voter, f)
print("‚úÖ Modelo guardado: ensemble_model.pkl")


MODELADO: ENSEMBLE CON WEIGHTED SOFT VOTING

 Entrenamiento: 2400 | Validaci√≥n: 600
   Train pos ratio: 30.00%
   Val pos ratio: 30.00%

 Entrenando modelos base...
  RF ... AUC=0.8991, F1=0.6280, Acc=0.8183
  LR ... AUC=0.9299, F1=0.7617, Acc=0.8467
  GB ... AUC=0.9176, F1=0.7296, Acc=0.8567
  SVC ... AUC=0.9276, F1=0.7696, Acc=0.8533

 Calculando pesos a partir de AUC (normalizado)...

Pesos finales (seg√∫n AUC):
   RF: w=0.238  (AUC=0.8991)
   LR: w=0.257  (AUC=0.9299)
   GB: w=0.249  (AUC=0.9176)
  SVC: w=0.255  (AUC=0.9276)

 Creando VotingClassifier con soft voting...

 RESULTADOS ENSEMBLE EN VALIDACI√ìN
Accuracy:  0.8533
Precision: 0.7875
Recall:    0.7000
F1-Score:  0.7412
ROC-AUC:   0.9306

üìä Matriz de Confusi√≥n:
   TN=386 | FP=34
   FN=54 | TP=126

üîÑ Reentrenando ensemble en TODO el dataset de entrenamiento...
‚úÖ Ensemble reentrenado
‚úÖ Modelo guardado: ensemble_model.pkl


## 8Ô∏è‚É£ Predicciones Finales en Test

In [69]:
print("\n" + "=" * 80)
print("GENERANDO PREDICCIONES EN TEST")
print("=" * 80)

print("\nüîÆ Prediciendo...")
y_pred_test = rf_model.predict(X_test)
y_pred_proba_test = rf_model.predict_proba(X_test)[:, 1]

# Crear submission
submission = pd.DataFrame({
    'patient_id': df_test_final['patient_id'],
    'has_diabetes': y_pred_test,
    'probability': y_pred_proba_test
})

print(f"\n‚úÖ {len(submission)} predicciones generadas")

print(f"\nüìä Distribuci√≥n predicciones:")
dist = submission['has_diabetes'].value_counts()
print(f"   Negativo (0): {dist[0]} ({dist[0]/len(submission)*100:.1f}%)")
print(f"   Positivo (1): {dist[1]} ({dist[1]/len(submission)*100:.1f}%)")

print(f"\nüìä Probabilidades:")
print(f"   Media: {submission['probability'].mean():.4f}")
print(f"   Min: {submission['probability'].min():.4f}")
print(f"   Max: {submission['probability'].max():.4f}")

print(f"\nüìã PRIMERAS 10 PREDICCIONES:")
print(submission.head(10).to_string(index=False))

# Guardar
submission.to_csv("submission.csv", index=False)
print(f"\n‚úÖ Guardado: submission.csv")


GENERANDO PREDICCIONES EN TEST

üîÆ Prediciendo...


ValueError: The feature names should match those that were passed during fit.
Feature names seen at fit time, yet now missing:
- has_heart_disease
- note_count


## 9Ô∏è‚É£ Guardado de Archivos

In [None]:
print("\n" + "=" * 80)
print("GUARDANDO DATAFRAMES")
print("=" * 80)

# Parquet (comprimido)
df_train_final.to_parquet("df_train_final.parquet", index=False)
df_test_final.to_parquet("df_test_final.parquet", index=False)
print(f"\n‚úÖ Parquet (comprimido):")
print(f"   df_train_final.parquet")
print(f"   df_test_final.parquet")

# CSV (primeras 100 filas, legible)
df_train_final.head(100).to_csv("df_train_sample.csv", index=False)
df_test_final.head(100).to_csv("df_test_sample.csv", index=False)
print(f"\n‚úÖ CSV (muestras 100 filas):")
print(f"   df_train_sample.csv")
print(f"   df_test_sample.csv")

print(f"\nüíæ Tama√±o en memoria:")
print(f"   Train: {df_train_final.memory_usage(deep=True).sum() / 1e6:.2f} MB")
print(f"   Test: {df_test_final.memory_usage(deep=True).sum() / 1e6:.2f} MB")

print(f"\n‚úÖ ARCHIVOS GENERADOS:")
print(f"   1. submission.csv (predicciones finales)")
print(f"   2. df_train_final.parquet (features train + embeddings)")
print(f"   3. df_test_final.parquet (features test + embeddings)")
print(f"   4. df_train_sample.csv (muestra train)")
print(f"   5. df_test_sample.csv (muestra test)")

## üéâ Resumen Final

### ‚úÖ Pipeline Completado:

1. **Carga**: train.json + test.json
2. **Extracci√≥n**: edad, g√©nero, BMI, HbA1c, glucosa, hipertensi√≥n, cardiopat√≠a, fumaci√≥n
3. **BioClinicalBERT**: embeddings 768-dimensionales por nota
4. **Agrupaci√≥n**: promediado por paciente
5. **Features**: ~780 columnas (10 cl√≠nicas + 768 embeddings + dummies)
6. **Modelado**: RandomForest 200 √°rboles con validaci√≥n 80/20
7. **Predicciones**: submission.csv con probabilidades

### üìä Dataset:

- **Train**: 200 pacientes con etiqueta diabetes (133 neg, 67 pos = 33.5%)
- **Test**: ~300 pacientes sin etiqueta
- **Features cl√≠nicos**: edad (a√±os), g√©nero (m/f), BMI (18-80), HbA1c (3-20), glucosa (40-600)
- **Embeddings**: 768-dim via Bio_ClinicalBERT preentrenado en MIMIC-III

### üéØ Modelos/Algoritmos:

- **RandomForest**: 200 √°rboles, max_depth=15, balanced class weights
- **Validaci√≥n**: 80/20 train/val, stratified por has_diabetes
- **M√©tricas**: Accuracy, Precision, Recall, F1-Score, ROC-AUC

### üíæ Salidas:

- `submission.csv`: patient_id + has_diabetes (0/1) + probability
- `df_train_final.parquet`: 200 √ó 776 (patient_id + 9 features + 768 embeddings)
- `df_test_final.parquet`: 300 √ó 777 (paciente_id + 9 features + 768 embeddings)

### üìñ Pr√≥ximas Mejoras:

- XGBoost o LightGBM (suelen superar RF)
- Hyperparameter tuning (GridSearchCV/Optuna)
- Ensemble (combinar RF + XGB + Neural Network)
- Feature engineering (interacciones, ratios)
- Neural Networks (embeddings directos + dense layers)

### üîó Cargar datos despu√©s sin re-procesar:

```python
import pandas as pd
df_train = pd.read_parquet("df_train_final.parquet")
df_test = pd.read_parquet("df_test_final.parquet")
submission = pd.read_csv("submission.csv")
```

---

**Creado**: 03-11-2025  
**Versi√≥n**: 1.0 - Notebook Completo y Funcional
