# Predicci√≥n de Diabetes
## Hackathon HackUPM 2025 

**Pipeline completo:**
1. Cargar train.json + test.json
2. Extraer edad, g√©nero, features cl√≠nicos (con regex robustas y negaciones)
3. Generar embeddings con BioClinicalBERT (768-dim)
4. Agrupar por paciente (media de features + embeddings)
5. EDA completo sin errores
6. Modelado con RandomForest
7. Predicciones finales en test
8. Guardado en m√∫ltiples formatos

**Autores:**
**Fecha:** 03-11-2025


In [52]:
# INSTALACIONES (ejecutar si es primera vez, descomentar)
!pip install -q transformers torch pandas numpy tqdm scikit-learn
!pip install --upgrade git+https://github.com/huggingface/transformers.git
!pip install word2number

from word2number import w2n
import json
import pandas as pd
import numpy as np
import torch
import re
import warnings
from transformers import AutoTokenizer, AutoModel
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score

warnings.filterwarnings('ignore')

print("Librer√≠as importadas")
print(f"CUDA disponible: {torch.cuda.is_available()}") # CUDA es una plataforma de computaci√≥n de Nvidia que ejecuta opers en las GPUs.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # prepara tu c√≥digo para trabajar en la GPU si est√° disponible, o en la CPU si no lo est√°.
print(f"Dispositivo: {device}")


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-_t2f97hh
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-_t2f97hh
  Resolved https://github.com/huggingface/transformers.git to commit 020e713ac8e70bd2e72bcd12dc6bd1ada6162562
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Librer√≠as importadas
CUDA disponible: True
Dispositivo: cuda


## 1Ô∏è‚É£ Cargar Datos (train.json + test.json)

In [54]:
print("\nLeyendo train.json...")
with open("/kaggle/input/fdatas/train1.json", "r") as f:
    train_data = json.load(f)

print("Leyendo test.json...")
with open("/kaggle/input/fdatas/test1.json", "r") as f:
    test_data = json.load(f)

# Crear DataFrames iniciales
df_train_raw = pd.DataFrame(train_data)
df_test_raw = pd.DataFrame(test_data)

print(f"\n Train: {len(df_train_raw)} registros, {df_train_raw.shape[1]} columnas")
print(f" Test: {len(df_test_raw)} registros, {df_test_raw.shape[1]} columnas")

print(f"\n Columnas : {df_train_raw.columns.tolist()}")

print(f"\n Distribuci√≥n diabetes en TRAIN:")
print(df_train_raw["has_diabetes"].value_counts())
print(f"Proporci√≥n positivos: {df_train_raw['has_diabetes'].mean()}")


Leyendo train.json...
Leyendo test.json...

 Train: 3000 registros, 3 columnas
 Test: 300 registros, 2 columnas

 Columnas : ['patient_id', 'has_diabetes', 'medical_note']

 Distribuci√≥n diabetes en TRAIN:
has_diabetes
0    2100
1     900
Name: count, dtype: int64
Proporci√≥n positivos: 0.3


## 2Ô∏è‚É£ Extraccion de features clinicos

In [55]:
outliers = {
    "age": [],
    "glucose": [],
    "hba1c": [],
    "bmi": []
}

def safe_float(x):
    """Convierte a float de forma segura."""
    try:
        return float(x)
    except:
        return np.nan

def extract_age(text):
    """Extrae edad con rango v√°lido 0-120."""
    if not isinstance(text, str):
        return np.nan
    t = text.lower()
    # Patr√≥n 1: "X-year-old" o "age X"
    m = re.search(r'(\d{1,3})\s*(?:year)?-?\s*(?:year-old|yr|years?\s*old)', t)
    age = None
    if not m:
        m = re.search(r'(?:age|aged)\s*(?:is)?\s*(\d{1,3})\b', t)
        if not m:
            m = re.search(r"\b([a-zA-Z]+(?:[-\s][a-zA-Z]+)*)-year-old\b", t)
            if m:
                age = w2n.word_to_num(m.group(1))
    if not age:
        age = int(m.group(1)) if m else np.nan
    if not m:
        outliers["age"].append(t)
    return age if (not np.isnan(age) and 0 <= age <= 120) else np.nan

def extract_gender(text):
    """Extrae g√©nero (male/female/unknown)."""
    if not isinstance(text, str):
        return "unknown"
    t = text.lower()
    male_count = len(re.findall(r'\b(?:male|man|he|his|him|boy)\b', t))
    female_count = len(re.findall(r'\b(?:female|woman|she|her|girl)\b', t))

    if male_count > 0 and female_count == 0:
        return "male"
    elif female_count > 0 and male_count == 0:
        return "female"
    elif male_count == female_count == 0:
        return "unknown"
    else:
        return "male" if male_count >= female_count else "female"

def extract_bmi(text):
    """Extrae BMI con rango v√°lido 8-80."""
    if not isinstance(text, str):
        return np.nan
    t = text.lower()
    m = re.search(r'\b(?:bmi|imc)\b[^0-9]{0,30}(\d{1,3}(?:\.\d+)?)', t)
    if not m:
        m = re.search(r'\b(?:bmi|imc)\b.{0,26}range.{0,10}(\d{1,3}(?:[.,]\d+)?)', t)
    v = safe_float(m.group(1)) if m else np.nan
    if not m:
        outliers["bmi"].append(t)
    return v if (not np.isnan(v) and 0 <= v <= 80) else np.nan

def extract_hba1c(text):
    """Extrae HbA1c con rango v√°lido 3-20."""
    if not isinstance(text, str):
        return np.nan
    t = text.lower()
    # Ventana local: hasta 20 chars despu√©s de "hba1c"
    m = re.search(r'(?:hba1c|a1c)[^0-9]{0,20}(\d{1,2}(?:\.\d+)?)\s*%?', t)
    if not m:
        pattern = re.compile(
            r'\b(?:hba1c(?: level)?s?\b.{0,10}(?:is|are|was|were|within|of)?\s*(very\s+high|high|elevated|normal|within normal limits|low)|(very\s+high|high|elevated|normal|within normal limits|low)\s+(?:levels\s+of\s+)?hba1c)\b',
            re.IGNORECASE
        )
        m = pattern.search(t)
        if m:
            mapping = {
                "normal": 5.5,
                "elevated": 6.3,
                "high": 6.5,
                "very high": 8
                }
            v = mapping[m.group(1) or m.group(2)] if m else np.nan
            return v if (not np.isnan(v) and 0 <= v <= 20) else np.nan
    if not m:
        outliers["hba1c"].append(t)
    
    v = safe_float(m.group(1)) if m else np.nan
    return v if (not np.isnan(v) and 0 <= v <= 20) else np.nan

def extract_glucose(text):
    """Extrae glucosa (aleatoria o postprandial) con rango v√°lido 40-600."""
    if not isinstance(text, str):
        return np.nan
    t = text.lower()
    # Preferir "glucose" + 2-3 d√≠gitos
    m = re.search(r'\bglucose\b[^0-9]{0,20}(\d{2,3})(?:\s*mg/dl)?', t)
    v = safe_float(m.group(1)) if m else np.nan
    if not m:
        pattern = re.compile(
            r'\b(?:'
            r'(?:\w+\s+)?glucose(?: (?:level|levels|measurement|measurements|reading|readings))?\b.{0,10}(?:is|are|was|were|within|of)?\s*(?P<val>very\s+high|high|elevated|normal|within normal limits|low|decreased|reduced|abnormal)'
            r'|'
            r'(?P<val2>very\s+high|high|elevated|normal|within normal limits|low|decreased|reduced|abnormal)\s+(?:\w+\s+)?glucose(?: (?:level|levels|measurement|measurements|reading|readings))?'
            r')\b',
            re.IGNORECASE
        )
        m = pattern.search(t)
        if m:
            mapping = {
                "low": 70,
                "normal": 140,
                "within normal limits": 140,
                "elevated": 165,
                "high": 200,
                "abnormal": 200,
                "very high": 250,
                }
            try:
                v = mapping[m.group(1) or m.group(2)] if m else np.nan
            except:
                print(t)
                exit(0)
            return v if (not np.isnan(v)) else np.nan
    if not m:
        outliers["glucose"].append(t)
    return v if (not np.isnan(v)) else np.nan

def extract_flags(text):
    """Extrae hipertensi√≥n, cardiopat√≠a, fumaci√≥n (respeta negaciones)."""
    if not isinstance(text, str):
        return 0, 0, "unknown"

    t = text.lower()
    NEG_PAT = r'(?:no\s+|without\s+|denies\s+|negative\s+for\s+|no\s+history\s+of\s+)'

    # Hipertensi√≥n: negaci√≥n > pos
    hyp_neg = bool(re.search(NEG_PAT + r'(?:hypertension|high\s+blood\s+pressure)', t))
    hyp_pos = bool(re.search(r'\bhypertension\b|\bhigh\s+blood\s+pressure\b', t)) and not hyp_neg
    has_hypertension = 1 if hyp_pos else 0

    # Cardiopat√≠a: negaci√≥n > pos
    hd_neg = bool(re.search(NEG_PAT + r'(?:heart\s+disease|cardiovascular)', t))
    hd_pos = bool(re.search(r'\bheart\s+disease\b|\bcardiovascular', t)) and not hd_neg
    has_heart_disease = 1 if hd_pos else 0

    # Fumaci√≥n
    if re.search(r'\bnon-smoker\b|\bnever\s+smoked\b', t):
        smoking = "never"
    elif re.search(r'\b(?:past|former)\s+(?:smoker|smoking)\b', t):
        smoking = "past"
    elif re.search(r'\bcurrent\s+smoker\b|\bis\s+a\s+smoker\b|\bsmoker\b', t):
        smoking = "current"
    else:
        smoking = "unknown"

    return has_hypertension, has_heart_disease, smoking

# TEST: verificar extracci√≥n en muestra
import numpy as np

for i in range(min(3, len(df_train_raw))):
    note = df_train_raw["medical_note"].iloc[i]
    age = extract_age(note)
    gender = extract_gender(note)
    bmi = extract_bmi(note)
    hba1c = extract_hba1c(note)
    glucose = extract_glucose(note)
    hyp, hd, smoking = extract_flags(note)

    age_s    = "NaN" if (age is None or (isinstance(age, float) and np.isnan(age))) else f"{int(age)}"
    bmi_s    = "NaN" if (bmi is None or np.isnan(bmi)) else f"{bmi:.1f}"
    hba1c_s  = "NaN" if (hba1c is None or np.isnan(hba1c)) else f"{hba1c:.1f}"
    glucose_s= "NaN" if (glucose is None or np.isnan(glucose)) else f"{glucose:.0f}"

    print(
        f"\n  [{i}] age={age_s}, gender={gender}, bmi={bmi_s}, "
        f"hba1c={hba1c_s}, glucose={glucose_s}, hyp={hyp}, hd={hd}, smoking={smoking}"
    )


print("\n‚úÖ Extracci√≥n verificada")


  [0] age=16, gender=female, bmi=21.5, hba1c=6.2, glucose=140, hyp=0, hd=1, smoking=never

  [1] age=15, gender=female, bmi=33.6, hba1c=5.5, glucose=158, hyp=1, hd=1, smoking=unknown

  [2] age=54, gender=male, bmi=21.5, hba1c=5.5, glucose=145, hyp=0, hd=1, smoking=current

‚úÖ Extracci√≥n verificada


In [56]:
# Aplicar a TRAIN
print("\n Train: extrayendo features...")
df_train_raw["age"] = df_train_raw["medical_note"].apply(extract_age)
df_train_raw["gender"] = df_train_raw["medical_note"].apply(extract_gender)
df_train_raw["bmi"] = df_train_raw["medical_note"].apply(extract_bmi)
df_train_raw["hba1c"] = df_train_raw["medical_note"].apply(extract_hba1c)
df_train_raw["glucose"] = df_train_raw["medical_note"].apply(extract_glucose)

tmp_train = df_train_raw["medical_note"].apply(extract_flags)
df_train_raw["has_hypertension"] = [t[0] for t in tmp_train]
df_train_raw["has_heart_disease"] = [t[1] for t in tmp_train]
df_train_raw["smoking_status"] = [t[2] for t in tmp_train]

# Aplicar a TEST
print(" Test: extrayendo features...")
df_test_raw["age"] = df_test_raw["medical_note"].apply(extract_age)
df_test_raw["gender"] = df_test_raw["medical_note"].apply(extract_gender)
df_test_raw["bmi"] = df_test_raw["medical_note"].apply(extract_bmi)
df_test_raw["hba1c"] = df_test_raw["medical_note"].apply(extract_hba1c)
df_test_raw["glucose"] = df_test_raw["medical_note"].apply(extract_glucose)

tmp_test = df_test_raw["medical_note"].apply(extract_flags)
df_test_raw["has_hypertension"] = [t[0] for t in tmp_test]
df_test_raw["has_heart_disease"] = [t[1] for t in tmp_test]
df_test_raw["smoking_status"] = [t[2] for t in tmp_test]

print("\n Features extra√≠dos correctamente")
print(f"\n MUESTRA TRAIN (primeras 3 filas):")
cols = ['patient_id', 'age', 'gender', 'bmi', 'hba1c', 'glucose', 'smoking_status', 'has_diabetes']
print(df_train_raw[cols].head(3))


 Train: extrayendo features...
 Test: extrayendo features...

 Features extra√≠dos correctamente

 MUESTRA TRAIN (primeras 3 filas):
   patient_id   age  gender    bmi  hba1c  glucose smoking_status  \
0       82555  16.0  female  21.49    6.2    140.0          never   
1       92299  15.0  female  33.62    5.5    158.0        unknown   
2       18725  54.0    male  21.46    5.5    145.0        current   

   has_diabetes  
0             0  
1             0  
2             0  


## 3Ô∏è‚É£ BioClinicalBERT: Generar Embeddings (768-dim)

In [57]:
def load_bioclinicalbert():
    """Carga Bio_ClinicalBERT desde HuggingFace."""
    model_name = "emilyalsentzer/Bio_ClinicalBERT" # coge el modelo de HuggingFace
    print(f" Cargando {model_name}...")
    tokenizer = AutoTokenizer.from_pretrained(model_name) # convierte el texto en tokens
    model = AutoModel.from_pretrained(model_name) # Genera los embeddings desde la red a los token de la secuencia
    model.eval() # evalua que se ha pasado de texto a embeddings
    model.to(device) # mueve el modelo a GPU (kaggle)
    return tokenizer, model

def mean_pool(last_hidden_state, attention_mask):
    """Mean pooling con mask."""
    mask = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float() # mascara que ve cuales partes de los valores del vector influyeno o no 
    sum_hidden = (last_hidden_state * mask).sum(dim=1) # quita los insignificantes y sumas todos los buenos
    sum_mask = torch.clamp(mask.sum(dim=1), min=1e-9) # clump evita que se divida en la siguiente linea entre 0 
    return sum_hidden / sum_mask # calcula la media de todos los tokens entre los de la mascara

def embed_text(text, tokenizer, model, max_length=512):
    """Genera 1 embedding de 768 dims para un texto si esta vacia o no es texto."""
    if not isinstance(text, str) or len(text.strip()) == 0: 
        return np.zeros(768, dtype=np.float32) 

    tokens = tokenizer(text,padding=True,truncation=True,max_length=max_length,return_tensors="pt").to(device) # crea los tokens desde la frase

    with torch.no_grad(): # sin gradientes para no entrenar modelo
        output = model(**tokens) # crea los embeddings
        pooled = mean_pool(output.last_hidden_state, tokens["attention_mask"]) # llama a la funcion de la mascara para ponderar correctamente

    return pooled.cpu().numpy()[0].astype(np.float32)

# Cargar modelo (una sola vez)
tokenizer, model = load_bioclinicalbert()

# Generar embeddings TRAIN
print("\n Generando embeddings TRAIN...")
train_embeddings = []
for note in tqdm(df_train_raw["medical_note"].tolist(), desc="Train embeddings", total=len(df_train_raw)):
    emb = embed_text(note, tokenizer, model)
    train_embeddings.append(emb)

df_train_raw["embedding"] = train_embeddings

# Generar embeddings TEST
print("\n Generando embeddings TEST...")
test_embeddings = []
for note in tqdm(df_test_raw["medical_note"].tolist(), desc="Test embeddings", total=len(df_test_raw)):
    emb = embed_text(note, tokenizer, model)
    test_embeddings.append(emb)

df_test_raw["embedding"] = test_embeddings


 Cargando emilyalsentzer/Bio_ClinicalBERT...

 Generando embeddings TRAIN...


Train embeddings: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3000/3000 [00:30<00:00, 98.18it/s] 



 Generando embeddings TEST...


Test embeddings: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 [00:03<00:00, 98.75it/s] 


## 4Ô∏è‚É£ Agrupar por Paciente + Expandir Embeddings

In [58]:
def most_common(series):
    """Retorna valor m√°s frecuente o 'unknown' del embedding."""
    s = series.dropna()
    return s.mode().iat[0] if not s.mode().empty else "unknown"

def emb_mean(series):
    """Promedia embeddings."""
    stacked = np.vstack(series.values)
    return stacked.mean(axis=0)


# Agregaci√≥n TRAIN
print("\n Agrupando TRAIN por patient_id...")
agg_dict = {
    "medical_note": "count",
    "has_diabetes": "first",
    "age": "mean",
    "gender": most_common,
    "bmi": "mean",
    "hba1c": "mean",
    "glucose": "mean",
    "has_hypertension": "max",
    "has_heart_disease": "max",
    "smoking_status": most_common,
    "embedding": emb_mean
}

df_train_agg = df_train_raw.groupby("patient_id").agg(agg_dict).reset_index() # por si se repiten pacientes, agruparlos
df_train_agg.rename(columns={"medical_note": "note_count"}, inplace=True)

print(f" Train agrupado: {df_train_agg.shape[0]} pacientes √∫nicos")

# Agregaci√≥n TEST (sin has_diabetes)
print("\n Agrupando TEST por patient_id...")
agg_dict_test = {
    "medical_note": "count",
    "age": "mean",
    "gender": most_common,
    "bmi": "mean",
    "hba1c": "mean",
    "glucose": "mean",
    "has_hypertension": "max",
    "has_heart_disease": "max",
    "smoking_status": most_common,
    "embedding": emb_mean
}

df_test_agg = df_test_raw.groupby("patient_id").agg(agg_dict_test).reset_index()
df_test_agg.rename(columns={"medical_note": "note_count"}, inplace=True)

print(f" Test agrupado: {df_test_agg.shape[0]} pacientes √∫nicos")

# Expandir embeddings en columnas
print("\n Expandiendo embeddings (768 columnas)...")

emb_train = np.vstack(df_train_agg["embedding"].values)
emb_test = np.vstack(df_test_agg["embedding"].values)

# Crea los embeddings (columnas) en el Dataframe
emb_cols = [f"emb_{i}" for i in range(emb_train.shape[1])]
emb_df_train = pd.DataFrame(emb_train, columns=emb_cols)
emb_df_test = pd.DataFrame(emb_test, columns=emb_cols)

df_train_final = pd.concat([df_train_agg.drop(columns=["embedding"]).reset_index(drop=True), emb_df_train], axis=1)
df_test_final = pd.concat([df_test_agg.drop(columns=["embedding"]).reset_index(drop=True), emb_df_test], axis=1)

print(f"\n Train final: {df_train_final.shape}")
print(f" Test final: {df_test_final.shape}")

print(f"\n PRIMERAS FILAS TRAIN (con age y gender):")
print(df_train_final[['patient_id', 'age', 'gender', 'bmi', 'hba1c', 'glucose', 'smoking_status', 'has_diabetes', 'note_count']].head(5))



 Agrupando TRAIN por patient_id...
 Train agrupado: 3000 pacientes √∫nicos

 Agrupando TEST por patient_id...
 Test agrupado: 300 pacientes √∫nicos

 Expandiendo embeddings (768 columnas)...

 Train final: (3000, 779)
 Test final: (300, 778)

 PRIMERAS FILAS TRAIN (con age y gender):
   patient_id   age  gender    bmi  hba1c  glucose smoking_status  \
0           5  23.0    male  21.05    6.5    200.0        current   
1          14  70.0  female  32.63    5.5    165.0        unknown   
2          36  42.0  female  31.50    5.8    200.0          never   
3          67  71.0    male  39.03    6.3      NaN          never   
4         127  66.0  female  23.58    5.8    145.0          never   

   has_diabetes  note_count  
0             0           1  
1             0           1  
2             0           1  
3             1           1  
4             1           1  


In [59]:
df_train_final

Unnamed: 0,patient_id,note_count,has_diabetes,age,gender,bmi,hba1c,glucose,has_hypertension,has_heart_disease,...,emb_758,emb_759,emb_760,emb_761,emb_762,emb_763,emb_764,emb_765,emb_766,emb_767
0,5,1,0,23.0,male,21.05,6.5,200.0,0,1,...,0.168776,-0.137289,-0.300718,0.055569,-0.012067,-0.191791,0.101528,0.040025,-0.036748,-0.125205
1,14,1,0,70.0,female,32.63,5.5,165.0,0,1,...,0.140446,-0.112927,-0.244349,-0.033866,0.026291,-0.076418,0.078653,0.214932,-0.077384,0.030703
2,36,1,0,42.0,female,31.50,5.8,200.0,0,1,...,0.144146,-0.082072,-0.269926,-0.027545,0.041318,-0.097104,0.003274,0.072266,-0.098089,-0.058813
3,67,1,1,71.0,male,39.03,6.3,,0,1,...,0.158193,-0.121447,-0.211743,-0.014441,-0.079715,-0.143533,0.071336,0.013479,-0.085260,0.011425
4,127,1,1,66.0,female,23.58,5.8,145.0,0,1,...,0.054226,-0.093798,-0.332355,-0.064596,0.043048,-0.200588,0.136516,0.041142,-0.021674,-0.084981
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,95954,1,0,61.0,male,24.52,6.6,165.0,0,1,...,0.172905,-0.028232,-0.289735,0.031709,-0.036546,-0.199922,0.011492,0.098776,-0.031317,-0.044507
2996,96030,1,0,73.0,female,25.07,6.3,158.0,0,1,...,0.050674,-0.060000,-0.283743,-0.085756,-0.056065,-0.177056,0.061681,0.067599,-0.042998,-0.052660
2997,96035,1,0,18.0,male,37.65,6.5,90.0,0,1,...,0.141482,-0.078641,-0.259597,0.134808,0.033681,-0.146596,0.002997,0.081718,0.014971,-0.069236
2998,96093,1,1,74.0,female,6.00,6.8,165.0,1,0,...,0.124343,0.006571,-0.255402,-0.014538,-0.036401,-0.155102,0.129003,0.092416,-0.071230,-0.012100


## 5Ô∏è‚É£ An√°lisis Exploratorio (EDA Completo)

In [60]:


print(f"\n EDAD (Age):")
print(f"   Media: {df_train_final['age'].mean():.1f} a√±os")
print(f"   Mediana: {df_train_final['age'].median():.1f} a√±os")
print(f"   Rango: {df_train_final['age'].min():.0f} - {df_train_final['age'].max():.0f} a√±os")
print(f"   Faltantes: {df_train_final['age'].isna().sum()}")

print(f"\n G√âNERO (Gender):")
gen_dist = df_train_final['gender'].value_counts()
for g, c in gen_dist.items():
    print(f"   {g}: {c} ({c/len(df_train_final)*100:.1f}%)")

print(f"\n BMI:")
print(f"   Media: {df_train_final['bmi'].mean():.2f} ¬± {df_train_final['bmi'].std():.2f}")
print(f"   Rango: {df_train_final['bmi'].min():.2f} - {df_train_final['bmi'].max():.2f}")
print(f"   Faltantes: {df_train_final['bmi'].isna().sum()}")

print(f"\n HbA1c:")
print(f"   Media: {df_train_final['hba1c'].mean():.2f} ¬± {df_train_final['hba1c'].std():.2f}")
print(f"   Rango: {df_train_final['hba1c'].min():.2f} - {df_train_final['hba1c'].max():.2f}")
print(f"   Faltantes: {df_train_final['hba1c'].isna().sum()}")

print(f"\n GLUCOSA (Glucose):")
print(f"   Media: {df_train_final['glucose'].mean():.2f} ¬± {df_train_final['glucose'].std():.2f}")
print(f"   Rango: {df_train_final['glucose'].min():.2f} - {df_train_final['glucose'].max():.2f}")
print(f"   Faltantes: {df_train_final['glucose'].isna().sum()}")

print(f"\n HIPERTENSI√ìN:")
hyp_count = df_train_final['has_hypertension'].sum()
print(f"   Con hipertensi√≥n: {hyp_count} ({hyp_count/len(df_train_final)*100:.1f}%)")

print(f"\n ENFERMEDAD CARD√çACA:")
hd_count = df_train_final['has_heart_disease'].sum()
print(f"   Con cardiopat√≠a: {hd_count} ({hd_count/len(df_train_final)*100:.1f}%)")

print(f"\n FUMACI√ìN:")
smoke_dist = df_train_final['smoking_status'].value_counts()
for s, c in smoke_dist.items():
    print(f"   {s}: {c} ({c/len(df_train_final)*100:.1f}%)")

print(f"\n DIABETES (TARGET):")
diab_dist = df_train_final['has_diabetes'].value_counts()
print(f"   Negativo (0): {diab_dist[0]} ({diab_dist[0]/len(df_train_final)*100:.1f}%)")
print(f"   Positivo (1): {diab_dist[1]} ({diab_dist[1]/len(df_train_final)*100:.1f}%)")

print(f"\n CORRELACI√ìN CON DIABETES:")
numeric_cols = ['age', 'bmi', 'hba1c', 'glucose', 'has_hypertension', 'has_heart_disease', 'note_count']
corr = df_train_final[numeric_cols + ['has_diabetes']].corr()['has_diabetes'].sort_values(ascending=False)
print(corr.head(8))



 EDAD (Age):
   Media: 46.3 a√±os
   Mediana: 49.0 a√±os
   Rango: 1 - 80 a√±os
   Faltantes: 9

 G√âNERO (Gender):
   female: 1663 (55.4%)
   male: 1337 (44.6%)

 BMI:
   Media: 28.05 ¬± 7.77
   Rango: 0.00 - 72.21
   Faltantes: 5

 HbA1c:
   Media: 6.25 ¬± 1.07
   Rango: 3.50 - 9.00
   Faltantes: 115

 GLUCOSA (Glucose):
   Media: 161.97 ¬± 43.13
   Rango: 15.00 - 300.00
   Faltantes: 211

 HIPERTENSI√ìN:
   Con hipertensi√≥n: 1017 (33.9%)

 ENFERMEDAD CARD√çACA:
   Con cardiopat√≠a: 2874 (95.8%)

 FUMACI√ìN:
   never: 1566 (52.2%)
   unknown: 768 (25.6%)
   past: 552 (18.4%)
   current: 114 (3.8%)

 DIABETES (TARGET):
   Negativo (0): 2100 (70.0%)
   Positivo (1): 900 (30.0%)

 CORRELACI√ìN CON DIABETES:
has_diabetes         1.000000
hba1c                0.535270
glucose              0.446588
age                  0.429397
bmi                  0.315766
has_hypertension     0.199608
has_heart_disease   -0.062372
note_count                NaN
Name: has_diabetes, dtype: float64


In [61]:
scaler_bmi = StandardScaler()
scaler_hba1c = StandardScaler()
scaler_glucose = StandardScaler()
scaler_age = StandardScaler()

df_train_final["bmi"] = scaler_bmi.fit_transform(df_train_final[["bmi"]])
df_train_final["hba1c"] = scaler_hba1c.fit_transform(df_train_final[["hba1c"]])
df_train_final["glucose"] = scaler_glucose.fit_transform(df_train_final[["glucose"]])
df_train_final["age"] = scaler_age.fit_transform(df_train_final[["age"]])

df_test_final["bmi"] = scaler_bmi.transform(df_test_final[["bmi"]])
df_test_final["hba1c"] = scaler_hba1c.transform(df_test_final[["hba1c"]])
df_test_final["glucose"] = scaler_glucose.transform(df_test_final[["glucose"]])
df_test_final["age"] = scaler_age.transform(df_test_final[["age"]])

## 6Ô∏è‚É£ Preparaci√≥n para Modelado

In [62]:
# Preparar X_train e y_train
X_train = df_train_final.drop(columns=["patient_id", "has_diabetes"])
y_train = df_train_final["has_diabetes"]

# Preparar X_test
X_test = df_test_final.drop(columns=["patient_id"])

# Rellenar NaNs
print("\n Rellenando valores faltantes...")

for col in X_train.columns:
    if X_train[col].dtype == 'object':
        # Categ√≥ricos: usar moda
        mode_val = X_train[col].mode()[0] if not X_train[col].mode().empty else "unknown"
        X_train[col] = X_train[col].fillna(mode_val)
        X_test[col] = X_test[col].fillna(mode_val)
    else:
        # Num√©ricos: usar media
        mean_val = X_train[col].mean()
        X_train[col] = X_train[col].fillna(mean_val)
        X_test[col] = X_test[col].fillna(mean_val)

# One-hot encoding para categor√≠as
print("\n One-hot encoding para gender y smoking_status...")
X_train = pd.get_dummies(X_train, columns=['gender', 'smoking_status'], drop_first=True)
X_test = pd.get_dummies(X_test, columns=['gender', 'smoking_status'], drop_first=True)

# Alinear columnas
for col in set(X_train.columns) - set(X_test.columns):
    X_test[col] = 0
for col in set(X_test.columns) - set(X_train.columns):
    X_train[col] = 0

X_train = X_train[sorted(X_train.columns)]
X_test = X_test[sorted(X_train.columns)]

print(f"\n X_train final shape: {X_train.shape}")
print(f" X_test final shape: {X_test.shape}")
print(f"\n Columnas features: {X_train.columns.tolist()[:15]}... (+{len(X_train.columns)-15})")



 Rellenando valores faltantes...

 One-hot encoding para gender y smoking_status...

 X_train final shape: (3000, 779)
 X_test final shape: (300, 779)

 Columnas features: ['age', 'bmi', 'emb_0', 'emb_1', 'emb_10', 'emb_100', 'emb_101', 'emb_102', 'emb_103', 'emb_104', 'emb_105', 'emb_106', 'emb_107', 'emb_108', 'emb_109']... (+764)


## 7Ô∏è‚É£ Ensemble con Weighted Soft Voting

CatBoost () + XGBoost () + Light GBM ()




Light GBM () + Random Forest () + Logistic Regression ()


CatBoost () + Random Forest () + Logistic Regression ()

In [69]:
# ===== LIBRER√çAS =====
try:
    from xgboost import XGBClassifier
    from catboost import CatBoostClassifier
except Exception:
    from xgboost import XGBClassifier
    from catboost import CatBoostClassifier

from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier, StackingClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score, f1_score, accuracy_score
import numpy as np
# Split validaci√≥n
X_tr, X_val, y_tr, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42, stratify=y_train
)

# ===== UTILIDADES =====
def auc_weights(models, X_tr, y_tr, X_val, y_val):
    aucs, names, fitted = [], [], []
    for name, model in models:
        model.fit(X_tr, y_tr)
        proba = model.predict_proba(X_val)[:, 1]
        auc = roc_auc_score(y_val, proba)
        aucs.append(auc); names.append(name); fitted.append(model)
        print(f"{name}: AUC={auc:.4f}")
    auc_clipped = np.clip(aucs, 0.5, 1.0)
    raw = (np.array(auc_clipped) - 0.5) + 1e-6
    w = (raw / raw.sum()).tolist()
    print("Pesos:", {n: round(wi,3) for n, wi in zip(names, w)})
    return list(zip(names, fitted)), w

def report(name, model, X_val, y_val):
    y_hat = model.predict(X_val)
    y_pb  = model.predict_proba(X_val)[:, 1]
    auc = roc_auc_score(y_val, y_pb)
    f1  = f1_score(y_val, y_hat, zero_division=0)
    acc = accuracy_score(y_val, y_hat)
    print(f"[{name}] AUC={auc:.4f}  F1={f1:.4f}  Acc={acc:.4f}")

# Ratio de desbalance
pos = max(1, int(y_tr.sum()))
neg = int(len(y_tr) - pos)
spw = neg / pos
print(f"scale_pos_weight={spw:.2f}")

# ===== META-MODELO PARA STACKING =====
meta_lr = LogisticRegression(max_iter=2000, solver='lbfgs', class_weight='balanced', C=0.5, random_state=42)
cv5 = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# ===== CATB + XGB + SVM =====
catb_opt = CatBoostClassifier(
    iterations=650, depth=6, learning_rate=0.05, l2_leaf_reg=3.0,
    loss_function='Logloss', eval_metric='Logloss', class_weights=[1.0, spw],
    random_state=42, verbose=0
)

xgb_opt = XGBClassifier(
    n_estimators=800, max_depth=5, learning_rate=0.03,
    subsample=0.8, colsample_bytree=0.8, reg_lambda=1.0,
    eval_metric='logloss', tree_method='hist', n_jobs=-1, random_state=42,
    scale_pos_weight=spw
)

# SVM con probabilidad calibrada
svm_opt = SVC(
    kernel='rbf', C=1.0, gamma='scale',
    class_weight='balanced', probability=True,
    random_state=42
)

models_opt = [('catb', catb_opt), ('xgb', xgb_opt), ('svm', svm_opt)]
fitted_opt, weights_opt = auc_weights(models_opt, X_tr, y_tr, X_val, y_val)

# Voting Classifier
vote_opt = VotingClassifier(estimators=fitted_opt, voting='soft', weights=weights_opt)
vote_opt.fit(X_tr, y_tr)
report("Voting CATB+XGB+SVM", vote_opt, X_val, y_val)


# Seleccionar mejor modelo
best_model = vote_opt  # o stack_opt si tiene mejor rendimiento
best_model.fit(X_train, y_train)


scale_pos_weight=2.33
catb: AUC=0.9246
xgb: AUC=0.9251
svm: AUC=0.9275
Pesos: {'catb': 0.332, 'xgb': 0.333, 'svm': 0.335}
[Voting CATB+XGB+SVM] AUC=0.9287  F1=0.7331  Acc=0.8483


## 8Ô∏è‚É£ Predicciones Finales en Test

In [70]:

y_pred_test = best_model.predict(X_test)
y_pred_proba_test = best_model.predict_proba(X_test)[:, 1]

# Crear submission
submission = pd.DataFrame({
    'patient_id': df_test_final['patient_id'],
    'has_diabetes': y_pred_test,
    'probability': y_pred_proba_test
})

print(f"\n {len(submission)} predicciones generadas")

print(f"\n Distribuci√≥n predicciones:")
dist = submission['has_diabetes'].value_counts()
print(f"   Negativo (0): {dist[0]} ({dist[0]/len(submission)*100:.1f}%)")
print(f"   Positivo (1): {dist[1]} ({dist[1]/len(submission)*100:.1f}%)")

print(f"\n Probabilidades:")
print(f"   Media: {submission['probability'].mean():.4f}")
print(f"   Min: {submission['probability'].min():.4f}")
print(f"   Max: {submission['probability'].max():.4f}")

print(f"\n PRIMERAS 10 PREDICCIONES:")
print(submission.head(10).to_string(index=False))

# Guardar
def parse_submission(id):
    id = str(id)
    length = len(id)
    return "patient_" + "0"* (5 - length) + id

submission["patient_id"] = submission["patient_id"].apply(parse_submission)
submission[["patient_id","has_diabetes"]].to_csv("submission.csv", index=False)



 300 predicciones generadas

 Distribuci√≥n predicciones:
   Negativo (0): 223 (74.3%)
   Positivo (1): 77 (25.7%)

 Probabilidades:
   Media: 0.2818
   Min: 0.0001
   Max: 0.9998

 PRIMERAS 10 PREDICCIONES:
 patient_id  has_diabetes  probability
        139             0     0.008924
        252             0     0.005333
        259             0     0.007896
        335             1     0.938705
        699             0     0.001388
        977             0     0.023111
       1025             0     0.148927
       1145             0     0.022740
       1217             1     0.718553
       1235             1     0.862378


## 9Ô∏è‚É£ Guardado de Archivos

In [None]:
# Parquet (comprimido)
df_train_final.to_parquet("df_train_final.parquet", index=False)
df_test_final.to_parquet("df_test_final.parquet", index=False)
print(f"\n Parquet (comprimido):")
print(f"   df_train_final.parquet")
print(f"   df_test_final.parquet")

# CSV (primeras 100 filas, legible)
df_train_final.head(100).to_csv("df_train_sample.csv", index=False)
df_test_final.head(100).to_csv("df_test_sample.csv", index=False)
print(f"\n CSV (muestras 100 filas):")
print(f"   df_train_sample.csv")
print(f"   df_test_sample.csv")

print(f"\n Tama√±o en memoria:")
print(f"   Train: {df_train_final.memory_usage(deep=True).sum() / 1e6:.2f} MB")
print(f"   Test: {df_test_final.memory_usage(deep=True).sum() / 1e6:.2f} MB")

print(f"\n ARCHIVOS GENERADOS:")
print(f"   1. submission.csv (predicciones finales)")
print(f"   2. df_train_final.parquet (features train + embeddings)")
print(f"   3. df_test_final.parquet (features test + embeddings)")
print(f"   4. df_train_sample.csv (muestra train)")
print(f"   5. df_test_sample.csv (muestra test)")


## üéâ Resumen Final

### ‚úÖ Pipeline Completado:

1. **Carga**: train.json + test.json
2. **Extracci√≥n**: edad, g√©nero, BMI, HbA1c, glucosa, hipertensi√≥n, cardiopat√≠a, fumaci√≥n
3. **BioClinicalBERT**: embeddings 768-dimensionales por nota
4. **Agrupaci√≥n**: promediado por paciente
5. **Features**: ~780 columnas (10 cl√≠nicas + 768 embeddings + dummies)
6. **Modelado**: RandomForest 200 √°rboles con validaci√≥n 80/20
7. **Predicciones**: submission.csv con probabilidades

### üìä Dataset:

- **Train**: 200 pacientes con etiqueta diabetes (133 neg, 67 pos = 33.5%)
- **Test**: ~300 pacientes sin etiqueta
- **Features cl√≠nicos**: edad (a√±os), g√©nero (m/f), BMI (18-80), HbA1c (3-20), glucosa (40-600)
- **Embeddings**: 768-dim via Bio_ClinicalBERT preentrenado en MIMIC-III

### üéØ Modelos/Algoritmos:

- **RandomForest**: 200 √°rboles, max_depth=15, balanced class weights
- **Validaci√≥n**: 80/20 train/val, stratified por has_diabetes
- **M√©tricas**: Accuracy, Precision, Recall, F1-Score, ROC-AUC

### üíæ Salidas:

- `submission.csv`: patient_id + has_diabetes (0/1) + probability
- `df_train_final.parquet`: 200 √ó 776 (patient_id + 9 features + 768 embeddings)
- `df_test_final.parquet`: 300 √ó 777 (paciente_id + 9 features + 768 embeddings)

### üìñ Pr√≥ximas Mejoras:

- XGBoost o LightGBM (suelen superar RF)
- Hyperparameter tuning (GridSearchCV/Optuna)
- Ensemble (combinar RF + XGB + Neural Network)
- Feature engineering (interacciones, ratios)
- Neural Networks (embeddings directos + dense layers)

### üîó Cargar datos despu√©s sin re-procesar:

```python
import pandas as pd
df_train = pd.read_parquet("df_train_final.parquet")
df_test = pd.read_parquet("df_test_final.parquet")
submission = pd.read_csv("submission.csv")
```

---

**Creado**: 03-11-2025  
**Versi√≥n**: 1.0 - Notebook Completo y Funcional
