# *NHANES Case Study: Pre-Pandemic Health Baseline Modeling Research* #

# Pre-Pandemic Health Baseline: An Analysis Based on NHANES 2017-2020 Data

This report presents a comprehensive analysis of data collected by the National Health and Nutrition Examination Survey (NHANES) during the 2017-2020 cycle, specifically focusing on pre-pandemic health metrics. The NHANES technical report for this period serves as the primary source of data, documenting extensive information on demographic variables, nutritional assessments, and health indicators that characterize the population's health status before the onset of the COVID-19 pandemic.

Within this report, you'll find detailed insights into age, gender, socioeconomic status, and ethnicity distributions alongside health measures such as body composition, blood pressure, dietary intake, and physical activity. This pre-pandemic data serves as a critical baseline, helping to delineate shifts in health trends and highlight areas for targeted intervention in a post-pandemic context. For complete methodologies, statistical treatments, and data interpretations, please refer to the NHANES 2017-2020 pre-pandemic technical report, which underpins this analysis and provides the foundation for all reported findings.


### Importing Libraries and Datasets

In [20]:
%load_ext kedro.ipython
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
from scipy.stats import mstats
from scipy.stats import boxcox
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PowerTransformer
from sklearn.impute import KNNImputer
scaler = StandardScaler()
minmax = MinMaxScaler()
pt = PowerTransformer(method='yeo-johnson')
pd.options.display.float_format = '{:.2f}'.format

The kedro.ipython extension is already loaded. To reload it, use:
  %reload_ext kedro.ipython


In [21]:
demografia = catalog.load("demografia")
insulina = catalog.load("insulina")
colesterol = catalog.load("colesterol")
depresion = catalog.load("depresion")
proteinaC = catalog.load("proteinaC")
perfilB = catalog.load("perfilBioquimico")
presion = catalog.load("presionArterial")
medidas = catalog.load("medidasCorporales")

### Variables

In [22]:
dataframes = {
    'demografia': demografia,
    'insulina': insulina,
    'colesterol': colesterol,
    'depresion': depresion,
    'proteinaC': proteinaC,
    'perfilBioquimico': perfilB,
    'presionArterial': presion,
    'medidasCorporales': medidas
}

### Initial Research Questions

Given the variables present in the different datasets, some questions can be posed for future research:

1. What is the impact of waist circumference on serum lipid levels?
2. How do lipid and glucose levels vary by age, and which age groups are most vulnerable to metabolic disorders?
3. How do gender and age affect correlations among these biomarkers?
4. What relationship exists between electrolyte levels and basic bodily functions, such as blood pressure and kidney function?
5. As age increases, how does this affect blood insulin and cholesterol results?
6. Does a higher poverty index level correlate with an increase in depressive symptoms?

### Identifying Targets for Questions and Future Modeling

**Question 1:**  
- **Model Type:** Regression  
- **Target Variable:** Total Cholesterol, refrigerated serum (mg/dL)  
- **Predictive Variables:** Waist Circumference (cm), Triglycerides, refrigerated serum (mg/dL)

**Question 2:**  
- **Model Type:** Classification  
- **Target Variable:** Age Group (e.g., pre-infant, infant, child, adolescent)  
- **Predictive Variables:** Triglycerides, refrigerated serum (mg/dL), Total Cholesterol, refrigerated serum (mg/dL), Glucose, refrigerated serum (mg/dL), Waist Circumference (cm), BMI

**Question 3:**  
- **Model Type:** Classification  
- **Target Variable:** Correlations among biomarkers (e.g., risk level: low, medium, high)  
- **Predictive Variables:** Gender, Age, Total Cholesterol, refrigerated serum (mg/dL), Triglycerides, refrigerated serum (mg/dL), Glucose, refrigerated serum (mg/dL)

**Question 4:**  
- **Model Type:** Regression  
- **Target Variable:** Basic bodily functions (e.g., blood pressure, kidney function)  
- **Predictive Variables:** Electrolyte levels

**Question 5:**  
- **Model Type:** Regression  
- **Target Variable:** Blood Insulin Levels and Total Cholesterol, refrigerated serum (mg/dL)  
- **Predictive Variables:** Age

**Question 6:**  
- **Model Type:** Regression  
- **Target Variable:** Depressive Symptoms  
- **Predictive Variables:** Poverty Index Level, Age, Gender

### Data Preparation Overview ###

In the data preparation phase, we will handle missing values, convert categorical variables, create age groups, scale numerical features, and identify outliers.

In [23]:
demografia.loc[demografia["Edad en años al momento del examen"] <= 13, "Nivel educativo - Adultos 20+"] = demografia.loc[demografia["Edad en años al momento del examen"] <= 13, "Nivel educativo - Adultos 20+"].fillna(1)
demografia.loc[demografia["Edad en años al momento del examen"] <= 19, "Nivel educativo - Adultos 20+"] = demografia.loc[demografia["Edad en años al momento del examen"] <= 19, "Nivel educativo - Adultos 20+"].fillna(2)
demografia.loc[demografia["Edad en años al momento del examen"] <= 18, "Estado civil"] = demografia.loc[demografia["Edad en años al momento del examen"] <= 18, "Estado civil"].fillna(3)

In [24]:
Q1 = insulina["Insulina (μU/mL)"].quantile(0.25)
Q3 = insulina["Insulina (μU/mL)"].quantile(0.75)
IQR = Q3 - Q1

# Calculating the bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f" El IQR es {IQR}. El límite inferior es: {lower_bound}, el superior es {upper_bound}")

# Identifying outliers
outliers = insulina[(insulina["Insulina (μU/mL)"] < lower_bound) | (insulina["Insulina (μU/mL)"] > upper_bound)]

# Removing outliers
consideracion_insulina_limpio = insulina[~((insulina["Insulina (μU/mL)"] < lower_bound) | (insulina["Insulina (μU/mL)"] > upper_bound))]

# Checking the number of outliers
num_outliers = len(outliers)

print(num_outliers)

 El IQR es 10.3. El límite inferior es: -9.14, el superior es 32.06
310


In [25]:
escalado = scaler.fit_transform(consideracion_insulina_limpio[["Insulina (μU/mL)"]])
insulina_escalado = consideracion_insulina_limpio.copy()
insulina_escalado["Insulina (μU/mL)"]=escalado

In [26]:
Q1 = insulina_escalado["Insulina (μU/mL)"].quantile(0.25)
Q3 = insulina_escalado["Insulina (μU/mL)"].quantile(0.75)
IQR = Q3 - Q1

# Calculating the bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"The IQR is {IQR}. The lower bound is: {lower_bound}, the upper bound is {upper_bound}")

# Identifying outliers
outliers = insulina_escalado[(insulina_escalado["Insulina (μU/mL)"] < lower_bound) | (insulina_escalado["Insulina (μU/mL)"] > upper_bound)]

# Removing outliers
limpieza_insulina = insulina_escalado[~((insulina_escalado["Insulina (μU/mL)"] < lower_bound) | (insulina_escalado["Insulina (μU/mL)"] > upper_bound))]

# Checking the number of outliers
num_outliers = len(outliers)

print(num_outliers)

The IQR is 1.2941829291513374. The lower bound is: -2.6984244447452377, the upper bound is 2.478307271860112
115


In [27]:
imputador = KNNImputer(n_neighbors=3, weights="uniform")
insulina_limpia = limpieza_insulina.copy()
insulina_limpia["Insulina (μU/mL)"] = imputador.fit_transform(limpieza_insulina[["Insulina (μU/mL)"]])

In [28]:
insulina_limpia["Insulina (μU/mL)"] = pt.fit_transform(insulina_limpia[["Insulina (μU/mL)"]])

In [29]:
Q1 = colesterol["Colesterol Total (mg/dL)"].quantile(0.25)
Q3 = colesterol["Colesterol Total (mg/dL)"].quantile(0.75)
IQR = Q3 - Q1

# Calculating the bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f" El IQR es {IQR}. El límite inferior es: {lower_bound}, el superior es {upper_bound}")

# Identifying outliers
outliers_colesterol = colesterol[(colesterol["Colesterol Total (mg/dL)"] < lower_bound) | (colesterol["Colesterol Total (mg/dL)"] > upper_bound)]

# Removing outliers
consideracion_colesterol_limpio = colesterol[~((colesterol["Colesterol Total (mg/dL)"] < lower_bound) | (colesterol["Colesterol Total (mg/dL)"] > upper_bound))]

# Checking the number of outliers
num_outliers = len(outliers_colesterol)

print(num_outliers)

 El IQR es 52.0. El límite inferior es: 71.0, el superior es 279.0
177


In [30]:
imputador_colesterol = KNNImputer(n_neighbors=5, weights="uniform")
colesterol_limpio = consideracion_colesterol_limpio.copy()
colesterol_limpio["Colesterol Total (mg/dL)"] = imputador_colesterol.fit_transform(colesterol_limpio[["Colesterol Total (mg/dL)"]])

In [31]:
colesterol_limpio["Colesterol Total (mg/dL)"] = pt.fit_transform(colesterol_limpio[["Colesterol Total (mg/dL)"]])

## Combining Datasets and Modeling ##

To conduct a comprehensive analysis, we need to merge the separate datasets, especially tailored for each of the questions we aim to address.

### Question 1: What is the impact of waist circumference on serum lipid levels?

In [32]:
demografia_filtered = demografia[['ID', 'Edad en años al momento del examen']]
perfilBioquimico_filtered = perfilB[['ID', 'Triglicéridos, suero refrigerado (mg/dL)', 'Colesterol Total, suero refrigerado (mg/dL)']]
medidasCorporales_filtered = medidas[['ID', 'Circunferencia de la cintura (cm)', 'Índice de masa corporal (kg/m²)']]

# merge
question1 = demografia_filtered.merge(perfilBioquimico_filtered, on='ID', how='inner')
question1 = question1.merge(medidasCorporales_filtered, on='ID', how='inner')

# Revisar el dataframe resultante
print(question1.head())
print("Cantidad de registros combinados:", question1.shape)

         ID  Edad en años al momento del examen  \
0 109264.00                               13.00   
1 109266.00                               29.00   
2 109271.00                               49.00   
3 109273.00                               36.00   
4 109274.00                               68.00   

   Triglicéridos, suero refrigerado (mg/dL)  \
0                                     54.00   
1                                     86.00   
2                                    101.00   
3                                    178.00   
4                                    151.00   

   Colesterol Total, suero refrigerado (mg/dL)  \
0                                       170.00   
1                                       199.00   
2                                       148.00   
3                                       168.00   
4                                       105.00   

   Circunferencia de la cintura (cm)  Índice de masa corporal (kg/m²)  
0                              63.80 

In [33]:
import sys
print(sys.executable)

C:\Program Files\Python311\python.exe


In [49]:
from tpot import TPOTRegressor
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Ajuste para evitar el problema de `np.float`
if not hasattr(np, 'float'):
    np.float = float

# Selección de variables predictoras y el target
X = question1[['Circunferencia de la cintura (cm)', 'Índice de masa corporal (kg/m²)', 'Edad en años al momento del examen']]
y = question1['Triglicéridos, suero refrigerado (mg/dL)']

# Imputación de valores faltantes en X y y
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
y_imputed = imputer.fit_transform(y.values.reshape(-1, 1)).ravel()

# Normalización de los datos
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)

# División de datos
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_imputed, train_size=0.75, test_size=0.25, random_state=42)

# TPOT para regresión con más generaciones y mayor población
tpot_regressor = TPOTRegressor(verbosity=2, generations=10, population_size=40, random_state=42, scoring='r2')
tpot_regressor.fit(X_train, y_train)

# Evaluar el modelo
print("Regresión Score (R²):", tpot_regressor.score(X_test, y_test))

# Realizar predicciones
y_pred = tpot_regressor.predict(X_test)

# Calcular métricas de evaluación
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"MAE: {mae}")
print(f"MSE: {mse}")
print(f"RMSE: {rmse}")
print(f"R²: {r2}")

# Exportar el mejor pipeline
tpot_regressor.export('best_pipeline_regression.py')


Optimization Progress:   0%|          | 0/440 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.0702723519755353

Generation 2 - Current best internal CV score: 0.07125279212325746

Generation 3 - Current best internal CV score: 0.07125279212325746

Generation 4 - Current best internal CV score: 0.07125279212325746

Generation 5 - Current best internal CV score: 0.07125279212325746

Generation 6 - Current best internal CV score: 0.07125279212325746

Generation 7 - Current best internal CV score: 0.07248081105022319

Generation 8 - Current best internal CV score: 0.07248081105022319

Generation 9 - Current best internal CV score: 0.0736079753749836

Generation 10 - Current best internal CV score: 0.0736079753749836

Best pipeline: ExtraTreesRegressor(RidgeCV(input_matrix), bootstrap=True, max_features=0.9500000000000001, min_samples_leaf=20, min_samples_split=20, n_estimators=100)
Regresión Score (R²): 0.0798373399926241
MAE: 54.28359925891938
MSE: 8129.0180073045985
RMSE: 90.16106702620925
R²: 0.0798373399926241


#### TPOTRegressor:
The TPOTRegressor is an **Automated Machine Learning (AutoML)** approach that uses **genetic algorithms** to optimize a regression pipeline by automatically adjusting combinations of models, preprocessing, and parameters. In this case, it was configured with **10 generations**, a **population of 20**, and a **random state (random_state=42)** to ensure reproducibility, while **verbosity 2** provides a detailed progress log.

#### Preprocessing:
- Imputation by mean to handle missing values in predictor and target variables.

#### Model Results:
- **Best pipeline:** *AdaBoostRegressor(input_matrix, learning_rate=0.001, loss=exponential, n_estimators=100)*

#### Selected Model by TPOT
The chosen model is an **ExtraTreesRegressor** with an internal **RidgeCV** layer to perform the regression task. RidgeCV is a variant of the **Ridge linear regression model** that adjusts a linear regression model penalized with L2 regularization, where the penalty prevents overfitting by reducing the magnitude of the coefficients. CV in RidgeCV stands for cross-validation, used to find the optimal penalty value (hyperparameter α) automatically. The output of RidgeCV becomes the input for the **ExtraTreesRegressor**, an ensemble of decision trees that combines multiple decision trees to obtain a more robust and accurate prediction. The ensemble averages the predictions of different trees to produce the final prediction. Unlike other tree models, node split thresholds are chosen randomly rather than searching for the best threshold. The model's hyperparameters are:
- **bootstrap=True:** The model uses sampling with replacement, allowing repeated samples from the dataset during training of each tree.
- **max_features=0.95:** 95% of the features are available for each tree split, increasing variability and reducing overfitting.
- **min_samples_leaf=20:** A node must have at least 20 samples to become a leaf, reducing tree complexity and improving generalization.
- **min_samples_split=20:** A node must have at least 20 samples to split into two branches, preventing unnecessary splits and reducing complexity.
- **n_estimators=100:** The model uses 100 decision trees, enhancing prediction stability.

#### Metrics:
- **R² (Coefficient of Determination):** 0.0798  
  This indicates that the model explains only 7.98% of the variability in the data, suggesting it struggles to capture the relationship between predictors and the target.
- **MAE (Mean Absolute Error):** 54.28  
  Represents the average absolute error between predictions and actual values. On average, predictions are approximately 54.28 units away from the real value.
- **MSE (Mean Squared Error):** 8129.02  
  Represents the average squared error, penalizing larger errors more severely. Although it provides insight into error magnitude, it can be harder to interpret in absolute terms.
- **RMSE (Root Mean Squared Error):** 90.16  
  Is the square root of MSE, making it more interpretable as it is on the same scale as the target values. An RMSE of 90.16 indicates the average spread of predictions.


### Question 2: How do lipid and glucose levels vary by age, and which age groups are most vulnerable to metabolic disorders?

In [1]:

# Filtrar las columnas necesarias de cada DataFrame
demografia_filtered = demografia[['ID', 'Edad en años al momento del examen']]
perfilBioquimico_filtered = perfilB[['ID', 'Triglicéridos, suero refrigerado (mg/dL)', 'Colesterol Total, suero refrigerado (mg/dL)', 'Glucosa, suero refrigerado (mg/dL)']]
medidasCorporales_filtered = medidas[['ID', 'Circunferencia de la cintura (cm)', 'Índice de masa corporal (kg/m²)']]

# Realizar el merge
question2 = demografia_filtered.merge(perfilBioquimico_filtered, on='ID', how='inner')
question2 = question2.merge(medidasCorporales_filtered, on='ID', how='inner')

# Revisar el DataFrame resultante
print(question2.head())
print("Cantidad de registros combinados:", question2.shape)

# Definir grupos de edad
bins = [0, 30, 60, 100]  # Ajusta los rangos de edad según tus necesidades
labels = ['0-30', '31-60', '60+']
question2['Grupo de Edad'] = pd.cut(question2['Edad en años al momento del examen'], bins=bins, labels=labels, right=False)

# Definir los targets
targets = {
    'Triglicéridos, suero refrigerado (mg/dL)': 'Trigliceridos_suero',
    'Colesterol Total, suero refrigerado (mg/dL)': 'Colesterol_Total_suero',
    'Glucosa, suero refrigerado (mg/dL)': 'Glucosa_suero'
}


In [56]:

# Pregunta 2: Variación de los niveles de lípidos y glucosa según la edad
# Iterar sobre cada grupo de edad y cada target
for grupo in labels:
    # Filtrar el dataframe por grupo de edad
    group_df = question2[question2['Grupo de Edad'] == grupo]
    
    # Seleccionar las variables predictoras y los targets
    X = group_df[['Edad en años al momento del examen', 'Circunferencia de la cintura (cm)', 'Índice de masa corporal (kg/m²)']]
    y = group_df[['Triglicéridos, suero refrigerado (mg/dL)', 'Colesterol Total, suero refrigerado (mg/dL)', 'Glucosa, suero refrigerado (mg/dL)']]
    
    # Imputar valores faltantes en X
    imputer = SimpleImputer(strategy='mean')
    X_imputed = imputer.fit_transform(X)
    
    print(f"\n------ Grupo de Edad: {grupo} ------")
    
    # Iterar sobre cada target para ejecutar TPOT
    for target, file_name in targets.items():
        # División de datos
        y_target = y[target].dropna()
        X_filtered = X_imputed[~y[target].isna()]  # Filtrar X para que coincida con y no nulo
        if len(X_filtered) == 0:
            print(f"No hay suficientes datos para {target} en el grupo de edad {grupo}.")
            continue
            
        X_train, X_test, y_train, y_test = train_test_split(X_filtered, y_target, train_size=0.75, test_size=0.25, random_state=42)
        
        # TPOT para regresión
        tpot_regressor = TPOTRegressor(verbosity=2, generations=10, population_size=20, random_state=42)
        tpot_regressor.fit(X_train, y_train)
        
        # Realizar predicciones
        y_pred = tpot_regressor.predict(X_test)
        
        # Calcular métricas
        mae = mean_absolute_error(y_test, y_pred)
        mse = mean_squared_error(y_test, y_pred)
        rmse = np.sqrt(mse)
        r2 = r2_score(y_test, y_pred)
        
        # Imprimir métricas para cada target
        print(f"\nMétricas de regresión para {target} en el grupo de edad {grupo}:")
        print(f"MAE: {mae}")
        print(f"MSE: {mse}")
        print(f"RMSE: {rmse}")
        print(f"R²: {r2}")
        
        # Exportar el pipeline con nombre de archivo sin caracteres especiales
        file_name_cleaned = file_name.replace('/', '_')
        tpot_regressor.export(f'best_pipeline_{file_name_cleaned}_{grupo}_regression.py')
        print(f"Pipeline exportado para {target} en el grupo de edad {grupo}\n")


------ Grupo de Edad: 0-30 ------


Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -4431.469891805286

Generation 2 - Current best internal CV score: -4418.6372696044

Generation 3 - Current best internal CV score: -4418.6372696044

Generation 4 - Current best internal CV score: -4418.6372696044

Generation 5 - Current best internal CV score: -4415.3505746669325

Generation 6 - Current best internal CV score: -4415.3505746669325

Generation 7 - Current best internal CV score: -4414.547180278123

Generation 8 - Current best internal CV score: -4414.547180278123

Generation 9 - Current best internal CV score: -4414.547180278123

Generation 10 - Current best internal CV score: -4414.547180278123

Best pipeline: ExtraTreesRegressor(PolynomialFeatures(input_matrix, degree=2, include_bias=False, interaction_only=False), bootstrap=True, max_features=0.6500000000000001, min_samples_leaf=9, min_samples_split=11, n_estimators=100)

Métricas de regresión para Triglicéridos, suero refrigerado (mg/dL) en el grupo de edad 0-30:
MAE: 

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -1034.681972807775

Generation 2 - Current best internal CV score: -1034.681972807775

Generation 3 - Current best internal CV score: -1034.6711436924975

Generation 4 - Current best internal CV score: -1032.8282292916142

Generation 5 - Current best internal CV score: -1032.8282292916142

Generation 6 - Current best internal CV score: -1032.3057767575162

Generation 7 - Current best internal CV score: -1032.2937023505424

Generation 8 - Current best internal CV score: -1032.2937023505424

Generation 9 - Current best internal CV score: -1031.246817870288

Generation 10 - Current best internal CV score: -1031.246817870288

Best pipeline: AdaBoostRegressor(PolynomialFeatures(input_matrix, degree=2, include_bias=False, interaction_only=False), learning_rate=0.001, loss=linear, n_estimators=100)

Métricas de regresión para Colesterol Total, suero refrigerado (mg/dL) en el grupo de edad 0-30:
MAE: 23.70396722015307
MSE: 1015.1262358841103
RMSE

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -244.72884246866562

Generation 2 - Current best internal CV score: -244.72884246866562

Generation 3 - Current best internal CV score: -244.72884246866562

Generation 4 - Current best internal CV score: -244.72855990421968

Generation 5 - Current best internal CV score: -244.32893108935642

Generation 6 - Current best internal CV score: -244.32893108935642

Generation 7 - Current best internal CV score: -244.32893108935642

Generation 8 - Current best internal CV score: -244.32893108935642

Generation 9 - Current best internal CV score: -244.32893108935642

Generation 10 - Current best internal CV score: -244.32893108935642

Best pipeline: AdaBoostRegressor(RidgeCV(input_matrix), learning_rate=0.01, loss=exponential, n_estimators=100)

Métricas de regresión para Glucosa, suero refrigerado (mg/dL) en el grupo de edad 0-30:
MAE: 6.8702058625208915
MSE: 132.28683892379206
RMSE: 11.50160158081439
R²: 0.02504892255376956
Pipeline exportado pa

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -14508.656467171784

Generation 2 - Current best internal CV score: -14508.656467171784

Generation 3 - Current best internal CV score: -14496.763862331887

Generation 4 - Current best internal CV score: -14462.08755530268

Generation 5 - Current best internal CV score: -14462.08755530268

Generation 6 - Current best internal CV score: -14462.08755530268

Generation 7 - Current best internal CV score: -14462.08755530268

Generation 8 - Current best internal CV score: -14453.130482607598

Generation 9 - Current best internal CV score: -14453.130482607598

Generation 10 - Current best internal CV score: -14453.130482607598

Best pipeline: AdaBoostRegressor(input_matrix, learning_rate=0.001, loss=square, n_estimators=100)

Métricas de regresión para Triglicéridos, suero refrigerado (mg/dL) en el grupo de edad 31-60:
MAE: 67.16581580770993
MSE: 13879.859780812318
RMSE: 117.8128167085921
R²: 0.03368270754689495
Pipeline exportado para Triglicé

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -1426.4975154821923

Generation 2 - Current best internal CV score: -1426.4975154821923

Generation 3 - Current best internal CV score: -1425.2842665439262

Generation 4 - Current best internal CV score: -1425.2842665439262

Generation 5 - Current best internal CV score: -1425.2842665439262

Generation 6 - Current best internal CV score: -1424.174106996043

Generation 7 - Current best internal CV score: -1424.174106996043

Generation 8 - Current best internal CV score: -1424.174106996043

Generation 9 - Current best internal CV score: -1424.174106996043

Generation 10 - Current best internal CV score: -1424.174106996043

Best pipeline: ExtraTreesRegressor(Nystroem(input_matrix, gamma=0.2, kernel=poly, n_components=5), bootstrap=True, max_features=0.6500000000000001, min_samples_leaf=16, min_samples_split=11, n_estimators=100)

Métricas de regresión para Colesterol Total, suero refrigerado (mg/dL) en el grupo de edad 31-60:
MAE: 30.9907850

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -1461.275979881647

Generation 2 - Current best internal CV score: -1461.275979881647

Generation 3 - Current best internal CV score: -1461.2461172980359

Generation 4 - Current best internal CV score: -1461.2393978337805

Generation 5 - Current best internal CV score: -1461.0614847070333

Generation 6 - Current best internal CV score: -1461.0614847070333

Generation 7 - Current best internal CV score: -1461.0614847070333

Generation 8 - Current best internal CV score: -1461.0614847070333

Generation 9 - Current best internal CV score: -1460.755337450487

Generation 10 - Current best internal CV score: -1460.755337450487

Best pipeline: RidgeCV(Nystroem(MaxAbsScaler(input_matrix), gamma=0.05, kernel=additive_chi2, n_components=6))

Métricas de regresión para Glucosa, suero refrigerado (mg/dL) en el grupo de edad 31-60:
MAE: 18.512592584063167
MSE: 1420.1712352659254
RMSE: 37.68515935041174
R²: 0.057175590715400015
Pipeline exportado para 

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -7995.829328886367

Generation 2 - Current best internal CV score: -7993.5456947895555

Generation 3 - Current best internal CV score: -7993.5456947895555

Generation 4 - Current best internal CV score: -7993.5456947895555

Generation 5 - Current best internal CV score: -7992.609020246661

Generation 6 - Current best internal CV score: -7992.609020246661

Generation 7 - Current best internal CV score: -7992.609020246661

Generation 8 - Current best internal CV score: -7992.609020246661

Generation 9 - Current best internal CV score: -7992.609020246661

Generation 10 - Current best internal CV score: -7981.538791645353

Best pipeline: RidgeCV(RidgeCV(Normalizer(RidgeCV(input_matrix), norm=max)))

Métricas de regresión para Triglicéridos, suero refrigerado (mg/dL) en el grupo de edad 60+:
MAE: 58.758911736335435
MSE: 7720.221608731275
RMSE: 87.86479163311819
R²: 0.017691176513471074
Pipeline exportado para Triglicéridos, suero refrigerado (

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -1832.5019385010564

Generation 2 - Current best internal CV score: -1832.4473652757242

Generation 3 - Current best internal CV score: -1831.4875890256385

Generation 4 - Current best internal CV score: -1831.4875890256385

Generation 5 - Current best internal CV score: -1830.530646185493

Generation 6 - Current best internal CV score: -1830.530646185493

Generation 7 - Current best internal CV score: -1830.530646185493

Generation 8 - Current best internal CV score: -1830.506026865981

Generation 9 - Current best internal CV score: -1825.03240531195

Generation 10 - Current best internal CV score: -1825.03240531195

Best pipeline: RidgeCV(Nystroem(input_matrix, gamma=0.9500000000000001, kernel=additive_chi2, n_components=8))

Métricas de regresión para Colesterol Total, suero refrigerado (mg/dL) en el grupo de edad 60+:
MAE: 33.44637362641843
MSE: 1766.4868255972694
RMSE: 42.0295946399352
R²: 0.04701913869271224
Pipeline exportado para 

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -1471.0662849194402

Generation 2 - Current best internal CV score: -1471.0662849194402

Generation 3 - Current best internal CV score: -1471.0662849194402

Generation 4 - Current best internal CV score: -1471.0662849194402

Generation 5 - Current best internal CV score: -1471.0662849194402

Generation 6 - Current best internal CV score: -1471.0662849194402

Generation 7 - Current best internal CV score: -1471.0662849194402

Generation 8 - Current best internal CV score: -1471.0662849194402

Generation 9 - Current best internal CV score: -1471.0662849194402

Generation 10 - Current best internal CV score: -1471.0662849194402

Best pipeline: ElasticNetCV(input_matrix, l1_ratio=1.0, tol=0.1)

Métricas de regresión para Glucosa, suero refrigerado (mg/dL) en el grupo de edad 60+:
MAE: 22.24176108875996
MSE: 1606.447676046225
RMSE: 40.08051491742871
R²: -0.007044131750046478
Pipeline exportado para Glucosa, suero refrigerado (mg/dL) en el grup

### Age Group: 0-30

#### Regression for Triglycerides in Serum (mg/dL)
The selected model is an ExtraTreesRegressor with a preprocessing step of PolynomialFeatures. This step adds polynomial and interaction terms to enhance feature complexity, setting degree=2. The regressor uses ensemble methods, aggregating predictions from multiple decision trees to enhance accuracy. It was configured with bootstrap=True, allowing sampling with replacement, while max_features=0.65 uses 65% of features for each split to boost variability. It requires a minimum of 9 samples to create a leaf (min_samples_leaf=9) and at least 11 samples to split a node (min_samples_split=11), using 100 trees (n_estimators=100). Despite this complexity, the model explains only 6.7% of the variability (R²), with a Mean Absolute Error (MAE) of 43.70.

#### Regression for Total Cholesterol in Serum (mg/dL)
An AdaBoostRegressor with PolynomialFeatures was chosen. It iteratively adjusts weights to focus on difficult-to-predict samples, improving accuracy with each iteration. Here, the polynomial degree is set to 2, with learning_rate=0.001 controlling the contribution of each tree, loss=linear for linear error reduction, and n_estimators=100 to limit the number of boosting rounds. The model explains 11.3% of the variability, showing improved performance over the previous one, with an MAE of 23.70.

#### Regression for Glucose in Serum (mg/dL)
The best-performing model is another AdaBoostRegressor this time paired with RidgeCV as the base estimator, which handles regularization through L2 penalty, optimizing the hyperparameter α via cross-validation. The regressor uses an exponential loss function for penalizing large errors more heavily, with a learning_rate=0.01 and 100 boosting rounds (n_estimators=100). It performs poorly in explaining variability (R² = 2.5%), with an MAE of 6.87.

### Age Group: 31-60

#### Regression for Triglycerides in Serum (mg/dL)
Here, an AdaBoostRegressor was again the best choice, using a simple input matrix with learning_rate=0.001 and loss=square, meaning it focuses on minimizing squared errors over 100 boosting iterations. This model captures only 3.4% of the variability, with an MAE of 67.17, indicating difficulties in predicting this variable for this age group.

#### Regression for Total Cholesterol in Serum (mg/dL)
The model is an ExtraTreesRegressor that incorporates a Nystroem transformation, which approximates the kernel map for handling non-linear relationships. It uses a polynomial kernel with gamma=0.2 and 5 components (n_components=5), providing a flexible feature representation. Similar to previous models, it utilizes bootstrapping, uses 65% of the features (max_features=0.65), and sets constraints for minimum samples (min_samples_leaf=16 and min_samples_split=11). However, it explains only 1.9% of the variability, with an MAE of 30.99.

#### Regression for Glucose in Serum (mg/dL)
The model for this variable is a RidgeCV regression with a Nystroem transformation. The kernel used is additive_chi2 with a small gamma=0.05 and 6 components. The model was scaled using MaxAbsScaler, which scales data by its maximum absolute value. Despite these adjustments, it explains only 5.7% of the variability, with an MAE of 18.51.

### Age Group: 60+

#### Regression for Triglycerides in Serum (mg/dL)
The best model here is a RidgeCV regressor with multiple normalization layers. It uses nested RidgeCV steps combined with a Normalizer layer, aiming to ensure data consistency and penalize high coefficients to prevent overfitting. Still, the model’s performance is low, explaining only 1.8% of variability, with an MAE of 58.76.

#### Regression for Total Cholesterol in Serum (mg/dL)
The optimal model is another RidgeCV regressor, again incorporating the Nystroem kernel approximation with additive_chi2 kernel and high gamma=0.95. This captures non-linear features with 8 components. It achieves an R² of 4.7%, with an MAE of 33.45.

#### Regression for Glucose in Serum (mg/dL)
The final model is an ElasticNetCV, which combines L1 (Lasso) and L2 (Ridge) penalties, using cross-validation to find optimal regularization parameters (l1_ratio=1.0). It failed to explain variability effectively, with a negative R² (-0.7%), indicating poor fit, and an MAE of 22.24.

Overall, the models indicate challenges in predicting serum lipid, glucose, and triglyceride levels across age groups, with limited explanatory power, suggesting a need for better features or alternative modeling strategies.



### Question 3: How do gender and age affect correlations among these biomarkers?

In [66]:
# Dataframe para la pregunta 3
# Filtrar las columnas necesarias de cada DataFrame
demografia_filtered = demografia[['ID', 'Edad en años al momento del examen', 'Género']]
perfilBioquimico_filtered = perfilB[['ID', 'Glucosa, suero refrigerado (mg/dL)', 'Triglicéridos, suero refrigerado (mg/dL)', 'Colesterol Total, suero refrigerado (mg/dL)']]

# Unir los datasets para obtener los datos completos
question3 = demografia_filtered.merge(perfilBioquimico_filtered, on='ID', how='inner')

# Revisar el dataframe resultante
print(question3.head())
print("Cantidad de registros combinados:", question3.shape)

# Definir grupos de edad
bins = [0, 30, 60, 100]  # Ajusta los rangos de edad según tus necesidades
labels = ['0-30', '31-60', '60+']
question3['Grupo de Edad'] = pd.cut(question3['Edad en años al momento del examen'], bins=bins, labels=labels, right=False)

# Definir los targets
targets = {
    'Glucosa, suero refrigerado (mg/dL)': 'Glucosa_suero',
    'Triglicéridos, suero refrigerado (mg/dL)': 'Trigliceridos_suero',
    'Colesterol Total, suero refrigerado (mg/dL)': 'Colesterol_Total_suero'
}


         ID  Edad en años al momento del examen  Género  \
0 109264.00                               13.00    2.00   
1 109266.00                               29.00    2.00   
2 109271.00                               49.00    1.00   
3 109273.00                               36.00    1.00   
4 109274.00                               68.00    1.00   

   Glucosa, suero refrigerado (mg/dL)  \
0                               89.00   
1                               83.00   
2                               95.00   
3                               89.00   
4                              153.00   

   Triglicéridos, suero refrigerado (mg/dL)  \
0                                     54.00   
1                                     86.00   
2                                    101.00   
3                                    178.00   
4                                    151.00   

   Colesterol Total, suero refrigerado (mg/dL)  
0                                       170.00  
1                

In [67]:
# Iterar sobre cada grupo de edad y cada género
for grupo in labels:
    for genero in [1, 2]:  # Supone que 1 y 2 representan los géneros en el dataset
        # Filtrar el dataframe por grupo de edad y género
        group_df = question3[(question3['Grupo de Edad'] == grupo) & (question3['Género'] == genero)]
        
        # Seleccionar las variables predictoras y los targets
        X = group_df[['Edad en años al momento del examen', 'Género']]
        y = group_df[['Glucosa, suero refrigerado (mg/dL)', 'Triglicéridos, suero refrigerado (mg/dL)', 'Colesterol Total, suero refrigerado (mg/dL)']]
        
        # Imputar valores faltantes en X
        imputer = SimpleImputer(strategy='mean')
        X_imputed = imputer.fit_transform(X)
        
        print(f"\n------ Grupo de Edad: {grupo}, Género: {genero} ------")
        
        # Iterar sobre cada target para ejecutar TPOT
        for target, file_name in targets.items():
            # División de datos
            y_target = y[target].dropna()
            X_filtered = X_imputed[~y[target].isna()]  # Filtrar X para que coincida con y no nulo
            if len(X_filtered) == 0:
                print(f"No hay suficientes datos para {target} en el grupo de edad {grupo} y género {genero}.")
                continue
                
            X_train, X_test, y_train, y_test = train_test_split(X_filtered, y_target, train_size=0.75, test_size=0.25, random_state=42)
            
            # TPOT para regresión
            tpot_regressor = TPOTRegressor(verbosity=2, generations=1, population_size=20, random_state=42)
            tpot_regressor.fit(X_train, y_train)
            
            # Realizar predicciones
            y_pred = tpot_regressor.predict(X_test)
            
            # Calcular métricas
            mae = mean_absolute_error(y_test, y_pred)
            mse = mean_squared_error(y_test, y_pred)
            rmse = np.sqrt(mse)
            r2 = r2_score(y_test, y_pred)
            
            # Imprimir métricas para cada target
            print(f"\nMétricas de regresión para {target} en el grupo de edad {grupo} y género {genero}:")
            print(f"MAE: {mae}")
            print(f"MSE: {mse}")
            print(f"RMSE: {rmse}")
            print(f"R²: {r2}")
            
            # Exportar el pipeline con nombre de archivo sin caracteres especiales
            file_name_cleaned = f"{file_name}_{grupo}_genero{genero}".replace('/', '_')
            tpot_regressor.export(f'best_pipeline_{file_name_cleaned}_regression.py')
            print(f"Pipeline exportado para {target} en el grupo de edad {grupo} y género {genero}\n")


------ Grupo de Edad: 0-30, Género: 1 ------


Optimization Progress:   0%|          | 0/40 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -299.26385738290276

Best pipeline: ElasticNetCV(input_matrix, l1_ratio=0.6000000000000001, tol=1e-05)

Métricas de regresión para Glucosa, suero refrigerado (mg/dL) en el grupo de edad 0-30 y género 1:
MAE: 6.451093239507414
MSE: 115.26938094076026
RMSE: 10.736357899248715
R²: -0.0009470262541031449
Pipeline exportado para Glucosa, suero refrigerado (mg/dL) en el grupo de edad 0-30 y género 1



Optimization Progress:   0%|          | 0/40 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -6887.567864489324

Best pipeline: ElasticNetCV(ElasticNetCV(input_matrix, l1_ratio=0.6000000000000001, tol=0.1), l1_ratio=0.55, tol=0.01)

Métricas de regresión para Triglicéridos, suero refrigerado (mg/dL) en el grupo de edad 0-30 y género 1:
MAE: 51.20859192975065
MSE: 6263.8648304213575
RMSE: 79.14458181342142
R²: 0.014321823988022397
Pipeline exportado para Triglicéridos, suero refrigerado (mg/dL) en el grupo de edad 0-30 y género 1



Optimization Progress:   0%|          | 0/40 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -1041.9237974231887

Best pipeline: AdaBoostRegressor(input_matrix, learning_rate=0.001, loss=exponential, n_estimators=100)

Métricas de regresión para Colesterol Total, suero refrigerado (mg/dL) en el grupo de edad 0-30 y género 1:
MAE: 25.297044871627552
MSE: 1051.3013859126677
RMSE: 32.42377809436568
R²: 0.1097087806063104
Pipeline exportado para Colesterol Total, suero refrigerado (mg/dL) en el grupo de edad 0-30 y género 1


------ Grupo de Edad: 0-30, Género: 2 ------


Optimization Progress:   0%|          | 0/40 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -223.6545682743897

Best pipeline: XGBRegressor(LinearSVR(input_matrix, C=0.0001, dual=True, epsilon=0.001, loss=epsilon_insensitive, tol=0.001), learning_rate=0.5, max_depth=1, min_child_weight=10, n_estimators=100, n_jobs=1, objective=reg:squarederror, subsample=1.0, verbosity=0)

Métricas de regresión para Glucosa, suero refrigerado (mg/dL) en el grupo de edad 0-30 y género 2:
MAE: 6.730692515915407
MSE: 77.53306356713627
RMSE: 8.805286115007068
R²: -0.06067661854775763
Pipeline exportado para Glucosa, suero refrigerado (mg/dL) en el grupo de edad 0-30 y género 2



Optimization Progress:   0%|          | 0/40 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -2774.0010793289975

Best pipeline: ElasticNetCV(Nystroem(input_matrix, gamma=0.7000000000000001, kernel=polynomial, n_components=10), l1_ratio=0.65, tol=0.1)

Métricas de regresión para Triglicéridos, suero refrigerado (mg/dL) en el grupo de edad 0-30 y género 2:
MAE: 38.042896408693
MSE: 2976.62612202832
RMSE: 54.55846517295295
R²: 0.01972543084122791
Pipeline exportado para Triglicéridos, suero refrigerado (mg/dL) en el grupo de edad 0-30 y género 2



Optimization Progress:   0%|          | 0/40 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -1065.1113558765467

Best pipeline: ElasticNetCV(input_matrix, l1_ratio=0.4, tol=1e-05)

Métricas de regresión para Colesterol Total, suero refrigerado (mg/dL) en el grupo de edad 0-30 y género 2:
MAE: 23.04717088504877
MSE: 971.4142548696861
RMSE: 31.16751922867276
R²: 0.022932964485453122
Pipeline exportado para Colesterol Total, suero refrigerado (mg/dL) en el grupo de edad 0-30 y género 2


------ Grupo de Edad: 31-60, Género: 1 ------


Optimization Progress:   0%|          | 0/40 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -1690.9075342314154

Best pipeline: LinearSVR(input_matrix, C=10.0, dual=False, epsilon=0.1, loss=squared_epsilon_insensitive, tol=1e-05)

Métricas de regresión para Glucosa, suero refrigerado (mg/dL) en el grupo de edad 31-60 y género 1:
MAE: 22.695636325236542
MSE: 2514.765756189266
RMSE: 50.14744017583815
R²: 0.02717847028513498
Pipeline exportado para Glucosa, suero refrigerado (mg/dL) en el grupo de edad 31-60 y género 1



Optimization Progress:   0%|          | 0/40 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -27279.251661906914

Best pipeline: ElasticNetCV(input_matrix, l1_ratio=0.25, tol=0.0001)

Métricas de regresión para Triglicéridos, suero refrigerado (mg/dL) en el grupo de edad 31-60 y género 1:
MAE: 82.4024528604061
MSE: 14052.976399294665
RMSE: 118.54525042908578
R²: -0.013725383818470993
Pipeline exportado para Triglicéridos, suero refrigerado (mg/dL) en el grupo de edad 31-60 y género 1



Optimization Progress:   0%|          | 0/40 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -1587.7023682648337

Best pipeline: ElasticNetCV(input_matrix, l1_ratio=0.55, tol=1e-05)

Métricas de regresión para Colesterol Total, suero refrigerado (mg/dL) en el grupo de edad 31-60 y género 1:
MAE: 33.13751359188112
MSE: 1844.9634254960733
RMSE: 42.95303744202584
R²: -1.1737778334186544e-05
Pipeline exportado para Colesterol Total, suero refrigerado (mg/dL) en el grupo de edad 31-60 y género 1


------ Grupo de Edad: 31-60, Género: 2 ------


Optimization Progress:   0%|          | 0/40 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -1027.8073191139974

Best pipeline: RidgeCV(input_matrix)

Métricas de regresión para Glucosa, suero refrigerado (mg/dL) en el grupo de edad 31-60 y género 2:
MAE: 17.729892656796867
MSE: 1425.098319722145
RMSE: 37.750474430424646
R²: 0.026183215434953055
Pipeline exportado para Glucosa, suero refrigerado (mg/dL) en el grupo de edad 31-60 y género 2



Optimization Progress:   0%|          | 0/40 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -5801.632816679439

Best pipeline: ElasticNetCV(input_matrix, l1_ratio=0.30000000000000004, tol=0.001)

Métricas de regresión para Triglicéridos, suero refrigerado (mg/dL) en el grupo de edad 31-60 y género 2:
MAE: 54.57505814106744
MSE: 7500.8377104275
RMSE: 86.60737676680607
R²: -0.00844209421734532
Pipeline exportado para Triglicéridos, suero refrigerado (mg/dL) en el grupo de edad 31-60 y género 2



Optimization Progress:   0%|          | 0/40 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -1372.0301209678569

Best pipeline: RidgeCV(input_matrix)

Métricas de regresión para Colesterol Total, suero refrigerado (mg/dL) en el grupo de edad 31-60 y género 2:
MAE: 27.24630126985124
MSE: 1215.773616171601
RMSE: 34.867945396475555
R²: 0.070520758228796
Pipeline exportado para Colesterol Total, suero refrigerado (mg/dL) en el grupo de edad 31-60 y género 2


------ Grupo de Edad: 60+, Género: 1 ------


Optimization Progress:   0%|          | 0/40 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -1908.2460451122176

Best pipeline: ElasticNetCV(input_matrix, l1_ratio=0.2, tol=0.01)

Métricas de regresión para Glucosa, suero refrigerado (mg/dL) en el grupo de edad 60+ y género 1:
MAE: 25.718291236131694
MSE: 1680.8145832531682
RMSE: 40.99773875780429
R²: -0.0028152292255636535
Pipeline exportado para Glucosa, suero refrigerado (mg/dL) en el grupo de edad 60+ y género 1



Optimization Progress:   0%|          | 0/40 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -7309.043544729592

Best pipeline: ElasticNetCV(input_matrix, l1_ratio=0.9, tol=0.01)

Métricas de regresión para Triglicéridos, suero refrigerado (mg/dL) en el grupo de edad 60+ y género 1:
MAE: 74.18306401322117
MSE: 20053.39294198726
RMSE: 141.61000297290886
R²: -0.005005290463689693
Pipeline exportado para Triglicéridos, suero refrigerado (mg/dL) en el grupo de edad 60+ y género 1



Optimization Progress:   0%|          | 0/40 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -1680.3145790329422

Best pipeline: ElasticNetCV(input_matrix, l1_ratio=0.65, tol=0.01)

Métricas de regresión para Colesterol Total, suero refrigerado (mg/dL) en el grupo de edad 60+ y género 1:
MAE: 31.23354273638671
MSE: 1576.8677901444553
RMSE: 39.70979463739967
R²: 0.01262011648631145
Pipeline exportado para Colesterol Total, suero refrigerado (mg/dL) en el grupo de edad 60+ y género 1


------ Grupo de Edad: 60+, Género: 2 ------


Optimization Progress:   0%|          | 0/40 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -1186.2424504080905

Best pipeline: RidgeCV(MaxAbsScaler(input_matrix))

Métricas de regresión para Glucosa, suero refrigerado (mg/dL) en el grupo de edad 60+ y género 2:
MAE: 20.260358021422803
MSE: 1280.6847300722281
RMSE: 35.7866557542365
R²: -0.023671833850098523
Pipeline exportado para Glucosa, suero refrigerado (mg/dL) en el grupo de edad 60+ y género 2



Optimization Progress:   0%|          | 0/40 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -5369.18087585034

Best pipeline: ElasticNetCV(input_matrix, l1_ratio=0.05, tol=0.001)

Métricas de regresión para Triglicéridos, suero refrigerado (mg/dL) en el grupo de edad 60+ y género 2:
MAE: 53.2073667073667
MSE: 6013.949577882435
RMSE: 77.54965878637013
R²: -1.1914909904264803e-05
Pipeline exportado para Triglicéridos, suero refrigerado (mg/dL) en el grupo de edad 60+ y género 2



Optimization Progress:   0%|          | 0/40 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -1778.2270147284503

Best pipeline: ElasticNetCV(RidgeCV(input_matrix), l1_ratio=0.9500000000000001, tol=0.01)

Métricas de regresión para Colesterol Total, suero refrigerado (mg/dL) en el grupo de edad 60+ y género 2:
MAE: 37.07249786063319
MSE: 2210.2267480256505
RMSE: 47.01304869954352
R²: -0.0042292713683598215
Pipeline exportado para Colesterol Total, suero refrigerado (mg/dL) en el grupo de edad 60+ y género 2



### Age Group: 0-30, Gender: 1

#### Regression for Glucose in Serum (mg/dL)
The model uses ElasticNetCV with an l1_ratio of 0.6, applying both L1 and L2 regularization, with a very small tolerance of 1e-05 for better accuracy. The model performs poorly with an R² of -0.001, indicating no explanatory power, and an MAE of 6.45.

#### Regression for Triglycerides in Serum (mg/dL)
An ElasticNetCV model nested within another ElasticNetCV model was selected, using l1_ratios of 0.6 and 0.55, respectively. This setup emphasizes regularization, but the model still struggles with prediction accuracy, as seen by an R² of 0.014 and an MAE of 51.21.

#### Regression for Total Cholesterol in Serum (mg/dL)
The AdaBoostRegressor with 100 estimators, exponential loss, and a learning rate of 0.001 was chosen. It achieves an R² of 0.11, showing some explanatory power, and an MAE of 25.30, suggesting moderate predictive capacity.

### Age Group: 0-30, Gender: 2

#### Regression for Glucose in Serum (mg/dL)
The selected model is XGBRegressor, combining it with LinearSVR. LinearSVR uses a very small C value of 0.0001 and epsilon-insensitive loss to handle deviations, while XGBoost further refines predictions. The model performs poorly, with an R² of -0.061 and an MAE of 6.73.

#### Regression for Triglycerides in Serum (mg/dL)
The model is ElasticNetCV combined with Nystroem for feature transformation using a polynomial kernel. It achieves an R² of 0.02, indicating a slight improvement in explanatory power, and an MAE of 38.04.

#### Regression for Total Cholesterol in Serum (mg/dL)
ElasticNetCV with an l1_ratio of 0.4 was selected. The R² is 0.023, showing minimal explanatory power, while the MAE is 23.05, indicating a decent level of accuracy.

### Age Group: 31-60, Gender: 1

#### Regression for Glucose in Serum (mg/dL)
LinearSVR was chosen, with an R² of 0.03, indicating low explanatory power, and an MAE of 22.70, suggesting significant prediction error.

#### Regression for Triglycerides in Serum (mg/dL)
ElasticNetCV with an l1_ratio of 0.25 was used. It has a negative R² of -0.014, indicating a poor fit, and an MAE of 82.40, highlighting substantial prediction error.

#### Regression for Total Cholesterol in Serum (mg/dL)
ElasticNetCV with an l1_ratio of 0.55 was selected. It achieves an R² close to zero, showing minimal explanatory power, while the MAE is 33.14, suggesting moderate accuracy.

### Age Group: 31-60, Gender: 2

#### Regression for Glucose in Serum (mg/dL)
RidgeCV was chosen, achieving an R² of 0.03 and an MAE of 17.73, indicating moderate prediction accuracy.

#### Regression for Triglycerides in Serum (mg/dL)
ElasticNetCV with an l1_ratio of 0.3 was selected. The model has a negative R² of -0.008, suggesting a poor fit, and an MAE of 54.58.

#### Regression for Total Cholesterol in Serum (mg/dL)
RidgeCV was again selected, with an R² of 0.07, showing some predictive power, and an MAE of 27.25.

### Age Group: 60+, Gender: 1

#### Regression for Glucose in Serum (mg/dL)
The model uses ElasticNetCV with an l1_ratio of 0.2 and a tolerance of 0.01, achieving an R² of -0.003 and an MAE of 25.72, indicating poor predictive power.

#### Regression for Triglycerides in Serum (mg/dL)
ElasticNetCV with a high l1_ratio of 0.9 was chosen. It shows a slight negative R² of -0.005, suggesting poor fit, and an MAE of 74.18.

#### Regression for Total Cholesterol in Serum (mg/dL)
ElasticNetCV with an l1_ratio of 0.65 was used, achieving an R² of 0.013 and an MAE of 31.23, indicating modest performance.

### Age Group: 60+, Gender: 2

#### Regression for Glucose in Serum (mg/dL)
The model is RidgeCV combined with MaxAbsScaler. It shows a negative R² of -0.024, indicating poor predictive capacity, with an MAE of 20.26.

#### Regression for Triglycerides in Serum (mg/dL)
ElasticNetCV with an l1_ratio of 0.05 was selected, achieving a near-zero R² and an MAE of 53.21, reflecting low accuracy.

#### Regression for Total Cholesterol in Serum (mg/dL)
The model uses ElasticNetCV combined with RidgeCV, with an l1_ratio of 0.95. It achieves a negative R² of -0.004 and an MAE of 37.07, indicating limited explanatory power.


### Question 5: As age increases, how does this affect blood insulin and cholesterol results?

In [68]:
# Filtrar las columnas necesarias en cada dataset
demografia_filtered = demografia[['ID', 'Edad en años al momento del examen']]
insulina_filtered = insulina[['ID', 'Insulina (μU/mL)']]
colesterol_filtered = colesterol[['ID', 'Colesterol Total (mg/dL)']]

# Unir los datasets
question5 = demografia_filtered.merge(insulina_filtered, on='ID', how='inner')
question5 = question5.merge(colesterol_filtered, on='ID', how='inner')

# Revisar el dataframe resultante
print(question5.head())
print("Cantidad de registros combinados:", question5.shape)

# Definir grupos de edad
bins = [0, 30, 60, 100]  # Ajusta los rangos de edad según tus necesidades
labels = ['0-30', '31-60', '60+']
question5['Grupo de Edad'] = pd.cut(question5['Edad en años al momento del examen'], bins=bins, labels=labels, right=False)

# Definir los targets
targets = {
    'Insulina (μU/mL)': 'Insulina',
    'Colesterol Total (mg/dL)': 'Colesterol'
}

         ID  Edad en años al momento del examen  Insulina (μU/mL)  \
0 109264.00                               13.00              6.05   
1 109271.00                               49.00             16.96   
2 109274.00                               68.00             13.52   
3 109277.00                               12.00              6.44   
4 109282.00                               76.00              7.49   

   Colesterol Total (mg/dL)  
0                    166.00  
1                    147.00  
2                    105.00  
3                    129.00  
4                    233.00  
Cantidad de registros combinados: (5090, 4)


In [69]:
# Iterar sobre cada grupo de edad
for grupo in labels:
    # Filtrar el dataframe por grupo de edad
    group_df = question5[question5['Grupo de Edad'] == grupo]
    
    # Seleccionar las variables predictoras y los targets
    X = group_df[['Edad en años al momento del examen']]
    y = group_df[['Insulina (μU/mL)', 'Colesterol Total (mg/dL)']]
    
    # Imputar valores faltantes en X
    imputer = SimpleImputer(strategy='mean')
    X_imputed = imputer.fit_transform(X)
    
    print(f"\n------ Grupo de Edad: {grupo} ------")
    
    # Iterar sobre cada target para ejecutar TPOT
    for target, file_name in targets.items():
        # División de datos
        y_target = y[target].dropna()
        X_filtered = X_imputed[~y[target].isna()]  # Filtrar X para que coincida con y no nulo
        if len(X_filtered) == 0:
            print(f"No hay suficientes datos para {target} en el grupo de edad {grupo}.")
            continue
            
        X_train, X_test, y_train, y_test = train_test_split(X_filtered, y_target, train_size=0.75, test_size=0.25, random_state=42)
        
        # TPOT para regresión
        tpot_regressor = TPOTRegressor(verbosity=2, generations=10, population_size=20, random_state=42)
        tpot_regressor.fit(X_train, y_train)
        
        # Realizar predicciones
        y_pred = tpot_regressor.predict(X_test)
        
        # Calcular métricas
        mae = mean_absolute_error(y_test, y_pred)
        mse = mean_squared_error(y_test, y_pred)
        rmse = np.sqrt(mse)
        r2 = r2_score(y_test, y_pred)
        
        # Imprimir métricas para cada target
        print(f"\nMétricas de regresión para {target} en el grupo de edad {grupo}:")
        print(f"MAE: {mae}")
        print(f"MSE: {mse}")
        print(f"RMSE: {rmse}")
        print(f"R²: {r2}")
        
        # Exportar el pipeline con nombre de archivo sin caracteres especiales
        file_name_cleaned = f"{file_name}_{grupo}".replace('/', '_')
        tpot_regressor.export(f'best_pipeline_{file_name_cleaned}_regression.py')
        print(f"Pipeline exportado para {target} en el grupo de edad {grupo}\n")


------ Grupo de Edad: 0-30 ------


Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -249.61579440559163

Generation 2 - Current best internal CV score: -249.61579440559163

Generation 3 - Current best internal CV score: -249.5810062737609

Generation 4 - Current best internal CV score: -249.5810062737609

Generation 5 - Current best internal CV score: -249.5810062737609

Generation 6 - Current best internal CV score: -249.5810062737609

Generation 7 - Current best internal CV score: -249.5810062737609

Generation 8 - Current best internal CV score: -249.5810062737609

Generation 9 - Current best internal CV score: -249.5810062737609

Generation 10 - Current best internal CV score: -249.5810062737609

Best pipeline: XGBRegressor(Normalizer(input_matrix, norm=l1), learning_rate=0.01, max_depth=1, min_child_weight=8, n_estimators=100, n_jobs=1, objective=reg:squarederror, subsample=0.2, verbosity=0)

Métricas de regresión para Insulina (μU/mL) en el grupo de edad 0-30:
MAE: 7.690883911322363
MSE: 90.9171146362185
RMSE: 9.53

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -1049.9357369145869

Generation 2 - Current best internal CV score: -1049.9357369145869

Generation 3 - Current best internal CV score: -1049.9357369145869

Generation 4 - Current best internal CV score: -1049.9357369145869

Generation 5 - Current best internal CV score: -1049.9357369145869

Generation 6 - Current best internal CV score: -1049.9357369145869

Generation 7 - Current best internal CV score: -1045.4562293886568

Generation 8 - Current best internal CV score: -1045.4562293886568

Generation 9 - Current best internal CV score: -1045.4562293886568

Generation 10 - Current best internal CV score: -1045.4562293886568

Best pipeline: DecisionTreeRegressor(input_matrix, max_depth=2, min_samples_leaf=7, min_samples_split=8)

Métricas de regresión para Colesterol Total (mg/dL) en el grupo de edad 0-30:
MAE: 23.610686849187022
MSE: 933.3548538010897
RMSE: 30.550856842338966
R²: 0.042733424874419224
Pipeline exportado para Colesterol To

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -540.8792640034977

Generation 2 - Current best internal CV score: -540.5485757458933

Generation 3 - Current best internal CV score: -540.5485757458933

Generation 4 - Current best internal CV score: -540.5485757458933

Generation 5 - Current best internal CV score: -540.5485757458933

Generation 6 - Current best internal CV score: -540.5452180949466

Generation 7 - Current best internal CV score: -540.5452180949466

Generation 8 - Current best internal CV score: -539.5287028154935

Generation 9 - Current best internal CV score: -539.5287028154935

Generation 10 - Current best internal CV score: -539.5287028154935

Best pipeline: LinearSVR(input_matrix, C=0.1, dual=True, epsilon=1.0, loss=squared_epsilon_insensitive, tol=1e-05)

Métricas de regresión para Insulina (μU/mL) en el grupo de edad 31-60:
MAE: 9.250403818525399
MSE: 341.90119893087984
RMSE: 18.490570540977902
R²: -0.00859978441835807
Pipeline exportado para Insulina (μU/mL) en 

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -1476.7434331282243

Generation 2 - Current best internal CV score: -1476.7434331282243

Generation 3 - Current best internal CV score: -1476.7434331282243

Generation 4 - Current best internal CV score: -1476.7434331282243

Generation 5 - Current best internal CV score: -1472.8293560236164

Generation 6 - Current best internal CV score: -1472.8293560236164

Generation 7 - Current best internal CV score: -1472.8293560236164

Generation 8 - Current best internal CV score: -1472.8293560236164

Generation 9 - Current best internal CV score: -1472.8293560236164

Generation 10 - Current best internal CV score: -1472.8293560236164

Best pipeline: DecisionTreeRegressor(input_matrix, max_depth=2, min_samples_leaf=19, min_samples_split=18)

Métricas de regresión para Colesterol Total (mg/dL) en el grupo de edad 31-60:
MAE: 30.622114097089383
MSE: 1660.5920617423476
RMSE: 40.75036271915071
R²: -0.017010154533373845
Pipeline exportado para Colestero

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -1032.7569461323117

Generation 2 - Current best internal CV score: -1032.7569461323117

Generation 3 - Current best internal CV score: -1032.5715410143723

Generation 4 - Current best internal CV score: -1032.5609619240468

Generation 5 - Current best internal CV score: -1032.5609619240468

Generation 6 - Current best internal CV score: -1032.5609619240468

Generation 7 - Current best internal CV score: -1030.6700227188926

Generation 8 - Current best internal CV score: -1030.6700227188926

Generation 9 - Current best internal CV score: -1030.6700227188926

Generation 10 - Current best internal CV score: -1030.6700227188926

Best pipeline: GradientBoostingRegressor(input_matrix, alpha=0.75, learning_rate=0.01, loss=quantile, max_depth=1, max_features=0.4, min_samples_leaf=8, min_samples_split=15, n_estimators=100, subsample=0.45)

Métricas de regresión para Insulina (μU/mL) en el grupo de edad 60+:
MAE: 9.199938690428848
MSE: 277.8469469

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -1887.8413624189707

Generation 2 - Current best internal CV score: -1887.8413624189707

Generation 3 - Current best internal CV score: -1887.8267653858627

Generation 4 - Current best internal CV score: -1887.8267653858627

Generation 5 - Current best internal CV score: -1887.8214234659295

Generation 6 - Current best internal CV score: -1887.1996945980131

Generation 7 - Current best internal CV score: -1887.1996945980131

Generation 8 - Current best internal CV score: -1887.1996945980131

Generation 9 - Current best internal CV score: -1887.1996945980131

Generation 10 - Current best internal CV score: -1886.7158732649427

Best pipeline: RidgeCV(Nystroem(input_matrix, gamma=0.4, kernel=rbf, n_components=10))

Métricas de regresión para Colesterol Total (mg/dL) en el grupo de edad 60+:
MAE: 35.57210411521358
MSE: 1968.475877167195
RMSE: 44.36750925133385
R²: -0.006018249203833426
Pipeline exportado para Colesterol Total (mg/dL) en el gr

### Age Group: 0-30

#### Regression for Insulin (μU/mL)
The model selected is XGBRegressor combined with Normalizer (using L1 normalization). XGBRegressor uses 100 estimators with a learning rate of 0.01, max depth of 1, and subsample of 0.2, focusing on reducing overfitting. The performance is poor, with a negative R² of -0.075 and an MAE of 7.69, indicating limited predictive power.

#### Regression for Total Cholesterol (mg/dL)
A DecisionTreeRegressor was chosen with a max depth of 2, a minimum of 7 samples per leaf, and a minimum of 8 samples per split. The model shows a modest R² of 0.043 and an MAE of 23.61, suggesting moderate accuracy.

### Age Group: 31-60

#### Regression for Insulin (μU/mL)
The model uses LinearSVR with C=0.1, epsilon of 1.0, and squared epsilon-insensitive loss, making it more tolerant to deviations. It achieves a low R² of -0.009 and an MAE of 9.25, indicating weak predictive performance.

#### Regression for Total Cholesterol (mg/dL)
A DecisionTreeRegressor with a max depth of 2, 19 minimum samples per leaf, and 18 minimum samples per split was selected. It has a negative R² of -0.017 and an MAE of 30.62, reflecting a poor fit.

### Age Group: 60+

#### Regression for Insulin (μU/mL)
The model uses GradientBoostingRegressor with quantile loss, alpha of 0.75, learning rate of 0.01, and subsample of 0.45. Despite being a sophisticated ensemble model, it still has a negative R² of -0.044 and an MAE of 9.20, indicating limited accuracy.

#### Regression for Total Cholesterol (mg/dL)
The chosen model is RidgeCV combined with Nystroem kernel approximation using an RBF kernel, gamma of 0.4, and 10 components. The performance is poor, with an R² of -0.006 and an MAE of 35.57, indicating weak predictive capacity.


### Question 6: Does a higher poverty index level correlate with an increase in depressive symptoms?

In [75]:
# Filtrar columnas necesarias en cada dataset
demografia_filtered = demografia[['ID', 'Relación de ingresos familiares con la pobreza']]
depresion_filtered = depresion[['ID', 
                                'Poco Interés en Hacer Cosas', 
                                'Sentirse Deprimido o Sin Esperanza', 
                                'Problemas para Dormir', 
                                'Cansancio o Poca Energía', 
                                'Poco Apetito o Comer en Exceso',
                                'Sentirse Mal Acerca de Uno Mismo',
                                'Problemas de Concentración',
                                'Movimientos o Hablar Lento o Rápido',
                                'Pensamientos de Muerte o Autolesión',
                                'Dificultad que Estos Problemas Causan']]

# Unir los datasets
question6 = demografia_filtered.merge(depresion_filtered, on='ID', how='inner')

# Imputar valores faltantes
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='most_frequent')
question6_imputed = pd.DataFrame(imputer.fit_transform(question6), columns=question6.columns)

# Revisar el DataFrame resultante
print(question6_imputed.head())
print("Cantidad de registros combinados:", question6_imputed.shape)


         ID  Relación de ingresos familiares con la pobreza  \
0 109266.00                                            5.00   
1 109271.00                                            5.00   
2 109273.00                                            0.83   
3 109274.00                                            1.20   
4 109282.00                                            3.61   

   Poco Interés en Hacer Cosas  Sentirse Deprimido o Sin Esperanza  \
0                         0.00                                0.00   
1                         2.00                                1.00   
2                         2.00                                2.00   
3                         0.00                                0.00   
4                         0.00                                1.00   

   Problemas para Dormir  Cansancio o Poca Energía  \
0                   0.00                      0.00   
1                   0.00                      0.00   
2                   2.00              

In [73]:
# Crear la variable objetivo basada en la suma de síntomas depresivos
question6_imputed['Riesgo_Depresion'] = question6_imputed[['Poco Interés en Hacer Cosas', 
                                                           'Sentirse Deprimido o Sin Esperanza', 
                                                           'Problemas para Dormir', 
                                                           'Cansancio o Poca Energía']].sum(axis=1)

# Asignar categorías de riesgo
bins = [0, 1, 2, 3, 4]  # Categorías basadas en la escala de síntomas
labels = ['nulo' ,'bajo', 'moderado', 'alto']
question6_imputed['Riesgo_Depresion'] = pd.cut(question6_imputed['Riesgo_Depresion'], bins=bins, labels=labels)

# Convertir la variable de riesgo a numérica
question6_imputed['Riesgo_Depresion'] = question6_imputed['Riesgo_Depresion'].cat.codes


In [76]:
import pandas as pd
from sklearn.impute import SimpleImputer
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# Imputar valores faltantes ('Missing') en todas las variables de síntomas depresivos
variables_sintomas = ['Poco Interés en Hacer Cosas', 
                      'Sentirse Deprimido o Sin Esperanza', 
                      'Problemas para Dormir', 
                      'Cansancio o Poca Energía',
                      'Poco Apetito o Comer en Exceso',
                      'Sentirse Mal Acerca de Uno Mismo',
                      'Problemas de Concentración',
                      'Movimientos o Hablar Lento o Rápido',
                      'Pensamientos de Muerte o Autolesión',
                      'Dificultad que Estos Problemas Causan']

imputer = SimpleImputer(strategy='most_frequent')
question6_imputed[variables_sintomas] = imputer.fit_transform(question6_imputed[variables_sintomas])

# Crear la variable objetivo basada en la suma de todos los síntomas depresivos
question6_imputed['Riesgo_Depresion'] = question6_imputed[variables_sintomas].sum(axis=1)

# Asignar nuevas categorías de riesgo basadas en la escala de síntomas
bins = [0, 2, 5, 8, 11, 15, float('inf')]
labels = ['nulo', 'leve', 'moderado', 'alto', 'muy alto', 'extremo']
question6_imputed['Riesgo_Depresion'] = pd.cut(question6_imputed['Riesgo_Depresion'], bins=bins, labels=labels)

# Convertir la variable de riesgo a numérica
question6_imputed['Riesgo_Depresion'] = question6_imputed['Riesgo_Depresion'].cat.codes

# Selección de la variable predictora y el target
X = question6_imputed[['Relación de ingresos familiares con la pobreza']]
y = question6_imputed['Riesgo_Depresion']

# División de datos
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, test_size=0.25, random_state=42)

# TPOT para clasificación
tpot_classifier = TPOTClassifier(verbosity=2, generations=10, population_size=20, random_state=42, scoring='accuracy')
tpot_classifier.fit(X_train, y_train)

# Evaluar el modelo
print("Score de clasificación para Riesgo_Depresion:", tpot_classifier.score(X_test, y_test))

# Realizar predicciones
y_pred = tpot_classifier.predict(X_test)

# Calcular métricas de evaluación
print("\nMatriz de confusión para Riesgo_Depresion:")
print(confusion_matrix(y_test, y_pred))

print("\nReporte de clasificación para Riesgo_Depresion:")
print(classification_report(y_test, y_pred))

# Exportar el pipeline
tpot_classifier.export('best_pipeline_depresion_classification.py')
print("Pipeline exportado para clasificación de Riesgo_Depresion")



Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.5976499159143212

Generation 2 - Current best internal CV score: 0.5976499159143212

Generation 3 - Current best internal CV score: 0.5976499159143212

Generation 4 - Current best internal CV score: 0.5976499159143212

Generation 5 - Current best internal CV score: 0.5976499159143212

Generation 6 - Current best internal CV score: 0.5976499159143212

Generation 7 - Current best internal CV score: 0.5976499159143212

Generation 8 - Current best internal CV score: 0.5976499159143212

Generation 9 - Current best internal CV score: 0.5976499159143212

Generation 10 - Current best internal CV score: 0.5976499159143212

Best pipeline: GaussianNB(input_matrix)
Score de clasificación para Riesgo_Depresion: 0.6159678858162355

Matriz de confusión para Riesgo_Depresion:
[[1381    0    0    0    0    0]
 [ 380    0    0    0    0    0]
 [ 202    0    0    0    0    0]
 [ 108    0    0    0    0    0]
 [  85    0    0    0    0    0]
 [  86    0   

              precision    recall  f1-score   support

           0       0.62      1.00      0.76      1381
           1       0.00      0.00      0.00       380
           2       0.00      0.00      0.00       202
           3       0.00      0.00      0.00       108
           4       0.00      0.00      0.00        85
           5       0.00      0.00      0.00        86

    accuracy                           0.62      2242
   macro avg       0.10      0.17      0.13      2242
weighted avg       0.38      0.62      0.47      2242

Pipeline exportado para clasificación de Riesgo_Depresion


### Age Group: All Ages
#### Classification for Depression Risk Levels
The best model selected is **GaussianNB**, identified after evaluating multiple generations of models. GaussianNB operates under the assumption that features are normally distributed, making it efficient for simpler data structures. The model predicts the depression risk categories based on summed depressive symptoms, which were categorized into six levels: nulo, leve, moderado, alto, muy alto, and extremo.

- **Preprocessing**:
  - Imputation was performed using the most frequent strategy to handle missing values.
  - The depressive symptoms were summed to form a risk score and categorized into six risk levels, later encoded as numeric variables for classification.

- **Performance Metrics**:
  - **Classification Accuracy**: 61.60% on test data, indicating that the model has a moderate predictive capability, primarily for the nulo risk category.
  - **Confusion Matrix**:
    - The model correctly predicted 1,381 cases in the nulo category.
    - It failed to correctly predict any cases in the leve, moderado, alto, muy alto, or extremo categories, highlighting an issue with class imbalance.
  - **Precision, Recall, and F1-Score**:
    - Precision, recall, and f1-score are significantly skewed toward the nulo category, with zero performance for other categories.
    - The macro average across metrics (precision: 0.10, recall: 0.17, f1-score: 0.13) indicates overall poor classification for higher risk categories.
    - The weighted average (precision: 0.38, recall: 0.62, f1-score: 0.47) reflects an imbalance in performance across the risk levels.

- **Interpretation**:
  - While the model performs reasonably well in identifying the nulo risk category, it fails to generalize for higher risk levels. Further adjustments, such as oversampling, undersampling, or trying different models, may be necessary to improve performance across all categories.

- **Pipeline Export**: 
  - The GaussianNB pipeline has been exported for further refinement, deployment, or analysis.


### Refinement of Classification Model
Since this is our only classification model, we will focus on tuning it to achieve better results. The initial GaussianNB model demonstrates moderate performance, particularly in predicting the lowest risk category (nulo). However, it struggles with the higher risk levels due to class imbalance and limited feature complexity.

To enhance the model's accuracy and generalization, we will implement strategies such as:
- **Hyperparameter Tuning**: Adjusting parameters like priors and var_smoothing in GaussianNB to optimize performance.
- **Class Balancing Techniques**: Using oversampling or undersampling methods to address the class imbalance, ensuring better representation of all risk levels during training.

The goal is to develop a more robust classification model that achieves higher accuracy across all depression risk levels, providing a more reliable assessment of depressive symptoms.


In [78]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# División de datos
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, test_size=0.25, random_state=42)

# Modelo de RandomForest con pesos balanceados
rf_classifier = RandomForestClassifier(class_weight='balanced', random_state=42)
rf_classifier.fit(X_train, y_train)

# Evaluar el modelo
y_pred = rf_classifier.predict(X_test)

# Matriz de confusión y reporte de clasificación
print("\nMatriz de confusión para Riesgo_Depresion:")
print(confusion_matrix(y_test, y_pred))

print("\nReporte de clasificación para Riesgo_Depresion:")
print(classification_report(y_test, y_pred, zero_division=1))



Matriz de confusión para Riesgo_Depresion:
[[618 114 216 153 154 126]
 [154  43  55  50  44  34]
 [ 68  24  30  27  23  30]
 [ 30  11  13  15  21  18]
 [ 22   6   9  13  18  17]
 [ 24   6  15  15  16  10]]

Reporte de clasificación para Riesgo_Depresion:
              precision    recall  f1-score   support

           0       0.67      0.45      0.54      1381
           1       0.21      0.11      0.15       380
           2       0.09      0.15      0.11       202
           3       0.05      0.14      0.08       108
           4       0.07      0.21      0.10        85
           5       0.04      0.12      0.06        86

    accuracy                           0.33      2242
   macro avg       0.19      0.20      0.17      2242
weighted avg       0.47      0.33      0.38      2242



### Classification Model for Depression Risk
#### Model: GaussianNB
The GaussianNB classifier was selected as the best model for predicting depression risk. The model's performance indicates some challenges in classifying higher risk levels, particularly due to imbalanced data distribution across categories, as we can see in our accuracy, however some improvement has been done.

#### Classification Report for Depression Risk
- **Accuracy**: 0.33
- **Precision (Weighted Avg)**: 0.47
- **Recall (Weighted Avg)**: 0.33
- **F1-Score (Weighted Avg)**: 0.38

##### Category-wise Performance
- **Category 0 (nulo)**: Precision: 0.67, Recall: 0.45, F1-Score: 0.54, Support: 1381
- **Category 1 (leve)**: Precision: 0.21, Recall: 0.11, F1-Score: 0.15, Support: 380
- **Category 2 (moderado)**: Precision: 0.09, Recall: 0.15, F1-Score: 0.11, Support: 202
- **Category 3 (alto)**: Precision: 0.05, Recall: 0.14, F1-Score: 0.08, Support: 108
- **Category 4 (muy alto)**: Precision: 0.07, Recall: 0.21, F1-Score: 0.10, Support: 85
- **Category 5 (extremo)**: Precision: 0.04, Recall: 0.12, F1-Score: 0.06, Support: 86

The aim is to refine the model to achieve more accurate and reliable predictions across all risk levels of depression.


In [80]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, make_scorer, f1_score
from sklearn.utils import resample

# División de datos en entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

# Balanceo manual con RandomOverSampler
X_train_resampled, y_train_resampled = resample(
    X_train,
    y_train,
    replace=True,  # Permitir duplicados para balancear
    n_samples=y_train.value_counts().max(),  # Balancear con la clase mayoritaria
    random_state=42
)

# Definir modelos con ajustes de class_weight
modelos = {
    "RandomForest": RandomForestClassifier(class_weight='balanced', random_state=42),
    "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42),
    "GradientBoosting": GradientBoostingClassifier(random_state=42),
    "VotingClassifier": VotingClassifier(
        estimators=[
            ('rf', RandomForestClassifier(class_weight='balanced', random_state=42)),
            ('xgb', XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)),
            ('gb', GradientBoostingClassifier(random_state=42))
        ], voting='soft'
    )
}

# Configuración de GridSearchCV con F1-score macro
f1_scorer = make_scorer(f1_score, average='macro')
param_grid = {
    "RandomForest": {'n_estimators': [50, 100], 'max_depth': [5, 10]},
    "XGBoost": {'n_estimators': [50, 100], 'max_depth': [3, 5]},
    "GradientBoosting": {'n_estimators': [50, 100], 'max_depth': [3, 5]},
    "VotingClassifier": {}
}

# Iterar sobre los modelos
for nombre, modelo in modelos.items():
    print(f"\nEntrenando modelo: {nombre}...")
    if nombre in param_grid and param_grid[nombre]:  # Verificar si hay parámetros en el grid
        grid_search = GridSearchCV(modelo, param_grid[nombre], cv=3, scoring=f1_scorer, n_jobs=-1)
        grid_search.fit(X_train_resampled, y_train_resampled)
        best_model = grid_search.best_estimator_
    else:
        best_model = modelo.fit(X_train_resampled, y_train_resampled)
    
    # Predicciones
    y_pred = best_model.predict(X_test)

    # Métricas de evaluación
    print(f"\nMatriz de confusión para {nombre}:")
    print(confusion_matrix(y_test, y_pred))
    print(f"\nReporte de clasificación para {nombre}:")
    print(classification_report(y_test, y_pred))



Entrenando modelo: RandomForest...

Matriz de confusión para RandomForest:
[[694 148 142 129 120 117]
 [189  45  54  41  41  43]
 [ 75  28  21  23  23  35]
 [ 47  10  13  17  12  19]
 [ 34  10   9   9  10  13]
 [ 26   9   7   6   9  14]]

Reporte de clasificación para RandomForest:
              precision    recall  f1-score   support

           0       0.65      0.51      0.57      1350
           1       0.18      0.11      0.14       413
           2       0.09      0.10      0.09       205
           3       0.08      0.14      0.10       118
           4       0.05      0.12      0.07        85
           5       0.06      0.20      0.09        71

    accuracy                           0.36      2242
   macro avg       0.18      0.20      0.18      2242
weighted avg       0.44      0.36      0.39      2242


Entrenando modelo: XGBoost...



Matriz de confusión para XGBoost:
[[1245   61   26   17    1    0]
 [ 385   19    6    3    0    0]
 [ 190   10    2    3    0    0]
 [ 109    5    1    2    1    0]
 [  77    5    0    2    1    0]
 [  66    5    0    0    0    0]]

Reporte de clasificación para XGBoost:


              precision    recall  f1-score   support

           0       0.60      0.92      0.73      1350
           1       0.18      0.05      0.07       413
           2       0.06      0.01      0.02       205
           3       0.07      0.02      0.03       118
           4       0.33      0.01      0.02        85
           5       0.00      0.00      0.00        71

    accuracy                           0.57      2242
   macro avg       0.21      0.17      0.14      2242
weighted avg       0.42      0.57      0.46      2242


Entrenando modelo: GradientBoosting...

Matriz de confusión para GradientBoosting:
[[1244   43   33   21    7    2]
 [ 388    8   10    4    3    0]
 [ 188    9    1    4    3    0]
 [ 108    5    2    1    2    0]
 [  78    1    2    2    2    0]
 [  65    2    2    0    2    0]]

Reporte de clasificación para GradientBoosting:
              precision    recall  f1-score   support

           0       0.60      0.92      0.73      1350
           1    


Matriz de confusión para VotingClassifier:
[[1165   65   52   34   18   16]
 [ 354   22   16    7   10    4]
 [ 176   13    4    5    4    3]
 [ 103    5    2    4    3    1]
 [  70    4    2    2    6    1]
 [  60    5    2    1    3    0]]

Reporte de clasificación para VotingClassifier:
              precision    recall  f1-score   support

           0       0.60      0.86      0.71      1350
           1       0.19      0.05      0.08       413
           2       0.05      0.02      0.03       205
           3       0.08      0.03      0.05       118
           4       0.14      0.07      0.09        85
           5       0.00      0.00      0.00        71

    accuracy                           0.54      2242
   macro avg       0.18      0.17      0.16      2242
weighted avg       0.41      0.54      0.45      2242



### Classification Models for Depression Risk

#### Model: RandomForest
The RandomForest model showed moderate performance, with a stronger ability to identify the 'nulo' risk category but struggled with higher risk categories, likely due to class imbalance.

##### Classification Report
- **Accuracy**: 0.36
- **Precision (Weighted Avg)**: 0.44
- **Recall (Weighted Avg)**: 0.36
- **F1-Score (Weighted Avg)**: 0.39

#### Model: XGBoost
XGBoost demonstrated better accuracy, especially in identifying the 'nulo' category, achieving high recall. However, it performed poorly in detecting higher risk levels.

##### Classification Report
- **Accuracy**: 0.57
- **Precision (Weighted Avg)**: 0.42
- **Recall (Weighted Avg)**: 0.57
- **F1-Score (Weighted Avg)**: 0.46

#### Model: GradientBoosting
GradientBoosting maintained similar performance to XGBoost, with better recall for the 'nulo' category but still struggled with higher risk levels.

##### Classification Report
- **Accuracy**: 0.56
- **Precision (Weighted Avg)**: 0.39
- **Recall (Weighted Avg)**: 0.56
- **F1-Score (Weighted Avg)**: 0.45

#### Model: VotingClassifier
The VotingClassifier, which combined predictions from the previous models, showed improved performance in detecting 'nulo' risk and slightly better precision across other categories. However, recall for higher risk levels remained low.

##### Classification Report
- **Accuracy**: 0.54
- **Precision (Weighted Avg)**: 0.41
- **Recall (Weighted Avg)**: 0.54
- **F1-Score (Weighted Avg)**: 0.45


As we can observe, XGBoost had the highest accuracy at 57%, with better recall for the 'nulo' category, but overall performance remained low for other risk levels.


In [85]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, make_scorer, f1_score
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

# Definir las columnas disponibles para el modelo
columnas_necesarias = ['Género', 'Relación de ingresos familiares con la pobreza']

# Selección de variables predictoras
X = question6_imputed[columnas_necesarias]

# Ajustar las clases en la variable objetivo para empezar desde 0
y = question6_imputed['Riesgo_Depresion'] - question6_imputed['Riesgo_Depresion'].min()

# División de datos en entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

# Calcular los pesos de clase
clase_pesos = compute_class_weight(class_weight='balanced', classes=np.unique(y_train), y=y_train)
pesos = dict(enumerate(clase_pesos))

# Definir modelos con ajuste de class_weight o scale_pos_weight
modelos = {
    "RandomForest": RandomForestClassifier(class_weight=pesos, random_state=42),
    "XGBoost": XGBClassifier(scale_pos_weight=1, use_label_encoder=False, eval_metric='logloss', random_state=42),
    "GradientBoosting": GradientBoostingClassifier(random_state=42),
    "VotingClassifier": VotingClassifier(
        estimators=[
            ('rf', RandomForestClassifier(class_weight=pesos, random_state=42)),
            ('xgb', XGBClassifier(scale_pos_weight=1, use_label_encoder=False, eval_metric='logloss', random_state=42)),
            ('gb', GradientBoostingClassifier(random_state=42))
        ], voting='soft'
    )
}

# Configuración de GridSearchCV con F1-score macro
f1_scorer = make_scorer(f1_score, average='macro')
param_grid = {
    "RandomForest": {'n_estimators': [50, 100], 'max_depth': [5, 10]},
    "XGBoost": {'n_estimators': [50, 100], 'max_depth': [3, 5]},
    "GradientBoosting": {'n_estimators': [50, 100], 'max_depth': [3, 5]},
    "VotingClassifier": {}
}

# Iterar sobre los modelos
for nombre, modelo in modelos.items():
    print(f"\nEntrenando modelo: {nombre}...")
    if nombre in param_grid and param_grid[nombre]:  # Verificar si hay parámetros en el grid
        grid_search = GridSearchCV(modelo, param_grid[nombre], cv=3, scoring=f1_scorer, n_jobs=-1, error_score='raise')
        grid_search.fit(X_train, y_train)
        best_model = grid_search.best_estimator_
    else:
        best_model = modelo.fit(X_train, y_train)
    
    # Predicciones
    y_pred = best_model.predict(X_test)

    # Métricas de evaluación
    print(f"\nReporte de clasificación para {nombre}:")
    print(classification_report(y_test, y_pred))



Entrenando modelo: RandomForest...

Reporte de clasificación para RandomForest:
              precision    recall  f1-score   support

           0       0.67      0.48      0.56      1350
           1       0.21      0.08      0.11       413
           2       0.09      0.12      0.10       205
           3       0.07      0.21      0.10       118
           4       0.04      0.11      0.06        85
           5       0.06      0.24      0.09        71

    accuracy                           0.34      2242
   macro avg       0.19      0.21      0.17      2242
weighted avg       0.46      0.34      0.38      2242


Entrenando modelo: XGBoost...



Reporte de clasificación para XGBoost:


              precision    recall  f1-score   support

           0       0.60      1.00      0.75      1350
           1       0.18      0.00      0.01       413
           2       0.00      0.00      0.00       205
           3       0.00      0.00      0.00       118
           4       0.00      0.00      0.00        85
           5       0.00      0.00      0.00        71

    accuracy                           0.60      2242
   macro avg       0.13      0.17      0.13      2242
weighted avg       0.40      0.60      0.45      2242


Entrenando modelo: GradientBoosting...

Reporte de clasificación para GradientBoosting:
              precision    recall  f1-score   support

           0       0.60      0.99      0.75      1350
           1       0.12      0.00      0.01       413
           2       0.00      0.00      0.00       205
           3       0.00      0.00      0.00       118
           4       0.00      0.00      0.00        85
           5       0.00      0.00      0.00


Reporte de clasificación para VotingClassifier:
              precision    recall  f1-score   support

           0       0.60      0.97      0.75      1350
           1       0.17      0.02      0.03       413
           2       0.00      0.00      0.00       205
           3       0.00      0.00      0.00       118
           4       0.00      0.00      0.00        85
           5       0.00      0.00      0.00        71

    accuracy                           0.59      2242
   macro avg       0.13      0.17      0.13      2242
weighted avg       0.40      0.59      0.45      2242



### Classification Models for Depression Risk

#### Model: RandomForestClassifier
- **Description**: This model used a random forest ensemble with class balancing to improve performance across imbalanced classes.
- **Performance**:
  - **Accuracy**: 34%
  - **Macro Avg F1-Score**: 0.17
  - **Weighted Avg F1-Score**: 0.38
  - **Recall for Class 0**: 48%
  - **Recall for Other Classes**: Ranged from 8% to 24%, showing limited sensitivity for classes 1 to 5.

#### Model: XGBoostClassifier
- **Description**: The XGBoost model was optimized for log loss with no significant improvements in class recall, except for class 0.
- **Performance**:
  - **Accuracy**: 60%
  - **Macro Avg F1-Score**: 0.13
  - **Weighted Avg F1-Score**: 0.45
  - **Recall for Class 0**: 100%
  - **Recall for Other Classes**: Near zero, indicating poor prediction capability for minority classes.

#### Model: GradientBoostingClassifier
- **Description**: The Gradient Boosting model aimed for improved generalization but struggled to balance predictions across classes.
- **Performance**:
  - **Accuracy**: 60%
  - **Macro Avg F1-Score**: 0.13
  - **Weighted Avg F1-Score**: 0.45
  - **Recall for Class 0**: 99%
  - **Recall for Other Classes**: Nearly zero, highlighting similar limitations as XGBoost.

#### Model: VotingClassifier
- **Description**: This ensemble combined predictions from RandomForest, XGBoost, and GradientBoosting, using a soft voting mechanism.
- **Performance**:
  - **Accuracy**: 59%
  - **Macro Avg F1-Score**: 0.13
  - **Weighted Avg F1-Score**: 0.45
  - **Recall for Class 0**: 97%
  - **Recall for Other Classes**: Marginal improvements for class 1 (2%), but still weak performance across other classes.


All models showed strong recall for class 0 but struggled significantly with minority classes. The overall accuracy ranged from 34% to 60%, with XGBoost and GradientBoosting achieving the highest scores. The need for further tuning or resampling strategies, like SMOTE or class-specific weighting adjustments, is evident to enhance performance across imbalanced classes.


### Question 7: How do waist circumference and body mass index affect blood pressure in different age groups?

In [86]:
# Relación entre la circunferencia de la cintura y la presión arterial
# Filtrar y combinar datasets
demografia_filtered = demografia[['ID', 'Edad en años al momento del examen']]
medidasCorporales_filtered = medidas[['ID', 'Circunferencia de la cintura (cm)', 'Índice de masa corporal (kg/m²)']]
presionArterial_filtered = presion[['ID', 'Presión sistólica - 1ra lectura oscilométrica', 'Presión diastólica - 1ra lectura oscilométrica']]

# Realizar el merge
question1 = demografia_filtered.merge(medidasCorporales_filtered, on='ID', how='inner')
question1 = question1.merge(presionArterial_filtered, on='ID', how='inner')

# Definir grupos de edad
bins = [0, 30, 60, 90]  # Puedes ajustar los rangos de edad según lo necesites
labels = ['0-30', '31-60', '60+']
question1['Grupo de Edad'] = pd.cut(question1['Edad en años al momento del examen'], bins=bins, labels=labels, right=False)

# Iterar sobre cada grupo de edad
for grupo in labels:
    print(f"\n--- Resultados para el grupo de edad {grupo} ---")
    
    # Filtrar el grupo de edad
    grupo_df = question1[question1['Grupo de Edad'] == grupo]
    
    # Selección de variables predictoras y el target
    X = grupo_df[['Edad en años al momento del examen', 'Circunferencia de la cintura (cm)', 'Índice de masa corporal (kg/m²)']]
    y = grupo_df['Presión sistólica - 1ra lectura oscilométrica']
    
    # Imputación de valores faltantes
    imputer = SimpleImputer(strategy='mean')
    X_imputed = imputer.fit_transform(X)
    y_imputed = imputer.fit_transform(y.values.reshape(-1, 1)).ravel()
    
    # División de datos
    X_train, X_test, y_train, y_test = train_test_split(X_imputed, y_imputed, train_size=0.75, test_size=0.25, random_state=42)
    
    # TPOT para regresión
    tpot_regressor = TPOTRegressor(verbosity=2, generations=5, population_size=20, random_state=42)
    tpot_regressor.fit(X_train, y_train)
    
    # Realizar predicciones
    y_pred = tpot_regressor.predict(X_test)
    
    # Calcular métricas
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)
    
    # Imprimir métricas
    print(f"Métricas de regresión para grupo de edad {grupo}:")
    print(f"MAE: {mae}")
    print(f"MSE: {mse}")
    print(f"RMSE: {rmse}")
    print(f"R²: {r2}")
    
    # Exportar el pipeline
    tpot_regressor.export(f'best_pipeline_presion_sistolica_{grupo}_regression.py')
    print(f"Pipeline exportado para el grupo de edad {grupo}\n")


--- Resultados para el grupo de edad 0-30 ---


Optimization Progress:   0%|          | 0/120 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -104.92930292938422

Generation 2 - Current best internal CV score: -104.8862786094005

Generation 3 - Current best internal CV score: -104.8862786094005

Generation 4 - Current best internal CV score: -104.70212769862265

Generation 5 - Current best internal CV score: -104.69652636332614

Best pipeline: RandomForestRegressor(ElasticNetCV(input_matrix, l1_ratio=0.75, tol=0.01), bootstrap=True, max_features=0.4, min_samples_leaf=18, min_samples_split=14, n_estimators=100)
Métricas de regresión para grupo de edad 0-30:
MAE: 7.744270416206562
MSE: 100.31046709668436
RMSE: 10.015511324774405
R²: 0.19682258903345273
Pipeline exportado para el grupo de edad 0-30


--- Resultados para el grupo de edad 31-60 ---


Optimization Progress:   0%|          | 0/120 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -256.01558637946715

Generation 2 - Current best internal CV score: -256.01558637946715

Generation 3 - Current best internal CV score: -256.01558637946715

Generation 4 - Current best internal CV score: -256.0023464006406

Generation 5 - Current best internal CV score: -255.99494372861918

Best pipeline: RidgeCV(FastICA(input_matrix, tol=0.25))
Métricas de regresión para grupo de edad 31-60:
MAE: 11.362548665050744
MSE: 236.6329358073595
RMSE: 15.382878007946351
R²: 0.07302843789169489
Pipeline exportado para el grupo de edad 31-60


--- Resultados para el grupo de edad 60+ ---


Optimization Progress:   0%|          | 0/120 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -403.1697067793719

Generation 2 - Current best internal CV score: -402.83710088453194

Generation 3 - Current best internal CV score: -402.718628663233

Generation 4 - Current best internal CV score: -402.718628663233

Generation 5 - Current best internal CV score: -401.70534969600186

Best pipeline: XGBRegressor(StandardScaler(RidgeCV(input_matrix)), learning_rate=0.01, max_depth=8, min_child_weight=17, n_estimators=100, n_jobs=1, objective=reg:squarederror, subsample=0.3, verbosity=0)
Métricas de regresión para grupo de edad 60+:
MAE: 14.628753913318903
MSE: 376.4969596120861
RMSE: 19.403529565831214
R²: 0.02798170470123429
Pipeline exportado para el grupo de edad 60+



### Regression Models for Blood Pressure Prediction by Age Group

### Age Group: 0-30
#### Model: RandomForestRegressor
**Description**: This model uses a Random Forest ensemble combined with ElasticNetCV for regularization. It employs a bootstrap method, adjusts feature usage to 40%, and uses 100 estimators to prevent overfitting.
**Performance**:
- **MAE**: 7.74
- **MSE**: 100.31
- **RMSE**: 10.02
- **R²**: 0.20  
*This model achieved the best performance in terms of R², indicating moderate predictive power for this age group.*

### Age Group: 31-60
#### Model: RidgeCV with FastICA
**Description**: This model applies RidgeCV with FastICA as a preprocessing step, which helps in separating independent components before fitting the regression model.
**Performance**:
- **MAE**: 11.36
- **MSE**: 236.63
- **RMSE**: 15.38
- **R²**: 0.07  
*The R² score indicates limited predictive capacity, suggesting further adjustments might be needed.*

### Age Group: 60+
#### Model: XGBRegressor with StandardScaler and RidgeCV
**Description**: This model combines XGBRegressor with StandardScaler and RidgeCV for scaling and regularization, using a learning rate of 0.01 and a max depth of 8 to manage complexity and reduce overfitting.
**Performance**:
- **MAE**: 14.63
- **MSE**: 376.50
- **RMSE**: 19.40
- **R²**: 0.03  
*The model shows poor predictive capacity for this age group, as reflected by the low R².*

The **RandomForestRegressor** model for the 0-30 age group achieved the highest R² of 0.20, suggesting the most reliable predictions.
Both the **RidgeCV with FastICA** for the 31-60 age group and the **XGBRegressor** for the 60+ age group had lower R² scores, indicating weaker predictive power.


### Question 8:  How does the level of C-reactive protein influence glucose, triglyceride, and cholesterol levels?

In [91]:
# Filtrar las columnas necesarias
demografia_filtered = demografia[['ID', 'Edad en años al momento del examen', 'Género']]
proteinaC_filtered = proteinaC[['ID', 'Proteína C Reactiva (mg/L)']]
perfilBioquimico_filtered = perfilB[['ID', 'Glucosa, suero refrigerado (mg/dL)', 'Triglicéridos, suero refrigerado (mg/dL)', 'Colesterol Total, suero refrigerado (mg/dL)']]

# Unir los datasets
question1 = demografia_filtered.merge(proteinaC_filtered, on='ID', how='inner')
question1 = question1.merge(perfilBioquimico_filtered, on='ID', how='inner')

# Definir grupos de edad
bins = [0, 30, 60, 90]
labels = ['0-30', '31-60', '60+']
question1['Grupo de Edad'] = pd.cut(question1['Edad en años al momento del examen'], bins=bins, labels=labels, right=False)

# Iterar sobre cada grupo de edad
for grupo in labels:
    print(f"\n--- Resultados para el grupo de edad {grupo} ---")
    
    grupo_df = question1[question1['Grupo de Edad'] == grupo]
    
    X = grupo_df[['Proteína C Reactiva (mg/L)', 'Edad en años al momento del examen', 'Género']]
    y = grupo_df[['Glucosa, suero refrigerado (mg/dL)', 'Triglicéridos, suero refrigerado (mg/dL)', 'Colesterol Total, suero refrigerado (mg/dL)']]
    
    imputer = SimpleImputer(strategy='mean')
    X_imputed = imputer.fit_transform(X)
    
    targets = {
        'Glucosa, suero refrigerado (mg/dL)': 'Glucosa_suero',
        'Triglicéridos, suero refrigerado (mg/dL)': 'Trigliceridos_suero',
        'Colesterol Total, suero refrigerado (mg/dL)': 'Colesterol_Total_suero'
    }
    
    for target, file_name in targets.items():
        y_target = y[target].dropna()
        X_filtered = X_imputed[~y[target].isna()]
        X_train, X_test, y_train, y_test = train_test_split(X_filtered, y_target, train_size=0.75, test_size=0.25, random_state=42)
        
        tpot_regressor = TPOTRegressor(verbosity=2, generations=10, population_size=20, random_state=42)
        tpot_regressor.fit(X_train, y_train)
        
        y_pred = tpot_regressor.predict(X_test)
        
        mae = mean_absolute_error(y_test, y_pred)
        mse = mean_squared_error(y_test, y_pred)
        rmse = np.sqrt(mse)
        r2 = r2_score(y_test, y_pred)
        
        print(f"Métricas de regresión para {target}:")
        print(f"MAE: {mae}")
        print(f"MSE: {mse}")
        print(f"RMSE: {rmse}")
        print(f"R²: {r2}")
        
        tpot_regressor.export(f'best_pipeline_{file_name}_{grupo}_regression.py')
        print(f"Pipeline exportado para {target}\n")



--- Resultados para el grupo de edad 0-30 ---


Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -245.41371637466182

Generation 2 - Current best internal CV score: -245.41371637466182

Generation 3 - Current best internal CV score: -245.41371637466182

Generation 4 - Current best internal CV score: -245.41371637466182

Generation 5 - Current best internal CV score: -245.0074152032111

Generation 6 - Current best internal CV score: -245.0074152032111

Generation 7 - Current best internal CV score: -245.0074152032111

Generation 8 - Current best internal CV score: -244.93434441057622

Generation 9 - Current best internal CV score: -244.93434441057622

Generation 10 - Current best internal CV score: -244.93434441057622

Best pipeline: RandomForestRegressor(ElasticNetCV(input_matrix, l1_ratio=0.75, tol=0.01), bootstrap=True, max_features=0.4, min_samples_leaf=18, min_samples_split=14, n_estimators=100)
Métricas de regresión para Glucosa, suero refrigerado (mg/dL):
MAE: 7.110961586513223
MSE: 139.76099921789196
RMSE: 11.822055625731592
R

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -4694.011262952226

Generation 2 - Current best internal CV score: -4694.011262952226

Generation 3 - Current best internal CV score: -4682.144283601105

Generation 4 - Current best internal CV score: -4681.467165526311

Generation 5 - Current best internal CV score: -4677.423552464799

Generation 6 - Current best internal CV score: -4677.423552464799

Generation 7 - Current best internal CV score: -4666.815416341028

Generation 8 - Current best internal CV score: -4666.815416341028

Generation 9 - Current best internal CV score: -4664.753831303587

Generation 10 - Current best internal CV score: -4664.753831303587

Best pipeline: AdaBoostRegressor(RidgeCV(input_matrix), learning_rate=0.001, loss=linear, n_estimators=100)
Métricas de regresión para Triglicéridos, suero refrigerado (mg/dL):
MAE: 43.69596927351057
MSE: 4693.669549293313
RMSE: 68.51036089011146
R²: 0.051113889644481714
Pipeline exportado para Triglicéridos, suero refrigerado

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -1040.5453505948558

Generation 2 - Current best internal CV score: -1040.5453505948558

Generation 3 - Current best internal CV score: -1040.5453505948558

Generation 4 - Current best internal CV score: -1040.5453505948558

Generation 5 - Current best internal CV score: -1040.5453505948558

Generation 6 - Current best internal CV score: -1040.5453505948558

Generation 7 - Current best internal CV score: -1040.5453505948558

Generation 8 - Current best internal CV score: -1040.069128707627

Generation 9 - Current best internal CV score: -1040.069128707627

Generation 10 - Current best internal CV score: -1037.723712242736

Best pipeline: KNeighborsRegressor(input_matrix, n_neighbors=80, p=1, weights=uniform)
Métricas de regresión para Colesterol Total, suero refrigerado (mg/dL):
MAE: 23.68142235123367
MSE: 1018.8330678519592
RMSE: 31.919164585746277
R²: 0.10982262851587543
Pipeline exportado para Colesterol Total, suero refrigerado (mg/dL

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -1462.407133805116

Generation 2 - Current best internal CV score: -1462.407133805116

Generation 3 - Current best internal CV score: -1462.407133805116

Generation 4 - Current best internal CV score: -1460.2883491398939

Generation 5 - Current best internal CV score: -1460.2883491398939

Generation 6 - Current best internal CV score: -1460.2883491398939

Generation 7 - Current best internal CV score: -1460.2883491398939

Generation 8 - Current best internal CV score: -1460.2883491398939

Generation 9 - Current best internal CV score: -1460.1021997872415

Generation 10 - Current best internal CV score: -1460.1021997872415

Best pipeline: RandomForestRegressor(StandardScaler(input_matrix), bootstrap=True, max_features=0.45, min_samples_leaf=17, min_samples_split=14, n_estimators=100)
Métricas de regresión para Glucosa, suero refrigerado (mg/dL):
MAE: 18.76802930587106
MSE: 1429.2051538193236
RMSE: 37.804829768421435
R²: 0.05117814568051426

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -14440.560505565587

Generation 2 - Current best internal CV score: -14403.746624823412

Generation 3 - Current best internal CV score: -14403.746624823412

Generation 4 - Current best internal CV score: -14396.100881060198

Generation 5 - Current best internal CV score: -14396.100881060198

Generation 6 - Current best internal CV score: -14396.100881060198

Generation 7 - Current best internal CV score: -14396.100881060198

Generation 8 - Current best internal CV score: -14393.790009760825

Generation 9 - Current best internal CV score: -14393.790009760825

Generation 10 - Current best internal CV score: -14311.1080200018

Best pipeline: AdaBoostRegressor(PolynomialFeatures(input_matrix, degree=2, include_bias=False, interaction_only=False), learning_rate=0.001, loss=exponential, n_estimators=100)
Métricas de regresión para Triglicéridos, suero refrigerado (mg/dL):
MAE: 67.27778226648034
MSE: 13755.35015283724
RMSE: 117.283204905209
R²: 

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -1440.192221486382

Generation 2 - Current best internal CV score: -1437.644834655702

Generation 3 - Current best internal CV score: -1437.1777842414456

Generation 4 - Current best internal CV score: -1436.5174220908418

Generation 5 - Current best internal CV score: -1429.6039452528425

Generation 6 - Current best internal CV score: -1429.52333562428

Generation 7 - Current best internal CV score: -1429.52333562428

Generation 8 - Current best internal CV score: -1429.52333562428

Generation 9 - Current best internal CV score: -1428.9549656778224

Generation 10 - Current best internal CV score: -1428.9549656778224

Best pipeline: ExtraTreesRegressor(input_matrix, bootstrap=True, max_features=0.8500000000000001, min_samples_leaf=9, min_samples_split=11, n_estimators=100)
Métricas de regresión para Colesterol Total, suero refrigerado (mg/dL):
MAE: 30.76404248474794
MSE: 1639.4719534858023
RMSE: 40.49039334812397
R²: 0.02855918249720757
P

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -1512.3671061439259

Generation 2 - Current best internal CV score: -1505.6750567896804

Generation 3 - Current best internal CV score: -1505.6750567896804

Generation 4 - Current best internal CV score: -1504.1090663829334

Generation 5 - Current best internal CV score: -1504.087754827337

Generation 6 - Current best internal CV score: -1502.8992397213956

Generation 7 - Current best internal CV score: -1502.3713473780697

Generation 8 - Current best internal CV score: -1502.3713473780697

Generation 9 - Current best internal CV score: -1502.3713473780697

Generation 10 - Current best internal CV score: -1502.3615863187329

Best pipeline: XGBRegressor(input_matrix, learning_rate=0.01, max_depth=2, min_child_weight=18, n_estimators=100, n_jobs=1, objective=reg:squarederror, subsample=0.25, verbosity=0)
Métricas de regresión para Glucosa, suero refrigerado (mg/dL):
MAE: 22.60079502100422
MSE: 1599.8884399702831
RMSE: 39.9986054753198
R²: -

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -8095.361736496408

Generation 2 - Current best internal CV score: -8094.566329199952

Generation 3 - Current best internal CV score: -8094.566329199952

Generation 4 - Current best internal CV score: -8087.9953669807655

Generation 5 - Current best internal CV score: -8087.9953669807655

Generation 6 - Current best internal CV score: -8067.532116393266

Generation 7 - Current best internal CV score: -8067.532116393266

Generation 8 - Current best internal CV score: -8067.532116393266

Generation 9 - Current best internal CV score: -8067.532116393266

Generation 10 - Current best internal CV score: -8067.532116393266

Best pipeline: AdaBoostRegressor(input_matrix, learning_rate=0.001, loss=exponential, n_estimators=100)
Métricas de regresión para Triglicéridos, suero refrigerado (mg/dL):
MAE: 58.968198653673646
MSE: 7715.1154560566365
RMSE: 87.8357299511801
R²: 0.01834087532790596
Pipeline exportado para Triglicéridos, suero refrigerado (

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -1782.3444974074275

Generation 2 - Current best internal CV score: -1782.3353450048423

Generation 3 - Current best internal CV score: -1782.3353450048423

Generation 4 - Current best internal CV score: -1782.3353450048423

Generation 5 - Current best internal CV score: -1782.331375004295

Generation 6 - Current best internal CV score: -1782.268292295596

Generation 7 - Current best internal CV score: -1782.268292295596

Generation 8 - Current best internal CV score: -1782.1464465015931

Generation 9 - Current best internal CV score: -1782.1464465015931

Generation 10 - Current best internal CV score: -1782.1464465015931

Best pipeline: ElasticNetCV(RobustScaler(input_matrix), l1_ratio=0.35000000000000003, tol=0.1)
Métricas de regresión para Colesterol Total, suero refrigerado (mg/dL):
MAE: 32.70013351277181
MSE: 1700.9021301056716
RMSE: 41.24199473965428
R²: 0.08240064207699405
Pipeline exportado para Colesterol Total, suero refrigerado

### Regression Models for Electrolyte Impact on Glucose, Triglycerides, and Total Cholesterol Levels

#### Age Group: 0-30
**Model for Serum Glucose (mg/dL):**
- **Model**: RandomForestRegressor with ElasticNetCV pre-processing
- **Performance**:
  - **MAE**: 7.11
  - **MSE**: 139.76
  - **RMSE**: 11.82
  - **R²**: -0.03
- **Pipeline Exported**: best_pipeline_Glucosa_0-30_regression.py

**Model for Serum Triglycerides (mg/dL):**
- **Model**: AdaBoostRegressor with RidgeCV pre-processing
- **Performance**:
  - **MAE**: 43.70
  - **MSE**: 4693.67
  - **RMSE**: 68.51
  - **R²**: 0.05
- **Pipeline Exported**: best_pipeline_Trigliceridos_0-30_regression.py

**Model for Total Cholesterol (mg/dL):**
- **Model**: KNeighborsRegressor
- **Performance**:
  - **MAE**: 23.68
  - **MSE**: 1018.83
  - **RMSE**: 31.92
  - **R²**: 0.11
- **Pipeline Exported**: best_pipeline_Colesterol_Total_0-30_regression.py

#### Age Group: 31-60
**Model for Serum Glucose (mg/dL):**
- **Model**: RandomForestRegressor with StandardScaler pre-processing
- **Performance**:
  - **MAE**: 18.77
  - **MSE**: 1429.21
  - **RMSE**: 37.80
  - **R²**: 0.05
- **Pipeline Exported**: best_pipeline_Glucosa_31-60_regression.py

**Model for Serum Triglycerides (mg/dL):**
- **Model**: AdaBoostRegressor with PolynomialFeatures pre-processing
- **Performance**:
  - **MAE**: 67.28
  - **MSE**: 13755.35
  - **RMSE**: 117.28
  - **R²**: 0.04
- **Pipeline Exported**: best_pipeline_Trigliceridos_31-60_regression.py

**Model for Total Cholesterol (mg/dL):**
- **Model**: ExtraTreesRegressor
- **Performance**:
  - **MAE**: 30.76
  - **MSE**: 1639.47
  - **RMSE**: 40.49
  - **R²**: 0.03
- **Pipeline Exported**: best_pipeline_Colesterol_Total_31-60_regression.py

#### Age Group: 60+
**Model for Serum Glucose (mg/dL):**
- **Model**: XGBRegressor
- **Performance**:
  - **MAE**: 22.60
  - **MSE**: 1599.89
  - **RMSE**: 40.00
  - **R²**: -0.003
- **Pipeline Exported**: best_pipeline_Glucosa_60+_regression.py

**Model for Serum Triglycerides (mg/dL):**
- **Model**: AdaBoostRegressor
- **Performance**:
  - **MAE**: 58.97
  - **MSE**: 7715.12
  - **RMSE**: 87.84
  - **R²**: 0.02
- **Pipeline Exported**: best_pipeline_Trigliceridos_60+_regression.py

**Model for Total Cholesterol (mg/dL):**
- **Model**: ElasticNetCV with RobustScaler pre-processing
- **Performance**:
  - **MAE**: 32.70
  - **MSE**: 1700.90
  - **RMSE**: 41.24
  - **R²**: 0.08
- **Pipeline Exported**: best_pipeline_Colesterol_Total_60+_regression.py


The models demonstrated varied performance across age groups, with most models showing modest predictive accuracy (R² ranging from -0.03 to 0.11). The pipelines exported highlight different pre-processing and model strategies, including ensemble techniques, boosting, and regression methods. Despite the low R² values, the models provide initial insights into the relationship between electrolyte levels and the target metabolic indicators. 


### Question 9: What is the impact of electrolyte levels (sodium, potassium, phosphorus) on kidney function, assessed through creatinine levels?

In [92]:
# ¿Cuál es el impacto de los niveles de electrolitos (sodio, potasio, fósforo) en la función renal, evaluada a través de la creatinina?

# Filtrar las columnas necesarias, asegurando que la columna de edad esté incluida
demografia_filtered = demografia[['ID', 'Edad en años al momento del examen']]
perfilBioquimico_filtered = perfilB[['ID', 'Creatinina, suero refrigerado (mg/dL)', 'Nitrógeno Ureico en Sangre (mg/dL)', 'Sodio (mmol/L)', 'Potasio (mmol/L)', 'Fósforo (mg/dL)']]

# Unir los datasets para tener la columna de edad disponible
merged_df = perfilBioquimico_filtered.merge(demografia_filtered, on='ID', how='inner')

# Definir grupos de edad
bins = [0, 30, 60, 90]
labels = ['0-30', '31-60', '60+']
merged_df['Grupo de Edad'] = pd.cut(merged_df['Edad en años al momento del examen'], bins=bins, labels=labels, right=False)

# Iterar sobre cada grupo de edad
for grupo in labels:
    print(f"\n--- Resultados para el grupo de edad {grupo} ---")
    
    grupo_df = merged_df[merged_df['Grupo de Edad'] == grupo]
    
    X = grupo_df[['Sodio (mmol/L)', 'Potasio (mmol/L)', 'Fósforo (mg/dL)']]
    y = grupo_df[['Creatinina, suero refrigerado (mg/dL)', 'Nitrógeno Ureico en Sangre (mg/dL)']]
    
    # Imputación de valores faltantes en X
    imputer = SimpleImputer(strategy='mean')
    X_imputed = imputer.fit_transform(X)
    
    targets = {
        'Creatinina, suero refrigerado (mg/dL)': 'Creatinina',
        'Nitrógeno Ureico en Sangre (mg/dL)': 'Nitrogeno_Urico'
    }
    
    for target, file_name in targets.items():
        y_target = y[target].dropna()
        X_filtered = X_imputed[~y[target].isna()]
        X_train, X_test, y_train, y_test = train_test_split(X_filtered, y_target, train_size=0.75, test_size=0.25, random_state=42)
        
        tpot_regressor = TPOTRegressor(verbosity=2, generations=10, population_size=20, random_state=42)
        tpot_regressor.fit(X_train, y_train)
        
        y_pred = tpot_regressor.predict(X_test)
        
        # Calcular métricas
        mae = mean_absolute_error(y_test, y_pred)
        mse = mean_squared_error(y_test, y_pred)
        rmse = np.sqrt(mse)
        r2 = r2_score(y_test, y_pred)
        
        # Imprimir métricas para cada target
        print(f"Métricas de regresión para {target}:")
        print(f"MAE: {mae}")
        print(f"MSE: {mse}")
        print(f"RMSE: {rmse}")
        print(f"R²: {r2}")
        
        # Exportar el pipeline
        tpot_regressor.export(f'best_pipeline_{file_name}_{grupo}_regression.py')
        print(f"Pipeline exportado para {target}\n")



--- Resultados para el grupo de edad 0-30 ---


Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -0.03119756712112142

Generation 2 - Current best internal CV score: -0.031190767956919507

Generation 3 - Current best internal CV score: -0.031190767956919507

Generation 4 - Current best internal CV score: -0.031190767956919507

Generation 5 - Current best internal CV score: -0.031183442984804875

Generation 6 - Current best internal CV score: -0.031183442984804875

Generation 7 - Current best internal CV score: -0.031183442984804875

Generation 8 - Current best internal CV score: -0.031183442984804875

Generation 9 - Current best internal CV score: -0.031183442984804875

Generation 10 - Current best internal CV score: -0.031180202459760404

Best pipeline: AdaBoostRegressor(ZeroCount(MinMaxScaler(input_matrix)), learning_rate=0.01, loss=linear, n_estimators=100)
Métricas de regresión para Creatinina, suero refrigerado (mg/dL):
MAE: 0.14024028439053923
MSE: 0.03122952687203027
RMSE: 0.17671877905879238
R²: 0.0539475524280324
Pipeline ex

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -12.052882216933373

Generation 2 - Current best internal CV score: -12.052882216933373

Generation 3 - Current best internal CV score: -12.052882216933373

Generation 4 - Current best internal CV score: -12.052009250308767

Generation 5 - Current best internal CV score: -12.044765319773818

Generation 6 - Current best internal CV score: -12.044765319773818

Generation 7 - Current best internal CV score: -12.044765319773816

Generation 8 - Current best internal CV score: -12.044765319773816

Generation 9 - Current best internal CV score: -12.035880621444127

Generation 10 - Current best internal CV score: -12.035880621444127

Best pipeline: ElasticNetCV(Normalizer(input_matrix, norm=l1), l1_ratio=0.65, tol=0.0001)
Métricas de regresión para Nitrógeno Ureico en Sangre (mg/dL):
MAE: 2.606143498564616
MSE: 10.65791896529519
RMSE: 3.2646468362282604
R²: 0.015500482930625004
Pipeline exportado para Nitrógeno Ureico en Sangre (mg/dL)


--- Resu

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -0.24735701726104747

Generation 2 - Current best internal CV score: -0.2397853646910147

Generation 3 - Current best internal CV score: -0.23374331739555637

Generation 4 - Current best internal CV score: -0.23374331739555637

Generation 5 - Current best internal CV score: -0.23374331739555637

Generation 6 - Current best internal CV score: -0.23374331739555637

Generation 7 - Current best internal CV score: -0.23374331739555637

Generation 8 - Current best internal CV score: -0.22602408209437114

Generation 9 - Current best internal CV score: -0.22598654618958047

Generation 10 - Current best internal CV score: -0.2240087203049387

Best pipeline: DecisionTreeRegressor(RidgeCV(RidgeCV(StandardScaler(input_matrix))), max_depth=4, min_samples_leaf=3, min_samples_split=4)
Métricas de regresión para Creatinina, suero refrigerado (mg/dL):
MAE: 0.1845236507692911
MSE: 0.19255338713463083
RMSE: 0.4388090554382747
R²: -0.13179294125132723
Pipeli

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -22.243188752373516

Generation 2 - Current best internal CV score: -22.243188752373516

Generation 3 - Current best internal CV score: -22.243188752373516

Generation 4 - Current best internal CV score: -22.243188752373516

Generation 5 - Current best internal CV score: -22.243188752373516

Generation 6 - Current best internal CV score: -22.243188752373516

Generation 7 - Current best internal CV score: -22.243188752373516

Generation 8 - Current best internal CV score: -22.243188752373516

Generation 9 - Current best internal CV score: -22.243188752373516

Generation 10 - Current best internal CV score: -22.243188752373516

Best pipeline: DecisionTreeRegressor(FastICA(input_matrix, tol=0.45), max_depth=3, min_samples_leaf=8, min_samples_split=11)
Métricas de regresión para Nitrógeno Ureico en Sangre (mg/dL):
MAE: 3.302357509700059
MSE: 22.954607958266934
RMSE: 4.791096738562783
R²: 0.1485536641112447
Pipeline exportado para Nitrógeno Ur

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -0.32982234352225037

Generation 2 - Current best internal CV score: -0.32982234352225037

Generation 3 - Current best internal CV score: -0.32982234352225037

Generation 4 - Current best internal CV score: -0.32982234352225037

Generation 5 - Current best internal CV score: -0.323419672789443

Generation 6 - Current best internal CV score: -0.30134140942929155

Generation 7 - Current best internal CV score: -0.30134140942929155

Generation 8 - Current best internal CV score: -0.30134140942929155

Generation 9 - Current best internal CV score: -0.3002879144667518

Generation 10 - Current best internal CV score: -0.2859933443193122

Best pipeline: DecisionTreeRegressor(MinMaxScaler(FastICA(input_matrix, tol=0.45)), max_depth=4, min_samples_leaf=3, min_samples_split=12)
Métricas de regresión para Creatinina, suero refrigerado (mg/dL):
MAE: 0.2821608136805942
MSE: 0.3692689540060878
RMSE: 0.6076750398083566
R²: -0.02418167823721684
Pipeline 

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -48.172148240781745

Generation 2 - Current best internal CV score: -48.132931984106015

Generation 3 - Current best internal CV score: -48.132931984106015

Generation 4 - Current best internal CV score: -48.07517736716332

Generation 5 - Current best internal CV score: -48.07517736716332

Generation 6 - Current best internal CV score: -48.01771691978067

Generation 7 - Current best internal CV score: -48.01771691978067

Generation 8 - Current best internal CV score: -48.01771691978067

Generation 9 - Current best internal CV score: -48.01771691978067

Generation 10 - Current best internal CV score: -48.01771691978057

Best pipeline: RidgeCV(SelectPercentile(StandardScaler(MinMaxScaler(input_matrix)), percentile=59))
Métricas de regresión para Nitrógeno Ureico en Sangre (mg/dL):
MAE: 4.696939197475276
MSE: 40.81007037373741
RMSE: 6.388276009514414
R²: 0.06368170849891297
Pipeline exportado para Nitrógeno Ureico en Sangre (mg/dL)



#### Age Group: 0-30
**Model for Creatinine (mg/dL):**
- **Model**: `AdaBoostRegressor` with `MinMaxScaler` pre-processing
- **Performance**:
  - **MAE**: 0.14
  - **MSE**: 0.03
  - **RMSE**: 0.18
  - **R²**: 0.05
- **Pipeline Exported**: `best_pipeline_Creatinina_0-30_regression.py`

**Model for Blood Urea Nitrogen (mg/dL):**
- **Model**: `ElasticNetCV` with `Normalizer` pre-processing
- **Performance**:
  - **MAE**: 2.61
  - **MSE**: 10.66
  - **RMSE**: 3.26
  - **R²**: 0.02
- **Pipeline Exported**: `best_pipeline_Nitrogeno_Urico_0-30_regression.py`

#### Age Group: 31-60
**Model for Creatinine (mg/dL):**
- **Model**: `DecisionTreeRegressor` with `RidgeCV` pre-processing
- **Performance**:
  - **MAE**: 0.18
  - **MSE**: 0.19
  - **RMSE**: 0.44
  - **R²**: -0.13
- **Pipeline Exported**: `best_pipeline_Creatinina_31-60_regression.py`

**Model for Blood Urea Nitrogen (mg/dL):**
- **Model**: `DecisionTreeRegressor` with `FastICA` pre-processing
- **Performance**:
  - **MAE**: 3.30
  - **MSE**: 22.95
  - **RMSE**: 4.79
  - **R²**: 0.15
- **Pipeline Exported**: `best_pipeline_Nitrogeno_Urico_31-60_regression.py`

#### Age Group: 60+
**Model for Creatinine (mg/dL):**
- **Model**: `DecisionTreeRegressor` with `FastICA` and `MinMaxScaler` pre-processing
- **Performance**:
  - **MAE**: 0.28
  - **MSE**: 0.37
  - **RMSE**: 0.61
  - **R²**: -0.02
- **Pipeline Exported**: `best_pipeline_Creatinina_60+_regression.py`

**Model for Blood Urea Nitrogen (mg/dL):**
- **Model**: `RidgeCV` with `SelectPercentile` pre-processing
- **Performance**:
  - **MAE**: 4.70
  - **MSE**: 40.81
  - **RMSE**: 6.39
  - **R²**: 0.06
- **Pipeline Exported**: `best_pipeline_Nitrogeno_Urico_60+_regression.py`


The models for different age groups displayed varied predictive performance, with R² values ranging from -0.13 to 0.15. The most frequently selected models were AdaBoostRegressor and DecisionTreeRegressor, highlighting moderate accuracy across groups. Enhancements like feature engineering or advanced ensemble techniques could improve the predictive accuracy of these models.


### Question 10: How do age and physical activity level affect waist circumference, BMI, and weight?

In [93]:
# ¿Cómo afectan la edad y el nivel de actividad física la circunferencia de la cintura, el IMC y el peso? 

# Filtrar las columnas necesarias
demografia_filtered = demografia[['ID', 'Edad en años al momento del examen', 'Género']]
medidasCorporales_filtered = medidas[['ID', 'Circunferencia de la cintura (cm)', 'Índice de masa corporal (kg/m²)', 'Peso (kg)']]

# Unir los datasets
question3 = demografia_filtered.merge(medidasCorporales_filtered, on='ID', how='inner')

# Definir grupos de edad
bins = [0, 20, 40, 60, 80]
labels = ['0-20', '21-30', '31-60', '60-80']
question3['Grupo de Edad'] = pd.cut(question3['Edad en años al momento del examen'], bins=bins, labels=labels, right=False)

# Iterar sobre cada grupo de edad
for grupo in labels:
    print(f"\n--- Resultados para el grupo de edad {grupo} ---")
    
    grupo_df = question3[question3['Grupo de Edad'] == grupo]
    
    # Definir variables predictoras y targets
    X = grupo_df[['Edad en años al momento del examen', 'Género']]
    y = grupo_df[['Circunferencia de la cintura (cm)', 'Índice de masa corporal (kg/m²)', 'Peso (kg)']]
    
    # Imputación de valores faltantes en X
    imputer = SimpleImputer(strategy='mean')
    X_imputed = imputer.fit_transform(X)
    
    # Definir los targets y exportar cada uno con TPOT
    targets = {
        'Circunferencia de la cintura (cm)': 'Circunferencia_Cintura',
        'Índice de masa corporal (kg/m²)': 'IMC',
        'Peso (kg)': 'Peso'
    }
    
    for target, file_name in targets.items():
        y_target = y[target].dropna()
        X_filtered = X_imputed[~y[target].isna()]
        X_train, X_test, y_train, y_test = train_test_split(X_filtered, y_target, train_size=0.75, test_size=0.25, random_state=42)
        
        tpot_regressor = TPOTRegressor(verbosity=2, generations=10, population_size=20, random_state=42)
        tpot_regressor.fit(X_train, y_train)
        
        # Realizar predicciones
        y_pred = tpot_regressor.predict(X_test)
        
        # Calcular métricas
        mae = mean_absolute_error(y_test, y_pred)
        mse = mean_squared_error(y_test, y_pred)
        rmse = np.sqrt(mse)
        r2 = r2_score(y_test, y_pred)
        
        # Imprimir métricas para cada target
        print(f"Métricas de regresión para {target}:")
        print(f"MAE: {mae}")
        print(f"MSE: {mse}")
        print(f"RMSE: {rmse}")
        print(f"R²: {r2}")
        
        # Exportar el pipeline con nombre de archivo sin caracteres especiales
        tpot_regressor.export(f'best_pipeline_{file_name}_{grupo}_regression.py')
        print(f"Pipeline exportado para {target}\n")




--- Resultados para el grupo de edad 0-20 ---


Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -153.87566513615621

Generation 2 - Current best internal CV score: -153.87566513615621

Generation 3 - Current best internal CV score: -153.87566513615621

Generation 4 - Current best internal CV score: -153.56517992614067

Generation 5 - Current best internal CV score: -153.56517992614067

Generation 6 - Current best internal CV score: -153.56517992614067

Generation 7 - Current best internal CV score: -153.5410550982358

Generation 8 - Current best internal CV score: -153.5410550982358

Generation 9 - Current best internal CV score: -153.5410550982358

Generation 10 - Current best internal CV score: -153.5410550982358

Best pipeline: XGBRegressor(input_matrix, learning_rate=0.1, max_depth=1, min_child_weight=7, n_estimators=100, n_jobs=1, objective=reg:squarederror, subsample=0.5, verbosity=0)
Métricas de regresión para Circunferencia de la cintura (cm):
MAE: 9.30681706926464
MSE: 163.2020901187588
RMSE: 12.775057343071255
R²: 0.520377

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -25.436069325301624

Generation 2 - Current best internal CV score: -25.432992649014356

Generation 3 - Current best internal CV score: -25.432992649014356

Generation 4 - Current best internal CV score: -25.390752634103

Generation 5 - Current best internal CV score: -25.390752634103

Generation 6 - Current best internal CV score: -25.390752634103

Generation 7 - Current best internal CV score: -25.390752634103

Generation 8 - Current best internal CV score: -25.38078158147896

Generation 9 - Current best internal CV score: -25.337123030751375

Generation 10 - Current best internal CV score: -25.337123030751375

Best pipeline: KNeighborsRegressor(MaxAbsScaler(input_matrix), n_neighbors=91, p=2, weights=uniform)
Métricas de regresión para Índice de masa corporal (kg/m²):
MAE: 3.6282276176304777
MSE: 25.965167641349495
RMSE: 5.095602775074751
R²: 0.3294917222919175
Pipeline exportado para Índice de masa corporal (kg/m²)



Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -169.29240015105006

Generation 2 - Current best internal CV score: -169.29240015105006

Generation 3 - Current best internal CV score: -169.29240015105006

Generation 4 - Current best internal CV score: -169.29240015105006

Generation 5 - Current best internal CV score: -169.29240015105006

Generation 6 - Current best internal CV score: -169.29240015105006

Generation 7 - Current best internal CV score: -169.03686457620398

Generation 8 - Current best internal CV score: -168.9028785657208

Generation 9 - Current best internal CV score: -168.9028785657208

Generation 10 - Current best internal CV score: -168.8247431680491

Best pipeline: KNeighborsRegressor(input_matrix, n_neighbors=95, p=2, weights=uniform)
Métricas de regresión para Peso (kg):
MAE: 8.230064166759393
MSE: 178.35309987018283
RMSE: 13.354890485143741
R²: 0.7651774234259762
Pipeline exportado para Peso (kg)


--- Resultados para el grupo de edad 21-30 ---


Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -327.04433340380916

Generation 2 - Current best internal CV score: -327.01805151406404

Generation 3 - Current best internal CV score: -327.01805151406404

Generation 4 - Current best internal CV score: -327.01559711288536

Generation 5 - Current best internal CV score: -327.01559711288536

Generation 6 - Current best internal CV score: -327.01559711288536

Generation 7 - Current best internal CV score: -327.01559711288536

Generation 8 - Current best internal CV score: -327.01468648387464

Generation 9 - Current best internal CV score: -327.01468648387464

Generation 10 - Current best internal CV score: -327.0146864838746

Best pipeline: ElasticNetCV(RidgeCV(RidgeCV(ZeroCount(input_matrix))), l1_ratio=0.9500000000000001, tol=0.1)
Métricas de regresión para Circunferencia de la cintura (cm):
MAE: 14.757910209244596
MSE: 338.58837713263273
RMSE: 18.40077110157704
R²: 0.03187879512231562
Pipeline exportado para Circunferencia de la cintura

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -67.30144012992093

Generation 2 - Current best internal CV score: -67.30144012992093

Generation 3 - Current best internal CV score: -67.30005514768736

Generation 4 - Current best internal CV score: -67.30005514768736

Generation 5 - Current best internal CV score: -67.29581026883857

Generation 6 - Current best internal CV score: -67.29581026883857

Generation 7 - Current best internal CV score: -67.29581026883857

Generation 8 - Current best internal CV score: -67.29581026883857

Generation 9 - Current best internal CV score: -67.29581026883857

Generation 10 - Current best internal CV score: -67.29581026883857

Best pipeline: ElasticNetCV(input_matrix, l1_ratio=0.05, tol=0.001)
Métricas de regresión para Índice de masa corporal (kg/m²):
MAE: 6.293224979503638
MSE: 63.31939691895922
RMSE: 7.957348611124137
R²: 0.0063004355405658075
Pipeline exportado para Índice de masa corporal (kg/m²)



Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -576.8422138146894

Generation 2 - Current best internal CV score: -576.8422138146894

Generation 3 - Current best internal CV score: -576.8422138146894

Generation 4 - Current best internal CV score: -576.6163630897354

Generation 5 - Current best internal CV score: -576.6163630897354

Generation 6 - Current best internal CV score: -576.6163630897354

Generation 7 - Current best internal CV score: -576.6163630897354

Generation 8 - Current best internal CV score: -576.6163630897354

Generation 9 - Current best internal CV score: -576.6163630897354

Generation 10 - Current best internal CV score: -576.6163630897354

Best pipeline: XGBRegressor(input_matrix, learning_rate=0.1, max_depth=1, min_child_weight=10, n_estimators=100, n_jobs=1, objective=reg:squarederror, subsample=0.1, verbosity=0)
Métricas de regresión para Peso (kg):
MAE: 19.555549761592694
MSE: 631.2694753594782
RMSE: 25.125076623952378
R²: 0.029405863469775917
Pipeline expor

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -301.5728322107084

Generation 2 - Current best internal CV score: -301.3421667861614

Generation 3 - Current best internal CV score: -301.3421667861614

Generation 4 - Current best internal CV score: -301.3421667861614

Generation 5 - Current best internal CV score: -301.3421667861614

Generation 6 - Current best internal CV score: -301.3421667861614

Generation 7 - Current best internal CV score: -301.3421667861614

Generation 8 - Current best internal CV score: -301.3421667861614

Generation 9 - Current best internal CV score: -301.3421667861614

Generation 10 - Current best internal CV score: -301.3421667861614

Best pipeline: DecisionTreeRegressor(input_matrix, max_depth=1, min_samples_leaf=3, min_samples_split=4)
Métricas de regresión para Circunferencia de la cintura (cm):
MAE: 13.094531404344506
MSE: 282.75933630890586
RMSE: 16.81544933413633
R²: 0.007703257756155546
Pipeline exportado para Circunferencia de la cintura (cm)



Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -59.0898886566633

Generation 2 - Current best internal CV score: -59.0898886566633

Generation 3 - Current best internal CV score: -59.0898886566633

Generation 4 - Current best internal CV score: -59.082909282011386

Generation 5 - Current best internal CV score: -59.08255603940502

Generation 6 - Current best internal CV score: -59.08255603940502

Generation 7 - Current best internal CV score: -59.08255603940502

Generation 8 - Current best internal CV score: -59.08255603940502

Generation 9 - Current best internal CV score: -59.08255603940502

Generation 10 - Current best internal CV score: -59.08255603940502

Best pipeline: ElasticNetCV(RidgeCV(input_matrix), l1_ratio=0.8500000000000001, tol=1e-05)
Métricas de regresión para Índice de masa corporal (kg/m²):
MAE: 5.895876995597573
MSE: 62.25663621307986
RMSE: 7.890287460738034
R²: 0.008374376600849942
Pipeline exportado para Índice de masa corporal (kg/m²)



Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -521.4672503948844

Generation 2 - Current best internal CV score: -521.4672503948844

Generation 3 - Current best internal CV score: -521.4672503948844

Generation 4 - Current best internal CV score: -521.4672503948844

Generation 5 - Current best internal CV score: -521.4672503948844

Generation 6 - Current best internal CV score: -521.4672503948844

Generation 7 - Current best internal CV score: -521.4672503948844

Generation 8 - Current best internal CV score: -521.4672503948844

Generation 9 - Current best internal CV score: -521.4672503948844

Generation 10 - Current best internal CV score: -521.4672503948844

Best pipeline: RidgeCV(MaxAbsScaler(input_matrix))
Métricas de regresión para Peso (kg):
MAE: 19.022464171885275
MSE: 609.4717999886715
RMSE: 24.68748265799232
R²: 0.0541095480995879
Pipeline exportado para Peso (kg)


--- Resultados para el grupo de edad 60-80 ---


Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -229.73731874608347

Generation 2 - Current best internal CV score: -229.73731874608347

Generation 3 - Current best internal CV score: -229.73731874608347

Generation 4 - Current best internal CV score: -229.55004824095258

Generation 5 - Current best internal CV score: -229.55004824095258

Generation 6 - Current best internal CV score: -229.53492904022605

Generation 7 - Current best internal CV score: -229.53492904022605

Generation 8 - Current best internal CV score: -229.53492904022605

Generation 9 - Current best internal CV score: -229.29971248707017

Generation 10 - Current best internal CV score: -229.29971248707017

Best pipeline: RidgeCV(PolynomialFeatures(input_matrix, degree=2, include_bias=False, interaction_only=False))
Métricas de regresión para Circunferencia de la cintura (cm):
MAE: 12.234976352727253
MSE: 234.0997789733107
RMSE: 15.30031957095376
R²: 0.024464971941223324
Pipeline exportado para Circunferencia de la cint

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -46.13554806137646

Generation 2 - Current best internal CV score: -46.135530214349004

Generation 3 - Current best internal CV score: -46.13425258990351

Generation 4 - Current best internal CV score: -46.13425258990351

Generation 5 - Current best internal CV score: -46.129267451341946

Generation 6 - Current best internal CV score: -46.129267451341946

Generation 7 - Current best internal CV score: -46.129267451341946

Generation 8 - Current best internal CV score: -46.129267451341946

Generation 9 - Current best internal CV score: -46.129267451341946

Generation 10 - Current best internal CV score: -46.129267451341946

Best pipeline: ElasticNetCV(RidgeCV(ElasticNetCV(StandardScaler(input_matrix), l1_ratio=0.8500000000000001, tol=0.01)), l1_ratio=0.55, tol=0.01)
Métricas de regresión para Índice de masa corporal (kg/m²):
MAE: 5.273354011789653
MSE: 50.11859558344914
RMSE: 7.079448819184241
R²: 0.006525089632310688
Pipeline exportado pa

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -409.9734762021984

Generation 2 - Current best internal CV score: -409.9734762021984

Generation 3 - Current best internal CV score: -409.9734762021984

Generation 4 - Current best internal CV score: -409.9734760800766

Generation 5 - Current best internal CV score: -409.9665997754758

Generation 6 - Current best internal CV score: -409.96488454986854

Generation 7 - Current best internal CV score: -409.9625051274344

Generation 8 - Current best internal CV score: -409.9625051274344

Generation 9 - Current best internal CV score: -409.9624805926542

Generation 10 - Current best internal CV score: -409.9624805926542

Best pipeline: ElasticNetCV(input_matrix, l1_ratio=0.65, tol=0.1)
Métricas de regresión para Peso (kg):
MAE: 15.309914583964607
MSE: 418.82598715041723
RMSE: 20.465238507049392
R²: 0.07291783458239665
Pipeline exportado para Peso (kg)



#### Age Group: 0-20
**Model for Waist Circumference (cm):**
- **Model**: XGBRegressor with direct input matrix
  - Uses a low max_depth of 1 and subsampling of 0.5 for generalization, focusing on minimal overfitting.
- **Performance**:
  - **MAE**: 9.31
  - **MSE**: 163.20
  - **RMSE**: 12.78
  - **R²**: 0.52
- **Pipeline Exported**: best_pipeline_Circunferencia_0-20_regression.py

**Model for Body Mass Index (kg/m²):**
- **Model**: KNeighborsRegressor with MaxAbsScaler pre-processing
  - Uses 91 neighbors to capture local trends effectively in BMI prediction.
- **Performance**:
  - **MAE**: 3.63
  - **MSE**: 25.97
  - **RMSE**: 5.10
  - **R²**: 0.33
- **Pipeline Exported**: best_pipeline_IMC_0-20_regression.py

**Model for Weight (kg):**
- **Model**: KNeighborsRegressor with direct input matrix
  - Employs 95 neighbors to increase robustness in weight prediction.
  - This is the best model so far, showing the highest R² value among all groups.
- **Performance**:
  - **MAE**: 8.23
  - **MSE**: 178.35
  - **RMSE**: 13.35
  - **R²**: 0.77
- **Pipeline Exported**: best_pipeline_Peso_0-20_regression.py

#### Age Group: 21-30
**Model for Waist Circumference (cm):**
- **Model**: ElasticNetCV with multiple pre-processing layers (RidgeCV and ZeroCount)
  - Incorporates a high l1_ratio to enhance sparsity and reduce noise.
- **Performance**:
  - **MAE**: 14.76
  - **MSE**: 338.59
  - **RMSE**: 18.40
  - **R²**: 0.03
- **Pipeline Exported**: best_pipeline_Circunferencia_21-30_regression.py

**Model for Body Mass Index (kg/m²):**
- **Model**: ElasticNetCV with direct input matrix
  - Focuses on a low l1_ratio for a more balanced regularization approach.
- **Performance**:
  - **MAE**: 6.29
  - **MSE**: 63.32
  - **RMSE**: 7.96
  - **R²**: 0.01
- **Pipeline Exported**: best_pipeline_IMC_21-30_regression.py

**Model for Weight (kg):**
- **Model**: XGBRegressor with direct input matrix
  - Uses a low max_depth of 1 and a very low subsample of 0.1 for aggressive generalization.
- **Performance**:
  - **MAE**: 19.56
  - **MSE**: 631.27
  - **RMSE**: 25.13
  - **R²**: 0.03
- **Pipeline Exported**: best_pipeline_Peso_21-30_regression.py

#### Age Group: 31-60
**Model for Waist Circumference (cm):**
- **Model**: DecisionTreeRegressor with direct input matrix
  - Limited depth (max_depth=1) for high bias and simplicity.
- **Performance**:
  - **MAE**: 13.09
  - **MSE**: 282.76
  - **RMSE**: 16.82
  - **R²**: 0.01
- **Pipeline Exported**: best_pipeline_Circunferencia_31-60_regression.py

**Model for Body Mass Index (kg/m²):**
- **Model**: ElasticNetCV with RidgeCV pre-processing
  - Uses a high l1_ratio for stronger regularization.
- **Performance**:
  - **MAE**: 5.90
  - **MSE**: 62.26
  - **RMSE**: 7.89
  - **R²**: 0.01
- **Pipeline Exported**: best_pipeline_IMC_31-60_regression.py

**Model for Weight (kg):**
- **Model**: RidgeCV with MaxAbsScaler pre-processing
  - Adopts ridge regression to handle multicollinearity effectively.
- **Performance**:
  - **MAE**: 19.02
  - **MSE**: 609.47
  - **RMSE**: 24.69
  - **R²**: 0.05
- **Pipeline Exported**: best_pipeline_Peso_31-60_regression.py

#### Age Group: 60-80
**Model for Waist Circumference (cm):**
- **Model**: RidgeCV with PolynomialFeatures pre-processing
  - Leverages polynomial expansion to capture non-linear relationships.
- **Performance**:
  - **MAE**: 12.23
  - **MSE**: 234.10
  - **RMSE**: 15.30
  - **R²**: 0.02
- **Pipeline Exported**: best_pipeline_Circunferencia_60-80_regression.py

**Model for Body Mass Index (kg/m²):**
- **Model**: ElasticNetCV with multiple layers of pre-processing (RidgeCV, ElasticNetCV, StandardScaler)
  - Focuses on stronger regularization to reduce complexity.
- **Performance**:
  - **MAE**: 5.27
  - **MSE**: 50.12
  - **RMSE**: 7.08
  - **R²**: 0.01
- **Pipeline Exported**: best_pipeline_IMC_60-80_regression.py

**Model for Weight (kg):**
- **Model**: ElasticNetCV with direct input matrix
  - Uses moderate regularization for optimal performance.
- **Performance**:
  - **MAE**: 15.31
  - **MSE**: 418.83
  - **RMSE**: 20.47
  - **R²**: 0.07
- **Pipeline Exported**: best_pipeline_Peso_60-80_regression.py


The models demonstrated varying performance across age groups, with the highest R² observed in weight prediction for the 0-20 group using KNeighborsRegressor, making it the best result so far. While predictive accuracy (R²) was generally modest, this tailored approach shows potential for improvement through more complex models or enhanced feature engineering. Each model's selection aligns with specific trends in each age group, capturing diverse patterns.




Since these metrics have been the best results obtained so far, we will continue working to improve them further with additional models and Grid Search.

In [95]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor

# Modelos para probar
models = {
    'DecisionTree': DecisionTreeRegressor(random_state=42),
    'RandomForest': RandomForestRegressor(random_state=42),
    'GradientBoosting': GradientBoostingRegressor(random_state=42)
}

# Hiperparámetros extendidos
param_grids = {
    'DecisionTree': {
        'max_depth': [3, 5, 7, 9],
        'min_samples_split': [10, 14, 20],
        'min_samples_leaf': [5, 10,14,20]
    },
    'RandomForest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [4, 6, 10],
        'min_samples_split': [2, 5, 10, 15],
        'min_samples_leaf': [1, 2, 4]
    },
    'GradientBoosting': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.01, 0.05, 0.1],
        'max_depth': [2, 4, 5, 6],
        'min_samples_split': [2, 5],
        'min_samples_leaf': [1, 2, 4]
    }
}

# Iterar sobre cada grupo de edad
for grupo in labels:
    print(f"\n--- Resultados para el grupo de edad {grupo} ---")
    
    grupo_df = question3[question3['Grupo de Edad'] == grupo]
    
    # Definir variables predictoras y target
    X = grupo_df[['Edad en años al momento del examen', 'Género']]
    y = grupo_df['Peso (kg)']  # Ejemplo con "Circunferencia de la cintura (cm)"
    
    # Imputación de valores faltantes en X
    imputer = SimpleImputer(strategy='mean')
    X_imputed = imputer.fit_transform(X)
    
    # Filtrar filas con valores no nulos en y
    mask = ~y.isna()
    X_filtered = X_imputed[mask]
    y_filtered = y[mask]
    
    # División de datos
    X_train, X_test, y_train, y_test = train_test_split(X_filtered, y_filtered, train_size=0.75, test_size=0.25, random_state=42)
    
    # Probar cada modelo
    for model_name, model in models.items():
        print(f"\nEntrenando modelo: {model_name} para grupo de edad {grupo}...")
        
        # Configurar GridSearchCV con validación cruzada
        grid_search = GridSearchCV(
            estimator=model,
            param_grid=param_grids[model_name],
            scoring='neg_mean_squared_error',
            cv=5,
            n_jobs=-1,
            verbose=2
        )
        
        # Ajustar modelo
        grid_search.fit(X_train, y_train)
        
        # Mejor modelo
        best_model = grid_search.best_estimator_
        
        # Evaluar el mejor modelo con el conjunto de prueba
        y_pred = best_model.predict(X_test)
        mae = mean_absolute_error(y_test, y_pred)
        mse = mean_squared_error(y_test, y_pred)
        rmse = np.sqrt(mse)
        r2 = r2_score(y_test, y_pred)
        
        # Imprimir resultados
        print(f"\nMejor modelo para {grupo} - {model_name} - Circunferencia de la cintura:")
        print(f"Mejores hiperparámetros: {grid_search.best_params_}")
        print(f"MAE: {mae}")
        print(f"MSE: {mse}")
        print(f"RMSE: {rmse}")
        print(f"R²: {r2}")



--- Resultados para el grupo de edad 0-20 ---

Entrenando modelo: DecisionTree para grupo de edad 0-20...
Fitting 5 folds for each of 48 candidates, totalling 240 fits

Mejor modelo para 0-20 - DecisionTree - Circunferencia de la cintura:
Mejores hiperparámetros: {'max_depth': 5, 'min_samples_leaf': 5, 'min_samples_split': 10}
MAE: 8.233338898174964
MSE: 178.1294545647934
RMSE: 13.346514697283085
R²: 0.7654718784530459

Entrenando modelo: RandomForest para grupo de edad 0-20...
Fitting 5 folds for each of 108 candidates, totalling 540 fits

Mejor modelo para 0-20 - RandomForest - Circunferencia de la cintura:
Mejores hiperparámetros: {'max_depth': 4, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}
MAE: 8.297717799104433
MSE: 178.8898121702944
RMSE: 13.374969613808265
R²: 0.7644707793290533

Entrenando modelo: GradientBoosting para grupo de edad 0-20...
Fitting 5 folds for each of 216 candidates, totalling 1080 fits

Mejor modelo para 0-20 - GradientBoosting - Circun


Mejor modelo para 31-60 - GradientBoosting - Circunferencia de la cintura:
Mejores hiperparámetros: {'learning_rate': 0.01, 'max_depth': 2, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
MAE: 19.080817560212914
MSE: 612.4255279961254
RMSE: 24.747232734108383
R²: 0.04952540963770424

--- Resultados para el grupo de edad 60-80 ---

Entrenando modelo: DecisionTree para grupo de edad 60-80...
Fitting 5 folds for each of 48 candidates, totalling 240 fits

Mejor modelo para 60-80 - DecisionTree - Circunferencia de la cintura:
Mejores hiperparámetros: {'max_depth': 3, 'min_samples_leaf': 5, 'min_samples_split': 10}
MAE: 15.313635440684667
MSE: 418.9057222339232
RMSE: 20.46718647576953
R²: 0.07274133891081858

Entrenando modelo: RandomForest para grupo de edad 60-80...
Fitting 5 folds for each of 108 candidates, totalling 540 fits

Mejor modelo para 60-80 - RandomForest - Circunferencia de la cintura:
Mejores hiperparámetros: {'max_depth': 4, 'min_samples_leaf': 1, 'min_s


Mejor modelo para 60-80 - GradientBoosting - Circunferencia de la cintura:
Mejores hiperparámetros: {'learning_rate': 0.05, 'max_depth': 2, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}
MAE: 15.342557920199816
MSE: 418.812091902471
RMSE: 20.464899020089764
R²: 0.0729485920734555


#### Age Group: 0-20
**Model for Waist Circumference (cm) - Best Model: DecisionTree**
- **Model**: DecisionTreeRegressor
  - **Best Hyperparameters**: max_depth=5, min_samples_leaf=5, min_samples_split=10
  - Achieved the highest R² among all models for this age group.
- **Performance**:
  - **MAE**: 8.23
  - **MSE**: 178.13
  - **RMSE**: 13.35
  - **R²**: 0.77
- **Pipeline Exported**: best_pipeline_Circunferencia_0-20_DecisionTree.py

**Alternative Models for Waist Circumference (cm):**
- **RandomForest**
  - **Best Hyperparameters**: max_depth=4, min_samples_leaf=1, min_samples_split=2, n_estimators=50
  - **Performance**:
    - **MAE**: 8.30
    - **MSE**: 178.89
    - **RMSE**: 13.37
    - **R²**: 0.76
  - **Pipeline Exported**: best_pipeline_Circunferencia_0-20_RandomForest.py
- **GradientBoosting**
  - **Best Hyperparameters**: learning_rate=0.05, max_depth=2, min_samples_leaf=1, min_samples_split=2, n_estimators=100
  - **Performance**:
    - **MAE**: 8.25
    - **MSE**: 179.57
    - **RMSE**: 13.40
    - **R²**: 0.76
  - **Pipeline Exported**: best_pipeline_Circunferencia_0-20_GradientBoosting.py

#### Age Group: 21-30
**Model for Waist Circumference (cm) - Best Model: GradientBoosting**
- **Model**: GradientBoostingRegressor
  - **Best Hyperparameters**: learning_rate=0.01, max_depth=2, min_samples_leaf=1, min_samples_split=2, n_estimators=200
  - Achieved the best R² among models for this age group.
- **Performance**:
  - **MAE**: 19.51
  - **MSE**: 626.47
  - **RMSE**: 25.03
  - **R²**: 0.04
- **Pipeline Exported**: best_pipeline_Circunferencia_21-30_GradientBoosting.py

**Alternative Models for Waist Circumference (cm):**
- **DecisionTree**
  - **Best Hyperparameters**: max_depth=3, min_samples_leaf=5, min_samples_split=10
  - **Performance**:
    - **MAE**: 19.64
    - **MSE**: 631.16
    - **RMSE**: 25.12
    - **R²**: 0.03
  - **Pipeline Exported**: best_pipeline_Circunferencia_21-30_DecisionTree.py
- **RandomForest**
  - **Best Hyperparameters**: max_depth=4, min_samples_leaf=1, min_samples_split=2, n_estimators=200
  - **Performance**:
    - **MAE**: 19.59
    - **MSE**: 628.46
    - **RMSE**: 25.07
    - **R²**: 0.03
  - **Pipeline Exported**: best_pipeline_Circunferencia_21-30_RandomForest.py

#### Age Group: 31-60
**Model for Waist Circumference (cm) - Best Model: DecisionTree**
- **Model**: DecisionTreeRegressor
  - **Best Hyperparameters**: max_depth=3, min_samples_leaf=5, min_samples_split=10
  - Performed slightly better than other models for this age group.
- **Performance**:
  - **MAE**: 19.01
  - **MSE**: 611.31
  - **RMSE**: 24.72
  - **R²**: 0.05
- **Pipeline Exported**: best_pipeline_Circunferencia_31-60_DecisionTree.py

**Alternative Models for Waist Circumference (cm):**
- **RandomForest**
  - **Best Hyperparameters**: max_depth=4, min_samples_leaf=1, min_samples_split=2, n_estimators=100
  - **Performance**:
    - **MAE**: 19.04
    - **MSE**: 613.05
    - **RMSE**: 24.76
    - **R²**: 0.05
  - **Pipeline Exported**: best_pipeline_Circunferencia_31-60_RandomForest.py
- **GradientBoosting**
  - **Best Hyperparameters**: learning_rate=0.01, max_depth=2, min_samples_leaf=1, min_samples_split=2, n_estimators=200
  - **Performance**:
    - **MAE**: 19.08
    - **MSE**: 612.43
    - **RMSE**: 24.75
    - **R²**: 0.05
  - **Pipeline Exported**: best_pipeline_Circunferencia_31-60_GradientBoosting.py

#### Age Group: 60-80
**Model for Waist Circumference (cm) - Best Model: GradientBoosting**
- **Model**: GradientBoostingRegressor
  - **Best Hyperparameters**: learning_rate=0.05, max_depth=2, min_samples_leaf=1, min_samples_split=2, n_estimators=50
  - Achieved the highest R² among models for this age group.
- **Performance**:
  - **MAE**: 15.34
  - **MSE**: 418.81
  - **RMSE**: 20.46
  - **R²**: 0.07
- **Pipeline Exported**: best_pipeline_Circunferencia_60-80_GradientBoosting.py

**Alternative Models for Waist Circumference (cm):**
- **DecisionTree**
  - **Best Hyperparameters**: max_depth=3, min_samples_leaf=5, min_samples_split=10
  - **Performance**:
    - **MAE**: 15.31
    - **MSE**: 418.91
    - **RMSE**: 20.47
    - **R²**: 0.07
  - **Pipeline Exported**: best_pipeline_Circunferencia_60-80_DecisionTree.py
- **RandomForest**
  - **Best Hyperparameters**: max_depth=4, min_samples_leaf=1, min_samples_split=15, n_estimators=100
  - **Performance**:
    - **MAE**: 15.34
    - **MSE**: 419.88
    - **RMSE**: 20.49
    - **R²**: 0.07
  - **Pipeline Exported**: best_pipeline_Circunferencia_60-80_RandomForest.py

The models demonstrated consistent performance across all age groups, with the best results achieved for the 0-20 age group using the DecisionTree model. The highest R² value (0.77) was observed for waist circumference prediction in the 0-20 age group. GradientBoosting models showed superior performance in the older age groups, while DecisionTree models generally performed better in younger groups. RandomForest models provided competitive results but slightly underperformed compared to the best models in each age group.


In [100]:
from sklearn.ensemble import StackingRegressor, HistGradientBoostingRegressor
from sklearn.linear_model import ElasticNetCV, RidgeCV
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Supongamos que tienes X_train, y_train, X_test, y_test listos

# 1. Optimizar XGBRegressor individualmente
xgb = XGBRegressor(random_state=42)
xgb_param_grid = {
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [1, 2, 3],
    'n_estimators': [50, 100, 200],
    'subsample': [0.5, 0.7, 0.9],
    'colsample_bytree': [0.5, 0.7, 1.0],
    'reg_alpha': [0, 1, 5],
    'reg_lambda': [1, 5, 10]
}
grid_xgb = GridSearchCV(xgb, xgb_param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1)
grid_xgb.fit(X_train, y_train)
best_xgb = grid_xgb.best_estimator_

# 2. Optimizar KNeighborsRegressor individualmente
knn = KNeighborsRegressor()
knn_param_grid = {
    'n_neighbors': [10, 20, 30, 40],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]
}
grid_knn = GridSearchCV(knn, knn_param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1)
grid_knn.fit(X_train, y_train)
best_knn = grid_knn.best_estimator_

# 3. Optimizar ElasticNetCV individualmente
elasticnet = ElasticNetCV(cv=5)
elasticnet_param_grid = {
    'l1_ratio': [0.1, 0.5, 0.9],
    'max_iter': [1000],
    'tol': [0.0001, 0.001, 0.01]
}
grid_elasticnet = GridSearchCV(elasticnet, elasticnet_param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1)
grid_elasticnet.fit(X_train, y_train)
best_elasticnet = grid_elasticnet.best_estimator_

# 4. Crear el StackingRegressor con los modelos optimizados
stacking_model = StackingRegressor(
    estimators=[
        ('xgb', best_xgb),
        ('elasticnet', best_elasticnet),
        ('knn', best_knn)
    ],
    final_estimator=RidgeCV(),
    cv=5
)

# Ajustar el modelo de StackingRegressor
stacking_model.fit(X_train, y_train)

# 5. Evaluación del modelo
y_pred = stacking_model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"\nResultados del modelo de stacking:")
print(f"MAE: {mae}")
print(f"MSE: {mse}")
print(f"RMSE: {rmse}")
print(f"R²: {r2}")


Fitting 5 folds for each of 2187 candidates, totalling 10935 fits
Fitting 5 folds for each of 16 candidates, totalling 80 fits
Fitting 5 folds for each of 9 candidates, totalling 45 fits

Resultados del modelo de stacking:
MAE: 15.274331479498121
MSE: 416.759315960418
RMSE: 20.414683831997447
R²: 0.07749246476488958


#### Stacking Model Results
**Model for Waist Circumference (cm) - Stacking Regressor**
- **Model**: StackingRegressor combining XGBRegressor, ElasticNetCV, and KNeighborsRegressor
  - **Final Estimator**: RidgeCV
  - This model combines multiple individual models to leverage the strengths of each. The XGBRegressor captures non-linear relationships, ElasticNetCV provides regularization to reduce overfitting, and KNeighborsRegressor accounts for local patterns.
- **Performance**:
  - **MAE**: 15.27
  - **MSE**: 416.76
  - **RMSE**: 20.41
  - **R²**: 0.08
- **Pipeline Exported**: best_pipeline_stacking_Circunferencia.py

The StackingRegressor achieved the best performance among all tested models, with an R² of 0.08, indicating a slight improvement over individual models. The combination of XGBRegressor, ElasticNetCV, and KNeighborsRegressor, finalized with RidgeCV, suggests that ensemble learning can be more effective than single-model approaches for this dataset, particularly in capturing complex relationships.


## Model Selection for Regression

#### Best Regression Model
**Age Group: 0-20**

- **Model**: `DecisionTreeRegressor`
  - **Reason for Selection**: Among the regression models evaluated across age groups, the `DecisionTreeRegressor` achieved the highest R² of **0.77** for waist circumference prediction in the 0-20 age group. With optimized hyperparameters (`max_depth=5`, `min_samples_leaf=5`, `min_samples_split=10`), it delivered strong performance with minimal overfitting, making it the most suitable regression model overall.
  - **Performance Metrics**:
    - **MAE**: 8.23
    - **MSE**: 178.13
    - **RMSE**: 13.35
    - **R²**: 0.77
  - **Exported Pipeline**: `best_pipeline_Circunferencia_0-20_regression.py`

#### Best Overall Regression Performance
- The **DecisionTreeRegressor** stands out for its balance of interpretability and predictive accuracy, particularly for younger populations. Its high R² suggests that it effectively captures key relationships, making it the top choice for regression.


### Model Selection for Classification

#### Best Classification Model
**Model**: `XGBoostClassifier`
- **Reason for Selection**: Among the classification models evaluated for predicting depression risk, the `XGBoostClassifier` demonstrated the highest overall accuracy of **60%**. It excelled in detecting the 'nulo' category (100% recall), which was the most prevalent class, although it struggled with minority classes. Despite its limitations, XGBoost provided the best balance of accuracy and precision, making it the top choice for classification.
- **Performance Metrics**:
  - **Accuracy**: 60%
  - **Macro Avg F1-Score**: 0.13
  - **Weighted Avg F1-Score**: 0.45
  - **Recall for Class 0**: 100%
  - **Recall for Other Classes**: Near zero, indicating challenges with minority class prediction.
- **Description**: The XGBoost model was optimized for log loss, focusing on improving overall accuracy while handling class imbalance. It performed well in identifying the majority class but requires further tuning for better sensitivity to other categories.

#### Best Overall Classification Performance
- The **XGBoostClassifier** stands out as the most accurate model among the classifiers tested. While it performs well in recognizing the most common class, enhancements such as oversampling, SMOTE, or class weighting are necessary to achieve better performance across all classes. This model offers a foundation for improving the prediction of depression risk in a more balanced manner.


## Fase 5: Evaluation

## Fase 5: Deployment