### **Tecnológico de Monterrey**

#### **Maestría en Inteligencia Artificial Aplicada**
#### **Clase**: Operaciones de Aprendizaje Automático
#### **Docentes**: Dr. Gerardo Rodríguez Hernández | Mtro. Ricardo Valdez Hernández | Mtro. Carlos Alberto Vences Sánchez

##### **Actividad**: Proyecto: Avance (Fase 1) **Notebook**: Modelo de aprendizaje automático
##### **Equipo 25**:
| Nombre | Matrícula |
|--------|-----------|
| Rafael Becerra García | A01796211 |
| Andrea Xcaret Gómez Alfaro | A01796384 |
| David Hernández Castellanos | A01795964 |
| Juan Pablo López Sánchez | A01313663 |
| Osiris Xcaret Saavedra Solís | A01795992 |

### Objetivos:

**Analisis de Requerimientos**
**Tarea**: Analiza la problemática a resolver siguiendo la liga con la descripción del dataset asignado.

**Manipulación y preparación de datos**
**Tarea**: Realizar tareas de Exploratory Data Analysis (EDA)  y limpieza de datos utilizando herramientas y bibliotecas específicas (Python, Pandas, DVC, Scikitlearn, etc.)

**Exploración y preprocesamiento de datos**
**Tarea**: Explorar y preprocesar los datos para identificar patrones, tendencias y relaciones significativas.

**Versionado de datos**
**Tarea**: Aplicar técnicas de versionado de datos para asegurar reproducibilidad y trazabilidad.

**Construcción, ajuste y evaluación de Modelos de Machine Learning**
**Tarea**: Construir, ajustar y evaluar modelos de Machine Learning utilizando técnicas y algoritmos apropiados al problema.

In [1]:
# --- Inicialización --- #

# Librerías
from mlflow.models import infer_signature
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    log_loss, matthews_corrcoef, roc_auc_score, confusion_matrix
)
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

import joblib
import mlflow
import numpy as np
import os
import pandas as pd
import warnings


# Configuración inicial
DATA_DIR = "../../data/prepared/a01313663"
MLFLOW_TRACKING_URI = "mlruns"
EXPERIMENT_NAME = "Obesity_Classification"

mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
mlflow.set_experiment(EXPERIMENT_NAME)

# Remoción de advertencias
os.environ["MLFLOW_RECORD_ENV_VARS_IN_MODEL_LOGGING"] = "false"
warnings.filterwarnings("ignore", message="l1_ratio parameter is only used when penalty is 'elasticnet'", category=UserWarning)
warnings.filterwarnings("ignore", message="The max_iter was reached which means the coef_ did not converge")

# Carga del dataset
train_df = pd.read_csv(os.path.join(DATA_DIR, "train_prepared.csv"))
test_df = pd.read_csv(os.path.join(DATA_DIR, "test_prepared.csv"))

X_train = train_df.drop(columns=['NObeyesdad'])
y_train = train_df['NObeyesdad']
X_test = test_df.drop(columns=['NObeyesdad'])
y_test = test_df['NObeyesdad']

print("Datos cargados correctamente:")
print(f"Train: {X_train.shape}, Test: {X_test.shape}")

2025/10/08 11:37:58 INFO mlflow.tracking.fluent: Experiment with name 'Obesity_Classification' does not exist. Creating a new experiment.


Datos cargados correctamente:
Train: (1472, 23), Test: (632, 23)


In [2]:
# --- Código utilitario --- #

# Función para calcular métricas de evaluación de modelos
def evaluate_model(y_true, y_pred, y_proba=None, average="macro"):

    metrics = {
        "accuracy": accuracy_score(y_true, y_pred),
        "precision_macro": precision_score(y_true, y_pred, average=average, zero_division=0),
        "recall_macro": recall_score(y_true, y_pred, average=average, zero_division=0),
        "f1_macro": f1_score(y_true, y_pred, average=average, zero_division=0),
        "f1_weighted": f1_score(y_true, y_pred, average="weighted", zero_division=0),
        "mcc": matthews_corrcoef(y_true, y_pred)
    }

    if y_proba is not None:
        try:
            metrics["log_loss"] = log_loss(y_true, y_proba)
        except ValueError:
            metrics["log_loss"] = np.nan

        try:
            metrics["roc_auc_ovr"] = roc_auc_score(y_true, y_proba, multi_class="ovr")
        except Exception:
            metrics["roc_auc_ovr"] = np.nan

    return metrics


# Función auxiliar para ejecución de modelos en MLflow
def run_experiment(model, model_name, X_train, X_test, y_train, y_test, params=None):

    with mlflow.start_run(run_name=model_name):
        # Log de parámetros
        mlflow.log_param("model_name", model_name)
        if params:
            mlflow.log_params(params)

        # Entrenamiento
        model.fit(X_train, y_train)

        # Predicciones
        y_pred = model.predict(X_test)
        y_proba = model.predict_proba(X_test) if hasattr(model, "predict_proba") else None

        # Evaluación
        metrics = evaluate_model(y_test, y_pred, y_proba)
        mlflow.log_metrics(metrics)

        # Registro del modelo
        input_example = X_train.head(1)
        signature = infer_signature(X_train, model.predict(X_train))

        mlflow.sklearn.log_model(
            model,
            name=model_name,
            input_example=input_example,
            signature=signature
        )

        print(f"\nResultados de {model_name}:")
        for k, v in metrics.items():
            print(f"{k:20s}: {v:.4f}")

    return metrics

### Modelado

In [3]:
# --- Modelo base y ejecución inicial --- #

# Modelos a probar
models = {
    "RandomForest": RandomForestClassifier(random_state=42),
    "GradientBoosting": GradientBoostingClassifier(random_state=42),
    "LogisticRegression": LogisticRegression(max_iter=500, solver="saga", random_state=42),
    "SVC": SVC(probability=True, random_state=42)
}

results = []
for name, model in models.items():
    metrics = run_experiment(model, name, X_train, X_test, y_train, y_test)
    results.append({"Modelo": name, **metrics})

results_df = pd.DataFrame(results).sort_values(by="accuracy", ascending=False)
display(results_df)



Resultados de RandomForest:
accuracy            : 0.9256
precision_macro     : 0.9253
recall_macro        : 0.9259
f1_macro            : 0.9249
f1_weighted         : 0.9258
mcc                 : 0.9133
log_loss            : 0.5945
roc_auc_ovr         : 0.9877

Resultados de GradientBoosting:
accuracy            : 0.9272
precision_macro     : 0.9263
recall_macro        : 0.9276
f1_macro            : 0.9265
f1_weighted         : 0.9268
mcc                 : 0.9151
log_loss            : 0.3069
roc_auc_ovr         : 0.9883

Resultados de LogisticRegression:
accuracy            : 0.8418
precision_macro     : 0.8406
recall_macro        : 0.8387
f1_macro            : 0.8345
f1_weighted         : 0.8382
mcc                 : 0.8164
log_loss            : 0.5417
roc_auc_ovr         : 0.9731

Resultados de SVC:
accuracy            : 0.8782
precision_macro     : 0.8749
recall_macro        : 0.8768
f1_macro            : 0.8753
f1_weighted         : 0.8785
mcc                 : 0.8579
log_loss     

Unnamed: 0,Modelo,accuracy,precision_macro,recall_macro,f1_macro,f1_weighted,mcc,log_loss,roc_auc_ovr
1,GradientBoosting,0.927215,0.926256,0.92761,0.926513,0.926763,0.915061,0.306879,0.988289
0,RandomForest,0.925633,0.925342,0.925852,0.924916,0.925776,0.913349,0.594515,0.987733
3,SVC,0.878165,0.874942,0.876837,0.875251,0.878544,0.857939,0.397699,0.980196
2,LogisticRegression,0.841772,0.840577,0.838738,0.834468,0.838214,0.816359,0.54168,0.973108


In [4]:
# --- Ajuste de Hiperparámetros --- #

# Grids con valores a usar
param_grids = {
    "RandomForest": {
        "n_estimators": [100, 300],
        "max_depth": [None, 10, 20],
        "min_samples_split": [2, 5],
        "max_features": ["sqrt", "log2"]
    },
    "GradientBoosting": {
        "n_estimators": [100, 200],
        "learning_rate": [0.05, 0.1],
        "max_depth": [3, 5]
    },
    "LogisticRegression": {
        "C": [0.01, 0.1, 1, 10],
        "penalty": ["l2", "elasticnet"],
        "l1_ratio": [0, 0.5, 1]
    },
    "SVC": {
        "C": [0.1, 1, 10],
        "kernel": ["linear", "rbf"],
        "gamma": ["scale", "auto"]
    }
}

best_models = {}
scoring_metric = "f1_macro"

for name, model in models.items():
    print(f"\nAjustando {name}...")
    grid = GridSearchCV(model, param_grids[name], cv=5, scoring=scoring_metric, n_jobs=-1, verbose=1)
    grid.fit(X_train, y_train)

    best_models[name] = grid.best_estimator_
    print(f"Mejores parámetros: {grid.best_params_}")
    print(f"Mejor {scoring_metric}: {grid.best_score_:.4f}")

    input_example = X_train.head(1)
    signature = infer_signature(X_train, model.predict(X_train))

    # Registrar en MLflow
    with mlflow.start_run(run_name=f"{name}_tuned"):
        mlflow.log_params(grid.best_params_)
        mlflow.log_metric(f"best_cv_{scoring_metric}", grid.best_score_)
        mlflow.sklearn.log_model(grid.best_estimator_, name="model", input_example=input_example, signature=signature)



Ajustando RandomForest...
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Mejores parámetros: {'max_depth': 20, 'max_features': 'sqrt', 'min_samples_split': 2, 'n_estimators': 300}
Mejor f1_macro: 0.9216

Ajustando GradientBoosting...
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Mejores parámetros: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200}
Mejor f1_macro: 0.9261

Ajustando LogisticRegression...
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Mejores parámetros: {'C': 10, 'l1_ratio': 1, 'penalty': 'elasticnet'}
Mejor f1_macro: 0.8853

Ajustando SVC...
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Mejores parámetros: {'C': 10, 'gamma': 'scale', 'kernel': 'linear'}
Mejor f1_macro: 0.9276


In [None]:
# --- Guardar el mejor modelo --- #

final_results = []

for name, model in best_models.items():
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test) if hasattr(model, "predict_proba") else None
    metrics = evaluate_model(y_test, y_pred, y_proba)
    final_results.append({"Modelo": name, **metrics})

final_df = pd.DataFrame(final_results).sort_values(by="f1_macro", ascending=False)
display(final_df)

best_model_name = final_df.iloc[0]["Modelo"]
best_model = best_models[best_model_name]
print(f"\nMejor modelo final: {best_model_name}")

os.makedirs("../../models/a01313663", exist_ok=True)
joblib.dump(best_model, f"../../models/a01313663/best_model.pkl")
print(f"Modelo guardado en: /models/a01313663/best_model.pkl")

# Matriz de confusión
cm = confusion_matrix(y_test, best_model.predict(X_test))
cm_df = pd.DataFrame(cm, index=best_model.classes_, columns=best_model.classes_)
print("\nMatriz de Confusión:")
display(cm_df)

Unnamed: 0,Modelo,accuracy,precision_macro,recall_macro,f1_macro,f1_weighted,mcc,log_loss,roc_auc_ovr
1,GradientBoosting,0.935127,0.935252,0.935905,0.935209,0.935044,0.924303,0.533501,0.989059
0,RandomForest,0.93038,0.929499,0.930338,0.929452,0.930583,0.918827,0.595692,0.987649
3,SVC,0.908228,0.907522,0.907282,0.905207,0.907262,0.893375,0.362204,0.985614
2,LogisticRegression,0.887658,0.888303,0.886752,0.88503,0.886123,0.869353,0.470277,0.979846



Mejor modelo final: GradientBoosting
Modelo guardado en: /models/best_model.pkl

Matriz de Confusión:


Unnamed: 0,insufficient_weight,normal_weight,obesity_type_i,obesity_type_ii,obesity_type_iii,overweight_level_i,overweight_level_ii
insufficient_weight,77,3,0,0,0,0,0
normal_weight,5,75,0,0,0,3,1
obesity_type_i,1,1,100,2,2,2,3
obesity_type_ii,0,0,5,84,0,0,0
obesity_type_iii,0,0,1,1,94,0,1
overweight_level_i,0,0,1,1,0,77,5
overweight_level_ii,0,1,1,1,0,0,84
