# Parte 1: Ciclo de vida de un modelo - Entrenamiento y Tracking
*   **Autor:** Carolina Torres Zapata
*   **Fecha:** 2025-11-24
*   **Propósito:** Entrenar modelos de clasificación supervisada (Regresión Logística y Random Forest) utilizando los datos procesados. Se implementa **MLflow** para registrar experimentos, métricas y versionar los artefactos del modelo.
*   **Flujo de Trabajo:**
     1.  Carga de datos transformados desde la capa Silver.
     2.  Separación de conjuntos de entrenamiento y prueba (Split).
     3.  Entrenamiento iterativo con registro de experimentos (Tracking).
     4.  Evaluación de desempeño (Métricas).
     

## 1. Importar Librerías

In [0]:
import pandas as pd
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score

## 2. Carga de Datos (Capa Silver)
Leemos la tabla `dev.silver.churn_data` generada en la etapa anterior. Esta tabla ya contiene las características codificadas y escaladas.

In [0]:
df = spark.table("dev.silver.churn_data").toPandas()
print(f"Dimensiones del dataset: {df.shape}")
display(df.head())

Dimensiones del dataset: (7043, 32)


customerID,gender_Male,SeniorCitizen,Partner,Dependents,PhoneService,MultipleLines_No_phone_service,MultipleLines,InternetService_Fiber_optic,InternetService_No,OnlineSecurity_No_internet_service,OnlineSecurity,OnlineBackup_No_internet_service,OnlineBackup,DeviceProtection_No_internet_service,DeviceProtection,TechSupport_No_internet_service,TechSupport,StreamingTV_No_internet_service,StreamingTV,StreamingMovies_No_internet_service,StreamingMovies,Contract_One_year,Contract_Two_year,PaperlessBilling,PaymentMethod_Credit_card_automatic,PaymentMethod_Electronic_check,PaymentMethod_Mailed_check,tenure,MonthlyCharges,TotalCharges,Churn
8008-ESFLK,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.8400143297628127,1.5201556452330371,1.5684111969587773,0
7537-CBQUZ,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.247217966963116,1.4835961112293827,2.0964618708481115,0
1555-DJEQW,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.5322605130033284,1.6431286232453304,2.402200981291416,1
5649-TJHOV,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.2187151269579764,-0.9393039150128252,-0.5524896868803615,1
0519-XUZJU,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-1.2774445836787656,0.1990270301009692,-0.9769127064682396,1


## 3. Separación de Datos (Train/Test Split)
Preparamos los conjuntos de datos para el modelado:
1.  **Definición de Features (X):** Eliminamos la variable objetivo (`Churn`) y el identificador (`customerID`) para evitar que el modelo memorice identidades únicas.
2.  **Estratificación:** Utilizamos `stratify=y` al dividir los datos (80% entrenamiento / 20% prueba). Esto es crucial en problemas de clasificación desbalanceada para garantizar que la proporción de casos de fuga (Churn=1) sea la misma en ambos conjuntos.

In [0]:
# Definición de X e y
target_col = "Churn"
id_col = "customerID"

X = df.drop(columns=[target_col, id_col])
y = df[target_col]

# Split 80/20 con estratificación (importante porque Churn suele estar desbalanceado)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Registros de Entrenamiento: {X_train.shape[0]}")
print(f"Registros de Prueba: {X_test.shape[0]}")

Registros de Entrenamiento: 5634
Registros de Prueba: 1409


## 4. Configuración del Experimento MLflow
Definimos y configuramos el experimento en el que se registrarán todas las ejecuciones (runs). Establecer una ruta explícita (`set_experiment`) garantiza que los logs, métricas y artefactos queden centralizados y organizados, facilitando la trazabilidad y comparación de modelos.

In [0]:
experiment_path = "/Users/carolina.torresz@udea.edu.co/churn_experiment_ops"
mlflow.set_experiment(experiment_path)

print(f"Experimento configurado en: {experiment_path}")

2025/11/23 02:10:02 INFO mlflow.tracking.fluent: Experiment with name '/Users/carolina.torresz@udea.edu.co/churn_experiment_ops' does not exist. Creating a new experiment.


Experimento configurado en: /Users/carolina.torresz@udea.edu.co/churn_experiment_ops


##5. Función de Entrenamiento Estandarizada
Implementamos una función reutilizable que encapsula el ciclo de vida del entrenamiento para garantizar consistencia en todos los experimentos.

**Flujo de la función:**
1.  **Entrenamiento:** Ajuste del modelo con los datos de entrenamiento.
2.  **Evaluación:** Cálculo de métricas clave (*Accuracy, F1-Score, AUC-ROC*) sobre el set de prueba.
3.  **Trazabilidad (MLflow):** Registro automático de:
    *   **Hiperparámetros:** Configuración del modelo.
    *   **Métricas:** Resultados de desempeño.
    *   **Artefacto del Modelo:** Serialización y guardado del modelo junto con un `input_example`. Esto último es crítico para que MLflow registre automáticamente la "firma" (esquema) de los datos de entrada.

In [0]:
def entrenar_y_registrar(modelo, nombre_run, params):
    """
    Entrena un modelo, calcula métricas y lo registra en MLflow.
    """
    with mlflow.start_run(run_name=nombre_run) as run:
        print(f"🚀 Iniciando entrenamiento: {nombre_run}...")

        # 1. Entrenar
        modelo.fit(X_train, y_train)
        
        # 2. Predecir (Clases y Probabilidades)
        y_pred = modelo.predict(X_test)
        # Para AUC necesitamos probabilidades. Algunos modelos usan predict_proba
        if hasattr(modelo, "predict_proba"):
            y_prob = modelo.predict_proba(X_test)[:, 1]
        else:
            y_prob = y_pred # Fallback si no soporta proba
            
        # 3. Calcular Métricas
        acc = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred)
        auc = roc_auc_score(y_test, y_prob)
        
        print(f"   📊 Accuracy: {acc:.4f}")
        print(f"   📊 AUC:      {auc:.4f}")
        
        # 4. Logging en MLflow
        # A. Parámetros
        mlflow.log_params(params)
        
        # B. Métricas
        mlflow.log_metrics({"accuracy": acc, "f1_score": f1, "auc": auc})
        
        # C. Etiquetas (Tags) para operación
        mlflow.set_tag("env", "dev")
        mlflow.set_tag("algorithm", nombre_run.split("_")[0])
        
        # D. Guardar Modelo (Artefacto)
        input_example = X_train.iloc[:5]
        
        mlflow.sklearn.log_model(
            sk_model=modelo, 
            artifact_path="model",
            input_example=input_example
        )
        
        print(f"✅ Run ID: {run.info.run_id}")
        return run.info.run_id

## 6. Ejecución de Experimentos y Manejo de Desbalance
Procedemos a entrenar dos tipos de algoritmos para comparar su desempeño y establecer una línea base.

1.  **Regresión Logística:** Modelo lineal, simple e interpretable.
2.  **Random Forest:** Modelo de ensamble no lineal, robusto ante ruido.

**Decisión Técnica (`class_weight='balanced'`):**
Dado el desbalance detectado en el EDA (~26% Churn), se configuran ambos algoritmos con pesos de clase balanceados. Esto ajusta la función de costo para penalizar más severamente los errores en la clase minoritaria, mejorando la capacidad del modelo para detectar fugas reales sin añadir complejidad al pipeline de datos (como SMOTE).

In [0]:
# --- MODELO A: Regresión Logística ---
params_lr = {
    "C": 1.0, 
    "solver": "liblinear", 
    "class_weight": "balanced",  
    "random_state": 42
}
model_lr = LogisticRegression(**params_lr)

run_id_lr = entrenar_y_registrar(model_lr, "Logistic_Regression_Balanced", params_lr)

print("-" * 30)

# --- MODELO B: Random Forest ---
params_rf = {
    "n_estimators": 100, 
    "max_depth": 10, 
    "min_samples_split": 5,
    "class_weight": "balanced", 
    "random_state": 42
}
model_rf = RandomForestClassifier(**params_rf)

run_id_rf = entrenar_y_registrar(model_rf, "Random_Forest_Balanced", params_rf)

🚀 Iniciando entrenamiento: Logistic_Regression_Balanced...
   📊 Accuracy: 0.7303
   📊 AUC:      0.8181




✅ Run ID: 5a51654be731460f902fb1cd80356e55
------------------------------
🚀 Iniciando entrenamiento: Random_Forest_Balanced...
   📊 Accuracy: 0.7679
   📊 AUC:      0.8210




✅ Run ID: 93e7886de06745f1b6dc428763c38d40


In [0]:
# Listar las últimas corridas programáticamente para confirmar éxito
runs = mlflow.search_runs(experiment_ids=[mlflow.get_experiment_by_name(experiment_path).experiment_id])
display(runs[["run_id", "tags.mlflow.runName", "metrics.auc","metrics.f1_score" ,"metrics.accuracy", "status"]].head(2))

run_id,tags.mlflow.runName,metrics.auc,metrics.f1_score,metrics.accuracy,status
93e7886de06745f1b6dc428763c38d40,Random_Forest_Balanced,0.8209731586969438,0.6148409893992933,0.7679205110007097,FINISHED
5a51654be731460f902fb1cd80356e55,Logistic_Regression_Balanced,0.818056524322509,0.6016771488469602,0.730305180979418,FINISHED
