# Evaluación 4 – Machine Learning
## MLOps aplicado al modelo de la Evaluación 3

En esta evaluación se aplica MLOps al modelo desarrollado en la Evaluación 3,
utilizando el mismo dataset (Fifa World Cup 2022). El objetivo es:

1. Definir qué elementos del proceso se van a monitorear y versionar.
2. Aplicar versionamiento y trazabilidad sobre datos, modelo, métricas y ambiente.
3. Determinar qué partes del pipeline se deben automatizar.
4. Implementar una automatización básica (pipeline reproducible).


In [9]:
import os
import json
import hashlib
import datetime
import platform
import sys

import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Fijar semillas para reproducibilidad
SEED = 42
np.random.seed(SEED)
tf.random.set_seed(SEED)


In [10]:
DATASET_PATH = "/content/Fifa_world_cup_matches.csv"

df = pd.read_csv(DATASET_PATH)
df.head()


Unnamed: 0,team1,team2,possession team1,possession team2,possession in contest,number of goals team1,number of goals team2,date,hour,category,...,penalties scored team1,penalties scored team2,goal preventions team1,goal preventions team2,own goals team1,own goals team2,forced turnovers team1,forced turnovers team2,defensive pressures applied team1,defensive pressures applied team2
0,QATAR,ECUADOR,42%,50%,8%,0,2,20 NOV 2022,17 : 00,Group A,...,0,1,6,5,0,0,52,72,256,279
1,ENGLAND,IRAN,72%,19%,9%,6,2,21 NOV 2022,14 : 00,Group B,...,0,1,8,13,0,0,63,72,139,416
2,SENEGAL,NETHERLANDS,44%,45%,11%,0,2,21 NOV 2022,17 : 00,Group A,...,0,0,9,15,0,0,63,73,263,251
3,UNITED STATES,WALES,51%,39%,10%,1,1,21 NOV 2022,20 : 00,Group B,...,0,1,7,7,0,0,81,72,242,292
4,ARGENTINA,SAUDI ARABIA,64%,24%,12%,1,2,22 NOV 2022,11 : 00,Group C,...,1,0,4,14,0,0,65,80,163,361


# Punto 1 – Determinar qué elementos se van a monitorear/versionar

## 1. Elementos a monitorear y versionar

En este proyecto se versionarán y monitorearán los siguientes elementos:

1. **Dataset**
   - Metadatos: ruta, número de filas y columnas.
   - Esquema: nombres de columnas y tipos de datos.
   - Calidad: conteo de nulos por columna.
   - Tamaño en disco.
   - Hash SHA-256 del archivo para detectar cambios.

2. **Modelo**
   - Arquitectura: tipo de modelo (autoencoder) y dimensiones.
   - Hiperparámetros: `latent_dim`, `epochs`, `batch_size`, `optimizer`, `loss`.
   - Identificador de versión del modelo (por ejemplo `ae_v1`).

3. **Métricas**
   - Error MAE de reconstrucción en train y test.
   - Último `loss` y `val_loss` del entrenamiento.
   - Posible umbral de error si se usa para detección de anomalías.

4. **Dependencias y ambiente**
   - Versión de Python.
   - Versión de TensorFlow, scikit-learn, pandas, numpy.
   - Sistema operativo / plataforma.

5. **Resultados del preprocesamiento**
   - Variables usadas como features.
   - Parámetros del `StandardScaler` (media, desviación).
   - Tamaño de los conjuntos de train/test.

6. **Logs de entrenamiento**
   - Historial de `loss` y `val_loss` por época.
   - Guardados en un archivo CSV para análisis posterior.


In [11]:
#Código para capturar metadatos del dataset
def get_dataset_metadata(df: pd.DataFrame, path: str, version: str = "dataset_v1"):
    meta = {}
    meta["version"] = version
    meta["path"] = path
    meta["n_rows"], meta["n_cols"] = df.shape
    meta["columns"] = df.columns.tolist()
    meta["dtypes"] = df.dtypes.astype(str).to_dict()
    meta["null_counts"] = df.isnull().sum().to_dict()
    meta["size_bytes"] = os.path.getsize(path)
    meta["created_at"] = datetime.datetime.now().isoformat()

    # Hash SHA-256 del archivo
    sha256 = hashlib.sha256()
    with open(path, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            sha256.update(chunk)
    meta["sha256"] = sha256.hexdigest()

    return meta

dataset_metadata = get_dataset_metadata(df, DATASET_PATH)
dataset_metadata


{'version': 'dataset_v1',
 'path': '/content/Fifa_world_cup_matches.csv',
 'n_rows': 64,
 'n_cols': 88,
 'columns': ['team1',
  'team2',
  'possession team1',
  'possession team2',
  'possession in contest',
  'number of goals team1',
  'number of goals team2',
  'date',
  'hour',
  'category',
  'total attempts team1',
  'total attempts team2',
  'conceded team1',
  'conceded team2',
  'goal inside the penalty area team1',
  'goal inside the penalty area team2',
  'goal outside the penalty area team1',
  'goal outside the penalty area team2',
  'assists team1',
  'assists team2',
  'on target attempts team1',
  'on target attempts team2',
  'off target attempts team1',
  'off target attempts team2',
  'attempts inside the penalty area team1',
  'attempts inside the penalty area  team2',
  'attempts outside the penalty area  team1',
  'attempts outside the penalty area  team2',
  'left channel team1',
  'left channel team2',
  'left inside channel team1',
  'left inside channel team2',

In [12]:
# Preprocesamiento y metadatos del preprocesamiento
# Tomamos solo columnas numéricas (ajusta si tu EVA3 usaba otra selección)
num_cols = df.select_dtypes(include=["float64", "int64"]).columns
df_num = df[num_cols].copy()

scaler = StandardScaler()
X = scaler.fit_transform(df_num)

X_train, X_test = train_test_split(
    X, test_size=0.2, random_state=SEED
)

preprocess_metadata = {
    "version": "pre_v1",
    "features": num_cols.tolist(),
    "n_train": int(X_train.shape[0]),
    "n_test": int(X_test.shape[0]),
    "scaler_mean": scaler.mean_.tolist(),
    "scaler_scale": scaler.scale_.tolist(),
}
preprocess_metadata


{'version': 'pre_v1',
 'features': ['number of goals team1',
  'number of goals team2',
  'total attempts team1',
  'total attempts team2',
  'conceded team1',
  'conceded team2',
  'goal inside the penalty area team1',
  'goal inside the penalty area team2',
  'goal outside the penalty area team1',
  'goal outside the penalty area team2',
  'assists team1',
  'assists team2',
  'on target attempts team1',
  'on target attempts team2',
  'off target attempts team1',
  'off target attempts team2',
  'attempts inside the penalty area team1',
  'attempts inside the penalty area  team2',
  'attempts outside the penalty area  team1',
  'attempts outside the penalty area  team2',
  'left channel team1',
  'left channel team2',
  'left inside channel team1',
  'left inside channel team2',
  'central channel team1',
  'central channel team2',
  'right inside channel team1',
  'right inside channel team2',
  'right channel team1',
  'right channel team2',
  'total offers to receive team1',
  't

In [13]:
# Definición del modelo y metadatos del modelo
INPUT_DIM = X.shape[1]

def build_autoencoder(input_dim, latent_dim=8):
    inputs = keras.Input(shape=(input_dim,))
    x = layers.Dense(32, activation="relu")(inputs)
    x = layers.Dense(16, activation="relu")(x)
    latent = layers.Dense(latent_dim, activation="relu", name="latent")(x)
    x = layers.Dense(16, activation="relu")(latent)
    x = layers.Dense(32, activation="relu")(x)
    outputs = layers.Dense(input_dim, activation="sigmoid")(x)

    model = keras.Model(inputs, outputs, name="autoencoder_eva3")
    model.compile(optimizer="adam", loss="mse")
    return model

model = build_autoencoder(INPUT_DIM, latent_dim=8)
model.summary()


In [14]:
# Metadatos del modelo:
model_metadata = {
    "name": model.name,
    "version": "ae_v1",
    "input_dim": INPUT_DIM,
    "latent_dim": 8,
    "epochs": 50,
    "batch_size": 64,
    "optimizer": "adam",
    "loss": "mse",
    "random_state": SEED,
}
model_metadata


{'name': 'autoencoder_eva3',
 'version': 'ae_v1',
 'input_dim': 80,
 'latent_dim': 8,
 'epochs': 50,
 'batch_size': 64,
 'optimizer': 'adam',
 'loss': 'mse',
 'random_state': 42}

In [15]:
#Entrenamiento + métricas + logs
history = model.fit(
    X_train, X_train,
    validation_data=(X_test, X_test),
    epochs=model_metadata["epochs"],
    batch_size=model_metadata["batch_size"],
    verbose=0
)

recon_train = model.predict(X_train)
recon_test = model.predict(X_test)

train_mae = float(mean_absolute_error(X_train, recon_train))
test_mae = float(mean_absolute_error(X_test, recon_test))

metrics = {
    "version": "metrics_v1",
    "train_mae": train_mae,
    "test_mae": test_mae,
    "train_loss_final": float(history.history["loss"][-1]),
    "val_loss_final": float(history.history["val_loss"][-1]),
}
metrics


[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 68ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step


{'version': 'metrics_v1',
 'train_mae': 0.7722628332089213,
 'test_mae': 0.7588190596543481,
 'train_loss_final': 1.0003303289413452,
 'val_loss_final': 0.8892987966537476}

In [16]:
#Historial de entrenamiento:
logs_df = pd.DataFrame(history.history)
logs_df.head()


Unnamed: 0,loss,val_loss
0,1.291543,1.094473
1,1.287789,1.09212
2,1.284409,1.089957
3,1.281444,1.087962
4,1.278753,1.086237


In [17]:
#Dependencias y ambiente
import pkg_resources

env_info = {
    "python_version": sys.version,
    "platform": platform.platform(),
    "tensorflow_version": tf.__version__,
    "numpy_version": np.__version__,
    "pandas_version": pd.__version__,
    "sklearn_version": pkg_resources.get_distribution("scikit-learn").version,
}
env_info


  import pkg_resources


{'python_version': '3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]',
 'platform': 'Linux-6.6.105+-x86_64-with-glibc2.35',
 'tensorflow_version': '2.19.0',
 'numpy_version': '2.0.2',
 'pandas_version': '2.2.2',
 'sklearn_version': '1.6.1'}

# Punto 2 – Aplicar versionamiento y trazabilidad
Aquí vamos a:

Organizar artefactos en carpetas.

Guardar todos los metadatos y modelos con identificador de “run”.

(Opcional) Registrar el experimento en MLflow para mostrar trazabilidad.

In [18]:
#Estructura de carpetas y helper para guardar JSON
BASE_DIR = "mlops_eva4"
ARTIFACTS_DIR = os.path.join(BASE_DIR, "artifacts")
MODELS_DIR = os.path.join(BASE_DIR, "models")
METADATA_DIR = os.path.join(BASE_DIR, "metadata")
LOGS_DIR = os.path.join(BASE_DIR, "logs")

for d in [BASE_DIR, ARTIFACTS_DIR, MODELS_DIR, METADATA_DIR, LOGS_DIR]:
    os.makedirs(d, exist_ok=True)

def save_json(obj, path):
    with open(path, "w") as f:
        json.dump(obj, f, indent=2)


In [19]:
# Crear un run_id y guardar artefactos versionados
run_id = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
run_name = f"run_{run_id}_ae_v1"

# Guardar metadatos
save_json(dataset_metadata,   os.path.join(METADATA_DIR, f"dataset_{run_id}.json"))
save_json(preprocess_metadata,os.path.join(METADATA_DIR, f"preprocess_{run_id}.json"))
save_json(model_metadata,     os.path.join(METADATA_DIR, f"model_{run_id}.json"))
save_json(metrics,            os.path.join(METADATA_DIR, f"metrics_{run_id}.json"))
save_json(env_info,           os.path.join(METADATA_DIR, f"env_{run_id}.json"))

# Guardar logs de entrenamiento
history_path = os.path.join(LOGS_DIR, f"history_{run_id}.csv")
logs_df.to_csv(history_path, index=False)

# Guardar el modelo
model_path = os.path.join(MODELS_DIR, f"autoencoder_{run_id}.keras")
model.save(model_path)

run_name, model_path, history_path


('run_20251206-194845_ae_v1',
 'mlops_eva4/models/autoencoder_20251206-194845.keras',
 'mlops_eva4/logs/history_20251206-194845.csv')

# Punto 3 – Determinar qué partes del pipeline automatizar

## 3. Tareas del pipeline a automatizar (poda de manualidades)

En la Evaluación 3 varias tareas se realizaban de forma manual dentro del notebook:

- Carga del dataset desde CSV.
- Validación básica de la calidad de los datos.
- Preprocesamiento (selección de columnas, escalado, split train/test).
- Definición y entrenamiento del modelo.
- Cálculo de métricas y error de reconstrucción.
- Guardado manual del modelo y de los resultados.

Para aplicar MLOps se automatizarán las siguientes partes:

1. **Validación del dataset**  
   – Comprobar proporción de nulos, duplicados y dimensiones mínimas.  

2. **Preprocesamiento reproducible**  
   – Aplicar siempre la misma selección de columnas y escalado con los mismos parámetros.  

3. **Entrenamiento repetible**  
   – Entrenar el autoencoder con hiperparámetros configurables (epochs, batch_size, latent_dim, etc.).  

4. **Generación y guardado automático de métricas y logs**  
   – Calcular y guardar MAE, `loss`, `val_loss` e historial de entrenamiento.  

5. **Exportación y registro del modelo y artefactos**  
   – Guardar el modelo, metadatos y logs en una estructura de carpetas, idealmente registrando el experimento en MLflow.


# Punto 4 – Aplicar automatización (pipeline básico)
Ahora pasamos del análisis a código: un pipeline reproducible que hace todo con una sola función.

In [20]:
# Validación automática del dataset
def validate_dataset(df: pd.DataFrame, max_null_ratio: float = 0.3):
    report = {}
    report["n_rows"], report["n_cols"] = df.shape
    null_ratio = df.isnull().mean()
    report["null_ratio"] = null_ratio.to_dict()
    report["max_null_ratio_ok"] = bool((null_ratio <= max_null_ratio).all())
    report["has_duplicates"] = bool(df.duplicated().any())

    if not report["max_null_ratio_ok"]:
        raise ValueError("Dataset inválido: columnas con demasiados valores nulos.")
    if report["has_duplicates"]:
        print("Advertencia: el dataset tiene filas duplicadas.")

    return report

validation_report = validate_dataset(df)
validation_report


{'n_rows': 64,
 'n_cols': 88,
 'null_ratio': {'team1': 0.0,
  'team2': 0.0,
  'possession team1': 0.0,
  'possession team2': 0.0,
  'possession in contest': 0.0,
  'number of goals team1': 0.0,
  'number of goals team2': 0.0,
  'date': 0.0,
  'hour': 0.0,
  'category': 0.0,
  'total attempts team1': 0.0,
  'total attempts team2': 0.0,
  'conceded team1': 0.0,
  'conceded team2': 0.0,
  'goal inside the penalty area team1': 0.0,
  'goal inside the penalty area team2': 0.0,
  'goal outside the penalty area team1': 0.0,
  'goal outside the penalty area team2': 0.0,
  'assists team1': 0.0,
  'assists team2': 0.0,
  'on target attempts team1': 0.0,
  'on target attempts team2': 0.0,
  'off target attempts team1': 0.0,
  'off target attempts team2': 0.0,
  'attempts inside the penalty area team1': 0.0,
  'attempts inside the penalty area  team2': 0.0,
  'attempts outside the penalty area  team1': 0.0,
  'attempts outside the penalty area  team2': 0.0,
  'left channel team1': 0.0,
  'left cha

In [21]:
# Configuración central del experimento
config = {
    "dataset_path": DATASET_PATH,
    "test_size": 0.2,
    "random_state": SEED,
    "latent_dim": 8,
    "epochs": 50,
    "batch_size": 64,
    "run_prefix": "ae_eva4",
}
config


{'dataset_path': '/content/Fifa_world_cup_matches.csv',
 'test_size': 0.2,
 'random_state': 42,
 'latent_dim': 8,
 'epochs': 50,
 'batch_size': 64,
 'run_prefix': 'ae_eva4'}

In [22]:
# Pipeline automatizado
def run_pipeline(cfg: dict):
    # 1. Carga dataset
    df = pd.read_csv(cfg["dataset_path"])
    validate_dataset(df)

    # 2. Preprocesamiento
    num_cols = df.select_dtypes(include=["float64", "int64"]).columns
    df_num = df[num_cols].copy()

    scaler = StandardScaler()
    X = scaler.fit_transform(df_num)

    X_train, X_test = train_test_split(
        X,
        test_size=cfg["test_size"],
        random_state=cfg["random_state"]
    )

    # 3. Construir modelo
    input_dim = X.shape[1]
    model = build_autoencoder(input_dim, latent_dim=cfg["latent_dim"])

    # 4. Entrenar
    history = model.fit(
        X_train, X_train,
        validation_data=(X_test, X_test),
        epochs=cfg["epochs"],
        batch_size=cfg["batch_size"],
        verbose=0
    )

    # 5. Métricas
    recon_test = model.predict(X_test)
    test_mae = float(mean_absolute_error(X_test, recon_test))
    val_loss_final = float(history.history["val_loss"][-1])

    metrics = {
        "test_mae": test_mae,
        "val_loss_final": val_loss_final,
    }

    # 6. Versionado automático de artefactos
    timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
    run_id = f"{cfg['run_prefix']}_{timestamp}"
    run_dir = os.path.join(ARTIFACTS_DIR, run_id)
    os.makedirs(run_dir, exist_ok=True)

    # Guardar historial
    history_path = os.path.join(run_dir, "history.csv")
    pd.DataFrame(history.history).to_csv(history_path, index=False)

    # Guardar modelo
    model_path = os.path.join(run_dir, "model.keras")
    model.save(model_path)

    # Guardar resumen del run (config + métricas)
    run_summary = {
        "run_id": run_id,
        "config": cfg,
        "metrics": metrics,
        "features": num_cols.tolist(),
    }
    save_json(run_summary, os.path.join(run_dir, "run_summary.json"))

    print(f"Run {run_id} completado. Test MAE = {test_mae:.4f}")
    return run_id, metrics, run_dir

run_id_auto, metrics_auto, run_dir_auto = run_pipeline(config)
metrics_auto


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 268ms/step
Run ae_eva4_20251206-194853 completado. Test MAE = 0.7662


{'test_mae': 0.7662375733472266, 'val_loss_final': 0.9053087830543518}

# Conclusiones

- Se definieron explícitamente los elementos a versionar y monitorear:
  dataset, modelo, métricas, ambiente, preprocesamiento y logs de entrenamiento.
- Se implementó un esquema de versionamiento local basado en:
  - Carpetas de artefactos (`mlops_eva4/*`).
  - Identificadores de ejecución (`run_id`) con timestamp.
  - Archivos JSON con metadatos y métricas.
  - Modelos y logs guardados por cada ejecución.
- Se identificaron las tareas manuales del pipeline original (EVA3) y se
  decidió automatizar la validación del dataset, el preprocesamiento, el
  entrenamiento y el registro de resultados.
- Se implementó un pipeline reproducible (`run_pipeline`) que ejecuta
  automáticamente todas las etapas, y una variante opcional con registro
  en MLflow (`run_pipeline_mlflow`), cumpliendo con el enfoque de MLOps
  solicitado en la evaluación.
