# Fase 2 – MLOps: Rol de Data Scientist
**Autor:** Ricardo Miguel Aguilar Rosas  
**Matrícula:** A01171223  
**Equipo:** 8 – Energy Efficiency Prediction  
**Fecha:** Octubre 2025

**1) Environment setup**

In [5]:
# Scientific stack estable + MLflow
!pip install -Uq "numpy==1.26.4" "scipy==1.11.4" "scikit-learn==1.4.2" "pandas==2.2.2" "mlflow==2.15.0"

import numpy as np, pandas as pd, sklearn, scipy, mlflow
print("OK ->",
      "numpy", np.__version__,
      "| pandas", pd.__version__,
      "| sklearn", sklearn.__version__,
      "| scipy", scipy.__version__,
      "| mlflow", mlflow.__version__)


OK -> numpy 1.26.4 | pandas 2.2.2 | sklearn 1.4.2 | scipy 1.11.4 | mlflow 2.15.0


**2) Repository & Data**

In [6]:
from google.colab import drive
drive.mount('/content/drive')

%cd /content
!git clone https://github.com/PosgradoMNA/TC5044.10-Equipo-8.git || true
%cd TC5044.10-Equipo-8
!git checkout ricardo-datascientist

!ls -l src/data | sed -n '1,100p'



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content
fatal: destination path 'TC5044.10-Equipo-8' already exists and is not an empty directory.
/content/TC5044.10-Equipo-8
Already on 'ricardo-datascientist'
Your branch is up to date with 'origin/ricardo-datascientist'.
total 52
drwxr-xr-x 2 root root  4096 Oct 29 23:56 cleansed
-rw------- 1 root root 43747 Oct 30 00:25 energy_efficiency_modified.csv
-rw-r--r-- 1 root root   111 Oct 29 23:56 energy_efficiency_modified.csv.dvc


**3) Imports Path**

In [7]:
import sys
repo = '/content/TC5044.10-Equipo-8'
srcp = f'{repo}/src'
if repo not in sys.path: sys.path.append(repo)
if srcp not in sys.path: sys.path.append(srcp)
print("Paths ready:", repo in sys.path, srcp in sys.path)



Paths ready: True True


**4) Data Loading**

Objetivo: Cargar el conjunto de datos mediante la clase DataLoader y
verificar su estructura básica antes del preprocesamiento.

In [11]:
import sys
from src.handlers.data_loader import DataLoader

# Configuración de rutas del repositorio
repo = '/content/TC5044.10-Equipo-8'
srcp = f'{repo}/src'
if repo not in sys.path: sys.path.append(repo)
if srcp not in sys.path: sys.path.append(srcp)

# Carga del archivo CSV
loader = DataLoader()
df = loader.getDataFrameFromFile("data/energy_efficiency_modified.csv")

# Verificación inicial
print("Datos cargados correctamente.")
print(f"Dimensiones: {df.shape[0]} filas × {df.shape[1]} columnas\n")
print("Columnas:", list(df.columns), "\n")

print("Información general del conjunto de datos:")
df.info()

print("\nEstadísticas descriptivas:")
display(df.describe().T.head(10))

print("\nValores faltantes por columna:")
display(df.isnull().sum().to_frame("Valores Faltantes").T)

print(f"\nFilas duplicadas: {df.duplicated().sum()}")



Succesfully loaded DF from /content/TC5044.10-Equipo-8/src/data/energy_efficiency_modified.csv... 

Datos cargados correctamente.
Dimensiones: 783 filas × 11 columnas

Columnas: ['relative_compactness', 'surface_area', 'wall_area', 'roof_area', 'overall_height', 'orientation', 'glazing_area', 'glazing_area_distribution', 'heating_load', 'cooling_load', 'mixed_type_col'] 

Información general del conjunto de datos:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 783 entries, 0 to 782
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   relative_compactness       776 non-null    object 
 1   surface_area               774 non-null    object 
 2   wall_area                  776 non-null    object 
 3   roof_area                  776 non-null    object 
 4   overall_height             767 non-null    object 
 5   orientation                772 non-null    object 
 6   glazing_area           

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
glazing_area_distribution,772.0,5.200777,40.220112,0.0,1.0,3.0,4.0,971.0



Valores faltantes por columna:


Unnamed: 0,relative_compactness,surface_area,wall_area,roof_area,overall_height,orientation,glazing_area,glazing_area_distribution,heating_load,cooling_load,mixed_type_col
Valores Faltantes,7,9,7,7,16,11,12,11,6,6,87



Filas duplicadas: 0


**5) PREPROCESAMIENTO DE DATOS**

Objetivo:
Asegurar calidad y comparabilidad del dataset mediante:
(i) conversión a tipo numérico
(ii) imputación por mediana,
(iii) remoción de atípicos vía IQR y
(iv) estandarización (z-score).

In [13]:
from src.handlers.data_preprocessor import DataPreprocessor
import pandas as pd

# --- Ejecución de pasos de preprocesamiento (según handlers del repo) ---
pre = DataPreprocessor(df)     # requiere el DataFrame original

pre.convert_numeric()          # convierte columnas objetivo a float
pre.impute_missing()           # imputa faltantes con mediana
pre.detect_outliers()          # elimina atípicos con IQR (registra índices en pre.outliers)
pre.standardize()              # estandariza columnas numéricas (media≈0, std≈1)

# Resultado
df_pre = pre.df.copy()

# --- Controles rápidos de calidad post-proceso ---
print("=== Resumen post-proceso ===")
print(f"Filas eliminadas por atípicos (IQR): {len(pre.outliers)}")
print(f"Dimensiones finales: {df_pre.shape[0]} filas × {df_pre.shape[1]} columnas")
print("\nFaltantes restantes por columna (debe ser 0 en numéricas):")
display(df_pre.isnull().sum().to_frame("faltantes").T)

# Verificación de estandarización en columnas numéricas
num_cols = [
    "relative_compactness","surface_area","wall_area","roof_area","overall_height",
    "orientation","glazing_area","glazing_area_distribution","heating_load","cooling_load","mixed_type_col"
]
check = pd.DataFrame({
    "mean": df_pre[num_cols].mean().round(3),
    "std":  df_pre[num_cols].std(ddof=0).round(3)
})
print("\nComprobación de estandarización (media≈0, std≈1):")
display(check)

# (Opcional) Persistencia de un intermedio limpio para trazabilidad/DVC
# from src.handlers.data_loader import DataLoader
# DataLoader().saveDataFrameAsFileWithDVC(df_pre, route="data/cleansed", file_name="energy_efficiency_preprocessed.csv")



Converting numeric values... 


Initializing imputation of values... 


Missing values before imputation: 0 


Missing values after imputation: 0 


Initializing outlier analysis... 


Rows detected as outliers: 0 


Outliers removed 


Standardized numeric columns using StandardScaler for the following columns: ['relative_compactness', 'surface_area', 'wall_area', 'roof_area', 'overall_height', 'orientation', 'glazing_area', 'glazing_area_distribution', 'heating_load', 'cooling_load', 'mixed_type_col'] 

=== Resumen post-proceso ===
Filas eliminadas por atípicos (IQR): 0
Dimensiones finales: 707 filas × 11 columnas

Faltantes restantes por columna (debe ser 0 en numéricas):


Unnamed: 0,relative_compactness,surface_area,wall_area,roof_area,overall_height,orientation,glazing_area,glazing_area_distribution,heating_load,cooling_load,mixed_type_col
faltantes,0,0,0,0,0,0,0,0,0,0,0



Comprobación de estandarización (media≈0, std≈1):


Unnamed: 0,mean,std
relative_compactness,-0.0,1.0
surface_area,0.0,1.0
wall_area,0.0,1.0
roof_area,-0.0,1.0
overall_height,-0.0,1.0
orientation,-0.0,1.0
glazing_area,-0.0,1.0
glazing_area_distribution,0.0,1.0
heating_load,-0.0,1.0
cooling_load,-0.0,1.0


**6) DIVISIÓN TRAIN/TEST Y CONFIGURACIÓN DEL EXPERIMENTO**

In [14]:
from sklearn.model_selection import train_test_split
import mlflow

# --- Definición de objetivo y variables ---
TARGET = "heating_load"                 # cambia a "cooling_load" si lo requieres
X = df_pre.drop(columns=[TARGET])
y = df_pre[TARGET]

# --- Split reproducible ---
RANDOM_STATE = 42
TEST_SIZE = 0.20

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE
)

print("=== División de datos ===")
print(f"Train: {X_train.shape[0]} filas, Test: {X_test.shape[0]} filas")
print(f"Características: {X_train.shape[1]} columnas, Target: {TARGET}")

# --- Configuración de MLflow (carpeta local ./mlruns) ---
mlflow.set_experiment("Energy Efficiency – Ricardo Aguilar")
print("\nExperimento MLflow configurado: 'Energy Efficiency – Ricardo Aguilar'")


2025/10/30 00:56:58 INFO mlflow.tracking.fluent: Experiment with name 'Energy Efficiency – Ricardo Aguilar' does not exist. Creating a new experiment.


=== División de datos ===
Train: 565 filas, Test: 142 filas
Características: 10 columnas, Target: heating_load

Experimento MLflow configurado: 'Energy Efficiency – Ricardo Aguilar'


**7) ENTRENAMIENTO DE MODELOS Y REGISTRO EN MLFLOW**

In [15]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import r2_score, mean_absolute_error
try:
    # scikit-learn ≥ 1.4
    from sklearn.metrics import root_mean_squared_error as rmse_fn
except Exception:
    # compatibilidad con versiones previas
    from sklearn.metrics import mean_squared_error
    rmse_fn = lambda y_true, y_pred: mean_squared_error(y_true, y_pred, squared=False)

import mlflow
import mlflow.sklearn

# --- Definición de modelos base ---
model_specs = {
    "LinearRegression": LinearRegression(),
    "RandomForest": RandomForestRegressor(
        n_estimators=600, max_depth=12, min_samples_split=4,
        random_state=RANDOM_STATE, n_jobs=-1
    ),
    # Opcional (comenta si no deseas incluirlo en esta fase)
    "GradientBoosting": GradientBoostingRegressor(
        n_estimators=300, learning_rate=0.08, max_depth=4, random_state=RANDOM_STATE
    ),
}

print("\n=== Entrenamiento y registro ===")
summary_rows = []

for name, model in model_specs.items():
    # Entrenar
    model.fit(X_train, y_train)
    preds = model.predict(X_test)

    # Métricas
    r2   = r2_score(y_test, preds)
    mae  = mean_absolute_error(y_test, preds)
    rmse = rmse_fn(y_test, preds)

    # Registro en MLflow
    with mlflow.start_run(run_name=f"{name}_{TARGET}"):
        mlflow.log_param("algorithm", name)
        mlflow.log_param("target", TARGET)
        mlflow.log_param("test_size", TEST_SIZE)
        mlflow.log_param("random_state", RANDOM_STATE)
        mlflow.log_param("n_features", X_train.shape[1])

        # Hiperparámetros relevantes si existen
        if hasattr(model, "n_estimators"):
            mlflow.log_param("n_estimators", getattr(model, "n_estimators"))
        if hasattr(model, "max_depth") and getattr(model, "max_depth") is not None:
            mlflow.log_param("max_depth", getattr(model, "max_depth"))
        if hasattr(model, "learning_rate"):
            mlflow.log_param("learning_rate", getattr(model, "learning_rate"))

        # Métricas
        mlflow.log_metric("r2", float(r2))
        mlflow.log_metric("mae", float(mae))
        mlflow.log_metric("rmse", float(rmse))

        # Modelo
        mlflow.sklearn.log_model(model, artifact_path="model")

    print(f"{name:<17} | R2={r2:0.3f} | MAE={mae:0.3f} | RMSE={rmse:0.3f}")
    summary_rows.append((name, r2, mae, rmse))

# Resumen ordenado (opcional)
try:
    import pandas as pd
    results_df = pd.DataFrame(summary_rows, columns=["Modelo", "R2", "MAE", "RMSE"]).sort_values("R2", ascending=False)
    display(results_df)
except Exception:
    pass



=== Entrenamiento y registro ===




LinearRegression  | R2=0.972 | MAE=0.105 | RMSE=0.154




RandomForest      | R2=0.988 | MAE=0.052 | RMSE=0.098




GradientBoosting  | R2=0.991 | MAE=0.054 | RMSE=0.089


Unnamed: 0,Modelo,R2,MAE,RMSE
2,GradientBoosting,0.990538,0.054077,0.088769
1,RandomForest,0.988415,0.051604,0.098223
0,LinearRegression,0.971513,0.105429,0.154028
