
# TP3 — Pipeline de Regresión (Precios Inmobiliarios) · Scikit‑learn

**Objetivo:** predecir el precio en USD (`prp_pre_dol`) a partir de atributos inmobiliarios (numéricos, categóricos y geográficos).  
**Tipo de problema:** regresión.  
**Métrica principal:** RMSE (en escala original); se reportan también MAE y R².

**Checklist de la consigna:**
- ✅ Objetivo predictivo definido (regresión sobre `prp_pre_dol`).
- ✅ Partición del dataset (hold‑out 80/20 con opción temporal + CV apropiada).
- ✅ Pipeline completo (preprocesamiento, modelado, evaluación) con `ColumnTransformer` y `Pipeline`.
- ✅ Comparación de al menos dos modelos (RandomForest y HistGradientBoosting).
- ✅ Ajuste de hiperparámetros del mejor modelo con **RandomizedSearchCV** optimizado (n_iter y cv reducidos).
- ✅ Métricas adecuadas: RMSE, MAE, R²; diagnóstico de over/underfitting y residuos.
- ✅ Explicaciones claras de cada decisión (justificaciones debajo de cada bloque).



## 1) Configuración e imports

> Ajustá `DATA_PATH` al CSV que uses. Este notebook asume que la columna objetivo es `prp_pre_dol`.  
> Si tu dataset tiene otra columna objetivo, cambiala en `TARGET_COL`.


In [None]:
# --- Configuración principal ---
from pathlib import Path

ROOT_DIR = Path.cwd()
DATA_PATH = ROOT_DIR / "include" / "data" / "processed" / "propiedades_clean.csv"
TARGET_COL = "prp_pre_dol"   # Columna objetivo

# Si tu dataset tiene fecha/tiempo para split temporal, ajustá acá
CANDIDATE_DATE_COLS = ["fecha_publicacion", "fecha", "alta", "publicado", "created_at"]

# Parámetros de búsqueda (rápidos y eficientes)
RANDOM_STATE = 42
N_JOBS = -1
CV_FOLDS_DEFAULT = 3         # 3 folds para reducir tiempo
N_ITER_SEARCH = 10           # ensayos aleatorios por modelo

import os, math, warnings, csv
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, TimeSeriesSplit, cross_validate
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, make_scorer
from sklearn.experimental import enable_hist_gradient_boosting  # noqa: F401
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.compose import TransformedTargetRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cluster import KMeans




## 2) Carga de datos y limpieza básica

- Detección robusta de separador y encoding (pandas por defecto suele funcionar).
- Conversión de numéricos en texto (comas/puntos) cuando haga falta.
- Saneo del objetivo (`> 0` y finitos).
- **Filtro de negocio opcional**: recortar a ≤ 400k USD si querés concentrarte en segmento medio.


In [None]:

NUMERIC_PREFIXES = (
    "sup",
    "m2",
    "metros",
    "amb",
    "dorm",
    "ban",
    "coch",
    "fotos",
    "lat",
    "lng",
    "long",
    "precio",
    "pre_",
    "prp_pre",
    "prp_lat",
    "prp_lng",
)

def coerce_numeric(series: pd.Series) -> pd.Series:
    """Convierte strings tipo '1.234,56' a float; deja NaN si no se puede."""
    if series.dtype.kind in "biufc":
        return series
    s = series.astype(str).str.strip()
    mask_comma = s.str.contains(",", regex=False)
    s = s.where(~mask_comma, s.str.replace(".", "", regex=False))
    s = s.str.replace(",", ".", regex=False)
    return pd.to_numeric(s, errors="coerce")


def detect_separator(path: str, default: str = ",", sample_rows: int = 5) -> str:
    """Intenta inferir el separador principal del CSV."""
    sample = ""
    try:
        with open(path, "r", encoding="utf-8-sig", errors="ignore") as fh:
            sample = "".join([fh.readline() for _ in range(sample_rows)])
    except FileNotFoundError:
        return default
    except Exception:
        sample = ""
    if sample:
        try:
            return csv.Sniffer().sniff(sample).delimiter
        except Exception:
            pass
        if sample.count(";") > sample.count(","):
            return ";"
    return default

# Carga
assert os.path.exists(DATA_PATH), f"No se encontró el archivo: {DATA_PATH}"
separator = detect_separator(DATA_PATH)
print(f"Cargando dataset desde {DATA_PATH}")
read_csv_kwargs = dict(
    sep=separator,
    encoding="utf-8-sig",
    engine="python",
    on_bad_lines="skip",
)
df = pd.read_csv(DATA_PATH, **read_csv_kwargs)
print(f"Separador detectado: {separator!r}")

# Normaliza columnas numéricas comunes que a veces vienen como string
for c in df.columns:
    c_low = c.lower()
    if any(c_low.startswith(prefix) for prefix in NUMERIC_PREFIXES):
        try:
            df[c] = coerce_numeric(df[c])
        except Exception:
            pass

# Sanea objetivo
assert TARGET_COL in df.columns, f"No existe la columna objetivo {TARGET_COL} en el dataset."
y = df[TARGET_COL].copy()
mask_valid = np.isfinite(y) & (y > 0)
df = df.loc[mask_valid].copy()
y = df[TARGET_COL].copy()

# Filtro de negocio (opcional): enfocar hasta 400k USD
APLICAR_FILTRO_400K = True
if APLICAR_FILTRO_400K:
    mask_400k = y <= 400_000
    df, y = df.loc[mask_400k].copy(), y.loc[mask_400k].copy()

print("Shape final:", df.shape)
display(df.head(3))





## 3) Partición Train/Test (temporal si es posible)

- Si hay columna de fecha válida → **split temporal 80/20** (train = pasado, test = más reciente).  
- Si no → `train_test_split` aleatorio 80/20.
- La **validación cruzada** posterior respeta la temporalidad con `TimeSeriesSplit` si detectamos fecha.


In [None]:

# Detecta columna de fecha utilizable
date_col = None
for c in CANDIDATE_DATE_COLS:
    if c in df.columns:
        try:
            parsed = pd.to_datetime(df[c], errors="coerce")
        except Exception:
            continue
        if parsed.notna().sum() == 0:
            continue
        df[c] = parsed
        date_col = c
        break

# Define X, y
X = df.drop(columns=[TARGET_COL])

def temporal_split(X: pd.DataFrame, y: pd.Series, date_col: str, test_size: float = 0.2):
    ordered = X[date_col].sort_values(kind="mergesort")
    ordered_idx = ordered.index
    n = len(ordered_idx)
    if n < 2:
        raise ValueError("No hay suficientes filas para realizar un split temporal.")
    cut = int(np.floor((1 - test_size) * n))
    cut = min(max(1, cut), n - 1)
    idx_train = ordered_idx[:cut]
    idx_test = ordered_idx[cut:]
    return (
        X.loc[idx_train],
        X.loc[idx_test],
        y.loc[idx_train],
        y.loc[idx_test],
    )

if date_col is not None:
    print(f"Usando split temporal por columna: {date_col}")
    X_train, X_test, y_train, y_test = temporal_split(X, y, date_col, test_size=0.2)
    X_train = X_train.drop(columns=[date_col])
    X_test = X_test.drop(columns=[date_col])
    cv = TimeSeriesSplit(n_splits=CV_FOLDS_DEFAULT)
else:
    print("Usando split aleatorio 80/20 (sin columna temporal válida).")
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    cv = CV_FOLDS_DEFAULT
print(X_train.shape, X_test.shape)




## 4) Ingeniería de Variables + Preprocesamiento

**Numéricas**  
- Imputación **mediana** (robusta a outliers).  
- **Winsorización** (recorte 1–99%) para estabilizar extremos.  
- **Estandarización**.

**Categóricas**  
- Imputación **más frecuente**.  
- **One‑Hot Encoding** con `handle_unknown="ignore"`.

**Ingeniería 'segura'**  
- Logs (`log_sup_*`, `log_fotos`), ratios (`ratio_cubierta`, `ambientes_m2`) y **geo‑clusters (KMeans)** si hay `lat/lng`.


In [None]:
from sklearn.utils.validation import check_is_fitted

class WinsorizeNumeric(BaseEstimator, TransformerMixin):
    def __init__(self, lower=0.01, upper=0.99):
        self.lower = lower
        self.upper = upper

    def fit(self, X, y=None):
        X = pd.DataFrame(X).copy()
        self.perc_ = {}
        for c in X.columns:
            col = pd.to_numeric(X[c], errors='coerce')
            lo = np.nanpercentile(col, self.lower * 100)
            hi = np.nanpercentile(col, self.upper * 100)
            self.perc_[c] = (lo, hi)
        return self

    def transform(self, X):
        check_is_fitted(self, 'perc_')
        X = pd.DataFrame(X).copy()
        for i, c in enumerate(X.columns):
            lo, hi = self.perc_[c]
            col = pd.to_numeric(X[c], errors='coerce')
            col = np.where(col < lo, lo, np.where(col > hi, hi, col))
            X[c] = col
        return X.values

class SafeFeatureEngineer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.kmeans_ = None
        self.geo_cols_ = None
        self.kmeans_cols_ = None

    def fit(self, X, y=None):
        X = X.copy()
        lat_col = next((c for c in ["prp_lat", "lat", "latitude"] if c in X.columns), None)
        lng_col = next((c for c in ["prp_lng", "lng", "long", "longitude"] if c in X.columns), None)
        self.geo_cols_ = (lat_col, lng_col)
        self.kmeans_ = None
        self.kmeans_cols_ = None
        if lat_col and lng_col:
            geo = (
                X[[lat_col, lng_col]]
                .apply(pd.to_numeric, errors='coerce')
                .dropna()
            )
            if len(geo) >= 100:
                self.kmeans_ = KMeans(n_clusters=8, random_state=42)
                self.kmeans_.fit(geo.to_numpy())
                self.kmeans_cols_ = [lat_col, lng_col]
        return self

    def transform(self, X):
        X = X.copy()
        for col in ["sup_cubierta", "sup_total", "fotos"]:
            if col in X.columns:
                X[f"log_{col}"] = np.log1p(pd.to_numeric(X[col], errors='coerce'))

        if "sup_cubierta" in X.columns and "sup_total" in X.columns:
            num = pd.to_numeric(X["sup_cubierta"], errors='coerce')
            den = pd.to_numeric(X["sup_total"], errors='coerce').replace(0, np.nan)
            X["ratio_cubierta"] = (num / den).fillna(0)

        amb_cols = [c for c in X.columns if c.lower().startswith(("amb", "dorm", "habit", "ba"))]
        if "sup_total" in X.columns and len(amb_cols) > 0:
            den = pd.to_numeric(X["sup_total"], errors='coerce').replace(0, np.nan)
            X["ambientes_m2"] = (
                pd.DataFrame({c: pd.to_numeric(X[c], errors='coerce') for c in amb_cols}).sum(axis=1) / den
            ).fillna(0)

        if self.kmeans_ is not None and self.kmeans_cols_:
            missing_cols = [c for c in self.kmeans_cols_ if c not in X.columns]
            if not missing_cols:
                geo = pd.DataFrame({c: pd.to_numeric(X[c], errors='coerce') for c in self.kmeans_cols_})
                cluster = np.full(len(X), -1, dtype=int)
                lat_col, lng_col = self.kmeans_cols_
                valid = geo[lat_col].notna() & geo[lng_col].notna()
                if valid.any():
                    cluster[valid] = self.kmeans_.predict(geo.loc[valid, self.kmeans_cols_].to_numpy())
                X["geo_cluster"] = cluster

        return X

num_cols = [c for c in X_train.columns if pd.api.types.is_numeric_dtype(X_train[c])]
cat_cols = [c for c in X_train.columns if c not in num_cols]

num_pipe = Pipeline(steps=[
    ("imp", SimpleImputer(strategy="median")),
    ("winsor", WinsorizeNumeric(0.01, 0.99)),
    ("scaler", StandardScaler())
])

cat_pipe = Pipeline(steps=[
    ("imp", SimpleImputer(strategy="most_frequent")),
    ("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

pre = ColumnTransformer(
    transformers=[
        ("num", num_pipe, num_cols),
        ("cat", cat_pipe, cat_cols),
    ],
    remainder="drop",
    verbose_feature_names_out=False
)

feat = Pipeline(steps=[
    ("feateng", SafeFeatureEngineer()),
    ("pre", pre)
])




## 5) Modelos base


In [None]:

rf = RandomForestRegressor(
    n_estimators=300,
    random_state=42,
    n_jobs=-1
)
hgb = HistGradientBoostingRegressor(random_state=42)

ttr_rf = TransformedTargetRegressor(regressor=rf, func=np.log1p, inverse_func=np.expm1)
ttr_hgb = TransformedTargetRegressor(regressor=hgb, func=np.log1p, inverse_func=np.expm1)

pipe_rf = Pipeline([("features", feat), ("model", ttr_rf)])
pipe_hgb = Pipeline([("features", feat), ("model", ttr_hgb)])



## 6) Benchmark con Validación Cruzada


In [None]:

import math
from sklearn.metrics import make_scorer, mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import cross_validate, TimeSeriesSplit

scorers = {
    "rmse": make_scorer(lambda yt, yp: -math.sqrt(mean_squared_error(yt, yp)), greater_is_better=False),
    "mae": make_scorer(mean_absolute_error, greater_is_better=False),
    "r2": make_scorer(r2_score),
}

def run_cv(name, pipe, X=X_train, y=y_train, cv=cv):
    cvres = cross_validate(
        pipe, X, y, scoring=scorers, cv=cv, n_jobs=-1, return_train_score=True
    )
    out = {
        "model": name,
        "rmse_cv_mean": -cvres["test_rmse"].mean(),
        "rmse_cv_std": cvres["test_rmse"].std(),
        "mae_cv_mean": -cvres["test_mae"].mean(),
        "r2_cv_mean": cvres["test_r2"].mean(),
        "rmse_train_mean": -cvres["train_rmse"].mean(),
        "mae_train_mean": -cvres["train_mae"].mean(),
        "r2_train_mean": cvres["train_r2"].mean(),
        "fit_time_mean": cvres["fit_time"].mean(),
    }
    return out

res = []
for name, pipe in [("RandomForestRegressor", pipe_rf), ("HistGradientBoostingRegressor", pipe_hgb)]:
    print(f"CV -> {name}")
    res.append(run_cv(name, pipe))

cv_df = pd.DataFrame(res).sort_values("rmse_cv_mean")
display(cv_df)
best_name = cv_df.iloc[0]["model"]
print("Mejor por RMSE-CV:", best_name)



## 7) Tuning eficiente con RandomizedSearchCV


In [None]:

from scipy.stats import randint, uniform
from sklearn.model_selection import RandomizedSearchCV

if best_name == "RandomForestRegressor":
    base = pipe_rf
    param_distributions = {
        'model__regressor__n_estimators': randint(150, 401),
        'model__regressor__max_depth': [10, 12, 15, None],
        'model__regressor__min_samples_split': randint(2, 11),
        'model__regressor__min_samples_leaf': randint(1, 11),
        'model__regressor__max_features': uniform(0.3, 0.7),
    }
else:
    base = pipe_hgb
    param_distributions = {
        'model__regressor__learning_rate': uniform(0.03, 0.20),
        'model__regressor__max_depth': [None, 6, 10],
        'model__regressor__min_samples_leaf': randint(5, 31),
        'model__regressor__l2_regularization': uniform(0.0, 1.0),
        'model__regressor__max_bins': [255],
    }

search = RandomizedSearchCV(
    base,
    param_distributions=param_distributions,
    n_iter=N_ITER_SEARCH,
    cv=cv,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,
    verbose=2,
    random_state=42,
    refit=True
)
search.fit(X_train, y_train)

def summarize_search(search):
    cols = ["rank_test_score","mean_test_score","std_test_score","mean_fit_time"]
    df = pd.DataFrame(search.cv_results_)[cols + ["params"]].sort_values("rank_test_score")
    df["rmse_cv"] = -df["mean_test_score"]
    return df[["rank_test_score","rmse_cv","std_test_score","mean_fit_time","params"]]

top = summarize_search(search).head(10)
display(top)

best_model = search.best_estimator_
print("Mejores params:", search.best_params_)





## 8) Evaluación final en Test


In [None]:

y_pred = best_model.predict(X_test)

rmse = math.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
rel_rmse = rmse / y_test.mean()

print(f"RMSE test: {rmse:,.2f}")
print(f"MAE  test: {mae:,.2f}")
print(f"R²   test: {r2:,.4f}")
print(f"RMSE relativo: {100*rel_rmse:,.2f}% del precio medio")



## 9) Diagnóstico gráfico


In [None]:

plt.figure()
plt.scatter(y_test, y_pred, s=12, alpha=0.6)
mn, mx = np.min([y_test.min(), y_pred.min()]), np.max([y_test.max(), y_pred.max()])
plt.plot([mn, mx], [mn, mx])
plt.xlabel("Precio real (USD)")
plt.ylabel("Predicción (USD)")
plt.title("Real vs Predicción")
plt.show()

res = y_test - y_pred
plt.figure()
plt.scatter(y_pred, res, s=12, alpha=0.6)
plt.axhline(0)
plt.xlabel("Predicción (USD)")
plt.ylabel("Residuo (USD)")
plt.title("Residuos vs Predicción")
plt.show()



## 10) Conclusiones (guion)
- Objetivo, métrica y tipo de problema.
- Partición temporal/aleatoria y CV.
- Pipeline (imputaciones, winsor, OHE, escalado, ingeniería; TTR con log).
- Comparación RF vs HGB; selección por RMSE-CV.
- Tuning con RandomizedSearchCV (n_iter=25, cv=3, n_jobs=-1).
- Resultados en test (RMSE/MAE/R² y RMSE relativo).
- Diagnóstico de over/underfitting y próximos pasos.
