Proyecto Waterpumps — Informe y Entrenamiento (HGB denso) - Mejor intento

Objetivo: predecir el estado de cada waterpoint (functional / non functional / functional needs repair).
Restricciones: sin APIs ni datos externos.
Métrica: Macro-F1 (promedio de F1 por clase).
Resumen del enfoque:

Feature Engineering temporal y geográfica (incluye KMeans sobre lon/lat/altura).

Preprocesado denso: numéricas con imputación+escalado; categóricas con TopN + One-Hot.

Modelo: HistGradientBoostingClassifier con early stopping.

Validación: holdout 80/20 estratificado.

Genera submission_hgb_dense.csv.

0.Setup 

In [2]:
# =========================
# 0) Setup y rutas
# =========================

# --- Setup básico /
import os
os.environ["LOKY_MAX_CPU_COUNT"] = "1"  # evita loky raro en Windows

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import FunctionTransformer, StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
from sklearn.metrics import f1_score, confusion_matrix, classification_report
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.class_weight import compute_class_weight

# --- Config ---
SEED = 42
ID_COL = "id"
TARGET_COL = "status_group"

# Rutas 
TRAIN_CSV  = r"C:\Users\judit\waterpumps\train_features.csv"
LABELS_CSV = r"C:\Users\judit\waterpumps\labels.csv"           # cambia si tu archivo difiere
TEST_CSV   = r"C:\Users\judit\waterpumps\test_features.csv"

SUBM_DIR = r"C:\Users\judit\waterpumps\submission"
SUBM_OUT = os.path.join(SUBM_DIR, "submission_hgb_dense.csv")
os.makedirs(SUBM_DIR, exist_ok=True)

print("Archivos encontrados:")
for p in [TRAIN_CSV, LABELS_CSV, TEST_CSV]:
    print(os.path.exists(p), "->", p)

Archivos encontrados:
True -> C:\Users\judit\waterpumps\train_features.csv
True -> C:\Users\judit\waterpumps\labels.csv
True -> C:\Users\judit\waterpumps\test_features.csv


1) Carga, unión con etiquetas y revisión rápida

In [3]:
# =========================
# 1) CARGA Y MERGE
# =========================
# Carga
trainX = pd.read_csv(TRAIN_CSV)
labels = pd.read_csv(LABELS_CSV)
testX  = pd.read_csv(TEST_CSV)

# Dejar solo columnas necesarias en labels

labels = labels[[c for c in labels.columns if c in {ID_COL, TARGET_COL}]]

# Merge train + etiquetas

df = trainX.merge(labels, on=ID_COL, how="inner")
X = df.drop(columns=[TARGET_COL])
y = df[TARGET_COL]

print(f"Train: {X.shape} | Test: {testX.shape}")
print("Distribución de clases (train):", y.value_counts(normalize=True).round(3).to_dict())

Train: (59400, 40) | Test: (14850, 40)
Distribución de clases (train): {'functional': 0.543, 'non functional': 0.384, 'functional needs repair': 0.073}


2) Feature Engineering (temporal, edad, geográfico, logs/ratios)

In [4]:
# =========================
# 2) FEATURE ENGINEERING
# =========================
def add_engineered_features(df: pd.DataFrame) -> pd.DataFrame:
    out = df.copy()

    # --- Temporal desde date_recorded ---
    dr = pd.to_datetime(out.get("date_recorded"), errors="coerce")
    out["rec_year"] = dr.dt.year
    out["rec_month"] = dr.dt.month
    out["rec_dayofweek"] = dr.dt.dayofweek

    # --- Edad del pozo y bins ---
    cy = out.get("construction_year", pd.Series(index=out.index)).replace({0: np.nan})
    out["well_age"] = out["rec_year"] - cy
    out.loc[out["well_age"] < 0, "well_age"] = np.nan
    out["well_age_bin"] = pd.cut(
        out["well_age"],
        bins=[-1, 0, 5, 10, 20, 40, 70, 120],
        labels=["unknown","0-5","5-10","10-20","20-40","40-70","70+"]
    )

    # --- Geo mínimos y flags ---
    for col in ["longitude", "latitude", "gps_height"]:
        if col not in out.columns:
            out[col] = np.nan

    out["gps_is_zero"] = (
        (out["longitude"].fillna(0)==0) |
        (out["latitude"].fillna(0)==0) |
        (out["gps_height"].fillna(0)==0)
    ).astype(int)

    # --- Logs y ratio ---
    for c in ["amount_tsh","population"]:
        if c not in out.columns:
            out[c] = np.nan
    out["log_population"] = np.log1p(out["population"].clip(lower=0))
    out["log_amount_tsh"] = np.log1p(out["amount_tsh"].clip(lower=0))
    out["tsh_per_capita"] = out["amount_tsh"] / (out["population"].fillna(0) + 1)

    # --- Bins geo gruesos ---
    out["lat_bin"] = (out["latitude"] * 10).round(0)
    out["lon_bin"] = (out["longitude"] * 10).round(0)

    out["height_is_zero"] = (out["gps_height"].fillna(0) == 0).astype(int)
    return out


3) KMeans geográfico (train+test) para patrones regionales

In [5]:
# Concatenamos solo columnas geo para KMeans
_geo_concat = pd.concat([
    X[["longitude","latitude","gps_height"]],
    testX[["longitude","latitude","gps_height"]]
], axis=0, ignore_index=True)

# Imputación simple para KMeans
_geo = _geo_concat.copy()
for c in ["longitude","latitude","gps_height"]:
    _geo[c] = _geo[c].fillna(_geo[c].median())

K = 35  # 25–40 suele ir bien
kmeans = KMeans(n_clusters=K, random_state=SEED, n_init=10)
_geo_labels = kmeans.fit_predict(_geo)

# Escribimos etiqueta de cluster en train y test
X = X.copy(); testX = testX.copy()
X["_geo_cluster"] = _geo_labels[:len(X)]
testX["_geo_cluster"] = _geo_labels[len(X):]

# Aplicamos FE principal
X_fe = add_engineered_features(X)
test_fe = add_engineered_features(testX)

4) Inferencia de columnas (num/cat) y limpieza de ruido

In [6]:
concat = pd.concat([
    X_fe.drop(columns=[ID_COL], errors="ignore"),
    test_fe.drop(columns=[ID_COL], errors="ignore")
], axis=0, ignore_index=True)

num_cols_all = [c for c in concat.columns if pd.api.types.is_numeric_dtype(concat[c])]
cat_cols_all = [c for c in concat.columns if not pd.api.types.is_numeric_dtype(concat[c])]

# Excluir texto libre muy ruidoso
drop_high_noise = {"wpt_name","scheme_name","recorded_by","subvillage"}
cat_cols_all = [c for c in cat_cols_all if c not in drop_high_noise]

# Cardinalidad
card = concat[cat_cols_all].nunique(dropna=True)
low_cat_cols  = [c for c in cat_cols_all if card[c] <= 50]
high_cat_cols = [c for c in cat_cols_all if card[c] >  50]

num_cols = [c for c in num_cols_all if c != ID_COL]

print(f"Numéricas: {len(num_cols)} | Cats_low: {len(low_cat_cols)} | Cats_high: {len(high_cat_cols)}")

Numéricas: 21 | Cats_low: 22 | Cats_high: 5


5) Reductor TopN para categóricas de alta cardinalidad + Preprocesador denso

In [7]:
class TopNCategoryReducer(BaseEstimator, TransformerMixin):
    """
    Mantiene las Top-N categorías por frecuencia en cada columna; el resto -> 'other'.
    Devuelve DataFrame para convivir bien con OneHotEncoder.
    """
    def __init__(self, top_n=40, min_count=30):
        self.top_n = top_n
        self.min_count = min_count
        self.keepers_ = {}
        self.columns_ = None

    def fit(self, X, y=None):
        X = pd.DataFrame(X).copy()
        self.columns_ = list(X.columns)
        for c in self.columns_:
            counts = X[c].astype("object").value_counts(dropna=False)
            keep = counts[counts >= self.min_count].index[: self.top_n]
            self.keepers_[c] = set(map(lambda z: "NA" if pd.isna(z) else str(z), keep))
        return self

    def transform(self, X):
        X = pd.DataFrame(X, columns=self.columns_).copy()
        for c in self.columns_:
            vals = X[c].astype("object")
            X[c] = vals.where(vals.isin(self.keepers_[c]), other="other")
            X[c] = X[c].fillna("NA").astype(str)
        return X

# Pipelines por tipo
num_pipe = Pipeline(steps=[
    ("imp", SimpleImputer(strategy="median")),
    ("scale", StandardScaler())  # denso
])

low_cat_pipe = Pipeline(steps=[
    ("imp", SimpleImputer(strategy="most_frequent")),
    ("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

high_cat_pipe = Pipeline(steps=[
    ("imp", SimpleImputer(strategy="most_frequent")),
    ("topn", TopNCategoryReducer(top_n=40, min_count=30)),
    ("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

transformers = []
if num_cols:      transformers.append(("num",     num_pipe,      num_cols))
if low_cat_cols:  transformers.append(("lowcat",  low_cat_pipe,  low_cat_cols))
if high_cat_cols: transformers.append(("highcat", high_cat_pipe, high_cat_cols))

pre_dense = ColumnTransformer(transformers=transformers, remainder="drop", sparse_threshold=0)

6) Modelo: HistGradientBoostingClassifier (con early stopping)

In [8]:
clf_hgb = HistGradientBoostingClassifier(
    learning_rate=0.12,
    max_depth=12,
    max_iter=400,
    early_stopping=True,
    validation_fraction=0.1,
    n_iter_no_change=30,
    random_state=SEED,
    l2_regularization=1e-3,
    max_bins=255
)

model = Pipeline(steps=[
    ("pre", pre_dense),
    ("clf", clf_hgb)
])

7) Validación holdout (80/20 estratificado) y pesos de clase

In [9]:
X_tr, X_va, y_tr, y_va = train_test_split(
    X_fe, y, test_size=0.2, stratify=y, random_state=SEED
)

classes_ = np.unique(y_tr)
cw = compute_class_weight(class_weight="balanced", classes=classes_, y=y_tr)
cw_map = {cls: w for cls, w in zip(classes_, cw)}
sw_tr = y_tr.map(cw_map).values

# Fit con early stopping (usa validation_fraction interno)
model.fit(X_tr, y_tr, clf__sample_weight=sw_tr)

pred_va = model.predict(X_va)
macro_f1 = f1_score(y_va, pred_va, average="macro")
print(f"Macro-F1 (holdout): {macro_f1:.4f}")

print("\nClassification report (holdout):")
print(classification_report(y_va, pred_va, digits=4))

labels_sorted = sorted(y.unique())
cm = confusion_matrix(y_va, pred_va, labels=labels_sorted)
pd.DataFrame(cm, index=labels_sorted, columns=labels_sorted)

Macro-F1 (holdout): 0.6560

Classification report (holdout):
                         precision    recall  f1-score   support

             functional     0.8338    0.7316    0.7793      6452
functional needs repair     0.2836    0.7022    0.4040       863
         non functional     0.8310    0.7430    0.7845      4565

               accuracy                         0.7338     11880
              macro avg     0.6494    0.7256    0.6560     11880
           weighted avg     0.7927    0.7338    0.7541     11880



Unnamed: 0,functional,functional needs repair,non functional
functional,4720,1123,609
functional needs repair,176,606,81
non functional,765,408,3392


8) Entrenamiento final en todo el train y generación de submission

In [10]:
# Recalcular sample_weight en todo el train
classes_all = np.unique(y)
cw_all = compute_class_weight(class_weight="balanced", classes=classes_all, y=y)
cw_map_all = {cls: w for cls, w in zip(classes_all, cw_all)}
sw_all = y.map(cw_map_all).values

# Fit final
model.fit(X_fe, y, clf__sample_weight=sw_all)

# Predicción sobre test y guardado de submission
preds = model.predict(test_fe)
submission = pd.DataFrame({ID_COL: testX[ID_COL], TARGET_COL: preds})
submission.to_csv(SUBM_OUT, index=False, encoding="utf-8")

print("✅ Submission guardada en:", SUBM_OUT)
submission.head()

✅ Submission guardada en: C:\Users\judit\waterpumps\submission\submission_hgb_dense.csv


Unnamed: 0,id,status_group
0,50785,functional
1,51630,functional needs repair
2,17168,functional
3,45559,non functional
4,49871,functional


9) Notas para el informe 

Métrica: Macro-F1, apropiada con clases desbalanceadas.

Desbalanceo: class_weight='balanced' + sample_weight por frecuencia.

FE: variables temporales, edad del pozo, flags GPS, bins geográficos, log1p y ratio tsh_per_capita, y cluster KMeans geo (K≈35).

Preprocesado: numéricas (mediana + escalado), categóricas baja cardinalidad (OHE), alta (TopN→OHE). Exclusión de texto libre muy ruidoso (wpt_name, scheme_name, recorded_by, subvillage).

Modelo: HGB con early stopping; funciona bien en denso y captura no linealidades.

Reproducibilidad: rutas fijas a C:\Users\judit\waterpumps\... y random_state global.