# Predicci√≥n de Riesgo de Accidentes con Ensamble Stacking ( Ejecutado en Kaggle notebook )

### Resumen del Proceso

Este notebook aborda el problema de predecir el riesgo de accidentes de tr√°fico. La estrategia principal se centra en la **aumentaci√≥n de datos y el modelado con un ensamble de stacking** para maximizar la precisi√≥n.

El flujo de trabajo se divide en los siguientes pasos clave:

1.  **Aumentaci√≥n de Datos**: Para robustecer el entrenamiento, duplicamos el tama√±o del dataset de entrenamiento. Creamos una copia **sint√©tica** de los datos originales ( ya que se conoce el generador de datos que sirvi√≥ como base para los datos de la competencia), generando un nuevo `accident_risk` basado en una f√≥rmula predefinida que imita las relaciones l√≥gicas entre las variables. Luego, combinamos los datos reales y los sint√©ticos.

2.  **Ingenier√≠a de Caracter√≠sticas**: Se crean nuevas variables informativas a partir de las existentes para mejorar la capacidad predictiva de los modelos. Esto incluye:
    * `base_risk`: Un c√°lculo de riesgo utilizado en el generador de datos, as√≠ que es una caracter√≠stica muy importante.
    * `speed_per_lane`: Una interacci√≥n entre el l√≠mite de velocidad y el n√∫mero de carriles.
    * `adverse_conditions`: Una bandera que indica si las condiciones clim√°ticas o de iluminaci√≥n son desfavorables.

3.  **Preprocesamiento y Pipelines**: Las caracter√≠sticas categ√≥ricas se transforman con `OneHotEncoder` y las num√©ricas se escalan con `StandardScaler`. Todo el proceso se encapsula en `Pipelines` de Scikit-learn para un flujo de trabajo limpio y eficiente.

4.  **Modelado y Ensamble**:
    * **Modelos Base**: Se entrenan y eval√∫an tres potentes modelos de gradient boosting: **XGBoost, LightGBM y CatBoost**.
    * **Ensamble Stacking**: Para la predicci√≥n final, se construye un `StackingRegressor`. Este "modelo de modelos" utiliza las predicciones de los tres modelos base como entrada para un meta-modelo final (`Ridge`), que aprende a combinarlas de la manera m√°s √≥ptima.

5.  **Evaluaci√≥n y Predicci√≥n Final**: Todos los modelos, incluidos los base y el ensamble final, se eval√∫an rigurosamente mediante **validaci√≥n cruzada de 5 folds** para estimar su rendimiento real (RMSE). Finalmente, el ensamble de stacking se entrena con todos los datos y se utiliza para generar el archivo `submission.csv` para la competencia. Tambi√©n se genera un reporte (`rendimiento_modelos.txt`) con los resultados comparativos.

In [None]:
# =========================================================================
# CELDA 2: GENERACI√ìN DE DATOS SINT√âTICOS Y DATASET COMBINADO
# =========================================================================

# --- 1. Funci√≥n Original para Generar Datos Sint√©ticos ---
# Mantenemos tu funci√≥n original intacta.
def generate_synthetic_data(num_rows=10000, seed=42,
                            road_type=None, num_lanes=None, curvature=None,
                            speed_limit=None, lighting=None, weather=None,
                            road_signs_present=None, public_road=None,
                            time_of_day=None, holiday=None, school_season=None,
                            num_reported_accidents=None):
    np.random.seed(seed)
    data = {
        "road_type": road_type if road_type is not None else np.random.choice(["highway", "urban", "rural"], num_rows),
        "num_lanes": num_lanes if num_lanes is not None else np.random.randint(1, 5, num_rows),
        "curvature": curvature if curvature is not None else np.round(np.random.uniform(0.0, 1.0, num_rows), 2),
        "speed_limit": speed_limit if speed_limit is not None else np.random.choice([25, 35, 45, 60, 70], num_rows),
        "lighting": lighting if lighting is not None else np.random.choice(["daylight", "night", "dim"], num_rows),
        "weather": weather if weather is not None else np.random.choice(["clear", "rainy", "foggy"], num_rows),
        "road_signs_present": road_signs_present if road_signs_present is not None else np.random.choice([True, False], num_rows),
        "public_road": public_road if public_road is not None else np.random.choice([True, False], num_rows),
        "time_of_day": time_of_day if time_of_day is not None else np.random.choice(["morning", "evening", "afternoon"], num_rows),
        "holiday": holiday if holiday is not None else np.random.choice([True, False], num_rows),
        "school_season": school_season if school_season is not None else np.random.choice([True, False], num_rows),
        "num_reported_accidents": num_reported_accidents if num_reported_accidents is not None else np.random.poisson(lam=1.5, size=num_rows)
    }
    base_risk = (
        0.3 * np.array(data["curvature"]) +
        0.2 * (np.array(data["lighting"]) == "night").astype(int) +
        0.1 * (np.array(data["weather"]) != "clear").astype(int) +
        0.2 * (np.array(data["speed_limit"]) >= 60).astype(int) +
        0.1 * (np.array(data["num_reported_accidents"]) > 2).astype(int)
    )
    noise = np.random.normal(0, 0.05, num_rows)
    risk_score = np.clip(base_risk + noise, 0, 1)
    data["accident_risk"] = np.round(risk_score, 2)
    return pd.DataFrame(data)

# --- 2. Creaci√≥n del Dataset Combinado (L√≥gica Original) ---
print("Generando datos sint√©ticos para el set de entrenamiento...")
synthetic_train_df = generate_synthetic_data(
    num_rows=train_df_raw.shape[0], seed=42, road_type=train_df_raw["road_type"],
    num_lanes=train_df_raw["num_lanes"], curvature=train_df_raw["curvature"],
    speed_limit=train_df_raw["speed_limit"], lighting=train_df_raw["lighting"],
    weather=train_df_raw["weather"], road_signs_present=train_df_raw["road_signs_present"],
    public_road=train_df_raw["public_road"], time_of_day=train_df_raw["time_of_day"],
    holiday=train_df_raw["holiday"], school_season=train_df_raw["school_season"],
    num_reported_accidents=train_df_raw["num_reported_accidents"])

train_df_raw['synthetic_risk'] = synthetic_train_df['accident_risk']

train_real_df = train_df_raw.copy()
train_real_df['is_synthetic'] = 0

train_synthetic_df = train_df_raw.drop(columns=['accident_risk']).copy()
train_synthetic_df.rename(columns={'synthetic_risk': 'accident_risk'}, inplace=True)
train_synthetic_df['is_synthetic'] = 1

combined_train_df = pd.concat([train_real_df.drop(columns=['synthetic_risk']), train_synthetic_df], ignore_index=True)
combined_train_df = combined_train_df.sample(frac=1, random_state=42).reset_index(drop=True) 

print(f"‚úÖ Dataset de entrenamiento combinado creado: {combined_train_df.shape}")

# --- 3. A√±adir Features Sint√©ticas al Dataset de Test ---
print("\nGenerando datos sint√©ticos para el set de test...")
synthetic_test_df = generate_synthetic_data(
    num_rows=test_df_raw.shape[0], seed=42, road_type=test_df_raw["road_type"],
    num_lanes=test_df_raw["num_lanes"], curvature=test_df_raw["curvature"],
    speed_limit=test_df_raw["speed_limit"], lighting=test_df_raw["lighting"],
    weather=test_df_raw["weather"], road_signs_present=test_df_raw["road_signs_present"],
    public_road=test_df_raw["public_road"], time_of_day=test_df_raw["time_of_day"],
    holiday=test_df_raw["holiday"], school_season=test_df_raw["school_season"],
    num_reported_accidents=test_df_raw["num_reported_accidents"])

test_df = test_df_raw.copy()
test_df['synthetic_risk'] = synthetic_test_df['accident_risk']
test_df['is_synthetic'] = 0 # El test siempre es "real"

print(f"‚úÖ Features sint√©ticas a√±adidas al dataset de test: {test_df.shape}")

In [None]:
# =========================================================================
# CELDA 3: ING. DE CARACTER√çSTICAS Y DEFINICI√ìN DE PIPELINES BASE
# =========================================================================
import lightgbm as lgb
import catboost as cb

# --- 1. Funci√≥n para Crear Nuevas Caracter√≠sticas ---
def feature_engineer(df):
    df_eng = df.copy()
    df_eng['base_risk'] = (
        0.3 * df_eng["curvature"] +
        0.2 * (df_eng["lighting"] == "night").astype(int) +
        0.1 * (df_eng["weather"] != "clear").astype(int) +
        0.2 * (df_eng["speed_limit"] >= 60).astype(int) +
        0.1 * (df_eng["num_reported_accidents"] > 2).astype(int)
    )
    df_eng['speed_per_lane'] = df_eng['speed_limit'] / df_eng['num_lanes']
    df_eng['adverse_conditions'] = ((df_eng['weather'] != 'clear') | (df_eng['lighting'] != 'daylight')).astype(int)
    return df_eng

print("Aplicando ingenier√≠a de caracter√≠sticas...")
X_full = feature_engineer(combined_train_df.drop(columns=['id', 'accident_risk']))
y_full = combined_train_df['accident_risk']
X_test_full = feature_engineer(test_df.drop(columns=['id']))
print("‚úÖ Ingenier√≠a de caracter√≠sticas completada.")


# --- 2. Definici√≥n Final de Tipos de Caracter√≠sticas ---
cat_features = X_full.select_dtypes(include=["object", "bool"]).columns.tolist()
num_features = X_full.select_dtypes(include=["int64", "float64", "int32"]).columns.tolist()
print(f"\nCaracter√≠sticas categ√≥ricas ({len(cat_features)}): {cat_features}")
print(f"Caracter√≠sticas num√©ricas ({len(num_features)}): {num_features}")


# --- 3. DEFINICI√ìN DE PIPELINES BASE ---
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_features),
        ("num", StandardScaler(), num_features)
    ],
    remainder='passthrough'
)

# Modelo 1: XGBoost
xgb_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor), 
    ('regressor', xgb.XGBRegressor(
        n_estimators=2000, learning_rate=0.02, max_depth=7, 
        subsample=0.8, colsample_bytree=0.8,
        objective='reg:squarederror', tree_method='hist', device='gpu', random_state=42
    ))
])

# Modelo 2: LightGBM (El R√°pido y Generalista)
# Menos profundo para capturar se√±ales generales y evitar sobreajuste en detalles.
lgbm_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor), 
    ('regressor', lgb.LGBMRegressor(
        n_estimators=1500, learning_rate=0.03, num_leaves=25, max_depth=5,
        subsample=0.7, colsample_bytree=0.7,
        objective='regression_l1', device='gpu', random_state=42, n_jobs=1, verbose=-1
    ))
])

# Modelo 3: CatBoost (El Profundo y Detallista)
# M√°s profundo y con aprendizaje m√°s lento para encontrar patrones complejos.
catboost_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', cb.CatBoostRegressor(
        n_estimators=2500, learning_rate=0.015, depth=9, l2_leaf_reg=4,
        task_type='GPU', random_seed=42, verbose=0
    ))
])

print("\n‚úÖ 3 Pipelines base definidos y listos.")

# --- 4. EVALUACI√ìN DE TODOS LOS MODELOS BASE ---
print("\n--- üìà Evaluando modelos base ---")
scores_report = {} 

# Evaluar XGBoost
xgb_base_scores = cross_val_score(xgb_pipeline, X_full, y_full, cv=5, scoring='neg_root_mean_squared_error', n_jobs=1)
scores_report['xgb_base_rmse'] = -xgb_base_scores.mean()
print(f"‚úÖ XGBoost Base CV Score (RMSE): {scores_report.get('xgb_base_rmse'):.5f}")

# Evaluar LightGBM
lgbm_base_scores = cross_val_score(lgbm_pipeline, X_full, y_full, cv=5, scoring='neg_root_mean_squared_error', n_jobs=1)
scores_report['lgbm_base_rmse'] = -lgbm_base_scores.mean()
print(f"‚úÖ LightGBM Base CV Score (RMSE): {scores_report.get('lgbm_base_rmse'):.5f}")

# Evaluar CatBoost
catboost_base_scores = cross_val_score(catboost_pipeline, X_full, y_full, cv=5, scoring='neg_root_mean_squared_error', n_jobs=1)
scores_report['catboost_base_rmse'] = -catboost_base_scores.mean()
print(f"‚úÖ CatBoost Base CV Score (RMSE): {scores_report.get('catboost_base_rmse'):.5f}")

In [None]:
# =========================================================================
# CELDA 4: CONSTRUCCI√ìN DEL ENSAMBLE STACKING
# =========================================================================

# --- Construcci√≥n del Ensamble Final con 3 modelos base ---
meta_model = Ridge(random_state=42)
stacking_ensemble = StackingRegressor(
    estimators=[
        ('xgb', xgb_pipeline),
        ('lgbm', lgbm_pipeline),
        ('catboost', catboost_pipeline)
    ],
    final_estimator=meta_model, # Ridge sigue siendo el meta-modelo
    cv=5,
    n_jobs=1
)
print("\n‚úÖ Ensamble de Stacking con 3 modelos base definido y listo.")

In [None]:
# =========================================================================
# CELDA 5: ENTRENAMIENTO, PREDICCI√ìN Y REPORTE FINAL
# =========================================================================

# --- EVALUACI√ìN DEL ENSAMBLE FINAL ---
print("\n--- üìà Evaluando el Ensamble de Stacking Final ---")
stacking_scores = cross_val_score(stacking_ensemble, X_full, y_full, cv=5, scoring='neg_root_mean_squared_error', n_jobs=1)
scores_report['stacking_final_rmse'] = -stacking_scores.mean()
print(f"‚úÖ Stacking Ensemble CV Score (RMSE): {scores_report.get('stacking_final_rmse'):.5f}")


# --- ENTRENAMIENTO FINAL DEL ENSAMBLE ---
print("\n--- ‚öôÔ∏è  Entrenando el ensamble final en TODO el dataset... ---")
stacking_ensemble.fit(X_full, y_full)
print("‚úÖ Ensamble entrenado con √©xito.")


# --- PREDICCI√ìN Y CREACI√ìN DEL ARCHIVO DE SUBMISSION ---
print("\n--- üß† Realizando predicciones sobre el conjunto de test... ---")
test_predictions = stacking_ensemble.predict(X_test_full)
submission_df = pd.DataFrame({'id': test_ids, 'accident_risk': test_predictions})
submission_df['accident_risk'] = submission_df['accident_risk'].clip(0, 1)
submission_path = 'submission_ensemble.csv'
submission_df.to_csv(submission_path, index=False)
print(f"\nüéâ ¬°√âxito! Archivo de submission guardado en: {submission_path}")


# --- ESCRITURA DEL REPORTE FINAL EN UN ARCHIVO .TXT ---
print("\n--- üíæ Guardando reporte de rendimiento ---")
report_path = 'rendimiento_modelos.txt'
with open(report_path, 'w') as f:
    f.write("="*50 + "\n")
    f.write("      INFORME DE RENDIMIENTO DE MODELOS (RMSE)\n")
    f.write("="*50 + "\n\n")
    
    def format_score(key):
        score = scores_report.get(key)
        if isinstance(score, float): return f"{score:.5f}"
        return "N/A"

    f.write("--- Modelos Base ---\n")
    f.write(f"XGBoost:      {format_score('xgb_base_rmse')}\n")
    f.write(f"LightGBM:     {format_score('lgbm_base_rmse')}\n")
    f.write(f"CatBoost:     {format_score('catboost_base_rmse')}\n")
    f.write(f"Ridge:        {format_score('ridge_base_rmse')}\n\n")
    
    f.write("--- Modelo Final ---\n")
    f.write(f"Ensamble Stacking: {format_score('stacking_final_rmse')}\n\n")
    
    f.write("="*50 + "\n")
    f.write(f"Reporte generado el {pd.Timestamp.now(tz='America/Bogota').strftime('%Y-%m-%d %H:%M:%S')}\n")

print(f"‚úÖ Reporte guardado en: {report_path}")

# Resultados de entrenamiento




INFORME DE RENDIMIENTO DE MODELOS (RMSE)

--- Modelos Base --- RMSE (CV 5-Folds)

- XGBoost:                0.05309
- LightGBM:               0.05326
- CatBoost:               0.05310


--- Modelo Final ---
- Ensamble Stacking:      0.05306

