# Flood forecasting project — Hanoi (Synthetic dataset)
This notebook contains:
1. Instructions to replace synthetic data with real downloads
2. Preprocessing and feature engineering
3. Baseline XGBoost training and evaluation (TimeSeriesSplit)
4. Saving model and inference example

Files generated with this notebook (synthetic):
- `/mnt/data/hanoi_synthetic_2024.csv` (synthetic)


## Download real data (optional)
If you have access to real rainfall and water level data for Hanoi, replace the synthetic CSV with your real dataset.

Useful sources to fetch real data (examples):

- GPM IMERG (satellite precipitation) — https://gpm.nasa.gov
- CHIRPS gridded rainfall — https://www.chc.ucsb.edu/data/chirps
- Vietnam NCHMF / local provincial hydrology portals

If you have station CSVs, ensure they have columns: `timestamp`, `station_id`, `rain_mm`, `water_level_cm`, `lat`, `lon`.


In [None]:
import pandas as pd

# Load synthetic data
csv_path = '/mnt/data/hanoi_synthetic_2024.csv'
df = pd.read_csv(csv_path, parse_dates=['timestamp'])
print('Rows:', len(df))
df.head()

## Preprocessing & Feature Engineering
We will create lag features for rain (1,3,6,12,24 hours), rolling sums and an API-like feature.

In [None]:
df = df.sort_values(['station_id','timestamp']).reset_index(drop=True)
for lag in [1,3,6,12,24]:
    df[f'rain_lag_{lag}'] = df.groupby('station_id')['rain_mm'].shift(lag)
for window in [3,6,24]:
    df[f'rain_sum_{window}h'] = df.groupby('station_id')['rain_mm'].rolling(window).sum().reset_index(0,drop=True)
# API-like (EWMA)
df['API'] = df.groupby('station_id')['rain_mm'].apply(lambda x: x.ewm(alpha=0.3).mean())
# dropna
df_feat = df.dropna().reset_index(drop=True)
print('After creating features, rows:', len(df_feat))
df_feat.head()

## Baseline: XGBoost (time-series split)
We'll use TimeSeriesSplit for evaluation.

In [None]:
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import f1_score, recall_score, precision_score, roc_auc_score
import xgboost as xgb

features = [c for c in df_feat.columns if c.startswith('rain_lag_') or c.startswith('rain_sum_') or c in ['API']]
X = df_feat[features]
y = df_feat['label']

tscv = TimeSeriesSplit(n_splits=5)
metrics = []
for train_idx, test_idx in tscv.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', n_estimators=100, max_depth=5)
    model.fit(X_train, y_train)
    prob = model.predict_proba(X_test)[:,1]
    pred = (prob >= 0.5).astype(int)
    metrics.append({
        'auc': roc_auc_score(y_test, prob),
        'f1': f1_score(y_test, pred),
        'recall': recall_score(y_test, pred),
        'precision': precision_score(y_test, pred)
    })
import pandas as pd
print(pd.DataFrame(metrics).mean())


In [None]:
import pickle
# train on all data
final_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', n_estimators=150, max_depth=5)
final_model.fit(X, y)
with open('/mnt/data/xgb_flood_model.pkl','wb') as f:
    pickle.dump(final_model, f)
print('Saved model to /mnt/data/xgb_flood_model.pkl')

In [None]:
# inference example: take last row features
last = X.iloc[[-1]]
prob = final_model.predict_proba(last)[:,1][0]
print('Predicted probability of flood (last timestamp):', prob)
