# Precipitation Forecasting using XGBoost

In this notebook, we investigate XGBoost (eXtreme Gradient Boosting) as a model to forecast precipitation. XGBoost is an efficient implementation of gradient boosting machines that builds an ensemble of decision trees sequentially. Each tree tries to correct the errors made by the previous trees. Unlike RNNs which maintain internal states, XGBoost treats our time series data as a traditional supervised learning problem where we flatten the temporal sequences into feature vectors.

## 0. Imports and Basic Setup

In [4]:
import os
import numpy as np
import pandas as pd
import xgboost as xgb

from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt

## 1. Load Preprocessed Train, Validation, and Test Splits


In [5]:
train_data = pd.read_csv("../data/processed/train_data.csv")
validation_data = pd.read_csv("../data/processed/validation_data.csv")
test_data = pd.read_csv("../data/processed/test_data.csv")

# Sort splits by location -> YYYY -> DOY to ensure correct time ordering
train_data.sort_values(by=["location", "YYYY", "DOY"], inplace=True)
validation_data.sort_values(by=["location", "YYYY", "DOY"], inplace=True)
test_data.sort_values(by=["location", "YYYY", "DOY"], inplace=True)

## 3. Create Sequences for XGBoost


In [2]:
def create_sequences(df, feature_cols, target_col, seq_length=30):
    """
    Create sequences for time series prediction, grouped by location.
    Unlike RNNs which keep the temporal dimension, we flatten the sequences
    for XGBoost into a single feature vector.
    
    Parameters:
    df (pandas.DataFrame): DataFrame with time series data
    feature_cols (list): List of feature column names
    target_col (str): Name of the target column
    seq_length (int): Length of sequence to use for prediction
    
    Returns:
    tuple: (X array of shape (n_samples, seq_length * n_features),
            y array of shape (n_samples,))
    """
    X_list, y_list = [], []
    grouped = df.groupby("location", group_keys=True)
    
    for loc, loc_df in grouped:
        loc_df = loc_df.reset_index(drop=True)
        loc_features = loc_df[feature_cols].values
        loc_target = loc_df[target_col].values
        
        for i in range(len(loc_df) - seq_length):
            # Flatten the sequence into a single feature vector
            sequence = loc_features[i:i + seq_length].flatten()
            X_list.append(sequence)
            y_list.append(loc_target[i + seq_length])
    
    return np.array(X_list), np.array(y_list)

## 3. Define Features and Target


In [6]:
exclude_cols = ["location", "YYYY", "DOY", "MM", "DD", "prec"]
feature_cols = [col for col in train_data.columns if col not in exclude_cols]
target_col = "prec"

print("Feature columns:", feature_cols)
print("Target column:", target_col)

Feature columns: ['2m_temp_max', '2m_temp_mean', '2m_temp_min', '2m_dp_temp_max', '2m_dp_temp_mean', '2m_dp_temp_min', '10m_wind_u', '10m_wind_v', 'fcst_alb', 'lai_high_veg', 'lai_low_veg', 'swe', 'surf_net_solar_rad_max', 'surf_net_solar_rad_mean', 'surf_net_therm_rad_max', 'surf_net_therm_rad_mean', 'surf_press', 'total_et', 'volsw_123', 'volsw_4']
Target column: prec


## 4. Generate sequences

In [7]:
SEQ_LENGTH = 30
X_train, y_train = create_sequences(train_data, feature_cols, target_col, seq_length=SEQ_LENGTH)
X_val, y_val = create_sequences(validation_data, feature_cols, target_col, seq_length=SEQ_LENGTH)
X_test, y_test = create_sequences(test_data, feature_cols, target_col, seq_length=SEQ_LENGTH)

print("Train sequence shape:", X_train.shape, y_train.shape)
print("Validation sequence shape:", X_val.shape, y_val.shape)
print("Test sequence shape:", X_test.shape, y_test.shape)

Train sequence shape: (1202420, 600) (1202420,)
Validation sequence shape: (106720, 600) (106720,)
Test sequence shape: (106620, 600) (106620,)


##  5. Build and Train XGBoost Model


In [8]:
# Create feature names for interpretation
feature_names = [f"{col}_t{i}" for i in range(SEQ_LENGTH) for col in feature_cols]

# Convert to DMatrix format for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=feature_names)
dval = xgb.DMatrix(X_val, label=y_val, feature_names=feature_names)
dtest = xgb.DMatrix(X_test, label=y_test, feature_names=feature_names)

# Set XGBoost parameters
params = {
    'objective': 'reg:squarederror',  # regression task
    'eval_metric': ['rmse', 'mae'],   # metrics to evaluate
    'max_depth': 6,                   # maximum depth of trees
    'learning_rate': 0.01,            # learning rate
    'subsample': 0.8,                 # fraction of samples used for tree building
    'colsample_bytree': 0.8,          # fraction of features used for tree building
    'min_child_weight': 1,            # minimum sum of instance weight in a child
}

# Train model with early stopping
model = xgb.train(
    params,
    dtrain,
    num_boost_round=1000,
    evals=[(dtrain, 'train'), (dval, 'val')],
    early_stopping_rounds=50,
    verbose_eval=100
)

: 

## 6. Evaluate the Model