# CIS 5200 – XGBoost Wind Speed Forecasting

This notebook trains an **XGBoost regression model** to forecast **1-hour-ahead wind speed** at the Tehachapi wind farm using ERA5 reanalysis data.

We assume your DataFrame already contains:
- `datetime`
- ERA5 physical vars: `u10`, `v10`, `t2m`, `sp`
- time encodings: `hour_sin`, `hour_cos`, `month_sin`, `month_cos`, `doy_sin`, `doy_cos`
- `wind_speed`
- `target_next_hour`

The goal is to build 24-hour lagged features, train XGBoost, and evaluate performance.

## 1. Setup and imports
Install dependencies if needed and import required libraries.

In [4]:
# If xgboost is not installed, uncomment the next line:
# !pip install xgboost

import numpy as np
import pandas as pd
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from xgboost import XGBRegressor
import matplotlib.pyplot as plt
%matplotlib inline

## 2. Load your preprocessed dataset
Set the correct path to your processed CSV file.

If you already have `df` in memory, skip this cell.

In [5]:
import sys
import os
sys.path.append(os.path.abspath(".."))
from data.data_helpers import get_dataframe

df = get_dataframe()
df.head()

Skipping existing file /Users/aboulmich/Projects/cis5200-project/data/era5_tehachapi_2018_H1.nc
Skipping existing file /Users/aboulmich/Projects/cis5200-project/data/era5_tehachapi_2018_H2.nc
Skipping existing file /Users/aboulmich/Projects/cis5200-project/data/era5_tehachapi_2019_H1.nc
Skipping existing file /Users/aboulmich/Projects/cis5200-project/data/era5_tehachapi_2019_H2.nc
Skipping existing file /Users/aboulmich/Projects/cis5200-project/data/era5_tehachapi_2020_H1.nc
Skipping existing file /Users/aboulmich/Projects/cis5200-project/data/era5_tehachapi_2020_H2.nc




Unnamed: 0,datetime,u10,v10,t2m,sp,hour_sin,hour_cos,month_sin,month_cos,doy_sin,doy_cos,wind_speed,target_next_hour
0,2018-01-01 00:00:00,-0.808231,-0.069685,291.890625,89925.25,0.0,1.0,0.5,0.866025,0.017213,0.999852,0.811229,1.524293
1,2018-01-01 01:00:00,-1.340298,-0.725997,290.692627,89972.25,0.258819,0.965926,0.5,0.866025,0.017213,0.999852,1.524293,1.245654
2,2018-01-01 02:00:00,-0.462882,-1.156458,288.544922,90036.1875,0.5,0.866025,0.5,0.866025,0.017213,0.999852,1.245654,1.053555
3,2018-01-01 03:00:00,0.300817,-1.009697,285.121826,90091.3125,0.707107,0.707107,0.5,0.866025,0.017213,0.999852,1.053555,1.122294
4,2018-01-01 04:00:00,0.736362,-0.846944,283.71582,90125.125,0.866025,0.5,0.5,0.866025,0.017213,0.999852,1.122294,1.363163


## 3. Sort by time and create lagged features

We model 1-hour-ahead wind speed using **the last 24 hours of history** for each variable:
- u10, v10
- t2m
- surface pressure (sp)
- wind_speed

Lag features: `var_lag0`, `var_lag1`, ..., `var_lag23`.

We also keep cycle features: hour/month/day-of-year sin/cos.

In [6]:
# Ensure sorted by datetime
df = df.sort_values("datetime").reset_index(drop=True)

LAG_HOURS = 24

lag_vars = ["u10", "v10", "t2m", "sp", "wind_speed"]
time_features = ["hour_sin", "hour_cos", "month_sin", "month_cos", "doy_sin", "doy_cos"]

# Construct lagged features
for var in lag_vars:
    for lag in range(LAG_HOURS):
        df[f"{var}_lag{lag}"] = df[var].shift(lag)

# Drop rows with NA due to shifting
df_ml = df.dropna(subset=["target_next_hour"] + [f"{v}_lag{LAG_HOURS-1}" for v in lag_vars]).copy()

print("Dataset after lagging:", df_ml.shape)
df_ml.head()

Dataset after lagging: (26280, 133)


  df[f"{var}_lag{lag}"] = df[var].shift(lag)
  df[f"{var}_lag{lag}"] = df[var].shift(lag)
  df[f"{var}_lag{lag}"] = df[var].shift(lag)
  df[f"{var}_lag{lag}"] = df[var].shift(lag)
  df[f"{var}_lag{lag}"] = df[var].shift(lag)
  df[f"{var}_lag{lag}"] = df[var].shift(lag)
  df[f"{var}_lag{lag}"] = df[var].shift(lag)
  df[f"{var}_lag{lag}"] = df[var].shift(lag)
  df[f"{var}_lag{lag}"] = df[var].shift(lag)
  df[f"{var}_lag{lag}"] = df[var].shift(lag)
  df[f"{var}_lag{lag}"] = df[var].shift(lag)
  df[f"{var}_lag{lag}"] = df[var].shift(lag)
  df[f"{var}_lag{lag}"] = df[var].shift(lag)
  df[f"{var}_lag{lag}"] = df[var].shift(lag)
  df[f"{var}_lag{lag}"] = df[var].shift(lag)
  df[f"{var}_lag{lag}"] = df[var].shift(lag)
  df[f"{var}_lag{lag}"] = df[var].shift(lag)
  df[f"{var}_lag{lag}"] = df[var].shift(lag)
  df[f"{var}_lag{lag}"] = df[var].shift(lag)
  df[f"{var}_lag{lag}"] = df[var].shift(lag)
  df[f"{var}_lag{lag}"] = df[var].shift(lag)
  df[f"{var}_lag{lag}"] = df[var].shift(lag)
  df[f"{va

Unnamed: 0,datetime,u10,v10,t2m,sp,hour_sin,hour_cos,month_sin,month_cos,doy_sin,...,wind_speed_lag14,wind_speed_lag15,wind_speed_lag16,wind_speed_lag17,wind_speed_lag18,wind_speed_lag19,wind_speed_lag20,wind_speed_lag21,wind_speed_lag22,wind_speed_lag23
23,2018-01-01 23:00:00,-2.007515,-0.398896,288.983398,90245.1875,-0.258819,0.965926,0.5,0.866025,0.017213,...,1.689377,1.627172,1.577095,1.499427,1.363163,1.122294,1.053555,1.245654,1.524293,0.811229
24,2018-01-02 00:00:00,-1.891289,-0.214457,288.743408,90263.5625,0.0,1.0,0.5,0.866025,0.034422,...,1.442469,1.689377,1.627172,1.577095,1.499427,1.363163,1.122294,1.053555,1.245654,1.524293
25,2018-01-02 01:00:00,-1.858377,-0.074957,287.475342,90278.1875,0.258819,0.965926,0.5,0.866025,0.034422,...,1.411922,1.442469,1.689377,1.627172,1.577095,1.499427,1.363163,1.122294,1.053555,1.245654
26,2018-01-02 02:00:00,-1.165116,-0.878426,287.569824,90315.625,0.5,0.866025,0.5,0.866025,0.034422,...,1.461147,1.411922,1.442469,1.689377,1.627172,1.577095,1.499427,1.363163,1.122294,1.053555
27,2018-01-02 03:00:00,-0.596005,-1.192101,284.457031,90319.4375,0.707107,0.707107,0.5,0.866025,0.034422,...,1.460693,1.461147,1.411922,1.442469,1.689377,1.627172,1.577095,1.499427,1.363163,1.122294


## 4. Build train/val/test splits (chronological)

We use 70% for training, 15% validation, 15% testing.

Time-series splitting **does NOT shuffle**.

In [7]:
lag_feature_cols = [c for c in df_ml.columns if any(c.startswith(v + "_lag") for v in lag_vars)]
feature_cols = lag_feature_cols + time_features

X = df_ml[feature_cols].to_numpy()
y = df_ml["target_next_hour"].to_numpy()

N = len(df_ml)
train_end = int(0.7 * N)
val_end = int(0.85 * N)

X_train, y_train = X[:train_end], y[:train_end]
X_val, y_val = X[train_end:val_end], y[train_end:val_end]
X_test, y_test = X[val_end:], y[val_end:]

print("Train:", X_train.shape)
print("Val:  ", X_val.shape)
print("Test: ", X_test.shape)

Train: (18396, 126)
Val:   (3942, 126)
Test:  (3942, 126)


## 5. Train XGBoost model

This configuration is lightweight but performant. Increase trees/depth later if you want more power.

In [8]:
xgb_model = XGBRegressor(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    objective="reg:squarederror",
    random_state=42,
    n_jobs=-1,
)

xgb_model.fit(
    X_train, y_train,
    eval_set=[(X_train, y_train), (X_val, y_val)],
    eval_metric="rmse",
    verbose=False,
)

print("Model training complete.")

TypeError: XGBModel.fit() got an unexpected keyword argument 'eval_metric'

## 6. Evaluate model
Compute MAE, RMSE, and R² on the **test set**.

In [None]:
y_pred = xgb_model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print(f"Test MAE:  {mae:.4f}")
print(f"Test RMSE: {rmse:.4f}")
print(f"Test R²:   {r2:.4f}")

## 7. Feature importance
Visualize the top 20 most important lag features used by XGBoost.

In [None]:
importances = xgb_model.feature_importances_
indices = np.argsort(importances)[::-1]

top_k = 20

plt.figure(figsize=(8, 6))
plt.barh([feature_cols[i] for i in indices[:top_k]][::-1],
         importances[indices[:top_k]][::-1])
plt.title("XGBoost Feature Importance – Top 20")
plt.tight_layout()
plt.show()