## WAVE ENERGY FORECASTING: A RIGOROUS MACHINE LEARNING APPROACH

Project: Advanced Wave Energy Flux Forecasting Using Ensemble Methods and 
         Deep Learning with Uncertainty Quantification
Target Station: NDBC Buoy 41025 (Diamond Shoals, NC)

Repository: https://github.com/RichelCode/Wave-Energy-Forecasting-Methodological-Enhancement-and-Advanced-Modeling

Paper Reference: Building upon "A Hybrid Machine Learning Approach to Wave 
                 Energy Forecasting" (TIefu et al., 2021, NAPS)

## Executive Summary

This notebook implements a comprehensive wave energy forecasting system that addresses critical methodological limitations in existing literature while advancing toward state-of-the-art performance. Our work builds upon and significantly improves the methodology presented in TIefu et al. (2021), incorporating:

Key Improvements:
- Rigorous time series cross-validation (eliminating data leakage)
- Spatial imputation using neighbor station data (preserving temporal continuity)
- Comprehensive feature engineering with temporal lags and domain knowledge
- Systematic hyperparameter optimization using Bayesian methods
- Statistical significance testing for model comparisons
- Neural network baselines (LSTM, GRU) alongside tree-based methods
- Probabilistic forecasting with uncertainty quantification
- Multi-horizon evaluation (1h, 3h, 6h, 12h, 24h ahead)

## Problem Context and Motivation

###  The Wave Energy Challenge

Wave energy represents a **largely untapped renewable resource** with estimated potential of:
- **1,170 TWh/year** along US coastlines (EPRI, 2011)
- **2-3 TW global theoretical capacity** (Gunn & Stock-Williams, 2012)
- **Predictable and high energy density** compared to wind/solar

However, **grid integration of wave energy** requires accurate forecasting for:

1. **Power System Operations:**
   - Unit commitment and economic dispatch optimization
   - Spinning reserve allocation
   - Transmission congestion management
   - Ancillary service procurement

2. **Wave Energy Converter (WEC) Management:**
   - Preventive maintenance scheduling during calm periods
   - Survival mode activation before extreme events
   - Optimal power extraction strategies
   - Component lifetime optimization

3. **Market Participation:**
   - Day-ahead and real-time market bidding
   - Financial risk management
   - Revenue optimization under uncertainty

### Critical Forecasting Parameters

Wave energy flux (kW/m) depends primarily on:

**Significant Wave Height (H_s):** 
- The mean height of the highest 1/3 of waves in a time period
- Directly measured by buoy accelerometers
- Energy ∝ H_s² (quadratic relationship amplifies forecast errors)

**Wave Period (T):**
- Dominant Period (DPD): Period of peak energy in wave spectrum
- Average Period (APD): Mean period of all waves
- Energy ∝ T (linear relationship but crucial for power capture)

**Wave Energy Flux Formula:**
$$E = \frac{\rho g^2 H_s^2 T}{64\pi} \approx 0.49 H_s^2 T \text{ (kW/m in deep water)}$$

Where:
- ρ = seawater density ≈ 1025 kg/m³
- g = gravitational acceleration = 9.81 m/s²
- H_s = significant wave height (m)
- T = wave period (s)

**Implication:** A 10% error in H_s translates to ~20% error in energy flux due to quadratic relationship. This demands high-accuracy forecasting.


## Literature Review and Research Gaps

### Existing Approaches

**Physics-Based Models:**
- WAVEWATCH III, SWAN (Simulating WAves Nearshore)
- **Pros:** Based on first principles, global applicability
- **Cons:** Computationally expensive (hours for 24h forecast), require accurate wind forcing, cumulative error from physics approximations

**Statistical/ML Models:**
- ARIMA, SARIMA for time series
- Neural Networks (ANN, LSTM, GRU)
- Tree-based methods (Random Forest, XGBoost)
- **Pros:** Fast inference, learn from data patterns
- **Cons:** Site-specific training, require substantial historical data

**Hybrid Models:**
- Decomposition + ML (VMD-LSTM, EMD-XGBoost)
- Stacking ensembles combining multiple algorithms
- Physics-informed neural networks
- **Pros:** Leverage complementary strengths
- **Cons:** Increased complexity, hyperparameter tuning challenges

### Identified Research Gaps (TIefu et al., 2021)

After thorough review of the foundational paper, we identified **critical methodological deficiencies:**

| Issue | Impact | Our Solution |
|-------|--------|--------------|
| Single train-test split | Overfitting risk, no robustness assessment | Time series cross-validation |
| No hyperparameter tuning | Unfair model comparison | Bayesian optimization (Optuna) |
| Missing neural network baselines | Incomplete various model comparison | LSTM, GRU, TCN (Temporal Convolutional Neural Network) implementations |
| Simple outlier deletion | Data loss, temporal gaps | Spatial imputation from neighbor station |
| No temporal lags | Limited forecasting capability | Lag features [1, 3, 6, 12, 24, 48h] |
| No statistical testing | Uncertain significance of improvements | Paired t-tests, DM tests, bootstrap CI |
| Point predictions only | No uncertainty quantification | Quantile regression, conformal prediction |
| Single horizon evaluation | Limited operational utility | Multi-horizon [1, 3, 6, 12, 24h] |

**Our work systematically addresses each of these gaps.**


## Research Objectives and Hypotheses

### Primary Objectives

**Objective 1:** Develop a methodologically rigorous wave energy forecasting framework that eliminates common pitfalls in applied ML research (data leakage, inadequate validation, missing baselines).

**Objective 2:** Systematically compare tree-based ensemble methods (XGBoost, Random Forest) and deep learning architectures (LSTM, GRU, TCN) under fair experimental conditions with proper hyperparameter tuning.

**Objective 3:** Design and validate a heterogeneous ensemble approach that combines complementary model strengths while avoiding redundancy.

**Objective 4:** Implement probabilistic forecasting to quantify prediction uncertainty for operational decision-making.

**Objective 5:** Evaluate performance across multiple forecast horizons (1-24 hours) and wave energy regimes (low, moderate, high, extreme).


## Data Sources and Study Site

### Target Location: Diamond Shoals, NC (NDBC Station 41025)

**Geographic Characteristics:**
- **Coordinates:** 35.006°N, 75.402°W
- **Location:** 15 nautical miles southeast of Cape Hatteras, North Carolina
- **Water Depth:** 48.8 meters (deep water conditions)
- **Exposure:** Open Atlantic Ocean, high wave energy environment
- **Wave Climate:** Mixed sea and swell, strong seasonal variability

**Strategic Importance:**
- Located in the **U.S. Mid-Atlantic wave energy hotspot**
- Representative of conditions for proposed WEC deployments
- Long historical record (1990s-present) for robust modeling
- Well-maintained instrument with high data quality

**Why This Location?**
- Cape Hatteras region has **average wave power >30 kW/m** (top 10% of US sites)
- Proximity to population centers (Virginia, North Carolina coast)
- Overlaps with proposed offshore wind farms (potential co-location)
- Existing grid infrastructure for renewable integration

### Data Description

**Source:** National Data Buoy Center (NDBC), NOAA
**Temporal Coverage:** January 2014 - December 2019 (6 years)
**Temporal Resolution:** Hourly measurements
**Total Potential Records:** 52,608 hours

**Variables Available:**

| Variable | Symbol | Unit | Description | Role in Analysis |
|----------|--------|------|-------------|------------------|
| Significant Wave Height | WVHT | meters | Mean height of highest 1/3 waves | **Primary Target** |
| Dominant Wave Period | DPD | seconds | Period at spectral peak | **Primary Target** |
| Average Wave Period | APD | seconds | Mean wave period | **Secondary Target/Feature** |
| Mean Wave Direction | MWD | degrees | Direction waves coming from | Feature |
| Wind Direction | WDIR | degrees | Direction wind coming from | Feature |
| Wind Speed | WSPD | m/s | 10-minute average wind speed | Feature |
| Wind Gust | GST | m/s | Peak 5-second wind speed | Feature |
| Atmospheric Pressure | PRES | hPa | Sea-level pressure | Feature |
| Air Temperature | ATMP | °C | Air temperature | Feature |
| Sea Surface Temperature | WTMP | °C | Water temperature at 1m depth | Feature |

**Derived Targets:**
- **Wave Energy Flux** (kW/m): Calculated from WVHT and DPD using deep water approximation
- **Wave Power Production** (kW): Theoretical power capture assuming 30% WEC efficiency

### Data Quality Considerations

**Known Issues (from NDBC documentation):**
- Missing data codes: 999, 9999, 99.0 (varies by variable)
- Sensor failures during extreme weather events (non-random missingness)
- Occasional calibration gaps during maintenance periods
- Wave direction uncertainty during low wave conditions

# DATA LOADING AND INITIAL ASSESSMENT

## Understanding NDBC Data Format

In [1]:
"""
NDBC Standard Meteorological Data Format (from NDBC documentation):

The data files contain hourly observations with the following characteristics:

FILE STRUCTURE:
- Line 1: Column headers (variable names)
- Line 2: Unit specifications  
- Line 3+: Hourly data records

MISSING DATA CODES:
- 999  or 999.0  → Missing wave/wind data
- 9999 or 9999.0 → Missing pressure data  
- 99   or 99.0   → Missing temperature data
- MM   → Missing month/time component

COMMON VARIABLES:
#YY  = Year (2014-2019 in our case)
MM   = Month (1-12)
DD   = Day (1-31)
hh   = Hour (0-23)
mm   = Minute (usually 00 or 50)
WDIR = Wind direction (degrees, meteorological convention)
WSPD = Wind speed (m/s)
GST  = Wind gust (m/s)
WVHT = Significant wave height (meters)
DPD  = Dominant wave period (seconds)
APD  = Average wave period (seconds)  
MWD  = Mean wave direction (degrees)
PRES = Sea level pressure (hPa)
ATMP = Air temperature (°C)
WTMP = Sea surface temperature (°C)
"""

print("NDBC Data Format Documentation")
print("=" * 80)
print("✓ Yearly text files with space-separated values")
print("✓ First two rows contain headers and units")
print("✓ Missing data coded as 999, 9999, 99, or MM")
print("✓ Hourly temporal resolution (typically at :00 or :50 minutes)")
print("=" * 80)
"""
NDBC Standard Meteorological Data Format (from NDBC documentation):

The data files contain hourly observations with the following characteristics:

FILE STRUCTURE:
- Line 1: Column headers (variable names)
- Line 2: Unit specifications  
- Line 3+: Hourly data records

MISSING DATA CODES:
- 999  or 999.0  → Missing wave/wind data
- 9999 or 9999.0 → Missing pressure data  
- 99   or 99.0   → Missing temperature data
- MM   → Missing month/time component

COMMON VARIABLES:
#YY  = Year (2014-2019 in our case)
MM   = Month (1-12)
DD   = Day (1-31)
hh   = Hour (0-23)
mm   = Minute (usually 00 or 50)
WDIR = Wind direction (degrees, meteorological convention)
WSPD = Wind speed (m/s)
GST  = Wind gust (m/s)
WVHT = Significant wave height (meters)
DPD  = Dominant wave period (seconds)
APD  = Average wave period (seconds)  
MWD  = Mean wave direction (degrees)
PRES = Sea level pressure (hPa)
ATMP = Air temperature (°C)
WTMP = Sea surface temperature (°C)
"""

print("NDBC Data Format Documentation")
print("=" * 80)
print("✓ Yearly text files with space-separated values")
print("✓ First two rows contain headers and units")
print("✓ Missing data coded as 999, 9999, 99, or MM")
print("✓ Hourly temporal resolution (typically at :00 or :50 minutes)")
print("=" * 80)


NDBC Data Format Documentation
✓ Yearly text files with space-separated values
✓ First two rows contain headers and units
✓ Missing data coded as 999, 9999, 99, or MM
✓ Hourly temporal resolution (typically at :00 or :50 minutes)
NDBC Data Format Documentation
✓ Yearly text files with space-separated values
✓ First two rows contain headers and units
✓ Missing data coded as 999, 9999, 99, or MM
✓ Hourly temporal resolution (typically at :00 or :50 minutes)


In [10]:
import pandas as pd
import glob
import os
import numpy as np

In [6]:
# Standard NDBC missing-value codes
na_vals = [99, 99.0, 999, 999.0, 9999, 9999.0, "MM"]


def read_ndbc_stdmet(path: str) -> pd.DataFrame:
    """
    Read a single NDBC standard meteorological file.

    - Finds the '#YY MM DD hh mm ...' header line
    - Uses that as column names
    - Treats NDBC missing codes as NaN
    """
    with open(path, "r", encoding="utf-8", errors="ignore") as f:
        header = None
        for line in f:
            if line.startswith("#YY") or line.startswith("YY"):
                header = line.lstrip("#").strip().split()
                break

    if header is None:
        raise ValueError(f"Couldn't find #YY header in {path}")

    df = pd.read_csv(
        path,
        sep=r"\s+",
        comment="#",
        header=None,
        names=header,
        na_values=na_vals,
        engine="python",
    )
    return df


In [7]:
def build_station_df(station_prefix: str) -> pd.DataFrame:
    """
    Build a single DataFrame for a station from yearly NDBC files.

    Example:
      station_prefix = '41025h'  -> matches 41025h2014.txt, 41025h2015.txt, ...
    """
    files = sorted(glob.glob(f"{station_prefix}*.txt"))
    if not files:
        raise FileNotFoundError(f"No files found matching pattern {station_prefix}*.txt")

    print("Found files:")
    for f in files:
        print("  ", os.path.basename(f))

    dfs = [read_ndbc_stdmet(f) for f in files]
    df = pd.concat(dfs, ignore_index=True)

    # Standardize time columns
    col_map = {
        "#YY": "year",
        "YY": "year",
        "MM": "month",
        "DD": "day",
        "hh": "hour",
        "mm": "minute",
    }
    df = df.rename(columns={k: v for k, v in col_map.items() if k in df.columns})

    # Build datetime index
    time_cols = ["year", "month", "day", "hour", "minute"]
    df["datetime"] = pd.to_datetime(df[time_cols], errors="coerce")
    df = df.dropna(subset=["datetime"]).set_index("datetime").sort_index()
    df = df.drop(columns=[c for c in time_cols if c in df.columns])

    return df

In [9]:
print("Current working directory:")
print(os.getcwd())
print("\nFiles in this directory:")
for f in sorted(os.listdir()):
    print("  ", f)

# Pattern for your files: 41025h2014.txt, 41025h2015.txt, ...
file_pattern = "41025h*.txt"

data_files = sorted(glob.glob(file_pattern))

print("\nSearching for files with pattern:", file_pattern)
print(f"Found {len(data_files)} file(s):")
for f in data_files:
    size_kb = os.path.getsize(f) / 1024
    print(f"  • {os.path.basename(f):20s}  ({size_kb:7.1f} KB)")

if len(data_files) == 0:
    raise FileNotFoundError(
        f"No files found matching {file_pattern}. "
        "If needed, change the working directory or file_pattern."
    )

# -------------------------------------------------------------------------
# Build the combined DataFrame for Station 41025
# -------------------------------------------------------------------------

df_41025 = build_station_df("41025h")

print("\nCombined Station 41025 DataFrame:")
print("-" * 60)
print("Index range:", df_41025.index.min(), "→", df_41025.index.max())
print("Total rows:", len(df_41025))
print("Columns:", df_41025.columns.tolist())

# Show a quick preview
display(df_41025.head())
display(df_41025.tail())

Current working directory:
C:\Users\attafuro\Desktop\Wave Analysis

Files in this directory:
   .ipynb_checkpoints
   41025h2014.txt
   41025h2015.txt
   41025h2016.txt
   41025h2017.txt
   41025h2018.txt
   41025h2019.txt
   44095h2014.txt
   44095h2015.txt
   44095h2016.txt
   44095h2017.txt
   44095h2018.txt
   44095h2019.txt
   APD_LSTM_predictions.xlsx
   APD_predictions.xlsx
   APD_results.xlsx
   Conference Presentation.ipynb
   Imputing Neighbor Stations .ipynb
   Imputing with Neighbor Stations.ipynb
   Imputing with traditional method.ipynb
   WVHT_LSTM_predictions.xlsx
   WVHT_T_Predictions_with_Neighbor.xlsx
   WVHT_predictions.xlsx
   Wave Analysis.ipynb
   Wave_Energy_Flux.xlsx
   Wave_Energy_Flux_Metrics.xlsx
   Wave_Energy_Flux_TimeSeries.xlsx
   Wave_analysis_Considering_Two_Stations.ipynb

Searching for files with pattern: 41025h*.txt
Found 6 file(s):
  • 41025h2014.txt        (  597.5 KB)
  • 41025h2015.txt        (  616.6 KB)
  • 41025h2016.txt        (  755.9 KB)
 

Unnamed: 0_level_0,WDIR,WSPD,GST,WVHT,DPD,APD,MWD,PRES,ATMP,WTMP,DEWP,VIS,TIDE
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2013-12-31 23:50:00,305.0,7.0,10.0,0.96,5.56,4.41,2.0,1025.5,11.8,23.3,1.8,,
2014-01-01 00:50:00,301.0,6.8,9.5,0.96,6.25,4.4,360.0,1026.6,11.8,23.3,1.4,,
2014-01-01 01:50:00,305.0,7.5,9.5,0.89,5.88,4.23,2.0,1027.6,11.9,23.3,0.5,,
2014-01-01 02:50:00,308.0,6.6,9.4,0.89,5.88,4.15,3.0,1027.9,11.7,23.3,-0.1,,
2014-01-01 03:50:00,322.0,7.0,10.1,0.92,6.25,4.38,26.0,1027.9,12.0,23.2,-0.3,,


Unnamed: 0_level_0,WDIR,WSPD,GST,WVHT,DPD,APD,MWD,PRES,ATMP,WTMP,DEWP,VIS,TIDE
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2019-12-31 23:10:00,266.0,8.3,10.7,,,,,1008.9,15.5,23.1,7.0,,
2019-12-31 23:20:00,263.0,8.9,11.2,,,,,1009.0,15.5,23.1,7.0,,
2019-12-31 23:30:00,271.0,8.6,11.5,,,,,1009.2,15.7,22.9,7.6,,
2019-12-31 23:40:00,277.0,8.8,12.1,1.25,10.81,5.44,20.0,1009.1,15.7,23.0,6.6,,
2019-12-31 23:50:00,275.0,9.4,12.6,,,,,1009.2,15.7,23.0,6.3,,


In [11]:
# ============================================================================
# QUICK SANITY CHECKS ON STATION 41025 DATA
# ============================================================================
print("Datetime index summary")
print("-" * 40)
print("Start:", df_41025.index.min())
print("End  :", df_41025.index.max())
print("Rows :", len(df_41025))

# Check spacing between timestamps (should be mostly 1 hour)
dt_diff = df_41025.index.to_series().diff().dropna()
dt_hours = dt_diff.dt.total_seconds() / 3600

print("\nTime step statistics (hours):")
print(dt_hours.describe())

# Core variables we care about
core_vars = [v for v in ["WVHT", "DPD", "APD", "WSPD", "GST"] if v in df_41025.columns]
print("\nCore variables present:", core_vars)

missing_summary = (
    df_41025[core_vars]
    .isna()
    .mean()
    .mul(100)
    .rename("missing_%")
    .to_frame()
)

print("\n% missing per core variable:")
display(missing_summary.T)


Datetime index summary
----------------------------------------
Start: 2013-12-31 23:50:00
End  : 2019-12-31 23:50:00
Rows : 117882

Time step statistics (hours):
count    117881.000000
mean          0.446077
std          12.281906
min           0.166667
25%           0.166667
50%           0.166667
75%           1.000000
max        3881.166667
Name: datetime, dtype: float64

Core variables present: ['WVHT', 'DPD', 'APD', 'WSPD', 'GST']

% missing per core variable:


Unnamed: 0,WVHT,DPD,APD,WSPD,GST
missing_%,62.441255,62.441255,62.441255,0.92126,0.923805
