# Tutorial 2.3 Preprocessing NOAA Meteorological Data (Wind)

## 2.3.1 Introduction

This notebook is **Step3** for the *Predict Future DO Tutorial Series*.

We retrieve and preprocess wind data from NOAA’s National Data Buoy Center (NDBC), we will focus on hourly wind speed and direction.  
Data is aggregated over directional sectors and time windows (e.g., last 14 days) to create features for downstream prediction.

- **Raw data:** Downloaded manually from [NOAA NDBC station TPLM2](https://www.ndbc.noaa.gov/station_history.php?station=tplm2).
- **Files:** Saved as text files in the `NOAA_Raw_Meterological` folder.


## 2.3.2 Preview Raw Data Files

In [3]:
import os

def preview_tplm2_files(folder_path="NOAA_Raw_Meterological", num_lines=3):
    """
    Print the first few lines of each tplm2h*.txt file for inspection.
    """
    for fname in sorted(os.listdir(folder_path)):
        if fname.startswith("tplm2h") and fname.endswith(".txt"):
            print(f"\n📄 Preview from {fname}:")
            with open(os.path.join(folder_path, fname)) as f:
                for _ in range(num_lines):
                    print("  " + f.readline().strip())

# Run preview
preview_tplm2_files("NOAA_Raw_Meterological")



📄 Preview from tplm2h1985.txt:
  YY MM DD hh WD   WSPD GST  WVHT  DPD   APD  MWD  BAR    ATMP  WTMP  DEWP  VIS
  85 10 25 14 280 02.6 03.1 99.00 99.00 99.00 999 1019.6  17.0  17.9 999.0 99.0
  85 10 25 15 340 04.6 05.1 99.00 99.00 99.00 999 1019.9  19.1  17.9 999.0 99.0

📄 Preview from tplm2h1986.txt:
  YY MM DD hh WD   WSPD GST  WVHT  DPD   APD  MWD  BAR    ATMP  WTMP  DEWP  VIS
  86 01 01 00 200 07.7 09.3 99.00 99.00 99.00 999 1011.0  05.4  03.4 999.0 99.0
  86 01 01 01 210 08.3 09.8 99.00 99.00 99.00 999 1010.8  05.7  03.2 999.0 99.0

📄 Preview from tplm2h1987.txt:
  YY MM DD hh WD   WSPD GST  WVHT  DPD   APD  MWD  BAR    ATMP  WTMP  DEWP  VIS
  87 01 01 00 110 01.5 02.1 99.00 99.00 99.00 999 1027.4  04.6  05.0 999.0 99.0
  87 01 01 01 050 02.1 02.6 99.00 99.00 99.00 999 1027.1  04.2  04.9 999.0 99.0

📄 Preview from tplm2h1988.txt:
  YY MM DD hh WD   WSPD GST  WVHT  DPD   APD  MWD  BAR    ATMP  WTMP  DEWP  VIS
  88 01 01 01 200 10.3 11.9 99.00 99.00 99.00 999 1023.0  08.3  04.7 999

## 2.3.3 Load and Clean NOAA Meteorological Data

We extract wind direction (WDIR) and wind speed (WSPD), convert timestamp columns, and clean the data for analysis.


In [6]:
import pandas as pd
import os

def load_clean_wind_data(folder_path="NOAA_Raw_Meterological"):
    """
    Load and clean tplm2h*.txt NOAA meteorological files.
    Extract wind direction (WDIR) and wind speed (WSPD), with correct datetime indexing.
    """
    all_data = []

    for fname in sorted(os.listdir(folder_path)):
        if fname.startswith("tplm2h") and fname.endswith(".txt"):
            fpath = os.path.join(folder_path, fname)
            print(f"\n📄 Loading {fname}")

            try:
                df = pd.read_csv(
                    fpath,
                    delim_whitespace=True,
                    skiprows=[1],
                    na_values=["MM", "99.0", "999.0", "9999.0"]
                )

                # === Detect datetime columns ===
                year_col = next((c for c in df.columns if c.upper() in ["YY", "#YY", "YEAR", "YYYY"]), None)
                month_col = next((c for c in df.columns if c.upper() in ["MM", "MONTH"]), None)
                day_col   = next((c for c in df.columns if c.upper() in ["DD", "DAY"]), None)
                hour_col  = next((c for c in df.columns if c.upper() in ["HH", "HR", "HOUR"]), None)

                if not (year_col and month_col and day_col and hour_col):
                    print(f"⚠️ Skipping {fname}: Incomplete datetime columns")
                    continue

                # === Handle 2-digit and 4-digit years ===
                years = df[year_col].astype(int)
                if years.max() < 100:
                    years = years.where(years >= 50, years + 2000)
                    years = years.where(years >= 100, years + 1900)

                # === Build datetime index ===
                df["datetime"] = pd.to_datetime(dict(
                    year=years,
                    month=df[month_col],
                    day=df[day_col],
                    hour=df[hour_col]
                ), errors="coerce")
                df.set_index("datetime", inplace=True)

                # === Detect wind columns ===
                wd_col = "WDIR" if "WDIR" in df.columns else "WD" if "WD" in df.columns else None
                wspd_col = "WSPD" if "WSPD" in df.columns else None

                if not (wd_col and wspd_col):
                    print(f"⚠️ Skipping {fname}: Missing WDIR or WSPD")
                    continue

                df = df[[wd_col, wspd_col]].copy()
                df.columns = ["WDIR", "WSPD"]
                df.dropna(inplace=True)

                all_data.append(df)

            except Exception as e:
                print(f"⚠️ Skipping {fname}: {e}")

    if not all_data:
        raise ValueError("❌ No valid meteorological files loaded.")

    combined = pd.concat(all_data)
    print(f"\n✅ Cleaned meteorological data. Shape: {combined.shape}")
    return combined

# Run it
met_data = load_clean_wind_data("NOAA_Raw_Meterological")
met_data.head()





📄 Loading tplm2h1985.txt

📄 Loading tplm2h1986.txt

📄 Loading tplm2h1987.txt

📄 Loading tplm2h1988.txt

📄 Loading tplm2h1989.txt

📄 Loading tplm2h1990.txt

📄 Loading tplm2h1991.txt

📄 Loading tplm2h1992.txt

📄 Loading tplm2h1993.txt

📄 Loading tplm2h1994.txt

📄 Loading tplm2h1995.txt

📄 Loading tplm2h1996.txt

📄 Loading tplm2h1997.txt

📄 Loading tplm2h1998.txt

📄 Loading tplm2h1999.txt

📄 Loading tplm2h2000.txt

📄 Loading tplm2h2001.txt

📄 Loading tplm2h2002.txt

📄 Loading tplm2h2003.txt

📄 Loading tplm2h2004.txt

📄 Loading tplm2h2005.txt

📄 Loading tplm2h2006.txt

📄 Loading tplm2h2007.txt

📄 Loading tplm2h2008.txt

📄 Loading tplm2h2009.txt

📄 Loading tplm2h2010.txt

📄 Loading tplm2h2011.txt

📄 Loading tplm2h2012.txt

📄 Loading tplm2h2013.txt

📄 Loading tplm2h2014.txt

📄 Loading tplm2h2015.txt

📄 Loading tplm2h2016.txt

📄 Loading tplm2h2017.txt

📄 Loading tplm2h2018.txt

📄 Loading tplm2h2019.txt

📄 Loading tplm2h2020.txt

📄 Loading tplm2h2021.txt

📄 Loading tplm2h2022.txt

📄 Loading t

Unnamed: 0_level_0,WDIR,WSPD
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1
1985-10-25 15:00:00,340.0,4.6
1985-10-25 16:00:00,330.0,6.2
1985-10-25 17:00:00,340.0,6.7
1985-10-25 18:00:00,320.0,5.7
1985-10-25 19:00:00,330.0,6.2


## 2.3.4 Convert Wind Direction into Cardinal Sectors

Bin wind directions into 8 compass sectors (N, NE, E, SE, S, SW, W, NW) for directional feature engineering.


In [12]:
def get_wind_sector(degrees):
    """
    Convert wind direction (°) into compass sectors (N, NE, E, ... NW).
    """
    dirs = ['N', 'NE', 'E', 'SE', 'S', 'SW', 'W', 'NW']
    bins = [0, 45, 90, 135, 180, 225, 270, 315, 360]
    return pd.cut(degrees % 360, bins=bins, labels=dirs, right=False, include_lowest=True)

# Apply sector binning
met_data["sector"] = get_wind_sector(met_data["WDIR"])


## 2.3.5 Aggregate Wind by Direction and Time Window

For each sector, compute 14-day rolling mean wind speed (per sector).  
Aggregate daily and drop the first 14 days (for full coverage).



In [17]:
# === Sort and remove duplicate timestamps ===
met_data = met_data.sort_index()
met_data = met_data[~met_data.index.duplicated(keep="first")]

# === Initialize output DataFrame ===
wind_agg = pd.DataFrame(index=met_data.index)

# === Compute 14-day rolling mean wind speed per sector ===
for sector in met_data["sector"].dropna().unique():
    sector_mask = met_data["sector"] == sector
    rolling = met_data.loc[sector_mask, "WSPD"].rolling("14D", min_periods=1).mean()
    wind_agg[f"{sector}14"] = rolling.reindex(met_data.index)

# Drop rows where all directional values are NaN
wind_agg.dropna(how="all", inplace=True)

# === Resample to daily average wind speed ===
wind_daily = wind_agg.resample("D").mean().dropna(how="all")

# Drop first 14 days to ensure full rolling window coverage
start_date = wind_daily.index.min() + pd.Timedelta(days=14)
wind_daily = wind_daily[wind_daily.index >= start_date]

# Save clean output
wind_daily.to_csv("CleanedData/wind.csv")
print("✅ Saved average wind features to wind.csv (first 14 days removed)")

# Preview
wind_daily.head()


✅ Saved average wind features to wind.csv (first 14 days removed)


Unnamed: 0_level_0,NW14,W14,N14,SE14,E14,S14,SW14,NE14
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1985-11-08,,5.607049,,3.6,,4.581683,3.511067,
1985-11-09,,,,5.016622,,4.703702,3.887662,
1985-11-10,,,,5.983407,,5.639175,,
1985-11-11,,,6.5393,5.84791,10.295455,6.156014,3.581818,
1985-11-12,,,5.873218,,9.87956,,,7.657221
