# Retrieving and Aggregating NWM Data for TCSO Incident Dates

**Author(s):** 

<ul style="line-height:1.5;">
<li>Nana Oye Djan <a href="mailto:ndjan@andrew.cmu.edu">(ndjan@andrew.cmu.edu)</a></li>
</ul>

**Last Updated:** 
17th July 2025

**Purpose:**

This notebook provides code to retrieve NOAA National Water Model Retrospective data from Amazon Web Services (AWS) in zarr format.

**Description:**

This notebook downloads the NWM retrospective data for target dates and specified reaches and merges this downloaded data with flowlines specified by the user for use in our <a href="https://github.com/Kaysharp-cloud/Flo_NAVSAFE">(FIM generation code)</a></li> .

**Data Description:**

This notebook uses data developed and published by NOAA on Amazon Web Services (AWS) as described in detail in this registry <a href="https://registry.opendata.aws/nwm-archive/">(this registry)</a></li> of open data entry. The NOAA National Water Model Retrospective dataset contains input and output from multi-decade CONUS retrospective simulations. These simulations used meteorological input fields from meteorological retrospective datasets. The output frequency and fields available in this historical NWM dataset differ from those contained in the real-time operational NWM forecast model. Additionally, note that no streamflow or other data assimilation is performed within any of the NWM retrospective simulations. This notebook uses the Zarr format version of this data. This notebook takes a csv file specifying the dates where flooded roadways and high water occurred in Travis County. It also takes a csv file of COMIDs, a layer in a geodatase with information on the geometries of the reaches (COMIDs) then retrieves data from AWS for Travis County and saves it as shapefiles. Necessary data can be found <a href="https://www.hydroshare.org/resource/41b23520d92c4d6e8d7934f106e54fd3/">(here)</a></li> 

**Software Requirements:**

This notebook requires the following specific Python libraries: 

> pandas: 2.3.0 \
   geopandas: 0.14.4 \
   tqdm: 4.67.1 \
   xarray: 2025.4.0 \
   s3fs: 2025.5.1 \
   matplotlib: 3.10.0 \
   shapely: 2.1.1 \
   os: Python 3.10.14 (stdlib) \
   pathlib: Python 3.10.14 (stdlib)  
 


### 1. Install and Import Python Libraries Needed to Run this Jupyter Notebook

In [57]:
import geopandas as gpd
import pandas as pd
import os
import xarray as xr
import s3fs
from shapely.geometry import Point
import matplotlib.pyplot as plt
from tqdm import tqdm
from pathlib import Path

### 2. Read In Inputs

In [58]:
# Read incident data from csv
incidents_csv = Path('/Users/nanaoye/Library/CloudStorage/Box-Box/My Research/CUAHSI/SI_2025/all_flooded_incidents.csv')
incidents_df  = pd.read_csv(incidents_csv, parse_dates=["response_date"])
target_dates = (incidents_df["response_date"].dt.normalize().drop_duplicates().dt.strftime("%Y-%m-%d").tolist())

#Read list of TC COMIDs
comid_csv = '/Users/nanaoye/Documents/ArcGIS/Projects/Theme4DataRevised/Travis_Feature_IDs.csv'
comids = (pd.read_csv(comid_csv, dtype={"IDs": str})["IDs"].astype(int).unique()).tolist()


#Read flowlines
flowlines_gpkg = '/Users/nanaoye/Documents/ArcGIS/Projects/Theme4DataRevised/Theme4Data.gdb'
flowline_layer = 'P2FFlowlines'
gdf_lines = gpd.read_file(flowlines_gpkg, layer=flowline_layer)
gdf_lines["feature_id"] = gdf_lines["feature_id"].astype(int)

#Specify output folder
out_dir         = Path("/Users/nanaoye/Library/CloudStorage/Box-Box/My Research/CUAHSI/SI_2025/tcso_incidents_shps")
out_dir.mkdir(parents=True, exist_ok=True)

#Variable read out of NWM Bucket
variable = "streamflow"

### 3. Get Retrospective Data

In [59]:
# Create an anonymous S3 filesystem object (public bucket)
fs = s3fs.S3FileSystem(anon=True)

# Point to the Zarr store
mapper = fs.get_mapper('noaa-nwm-retrospective-3-0-pds/CONUS/zarr/chrtout.zarr')

# Open the Zarr dataset using xarray
ds = xr.open_zarr(mapper, consolidated=True)

# Subset to TC
ds_small = ds.sel(feature_id=comids)


### Download shapefiles for each file

In [61]:
for date_str in target_dates:
    try:
        # Slice the COMID-subsetted dataset for the current day
        ds_day = ds_small.sel(time=date_str)               # <<< CHANGED/ADDED

        # Lazy → in-memory as a tidy DataFrame
        df_day = (
            ds_day["streamflow"]
              .to_dataframe()
              .reset_index()
              .dropna(subset=["streamflow"])
        )

        if df_day.empty:
            print(f"⚠️ No NWM data on {date_str}")
            continue

        # Extract the peak row for every reach
        idxpeak  = df_day.groupby("feature_id")["streamflow"].idxmax()
        df_peak  = df_day.loc[idxpeak].copy()              # <<< CHANGED/ADDED

        # Trim/rename long columns
        df_peak = df_peak.rename(
            columns={
                "streamflow": "peak_cms",                  # ≤10 chars
                "time":       "peak_time"                  # ≤10 chars
            }
        )

        #Ensure feature_id is int
        df_peak["feature_id"] = df_peak["feature_id"].astype(int)

        #merge onto your flowlines, keeping exactly the attributes you want:
        keep = [
            "LakeID", "HydroID", "From_Node", "To_Node",
            "NextDownID", "feature_id", "order_", "areasqkm", "geometry"
        ]

        # Merge attributes onto flowline geometries
        gdf_out = (
            gdf_lines[keep].merge(df_peak, on="feature_id", how="inner")
        )

        #fill any missing reaches with zero
        gdf_out["peak_cms"] = gdf_out["peak_cms"].fillna(0)

        
        if gdf_out.empty:
            print(f"⚠️ Join produced zero rows on {date_str}")
            continue

        #Drop peak time
        gdf_out = gdf_out.drop(columns=["peak_time"]) 

        # Fix object-type columns (especially for SHP export)
        for col in gdf_out.columns:
            if gdf_out[col].dtype == object:
                gdf_out[col] = gdf_out[col].apply(lambda x: x.decode('utf-8') if isinstance(x, bytes) else x)


        # Write Shapefile (will inherit flowline CRS)
        shp_name = out_dir / f"nwm_peak_{date_str.replace('-','')}.shp"
        gdf_out.to_file(shp_name)
        print(
            f"✅ {date_str} → {len(gdf_out):,} reaches saved → {shp_name.name}"
        )

    except Exception as e:
        print(f"❌ Error on {date_str}: {e}")

✅ 2015-03-09 → 2,245 reaches saved → nwm_peak_20150309.shp
✅ 2015-03-21 → 2,245 reaches saved → nwm_peak_20150321.shp
✅ 2015-05-05 → 2,245 reaches saved → nwm_peak_20150505.shp
✅ 2015-05-06 → 2,245 reaches saved → nwm_peak_20150506.shp
✅ 2015-05-13 → 2,245 reaches saved → nwm_peak_20150513.shp
✅ 2015-05-17 → 2,245 reaches saved → nwm_peak_20150517.shp
✅ 2015-05-23 → 2,245 reaches saved → nwm_peak_20150523.shp
✅ 2015-05-24 → 2,245 reaches saved → nwm_peak_20150524.shp
✅ 2015-05-29 → 2,245 reaches saved → nwm_peak_20150529.shp
✅ 2015-06-18 → 2,245 reaches saved → nwm_peak_20150618.shp
✅ 2015-06-27 → 2,245 reaches saved → nwm_peak_20150627.shp
✅ 2015-10-24 → 2,245 reaches saved → nwm_peak_20151024.shp
✅ 2015-10-26 → 2,245 reaches saved → nwm_peak_20151026.shp
✅ 2015-10-30 → 2,245 reaches saved → nwm_peak_20151030.shp
✅ 2015-10-31 → 2,245 reaches saved → nwm_peak_20151031.shp
✅ 2015-11-28 → 2,245 reaches saved → nwm_peak_20151128.shp
✅ 2015-12-13 → 2,245 reaches saved → nwm_peak_20151213.s