# Fetching and Summarizing NWM Gridded Data

## Overview
In this guide we'll demonstrate fetching National Water Model (NWM) operational and retrospective gridded data from cloud storage. This example makes use of a pre-generated Evaluation dataset stored in TEEHR's examples data module.

**Note**: For demonstration purposes several cells below are shown in markdown form. If you want to download this notebook and run them yourself, you will need to convert them to code cells.

For a refresher on loading location and location crosswalk data into a new Evaluation refer back to the [Loading Local Data](https://rtiinternational.github.io/teehr/user_guide/notebooks/02_loading_local_data.html) and [Setting-up a Simple Example](https://rtiinternational.github.io/teehr/user_guide/notebooks/04_setup_simple_example.html) user guide pages.

### Set up the example Evaluation

In [1]:
from datetime import datetime
from pathlib import Path
import shutil
import hvplot.pandas  # noqa
import pandas as pd

from teehr.examples.setup_nwm_streamflow_example import setup_nwm_example
import teehr
from teehr.evaluation.utils import print_tree

# Tell Bokeh to output plots in the notebook
from bokeh.io import output_notebook
output_notebook()

We'll start with the Evaluation from the previous User Guide, which contains observed (USGS) and simulation (NWM) streamflow for a gage at Radford, VA during hurricane Helene.

In [2]:
# Define the directory where the Evaluation will be created.
test_eval_dir = Path(Path().home(), "temp", "11_fetch_gridded_nwm_data")

# # Setup the example evaluation using data from the TEEHR repository.
# shutil.rmtree(test_eval_dir, ignore_errors=True)
# setup_nwm_example(tmpdir=test_eval_dir)

# Initialize the evaluation.
ev = teehr.Evaluation(dir_path=test_eval_dir)

25/04/10 09:09:47 WARN Utils: Your hostname, ubuntu3 resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
25/04/10 09:09:47 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/10 09:09:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
ev.locations.to_geopandas()

                                                                                

Unnamed: 0,id,name,geometry
0,wbd-0505000115,Peak Creek-New River,"MULTIPOLYGON (((-80.81589 37.13150, -80.81594 ..."
1,wbd-0505000117,Lower Little River,"MULTIPOLYGON (((-80.43397 37.11944, -80.43478 ..."
2,wbd-0505000118,Back Creek-New River,"MULTIPOLYGON (((-80.40439 37.30049, -80.40460 ..."
3,usgs-03171000,"NEW RIVER AT RADFORD, VA",POINT (-80.56922 37.14179)


We can add polygons to summarize rainfall to, to see the impact of the forcing data on the NWM streamflow prediction

In [4]:
# location_data_path = Path(test_eval_dir, "two_locations.parquet")
# two_locations_data.fetch_file("two_locations.parquet", location_data_path)

location_data_path = Path("/home/sam/git/teehr/tests/data/fetch_nwm_streamflow/three_huc10s_radford.parquet")

In [5]:
ev.locations.load_spatial(
    location_data_path,
    field_mapping={
        "huc10": "id"
    },
    location_id_prefix="wbd",
    write_mode="append"  # this is the default
)

In [6]:
gdf = ev.locations.to_geopandas()
gdf

Unnamed: 0,id,name,geometry
0,wbd-0505000115,Peak Creek-New River,"MULTIPOLYGON (((-80.81589 37.13150, -80.81594 ..."
1,wbd-0505000117,Lower Little River,"MULTIPOLYGON (((-80.43397 37.11944, -80.43478 ..."
2,wbd-0505000118,Back Creek-New River,"MULTIPOLYGON (((-80.40439 37.30049, -80.40460 ..."
3,usgs-03171000,"NEW RIVER AT RADFORD, VA",POINT (-80.56922 37.14179)


In [7]:
gdf[gdf.id != "usgs-03171000"].hvplot(geo=True, tiles="OSM", alpha=0.5) * \
    gdf[gdf.id == "usgs-03171000"].hvplot(geo=True, color="black", size=50)

Now we'll fetch NWM gridded AnA data and calculate MAP. We'll consider this "observed" since it's MRMS. The location ID prefix filters to include only our polygons. Append is default so it will add rainfall data to the streamflow in the table (primary timeseries table will contain both streamflow and rainfall)

In [None]:
ev.fetch.nwm_operational_grids(
    nwm_configuration="forcing_analysis_assim",
    output_type="forcing",
    variable_name="RAINRATE",
    start_date=datetime(2024, 9, 26),
    ingest_days=1,
    nwm_version="nwm30",
    prioritize_analysis_valid_time=True,
    t_minus_hours=[0],
    calculate_zonal_weights=True,
    location_id_prefix="wbd",
    timeseries_type="primary"  # Considered "observations". This is the default.
)



In [8]:
primary_df = ev.primary_timeseries.to_pandas()
primary_df

Unnamed: 0,value_time,value,unit_name,location_id,configuration_name,variable_name,reference_time
0,2024-09-26 00:00:00,286.000153,m^3/s,usgs-03171000,usgs_observations,streamflow_hourly_inst,NaT
1,2024-09-26 01:00:00,274.956573,m^3/s,usgs-03171000,usgs_observations,streamflow_hourly_inst,NaT
2,2024-09-26 02:00:00,270.425873,m^3/s,usgs-03171000,usgs_observations,streamflow_hourly_inst,NaT
3,2024-09-26 03:00:00,273.257568,m^3/s,usgs-03171000,usgs_observations,streamflow_hourly_inst,NaT
4,2024-09-26 04:00:00,286.000153,m^3/s,usgs-03171000,usgs_observations,streamflow_hourly_inst,NaT
...,...,...,...,...,...,...,...
188,2024-09-26 20:00:00,0.000005,mm/s,wbd-0505000117,nwm30_forcing_analysis_assim,rainfall_hourly_rate,2024-09-26 20:00:00
189,2024-09-26 20:00:00,0.000113,mm/s,wbd-0505000118,nwm30_forcing_analysis_assim,rainfall_hourly_rate,2024-09-26 20:00:00
190,2024-09-26 12:00:00,0.000479,mm/s,wbd-0505000115,nwm30_forcing_analysis_assim,rainfall_hourly_rate,2024-09-26 12:00:00
191,2024-09-26 12:00:00,0.000048,mm/s,wbd-0505000117,nwm30_forcing_analysis_assim,rainfall_hourly_rate,2024-09-26 12:00:00


In [9]:
primary_df[primary_df["location_id"] == "wbd-0505000117"].value_time.min()

Timestamp('2024-09-26 00:00:00')

In [10]:

primary_df[primary_df["location_id"] == "wbd-0505000117"].hvplot() * \
    primary_df[primary_df["location_id"] == "wbd-0505000118"].hvplot() * \
        primary_df[primary_df["location_id"] == "wbd-0505000115"].hvplot()

Let's take a look at the `cache` directory. After fetching the NWM analysis forcing, we have a directory for the weights file and a directory for other stuff

In [11]:
print_tree(ev.cache_dir, max_depth=2)

├── readme.md
├── fetching
│   ├── weights
│   │   └── nwm30_forcing_analysis_assim
│   ├── kerchunk
│   │   ├── nwm.20240926.nwm.t02z.analysis_assim.forcing.tm00.conus.nc.json
│   │   ├── nwm.20240926.nwm.t20z.analysis_assim.forcing.tm00.conus.nc.json
│   │   ├── nwm.20240926.nwm.t21z.analysis_assim.forcing.tm00.conus.nc.json
│   │   ├── nwm.20240926.nwm.t22z.analysis_assim.forcing.tm00.conus.nc.json
│   │   ├── nwm.20240926.nwm.t08z.analysis_assim.forcing.tm00.conus.nc.json
│   │   ├── nwm.20240926.nwm.t04z.analysis_assim.forcing.tm00.conus.nc.json
│   │   ├── nwm.20240926.nwm.t12z.analysis_assim.forcing.tm00.conus.nc.json
│   │   ├── nwm.20240926.nwm.t13z.analysis_assim.forcing.tm00.conus.nc.json
│   │   ├── nwm.20240926.nwm.t18z.analysis_assim.forcing.tm00.conus.nc.json
│   │   ├── nwm.20240926.nwm.t10z.analysis_assim.forcing.tm00.conus.nc.json
│   │   ├── nwm.20240926.nwm.t05z.analysis_assim.forcing.tm00.conus.nc.json
│   │   ├── nwm.20240926.nwm.t01z.analysis_assim.forcing.tm00.c

In [12]:
print_tree(Path(ev.cache_dir, "fetching", "weights"))

└── nwm30_forcing_analysis_assim
    └── nwm30_forcing_analysis_assim_pixel_weights.parquet


The weights file contains the fractional coverage for each pixel (row/col pair) that intersects with each polygon location (WBD HUC10 watershed).

In [17]:
cached_weights_filepath = Path(
        ev.cache_dir,
        "fetching",
        "weights",
        "nwm30_forcing_analysis_assim",
        "nwm30_forcing_analysis_assim_pixel_weights.parquet"
)
pd.read_parquet(cached_weights_filepath)

Unnamed: 0,row,col,weight,location_id
0,1730,3704,0.045272,wbd-0505000115
1,1730,3705,0.246893,wbd-0505000115
2,1730,3706,0.090052,wbd-0505000115
3,1730,3725,0.012040,wbd-0505000117
4,1730,3726,0.001101,wbd-0505000117
...,...,...,...,...
1599,1774,3726,0.801455,wbd-0505000118
1600,1774,3727,0.028692,wbd-0505000118
1601,1775,3724,0.377646,wbd-0505000118
1602,1775,3725,0.784341,wbd-0505000118


Now we can fetch the NWM forecast rainfall to compare it to the "observed" rainfall. First we need to create and import the `location_crosswalks` table, which maps the secondary locations ("model") to the primary ("observations").

In [14]:
ev.locations.to_pandas()["id"].unique()

array(['wbd-0505000115', 'wbd-0505000117', 'wbd-0505000118',
       'usgs-03171000'], dtype=object)

In [15]:
xwalk_df = pd.DataFrame(
    {
        "primary_location_id": [
            "wbd-0505000115",
            "wbd-0505000117",
            "wbd-0505000118"
        ],
        "secondary_location_id": [
            "forecast-0505000115",
            "forecast-0505000117",
            "forecast-0505000118"
        ]
    }
)
temp_xwalk_path = Path(ev.cache_dir, "loading", "forcing_xwalk.parquet")
xwalk_df.to_parquet(temp_xwalk_path)

Now we can load it into the Evaluation, appending to the existing table.

In [16]:
# Take a look at the existing crosswalk, which maps the point locations.
ev.location_crosswalks.to_pandas()

Unnamed: 0,primary_location_id,secondary_location_id
0,usgs-03171000,nwm30-6884666
1,wbd-0505000115,forecast-0505000115
2,wbd-0505000117,forecast-0505000117
3,wbd-0505000118,forecast-0505000118


In [19]:
ev.location_crosswalks.load_parquet(temp_xwalk_path)

                                                                                

In [21]:
ev.location_crosswalks.to_pandas()

Unnamed: 0,primary_location_id,secondary_location_id
0,usgs-03171000,nwm30-6884666
1,wbd-0505000115,forecast-0505000115
2,wbd-0505000117,forecast-0505000117
3,wbd-0505000118,forecast-0505000118


Now we can fetch the NWM forecast rainfall. Let's grab a single medium range forecast

In [19]:
ev.fetch.nwm_operational_grids(
    nwm_configuration="forcing_short_range",
    output_type="forcing",
    variable_name="RAINRATE",
    start_date=datetime(2024, 9, 26),
    ingest_days=1,
    nwm_version="nwm30",
    location_id_prefix="forecast",
    calculate_zonal_weights=False,  # re-use the weights file in the cache
    zonal_weights_filepath=cached_weights_filepath,
    starting_z_hour=1,
    ending_z_hour=1,
    timeseries_type="secondary"  # now considered forecast
)

AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `member` cannot be resolved. Did you mean one of the following? [`value`, `unit_name`, `value_time`, `location_id`, `variable_name`].;
'Project [reference_time#773, value_time#772, value#769, variable_name#771, configuration_name#774, unit_name#770, location_id#768, 'member]
+- Relation [location_id#768,value#769,unit_name#770,variable_name#771,value_time#772,reference_time#773,configuration_name#774,__index_level_0__#775L] parquet
