# Collect and Manipulate AORC Meteorological Data

**Authors:**  
   - Tony Castronova <acastronova@cuahsi.org>    
   - Irene Garousi-Nejad <igarousi@cuahsi.org>  
    
**Last Updated:** 

**Description**:  

This notebooks explores methods for working with large cloud data stores using tools such as `xarray`, `dask`, and `geopandas`. This exploration is uses the Analysis of Record for Collaboration (AORC) meteorological dataset that is used by the NOAA National Water Model. This notebook provides examples for how to access, slice, and visualize a large cloud-hosted dataset as well as an approach for aligning these data with watershed vector boundaries.

The data used in this notebook can be found at https://registry.opendata.aws/nwm-archive/.


**Software Requirements**:  

The software and operating system versions used to develop this notebook are listed below. To avoid encountering issues related to version conflicts among Python packages, we recommend creating a new environment variable and installing the required packages specifically for this notebook.

> fsspec    : 2024.6.0  
geopandas : 0.14.4  
numpy     : 1.26.4  
matplotlib: 3.9.0  
sys       : 3.12.3   
dask      : 2024.5.2  
xarray    : 2024.5.0  
rioxarray : 0.15.5  
geocube   : 0.5.2  
s3fs      : 2024.6.0  
zarr      : 2.18.2  
---

In [None]:
import sys
import dask
import numpy
import xarray as xr
import fsspec
import rioxarray
import geopandas as gpd
import matplotlib.pyplot as plt
from dask.distributed import Client
from geocube.api.core import make_geocube

import warnings
warnings.filterwarnings("ignore")

We'll use `dask` to parallelize our code. This is a very powerful library that has been integrated into libraries such as `xarray` which enables us to use its capabilities without writing any parallel code. However, the process for writing parallel code using `dask` is straightforward and well documented, for more information see their website [here].(https://www.dask.org/).

In this notebook, we'll be using `dask` to speed up our access of the AORC dataset. To visualize the progress of long running jobs, we'll first need to create a "cluster." The cluster defines the number of workers and their respective computing resources. This should be scaled to the hardware that you have access to.

In [None]:
# use a try accept loop so we only instantiate the client
# if it doesn't already exist.
try:
    print(client.dashboard_link)
except:    
    # The client should be customized to your workstation resources.
    # This is configured for a "Large" instance on ciroh.awi.2i2c.cloud
    # client = Client()
    client = Client(n_workers=8, memory_limit='10GB') # Large Machine
    print(client.dashboard_link)

---

## Access the AORC Forcing Data using Xarray

In this notebook we'll be working with AORC v1.1 meteorological forcing. These data are publicly available as part of the NOAA National Water Model v3.0 Retrospective archive on AWS registry of open data. These data are available in the `Zarr` format which offers a convienent and efficient means for slicing and subsetting very large datasets using libraries such as `xarray`. The following link will navigate you to the data, this can be helpful for understanding what data are available and how they are structured:

Homepage  : https://registry.opendata.aws/nwm-archive/  
Zarr Store: https://noaa-nwm-retrospective-3-0-pds.s3.amazonaws.com/index.html#CONUS/zarr/ 


Define a few parameters for accessing the specific variable that we're interested in.

In [None]:
bucket_url = 's3://noaa-nwm-retrospective-3-0-pds'
region = 'CONUS'
variable = 'precip'

We'll use the `fsspec` library to load these data. The `fsspec` library provides a filesystem interface for data accessing remote data such as the AORC Zarr store on AWS. To learn more about `fsspec`, see their documentation [here](https://filesystem-spec.readthedocs.io/en/latest/). Since these data are stored in an S3 bucket, `fsspec` will leverage the `s3fs` package to provide a filesystem interface to S3.

In [None]:
# build a path to the zarr store that we want
s3path = f"{bucket_url}/{region}/zarr/forcing/{variable}.zarr"

# load these data using xarray
ds = xr.open_zarr(fsspec.get_mapper(s3path, anon=True), consolidated=True)

ds

Notice that this loaded very fast. That's because it performed a "lazy" load of the data, i.e. only the metadata was loaded. Data values will not be accessed until computations are performed.

In [None]:
print(f'Size of ds={sys.getsizeof(ds)} bytes')

## Slicing and Visualizing the AORC Data

Since this is a lot of data, let's reduce the size that we're looking at to a single timestep. We can do this in a number of ways, however in this case we're just selecting the first index of data. For more information on slicing data, see the xarray documentation [here](https://docs.xarray.dev/en/v2023.09.0/user-guide/indexing.html).

#### Slice these Data using Xarray Indexing

In [None]:
# Slice data using index locators
ds_sel = ds.isel(time=0)
ds_sel

In [None]:
# select using multiple locators
ds_sel = ds.isel(time=0, x=1000, y=1000)
ds_sel

In [None]:
# Query the RAINRATE value associated with this slice of the AORC 
ds_sel.RAINRATE.values

Selecting a specific data point in time is not very useful. Often we're interested in a time series of data. Let's extend our previous example to collect data through a range of time. This can be done using indexing as before:

In [None]:
ds_sel = ds.isel(time=range(0, 100), # use range to select multiple indices
                 x=1000,
                 y=1000)
ds_sel

In [None]:
# Plot the timeseries of data associated with the RAINRATE variable
ds_sel.RAINRATE.plot();

We now have a timeseries at a single grid cell, but we defined the time range using array indexing. This makes it difficult to select a specific time range of interest. We can use datetime slicing instead of array indexing to acquire a more precise time range. First let's figure out the data range for which data is available by returning the minimum and maximum dates in the dataset.

In [None]:
dt_min = ds.time.min().values
dt_max = ds.time.max().values
print(f'The daterange of our data is {dt_min} - {dt_max}')

Next we can slice our data for a time span within this range.

In [None]:
# select the spatial area of interest using array indexing
ds_sel = ds.isel(x=1000,
                 y=1000)

# select the time span of interest using date range slicing
ds_sel = ds_sel.sel(time=slice('2020-01-01', '2021-01-01'))

ds_sel

In [None]:
# Plot the timeseries of data associated with the RAINRATE variable.
ds_sel.RAINRATE.plot();

Similarly we can modify our example to select a spatial range rather than a single grid cell. This can easily be done using array indexing:

In [None]:
# select the spatial area of interest using array indexing
ds_sel = ds.isel(x=range(1000, 2000),
                 y=range(1000,2000))

# select the time span of interest using date range slicing
ds_sel = ds_sel.sel(time=slice('2020-01-01', '2021-01-01'))

ds_sel

We now have 1000x1000 arrays of data stacked through time. Plotting becomes a bit more tricky here, but we can preview our data by plotting at a single time step.

In [None]:
# select a single time within our data cube
rainrate = ds_sel.isel(time=5006).RAINRATE

# plot the values where rainrate is greater than 0.0
rainrate.where(rainrate > 0.0).plot();

We can extend this example to select a spatial using coordinate values instead of array indices.

In [None]:
ymin = -846500.312
ymax = -786500.312
xmin = -1274499.125
xmax = -426499.1875

# select the time span and spatial area of interest using slicing
ds_sel = ds_sel.sel(time=slice('2020-01-01', '2021-01-01'),
                    y=slice(ymin, ymax),
                    x=slice(xmin, xmax))

In [None]:
# plot  a single time step within our data cube
ds_sel.isel(time=5006).RAINRATE.plot();

## Aligning Gridded AORC with Watershed Vectors

Often times we are interested in gridded data that aligns with a vector area such as a watershed boundary. We can align the AORC gridded data on vectors using the geocube library. First, let's load a watershed Shapefile that defines our area of interest. `GeoPandas` makes working with Shapefiles in Python very easy and intuitive:

In [None]:
# load the watershed shapefile
gdf = gpd.read_file('sample-data/watershed.shp')

# preview the watershed
gdf.plot()

We can also preview the attributes of this shapefile.

In [None]:
gdf

To align the gridded AORC data on these vector boundaries we need to first set the coordinate reference system (CRS) within the xarray dataset. It's CRS is defined in the metadata but it isn't set in a way that we can leverage it. Let's change that by using the rasterIO extension to xarray, called `rioxarray`. 

In [None]:
# set the crs in the dataset
ds.rio.set_crs(ds.crs.attrs['esri_pe_string'])
ds.rio.write_crs(inplace=True)

Next we need to make sure that the AORC CRS matches that of our watershed. If these don't align, we'll need to perform geospatial transformations before moving on.

In [None]:
print(f'AORC CRS:\n-----\n{ds.rio.crs.to_proj4()}')
print(f'\nShapefile CRS:\n-----\n{gdf.crs.to_proj4()}')

Since these coordinate systems differ, we'll need to convert one of them so that they align.

In [None]:
# convert the shapefile into the coordinate system of the xarray dataset
gdf = gdf.to_crs(ds.rio.crs)
print(f'\nShapefile CRS:\n-----\n{gdf.crs.to_proj4()}')

Let's clip the AORC data to the extent of this watershed. This can be done using `rioxarray`'s "clip" method.

In [None]:
# clip the data
ds_sel = ds.rio.clip(
         gdf.geometry.values,
         gdf.crs,
         all_touched=True,   # select all grid cells that touch the vector boundary
         drop=True,          # drop anything that is outside the clipped region
         invert=False,
         from_disk=True)
ds_sel

Preview our data at a single point in time. We'll use some `matplotlib` features to make a more interesting plot that contains both our gridded data as well as our vector data.

In [None]:
fig, ax = plt.subplots()

# add RAINRATE at a single time to the plot
ds_sel.isel(time=5006).RAINRATE.plot(ax=ax)

# add our watershed to the plot
gdf.plot(ax=ax, facecolor='none', edgecolor='k')

We've clipped the AORC dataset to the extent of our watershed boundary, however it still has no relation to the individual subcatchments. To better connect these two datasets, we can create a new dataset variable that represents a mask of grid cells that are associated with each subcatchment. We'll use the `geocube` library to accomplish this task.

Note that the method we're using will associate grid cell with the watershed that it overlaps the most with. There are more advanced ways to create a mapping using various interpolation methods that will distribute values cells across all watershed boundaries that they intersect with. This is left as a future exercise.

In [None]:
# create zonal id column
gdf['cat'] = gdf.id.str.split('-').str[-1].astype(int)

# select a single array of data to use as a template
rainrate_data = ds_sel.isel(time=0).RAINRATE

# create a grid for the geocube
out_grid = make_geocube(
    vector_data=gdf,
    measurements=["cat"],
    like=ds_sel # ensure the data are on the same grid
)

# add the catchment variable to the original dataset
ds_sel = ds_sel.assign_coords(cat = (['y','x'], out_grid.cat.data))

# compute the unique catchment IDs which will be used to compute zonal statistics
catchment_ids = numpy.unique(ds_sel.cat.data[~numpy.isnan(ds_sel.cat.data)])

print(f'The dataset contains {len(catchment_ids)} catchments')
ds_sel

We can now select and plot data for spatial areas that correspond with out catchment identifiers.

In [None]:
fig, ax = plt.subplots()

# plot RAINRATE for a single catchment
ds_sel.isel(time=5006).cat.plot(ax=ax, levels=35, cmap='gist_ncar');

# add our watershed to the plot
gdf.plot(ax=ax, facecolor='none', edgecolor='k');

# adjust the x and y limits of the plot so we can see the entire watershed.
ax.set_xlim(ds_sel.x.min(), ds_sel.x.max())
ax.set_ylim(ds_sel.y.min(), ds_sel.y.max())

We can plot data for a single catchment by filtering on it's catchment identifier. These identifers are defined by the geopandas dataframe:

In [None]:
gdf.cat.unique()

In [None]:
fig, ax = plt.subplots()

# plot RAINRATE for a single catchment
cat_id=2853621
ds_sel.isel(time=5006).where(ds_sel.cat==cat_id, drop=True).RAINRATE.plot(ax=ax);

# add our watershed to the plot
gdf.plot(ax=ax, facecolor='none', edgecolor='k');

# adjust the x and y limits of the plot so we can see the entire watershed.
ax.set_xlim(ds_sel.x.min(), ds_sel.x.max())
ax.set_ylim(ds_sel.y.min(), ds_sel.y.max())

We can now perform computations on AORC data that aligns with subcatchments. For example, let's plot the average precipitation rate for a single catchment through time.

In [None]:
# perform spatial selection using the catchment id defined in the cell above.
dat = ds_sel.where(ds_sel.cat==cat_id, drop=True)

# compute mean rainrate across dimensions x and y.
dat = dat.RAINRATE.mean(dim=['x','y'])

# slice our dataset to a reasonable time range
dat = dat.sel(time=slice('2020-01-01', '2020-06-01')).compute()  # triggers the computation

In [None]:
fig, ax = plt.subplots()

dat.plot(ax=ax)
ax.set_title(f'Mean Rainrate for CAT-{cat_id}')
ax.set_xlabel('Time')
plt.grid()