## Query AORC Forcing Data via HydroShare Thredds

**Authors**: Tony Castronova <acastronova@cuahsi.org>, Irene Garousi-Nejad <igarousi@cuahsi.org>  
**Last Updated**: 03.28.2023

**Description**:  

This example demonstrates how to collect and visualize AORC forcing data from HydroShare's THREDDS server, however the process is the same for accessing these data in other THREDDs instances as well.

> The Analysis of Record for Calibration (AORC) is a gridded record of near-surface weather conditions covering the continental United States and Alaska and their hydrologically contributing areas. It is defined on a latitude/longitude spatial grid with a mesh length of ~800 m (30 arc seconds), and a temporal resolution of one hour. Elements include hourly total precipitation, temperature, specific humidity, terrain-level pressure, downward longwave and shortwave radiation, and west-east and south-north wind components. It spans the period from 1979 at Continental U.S. (CONUS) locations / 1981 in Alaska, to the near-present (at all locations). This suite of eight variables is sufficient to drive most land-surface and hydrologic models and is used to force the calibration run of the National Water Model (NWM).


**Software Requirements**

This notebook was developed using the following software and operating system versions.

OS: MacOS Ventura 13.0.1  
Python: 3.10.0
Zarr: 2.13.2  
NetCDF4: 1.6.1  
xarray: 0.17.0  
fsspec: 0.8.7  
dask: 2021.3.0  
hvplot: 0.7.1  
holoviews: 1.14.2  
pynhd: 0.10.1
nest-asyncio: 1.5.6


The following commands should help you set up these dependencies
```
$ conda create -n nwm-env python=3.10.0

$ conda install -y -c pyviz -c conda-forge pynhd folium s3fs hvplot dask distributed zarr

```


---

## 1. Search AORC Forcing on HydroShare Thredds

The AORC Forcing data used in this notebook covers the Great Basin watershed from 2010-2019 and is stored in HydroShare's Thredds catalog:

https://thredds.hydroshare.org/thredds/catalog/aorc/data/16/catalog.html




In [None]:
import os
import re
import numpy
import xarray
import requests
import xml.etree.ElementTree as ET

In [None]:
from dask.distributed import Client
client = Client()
client

To discover the available data, we're reading an XML file and parsing its content. Using this information we create a list of URLs that can be loaded using `xarray` in later steps. This process is optional and can be bypassed by manually collecting these urls from the THREDDs interface via web browser. 

Identify the files in the catalog via the catalog.xml document. The following link provides a list of all available data for the HUC-2 region for the Great Basin; `HUC=16`. 
https://thredds.hydroshare.org/thredds/catalog/aorc/data/16/catalog.xml

In [None]:
catalog_base_url = 'https://thredds.hydroshare.org/thredds/catalog'
dods_base_url = 'https://thredds.hydroshare.org/thredds/dodsC'

Read the catalog.xml document and extract all `urlPath` attributes. We'll use the `urlPath` attribute to build the complete path to each file we want to access.

In [None]:
url = f'{catalog_base_url}/aorc/data/16/catalog.xml'
root = ET.fromstring(requests.get(url).text)
ns = '{http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0}'

In [None]:
# use xpath top select all "dataset" elements.
elems = root.findall(f'.//{ns}dataset')

In [None]:
# loop through results and extract the "urlPath" attribute values
paths = []
for elem in elems:
    atts = elem.attrib
    if 'urlPath' in atts.keys():
        paths.append(f"{dods_base_url}/{atts['urlPath']}")

In [None]:
# use regex to isolate only files that end with ".nc"
paths = list(filter(re.compile("^.*\.nc$").match, paths))

Print out some information about the files that we've found. For example, the total number of files as well as the names for the first and last files. The names of the first and last files will give us the temporal range of data that's available.

In [None]:
print(f'Found {len(paths)} individual files')
print(f'The first file is named: {os.path.basename(paths[0])}')
print(f'The last file is named: {os.path.basename(paths[-1])}')

## 2. Preview Data From a Single File

Connect to the THREDDs server and load a single month of data to explore that variables and spatial extent that is available to us. Using the URLs stored in the `paths` variable, we can load the first file via index 0 (i.e. `paths[0]`):


In [None]:
ds = xarray.open_dataset(paths[0])

Display the variables that are contained in this file.

In [None]:
ds

Plot Rainrate for the first timestep across the entire grid domain. There are numerous ways to select data within an Xarray DataSet, for more information on see: https://docs.xarray.dev/en/stable/user-guide/indexing.html

In [None]:
# plot RAINRATE for the entire spatial grid at Time=0
ds.isel(Time=0).RAINRATE.plot()

Plot RAINRATE through time (50 timesteps) at a single grid cell in the domain:

In [None]:
ds.isel(Time=range(0, 50), south_north=1, west_east=1).RAINRATE.plot()

## 3. Access Data in Multiple Files

To access data through time ranges longer than 1 month we'll need to access multiple files. For example, for the time range 01/01/2010 - 03/23/2010 we'll need to load three files:

- https://thredds.hydroshare.org/thredds/dodsC/aorc/data/16/201001.nc
- https://thredds.hydroshare.org/thredds/dodsC/aorc/data/16/201002.nc
- https://thredds.hydroshare.org/thredds/dodsC/aorc/data/16/201003.nc

When loading large amounts of data (e.g. each of these file is 16 GiB), Dask chunking becomes extremely important. Each of the files that we're accessing contains ~744 timesteps and ~700,000 grid cells (820 rows, 855 columns), which is approximately 500 million elements. The Dask chunking documentation suggests:

   > A good rule of thumb is to create arrays with a minimum chunksize of at least one million elements (e.g., a 1000x1000 matrix). With large arrays (10+ GB), the cost of queueing up Dask operations can be noticeable, and you may need even larger chunksizes.
   > https://docs.xarray.dev/en/stable/user-guide/dask.html
   
We've found that following chunking scheme provides adequate performance for many applications:

|Dimension|Chunks|
|---|---|
|Time| 10 |
|west_east|285|
|south_north|275|

The results in chunks that contain approximately 800,000 elements.

In [None]:
# load multiple files using open_mfdataset
ds = xarray.open_mfdataset(paths[0:5],
                           concat_dim='Time',
                           combine='nested',
                           parallel=True,
                           chunks={'Time': 10, 'west_east': 285, 'south_north':275})

Preview the data, notice there are 3624 timesteps.


In [None]:
ds

Plot `LWDOWN` for a time range that spans more than one month.

In [None]:
ds.isel(Time=range(600, 1000), south_north=1, west_east=1).LWDOWN.plot()

Create a coordinate containing datetime values so that we can perform queries using human readable datetimes.

In [None]:
# sort data by valid_time
ds = ds.sortby('valid_time')

# create coordinate to allow loc searches
ds = ds.assign_coords(Time=('Time', ds.valid_time.data))

Slice the dataset using a human readable time range

In [None]:
ds.loc[dict(Time=slice('2010-01-01','2010-01-03'), west_east=1, south_north=1)].LWDOWN.plot()

## 4. Advanced Usage

The following demonstrates some of the more advanced computations and visualization that you can perform. Start by loading the entire 10-year dataset.

In [None]:
import hvplot.xarray

# load multiple files using open_mfdataset
ds = xarray.open_mfdataset(paths,
                           concat_dim='Time',
                           combine='nested',
                           parallel=True,
                           chunks={'Time': 10, 'west_east': 285, 'south_north':275})

In [None]:
# sort data by valid_time
ds = ds.sortby('valid_time')

# create coordinate to allow loc searches
ds = ds.assign_coords(Time=('Time', ds.valid_time.data))

In [None]:
ds

---

Plot the daily average `LWDOWN` for a period of time.

In [None]:
dat = ds.loc[dict(Time=slice('2010-01-01','2010-01-5'), west_east=range(100, 150), south_north=range(100, 150))]

In [None]:
dat.LWDOWN.resample(Time='1d').mean(['west_east', 'south_north']).plot()

---

Plot animation of `LWDOWN` through time

In [None]:
# pyplot needed to plot the dataset, but animation only needed much further down.
from matplotlib import pyplot as plt, animation
%matplotlib inline

# This is needed to display graphics calculated outside of jupyter notebook
from IPython.display import HTML, display

from datetime import datetime, timedelta

In [None]:
st = '2010-01-01'
et = '2010-01-05'

dat = ds.loc[dict(Time=slice(st, et), west_east=range(100, 150), south_north=range(100, 150))].LWDOWN.compute()

In [None]:
# Get a handle on the figure and the axes
fig, ax = plt.subplots(figsize=(12,6))

# Plot the initial frame. 
cax = dat[0].plot(
    add_colorbar=True,
    cmap='coolwarm',
    vmin=0, vmax=500,
    cbar_kwargs={
        'extend':'neither'
    }
)

# create a list of datetimes to update the figure title
start = datetime.strptime(st, '%Y-%m-%d') 
datetimes = [start + timedelta(hours=i) for i in range(0, len(dat))]

# Next we need to create a function that updates the values for the colormesh, as well as the title.
def animate(frame):
    cax.set_array(
        dat[frame].values.flatten()
    )
    ax.set_title(f"Time = {datetimes[frame].strftime('%m-%d-%Y %H:%M')}")

# Finally, we use the animation module to create the animation.
ani = animation.FuncAnimation(
    fig,             
    animate,         
    frames=len(dat),
    interval=200     
)

In [None]:
HTML(ani.to_jshtml())