# Spatial and Temporal Subsetting of AORC Forcing Data for a Designated Watershed and Timeframe

**Authors**: 
- Irene Garousi-Nejad <igarousi@cuahsi.org>, 
- Tony Castronova <acastronova@cuahsi.org>

**Last Updated**: 04.05.2023

**Description**:  

The objective of this Jupyter Notebook is to query AORC v1.0 Forcing data stored on HydroShare's Thredds server and create a subset of this dataset for a designated watershed and timeframe. The user is prompted to define their temporal and spatial frames of interest, which specifies the start and end dates for the data subset. Additionally, the user is prompted to define a spatial frame of interest, which could be a bounding box or a shapefile, to subset the data spatially. 

Before the subsetting is performed, data is queried, and geospatial metadata is added to ensure that the data is correctly aligned with its corresponding location on the Earth's surface. To achieve this, two separate notebooks were created - [this notebook](https://github.com/CUAHSI/notebook-examples/blob/main/thredds/query-aorc-thredds.ipynb) and [this notebook](https://github.com/CUAHSI/notebook-examples/blob/main/thredds/aorc-adding-spatial-metadata.ipynb) - which explain how to query the dataset and add geospatial metadata to AORC v1.0 data in detail, respectively. In this notebook, we call functions from the AORC.py script to perform these preprocessing steps, resulting in a cleaner notebook that focuses solely on the subsetting process.

Finally, the subsetted data is plotted to visually inspect the results, which can be used to identify any potential issues with the subsetting process and ensure that the data subset is correctly aligned with the user's interests.

**Software Requirements**:

The software and operating system versions used to develop this notebook are listed below. To avoid encountering issues related to version conflicts among Python packages, we recommend creating a new environment variable and installing the required packages specifically for this notebook. 


OS: Microsoft Windows 11 Pro version 10.0.22621
> Conda: 22.9.0  \
> Python: 3.9.16  \
> re: 2.2.1  \
> wget: 3.2  \
> xarray: 2022.11.0  \
> pyproj: 3.5.0  \
> rioxarray: 0.14.0  \
> numpy: 1.24.2  \
> pandas: 2.0.0  \
> geopandas: 0.12.2  \
> netCDF4: 1.6.3  \
> cartopy: 0.21.1  \
> matplotlib: 3.7.1 \
> owslib: 0.24.1

---

In [None]:
import os
import re
import wget
import xarray
import pyproj
import requests
import rioxarray
import numpy as np
import pandas as pd
import geopandas as gpd
import cartopy.crs as ccrs
from pyproj import Transformer
import matplotlib.pyplot as plt
from owslib.wms import WebMapService

import AORC

## Define the temporal and spatial parameters

Specify the start and end dates and select a monthly frequency to generate a list of datetime elements that fall within the desired time period. Monthly frequency is chosen because the AORC data is organized monthly on HydroShare's Thredds. To ensure that the datetime range covers the entire period of interest, the `MS` frequency is used. This frequency indicates the start of each month, whereas `M` frequency considers only the month-end and may not include the final month of the period of interest.

In [None]:
start_date = '2010-12-01'   #yyyy-mm-dd'
end_date = '2011-03-15'     #yyyy-mm-dd'

subset_times=pd.date_range(pd.to_datetime(start_date, format="%Y-%m-%d"), 
             pd.to_datetime(end_date, format="%Y-%m-%d"), 
             freq='MS')  # MS is month start frequency vs M that is month end frequency

print(subset_times)

Define the path to the area of interest, which can be a watershed, and read the shapefile using geopandas package. Here, we are using a shapefile of the Logan River head watershed in Utah that is available on a [HydroShare resource](https://www.hydroshare.org/resource/8974522ddcd84440a02cf6e7124261b2/data/contents/watershed.shp).  

In [None]:
# Define the path and filenames for the watershed of interest that you want to download
data_path = 'https://www.hydroshare.org/resource/8974522ddcd84440a02cf6e7124261b2/data/contents/'
files_to_download = ['watershed.shp', 'watershed.shx', 'watershed.prj', 'watershed.dbf']

# Download each file
for file in files_to_download:
    filename = wget.download(data_path+file)

In [None]:
# Read the shapefile into a geopandas dataframe
watershed_gpd = gpd.read_file('./watershed.shp')

fig, ax = plt.subplots(figsize=(5, 8))
watershed_gpd.plot(ax=ax, edgecolor='blue', facecolor='white')

plt.show()

To view the fields, you can print the first few rows of the dataframe using the `head()` method.

In [None]:
watershed_gpd.head()

Use the bounds attribute of the `gdf` GeoDataFrame to get the bounding box coordinates as a tuple. The bounds attribute returns a DataFrame with columns for the minimum and maximum values of the `x` or `langitude` and `y` or `latitude` coordinates. We will use this information later for the subsetting process.

In [None]:
bbox = watershed_gpd.total_bounds
bbox

## Query AORC data for the period of interest

Call the `get_paths` function from `AORC.py` script to query a portion of the AORC v1.0 data from the HydroShare's Thredds server based on `subset_times` that includes the desired subset.

In [None]:
paths = AORC.get_paths(subset_times)

print(f'Found {len(paths)} individual files')
print(f'The first file is named: {os.path.basename(paths[0])}')
print(f'The last file is named: {os.path.basename(paths[-1])}')

Load the selected original AORC data using `open_mfdataset`.

In [None]:
# load multiple files using open_mfdataset
ds = xarray.open_mfdataset(paths,
                           concat_dim='Time',
                           combine='nested',
                           parallel=True,
                           chunks={'Time': 10, 'west_east': 285, 'south_north':275})

# sort data by valid_time
ds = ds.sortby('valid_time')

# create coordinate to allow loc searches
ds = ds.assign_coords(Time=('Time', ds.valid_time.data))

# check the first and last timestamps
print(ds.Time[0].values)
print(ds.Time[-1].values)

## Add spatial metadata to the AORC data and watershed to prepare it for spatial subsetting

First, call the `add_spatial_metadata` function from `AORC.py` script to add spatial metadata to the AORC dataset that is already selected and loaded based on the desired timeframe.

In [None]:
ds_geo = AORC.add_spatial_metadata(ds)

ds_geo

To ensure that the shapefile coordinates match those of the AORC NetCDF file, you need to retrieve the coordinate reference system (CRS) of the AORC dataset, and then use this information to convert the CRS of the shapefile to match the CRS of the AORC file. This will ensure that the two files are using the same coordinate system and can be overlaid properly.

In [None]:
watershed_gpd = watershed_gpd.to_crs(ds_geo.spatial_ref.crs_wkt)

The following code shows that the bounding box includes the projected coordinates that match the CRS of the AORC data.

In [None]:
watershed_gpd.total_bounds

## Subset the AORC forcing data for the area of interest

Use the bounding box information of the watershed and find indices of the netCDF grid cells that intersect with the shapefile. We will use these indices to to extract the relevant subset of the AORC dataset for the watershed of interest. 

In [None]:
# Extract the bounding box of the shapefile
xmin, ymin, xmax, ymax = watershed_gpd.total_bounds

# Find the indices of the netCDF grid cells 
x_values = ds_geo.x.values
y_values = ds_geo.y.values
x_idx = np.where((x_values >= xmin) & (x_values <= xmax))[0]
y_idx = np.where((y_values >= ymin) & (y_values <= ymax))[0]

# Subset the netCDF file using the indices
subset_ds = ds_geo.isel(x=x_idx, y=y_idx)

Check the size of the subsetted dataset

In [None]:
subset_ds

## Visualization

In [None]:
fig, axes = plt.subplots(ncols=2, figsize=(14,5), subplot_kw={'projection': ccrs.PlateCarree()})

# set global extent and add gridlines for both subplots
for ax in axes.flat:
    ax.set_global()
    gl = ax.gridlines(crs=ccrs.PlateCarree(), draw_labels=True,
                  linewidth=2, color='gray', alpha=0.5, linestyle='--')
    gl.top_labels = False    # display only the tick labels on one side of the plot
    gl.right_labels = False  # display only the tick labels on one side of the plot

    
# add WMS layer to left subplot
gb_wms = 'https://geoserver.hydroshare.org/geoserver/HS-965eab1801c342a58a463f386c9f3e9b/wms'
axes[0].add_wms(wms=gb_wms,
               layers=['GB_shapefile'],
               zorder=10)


# plot LWDOWN at the first timesteop
ds_geo.isel(time=1).LWDOWN.plot(
               ax=axes[0], transform=ccrs.PlateCarree(), x="lon", y="lat",
               zorder=2,
               cmap='Reds')


# add WMS for the watershed shapefile to right subplot
gb_wms = 'https://geoserver.hydroshare.org/geoserver/HS-8974522ddcd84440a02cf6e7124261b2/wms'
ax.add_wms(wms=gb_wms,
          layers=['watershed'],
          zorder=10)

# another option to add the shapefile to the map
# here we directly read the shapefile in the working directory not from geoserver.hydroshare.org
from cartopy.io.shapereader import Reader
from cartopy.feature import ShapelyFeature
shape_feature = ShapelyFeature(Reader('./watershed.shp').geometries(),
                                ccrs.PlateCarree(), facecolor='none')
axes[1].add_feature(shape_feature)


# plot LWDOWN at the first timesteop
subset_ds.isel(time=1).LWDOWN.plot(
               ax=ax, transform=ccrs.PlateCarree(), x="lon", y="lat",
               zorder=2,
               cmap='Reds', alpha=0.5)


# zoom to the map on the left panel
axes[0].set_ylim([30, 45])
axes[0].set_xlim([-125, -105])
axes[0].set_aspect('equal')
axes[0].coastlines()


# zoom to the map on the right panel
buffer = 0.15  # degrees
axes[1].set_ylim([bbox[1]-buffer, bbox[3]+buffer])
axes[1].set_xlim([bbox[2]-buffer, bbox[0]+buffer])
axes[1].set_aspect('equal')
axes[1].coastlines()


# add titles
axes[0].set_title('AORC v1.0 encompassing the Great Basin')
axes[1].set_title('Subset of AORC v1.0 encompassing the Watershed of Interest')

plt.show()