# Retrieving and Aggregating AORC Data at a Point

**Authors:** 

<ul style="line-height:1.5;">
<li>Ayman Nassar <a href="mailto:ayman.nassar@usu.edu">(ayman.nassar@usu.edu)</a></li>
<li>Pabitra Dash <a href="mailto:pabitra.dash@usu.edu">(pabitra.dash@usu.edu)</a></li>
<li>Homa Salehabadi <a href="mailto:homa.salehabadi@usu.edu">(homa.salehabadi@usu.edu)</a></li>
<li>David Tarboton <a href="mailto:david.tarboton@usu.edu">(david.tarboton@usu.edu)</a></li>
<li>Anthony Castronova <a href="acastronova@cuahsi.org">(acastronova@cuahsi.org)</a></li>

</ul>

**Last Updated:** 1/20/2025

**Purpose:**

This notebook provides code examples for retrieving NOAA Analysis of Record for Calibration (AORC) data from Amazon Web Services (AWS). It is intended to make it easy for researchers to access data for a specific point specified by latitude and longitude or known geographic coordinates. It also allows for data aggregation at time scales different from the underlying NOAA data.

**Audience:**

Researchers who are familiar with Jupyter Notebooks, basic Python and basic hydrologic data analysis.

**Description:**

This notebook takes as inputs the coordinates (e.g. latitude and longitude) of a study location in any coordinate system, start and end dates for the desired study period, a variable name, and a preferred time aggregation interval. It then retrieves data from Amazon Web Services (AWS), aggregates it over the specified time interval, displays the data as a plot, and saves it as a comma separated variable (CSV) file.

**Data Description:**

This notebook uses AORC data developed and published by NOAA on Amazon Web Services (AWS) as described in detail in this registry of open data entry <https://registry.opendata.aws/noaa-nws-aorc/>. The AORC dataset is a gridded record of near-surface weather conditions covering the continental United States and Alaska and their hydrologically contributing areas. It is defined on a latitude/longitude spatial grid with a mesh length of 30 arc seconds (~800 m), and a temporal resolution of one hour. This notebook uses the Zarr format files of version 1.1 of the AORC data. Zarr is a format for storage of chunked, compressed, N-dimensional arrays, designed to support storage using distributed systems such as cloud object stores (<https://zarr.dev/>).


**Software Requirements:**

This notebook has been tested on the CIROH Jupyterhub, CyberGIS Jupyter for Water and CUAHSI JupyterHub deployments. It relies on a general-purpose Jupyter computing environment with the following specific Python libraries: 

 > numpy: 1.26.4     
   geopandas: 0.14.3  
   pandas: 2.2.1  
   matplotlib: 3.8.3   
   contextily: 1.6.2    
   shapely: 2.0.3

It also uses code from aorc_utils.py that accompanies this notebook.

### 1. Install and Import Python Libraries Needed to Run this Jupyter Notebook

The `contextily` library is used in this notebook for mapping. It may not be installed in your Python environment by default so should be installed before you work with it. Use the following command to install the contextily library:

In [None]:
!pip install contextily

Import the libraries needed to run this notebook:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import geopandas as gpd
import pandas as pd
import contextily as ctx  # For adding a basemap
from shapely.geometry import Point
from aorc_utils import get_conus_bucket_url, load_dataset, reproject_coordinates, get_variable_code, get_aggregation_code, get_time_code

### 2. Set Inputs

Use the cells in this section of the notebook to set the input values that specify the data to retrieve. The coordinate system for the point latitude (or y) and longitude (or x) coordinates needs to be given, using an European Petroleum Survey Group (EPSG) code.  EPSG codes are a widely used coordinate system encoding and are given at [https://spatialreference.org/](https://spatialreference.org/). The specific EPSG value of 4326 given in the cell below references the World Geodetic System of 1984 (WGS84) geographic latitude and longitude coordinate system. Thus the input lon and lat values are interpreted as being in this coordinate system. If your data is in a different coordinate system, you need to look up its EPSG code at the website above and use it in the cell below.  To learn more about coordinate systems, see, for example, the [UCGIS body of Knowledge section on Coordinate Systems](https://gistbok-topics.ucgis.org/DM-05-047).

In [None]:
# Start date - In Year-Month-Day format, the earliest start date can be '1979-02-01'
start_datetime = '1990-01-01'

# End date - In Year-Month-Day format, the latest end date can be '2023-01-31'
end_datetime = '1990-12-31'

# Coordinate system EPSG code (from https://spatialreference.org/).
input_crs = 'EPSG:4326'

# Point location
# lon, lat are used as names, even for projected coordinate systems where lon = x and lat = y
lon = -111.96503  # Longitude or X
lat = 40.77069  # Latitude or Y

**The followings are valid variables to retrieve data:**

- `Total Precipitation` (APCP_surface): Hourly total precipitation (kgm-2 or mm)
- `Air Temperature` (TMP_2maboveground): Temperature (at 2 m above-ground-level (AGL)) (K)
- `Specific Humidity` (SPFH_2maboveground): Specific humidity (at 2 m AGL) (g g-1)
- `Downward Long-Wave Radiation Flux` (DLWRF_surface): longwave (infrared) radiation flux (at the surface) (W m-2)
- `Downward Short-Wave Radiation Flux` (DSWRF_surface): Downward shortwave (solar) radiation flux (at the surface) (W m-2)
- `Pressure` (PRES_surface): Air pressure (at the surface) (Pa)
- `U-Component of Wind` (UGRD_10maboveground): (west to east) - components of the wind (at 10 m AGL) (m s-1)
- `V-Component of Wind` (VGRD_10maboveground): (south to north) - components of the wind (at 10 m AGL) (m s-1)

In [None]:
# User-defined variable - see above for a list of valid variable names
variable_name = 'Total Precipitation'

# User-defined aggregation interval - valid values are 'hour','day','month','year'  
agg_interval = 'day'

### 3. Display Map with Point Location

In [None]:
# Point location
point = Point(lon, lat)

# Create a GeoDataFrame with the point
gdf_point = gpd.GeoDataFrame(geometry=[point], crs=input_crs)

# Create a layout for the plot
fig, ax = plt.subplots(figsize=(10, 8))

# Display the point location
gdf_point.plot(ax=ax, color='red', marker='o', markersize=100, label='Point Location')

# Add a topographic basemap using contextily 
ctx.add_basemap(ax, source=ctx.providers.Esri.NatGeoWorldMap, crs=gdf_point.crs.to_string(), alpha=1)

# Customize x and y axis labels
ax.set_title("Map with Point Location", fontsize=14)
ax.set_xlabel("Longitude", fontsize=12)
ax.set_ylabel("Latitude", fontsize=12)

# Show the plot
plt.show()

### 4. Virtually Load the Data Array 
This block of code maps the input variable and aggregation interval onto the variable encoding used in the Zarr bucket storage.  It then loads the virtual xarray dataset for the variable of interest. 

In [None]:
# Get the variable code
variable_code = get_variable_code(variable_name)

# Get the S3 bucket data file URL
url = get_conus_bucket_url(variable_code)
ds = load_dataset(url)

# Print the dataset (ds) of selected variable
print(ds)

# Print the units of the selected variable in AORC dataset
print(f"The unit of {list(ds.data_vars)[0]} is {ds[list(ds.data_vars)[0]].attrs.get('units', 'No units specified')}")

### 5. Subset and Aggregate the Data

This block of code first projects the input location fully specified with lon(x), lat(y) coordinates and coordinate system to the coordinate system used by the AORC data. The AORC data coordinate system is a Lambert Conformal Conic projection used by the National Water Model. Curious users could examine ds.crs.esri_pe_string to see details. 

Data from the AORC grid cell with center nearest to the point of interest for the variable of interest is then retrieved and aggregated to the time interval specified, using sum aggregation for precipitation and mean aggregation for other variables.  The time step of the input AORC data is hourly.

The results is saved in a data frame ds_subset.df which holds as columns time (date), x and y coordinates of the AORC grid cell center nearest to the input point (in the Lambert Conformal Conic coordinate system used by AORC) and the variable of interest. 

In [None]:
# Reproject coordinates
x, y = reproject_coordinates(ds, lon, lat, input_crs)

# Get aggregation code
agg_code = get_aggregation_code(agg_interval)

# Subsetting and aggregating the user-defined variable
variable_code_cap = variable_code.upper()

if variable_code == 'precip':
    ds_subset = ds['RAINRATE'].loc[dict(time=slice(start_datetime, end_datetime))].sel(y=y, x=x, method='nearest').compute() * 3600
    ds_subset_df = ds_subset.resample(time=agg_code).sum().to_dataframe()
    unit = f"mm/{agg_interval}"
else:
    ds_subset = ds[variable_code_cap].loc[dict(time=slice(start_datetime, end_datetime))].sel(y=y, x=x, method='nearest').compute()
    ds_subset_df = ds_subset.resample(time=agg_code).mean().to_dataframe()
    unit = ds[variable_code_cap].attrs.get('units', 'No units specified')

# Identify the column name in the resulting DataFrame
new_column_name = ds_subset_df.columns[2]

# Rename the last column to include the unit
variable_name_with_unit = f"{new_column_name} ({unit})"
ds_subset_df.rename(columns={new_column_name: variable_name_with_unit}, inplace=True)

print(ds_subset_df)

### 6. Plot the Data and Trend

This block of code plots the variable column of the data frame against its time index.

It also adds a trend line as an illustration of working with the data.

In [None]:
# Extracting the time index and column to plot
time_list=pd.to_datetime(ds_subset_df.index)
data_list = ds_subset_df.iloc[:,2]

# Setup the plot
plt.figure(figsize=(14, 8))  # Adjusting the size to provide more space for x-labels
plt.plot(time_list, data_list, color='blue', linewidth=2, marker='o', markersize=6, markerfacecolor='red', markeredgewidth=1)
plt.title(f'{variable_name} ({start_datetime[:]} - {end_datetime[:]})', fontsize=16)
plt.xlabel(f'Date/time', fontsize=14)
plt.ylabel(f'{variable_name} ({unit})', fontsize=14)
plt.grid(True, linestyle='--', alpha=0.7)

# Handling overlapping x-labels
plt.xticks(rotation=45, ha='right')  # Rotate labels and align them to the right
plt.gca().xaxis.set_major_locator(plt.MaxNLocator(nbins=10))  # Show fewer labels to avoid overlap
plt.gcf().autofmt_xdate()  # Automatically adjust x-label formatting for better spacing

# Adding a trend line
z = np.polyfit(range(len(time_list)), data_list, 1)
p = np.poly1d(z)
plt.plot(time_list, p(range(len(time_list))), color='black', linestyle='--', alpha=0.7)

# Adjust layout for better spacing
plt.tight_layout()

# Saving the plot
plt.savefig(f'{agg_interval}_{variable_code}_plot_for_point.png', dpi=800)

# Displaying the plot
plt.show()


### 7. Save the Data as a CSV File

In [None]:
# Save the DataFrame to a CSV file

# Specify the file path where you want to save the CSV file
file_path = f"{variable_name}_at_a_point.csv"

# Save the DataFrame to a CSV file
ds_subset_df.to_csv(file_path, index=True)