# Accessing Historic Observation Data Platform data

You can easily access the finalized cloud-optimized data using the Python library `intake`, which interfaces with our data catalog. This notebook demonstrates how to interact with the data catalog to download data, along with some simple plotting code to generate figures of the data. 

In [None]:
import intake

First, open the catalog using `intake`

In [None]:
cat = intake.open_esm_datastore("https://cadcat.s3.amazonaws.com/histwxstns/era-hdp-collection.json")

Next, view the catalog in table format. You can inspect the first few rows by calling `.head()` on the table.

In [None]:
# Access catalog as dataframe and inspect the first few rows
cat_df = cat.df
cat_df

View all the weather station networks by using the following code 

In [None]:
# See all network options 
cat_df["network_id"].unique()

You can also filter the catalog to see all stations within a network

In [None]:
my_network = "ASOSAWOS"
cat_df[cat_df["network_id"] == my_network]

You can subset the catalog and read in the cloud-optimized data as `xarray.Dataset` objects using the method shown below. To change the data downloaded, simply modify the inputs in the dictionary `query`. These inputs must correspond to valid options in the catalog. 

In [None]:
# Set your query here
query = {
    "network_id": "ASOSAWOS",  # Name of the network
    "station_id": ["ASOSAWOS_A0002694297","ASOSAWOS_A0704900320","ASOSAWOS_72020200118"] # List of stations to get data for 
}

# Subset catalog
cat_subset = cat.search(**query)

# View the data you've selected before downloading
cat_subset.df

In [None]:
cat_subset

Then, you can download all the files. The files will be downloaded as a dictionary, in which each key is a string description of the data, and the item is the data object. 

In [None]:
# Get dataset dictionary 
dsets = cat_subset.to_dataset_dict(
    xarray_open_kwargs={'consolidated':False},
    storage_options={'anon':True}
)

To see all the string IDs for the Datasets in the dictionary, you can print them with the following code: 

In [None]:
list(dsets.keys())

You can easily access the files in the dictionary using the following format: 
```
dsets[<string ID of data>]
```
The string ID of the data is constructed using both the network ID and the station ID for each individual weather station. 

In [None]:
# Retrieve a single file
ds = dsets["ASOSAWOS.ASOSAWOS_72020200118"]
ds

## Make a quick plot of the data 
`xarray` has some nice mapping features that enable you to quickly generate a plot for a single timestep. This lets you get a sense for the data you read in. 

In [None]:
variable_to_plot = "tas"
ds.squeeze()[variable_to_plot].plot(x="time");

# Subset the historical weather stations for a region

If you're interested in historical weather observation stations in a specific area, you can also subset the full archive of stations to identify those that you are interested in. We will read in the Historical Data Platform station list, which provides the coordinates, dates of coverage, source network, and total number of observations for each station. We'll soon be adding additional information like the state and common station identifiers!

For this, we'll use an area shapefile; upload your own to see which stations are in your area! You can pass either a **web link to an open access shapefile**, or **upload your own shapefile** to your Hub instance too.

In [None]:
# Imports
import geopandas as gpd
import pandas as pd
import sys
import os

# Import a shapefile clipping function from figure_utils
sys.path.append(os.path.abspath("../notebooks"))
from figure_utils import clip_gpd_to_shapefile

In [None]:
# Read in the Historical Data Platform station list
# Note: this will be replaced with the cadcat station list in AWS
hdp_stns = pd.read_csv("s3://wecc-historical-wx/4_merge_wx/all_network_stationlist_merge.csv")

In [None]:
# Region of Interest shapefile, and set coordinate reference system
roi = gpd.read_file("...")  # Replace this with your own shapefile!
roi = roi.to_crs("EPSG:3857")

# If there are multiple polygons within the shapefile and you want to further subset, you will need to do so.
# Example
# roi = roi[roi["NAME"] == "SPECIFIC_AREA"]

In [None]:
# Clip the stationlist to subset within your area of interest
stns_within_area = clip_gpd_to_shapefile(hdp_stns, roi)

This subset are the stations within the designated area from your submitted shapefile! You can easily export this list now for your own information, and use it to look up specific stations.

In [None]:
# Export the subset portion of station list
stns_within_area.to_csv("subset_station_list.csv")

Want a more interactive way to zoom in on an area? use stns.explore()! Note, it may take some time to load.

In [None]:
stns_within_area.explore()