<center>
<h1>Accessing THREDDS using Siphon</h1>
<br>
<h3>25 July 2017
<br>
<br>
Ryan May (@dopplershift)
<br><br>
UCAR/Unidata<br>
</h3>
</center>


# What is Siphon?

* Python library for remote data access
* Focus on atmospheric and oceanic data sources
* Bulk of features focused on THREDDS

## Installing on Azure

In [None]:
!conda install --name root siphon -y -c conda-forge

## Functionality
* THREDDS catalog parser
* NetCDF Subset Service (NCSS) client
* CDM Remote client
* Radar Query Service client

# THREDDS?
* Server for data collections in various formats
* Powered by netCDF-Java
* Provides catalogs of data with metadata information
* Programmatic access to data with various services

* Metadata services
  - ISO
  - UDDC
  - NCML

* Download service (HTTPServer)

- Subsetting
  * WMS/WCS
  * OPeNDAP and CDMRemote
  * NetCDF Subset Service (NCSS)

## THREDDS Demo
http://thredds.ucar.edu

# Siphon for THREDDS
- Let's start by parsing a THREDDS catalog

In [None]:
from siphon.catalog import TDSCatalog
top_cat = TDSCatalog('http://thredds.ucar.edu/thredds/catalog.xml')

That takes care of download the catalog, parsing the XML, and doing useful things. From here we can do things like look at all the catalog references...

In [None]:
for ref in top_cat.catalog_refs:
    print(ref)

So we can see what's available at the top level. We can also extract exactly what we're looking for using the name of the item:

In [None]:
ref = top_cat.catalog_refs['Forecast Model Data']
ref.href

Or we can just access by position:

In [None]:
ref = top_cat.catalog_refs[0]
ref.href

and then resolve that catalog reference to get a new catalog.

In [None]:
new_cat = ref.follow()
list(new_cat.catalog_refs)

We can do this one more time, but instead of `catalog_refs`, we look at the `datasets` attribute to see the list of datasets available.

In [None]:
gfs_cat = new_cat.catalog_refs[4].follow()
list(gfs_cat.datasets)

`datasets` works just like `catalog_refs` in providing both name- and position-based access. Here we can access the first dataset in the catalog:

In [None]:
ds = gfs_cat.datasets[0]
ds.name

For catalogs that have a latest" automatically updated, dataset, the attribute `latest` is available:

In [None]:
ds = gfs_cat.latest
ds.name

Let's get a new catalog directly to some satellite data:
http://thredds.ucar.edu/thredds/idd/satellite.html

In [None]:
sat_cat = TDSCatalog('http://thredds.ucar.edu/thredds/catalog/'
                     'satellite/3.9/WEST-CONUS_4km/current/catalog.xml')
list(sat_cat.datasets)

Instead of accessing the dataset by name or position, we can also ask the collection of datasets to parse the filenames as datetimes and find:
- those within a range
- those closest to a time

In [None]:
from datetime import datetime, timedelta

In [None]:
# Look for all data within the last hour
now = datetime.utcnow()
l = sat_cat.datasets.filter_time_range(start=now - timedelta(hours=1),
                                       end=now)
[ds.name for ds in l]

In this case, the filter resulted in a list of `Dataset` handles. If we look instead for the nearest to a time, we get a single `Dataset` handle:

In [None]:
# Look for data from an hour ago
dt = datetime.utcnow() - timedelta(hours=1)
ds = sat_cat.datasets.filter_time_nearest(dt)
ds.name

We can use the dataset handle to look at the available access methods:

In [None]:
ds.access_urls

## Putting it together

How would we use this? Let's say we wanted to write a script to download the latest global run of the Wave Watch 3 model (WW3), and plot the output. So far, we have enough to get to the proper dataset:

In [None]:
top_cat = TDSCatalog('http://thredds.ucar.edu/thredds/catalog.xml')
models_cat = top_cat.catalog_refs[0].follow()
ww3_cat = models_cat.catalog_refs['Wave Watch III Global'].follow()
latest_ww3 = ww3_cat.latest
print(latest_ww3.name)
print(latest_ww3.access_urls)

## Exercise #1
1. Using Siphon, navigate from the top-level THREDDS catalog at https://nomads.ncdc.noaa.gov/thredds/catalog.xml to the 3-hour NARR-A data from January 5th, 2014 (or another product or time of interest)
1. Using Siphon, compare the available access methods (on http://thredds.ucar.edu) for:
  - The "Best GFS Quarter Degree Forecast Time Series" (under "Forecast Model Data")
  - A data file of "NEXRAD Level II Radar WSR-88D" (under "Radar Data")

In [None]:
# Start here
top_cat = TDSCatalog('https://nomads.ncdc.noaa.gov/thredds/catalog.xml')

# Accessing data using Siphon
Accessing catalogs is only part of the story; Siphon is much more useful if you're trying to access/download datasets.

For instance going back to our satellite data from earlier:

In [None]:
# Same as before
cat = TDSCatalog('http://thredds.ucar.edu/thredds/catalog/'
                'satellite/3.9/WEST-CONUS_4km/current/'
                'catalog.xml')
ds = cat.datasets.filter_time_nearest(datetime.utcnow()
                                      - timedelta(hours=1))

We can ask Siphon to download the file locally:

In [None]:
ds.download('data.gini')

Or better yet, get a file-like object that lets us `read` from the file as if it were local:

In [None]:
fobj = ds.remote_open()
data = fobj.read()

This is handy if you have Python code to read a particular format.

It's also possible to get access to the file through services that provide netCDF4-like access, but for the remote file. This access allows downloading information only for variables of interest, or for (index-based) subsets of that data:

In [None]:
nc = ds.remote_access()

By default this uses CDMRemote (if available), but it's also possible to ask for OPeNDAP (using netCDF4-python).

From here we can see what variables are available:

In [None]:
list(nc.variables)

Or get a subset of the values:

In [None]:
# Plot small sample image
%matplotlib inline
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1, figsize=(6, 6))
ax.imshow(nc.variables['IR'][0, ::10, ::10], cmap='Greys', interpolation='none')

## Exercise #2
Using `remote_access`, plot a subset of data from the High Resolution Rapid Refresh (http://thredds.ucar.edu/thredds/catalog/grib/NCEP/HRRR/CONUS_2p5km/catalog.html). Pick any of the available collections or individual model runs.

For some datasets, subset support is availble:
- Defaults to netCDF Subset Service (NCSS)
- Allows specifying latitude, longitude, time, and variables
- NCSS downloads a netCDF file

To use NCSS, we can call `subset` and get a client.

In [None]:
ds = TDSCatalog('http://thredds.ucar.edu/thredds/catalog/'
                'grib/NCEP/GFS/Global_0p25deg/catalog.xml').datasets[1]
ncss = ds.subset()
ncss.variables

With this client we can set up a query for the data we want. In this case we request the next 24 hours of forecast:

In [None]:
query = ncss.query()
query.lonlat_point(lon=-105, lat=40)
now = datetime.utcnow()
query.time_range(now, now + timedelta(days=1))
query.variables('Temperature_surface')
query.accept('netcdf4')

From here, we need to get the data, which will return it as an already opened netCDF4 object.

In [None]:
nc = ncss.get_data(query)
temp_data = nc.variables['Temperature_surface'][:]
times = nc.variables['time'][:]
fig, ax = plt.subplots(1, 1, figsize=(6, 6))
ax.plot(times, temp_data)

We can also request the data for a particular time for a region of interest:

In [None]:
query = ncss.query()
query.lonlat_box(east=-80, west=-90, south=35, north=45)
query.time(now + timedelta(days=1))
query.variables('Temperature_surface')
query.accept('netcdf4')
nc = ncss.get_data(query)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(6, 6))
ax.imshow(nc.variables['Temperature_surface'][0], cmap='RdBu')

## Exercise #3
- Use `subset` to download a subset of data from one of:
  - http://thredds.ucar.edu/thredds/catalog/grib/NCEP/WW3/Global/catalog.html
  - http://thredds.ucar.edu/thredds/catalog/grib/NCEP/HRRR/CONUS_2p5km/catalog.html
- Pick either a time-series or a 2D subset
- Plot using either `plot` or `imshow`

## A full Example

In [None]:
# Get the dataset handle
top_cat = TDSCatalog('http://thredds.ucar.edu/thredds/catalog.xml')
models_cat = top_cat.catalog_refs[0].follow()
gfs_cat = models_cat.catalog_refs['GFS Quarter Degree Forecast'].follow()
latest_gfs = gfs_cat.latest

# Download a subset using NCSS
now = datetime.utcnow()
ncss = latest_gfs.subset()
query = ncss.query().lonlat_point(lon=-86.50, lat=39.17)
query.time_range(now, now + timedelta(days=3)).accept('netcdf4')
query.variables('Temperature_surface')
nc = ncss.get_data(query)

# Plot
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
temp_f = 1.8 * (nc.variables['Temperature_surface'][:] - 273.15) + 32
ax.plot(temp_f, color='r')

# Future plans for Siphon
- Add curated list of servers
- Support for access to meteorological uppear air archives
- Support for TDS 5.0 CDM Remote Feature service
- Search catalogs using CSW

## Resources
- Siphon docs: https://unidata.github.io/siphon
- Unidata Python Workshop: https://unidata.github.com/unidata-python-workshop
- Unidata Python Gallery: https://unidata.github.com/python-gallery