A generic THREDDS crawler for R
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
R
man
DESCRIPTION
NAMESPACE
README.md

README.md

THREDDS Crawler for R

Requirements

R >= 3.0

httr

XML

Installation

It is easy to install with devtools

library(devtools)
install_github("btupper/threddscrawler")

Classes

TopCatalogRefClass for catalogs that are containers of CatalogRefClass pointers. This is like a listing of files and subdirectories in a directory, but here the files and subdirectories are all CatalogRefClass pointers.

CatalogRefClass is a pointer to TopCatalogRefClass

THREDDS dataset comes in two flavors: collections of datasets and direct datasets. I split these into DatasetsRefClass (collections) and DatasetRefClass (direct); the latter has an 'access' child node the former does not. A collection is a listing of one or more datasets (either direct or catalogs). A direct dataset is a pointer to an actual resource like a NetCDF file.

Example from NERACOOS

NERACOOS exposes data using a THREDDS server. This is an example that draws upon the MUR SST data subset prepared in 2015.

We start by examining the catalog Note that programmatically we access the companion XML file

We'll crawl these pages in succession...

Top

NorthEastShelf

DailyFiles

2010

library(threddscrawler)

# start by getting the TopCatalog - picture TopCatalog as web page that list one or more catalogs.
Top <- get_catalog('http://www.neracoos.org/thredds/catalog/GMRI/SST/TESTS/NASA_MUR_SST/catalog.xml')
Top
# Reference Class: "TopCatalogRef"
#   verbose_mode: FALSE
#   url: http://www.neracoos.org/thredds/catalog/GMRI/SST/TESTS/NASA_MUR_SST/catalog.xml
#   children: service dataset
#   catalogs: NorthEastShelf GulfOfMaine

# now get the catalogs embedded in the page.  Note that these point to other TopCatalogs.
A <- Top$get_catalogs()
A
# $NorthEastShelf
# Reference Class: "CatalogRefClass"
#   verbose_mode: FALSE
#   url: http://www.neracoos.org/thredds/catalog/GMRI/SST/TESTS/NASA_MUR_SST/NorthEastShelf/catalog.xml
#   children: 
#   name:NorthEastShelf
#   href:NorthEastShelf/catalog.xml
#   title:NorthEastShelf
#   type:
#   ID:GMRI_TESTS/NASA_MUR_SST/NorthEastShelf
# 
# $GulfOfMaine
# Reference Class: "CatalogRefClass"
#   verbose_mode: FALSE
#   url: http://www.neracoos.org/thredds/catalog/GMRI/SST/TESTS/NASA_MUR_SST/GulfOfMaine/catalog.xml
#   children: 
#   name:GulfOfMaine
#   href:GulfOfMaine/catalog.xml
#   title:GulfOfMaine
#   type:
#   ID:GMRI_TESTS/NASA_MUR_SST/GulfOfMaine
  
# now we get the catalogs in the NorthEastShelf   
NES <- A[['NorthEastShelf']]$get_catalog()
NES
# Reference Class: "TopCatalogRef"
#   verbose_mode: FALSE
#   url: http://www.neracoos.org/thredds/catalog/GMRI/SST/TESTS/NASA_MUR_SST/NorthEastShelf/catalog.xml
#   children: service dataset
#   catalogs: MonthlyMeans MonthlyFiles DailyFiles AggregatedMeans
  
# let's get the catalogs.  I won't show them, but we'll get the TopCatalog for the 'DailyFiles'
B <- NES$get_catalogs()
DAYS <- B[['DailyFiles']]$get_catalog()

# now get 2010
C <- DAYS$get_catalogs()
Y2010 <- C[['2010']]$get_catalog()

Now we are at "the bottom" of the search path and we find only a collection of datasets. Instead of requesting subsequent catalogs we can now request datasets.

days <- Y2010$get_datasets()
head(days, n = 2)
# $`20101231-JPL-L4UHfnd-GLOB-v01-fv04-MUR_subset.nc`
# Reference Class: "DatasetsRefClass"
#   verbose_mode: FALSE
#   url: http://www.neracoos.org/thredds/catalog/GMRI/SST/TESTS/NASA_MUR_SST/NorthEastShelf/DailyFiles/2010/20101231-JPL-L4UHfnd-GLOB-v01-fv04-MUR_subset.nc
#   children: dataSize date
#   datasets: NA
# 
# $`20101230-JPL-L4UHfnd-GLOB-v01-fv04-MUR_subset.nc`
# Reference Class: "DatasetsRefClass"
#   verbose_mode: FALSE
#   url: http://www.neracoos.org/thredds/catalog/GMRI/SST/TESTS/NASA_MUR_SST/NorthEastShelf/DailyFiles/2010/20101230-JPL-L4UHfnd-GLOB-v01-fv04-MUR_subset.nc
#   children: dataSize date
#   datasets: NA

Note that the 'datasets' attribute is NA - that tells us that have a real data source, not a catalog of data sources.