Skip to content
This repository has been archived by the owner on Sep 26, 2023. It is now read-only.

A modern interface to the NetCDF library #3

Open
mdsumner opened this issue Jul 28, 2016 · 4 comments
Open

A modern interface to the NetCDF library #3

mdsumner opened this issue Jul 28, 2016 · 4 comments

Comments

@mdsumner
Copy link

mdsumner commented Jul 28, 2016

A new package for the NetCDF library would be helpful, using modern techniques and Rcpp ideally.

NetCDF files store variables (arrays), built on dimensions(axes used by the variables, with metadata), and attributes (variable metadata, and global metadata).

http://www.unidata.ucar.edu/software/netcdf/

Existing R packages on CRAN:

https://cloud.r-project.org/web/packages/ncdf4/index.html

https://cloud.r-project.org/web/packages/RNetCDF/index.html

There is an enormous volume of data available in NetCDF, it's the predominant format for global climate studies, ocean modelling, and many remote sensing streams. It's use in R is relatively limited (IMO) restricted to domain experts already familiar with the API model, or to users of the higher level wrappers. Much of the data is available via Thredds (or OpenDAP) servers, and this is easily leveraged using ncdf4 or rgdal R packages (though the option is not turned on for support in the Windows binaries on CRAN).

Typically, you create a connection to a file, and use that connection to read in a variable (or a slice from one) after interrogating the conection for the variable's dimensions and attributes. Sometimes these are mapped grids with a simple 1D axis variable for each dimension - affine, or rectilinear referencing, others have "coordinate arrays" where the positioning is stored explicitly - curvilinear referencing.

Currently there is good support in R for NetCDF via the ncdf4 and RNetCDF packages, and indirectly via the raster package (leveraging ncdf4) and rgdal (leveraging GDAL). It's impressive how much abstraction raster and GDAL provide, but it covers only a relatively small range of the possible file configurations. This level of abstraction is rare for NetCDF use from what I've seen though, another example is Ferret: http://ferret.pmel.noaa.gov/Ferret

Apart from the domain-specific higher level functions in rgdal and raster for dealing with 2 or 3D grids with affine georeferencing, there is little abstraction over the standard API use.

A modern i/o package for the format would allow domain-specific packages to be more easily written for specific sources. This is possible now, but it's limited and quite challenging for many users. It would be awesome (and readily achievable with a new wrapper I believe)

  1. to be able to have a virtual R array that just paged for data from a NetCDF file collection as it was needed

  2. to write a DBI front-end on top of a new modern wrapper to NetCDF.

  3. to have support for composite types, which are not provided by any R package at the moment

  4. to provide consistent support for the OpenDAP/Thredds sources in different OS.

@mdsumner
Copy link
Author

Since I first wrote this there's now the rhdf5 package on Bioconductor which provides support for both groups and compound types (which neither of ncdf4 or RNetCDF do):

http://bioconductor.org/packages/devel/bioc/html/rhdf5.html

I will revisit this topic with that new package.

@jmp75
Copy link

jmp75 commented Dec 19, 2016

Michael,

I have a wishlist more than half similar to yours. Over the past few years I needed to write R packages with domain specific abstractions on top of netCDF. Usually some spatial abstractions are available out there but others (notably for time series) not as readily.

I am currently working in a domain with ensemble forecast time series (time series of ensembles of time series). I've built a package (not yet open source) with abstractions, on top of ncdf4, using 'xts' for R series handling. I think your ncload package has some overlap.

Independent from this former R package, I have a C++ library with time series abstractions, with netCDF I/O. The rationale is to have it accessible with consistent user experience from R, Python, Matlab and so on.

I believe there is a business case for most of what I outlined to be open source, and this is a process I am initiating. I'd welcome a collaboration with you and other interested parties to support this case.

Background code I can point to for information:
I have forked ncdf4 such that it is compilable on windows with Visual Studio (I have not checked with latest RTools' gcc). I cannot remember top of my head what the limitation was preventing a CRAN binary, but there was one.

https://github.com/jmp75/ncdf4/tree/devel

@edzer
Copy link

edzer commented Feb 16, 2017

Several of the ideas here overlap with those written in the stars proposal.

@mdsumner
Copy link
Author

mdsumner commented Jul 12, 2017

An update to my original post as quite a bit has occurred that is relevant in this.

libhdf5 is now available on Bioconductor, its goal to enable development of other packages like rhdf5 to be built on a common foundation:

http://bioconductor.org/packages/devel/bioc/vignettes/Rhdf5lib/inst/doc/Rhdf5lib.html#motivation

A foundation like that for classic NetCDF 3 is what I'd like to see. It's likely that libhdf5 already covers the need of this wishlist item for NetCDF 4.

Since starting this, I've rewritten an approach for extracting file metadata here: https://github.com/hypertidy/ncmeta - using both RNetCDF which is slightly more consistent and easy to use for harvesting, and ncdf4 which is still required for server-sources to "get coord or data values". This wishlist item, or a major upgrade or replacement to ncdf4 or RNetCDF would help improve this still.

Item 1) and 2) above can be done at the R level and I'm interested to see if that can work (but would still benefit from a modernized core wrapper).

Item 3) is now covered by rhdf5

Item 4) is still pending, it's pretty much not possible on Windows without compiling yourself or getting the Unidata binaries in ncdf4 or RNetCDF, but on Linux it's pretty trivial (easiest is to get the ubuntu unstable gis binaries, and it's likely that rocker/geospatial covers this). The useability for server-based sources is now much better, well handled by ROpenSci's rerddap, and still available via the lower level packages.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants