Skip to content

A repo for building, serving, and reducing time-chunked ERA5 data on GCP.

License

Notifications You must be signed in to change notification settings

H2Oxford/h2ox-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wave2Web Hack

H2Ox is a team of Oxford University PhD students and researchers who won first prize in theWave2Web Hackathon, September 2021, organised by the World Resources Institute and sponsored by Microsoft and Blackrock. In the Wave2Web hackathon, teams competed to predict reservoir levels in four reservoirs in the Kaveri basin West of Bangaluru: Kabini, Krishnaraja Sagar, Harangi, and Hemavathy. H2Ox used sequence-to-sequence models with meterological and forecast forcing data to predict reservoir levels up to 90 days in the future.

The H2Ox dashboard can be found at https://h2ox.org. The data API can be accessed at https://api.h2ox.org. All code and repos can be https://github.com/H2Oxford. Our Prototype Submission Slides are here. The H2Ox team is Lucas Kruitwagen, Chris Arderne, Tommy Lees, and Lisa Thalheimer.

H2Ox - Data

This repo is for a dockerised service to ingest ECMWF ERA5-Land data into a Zarr archive. The Zarr data is rechunked in the time domain in blocks of four years. This ensures efficient access to moderately-sized chunks of data, facilitating timeseries research. Two variables are ingested: two-meter temperature (t2m) and total precipitation (tp).

Installation - development

For development, the repo can be pip installed with the -e flag and [pre-commit] options:

git clone https://github.com/H2Oxford/h2ox-chirps.git
cd h2ox-chirps
pip install -e .[pre-commit]

Useage

For containerised deployment, a docker container can be built from this repo. This repo supports the creation of three different dockerised services:

  • an enqueuer queues up data requests from the Copernicus Data Store (CDS)
  • the downloader periodically pings the CDS to determine if the data is ready for download. When it is ready, it downloads the data and stores it in cloud storage.
  • the ingestor ingests the downloaded data into a zarr archive, rechunking it in the time dimension.

The apps use three sequential cloud storage buckets to store download status json tokens, and then raw .nc files of large chunks of continuguous ECMWF data. Then the three apps behave as follows: enqueuer stores a queue token in CLOUD_STAGING_QUEUE and periodically checks the CDS API is the data is ready for download. When data is ready to download, enqueuer places a queue token in CLOUD_STAGING_SCHEDULE, indicating the data is ready for download, and returns a 'success' message. ecwmf_downloader then downloads the data and stores it in the CLOUD_STAGING_RAW directory.

This repo also allows the user to specify a PROVIDER environment variable, making the docker container flexible to different cloud service ecosystems. The code at h2ox/provider allows utilities specific to different cloud services providers to be imported in a flexible way. Google Cloud Platform (GCP) is provided as a full implementation, but other cloud service providers could be added.

Credentials

The Copernicus Data Store serves era5land data using the CDS API library. To protect the CDS API and to schedule repeated updates, this app schedules requests to the CDS queue. Users of this app will need to request credentialed access to the CDS API. Then, to use this app and the CDS API, the user needs specify the URL and API-key in: ~/.cdsapirc.

A slackbot messenger is also implemented to post updates to a slack workspace. Follow these instuctions to set up a slackbot user, and then set the SLACKBOT_TOKEN and SLACKBOT_TARGET environment variables.

Environment Variables

The three different services require environment variables to target the various cloud and ECMWF resources.

PROVIDER=<GCP|e.g. AWS>                                        # a string to tell the app which utilities to use in src/h2ox/provider
CLOUD_STAGING_QUEUE=<gs://path/to/queue/tokens/>               # path to the tokens for enqueued data request
CLOUD_STAGING_SCHEDULE=<gs://path/to/download/staging/tokens/> # path to the tokens for data which had been stages
CLOUD_STAGING_RAW=<gs://path/to/raw/ncdata/files/>             # path to the raw staged .nc files
SLACKBOT_TOKEN=<my-slackbot-token>                             # a token for a slack-bot messenger
SLACKBOT_TARGET=<my-slackbot-target>                           # target channel to issue ingestion updates
CDSAPI_URL=<url-included-in-cds-credentials>                   # the url used to access the CDS api
CDSAPI_KEY=<key-included-in-cds-credentials>                   # the key to access the CDS api
TARGET=<gs://my/era5/zarr/archive>                             # the cloud path for the zarr archive
ZERO_DT=<YYYY-mm-dd>                                           # the initial date offset of the zarr archive
N_WORKERS=<int>                                                # the number of workers the cloud machine should use for data ingestion.

Docker

To set each app, the Docker service needs to be given the MAIN argument, which is the filepath to the app root directory, e.g. for the enqueuer:

docker build -t <my-tag> --build-arg', MAIN=apps/enqueuer .

Cloudbuild container registery services can also be targeted at forks of this repository. The cloudbuild service will need to provide the MAIN build argument.

To run the docker container, the environment variables can be passed as a .env file:

docker run --env-file=.env -t <my-tag>

Accessing ingested data

xarray can be used with a zarr backend to lazily access very large zarr archives.

Zarr Xarray

Citation

ERA5Land can be cited as:

Muñoz Sabater, J., (2019): ERA5-Land hourly data from 1981 to present. Copernicus Climate Change Service (C3S) Climate Data Store (CDS). 10.24381/cds.e2161bac

Our Wave2Web submission can be cited as:

Kruitwagen, L., Arderne, C., Lees, T., Thalheimer, L, Kuzma, S., & Basak, S. (2022): Wave2Web: Near-real-time reservoir availability prediction for water security in India. Preprint submitted to EarthArXiv, doi: 10.31223/X5V06F. Available at https://eartharxiv.org/repository/view/3381/

About

A repo for building, serving, and reducing time-chunked ERA5 data on GCP.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published