Enable download of large (spatial extent) cutouts from ERA5 via cdsapi. #236

euronion · 2022-05-16T12:58:14Z

Closes #221 .

Change proposed in this Pull Request

Split download of ERA5 into monthly downloads (currently: annual downloads) to prevent too-large downloads from ERA5 CDSAPI.

TODO

Add month indicator to progress prompts.

Description

Motivation and Context

See #221 .

How Has This Been Tested?

Locally by downloading a large cutout.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
[n/a] Breaking change (fix or feature that would cause existing functionality to change)

Checklist

I tested my contribution locally and it seems to work fine.
I locally ran pytest inside the repository and no unexpected problems came up.
I have adjusted the docstrings in the code appropriately.
I have documented the effects of my code changes in the documentation doc/.
[n/a] I have added newly introduced dependencies to environment.yaml file.
I have added a note to release notes doc/release_notes.rst.
I have used pre-commit run --all to lint/format/check my contribution

for more information, see https://pre-commit.ci

fneum · 2022-05-17T20:37:08Z

How does that interact with queuing at CDSAPI? Does that increase the chances of getting stuck in the request in month 9 or so?

euronion · 2022-05-17T20:46:33Z

I don't know.

The downloads for the larger cutouts worked relatively smoothly (1-2 hours), but the number of requests is 12x higher for a normal year, so the chances might be higher. On the other hand, since the downloaded slices are smaller I would not expect major performance changes. Probably acceptable, since you're not downloading cutouts on an everyday basis.

I don't know enough about the internals of the ERA5 climate store and I don't think we should optimise our retrieval routines for it as long as we haven't received any complaints for bad performance.

euronion · 2022-05-31T12:08:38Z

Alright. I did not encounter any issues downloading large datasets. Seems to work nicely @FabianHofmann .

What would be helpful is a message indicating which month/year combination is currently being downloaded, do you have an idea on how to easily implement this @FabianHofmann ?

Then I'd suggest @davide-f tries to download his cutout as well and if that works without issues then we can merge.

davide-f · 2022-05-31T14:51:00Z

@euronion Super! thank you very much. Currently, I am a bit busy with other stuff and I cannot run the machine with copernicus waiting long time for the analysis, unfortunately. As I have free resources, I'll test that.
Thank you!

FabianHofmann · 2022-05-31T20:12:12Z

Great. For the logging I would suggest to go with e.g. "2013-01", instead of "2013" only.
See

atlite/atlite/datasets/era5.py

Line 309 in 3c7b4b8

yearstr = ", ".join(atleast_1d(request["year"]))

which could be changed into

timestr = f"{request["year"])}-{request["month"]}"

and changed replaced accordingly in

atlite/atlite/datasets/era5.py

Line 311 in 3c7b4b8

varstr = "".join(["\t * " + v + f" ({yearstr})\n" for v in variables])

davide-f · 2022-06-15T22:19:17Z

As discussed with @euronion, I'll wait for his latest updates by the end of the week (estimate), and I'll run the model for the entire world.

As a comment, the "number of slices", currently one a month, may be a parameter as well.
Anyway, we could keep the current implementation and see if it works for the world, fingers crossed.

euronion · 2022-06-17T07:33:06Z

@davide-f You're good to give it a try!

Regarding your comment:
I had a look at the code and if I get the intention behind the comment correct (optimising the retrieval) then it might be easier to implement a heuristic which calculates the number of points being retrieved (np.prod([len(v) for k,v in request.items()])) and adjusts it automatically such that the request will safely not fail (request size below the size at which CDSAPI breaks) than to have a parameter to adjust it.

If it works for you @davide-f and the time it takes is acceptable (please report it as well if you can) then I'd stay away from overoptimising this aspect and just keep the monthly retrieval.

davide-f · 2022-06-18T06:15:03Z

@euronion the branch is running :) I'll track it and update you as I have news.
Just as a comment, I had to to few tests that have been interrupted, hence, since copernicus reduce priority to users' requests the more the same user is using the service, that may lead to a slight overestimation of the total expected time, though I don't think it is an issue.

I totally agree on seeing if the monthly retrieval works fine and it's expected time. I fear that it may take very long times though. I'll notify you as I have news :)

davide-f · 2022-06-18T09:27:30Z

I confirm that the first 1-month chunk has been downloaded. I'll be waiting for the entire procedure to end and let you know :)

davide-f · 2022-06-20T21:27:33Z

@euronion The procedure for the world (+- 180° lat lon) completed in 5 to 12 hours (I run it twice) successfully and produced an output file of 380Gb (large but we are speaking of a lot of data), see the settings below.

atlite:
  nprocesses: 4
  cutouts:
    # geographical bounds automatically determined from countries input
    world-2013-era5:
      module: era5
      dx: 0.3  # cutout resolution
      dy: 0.3  # cutout resolution
      # Below customization options are dealt in an automated way depending on
      # the snapshots and the selected countries. See 'build_cutout.py'
      time: ["2013-01-01", "2014-01-01"]  # specify different weather year (~40 years available)
      x: [-180., 180.]  # manual set cutout range
      y: [-180., 180.]    # manual set cutout range

As a recommendation, to silence some warning, if interested, the following comment was risen:

To avoid creating the large chunks, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    ...     array[indexer]
  return self.array[key]
/home/davidef/miniconda3/envs/pypsa-africa/lib/python3.10/site-packages/xarray/core/indexing.py:1228: PerformanceWarning: Slicing is producing a large chunk. To accept the large
chunk and silence this warning, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    ...     array[indexer]

The output also makes sense, however, it has some weird white bands, though I don't think this is related to this PR, what do you think?

davide-f · 2022-06-20T21:31:57Z

As discussed, for efficiency purposes, it may be interesting to decide the number of chunks to divide the output.
Since at world scale worked, we could specify the number of chunks as a number between 1 and 12, and we divide the blocks by months, e.g. 4 chunks: months 1-3, 4-6, 7-9 and 10-12.
For small data to downloading, it may be more efficient to download everything in one go; for Africa or Europe for example there is no need to split the data; yet this is a detail as long as it works

euronion · 2022-07-15T11:58:47Z

Think about heuristic to download in smaller/larger chunks depending on data geographical scope to download
Add note to documentation on how to compress cutouts

I attempted to compress cutouts during/after creation but without much success. using zlib integration of xarray the compressed cutouts unfortunately always increased in size (rather than decreasing). Using native netCDF tools compression of cutouts to 30-50% of size is possible without impacts on atlite performance. I want to add notes on this to the documentation with this PR as this allows for larger cutouts.

I would have preferred a solution where compression is done by atlite directly, but it seems like that does not work well using xarray.

codecov-commenter · 2022-09-06T10:35:24Z

Codecov Report

Patch coverage: 91.66% and project coverage change: -0.09 ⚠️

Comparison is base (f9bd7fd) 72.83% compared to head (d9f3bff) 72.74%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #236      +/-   ##
==========================================
- Coverage   72.83%   72.74%   -0.09%     
==========================================
  Files          19       19              
  Lines        1590     1596       +6     
  Branches      227      270      +43     
==========================================
+ Hits         1158     1161       +3     
- Misses        362      363       +1     
- Partials       70       72       +2

Impacted Files	Coverage Δ
atlite/datasets/era5.py	`88.23% <88.88%> (-1.70%)`	⬇️
atlite/data.py	`86.36% <100.00%> (+0.31%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

euronion · 2022-09-06T11:43:16Z

@davide-f If you wish to reduce the file size you can follow the instructions in the updated doc:

https://github.com/PyPSA/atlite/blob/230aa8a5b1b21bff8f03d23631f01e6ebf5d83b3/examples/create_cutout.ipynb

Should save ~50% :)

euronion · 2022-09-06T12:15:55Z

Month indicator has been added, e.g. info prompt during creation looks like this to indicate the month currently being retrieved

2022-09-06 14:14:27,779 INFO CDS: Downloading variables
         * runoff (2012-12)

euronion · 2022-09-06T12:19:58Z

I suggest we offload the heuristic into a separate issue and tackle it if necessary. ATM I think it would be a nice but unnecessary feature.

euronion · 2022-09-06T12:20:11Z

RTR @FabianHofmann would you?

fneum

Tested by @nworbmot

euronion · 2023-04-04T09:07:16Z

No idea why the CI keeps failing (no issues locally) and why it is continuing the old CI.yaml with Python 3.8 instead of 3.11

euronion and others added 3 commits May 11, 2022 18:16

Update era5.py

61c1008

Address memory errors for writing large cutouts.

529d841

[pre-commit.ci] auto fixes from pre-commit.com hooks

eb32d51

for more information, see https://pre-commit.ci

euronion assigned FabianHofmann and euronion May 31, 2022

euronion marked this pull request as ready for review May 31, 2022 12:08

Add month being retrieved to informative output during cutout.prepare().

19c213d

euronion added type: bug status: in progress priority: medium labels Jun 23, 2022

Add note on how to compress cutouts from terminal

4a2c15d

euronion added 3 commits September 6, 2022 13:31

Update doc-string for monthly retrieval

2a695b4

Merge branch 'master' into feat/era5-monthly-retrieveal

058af13

Update RELEASE_NOTES.rst

230aa8a

euronion mentioned this pull request Sep 6, 2022

Add heuristic for ERA5 download chunk sizes #252

Open

1 task

FabianHofmann added 3 commits October 11, 2022 09:32

Merge branch 'master' into feat/era5-monthly-retrieveal

0d67ea0

update release notes

793b73f

era5: ensure correct info print out format for data retrieval

6ea7023

fneum approved these changes Mar 9, 2023

View reviewed changes

euronion added 7 commits March 13, 2023 10:10

Merge branch 'master' into feat/era5-monthly-retrieveal

1dbf8a4

Merge branch 'master' into feat/era5-monthly-retrieveal

301aedd

Update doc to refer to new compression feature.

24e8401

Fix nan issue with encoding in xarray/netcdf

39e7f0e

Use nicer Python syntax

ad3d01a

Increase default compression level

3033d49

Address bug in xarray encoding for ERA5 data

eadd583

Merge branch 'master' into feat/era5-monthly-retrieveal

d9f3bff

euronion merged commit 3a6f543 into master Apr 5, 2023

fneum mentioned this pull request Aug 13, 2024

Dramatically different download speeds between versions #371

Closed

2 tasks

coroa mentioned this pull request Nov 1, 2024

fix: Skip previous encoding workaround for fixed xarray versions #401

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable download of large (spatial extent) cutouts from ERA5 via cdsapi. #236

Enable download of large (spatial extent) cutouts from ERA5 via cdsapi. #236

euronion commented May 16, 2022 •

edited

Loading

fneum commented May 17, 2022

euronion commented May 17, 2022

euronion commented May 31, 2022

davide-f commented May 31, 2022

FabianHofmann commented May 31, 2022

davide-f commented Jun 15, 2022

euronion commented Jun 17, 2022

davide-f commented Jun 18, 2022

davide-f commented Jun 18, 2022

davide-f commented Jun 20, 2022 •

edited

Loading

davide-f commented Jun 20, 2022

euronion commented Jul 15, 2022 •

edited

Loading

codecov-commenter commented Sep 6, 2022 •

edited by codecov bot

Loading

euronion commented Sep 6, 2022

euronion commented Sep 6, 2022

euronion commented Sep 6, 2022

euronion commented Sep 6, 2022

fneum left a comment

euronion commented Apr 4, 2023

Enable download of large (spatial extent) cutouts from ERA5 via cdsapi. #236

Enable download of large (spatial extent) cutouts from ERA5 via cdsapi. #236

Conversation

euronion commented May 16, 2022 • edited Loading

Change proposed in this Pull Request

TODO

Description

Motivation and Context

How Has This Been Tested?

Type of change

Checklist

fneum commented May 17, 2022

euronion commented May 17, 2022

euronion commented May 31, 2022

davide-f commented May 31, 2022

FabianHofmann commented May 31, 2022

davide-f commented Jun 15, 2022

euronion commented Jun 17, 2022

davide-f commented Jun 18, 2022

davide-f commented Jun 18, 2022

davide-f commented Jun 20, 2022 • edited Loading

davide-f commented Jun 20, 2022

euronion commented Jul 15, 2022 • edited Loading

codecov-commenter commented Sep 6, 2022 • edited by codecov bot Loading

Codecov Report

euronion commented Sep 6, 2022

euronion commented Sep 6, 2022

euronion commented Sep 6, 2022

euronion commented Sep 6, 2022

fneum left a comment

Choose a reason for hiding this comment

euronion commented Apr 4, 2023

euronion commented May 16, 2022 •

edited

Loading

davide-f commented Jun 20, 2022 •

edited

Loading

euronion commented Jul 15, 2022 •

edited

Loading

codecov-commenter commented Sep 6, 2022 •

edited by codecov bot

Loading