Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamline observations download #1657

Merged
merged 222 commits into from Feb 25, 2022
Merged

Streamline observations download #1657

merged 222 commits into from Feb 25, 2022

Conversation

jvegreg
Copy link
Contributor

@jvegreg jvegreg commented May 13, 2020

The aim of this pull request is to streamline the procedure to download and reformat observational datasets for ESMValTool. I do not aim to have automated downloads for every dataset we are currently supporting, but to provide the framework and a good number of examples of implementation so it becomes easy to implement in the near future.

In particular, I have created three basic downloader classes for HTPPS (using wget), FTP (using the python library) and Climate Data Store (using cdsapi) downloads along with some example implementations for some datasets. I also provide some derived classes to simplify getting dataset from specific sources, like the NASA storage or the ESA-CCI CEDA anonymous FTP.

Regarding the interface, I plan to use the reworked CLI interface in ESMValGroup/ESMValCore#605 to refactor it so we get something similar to:

esmvaltool data prepare DATASET1,DATASET2 --start 1990 --end 2015
esmvaltool data download DATASET1,DATASET2 --start 1990 --end 2015
esmvaltool data format DATASET1,DATASET2 --start 1990 --end 2015

where prepare will download and format the data in one go. Hopefully, this will make life easier for those user that do not have access to JASMIN nor DKRZ.

Tasks

  • Create an issue to discuss what you are going to do, if you haven't done so already (and add the link at the bottom)
  • Give this pull request a descriptive title that can be used as a one line summary in a changelog
  • Add documentation
  • Make sure your code is composed of functions of no more than 50 lines and uses meaningful names for variables
  • Circle/CI tests pass. Status can be seen below your pull request. If the tests are failing, click the link to find out why.
  • Preferably Codacy code quality checks pass, however a few remaining hard to solve Codacy issues are still acceptable. Status can be seen below your pull request. If there is an error, click the link to find out why. If you suspect Codacy may be wrong, please ask by commenting.
  • Please use yamllint to check that your YAML files do not contain mistakes
  • (Only if really necessary) Add any additional dependencies needed for the diagnostic script to setup.py, esmvaltool/install/R/r_requirements.txt or esmvaltool/install/Julia/julia_requirements.txt (depending on the language of your script) and also to package/meta.yaml for conda dependencies (includes Python and others, but not R/Julia)
  • If new dependencies are introduced, check that the license is compatible with Apache2.0

@bouweandela
Copy link
Member

Cool idea @jvegasbsc! Maybe we could choose a different name than obs or observations, because not everything we cmorize are observations? See also #1407.

I'm not too sure about inclusion as a subcommand in the esmvaltool command, because that is part of ESMValCore while the CMORizers live in ESMValTool.

@jvegreg
Copy link
Contributor Author

jvegreg commented May 13, 2020

I'm not too sure about inclusion as a subcommand in the esmvaltool command, because that is part of ESMValCore while the CMORizers live in ESMValTool.

I think the discoverability we will get with a single entry point is worth it. But my idea will be to find a way to allow ESMValTool to add commands to the interface without ESMValCore knowing anything about them, like a plug-in or similar

@valeriupredoi
Copy link
Contributor

@jvegasbsc this is very cool! But I have an important question (maybe @mattiarighi can also shed more light on it too): what are the licensing troubles we may get into if some random user downloads Tier 3 data via ESMValTool and starts redistributing it against the rules?

@jvegreg
Copy link
Contributor Author

jvegreg commented May 13, 2020

what are the licensing troubles we may get into if some random user downloads Tier 3 data via ESMValTool and starts redistributing it against the rules?

In that case, the illegal part is to redistribute, not to download. We are not supporting that part, just the downloading procedure, which is completely fine

@valeriupredoi
Copy link
Contributor

what are the licensing troubles we may get into if some random user downloads Tier 3 data via ESMValTool and starts redistributing it against the rules?

In that case, the illegal part is to redistribute, not to download. We are not supporting that part, just the downloading procedure, which is completely fine

there is a fine line between download and redistribution - we should make sure that the tool doesn't write any freshly downloaded Tier 3 files to any location before CMORizing them but rather downloads and CMORizes them on the fly; there is also the issue of clogging the disk if the tool keeps the originally downloaded files 🍺

@jvegreg
Copy link
Contributor Author

jvegreg commented May 14, 2020

I'm not too sure about inclusion as a subcommand in the esmvaltool command, because that is part of ESMValCore while the CMORizers live in ESMValTool.

I found a way to allow adding subcommands in esmvaltool without ESMValCore knowing anything about it. It is one of the ways that pytest uses to discover plugins.

Step 1: define a esmvaltool_commands entry point in setup.py:

'esmvaltool_commands': [
    'data = esmvaltool.cmorizers.obs.cmorize_obs:DataCommand'
]

In ESMValCore, scan for 'esmvaltool_commands' entry points and add the classes to the Fire interface:

for ep in iter_entry_points('esmvaltool_commands'):
    if hasattr(self, ep.name):
        logger.error('Registered command %s already exists', ep.name)
        continue
    self.__setattr__(ep.name, ep.load())

so we can decouple completely ESMValCore from this.

@bouweandela
Copy link
Member

Would it be an idea to add a command like

esmvaltool download recipe_example.yml

that would download and reformat all data needed to run recipe_example.yml? That way users would not need to bother with figuring out what data is needed exactly to run a recipe.

Maybe that could be simplified even further to

esmvaltool run --download recipe_example.yml

to combine all steps of obtaining the data, formatting it and running into a single command.

@zklaus
Copy link
Contributor

zklaus commented Feb 23, 2022

@remi-kazeroni, nice work! I'll have very few minor comments, but regarding the reverting of the linter configs: Absolutely, do revert them. In this branch, they were only intended to help filter the barrage of codacy problems so that we could see what really needs addressing.

Medium priority issues automation moved this from Review in progress to Reviewer approved Feb 24, 2022
@zklaus
Copy link
Contributor

zklaus commented Feb 24, 2022

Stellar work, @remi-kazeroni !

@remi-kazeroni
Copy link
Contributor

Thanks a lot @zklaus! All credits should go to @jvegreg for doing all the coding! 🍻

But I still wanted to address your final comment on the timerange selection. I agree the tool should return an error if the time range selection is wrong (wrong argument in the cli). The current assumption that a wrong timerange leads to the downloading/formatting of the whole datasets is not ideal in my opinion.

Please nobody merge now 😆

Medium priority issues automation moved this from Reviewer approved to Review in progress Feb 24, 2022
Copy link
Contributor

@zklaus zklaus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Time range issue as described by @remi-kazeroni

@remi-kazeroni
Copy link
Contributor

I have done some further tests on possible misuse of the time ranges arguments (--start, --end). For example, what happens with --start=1250? The download does not start but the error message is not uniform across datasets. This depends on the data source, the file naming (year included in the file name, folder name,...). Or what happens if a timerange is specified for a dataset where this is not needed (e.g. just one global file and no yearly files)? In these cases, the global files are downloaded.

In short: there may be room for further enhancements but I would leave that for future PRs since I have not found any problematic cases with the current handling of the time range.

@remi-kazeroni
Copy link
Contributor

Not sure how to display the codacy report of the second to last commit, but the number of issue was 7 without any problematic issue left. The number grew to 310 after reverting the changes on .pylintrc and .prospector.yml made by @zklaus to help with the handling of Codacy issues.

To be able to merge this, I guess we just need a final approval from @zklaus and coordinate the merge button hitting with @schlunma (I'd need a bit of time to draft a message to open a Github discussion about the changes related to the new cmorizer interface).

Thanks to all who contributed, reviewed, tested!! 🥳 🍻

@schlunma
Copy link
Contributor

Sounds great, @remi-kazeroni 👍 I will merge this PR shortly after you opened the discussion.

Thanks to everyone who contributed to this, this is an awesome addition 🎉

esmvaltool/cmorizers/data/cmorizer.py Outdated Show resolved Hide resolved
@zklaus
Copy link
Contributor

zklaus commented Feb 25, 2022

Not sure how to display the codacy report of the second to last commit, [...]

I'm not sure it's possible with Codacy. In principle, you can click on the little icon in front of the commit hash in the listing above (a green check mark, a red x, or a yellow dot for successful, failed, and ongoing checks, respectively). This will pop up a little dialog where you can click on details to see the specifics of a particular check. Have a look at the following screenshot.
commit-details

The problem with Codacy is that it seems to show only the details for the last run, no matter which one you select. That's a bug in Codacy, I think.

Medium priority issues automation moved this from Review in progress to Reviewer approved Feb 25, 2022
@zklaus
Copy link
Contributor

zklaus commented Feb 25, 2022

Great, thanks to everyone, particularly @remi-kazeroni for pushing this over the finish line! 🎉

There certainly is room for improvement, but as you said, let's take that up later.

@remi-kazeroni
Copy link
Contributor

Thanks! @schlunma: could you please merge this now? It would help me to get the right links in the documentation to write the discussion.

@schlunma schlunma merged commit 5aef74c into main Feb 25, 2022
Medium priority issues automation moved this from Reviewer approved to Done Feb 25, 2022
@schlunma schlunma deleted the refactor_downloads branch February 25, 2022 10:34
November 2020 automation moved this from Needs Review to Merged Feb 25, 2022
@valeriupredoi
Copy link
Contributor

@remi-kazeroni and the Streamliners band, legends! 🍺

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

None yet

7 participants