Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support per dataset DRS #494

Closed
bouweandela opened this issue Feb 14, 2020 · 19 comments
Closed

Support per dataset DRS #494

bouweandela opened this issue Feb 14, 2020 · 19 comments
Labels
EMAC enhancement New feature or request
Milestone

Comments

@bouweandela
Copy link
Member

Now that we have support for reading in the ERA5 dataset in it's native format through the native6 project (#447), it would be good to add support for a per dataset DRS, so other datasets (which will typically come with their own DRS) can also be supported using the native6 project.

@bouweandela bouweandela added enhancement New feature or request EMAC labels Feb 14, 2020
@valeriupredoi
Copy link
Contributor

config-developer.yml or you thinking something fancier so we don't overcrowd the file? 🍺

@bouweandela
Copy link
Member Author

Yes, just in config-developer.yml

@bouweandela
Copy link
Member Author

Related to #641

@bsolino
Copy link
Contributor

bsolino commented Apr 23, 2021

I have made this suggestion of how the config_developer file could look in #970 (comment), but I think it could be relevant to this discussion:

native6:
  default:
    input_dir:
      default: 'Tier{tier}/{dataset}/{latestversion}/{frequency}/{short_name}'
      BSC: '{project}/{dataset}_{version}/{area}_{frequency}_{grid}'
      RCAST: '...'
    input_file:
      default: '*.nc'
      BSC: '*.nc'
      RCAST: '*.nc'
  ICON:
    input_dir:
      default: '{model_version}_{model_component}_{experiment}_{grid}_{id}'
    input_file:
      default: '{model_version}_{model_component}_{experiment}_{grid}_{id}_{var_type}*.nc'
  EMAC:
    ...

@zklaus
Copy link
Contributor

zklaus commented May 18, 2021

@Peter9192 has withdrawn his initial attempt in #970. @bsolino could you pick this up? Perhaps @senesis could help?

@senesis
Copy link
Contributor

senesis commented May 19, 2021

There is an ambiguity in the comments above, about what "a dataset" means, as this can be :

  • either the initial meaning such as the ERA5 dataset, or a given model (so ; the value under the 'dataset' key of an entry of the dataset section of recipes)
  • or an entry of the datasets section of a recipe

My feeling is that one should be able to redefine both the DRS label and the rootpath in each datasets section entry (i.e. change the default value that are derived form the config-user file.

This would allow to avoid changing the structure of the project entry in config-dev file : we would just have to list all possible DRS without grouping by model. Using model name in DRS labels could be helpful

@zklaus zklaus added this to the v2.3.0 milestone May 20, 2021
@senesis
Copy link
Contributor

senesis commented May 21, 2021

@Peter9192 has withdrawn his initial attempt in #970. @bsolino could you pick this up? Perhaps @senesis could help?

I may help. My proposal would be :
In the config-dev file, e.g.

native6:
  cmor_strict: false
  input_dir:
    default: 'Tier{tier}/{dataset}/{latestversion}/{frequency}/{short_name}'
    BSC: '{project}/{dataset}_{version}/{area}_{frequency}_{grid}'
    ICON_DRS : '{model_version}_{model_component}_{experiment}_{grid}_{id}'
    IPSL_A:'{account}/{model}/{status}/{exp}/{simulation}/{igcm_dir}/Analyse/{freq}'
    IPSL_O: '{account}/{model}/{status}/{exp}/{simulation}/{igcm_dir}/Output/{freq}' 
  input_file:
    default: '*.nc'
    BSC: '*.nc'
    ICON : '{model_version}_{model_component}_{experiment}_{grid}_{id}_{var_type}*.nc'
    IPSL_A: '{simulation}_*_{file_var_name}.nc'
    IPSL_O: '{simulation}_*_{varname_in_filename}.nc'
  output_file: '{project}_{dataset}_{type}_{version}_{mip}_{short_name}'
  cmor_type: 'CMIP6'
  cmor_default_table_prefix: 'CMIP6_'

and in the config-user file, e.g. :

drs:
  native6: ICON_DRS

and in any recipe, e.g. :

datasets:
  - {dataset: ICON,  project: native6, exp: historical,  ...}
  - {dataset: IPSL,  project: native6, exp: historical,  ..., DRS: IPSL_A}
  - {dataset: IPSL,  project: native6, exp: hist_alt  ,  ..., DRS: IPSL_O}

@bsolino
Copy link
Contributor

bsolino commented May 25, 2021

@senesis Your suggestion has the same issue that I was trying to address in my proposal, in that it's mixing at the same level datasets (eg. ICON, IPSL) and directory structures in machines (eg. BSC, DKRZ). Each machine could have each project stored in a different way, and I believe the purpose of the DRS is to make the data easily accessible to users of those machines by letting them set a DRS for each project: just by setting the DRS in config-user.yml a user can locate all the project data stored in the machine they are using.

Note that I'm having a bit of a terminology issue here, because I am still unsure about the difference between the concepts of a "dataset" and a "project".

Furthermore, your suggestion seems to be coupling recipes with the directory structures in the machines. That would mean that it could be necessary to modify each recipe every time you want to execute it on a different machine.

The way I see it, we need two keys to identify the directory structure: one that refers to the machine and one that refers to the dataset. So, the configuration file should be set up in a way that makes possible to combine those keys.

@bsolino
Copy link
Contributor

bsolino commented May 25, 2021

There is an ambiguity in the comments above, about what "a dataset" means, as this can be :

* either the initial meaning such as the ERA5 dataset, or a given model (so ; the value under the 'dataset' key of an entry of the dataset section of recipes)

* or an entry of the datasets section of a recipe

I understand that this refers to the {dataset} key in the default route, is that correct?

I think that key exists because the original intent of native6 was to set up the data in a way that the directory structure of each dataset would be identical, and so datasets could be differenciated by the folder in which they were located. If we move forward with this idea of allowing different directory structures for each dataset, perhaps that key could be unnecessary.

I believe the original idea was from @bouweandela. Bouwe, do you have any specific opinion on this?

@Peter9192
Copy link
Contributor

and in the config-user file, e.g. :

drs:
  native6: ICON_DRS

So how would that work if I want to use both ERA5 and ICON data through the native6 project?

@senesis
Copy link
Contributor

senesis commented May 26, 2021

So how would that work if I want to use both ERA5 and ICON data through the native6 project?

That way

config-user.yml

drs:
  native6: ICON_DRS

recipe :

datasets:
  - {dataset: ICON,  project: native6, exp: historical,  ...}
  - {dataset: ERA5  project: native6, exp: historical,  ..., DRS: ERA5_DRS}

That feature is useful as soon as a given model may come with two DRS (for instance one experiment is i a shared CMIP6 data store, another one is stored in a private location). My proposal is that also the rootpath can be specified in datasets entries

@bouweandela
Copy link
Member Author

That feature is useful as soon as a given model may come with two DRS (for instance one experiment is i a shared CMIP6 data store, another one is stored in a private location)

That is a different issue, let's discuss it here: #129

There is an ambiguity in the comments above, about what "a dataset" means, as this can be :

  • either the initial meaning such as the ERA5 dataset, or a given model (so ; the value under the 'dataset' key of an entry of the dataset section of recipes)
  • or an entry of the datasets section of a recipe

I believe the original idea was from @bouweandela. Bouwe, do you have any specific opinion on this?

With dataset I meant the first option mentioned here.

My feeling is that one should be able to redefine both the DRS label and the rootpath in each datasets section entry (i.e. change the default value that are derived form the config-user file.

As pointed out by @bsolino, the idea behind using a DRS is that recipes are machine-independent, so specifying this in the recipe defeats the purpose of using a DRS.

@bsolino
Copy link
Contributor

bsolino commented Jun 1, 2021

@bouweandela: I meant to ask something else, but I think I didn't make my question very clear. I was wondering on how to deal with the ambiguity of the {dataset} key given that we are going to give support different DRS for the datasets. It seems to me that we will have to rethink the original idea of distinguishing the datasets by their folder.

The main options I can see right now (there may be more that I'm missing) are:

  1. Keep the default as is, only use it if the dataset doesn't have an explicit DRS. New datasets could then be added by the current method. However, "dataset" would take a double meaning, both a key in the config file and a part of the folder structure. I'm not sure that having that double meaning is a good option.
  2. Remove "{dataset}" from the default path. Unfortunately, I'm not sure if it would be possible to do this without rendering the idea of the default path useless.

In summary, I'm not sure how to move forward and avoid the ambiguity without crashing with what has already been done.

@senesis
Copy link
Contributor

senesis commented Jun 2, 2021

In my opinion, we are here complicating without any need by trying to put multiple datasets (e.g. ERA5 and ICON) in a same and single project (native6).
This is a very different situation w.r.t. the basic design of config files, i.e. for projects CMIP5 and CMIP6, in which all datasets share the same DRS on a given machine, this DRS being configured at the confg-user level.
Here, we want to handle data sources (e.g. one model or one observation data set) which each have their own set of attributes/facets, and each have their own DRS.
This means that these data sources look very much like projects; and as a matter of fact, defining a project for each allows to handle it without any code change . The only drawback is that it populates both the config-development file and the _fixes directory (rather than populating only the _fixes/natives6 directory). I think that this is a very light price to pay, w.r.t. complicating the config file structure and the code
Said otherwise : as soon as a dataset needs a specific DRS (or better said : as soon as the need for a specific DRS is not linked to the change of machine), it means that such a dataset is rather what I call a 'data source', which is best handled in ESMValTool as a project.

@zklaus
Copy link
Contributor

zklaus commented Jun 7, 2021

I am not sure if we still want this; We seem to be able to achieve most goals by using projects instead and @Peter9192's PR on the topic (#970) is now abandoned. I'll move this to 2.4.0, but we may well just close it in the end.

@zklaus zklaus modified the milestones: v2.3.0, v2.4.0 Jun 7, 2021
@senesis
Copy link
Contributor

senesis commented Jun 7, 2021

We seem to be able to achieve most goals by using projects instead ... we may well just close it in the end

I agree if implementing #129 , as @bouweandela suggested

@bsolino
Copy link
Contributor

bsolino commented Jun 9, 2021

The only drawback is that it populates both the config-development file and the _fixes directory (rather than populating only the _fixes/natives6 directory). I think that this is a very light price to pay, w.r.t. complicating the config file structure and the code

That's an excellent point. I agree that it would be too much for too little gain. And there are other avenues to solve the issues.

For starters, I'm not even sure it is a drawback in the config-developer file, as the total amount of lines added would be quite similar, the difference being mostly organizational. That could be achieved in other means, starting by something as simple as comments on the configuration file.

There could also be populating config-user issues, but I think there might be a better way to approach it. I think I better open a new issue for it.

EDIT: Opened in #1165

@bouweandela
Copy link
Member Author

Indeed, let's just create a new project per model that uses a custom DRS, until we have too many projects (e.g. > 10) and we need to come up with something smarter.

@zklaus
Copy link
Contributor

zklaus commented Jun 17, 2021

Ok, seems we are agreed. Closing this for now, please, anybody, feel free to reopen if you want to discuss this some more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EMAC enhancement New feature or request
Projects
Development

Successfully merging a pull request may close this issue.

6 participants