Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What about double-counting variables #224

Closed
goord opened this issue Aug 31, 2018 · 41 comments
Closed

What about double-counting variables #224

goord opened this issue Aug 31, 2018 · 41 comments
Labels
question release 1.0 release which is ready for starting CMIP6 runs

Comments

@goord
Copy link
Collaborator

goord commented Aug 31, 2018

It may happen that variables can be produced by more than one component (especially in the case of tm5-ifs or lpjg-ifs). We should come up with a mechanism to give precedence to certain models for certain variables.

@goord goord added the question label Aug 31, 2018
@goord
Copy link
Collaborator Author

goord commented Aug 31, 2018

Currently, a prioritization is made based upon the realms (see output of taskloader)

@tommibergman
Copy link
Collaborator

tommibergman commented Oct 31, 2018

For the TM5-IFS part, mostly we would like to have precedence with TM5 for tables AER* and IFS for tables A*. I am sure there are few exceptions, but this would be a first order suggestion.

Precedence of IFS over TM5 is true especially for the meteorological variables (these are mainly in Amon), since anyone using the data can always regrid to lower resolution.

@tommibergman
Copy link
Collaborator

We decided to produce a file with double counting variables with rules on which component should in which case produce the variable. Format is variable name, table, components in list of preferred order. So for example a line
pfull AERmon [tm5,ifs]
would mean that pfull variable for AERmon table would be produced from tm5 if tm5 is present, if not then ifs.

Actually the table column could also be a list, since more than one table but not all can have same preference. Or what do others think?

Attached is a list for TM5
double-counting.txt

@goord
Copy link
Collaborator Author

goord commented Jan 31, 2019

It should also be noted that the user will have to give the 'model configuration' (i.e. list of components) that has produced the data, even though one is only cmorizing variables for one component at the time...

@goord
Copy link
Collaborator Author

goord commented Feb 7, 2019

Hi @tommibergman and @treerink after some thought I came to the following conclusion: it may be more appropriate to write a separate script that splits the input data request into json variable list files according to EC-Earth component. In this way, it becomes more traceable and transparent which variables are being produced by which component, it can even be archived or put under version control with the model configuration files. This script will of course make use of the preferences file proposed above.

@treerink
Copy link
Collaborator

treerink commented Feb 7, 2019

@goord would the same idea possible but then with these component json files again merged in one json file in the end for each mip experiment for a given ece model configuration? This makes the archiving more compact, but also the cmorisation more straight forward, because otherwise one has to specify several jsons and pick the right ones when cmorising. Or does this break your idea?

@goord
Copy link
Collaborator Author

goord commented Feb 7, 2019

@treerink we can also make a single json file with an extra level denoting the components, e.g.

[ 
 "ifs": [ "Amon": ["ua", "va", "tos"]],
 "nemo": ["Omon": ["sos", "tos"], "3hr": ["tos"]]
]

If one specifies such a json, it can be crystal clear for the task loader and the user which variables will be omitted when processing for a single component.

@treerink
Copy link
Collaborator

treerink commented Feb 7, 2019

Yes, sounds like a plan. So let's try this for #253.

@treerink treerink added the release 1.0 release which is ready for starting CMIP6 runs label Feb 7, 2019
@zklaus
Copy link
Contributor

zklaus commented Feb 8, 2019

This plan of having one json file for each job sounds good!

But I'd like to comment a bit on what a job is:
First, the use of MIP together with Experiment is seems to me to be quite misplaced.
The two are separate entities and there is not a particularly strong connection.
Indeed, I think we can and should essentially ignore MIP now that the experiments are designed.

Wrt "model configuration", in common parlance this does not refer merely to a collection of components, but to what are separate models from the point of view of cmip6, eg EC-Earth-Veg, EC-Earth-CC, etc.

These two things are the only two that we should consider for organizing the json files.
This would give us a directory structure like

<model configuration>/<Experiment>/data-request.json

eg

EC-Earth-Veg/piControl/data-request.json

The data-request.json file should contain all variables requested by all mips, ie it should be based on the file
cmvme_ae.c4.cd.cf.cm.co.da.dc.dy.fa.ge.gm.hi.is.ls.lu.om.pa.pm.rf.sc.si.vi.vo_<Experiment>_3_3.xlsx produced with the -m _all_ switch to drq.

Do you agree?

@treerink
Copy link
Collaborator

@zklaus actually I was not after setting up any new directory infrastructure for this data request json files, the idea is just to add them in the existing control output sub directories so they form a set with the control output files for each experiment.

@zklaus
Copy link
Contributor

zklaus commented Feb 11, 2019

@treerink fair enough, that should work!
What do you think about the -m all thing?

@treerink
Copy link
Collaborator

Well each experiment has its own data request, in some cases (the Core MIP cases) this is a joined data request because we want to be efficient in running the experiment only once for all the MIPs run by a certain model configuration (EC-Earth3-AOGCM, EC-Earth3-Veg etc.). But genecec accounts for all of this (all this data request files are already generated to produce the control output files, but I did not share them because they are xlsx files) and as soon we have added the automatic creation of the json data request files this will be all ready for the end user. Note that the only content-wise difference between the xlsx data request files and the json data request files will be that the json ones do not contain variables which are requested but which can not be produced by EC-Earth, the ones in the ece2cmor3 ignored list (#253).

As a cmorizer you don't need anything with drq -m _all_. In fact in the identification steps on the background in ece2cmor3 I do use such things, but I noted that "all" in the python dreq package is different from "all" in drq, so I actually prefer to explicitly list the MIPs which I need to include.

In fact it would be also useful to generate for each experiment a metadata template file and add those as well to the control output sub directories, several MIP, experiment depending variables could be set by genecec, but the cmorizer always has to modify some stuff like the ensemble member label, that is why they always will be called metadata file templates (#214). I am only not sure whether this will be ready in time for the Core MIP cmorization.

@treerink
Copy link
Collaborator

By the way an example how to create in the current situation an xlsx data request file (as long the json data request files are not there) is given at the step by step wiki.

@klauswyser
Copy link
Collaborator

Please have a look at the newly created issue 615 on the EC-Earth dev portal.

The problem are not only the varlists for the different MIPs that are run with the same model configuration, but it's also the activity_id that is given by the MIP. Most experiments belong uniquely to one MIP so this is not a problem, but what to do with the "historical" experiment? How do we make sure that the variables are saved correctly for each MIP?

@treerink
Copy link
Collaborator

Ok, concerning joined Core MIP experiments, your point is that at the time of cmorising you actually don't want to provide a joined cmorised set of variables, but now you want to split out for each MIP the requested list of variables by this MIP experiment and then provide the correct activity_id which then becomes obvious (though one of the things the cmorizer has to adjust in the metadata template). You might be right that this is the way we have to provide the cmorized data, it will be a painfully amount of identical data with just slightly different meta data. Anyway, it means that I then have to produce with genecec for each joined Core MIP experiment a set of json data request files for each MIP one (which is in itself not to difficult I think).

@klauswyser
Copy link
Collaborator

klauswyser commented Feb 12, 2019

it will be a painfully amount of identical data with just slightly different meta data.

Are you sure about that? Do you think the same variable is in the drq for say SIMIP and CMIP? It would be nice if only variables that are exclusively in SIMIP are processed when running ece2cmor with activity_id=SIMIP, but I don't know if this is the case. Otherwise you are right, and the amount of duplicates would be prohibitive. In that case it would be better to process everything with activity_id=CMIP and then just hope that data users find the data that were produced for SIMIP.

it means that I then have to produce with genecec for each joined Core MIP experiment a set of json data request files for each MIP one (which is in itself not to difficult I think).

That could be a reason to not produce json files but stick with the xls files that are produced by drq, or?

@zklaus
Copy link
Contributor

zklaus commented Feb 12, 2019

I was also pondering these issues, but I have come to the conclusion that the activity_id in the metadata-template.json is always the mip that "owns" the experiment, not the one that requested the variable. This is not clearly spelled out in the CMIP6 documents, the evidence is circumstantial, but substantial. A lot of it comes from [1]:

  • In [1, Table 3] it is specified that the global attribute activity_id comes from CMIP6_experiment_id.json. In there, activity_id only lists the owners of the experiment.
  • Looking at the data that is already published on the ESGF, spot checks suggest that this is the reading of the standard by other groups.
  • In [1] the filename template does not contain the mip, suggesting that files from the same experiment will not be assigned to different mips. This is also supported by the directory template where the mip appears only above the experiment, never below.

There could be a few more of these hints; I didn't find anything supporting the reading that the files should carry the mip that requested the variable.
If you don't find this convincing, let me know and I will hunt some more evidence. Otherwise we can seek clarification from Karl Taylor or maybe any other of the others of [1].

In the case that actually multiple mips are relevant, [1, Table 1, activity_id row] and [1, Table 1, footnote 3] specify that the mips should be listed together, separated by a single space.

This seems to be applicable only in the case of jointly owned experiment, the complete list of these is:

  • piClim-aer
  • piClim-control
  • ssp370
  • land-hist
  • dcppC-forecast-addPinatubo

@zklaus
Copy link
Contributor

zklaus commented Feb 12, 2019

@treerink wrt the drq -m _all_ business, you write that a cmorizer I don't need to deal with that, but the step-by-step process that you link, seems to suggest that I do need to do something like

drq -m CMIP,DCPP,LS3MIP,PAMIP,RFMIP,ScenarioMIP,VolMIP,CORDEX,DynVar,SIMIP,VIACSAB -e piControl -t 1 -p 1 --xls --xlsDir ece2cmor3/scripts/cmip6-data-request/cmip6-data-request-m=CMIP.DCPP.LS3MIP.PAMIP.RFMIP.ScenarioMIP.VolMIP.CORDEX.DynVar.SIMIP.VIACSAB-e=piControl-t=1-p=1

Is this correct? In other words, you don't mean that I don't have to do the drq, but just that I can list the applicable mips instead of _all_, right?

@goord
Copy link
Collaborator Author

goord commented Feb 12, 2019

Ok there are 2 discussions going on here:

  • The problem with variables being generated by multiple components should be resolved by a 'drq2varlist` script that calls the 'complex' (current) taskloader and classifies the variables according to the preferred submodel as well. The ece2cmor3 taskloader will become straightforward without hidden decisions: what you see is what you get, and if you request something that doesn't exist it will report an error or maybe even abort the entire cmorization process...

  • The problem with the activity_id for experiments serving multiple MIPS, that is something that should have been decided on by the CMIP6 data request or CMOR people. Maybe we can even give multiple activity id's in the metadata Klaus? We should definitively raise this question to the WRCP people.

@zklaus
Copy link
Contributor

zklaus commented Feb 12, 2019

@goord you are right that we kind of derailed the original discussion which was about the same variable being available from different ec-earth components.

But wrt to your second point, I think the situation is clear enough: The activity_id has to be the mip owning the experiment, not the one requesting the variable. In the five experiments that share custody between two mips, both must be listed in the metadata, separated by a single space and @ufladrich informs me that the directory component should be the first mip listed in CMIP6_experiment_id.json.

@ufladrich
Copy link

[...] In the case that actually multiple mips are relevant, [1, Table 1, activity_id row] and [1, Table 1, footnote 3] specify that the mips should be listed together, separated by a single space.

And in that case the same reference details on page 17 for the Directory structure template:

If multiple activities are listed in the global attribute, the first one is used in the directory structure.

@treerink
Copy link
Collaborator

Is this correct? In other words, you don't mean that I don't have to do the drq, but just that I can list the applicable mips instead of _all_, right?

Yes correct, you need to run drq to create the data request file for the cmorization as long it isn't provided by us.

@zklaus
Copy link
Contributor

zklaus commented Feb 13, 2019

Ok, in that case it seems to be a good idea to go with -m _all_. Advantages include

  • No need to figure out the applicable mips as intersection of mips requesting variables for the given experiment and mips EC-Earth is participating in
  • No risk to accidentally add or omit the wrong mip
  • The same filename for all cmorizations, simplifying scripting

So all in all makes the job of the cmorizer much easier.

Are there any disadvantages that I am overlooking?

@treerink
Copy link
Collaborator

Using one data request file including all is indeed a pragmatic option, it will cause a lot of error messages because you are asking to cmorise many variables which are not in your data set (and this will differ per experiment). So you loose a bit of control, i.e. if a variable which should have been produced is for whatever reason not in your data set this error message is hard to distinguish , at the other hand, yes it is quite a short cut.

@treerink
Copy link
Collaborator

As described in #253 we aim for json data request files which are based on the xlsx file as created by drq, but with the ignored list applied on top and with directly applying the EC-Earth3 model configuration dependent preferences.

@treerink
Copy link
Collaborator

treerink commented Feb 15, 2019

Note that the preference file might contain a key as "omit". For instance the chemical tracers CFC12, cfc13, sf6 and c14 as discussed in the ece portal issue 609-26 will be only cmorised for the EC-Earth3-CC configuration.

@EC-Earth EC-Earth deleted a comment from goord Feb 20, 2019
@aearamos
Copy link

aearamos commented Mar 6, 2019

Hi @treerink
I'm trying have zg500 daily from cmorisation, but I can only find it in table AERday, with a "modeling_realm" = aerosol. So, I'm assuming this is from tm5, right?

Is there a way to cmorise zg500 assuming it comes from ifs, or it falls in this "desired" list we're creating in this issue? I think this is a good candidate for double-counting variables.

Thanks!

@tommibergman
Copy link
Collaborator

Yes aerosol realm is mainly from TM5 but there are exceptions. I agree also that this is one for the desired list.

@treerink
Copy link
Collaborator

treerink commented Mar 6, 2019

@aearamos So if TM5 is not active in the used model configuration (for instance for EC-EARTH3-AOGCM) you want zg500 from IFS? Do you have an grib code (or expression) already for it? If so, I can add it to the preference file in the dedicated branch we have for it now.

@aearamos
Copy link

aearamos commented Mar 6, 2019

Yes, I'd want zg500 daily from IFS.
I ran a test using a modified version of the AERday table, where I changed from aerosol to atmo the modelin_realm variable. It worked for me, but I know it's not ideal because it'a not the right table. Would it be possible to add this variable to some other table?
It is requested by DCPP and we'll start the runs pretty soon.

@aearamos
Copy link

Hi @treerink
When testing this new branch, I used a simple varlist.json that was in resources and got the following error for all the tables:
ERROR:ece2cmor3.taskloader: Cannot interpret day as an EC-Earth model component

Can you provide a varlist in the model that we can use to test? Or how can we generate the varlists now?

Thanks

@goord
Copy link
Collaborator Author

goord commented Mar 15, 2019

Hi Arthur, there is a script drq2vars that does that

@aearamos
Copy link

I'm just checking that.
When I do:
./drq2varlist.py --drq cmvmm_DCPP_TOTAL_1_1.xlsx

I get
2019-03-15 19:49:14 INFO:ece2cmor3.cmor_target: CMOR tables data_specs_version : 01.00.29
2019-03-15 19:49:14 INFO:ece2cmor3.cmor_target: CMOR tables cmor_version : 3.4
2019-03-15 19:49:14 INFO:ece2cmor3.cmor_target: CMOR tables Conventions : CF-1.7 CMIP-6.2
2019-03-15 19:49:14 INFO:ece2cmor3.cmor_target: CMOR tables table_date : 08 March 2019
Traceback (most recent call last):
File "./drq2varlist.py", line 38, in
main()
File "./drq2varlist.py", line 34, in main
json.dump(result, ofile, indent=4, separators=(',', ': '), sort_keys=True)
File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/json/init.py", line 189, in dump
for chunk in iterable:
File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/json/encoder.py", line 431, in _iterencode
for chunk in _iterencode_list(o, _current_indent_level):
File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/json/encoder.py", line 332, in _iterencode_list
for chunk in chunks:
File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
for chunk in chunks:
File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/json/encoder.py", line 332, in _iterencode_list
for chunk in chunks:
File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/json/encoder.py", line 442, in _iterencode
o = _default(o)
File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/json/encoder.py", line 184, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: <ece2cmor3.cmor_target.cmor_target object at 0x7fea5f48e910> is not JSON serializable

Do I have to add some more flags?

@goord
Copy link
Collaborator Author

goord commented Mar 15, 2019

Hmm not sure that seems like a bug in the branch. You could use the data request Excel file with ece2cmor, but you have to use it with the --drq option

@aearamos
Copy link

Could this be because of the version of CMOR? I'm using CMOR/3.3.3 now.

@goord
Copy link
Collaborator Author

goord commented Mar 15, 2019

No I think drq2vars is broken in the branch

@treerink
Copy link
Collaborator

Yes drq2vars.py is broken, also in the master where it has been merged in. If working again, an example of calling it is:

./drq2varlist.py --drq cmip6-data-request/cmip6-data-request-m=CMIP-e=CMIP-t=1-p=1/cmvme_CMIP_piControl_1_1.xlsxv --ececonf EC-EARTH-AOGCM

And indeed the cmorisation itself is also broken in the master, we aim to have it fixed all next Wednesday. If working again, an example of calling it is:

ece2cmor cmip6-ec-earth-output/t306/001/ --exp t306 --nemo --conf ece2cmor3/resources/metadata-templates/cmip6-CMIP-piControl-metadata-template.json --drq ece2cmor3/scripts/cmip6-data-request/cmip6-data-request-m=CMIP.DCPP.LS3MIP.PAMIP.RFMIP.ScenarioMIP.VolMIP.CORDEX.DynVar.SIMIP.VIACSAB-e=piControl-t=1-p=1/cmvme_cm.co.dc.dy.ls.pa.rf.sc.si.vi.vo_piControl_1_1.xlsx --ececonf EC-EARTH-AOGCM --odir cmor-nemo-CMIP-piControl-AOGCM-306 >& log-cmip6-cmorizing-nemo-cmor-CMIP-piControl-AOGCM-t306 &

@aearamos
Copy link

Hi @treerink
I'm trying have zg500 daily from cmorisation, but I can only find it in table AERday, with a "modeling_realm" = aerosol. So, I'm assuming this is from tm5, right?

Is there a way to cmorise zg500 assuming it comes from ifs, or it falls in this "desired" list we're creating in this issue? I think this is a good candidate for double-counting variables.

Thanks!

So, regarding this variable (zg500) from table AERday, by using the new varlist files, I should then have the modeling_realm as "aerosol" and the variable will be cmorised as an ifs variable? ece2cmor will be able to cmorise it even though the realm doesn't match one of ifs expected realms? In this case I'd be only using ifs.

@goord
Copy link
Collaborator Author

goord commented Mar 20, 2019

Hi @aearamos we can add it to ifspar.json. After speaking to @tommibergman , it looks like we will let TM5 generate the model-level meteorological variables (u, v, t, zg, w) in the AER* tables and IFS the rest (such as zg500, which is on pressure levels).

@goord
Copy link
Collaborator Author

goord commented Mar 21, 2019

Yes zg500 will be cmorized, regardless of the realms, they have no role anymore in the new task loading strategy.

@treerink
Copy link
Collaborator

Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question release 1.0 release which is ready for starting CMIP6 runs
Projects
None yet
Development

No branches or pull requests

7 participants