What about double-counting variables #224

goord · 2018-08-31T08:41:49Z

It may happen that variables can be produced by more than one component (especially in the case of tm5-ifs or lpjg-ifs). We should come up with a mechanism to give precedence to certain models for certain variables.

goord · 2018-08-31T14:06:09Z

Currently, a prioritization is made based upon the realms (see output of taskloader)

tommibergman · 2018-10-31T14:58:26Z

For the TM5-IFS part, mostly we would like to have precedence with TM5 for tables AER* and IFS for tables A*. I am sure there are few exceptions, but this would be a first order suggestion.

Precedence of IFS over TM5 is true especially for the meteorological variables (these are mainly in Amon), since anyone using the data can always regrid to lower resolution.

tommibergman · 2019-01-30T15:03:46Z

We decided to produce a file with double counting variables with rules on which component should in which case produce the variable. Format is variable name, table, components in list of preferred order. So for example a line
pfull AERmon [tm5,ifs]
would mean that pfull variable for AERmon table would be produced from tm5 if tm5 is present, if not then ifs.

Actually the table column could also be a list, since more than one table but not all can have same preference. Or what do others think?

Attached is a list for TM5
double-counting.txt

goord · 2019-01-31T09:15:03Z

It should also be noted that the user will have to give the 'model configuration' (i.e. list of components) that has produced the data, even though one is only cmorizing variables for one component at the time...

goord · 2019-02-07T13:47:23Z

Hi @tommibergman and @treerink after some thought I came to the following conclusion: it may be more appropriate to write a separate script that splits the input data request into json variable list files according to EC-Earth component. In this way, it becomes more traceable and transparent which variables are being produced by which component, it can even be archived or put under version control with the model configuration files. This script will of course make use of the preferences file proposed above.

treerink · 2019-02-07T14:50:37Z

@goord would the same idea possible but then with these component json files again merged in one json file in the end for each mip experiment for a given ece model configuration? This makes the archiving more compact, but also the cmorisation more straight forward, because otherwise one has to specify several jsons and pick the right ones when cmorising. Or does this break your idea?

goord · 2019-02-07T15:18:21Z

@treerink we can also make a single json file with an extra level denoting the components, e.g.

[ 
 "ifs": [ "Amon": ["ua", "va", "tos"]],
 "nemo": ["Omon": ["sos", "tos"], "3hr": ["tos"]]
]

If one specifies such a json, it can be crystal clear for the task loader and the user which variables will be omitted when processing for a single component.

treerink · 2019-02-07T20:07:30Z

Yes, sounds like a plan. So let's try this for #253.

zklaus · 2019-02-08T09:16:33Z

This plan of having one json file for each job sounds good!

But I'd like to comment a bit on what a job is:
First, the use of MIP together with Experiment is seems to me to be quite misplaced.
The two are separate entities and there is not a particularly strong connection.
Indeed, I think we can and should essentially ignore MIP now that the experiments are designed.

Wrt "model configuration", in common parlance this does not refer merely to a collection of components, but to what are separate models from the point of view of cmip6, eg EC-Earth-Veg, EC-Earth-CC, etc.

These two things are the only two that we should consider for organizing the json files.
This would give us a directory structure like

<model configuration>/<Experiment>/data-request.json

eg

EC-Earth-Veg/piControl/data-request.json

The data-request.json file should contain all variables requested by all mips, ie it should be based on the file
cmvme_ae.c4.cd.cf.cm.co.da.dc.dy.fa.ge.gm.hi.is.ls.lu.om.pa.pm.rf.sc.si.vi.vo_<Experiment>_3_3.xlsx produced with the -m _all_ switch to drq.

Do you agree?

treerink · 2019-02-11T16:46:52Z

@zklaus actually I was not after setting up any new directory infrastructure for this data request json files, the idea is just to add them in the existing control output sub directories so they form a set with the control output files for each experiment.

zklaus · 2019-02-11T16:51:15Z

@treerink fair enough, that should work!
What do you think about the -m all thing?

treerink · 2019-02-11T20:16:05Z

Well each experiment has its own data request, in some cases (the Core MIP cases) this is a joined data request because we want to be efficient in running the experiment only once for all the MIPs run by a certain model configuration (EC-Earth3-AOGCM, EC-Earth3-Veg etc.). But genecec accounts for all of this (all this data request files are already generated to produce the control output files, but I did not share them because they are xlsx files) and as soon we have added the automatic creation of the json data request files this will be all ready for the end user. Note that the only content-wise difference between the xlsx data request files and the json data request files will be that the json ones do not contain variables which are requested but which can not be produced by EC-Earth, the ones in the ece2cmor3 ignored list (#253).

As a cmorizer you don't need anything with drq -m _all_. In fact in the identification steps on the background in ece2cmor3 I do use such things, but I noted that "all" in the python dreq package is different from "all" in drq, so I actually prefer to explicitly list the MIPs which I need to include.

In fact it would be also useful to generate for each experiment a metadata template file and add those as well to the control output sub directories, several MIP, experiment depending variables could be set by genecec, but the cmorizer always has to modify some stuff like the ensemble member label, that is why they always will be called metadata file templates (#214). I am only not sure whether this will be ready in time for the Core MIP cmorization.

treerink · 2019-02-11T20:22:17Z

By the way an example how to create in the current situation an xlsx data request file (as long the json data request files are not there) is given at the step by step wiki.

klauswyser · 2019-02-11T20:24:54Z

Please have a look at the newly created issue 615 on the EC-Earth dev portal.

The problem are not only the varlists for the different MIPs that are run with the same model configuration, but it's also the activity_id that is given by the MIP. Most experiments belong uniquely to one MIP so this is not a problem, but what to do with the "historical" experiment? How do we make sure that the variables are saved correctly for each MIP?

treerink · 2019-02-11T21:01:14Z

Ok, concerning joined Core MIP experiments, your point is that at the time of cmorising you actually don't want to provide a joined cmorised set of variables, but now you want to split out for each MIP the requested list of variables by this MIP experiment and then provide the correct activity_id which then becomes obvious (though one of the things the cmorizer has to adjust in the metadata template). You might be right that this is the way we have to provide the cmorized data, it will be a painfully amount of identical data with just slightly different meta data. Anyway, it means that I then have to produce with genecec for each joined Core MIP experiment a set of json data request files for each MIP one (which is in itself not to difficult I think).

klauswyser · 2019-02-12T06:35:44Z

it will be a painfully amount of identical data with just slightly different meta data.

Are you sure about that? Do you think the same variable is in the drq for say SIMIP and CMIP? It would be nice if only variables that are exclusively in SIMIP are processed when running ece2cmor with activity_id=SIMIP, but I don't know if this is the case. Otherwise you are right, and the amount of duplicates would be prohibitive. In that case it would be better to process everything with activity_id=CMIP and then just hope that data users find the data that were produced for SIMIP.

it means that I then have to produce with genecec for each joined Core MIP experiment a set of json data request files for each MIP one (which is in itself not to difficult I think).

That could be a reason to not produce json files but stick with the xls files that are produced by drq, or?

zklaus · 2019-02-12T09:29:09Z

I was also pondering these issues, but I have come to the conclusion that the activity_id in the metadata-template.json is always the mip that "owns" the experiment, not the one that requested the variable. This is not clearly spelled out in the CMIP6 documents, the evidence is circumstantial, but substantial. A lot of it comes from [1]:

In [1, Table 3] it is specified that the global attribute activity_id comes from CMIP6_experiment_id.json. In there, activity_id only lists the owners of the experiment.
Looking at the data that is already published on the ESGF, spot checks suggest that this is the reading of the standard by other groups.
In [1] the filename template does not contain the mip, suggesting that files from the same experiment will not be assigned to different mips. This is also supported by the directory template where the mip appears only above the experiment, never below.

There could be a few more of these hints; I didn't find anything supporting the reading that the files should carry the mip that requested the variable.
If you don't find this convincing, let me know and I will hunt some more evidence. Otherwise we can seek clarification from Karl Taylor or maybe any other of the others of [1].

In the case that actually multiple mips are relevant, [1, Table 1, activity_id row] and [1, Table 1, footnote 3] specify that the mips should be listed together, separated by a single space.

This seems to be applicable only in the case of jointly owned experiment, the complete list of these is:

piClim-aer
piClim-control
ssp370
land-hist
dcppC-forecast-addPinatubo

zklaus · 2019-02-12T09:32:37Z

@treerink wrt the drq -m _all_ business, you write that a cmorizer I don't need to deal with that, but the step-by-step process that you link, seems to suggest that I do need to do something like

drq -m CMIP,DCPP,LS3MIP,PAMIP,RFMIP,ScenarioMIP,VolMIP,CORDEX,DynVar,SIMIP,VIACSAB -e piControl -t 1 -p 1 --xls --xlsDir ece2cmor3/scripts/cmip6-data-request/cmip6-data-request-m=CMIP.DCPP.LS3MIP.PAMIP.RFMIP.ScenarioMIP.VolMIP.CORDEX.DynVar.SIMIP.VIACSAB-e=piControl-t=1-p=1

Is this correct? In other words, you don't mean that I don't have to do the drq, but just that I can list the applicable mips instead of _all_, right?

goord · 2019-02-12T10:12:50Z

Ok there are 2 discussions going on here:

The problem with variables being generated by multiple components should be resolved by a 'drq2varlist` script that calls the 'complex' (current) taskloader and classifies the variables according to the preferred submodel as well. The ece2cmor3 taskloader will become straightforward without hidden decisions: what you see is what you get, and if you request something that doesn't exist it will report an error or maybe even abort the entire cmorization process...
The problem with the activity_id for experiments serving multiple MIPS, that is something that should have been decided on by the CMIP6 data request or CMOR people. Maybe we can even give multiple activity id's in the metadata Klaus? We should definitively raise this question to the WRCP people.

zklaus · 2019-02-12T10:40:35Z

@goord you are right that we kind of derailed the original discussion which was about the same variable being available from different ec-earth components.

But wrt to your second point, I think the situation is clear enough: The activity_id has to be the mip owning the experiment, not the one requesting the variable. In the five experiments that share custody between two mips, both must be listed in the metadata, separated by a single space and @ufladrich informs me that the directory component should be the first mip listed in CMIP6_experiment_id.json.

ufladrich · 2019-02-12T10:42:49Z

[...] In the case that actually multiple mips are relevant, [1, Table 1, activity_id row] and [1, Table 1, footnote 3] specify that the mips should be listed together, separated by a single space.

And in that case the same reference details on page 17 for the Directory structure template:

If multiple activities are listed in the global attribute, the first one is used in the directory structure.

treerink · 2019-02-12T20:12:04Z

Is this correct? In other words, you don't mean that I don't have to do the drq, but just that I can list the applicable mips instead of _all_, right?

Yes correct, you need to run drq to create the data request file for the cmorization as long it isn't provided by us.

zklaus · 2019-02-13T08:49:03Z

Ok, in that case it seems to be a good idea to go with -m _all_. Advantages include

No need to figure out the applicable mips as intersection of mips requesting variables for the given experiment and mips EC-Earth is participating in
No risk to accidentally add or omit the wrong mip
The same filename for all cmorizations, simplifying scripting

So all in all makes the job of the cmorizer much easier.

Are there any disadvantages that I am overlooking?

treerink · 2019-02-13T13:59:56Z

Using one data request file including all is indeed a pragmatic option, it will cause a lot of error messages because you are asking to cmorise many variables which are not in your data set (and this will differ per experiment). So you loose a bit of control, i.e. if a variable which should have been produced is for whatever reason not in your data set this error message is hard to distinguish , at the other hand, yes it is quite a short cut.

treerink · 2019-02-15T15:18:05Z

As described in #253 we aim for json data request files which are based on the xlsx file as created by drq, but with the ignored list applied on top and with directly applying the EC-Earth3 model configuration dependent preferences.

treerink · 2019-02-15T15:20:41Z

Note that the preference file might contain a key as "omit". For instance the chemical tracers CFC12, cfc13, sf6 and c14 as discussed in the ece portal issue 609-26 will be only cmorised for the EC-Earth3-CC configuration.

aearamos · 2019-03-06T10:49:49Z

Hi @treerink
I'm trying have zg500 daily from cmorisation, but I can only find it in table AERday, with a "modeling_realm" = aerosol. So, I'm assuming this is from tm5, right?

Is there a way to cmorise zg500 assuming it comes from ifs, or it falls in this "desired" list we're creating in this issue? I think this is a good candidate for double-counting variables.

Thanks!

tommibergman · 2019-03-06T10:52:45Z

Yes aerosol realm is mainly from TM5 but there are exceptions. I agree also that this is one for the desired list.

treerink · 2019-03-06T12:44:29Z

@aearamos So if TM5 is not active in the used model configuration (for instance for EC-EARTH3-AOGCM) you want zg500 from IFS? Do you have an grib code (or expression) already for it? If so, I can add it to the preference file in the dedicated branch we have for it now.

aearamos · 2019-03-06T13:52:34Z

Yes, I'd want zg500 daily from IFS.
I ran a test using a modified version of the AERday table, where I changed from aerosol to atmo the modelin_realm variable. It worked for me, but I know it's not ideal because it'a not the right table. Would it be possible to add this variable to some other table?
It is requested by DCPP and we'll start the runs pretty soon.

aearamos · 2019-03-15T18:31:49Z

Hi @treerink
When testing this new branch, I used a simple varlist.json that was in resources and got the following error for all the tables:
ERROR:ece2cmor3.taskloader: Cannot interpret day as an EC-Earth model component

Can you provide a varlist in the model that we can use to test? Or how can we generate the varlists now?

Thanks

goord · 2019-03-15T18:47:35Z

Hi Arthur, there is a script drq2vars that does that

aearamos · 2019-03-15T18:51:39Z

I'm just checking that.
When I do:
./drq2varlist.py --drq cmvmm_DCPP_TOTAL_1_1.xlsx

I get
2019-03-15 19:49:14 INFO:ece2cmor3.cmor_target: CMOR tables data_specs_version : 01.00.29
2019-03-15 19:49:14 INFO:ece2cmor3.cmor_target: CMOR tables cmor_version : 3.4
2019-03-15 19:49:14 INFO:ece2cmor3.cmor_target: CMOR tables Conventions : CF-1.7 CMIP-6.2
2019-03-15 19:49:14 INFO:ece2cmor3.cmor_target: CMOR tables table_date : 08 March 2019
Traceback (most recent call last):
File "./drq2varlist.py", line 38, in
main()
File "./drq2varlist.py", line 34, in main
json.dump(result, ofile, indent=4, separators=(',', ': '), sort_keys=True)
File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/json/init.py", line 189, in dump
for chunk in iterable:
File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/json/encoder.py", line 431, in _iterencode
for chunk in _iterencode_list(o, _current_indent_level):
File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/json/encoder.py", line 332, in _iterencode_list
for chunk in chunks:
File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
for chunk in chunks:
File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/json/encoder.py", line 332, in _iterencode_list
for chunk in chunks:
File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/json/encoder.py", line 442, in _iterencode
o = _default(o)
File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/json/encoder.py", line 184, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: <ece2cmor3.cmor_target.cmor_target object at 0x7fea5f48e910> is not JSON serializable

Do I have to add some more flags?

goord · 2019-03-15T19:07:38Z

Hmm not sure that seems like a bug in the branch. You could use the data request Excel file with ece2cmor, but you have to use it with the --drq option

aearamos · 2019-03-15T19:10:28Z

Could this be because of the version of CMOR? I'm using CMOR/3.3.3 now.

goord · 2019-03-15T20:40:28Z

No I think drq2vars is broken in the branch

treerink · 2019-03-17T13:19:12Z

Yes drq2vars.py is broken, also in the master where it has been merged in. If working again, an example of calling it is:

./drq2varlist.py --drq cmip6-data-request/cmip6-data-request-m=CMIP-e=CMIP-t=1-p=1/cmvme_CMIP_piControl_1_1.xlsxv --ececonf EC-EARTH-AOGCM

And indeed the cmorisation itself is also broken in the master, we aim to have it fixed all next Wednesday. If working again, an example of calling it is:

ece2cmor cmip6-ec-earth-output/t306/001/ --exp t306 --nemo --conf ece2cmor3/resources/metadata-templates/cmip6-CMIP-piControl-metadata-template.json --drq ece2cmor3/scripts/cmip6-data-request/cmip6-data-request-m=CMIP.DCPP.LS3MIP.PAMIP.RFMIP.ScenarioMIP.VolMIP.CORDEX.DynVar.SIMIP.VIACSAB-e=piControl-t=1-p=1/cmvme_cm.co.dc.dy.ls.pa.rf.sc.si.vi.vo_piControl_1_1.xlsx --ececonf EC-EARTH-AOGCM --odir cmor-nemo-CMIP-piControl-AOGCM-306 >& log-cmip6-cmorizing-nemo-cmor-CMIP-piControl-AOGCM-t306 &

aearamos · 2019-03-20T22:23:31Z

Hi @treerink
I'm trying have zg500 daily from cmorisation, but I can only find it in table AERday, with a "modeling_realm" = aerosol. So, I'm assuming this is from tm5, right?

Is there a way to cmorise zg500 assuming it comes from ifs, or it falls in this "desired" list we're creating in this issue? I think this is a good candidate for double-counting variables.

Thanks!

So, regarding this variable (zg500) from table AERday, by using the new varlist files, I should then have the modeling_realm as "aerosol" and the variable will be cmorised as an ifs variable? ece2cmor will be able to cmorise it even though the realm doesn't match one of ifs expected realms? In this case I'd be only using ifs.

goord · 2019-03-20T22:31:45Z

Hi @aearamos we can add it to ifspar.json. After speaking to @tommibergman , it looks like we will let TM5 generate the model-level meteorological variables (u, v, t, zg, w) in the AER* tables and IFS the rest (such as zg500, which is on pressure levels).

goord · 2019-03-21T06:41:19Z

Yes zg500 will be cmorized, regardless of the realms, they have no role anymore in the new task loading strategy.

treerink · 2019-03-25T15:58:15Z

Closing this issue.

goord added the question label Aug 31, 2018

goord mentioned this issue Sep 4, 2018

tos in 3hr file not found #231

Closed

treerink mentioned this issue Sep 13, 2018

Large amount of wrong "missing" messages by drq2file_def-nemo.py #235

Closed

goord mentioned this issue Oct 5, 2018

cmorizing lpjg output #257

Closed

tommibergman mentioned this issue Jan 30, 2019

Surface pressure variable ps #390

Closed

treerink mentioned this issue Feb 7, 2019

Create an 'EC-Earth CMIP6 data request' json for each MIP experiment #253

Closed

treerink added the release 1.0 release which is ready for starting CMIP6 runs label Feb 7, 2019

EC-Earth deleted a comment from goord Feb 20, 2019

treerink mentioned this issue Mar 15, 2019

Task load prefs #406

Merged

treerink added a commit that referenced this issue Mar 16, 2019

Add all ece model configuration options #224 #253.

27f70fb

treerink mentioned this issue Mar 19, 2019

Log the merge of the task load merge branch #411

Closed

goord mentioned this issue Mar 22, 2019

Duplicate requested variables were found, dismissing all cmorization tasks #417

Closed

treerink closed this as completed Mar 25, 2019

What about double-counting variables #224

What about double-counting variables #224

Comments

goord commented Aug 31, 2018

goord commented Aug 31, 2018

tommibergman commented Oct 31, 2018 • edited Loading

tommibergman commented Jan 30, 2019

goord commented Jan 31, 2019

goord commented Feb 7, 2019 • edited Loading

treerink commented Feb 7, 2019

goord commented Feb 7, 2019

treerink commented Feb 7, 2019

zklaus commented Feb 8, 2019 • edited Loading

treerink commented Feb 11, 2019

zklaus commented Feb 11, 2019

treerink commented Feb 11, 2019

treerink commented Feb 11, 2019

klauswyser commented Feb 11, 2019

treerink commented Feb 11, 2019

klauswyser commented Feb 12, 2019 • edited Loading

zklaus commented Feb 12, 2019

zklaus commented Feb 12, 2019

goord commented Feb 12, 2019

zklaus commented Feb 12, 2019

ufladrich commented Feb 12, 2019

treerink commented Feb 12, 2019

zklaus commented Feb 13, 2019

treerink commented Feb 13, 2019

treerink commented Feb 15, 2019

treerink commented Feb 15, 2019 • edited Loading

aearamos commented Mar 6, 2019

tommibergman commented Mar 6, 2019

treerink commented Mar 6, 2019

aearamos commented Mar 6, 2019

aearamos commented Mar 15, 2019

goord commented Mar 15, 2019

aearamos commented Mar 15, 2019

goord commented Mar 15, 2019

aearamos commented Mar 15, 2019

goord commented Mar 15, 2019

treerink commented Mar 17, 2019

aearamos commented Mar 20, 2019

goord commented Mar 20, 2019

goord commented Mar 21, 2019

treerink commented Mar 25, 2019

tommibergman commented Oct 31, 2018 •

edited

Loading

goord commented Feb 7, 2019 •

edited

Loading

zklaus commented Feb 8, 2019 •

edited

Loading

klauswyser commented Feb 12, 2019 •

edited

Loading

treerink commented Feb 15, 2019 •

edited

Loading