Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using mappings with multiple input datasets. #14

Open
captainceramic opened this issue Nov 5, 2014 · 9 comments
Open

Using mappings with multiple input datasets. #14

captainceramic opened this issue Nov 5, 2014 · 9 comments
Labels

Comments

@captainceramic
Copy link
Contributor

In a situation when you have multiple input datasets (which comes up when calculating the change between two time periods) the mapping of input constraint names to output constraint names is not working.

For example, if both input datasets have a constraint 'date', you might want to use the 'date' from the first input as 'start_date' and the 'date' from the second should be called 'end_date'.

@DamienIrving
Copy link
Contributor

With respect to the handling of multiple input datasets, in addition to constraint mapping we also need to think about metadata handling (see #31).

@DamienIrving
Copy link
Contributor

@captainceramic I've posted two new workflows: arithmetic_trail_1dataset.vt and arithmetic_trail_2datasets.vt. They both illustrate different problems:

  • The single dataset in arithmetic_trail_1dataset.vt includes 3 models: ACCESS1-0, MRI-CGCM3 and inmcm4. The workflow produces three output files: cdo sub ACCESS1-0 ACCESS1-0 outfile.nc, cdo sub MRI-CGCM3 MRI-CGCM3 outfile.nc and cdo sub inmcm4 inmcm4 outfile.nc. In other words, there's no outputs of one model versus another.
  • The two dataset approach in arithmetic_trail_2datasets.vt just has a single model in each dataset and still produces the confused institution and model output. If you add an extra model to either of the input datasets the workflow fails altogether.

I think the fundamental problem is the handling of %model% constraint. A process like cdo sub ACCESS1-0 inmcm4 outfile.nc should produce a new model constraint that is ACCESS1-0-inmcm4.

@DamienIrving
Copy link
Contributor

@captainceramic Thinking about this more, there are essentially two different use cases:

  • inter-model modules: i.e. where you want all the models to be compared against one another (e.g. this would apply to a module that calculates a temporal or spatial correlation between two different files)
  • intra-model modules: i.e. where you want the system to collect up all the different elements from the same model (e.g. this would apply to a module that calculates the wind speed from the u and v wind from the same model)

Perhaps in the wrapper for any particular module the user should have to specify whether it is an intra-model or inter-model module?

(at the moment those test workflows I produced were inter-model modules - I wanted to subtract one model from another to get the difference - we probably need an example of an intra-model)

@captainceramic
Copy link
Contributor Author

@DamienIrving I've had a look at these, and can reproduce your problems.

As written, your first workflow is behaving wrong, but as I would expect. What is happening is that the ArgumentCreator is examining the list of two FileCreators that it is receiving as input. It finds that both inputs and the output have a Constraint for model and institute, and that there are three valid combinations, ACCESS1-0 with CSIRO-BOM, MRI-CGCM3 with MRI and inmcm4 with INM.

There is nothing to tell the system that you want to combine institute and model, so the system treats them the same as it would rcp85 and rcp45: it runs one command for each valid combination and doesn't mix them up.

The second workflow is showing a bug, but it is more subtle. When running the workflow I get this output that looks like:

SCRIPT sub ACCESS_INPUT_FILE MRI_INPUT_FILE MRI_OUTPUT

etc...

The system seems to recognize that there are two inputs for each output, but then it doesn't know whether the value of model and institute from the first or second input should be passed to the output.

The way that this has been dealt with in earlier versions of the software is by 'mapping' or 'renaming' constraints from the input DataSets to different names in the output. I.e. the model in the first input is renamed from model to model_1 etc. This means that you can not have the same pattern for input and output - output has to have a model_1 and model_2 tag in the output pattern.

Your other idea is quite promising - combining constraints on the input into a single constraint on the output. I think that is worth following up.

In both these cases, any VT module wrapper that combines multiple models needs to overwrite the institute constraint on the output with something like ensemble, or combination etc.

@DamienIrving
Copy link
Contributor

@captainceramic I think combining constraints from the input into a single constraint on the output is definitely the way to go. This kind of relates to my other suggestion about designating modules that take more than one input file as intra-model, inter-model or all-model (I've added all-model because I just added a script called cdo_ensemble_statistics.sh which takes a variable number of input datasets).

  • For intra-model the %model% constraint would remain unchanged
  • For inter-model the %model% constraint would combine the constraints on the input to a single output constraint
  • For all-model the %model% constraint could be ensemble, however that gets problematic if you're doing lots of different ensembles (e.g. the best models, worst models, all models). We might just have to do the same as we'd do for inter-model and combine all the model names into a single output constraint, even if that means a constraint that is 30 model names long.

@captainceramic
Copy link
Contributor Author

As written at the moment, the default is that any ProcessUnit is intra-model.

If the model constraint is not present or has been overwritten by a new value, then we have the all-model case. For this all-model case the required new constraint values can be created by reading in the complete list of Constraint values for the input DataSets and mashing them together (or replacing them with a word like ensemble).

As for the inter-model case, I have some concerns with your idea of designating modules to be a particular type. My first concern is that one of the core concepts in the plugin is that input and output files are grouped according to their Constraints. I also have had working implementations of a mapping approach to this problem in the past, so I have a bit of an idea of where to start.

At the moment, my preference would be to add a new keyword argument to the ProcessUnit constructor. In the case of the models, we could do something like:

ProcessUnit([INPUTDS1, INPUTDS2], OUTPUT_PATTERN, SHELL_COMMAND,
                    merge_output=['model', 'institute'])

Meaning that the output value of model should be constructed from a combination of the values of 'model' from the input datasets. The same with the institute.

I will start writing up some tests, but I think I might need some help with this.

@DamienIrving
Copy link
Contributor

@captainceramic I like the look of that solution with the new keyword argument to the ProcessUnit!

If I can be of any assistance with this just say the word. I'd be happy to drop by your office again if need be (tomorrow or Wed or Fri next week would work), or we could chat over the phone or gitter?

@captainceramic
Copy link
Contributor Author

I have pushed some new code to the devel branch to start fixing this problem - at the moment, the merge_output keyword works for situations where the Constraint that you want to merge only has one value per DataSet. This works for the case in which you are comparing one time period with another.

I have also committed a test for the model, institute case mentioned here, but it isn't working properly yet.

@captainceramic captainceramic self-assigned this May 28, 2015
@DamienIrving
Copy link
Contributor

vt_fldcor.py and vt_timcor.py now have a merge_constraint input port so the user can specify which Constraints need to be merged.

@captainceramic captainceramic removed their assignment Jun 19, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants