Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INV: Handling multiple MIP projects in activity_id #41

Open
Zeitsperre opened this issue May 12, 2022 · 10 comments
Open

INV: Handling multiple MIP projects in activity_id #41

Zeitsperre opened this issue May 12, 2022 · 10 comments
Assignees
Labels
invalid This doesn't seem right

Comments

@Zeitsperre
Copy link
Collaborator

Zeitsperre commented May 12, 2022

The decoder currently treats the entire string of a attrs.ativity_id for CMIP6-endorsed MIPs as the activity, however I ran into this today in our database:

:activity_id = "ScenarioMIP AerChemMIP" ;

Since this field is used in creating the filetree, while it is technically valid to have spaces in a path, the idea of creating POSIX paths with escaped spaces runs counter to all known ethics and reason.

Proposal - hyphening:
ScenarioMIP AerChemMIPScenarioMIP-AerChemMIP

Thoughts?

@Zeitsperre Zeitsperre added the invalid This doesn't seem right label May 12, 2022
@aulemahal
Copy link
Collaborator

aulemahal commented May 12, 2022

the idea of creating POSIX paths with escaped spaces runs counter to all known ethics and reason

True story. The great Aristotle once wrote "Δεν μου αρέσουν τα κενά στα ονόματα αρχείων."

If there is a good reason to keep "AerChemMIP", I'd agree with hyphens. If we are only downloading "ScenarioMIP" (and their historical counterparts), I would say we drop the other names it into /dev/null, never to be seen again.

@RondeauG
Copy link
Collaborator

Is AerChemMIP something that we plan on ever using? If not, is it really an issue to remove it?

If we want to keep both, I'd say that for both the folder structure & the catalog, using an hyphen would quickly turn into a nightmare due to all the possible combinations.

@juliettelavoie
Copy link
Collaborator

As we specifically downloaded that data for ScenarioMIP, I would only keep that. In the catalog, people might search for all ScenarioMIP data and would not find whatever is in ScenarioMIP-AerChemMIP.

Also, I think there are other experiments that are part of more then 2 MIPS. It would get complicated quickly...
I suggest that we only ever keep one. We can have a list of our ordered preference.

@Zeitsperre
Copy link
Collaborator Author

I would be OK with the option of dropping (setting a preferred order would work well) if that behaviour could be configured for multiple use cases. Having some kind of option in the restructure_datasets function would be best, but how best to specify this ?

When decoding, the validation step demands a string that is a member of the CMIP6 controlled vocabulary. I can change this to allow for a list of allowed values, then check that they members of the controlled vocabulary. This would better handle cases of files being shared between 3 or more MIPs (do those exist?).

Another option would be to create two entries for the file, one according to each MIP, and hard-link those files so that they can be found in either filetree (ScenarioMip/this/that/file.nc and AerChemMip/this/that/file.nc). This solves the catalogue issue by creating two entries while not increasing the disk space used. This approach is a bit overkill, but would be surprisingly easy to implement.

I feel like we all have opinions on this.

@aulemahal
Copy link
Collaborator

I like the magical symlink solution! If it is easy to implement!

@juliettelavoie
Copy link
Collaborator

I thought there might be experiements with more than 2 MIPs, because I has seen a well populated column called synergies with other MIPS is the description of LS3MIP experiments (Van Den Hurk et al, 2016). But, looking at the list here (https://wcrp-cmip.github.io/CMIP6_CVs/docs/CMIP6_experiment_id.html), I only see duos.

@juliettelavoie
Copy link
Collaborator

If it is easy to have it on both ScenarioMIp and AerMIP without taking too much space, that is great!!

@RondeauG
Copy link
Collaborator

My understanding is that the 'real' file would only be at one location, but both filetrees would see it. So it takes the same space as only having it once in ScenarioMIP.

@Zeitsperre
Copy link
Collaborator Author

The only major issue with hard links is that if you perform certain operations (like copying hard linked files to another host), unless you specify to preserve hard links, you will break them (i.e. you will have two separate files) or if you modify one file, the other is modified as well. It's something that needs to be taken into consideration.

I can open a PR to address this in the coming weeks.

@juliettelavoie
Copy link
Collaborator

Just a reminder that we still have ScenarioMIP-AerChemMIP in the path. I think the conclusion here is to have ScenarioMIP and AerChemMIP with everything in ScenarioMIP-AerChemMIP in both directories with a hard link.

Not crucial as my catalog sees everything as ScenarioMIP. But this is a reminder that for the final form of /datasets, this needs to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

4 participants