-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
20240710 ZN proposal #39
Conversation
FYI, the output of this API scrape is in https://github.com/PCMDI/input4MIPs_CVs/blob/main/src/240701_2137_comp.json.7z - and I can easily update that as required |
@znichollscr pulled your branch into this repo so we can see side-by-side - https://github.com/PCMDI/input4MIPs_CVs/tree/20240710-zn-proposal |
As an FYI, you can compare across forks too main...znichollscr:input4MIPs_CVs:20240710-zn-proposal |
Feel free to update your scripts too. I wanted to see what would happen, so I've also updated the HTML so it now includes all the entries you had previously plus the ones I generated based on this proposal (i.e. what I've processed for CMIP6Plus). |
As I was keen to eyeball the |
My scrape intentionally targeted the source_id entries only, so toggles through all files, but only returned a single file entry per source_id - and some of these are not representative for the "whole" dataset (~100s of files) - as an FYI https://github.com/PCMDI/input4MIPs_CVs/blob/main/src/getInput4MIPsESGF.ipynb is doing that work |
Ah ok got it. I'll see if I can work on the branch in this repo then (I think I have permission...). (As an FYI, you can also fetch locally from someone else's fork.) |
Ok. It seems to have retrieved multiple files for some source IDs (e.g. there are lots of different variables for the GHG concs). My compilation does things on a variable basis, but we can easily create something that is at the source ID level from that if we want (I did it this way because you get both, if you only target the source ID level, you can never get back down to the variable level unambiguously). |
Ok, that should be fixed you are now a collaborator/admin, well, at least an invited one! |
Yep, you're right, it's pulled 5897 entries (dictionary keys). The ESGF input4MIPs project (here) lists 10434 entries. I use the "dataset search" option, which returns a single entry for each unique source_id variable combo, so for e.g. if a single variable is split across multiple files, I get a single entry "dataset", which might have multiple files - see the "get SOLR source_id entries" section in the notebook if you're interested |
Nice, thanks |
Ok so there's 10434 files, but only 5897 datasets (where a dataset is defined by its source_id - variable combination? |
That is pretty close. Some differences creep in when datasets/files are deprecated (i.e. |
@znichollscr I'd forgotten to mention that the info in |
Hi @durack1, sorry I'm not following, what is |
Figured it out now, you mean src/240701_2137_comp.json.7z All good. I'm not sure exactly which way around you want this to go (I'm guessing the latest source_id.json file is the source of truth, but I also just saw that you added a new comp.json in main so I'm not 100% sure) so I'll just make sure I keep a copy of the files and we can figure it out from there. |
913aa59
to
d7924d3
Compare
Alrighty @durack1, my proposal is here.
I've changed a bunch of stuff, but I think it is in general pretty straight forward.
In terms of reviewing this, I would suggest starting with the README.
That explains the proposed structure best.
Then have a look at the newly generated HTML file in
docs/input4mips_datasets.html
.I think this is a much more helpful database to search for a bunch of reasons, most notably you can now see which variables are provided by each dataset (which is handy if you don't need every variable provided by a data provider, for example no-one uses all 43 of our greenhouse gases).
As part of this check, you should also be able to verify that the script in
src/jsonToHtml.py
has been updated to work with the proposed structure.The changes to the CVs themselves are generally trivial.
The exception is clearly source ID, which I have split into:
You will also notice that I do not have all of the datasets that you previously had in source ID.
My lack of access to the APIs which you are using means that I can't re-create the list that you had unfortunately.
Having said that, I hope that updating the entries to meet the new format is trivial.
If it's not, I'm happy to help out/see if there are ways to get the same information without hitting the same API.