Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

20240710 ZN proposal #39

Merged
merged 19 commits into from
Jul 18, 2024
Merged

Conversation

znichollscr
Copy link
Collaborator

Alrighty @durack1, my proposal is here.

I've changed a bunch of stuff, but I think it is in general pretty straight forward.

In terms of reviewing this, I would suggest starting with the README.
That explains the proposed structure best.

Then have a look at the newly generated HTML file in docs/input4mips_datasets.html.
I think this is a much more helpful database to search for a bunch of reasons, most notably you can now see which variables are provided by each dataset (which is handy if you don't need every variable provided by a data provider, for example no-one uses all 43 of our greenhouse gases).
As part of this check, you should also be able to verify that the script in src/jsonToHtml.py has been updated to work with the proposed structure.

The changes to the CVs themselves are generally trivial.
The exception is clearly source ID, which I have split into:

  • source ID information that we need from the data producer
    • data producers would be responsible for this
  • the database of datasets i.e. entries for each dataset that are complete (and include information like variable ID, grid etc.)
    • you/I would be responsible for this, because all of the ESGF keys require access to the ESGF to handle correctly

You will also notice that I do not have all of the datasets that you previously had in source ID.
My lack of access to the APIs which you are using means that I can't re-create the list that you had unfortunately.
Having said that, I hope that updating the entries to meet the new format is trivial.
If it's not, I'm happy to help out/see if there are ways to get the same information without hitting the same API.

@durack1
Copy link
Contributor

durack1 commented Jul 11, 2024

You will also notice that I do not have all of the datasets that you previously had in source ID.
My lack of access to the APIs which you are using means that I can't re-create the list that you had unfortunately.

FYI, the output of this API scrape is in https://github.com/PCMDI/input4MIPs_CVs/blob/main/src/240701_2137_comp.json.7z - and I can easily update that as required

@durack1
Copy link
Contributor

durack1 commented Jul 11, 2024

@znichollscr pulled your branch into this repo so we can see side-by-side - https://github.com/PCMDI/input4MIPs_CVs/tree/20240710-zn-proposal

@znicholls
Copy link

As an FYI, you can compare across forks too main...znichollscr:input4MIPs_CVs:20240710-zn-proposal

@znicholls
Copy link

FYI, the output of this API scrape is in https://github.com/PCMDI/input4MIPs_CVs/blob/main/src/240701_2137_comp.json.7z - and I can easily update that as required

Feel free to update your scripts too. I wanted to see what would happen, so I've also updated the HTML so it now includes all the entries you had previously plus the ones I generated based on this proposal (i.e. what I've processed for CMIP6Plus).

@durack1
Copy link
Contributor

durack1 commented Jul 11, 2024

As an FYI, you can compare across forks too main...znichollscr:input4MIPs_CVs:20240710-zn-proposal

As I was keen to eyeball the input4mips_datasets.html file, easier if I have a branch local, and I can just keep fetching the latest

@durack1
Copy link
Contributor

durack1 commented Jul 11, 2024

FYI, the output of this API scrape is in https://github.com/PCMDI/input4MIPs_CVs/blob/main/src/240701_2137_comp.json.7z - and I can easily update that as required

Feel free to update your scripts too. I wanted to see what would happen, so I've also updated the HTML so it now includes all the entries you had previously plus the ones I generated based on this proposal (i.e. what I've processed for CMIP6Plus).

My scrape intentionally targeted the source_id entries only, so toggles through all files, but only returned a single file entry per source_id - and some of these are not representative for the "whole" dataset (~100s of files) - as an FYI https://github.com/PCMDI/input4MIPs_CVs/blob/main/src/getInput4MIPsESGF.ipynb is doing that work

@znichollscr
Copy link
Collaborator Author

As I was keen to eyeball the input4mips_datasets.html file, easier if I have a branch local, and I can just keep fetching the latest

Ah ok got it. I'll see if I can work on the branch in this repo then (I think I have permission...). (As an FYI, you can also fetch locally from someone else's fork.)

@znichollscr
Copy link
Collaborator Author

My scrape intentionally targeted the source_id entries only, so toggles through all files, but only returned a single file entry per source_id - and some of these are not representative for the "whole" dataset (~100s of files) - as an FYI https://github.com/PCMDI/input4MIPs_CVs/blob/main/src/getInput4MIPsESGF.ipynb is doing that work

Ok. It seems to have retrieved multiple files for some source IDs (e.g. there are lots of different variables for the GHG concs). My compilation does things on a variable basis, but we can easily create something that is at the source ID level from that if we want (I did it this way because you get both, if you only target the source ID level, you can never get back down to the variable level unambiguously).

@znichollscr
Copy link
Collaborator Author

znichollscr commented Jul 12, 2024

Hmmm, it seems there is something funny about my role in this project. I can't push to this repository or make pull requests off branches from within this repository. I get the error below

image

@durack1
Copy link
Contributor

durack1 commented Jul 12, 2024

Hmmm, it seems there is something funny about my role in this project. I can't push to this repository or make pull requests off branches from within this repository. I get the error below

Ok, that should be fixed you are now a collaborator/admin, well, at least an invited one!

@durack1
Copy link
Contributor

durack1 commented Jul 12, 2024

My scrape intentionally targeted the source_id entries only, so toggles through all files, but only returned a single file entry per source_id - and some of these are not representative for the "whole" dataset (~100s of files) - as an FYI https://github.com/PCMDI/input4MIPs_CVs/blob/main/src/getInput4MIPsESGF.ipynb is doing that work

Ok. It seems to have retrieved multiple files for some source IDs (e.g. there are lots of different variables for the GHG concs).

Yep, you're right, it's pulled 5897 entries (dictionary keys). The ESGF input4MIPs project (here) lists 10434 entries. I use the "dataset search" option, which returns a single entry for each unique source_id variable combo, so for e.g. if a single variable is split across multiple files, I get a single entry "dataset", which might have multiple files - see the "get SOLR source_id entries" section in the notebook if you're interested

@znichollscr
Copy link
Collaborator Author

Ok, that should be fixed you are now a collaborator/admin, well, at least an invited one!

Nice, thanks

@znichollscr
Copy link
Collaborator Author

Yep, you're right, it's pulled 5897 entries (dictionary keys). The ESGF input4MIPs project (here) lists 10434 entries. I use the "dataset search" option, which returns a single entry for each unique source_id variable combo, so for e.g. if a single variable is split across multiple files, I get a single entry "dataset", which might have multiple files - see the "get SOLR source_id entries" section in the notebook if you're interested

Ok so there's 10434 files, but only 5897 datasets (where a dataset is defined by its source_id - variable combination?

@durack1
Copy link
Contributor

durack1 commented Jul 12, 2024

Ok so there's 10434 files, but only 5897 datasets (where a dataset is defined by its source_id - variable combination?

That is pretty close. Some differences creep in when datasets/files are deprecated (i.e. "latest" = false), but for a first pass you have it right.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@durack1
Copy link
Contributor

durack1 commented Jul 18, 2024

@znichollscr I'd forgotten to mention that the info in comp.json has been deprecated by a bunch of updated info in the current main input4MIPs_CVs/input4MIPs_source_id.json so when you are updating content, please make sure we're not losing the new info

@znichollscr
Copy link
Collaborator Author

I'd forgotten to mention that the info in comp.json has been deprecated by a bunch of updated info in the current main input4MIPs_CVs/input4MIPs_source_id.json so when you are updating content, please make sure we're not losing the new info

Hi @durack1, sorry I'm not following, what is comp.json?

@znicholls
Copy link

Hi @durack1, sorry I'm not following, what is comp.json?

Figured it out now, you mean src/240701_2137_comp.json.7z

All good. I'm not sure exactly which way around you want this to go (I'm guessing the latest source_id.json file is the source of truth, but I also just saw that you added a new comp.json in main so I'm not 100% sure) so I'll just make sure I keep a copy of the files and we can figure it out from there.

@znichollscr
Copy link
Collaborator Author

Alrighty @durack1 I've resolved all the discussions etc. I wasn't sure exactly what you meant by keeping the latest information, so I've made #43 to ensure we capture that conversation.

I will merge now. Thanks for your help! 🚀

@znichollscr znichollscr merged commit cc7a671 into PCMDI:main Jul 18, 2024
@znichollscr znichollscr deleted the 20240710-zn-proposal branch July 18, 2024 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants