20240710 ZN proposal #39

znichollscr · 2024-07-11T07:53:02Z

Alrighty @durack1, my proposal is here.

I've changed a bunch of stuff, but I think it is in general pretty straight forward.

In terms of reviewing this, I would suggest starting with the README.
That explains the proposed structure best.

Then have a look at the newly generated HTML file in docs/input4mips_datasets.html.
I think this is a much more helpful database to search for a bunch of reasons, most notably you can now see which variables are provided by each dataset (which is handy if you don't need every variable provided by a data provider, for example no-one uses all 43 of our greenhouse gases).
As part of this check, you should also be able to verify that the script in src/jsonToHtml.py has been updated to work with the proposed structure.

The changes to the CVs themselves are generally trivial.
The exception is clearly source ID, which I have split into:

source ID information that we need from the data producer
- data producers would be responsible for this
the database of datasets i.e. entries for each dataset that are complete (and include information like variable ID, grid etc.)
- you/I would be responsible for this, because all of the ESGF keys require access to the ESGF to handle correctly

You will also notice that I do not have all of the datasets that you previously had in source ID.
My lack of access to the APIs which you are using means that I can't re-create the list that you had unfortunately.
Having said that, I hope that updating the entries to meet the new format is trivial.
If it's not, I'm happy to help out/see if there are ways to get the same information without hitting the same API.

durack1 · 2024-07-11T16:03:12Z

You will also notice that I do not have all of the datasets that you previously had in source ID.
My lack of access to the APIs which you are using means that I can't re-create the list that you had unfortunately.

FYI, the output of this API scrape is in https://github.com/PCMDI/input4MIPs_CVs/blob/main/src/240701_2137_comp.json.7z - and I can easily update that as required

durack1 · 2024-07-11T17:25:26Z

@znichollscr pulled your branch into this repo so we can see side-by-side - https://github.com/PCMDI/input4MIPs_CVs/tree/20240710-zn-proposal

znicholls · 2024-07-11T18:01:11Z

As an FYI, you can compare across forks too main...znichollscr:input4MIPs_CVs:20240710-zn-proposal

znicholls · 2024-07-11T19:56:58Z

FYI, the output of this API scrape is in https://github.com/PCMDI/input4MIPs_CVs/blob/main/src/240701_2137_comp.json.7z - and I can easily update that as required

Feel free to update your scripts too. I wanted to see what would happen, so I've also updated the HTML so it now includes all the entries you had previously plus the ones I generated based on this proposal (i.e. what I've processed for CMIP6Plus).

durack1 · 2024-07-11T20:02:25Z

As an FYI, you can compare across forks too main...znichollscr:input4MIPs_CVs:20240710-zn-proposal

As I was keen to eyeball the input4mips_datasets.html file, easier if I have a branch local, and I can just keep fetching the latest

durack1 · 2024-07-11T20:03:56Z

FYI, the output of this API scrape is in https://github.com/PCMDI/input4MIPs_CVs/blob/main/src/240701_2137_comp.json.7z - and I can easily update that as required

Feel free to update your scripts too. I wanted to see what would happen, so I've also updated the HTML so it now includes all the entries you had previously plus the ones I generated based on this proposal (i.e. what I've processed for CMIP6Plus).

My scrape intentionally targeted the source_id entries only, so toggles through all files, but only returned a single file entry per source_id - and some of these are not representative for the "whole" dataset (~100s of files) - as an FYI https://github.com/PCMDI/input4MIPs_CVs/blob/main/src/getInput4MIPsESGF.ipynb is doing that work

znichollscr · 2024-07-12T06:12:04Z

As I was keen to eyeball the input4mips_datasets.html file, easier if I have a branch local, and I can just keep fetching the latest

Ah ok got it. I'll see if I can work on the branch in this repo then (I think I have permission...). (As an FYI, you can also fetch locally from someone else's fork.)

znichollscr · 2024-07-12T06:13:28Z

My scrape intentionally targeted the source_id entries only, so toggles through all files, but only returned a single file entry per source_id - and some of these are not representative for the "whole" dataset (~100s of files) - as an FYI https://github.com/PCMDI/input4MIPs_CVs/blob/main/src/getInput4MIPsESGF.ipynb is doing that work

Ok. It seems to have retrieved multiple files for some source IDs (e.g. there are lots of different variables for the GHG concs). My compilation does things on a variable basis, but we can easily create something that is at the source ID level from that if we want (I did it this way because you get both, if you only target the source ID level, you can never get back down to the variable level unambiguously).

znichollscr · 2024-07-12T06:21:02Z

Hmmm, it seems there is something funny about my role in this project. I can't push to this repository or make pull requests off branches from within this repository. I get the error below

durack1 · 2024-07-12T17:53:52Z

Hmmm, it seems there is something funny about my role in this project. I can't push to this repository or make pull requests off branches from within this repository. I get the error below

Ok, that should be fixed you are now a collaborator/admin, well, at least an invited one!

durack1 · 2024-07-12T18:03:53Z

My scrape intentionally targeted the source_id entries only, so toggles through all files, but only returned a single file entry per source_id - and some of these are not representative for the "whole" dataset (~100s of files) - as an FYI https://github.com/PCMDI/input4MIPs_CVs/blob/main/src/getInput4MIPsESGF.ipynb is doing that work

Ok. It seems to have retrieved multiple files for some source IDs (e.g. there are lots of different variables for the GHG concs).

Yep, you're right, it's pulled 5897 entries (dictionary keys). The ESGF input4MIPs project (here) lists 10434 entries. I use the "dataset search" option, which returns a single entry for each unique source_id variable combo, so for e.g. if a single variable is split across multiple files, I get a single entry "dataset", which might have multiple files - see the "get SOLR source_id entries" section in the notebook if you're interested

znichollscr · 2024-07-12T18:05:29Z

Ok, that should be fixed you are now a collaborator/admin, well, at least an invited one!

Nice, thanks

znichollscr · 2024-07-12T18:06:50Z

Yep, you're right, it's pulled 5897 entries (dictionary keys). The ESGF input4MIPs project (here) lists 10434 entries. I use the "dataset search" option, which returns a single entry for each unique source_id variable combo, so for e.g. if a single variable is split across multiple files, I get a single entry "dataset", which might have multiple files - see the "get SOLR source_id entries" section in the notebook if you're interested

Ok so there's 10434 files, but only 5897 datasets (where a dataset is defined by its source_id - variable combination?

durack1 · 2024-07-12T18:10:30Z

Ok so there's 10434 files, but only 5897 datasets (where a dataset is defined by its source_id - variable combination?

That is pretty close. Some differences creep in when datasets/files are deprecated (i.e. "latest" = false), but for a first pass you have it right.

README.md

durack1 · 2024-07-18T05:24:32Z

@znichollscr I'd forgotten to mention that the info in comp.json has been deprecated by a bunch of updated info in the current main input4MIPs_CVs/input4MIPs_source_id.json so when you are updating content, please make sure we're not losing the new info

znichollscr · 2024-07-18T05:50:39Z

I'd forgotten to mention that the info in comp.json has been deprecated by a bunch of updated info in the current main input4MIPs_CVs/input4MIPs_source_id.json so when you are updating content, please make sure we're not losing the new info

Hi @durack1, sorry I'm not following, what is comp.json?

znicholls · 2024-07-18T09:47:01Z

Hi @durack1, sorry I'm not following, what is comp.json?

Figured it out now, you mean src/240701_2137_comp.json.7z

All good. I'm not sure exactly which way around you want this to go (I'm guessing the latest source_id.json file is the source of truth, but I also just saw that you added a new comp.json in main so I'm not 100% sure) so I'll just make sure I keep a copy of the files and we can figure it out from there.

znichollscr · 2024-07-18T10:13:47Z

Alrighty @durack1 I've resolved all the discussions etc. I wasn't sure exactly what you meant by keeping the latest information, so I've made #43 to ensure we capture that conversation.

I will merge now. Thanks for your help! 🚀

znichollscr mentioned this pull request Jul 11, 2024

Tweaks to source_id and data registration formats #38

Closed

znichollscr mentioned this pull request Jul 12, 2024

Registering CMIP7 prelim solar historical forcing #17

Closed

znicholls reviewed Jul 17, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

znichollscr commented Jul 17, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

znichollscr commented Jul 17, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

znichollscr added 8 commits July 18, 2024 12:02

Remove source ID temporarily

6a6d7af

Update input4MIPs CVs

b5f1d56

Update html table to use datasets

89258b4

Put variable information in table

e3e9047

Split CVs and database into separate directories

8ea7977

Clean up CVs files

b2452a6

Start writing README

8639f97

Finish adding info to the README

9b67d7b

znichollscr added 9 commits July 18, 2024 12:02

Make sure we can produce HTML file

5d55415

Tweak README

4ced36e

Fix up version in source ID

be91375

Add missing v in DRS

5befde9

Add full set of entries back in

fa95468

Fix missing quote

a09a32e

Add next solar source ID

6e3a84d

Fix up README after review

4067dca

Fix up file naming

d7924d3

znichollscr force-pushed the 20240710-zn-proposal branch from 913aa59 to d7924d3 Compare July 18, 2024 10:04

znichollscr added 2 commits July 18, 2024 12:08

Add legacy folder

4979d24

Rename file

d30315b

znichollscr mentioned this pull request Jul 18, 2024

Ensuring no information loss #43

Closed

znichollscr merged commit cc7a671 into PCMDI:main Jul 18, 2024

znichollscr deleted the 20240710-zn-proposal branch July 18, 2024 10:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20240710 ZN proposal #39

20240710 ZN proposal #39

znichollscr commented Jul 11, 2024

durack1 commented Jul 11, 2024

durack1 commented Jul 11, 2024

znicholls commented Jul 11, 2024

znicholls commented Jul 11, 2024

durack1 commented Jul 11, 2024

durack1 commented Jul 11, 2024

znichollscr commented Jul 12, 2024

znichollscr commented Jul 12, 2024

znichollscr commented Jul 12, 2024 •

edited

Loading

durack1 commented Jul 12, 2024

durack1 commented Jul 12, 2024

znichollscr commented Jul 12, 2024

znichollscr commented Jul 12, 2024

durack1 commented Jul 12, 2024

durack1 commented Jul 18, 2024

znichollscr commented Jul 18, 2024

znicholls commented Jul 18, 2024

znichollscr commented Jul 18, 2024

20240710 ZN proposal #39

20240710 ZN proposal #39

Conversation

znichollscr commented Jul 11, 2024

durack1 commented Jul 11, 2024

durack1 commented Jul 11, 2024

znicholls commented Jul 11, 2024

znicholls commented Jul 11, 2024

durack1 commented Jul 11, 2024

durack1 commented Jul 11, 2024

znichollscr commented Jul 12, 2024

znichollscr commented Jul 12, 2024

znichollscr commented Jul 12, 2024 • edited Loading

durack1 commented Jul 12, 2024

durack1 commented Jul 12, 2024

znichollscr commented Jul 12, 2024

znichollscr commented Jul 12, 2024

durack1 commented Jul 12, 2024

durack1 commented Jul 18, 2024

znichollscr commented Jul 18, 2024

znicholls commented Jul 18, 2024

znichollscr commented Jul 18, 2024

znichollscr commented Jul 12, 2024 •

edited

Loading