Skip to content

Extend Cube Metadata NetCDF zlib compression #6552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 14, 2025

Conversation

bjlittle
Copy link
Member

@bjlittle bjlittle commented Jul 8, 2025

🚀 Pull Request

Description

Extend NetCDF zlib compression serialization to also include auxiliary coordinates and ancillary variables that have the same shape as the associated parent cube data payload.

Does not apply to metadata with string dtypes.

The complevel, fletcher32, shuffle and zlib kwargs applied to the cube data payload will also be applied to the above applicable metadata.

Closes #6539

Ping @pasc85


Consult Iris pull request check list


Add any of the below labels to trigger actions on this PR:

  • benchmark_this Request that this pull request be benchmarked to check if it introduces performance shifts

@bjlittle bjlittle marked this pull request as draft July 8, 2025 08:25
Copy link

codecov bot commented Jul 8, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.90%. Comparing base (e15c079) to head (86499ce).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #6552   +/-   ##
=======================================
  Coverage   89.89%   89.90%           
=======================================
  Files          90       90           
  Lines       24146    24153    +7     
  Branches     4493     4496    +3     
=======================================
+ Hits        21707    21714    +7     
  Misses       1679     1679           
  Partials      760      760           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bjlittle bjlittle marked this pull request as ready for review July 8, 2025 10:33
@HGWright HGWright requested a review from pp-mo July 9, 2025 09:07
@pasc85
Copy link

pasc85 commented Jul 9, 2025

I tested the new feature, and that went really well! The notebook I used works as follows:

  • load a diagnostic on pressure levels that has status flag information
  • three samples: one from a big model (182M entries) and two from a smaller model (7.8M)
  • cf. the __original files below
  • remove the status flag data and add it again, either as an ancillary variable or
    as an auxiliary coordinate, and either as an int8 or as an float32
  • there's also an option where the status flag information isn't added back for comparison
  • so, 5 modes in total: data_only, anc_float, anc_int, aux_float, aux_int

The output below shows the size of the produced files; there are two sections:

  • "before" the changes: has been run with my current stage development environment
  • "after" the changes: run with minimal environment with @bjlittle's branch pip-installed in it

The numbers in the file names below are the number of entries in the cube, which is useful for
checking the change in file size when the data type is changed (only in the "before" files).

image

The results are very good:

  • the "after" cubes including status_flag are not much larger than the data_only cubes
  • ... and they are much smaller than the "before" versions
  • int8 is preferable to float32
  • no difference between using ancillary variable of auxiliary coordinate
  • both cube data and status flag data haven't changed

So, this is great -- thank you very much!

@bjlittle
Copy link
Member Author

bjlittle commented Jul 9, 2025

@pasc85 Brilliant!

Thanks for taking the time and effort to test this and also share the comprehensive results 🥇

Looks like this this change is a good win for the community, and your testing give us confidence that it's fit for purpose 👍

@pp-mo
Copy link
Member

pp-mo commented Jul 11, 2025

I tested the new feature, and that went really well! The notebook I used works as follows:

This is really useful.
Just for completeness though, can you explain what compression settings you applied ?
If the script is short, it would be great to just share that.

Apart from general helpfulness, my reason was ...
I was wondering whether to raise the possible criticism that the same settings might not apply equally well to ancils/aux-coords of different dtypes as the main data.
And I noted what you said that the float32 status did not compress as well as the byte one.


The thing is, this PR is a great quick win, but I have a bit of a hidden agenda !!
I think it would ultimately be much neater and more appropriate to replace our keyword controls for compression, packing and fill-value with a context manager approach as we have done with loading controls -- notably the load chunking control, which allows us to apply different chunking to different file variables.
That type of separate management makes the primary method API tidier, and it is easier to apply more detailed context.
Whereas, as we have it, the settings apply per cube you are saving, so they must be used for all components of the cube.
(BTW another great benefit is it avoids settings down through all intermediate routines, as seen in the changes here !)

@bjlittle what do you think of this ?
I don't think it can fairly be raised as a criticism of this PR, as that is far too much breaking change to introduce at this point. But it does show up limitations. Perhaps we can bank the idea somewhere else.

@bjlittle
Copy link
Member Author

bjlittle commented Jul 11, 2025

@pp-mo Great idea, but there's no major benefit here given the effort required for this particular use case. I'd be keen to bank this pull-request "as is" without straying off the path too far.

Apart from general helpfulness, my reason was ... I was wondering whether to raise the possible criticism that the same settings might not apply equally well to ancils/aux-coords of different dtypes as the main data. And I noted what you said that the float32 status did not compress as well as the byte one.

As it stands we're getting ~>95% improvement in compression footprint ... further optimisations would be minimal.

Also, this type of compression is only applicable to NetCDF and not other data formats (easily); we're getting a lot for free from the netcdf4 package.

As a compromise why not raise a separate dedicated issue capturing your thoughts @pp-mo, and we can start a conversation there? Perhaps we can agree on a way forward as a separate concern and target iris 3.14 ?

pp-mo
pp-mo previously requested changes Jul 11, 2025
Copy link
Member

@pp-mo pp-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks really neat, but I'm not sure about the testing yet

@bjlittle
Copy link
Member Author

@pp-mo Tweaked the non-compression tests to do as I intended, nice spot 👀

Also added some comments to clarify 👍

@pp-mo
Copy link
Member

pp-mo commented Jul 11, 2025

@pp-mo Great idea, but there's no major benefit here given the effort required for this particular use case. I'd be keen to bank this pull-request "as is" without straying off the path too far.

Absolutely

Also, this type of compression is only applicable to NetCDF and not other data formats (easily); we're getting a lot for free from the netcdf4 package.

Yes, and in fact the criticism applies more directly to the 'packing' arguments, which aren't covered here.

As a compromise why not raise a separate dedicated issue capturing your thoughts @pp-mo, and we can start a conversation there? Perhaps we can agree on a way forward as a separate concern and target iris 3.14 ?

If anything, given the breaking change nature, perhaps more of an Iris 4 thing?

@bjlittle
Copy link
Member Author

bjlittle commented Jul 11, 2025

If anything, given the breaking change nature, perhaps more of an Iris 4 thing?

You're proposing to not merge this and leave @pasc85 blocked for at least another 5 months to wait from iris 3.14? Really? 😬 Or have I misunderstood?

I wouldn't see what you're proposing beyond this PR as a breaking change, certainly not in the API. Additionally we could still support the compression behaviour proposed here moving forwards as a minimal default.

Also, I'm talking about netCDF compression here and not straying into the minefield of packing, netCDF or otherwise. Anyways, criticism accepted and overdue but out of scope for this PR 😜

@pp-mo
Copy link
Member

pp-mo commented Jul 14, 2025

I'm talking about netCDF compression here and not straying into the minefield of packing, netCDF or otherwise

Ah ok, wasn't really a clear distinction to me, but point taken.


I wouldn't see what you're proposing beyond this PR as a breaking change, certainly not in the API. Additionally we could still support the compression behaviour proposed here moving forwards as a minimal default.

Well, I was suggesting we might replace the save keywords with a context manager, which I think is pretty breaking. I'm not sure we could combine the two effectively, since part of the point would be to be able to compress cube components differently to the cube. So, I thought a bit future wishlist really.

@pp-mo pp-mo dismissed their stale review July 14, 2025 11:19

All good now

@pp-mo pp-mo enabled auto-merge (squash) July 14, 2025 11:19
Copy link
Member

@pp-mo pp-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good now.

@pp-mo pp-mo merged commit f8030d0 into SciTools:main Jul 14, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

ExtendCube.data compression to coordinates
3 participants