Skip to content

Conversation

@valeriupredoi
Copy link
Collaborator

So this isa a somewhat changed setup of the test that @bnlawrence wrote but this PR achies the following:

  • allows for functionality to load a netCDF4 file into a Zarr metadata store object that can then be passed around inside Zarr (note: no data payload is loaded and given to Zarr); this is done with kerchunk, and is currently impressively ugly;
  • sets up a workflow to obtain the slice_coords: (offset, size) of each needed slice directly from Zarr, while fooling Zarr that it doesn't need data, all it needs is data structure (chunks, data shape, data type etc) and love;
  • added UnitTest for that - @bnlawrence will be sad I changed his 😁

@valeriupredoi valeriupredoi added the enhancement New feature or request label Sep 1, 2022
Copy link
Collaborator

@bnlawrence bnlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer that rather than changing the test, you added a test with the new functionality. (At some point we actually want to run these tests).

Also, it would be good eventually if the "business code" was in methods of Active rather than in the tests themselves. The business code following your changes here could skip the actual operation, the first desired result would be to return the numbers (version=1 in Active) ... although the code here is sort. of version -1 (return the addresses and shapes which we need to use to load the numbers).

@bnlawrence
Copy link
Collaborator

(We also need to ensure that somewhere one of our methods inspects the variable to find out what type it is, so we know how to interpret the bytes coming back from the offsets, range, size -> shape steps.)

@valeriupredoi
Copy link
Collaborator Author

@bnlawrence very good points indeed! I'll be tweaking this today. The info on the variable should be encoded in the reference dict that kerchunk spits out - at the moment parts of it are either ignored or hardcoded in my implementation, which is not good, but will change. One thing I am still struggling with - and have to inspect kerchunk to understand - is how do we read the file in kerchunk without the need for the file to be on the client disk

@valeriupredoi
Copy link
Collaborator Author

valeriupredoi commented Sep 5, 2022

@bnlawrence @davidhassell I think we've finally got to town with this: here's the stack explained:

  • I have added the magic module that does the work we need - netcdf_to_zarr that:
    • grabs a netCDF4 file and puts it into a reference file system generator as a json file constructed by kerchunk - this is written to disk inside the local FS at the moment
    • this json file is then opened as a normal Zarr Array (again, via a reference file system) and all its metadata is now available to be used (including chunks, data type, compression type, etc); note that this Array has no valid data, it has a lot of junk (nan's or numpy empties) as for data, but the data structuring is intact as per the original netCDF4 as read with a Zarr engine;
    • the metadata of this file is then used to get information about one desired selection (slice) or more, via a number of utility functions: PartialChunkIterator for slices' offsets and sizes, and my own zarr chunks info utility that returns the chunks mapping - a dictionary keyed by the chunks' index coordinates (x.x.x) with values each chunk's size, which is then used to select the chunks where the slices live, and return those chunks' sizes. NOTE: I am struggling to get the offsets for these chunks, I would be very happy if you could think of a method 😁
  • I have added a test in test_harness.py that runs this whole show and checks for what we need from Bryan's bizarre netCDF file 😁
  • Zarr's core.py is slightly hacked now to allow us to get the PCI info at any stage Zarr would go through, I've not destroyed the module so functionality should be as per previous, it's just a bit of a Trojan that sneaks in and returns the PCI

This basically sets the streamlined stack for what we need to do, I am rather happy with it 😆 Would you have some time look at it and we can discuss it with voice!

@valeriupredoi valeriupredoi merged commit 5d7b17a into main Sep 8, 2022
@valeriupredoi valeriupredoi deleted the test_integration branch September 8, 2022 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants