-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Think about how to aggregate and/or scatter chunks when copying #24
Comments
Yes, that's right. The final ICON-EU size is actually 20ish GB, the ICON-Global is 50GB, processed examples are here: https://huggingface.co/datasets/openclimatefix/dwd-icon-eu/tree/main/data and the raw data is here: https://opendata.dwd.de/weather/nwp/icon-eu/grib/06/ |
Thinking this through: We could think about this in terms of "scatter" and "gather" operations. Or maybe that's too restrictive, as illustrated by the example below. Copying chunks of many GRIB2 files into a single Zarr chunk would be a "gather" op. To make life complicated, let's assume the GRIB2 files are compressed. And we want to put half each GRIB2 files into ZarrA, and the other half into ZarrB. It could would go something like this:
TODO: Think about this more! This is a very early draft! |
We could think of this as:
So maybe there is a general-purpose data structure to specify this. Something like: struct ReadMapReduceMapWrite { // Horrible name!
source_file_chunks: Vec<(PathBuf, ByteRanges)>,
// As soon as LSIO finishes reading source_file_chunk, LSIO
// will start reading the source_file_chunks for the next ReadMapReduceMapWrite,
// so LSIO can overlap reading from IO whilst running reduce_func and dst_ma_func.
/// Applied to each byte range? Or applied to each (PathBuf, ByteRanges)?
source_map_func: Fn([u8]) > [u8],
/// Takes as input all the outputs of the source_map_func,
/// along with their source location. And outputs 1 or more buffers, with their destination.
reduce_func: Fn(ArrayQueue<([u8], PathBuf, ByteRange)>) -> Vec<([u8], PathBuf, ByteRange)>,
// Applied in parallel to each output of reduce_func
// Data is written after reduce_func completes
dst_map_func,
}; |
Or, put more responsibility on the user's code. The reader and writers would run in their own threads, separate from the main thread. We'd receive groups of decompressed GRIBs from a channel. And then we'd send uncompressed output buffers to another channel. Something like this: let source_receiver = read_and_map([gribs_for_zarrA_and_zarrB, gribs_for_zarrC_and_zarrD], decompress_func);
for decompressed_gribs in source_receiver {
let outputs_for_task = merge_into_zarrs(decompressed_gribs);
for output_buf in outputs_for_task {
writer_sender.send(MapAndWrite{map: compression, buf: output_buf, path: "foo", byte_ranges: vec![...]})
}
} I think this second approach might be more flexible. And, unlike the |
The thought process which started with this issue has ended in me totally changing the API design! So this specific issue is now out-of-date. For my latest plans, see https://github.com/JackKelly/light-speed-io/milestone/1 |
For example, for the ICON NWP data: each init time consists of 146,000 GRIB2 files! One file for each vertical level, variable, timestep.
Jacob has a script which converts ICON NWP GRIB2 to Zarr. This conversion takes 0.5 to 1 hour per init time, and the output compressed Zarr is about 50 GBytes per init time. Reading from a local SSD on one of Open Climate Fix's on-prem computers. Zarr is compressed using blosc2. No sub-selection.
(Did I get the details right, @jacobbieker?!)
How to merge multiple input files into a single output file in LSIO?
Maybe we should pass back to Python (async)? But that'll have a large performance impact.
Is there a way for users to tell LSIO to merge or split input files, in arbitrary, user-configurable ways?
TODO:
read_then_write
andread_map_write
(first, don't worry about merging).The text was updated successfully, but these errors were encountered: