Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial direct I/O support in FilesystemStore #58

Merged
merged 8 commits into from
Aug 30, 2024
Merged

Conversation

sk1p
Copy link
Contributor

@sk1p sk1p commented Aug 26, 2024

Refs #53. I've updated the example in https://github.com/LiberTEM/zarr-dio-proto/ with a case that uses this branch for writing. Let me know if a discussion directly on the PR works for you, otherwise I can move this to the issue. Performance already looks better than the buffered case:

Method Time (s)
Direct I/O (manual) 6.14
Direct I/O (this PR) 17.71
Buffered 38.91s

(performance in the buffered case also depends on if we are overwriting an existing array, in that case it takes ~51s)

Open questions

  • Should this rather be its own FilesystemDIOStore or an stay option, as implemented now?
  • What's the preferred API for keeping new_with_options forwards-compatible? Should it use a builder or is that overkill for now?
  • How much of the implementation/optimization should happen in this PR, vs. incremental improvements?
  • Thoughts on error handling - should there be a fall back to buffered I/O in case the Layout can't be constructed? Is panic! in case of allocation failure OK?
  • I have not yet looked into the read-path of things - as I understand, seek also has to be page-size aligned, which is a bit annoying. One could maybe return the buffers with padding and deref to the actually requested Bytes somehow?

Profile discussion

An annotated flamegraph of running the example code:

2024-08-26_14-25-zarrs-direct-i-o-profile-round1

From left to right:

  • The leftmost memcpy comes from Bytes::from(chunk_encoded.into_owned()), as far as I can see
  • For PageAlignedBuffer::drop, see below
  • Then we have a copy into the PageAlignedBuffer and its initialization to zero beforehand
  • Only the rightmost part of the profile is performing actual I/O (mkdir, stat, open, write etc.)

There are two parts to the remaining overhead: 1) the fact that we have to touch the memory three times (initialization and two copies), and 2) that the allocations aren't re-used and a significant amount of time is spent in page faults and free'ing memory back to the operating system.

The page faults and the cost of PageAlignedBuffer::drop are caused by the default system-wide malloc not re-using large allocations by default; re-running with MALLOC_TRIM_THRESHOLD_=18388608 MALLOC_MMAP_THRESHOLD_=18388608 (which happen to be thresholds larger than the per-chunk buffers) reduces the total run time to ~11.7s. This is great, that's already within 2x of the "ideal". It has the following profile:

image

As it is hard to control these thresholds from a library, an alternative would be to explicitly re-use buffers across multiple chunk writes, but I'm not yet sure how to do this in a non-hacky way (a hacky way: just keep the buffer around in a thread-local RefCell... works, but also never free's). Maybe one of the arena crates can be used?

As for the memory copies and initialization:

  • Zero-initialization of PageAlignedBuffer can be amortized by re-using the initialized buffer over the life of, for example, the FilesystemStore, and/or initializing directly from the given slice of bytes
  • Instead of passing slices of already allocated buffers down the "codec chain", and having ad-hoc allocations in the individual codec steps, can we invert it somehow that buffers are managed by the lower levels of the codec chain? Then, the previous-to-last codec step would ask the FilesystemStore for a suitable buffer, which would already be page-size aligned (removing the last copy)
  • Instead of passing some user-allocated memory into store_chunk_elements, maybe the user could instead be presented with a mutable slice (or ArrayViewMut<T> or similar) where they write the data, which is backed by a suitable buffer. I think this would get rid of the copy by into_owned, as the buffer would be managed by the FilesystemStore.

For the last point, a sketch of an API could be:

// like `store_chunk_elements`, but provides a buffer to write to
// (might need to also have a `num_elements` parameter, if this is a border chunk?)
array.store_chunk_bikeshed(&chunk_indices, |buf: &mut [T]| {
    // just as an example - some code that fills up `buf`
    my_decoder.decode(&input, &mut buf);
})

In the uncompressed case, buf would then be handed down without copying all the way to the FilesystemStore, where it can be directly used for I/O. Thoughts?

For comparison, a before-profile using buffered I/O:

image

(all profiles taken on Linux 6.1 / Debian stable with perf record --call-graph=dwarf -F 1234 <bin>, converted using perf script -F +pid and visualized using https://profiler.firefox.com/)

@sk1p sk1p requested a review from LDeakin as a code owner August 26, 2024 16:15
Copy link

codecov bot commented Aug 26, 2024

Codecov Report

Attention: Patch coverage is 98.03922% with 1 line in your changes missing coverage. Please review.

Project coverage is 82.28%. Comparing base (ce75205) to head (cc5c131).
Report is 13 commits behind head on main.

Files with missing lines Patch % Lines
src/storage/store/store_sync/filesystem_store.rs 98.03% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #58      +/-   ##
==========================================
+ Coverage   82.26%   82.28%   +0.02%     
==========================================
  Files         160      160              
  Lines       22183    22228      +45     
==========================================
+ Hits        18249    18291      +42     
- Misses       3934     3937       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@LDeakin
Copy link
Owner

LDeakin commented Aug 27, 2024

This is great stuff, thanks for this. I've not had time to look at this thoroughly or test it, but here are my first thoughts.

Open questions

  • Should this rather be its own FilesystemDIOStore or an stay option, as implemented now?

This has so much overlap with the standard filesystem store; so just keep it as an option.

  • What's the preferred API for keeping new_with_options forwards-compatible? Should it use a builder or is that overkill for now?

What about a #[non_exhaustive] FilesystemStoreOptions struct implementing Default? I can see more options being added to FilesystemStore in the future.

  • How much of the implementation/optimization should happen in this PR, vs. incremental improvements?

Let's aim for incremental improvements and get this in soon.

  • Thoughts on error handling - should there be a fall back to buffered I/O in case the Layout can't be constructed? Is panic! in case of allocation failure OK?

Yep, the rest of the crate panics on allocation failure

The page faults and the cost of PageAlignedBuffer::drop are caused by the default system-wide malloc not re-using large allocations by default

I wonder how jemalloc might perform as the global allocator in this situation?

I have not yet looked into the read-path of things - as I understand, seek also has to be page-size aligned, which is a bit annoying. One could maybe return the buffers with padding and deref to the actually requested Bytes somehow?

Fortunately is is feasible to allocate aligned bytes::Bytes tokio-rs/bytes#600 (comment) and then cheaply create slice references.

Reducing allocations

Would you benefit from a function like store_encoded_chunk below? You could then allocate the chunk bytes::Bytes however you like, but encoding them is your responsibility (no problem in your case):

// unsafe because the responsibility is on the caller to ensure the chunk is encoded correctly
// impl<TStorage: WritableStorageTraits + ..> Array<TStorage>
pub unsafe fn store_encoded_chunk(
    &self,
    chunk_indices: &[u64],
    encoded_chunk_bytes: bytes::Bytes,
) -> Result<(), ArrayError> {
    // part of the store_chunk_opt() implementation
    ...
    crate::storage::store_chunk(
        &*storage_transformer,
        self.path(),
        chunk_indices,
        self.chunk_key_encoding(),
        encoded_chunk_bytes,
    )?;
}

FilesystemStore with direct IO could check if the bytes are page-size aligned and only allocate/copy if required. You can then do things like reuse aligned buffers manually via store_encoded_chunk. What do you think about something like that as as stopgap solution? Could this achieve performance parity with your manual direct I/O implementation with (slightly) less boilerplate?

Longer term

I need to think on this for a while... but one potential path that might benefit all zarrs users without substantial API changes1 is to make the codec encode implementations alignment aware:

  • Add an alignment: Option<usize> to CodecOptions
  • Change codec RawBytes from Cow<'a, [u8]> to bytes::BytesMut
  • The last non pass-through codec in the chain can do an aligned allocation for its output

Footnotes

  1. API changes in the name of performance are welcomed, but best to be done incrementally and with usability in mind.

@sk1p
Copy link
Contributor Author

sk1p commented Aug 27, 2024

I wonder how jemalloc might perform as the global allocator in this situation?

From a quick test, without any tuning, quite bad: it took ~45s even with direct I/O.

image

@sk1p
Copy link
Contributor Author

sk1p commented Aug 27, 2024

here are my first thoughts [...]

Thanks, I'll push an update tomorrow.

@sk1p
Copy link
Contributor Author

sk1p commented Aug 28, 2024

What about a #[non_exhaustive] FilesystemStoreOptions struct implementing Default? I can see more options being added to FilesystemStore in the future.

Done.

Yep, the rest of the crate panics on allocation failure

👍

I have not yet looked into the read-path of things - as I understand, seek also has to be page-size aligned, which is a bit annoying. One could maybe return the buffers with padding and deref to the actually requested Bytes somehow?

Fortunately is is feasible to allocate aligned bytes::Bytes tokio-rs/bytes#600 (comment) and then cheaply create slice references.

I'll need to look a bit more into how the bytes crate operates, but this is a good pointer. Using bytes might be a good way to get rid of the unsafe in the PageAlignedBuffer, too.

Reducing allocations

Would you benefit from a function like store_encoded_chunk below? You could then allocate the chunk bytes::Bytes however you like, but encoding them is your responsibility (no problem in your case):

// unsafe because the responsibility is on the caller to ensure the chunk is encoded correctly
// impl<TStorage: WritableStorageTraits + ..> Array<TStorage>
pub unsafe fn store_encoded_chunk(
    &self,
    chunk_indices: &[u64],
    encoded_chunk_bytes: bytes::Bytes,
) -> Result<(), ArrayError> {
    // part of the store_chunk_opt() implementation
    ...
    crate::storage::store_chunk(
        &*storage_transformer,
        self.path(),
        chunk_indices,
        self.chunk_key_encoding(),
        encoded_chunk_bytes,
    )?;
}

FilesystemStore with direct IO could check if the bytes are page-size aligned and only allocate/copy if required. You can then do things like reuse aligned buffers manually via store_encoded_chunk. What do you think about something like that as as stopgap solution? Could this achieve performance parity with your manual direct I/O implementation with (slightly) less boilerplate?

I think it would, yes. I do realize that we have a use-case like in the aligned_source branch, where the source is page-aligned, is already in a normal integer encoding and doesn't need the decode step, but not yet in a Bytes - supporting that may take things a bit far, and I would be happy to use this stopgap solution for now.

Longer term

I need to think on this for a while... but one potential path that might benefit all zarrs users without substantial API changes1 is to make the codec encode implementations alignment aware:

I think it's a good idea sit on this a bit, and to first play around with other local I/O solutions like io_uring - one aspect is that alignment is not the only issue. With io_uring, you have the possibility to reduce per-I/O overheads that are still present in the direct I/O case, namely mapping user-space pages into kernel space, and un-mapping them after I/O completion. Using io_uring_register(2) you can perform this mapping once, and basically get back a handle/index for each buffer. Leaking this into the codecs would be a bit tricky I think.

  • The last non pass-through codec in the chain can do an aligned allocation for its output

What if the codecs (or at least the last codec in the chain) instead ask the storage for a buffer? I haven't thought this to completion, but basically if the storage could provide (or even "loan" for re-use) buffers out to the upper layers, it could hide implementation details like alignment, or any other special treatment that is needed for the specific storage backend.

Thinking about the "pass-through" case:

  1. The user requests (via some Array method) to write a chunk for a specific chunk index
  2. The array asks the codec chain for a buffer
  3. The request is passed down through the chain, and the last codec forwards it to the storage
  4. The storage provides a "raw" buffer (can this be a BytesMut?)
  5. In the array to bytes codec, the raw buffer is byte-mucked into something like &mut [T] where T: Element
  6. The user fills this buffer, either as elements of a slice, or with an ndarray mutable view sitting on top
  7. Control flows back to the storage and performs the I/O

In the non-pass-through case, this chain would be interrupted wherever the data needs to change - let's imagine a single compression codec, which writes into a buffer provided by the storage, and provides a different buffer for the upper codec layers to write their data into.

I hope I have the right mental model how this codec chain works...

  1. API changes in the name of performance are welcomed, but best to be done incrementally and with usability in mind.

Agreed! IMHO they can also be opt-in, if the added API surface is not too large, so the user can decide where they want to land on the performance vs usability scale.

Check for zero-size writes and disable `direct_io` for these.
@sk1p
Copy link
Contributor Author

sk1p commented Aug 28, 2024

FilesystemStore with direct IO could check if the bytes are page-size aligned and only allocate/copy if required. You can then do things like reuse aligned buffers manually via store_encoded_chunk. What do you think about something like that as as stopgap solution? Could this achieve performance parity with your manual direct I/O implementation with (slightly) less boilerplate?

See #64 for a first try at implementing this. The answer is, yes, it performs 🥳

Copy link
Owner

@LDeakin LDeakin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just needs a few #[cfg(unix)] to pass CI

@LDeakin LDeakin merged commit 489f357 into LDeakin:main Aug 30, 2024
23 checks passed
@LDeakin
Copy link
Owner

LDeakin commented Aug 30, 2024

Thanks!

I think it's a good idea sit on this a bit, and to first play around with other local I/O solutions like io_uring - one aspect is that alignment is not the only issue

👍

What if the codecs (or at least the last codec in the chain) instead ask the storage for a buffer? I haven't thought this to completion, but basically if the storage could provide (or even "loan" for re-use) buffers out to the upper layers, it could hide implementation details like alignment, or any other special treatment that is needed for the specific storage backend.

Thinking about the "pass-through" case:
...

Yep, that could probably work. The standard encode/decode methods of codecs do not provide any access to storage. But what you are after is not so different from the partial decoders and partial encoders (#45). They pass input/output handles through the codec chain and the last codec ends up calling storage methods via StoragePartialDecoder / StoragePartialEncoder. The same ideas could be applied to non-partial encoding.

@sk1p sk1p deleted the dio branch August 30, 2024 10:33
@JackKelly
Copy link

This is great stuff!

@sk1p wrote:

I think it's a good idea sit on this a bit, and to first play around with other local I/O solutions like io_uring - one aspect is that alignment is not the only issue. With io_uring, you have the possibility to reduce per-I/O overheads that are still present in the direct I/O case, namely mapping user-space pages into kernel space, and un-mapping them after I/O completion. Using io_uring_register(2) you can perform this mapping once, and basically get back a handle/index for each buffer.

I'm happy to help think through using io_uring for zarrs if you'd like. I spent a few months tinkering with io_uring in my Rust project, light speed IO. I've basically stopped work on light speed IO for a number of reasons, but I'd be happy to bounce ideas around.

I did some quick benchmarking on a PCIe 5 SSD on an AMD Epyc workstation. The headline is that io_uring could achieve much faster throughput than object_store when reading many small chunks of files. io_uring achieved a throughput of about 11.2 GiBytes/sec when reading 256 KiB chunks from 500 files, where each file is 39 MiBytes. This is as fast as fio using the io_uring engine. In contrast, object_store could only achieve a throughput of about 900 MiB/s.

However, when reading the entire files at once, object_store got to 6 GiB/sec.

So, when reading many small chunks (256 KiB), io_uring is more than 10x faster than object_store. But, when reading entire 39 MiB files, the gap closes significantly, and io_uring is "only" 2x faster than object_store.

However, it's a lot of effort to achieve this throughput! My first io_uring implementation wasn't fast. I re-wrote LSIO to use multiple threads (each with their own io_uring instance) and a custom work-stealing scheduler.

@sk1p
Copy link
Contributor Author

sk1p commented Sep 4, 2024

I'm happy to help think through using io_uring for zarrs if you'd like. I spent a few months tinkering with io_uring in my Rust project, light speed IO. I've basically stopped work on light speed IO for a number of reasons, but I'd be happy to bounce ideas around.

Thank you for the offer! I already came across your project while researching io_uring and efficient I/O in rust. I'm currently working towards integrating the Zarr writing code into our software, but afterwards might come back to this topic, especially once I tackle efficient reading again (I did a first prototype some 3 years back, using the Python zarr library...)

I did some quick benchmarking on a PCIe 5 SSD on an AMD Epyc workstation. The headline is that io_uring could achieve much faster throughput than object_store when reading many small chunks of files. io_uring achieved a throughput of about 11.2 GiBytes/sec when reading 256 KiB chunks from 500 files, where each file is 39 MiBytes. This is as fast as fio using the io_uring engine. In contrast, object_store could only achieve a throughput of about 900 MiB/s.

However, when reading the entire files at once, object_store got to 6 GiB/sec.

So, when reading many small chunks (256 KiB), io_uring is more than 10x faster than object_store. But, when reading entire 39 MiB files, the gap closes significantly, and io_uring is "only" 2x faster than object_store.

These are impressive numbers! I'm fortunate that we can often work in more comfortable block sizes - 1MiB+ is common for us, and for writing data, often even larger (something like 8MiB for (16, 512, 512) u16 chunks). So for writing, classical O_DIRECT already hits the mark from a performance point of view, and gives me a number to beat. However, having efficient access to small chunks (and efficient storage for many, many small chunks (w/ shards?)) would also be of interest for me.

However, it's a lot of effort to achieve this throughput! My first io_uring implementation wasn't fast. I re-wrote LSIO to use multiple threads (each with their own io_uring instance) and a custom work-stealing scheduler.

I can imagine, and I guess you learned a lot with that project!

I didn't yet look into your benchmarks in detail - does io_uring also put requirements on buffer alignment/sizes? IIRC Jens Axboe mentioned these alignment constraints in "the io_uring.pdf" as a downside of the existing async APIs. So I wonder, is it possible to hit these performance numbers without using O_DIRECT?

From your README:

My first use-case for light-speed-io is to help to speed up reading Zarr. After that, I'm interested in helping to create fast readers for "native" geospatial file formats like GRIB2 and EUMETSAT native files. And, even further than that, I'm interested in efficient & fast computation on out-of-core, chunked, labelled, multi-dimensional data.

It's interesting, I'm kind of going in the opposite direction - in LiberTEM, we first worked on doing efficient out-of-core, thread-per-core streaming computation, mostly on vendor-specific file formats, and are now moving towards writing data directly into zarr files from the detector systems (while also doing some "live" stream processing, think previews/immediate feedback etc.). Someone recently used a nice term for this - having the data "born FAIR".

I'm definitely interested in staying in contact, let me know if you'd like to chat some time!

@JackKelly
Copy link

does io_uring also put requirements on buffer alignment/sizes? IIRC Jens Axboe mentioned these alignment constraints in "the io_uring.pdf" as a downside of the existing async APIs.

My understanding is that the requirements for buffer alignment are defined by the filesystem. So, for example, if your filesystem requires that the start and end of memory buffers are aligned to 512-byte boundaries, then you'll have to align your buffers to 512-byte boundaries, no matter if you're using io_uring or read()/write() or aio_read()/aio_write()

IIRC, one of the benefits of io_uring in comparison to AIO is that AIO only allows you to perform async IO if (and only if) you use O_DIRECT. In contrast, io_uring is more flexible: io_uring allows you to use the page cache, or to use O_DIRECT if you prefer.

So I wonder, is it possible to hit these performance numbers without using O_DIRECT?

My understanding is that it depends on your read pattern. The page cache is the default because, for most read patterns, you do get better performance if you use the Linux page cache. But, for example, if you want to read lots of small chunks which are randomly distributed, and if you'll never re-read those same chunks (or if you implement your own cache) then O_DIRECT will be faster, because you can read directly into use-space buffers, and because the operating system won't read-ahead.

I'm definitely interested in staying in contact, let me know if you'd like to chat some time!

Absolutely! Sounds good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants