Gracefully handle OOM during decoding of pack entries by kornelski · Pull Request #1275 · GitoxideLabs/gitoxide

kornelski · 2024-02-04T02:21:53Z

gix_pack::data::file::decode::entry::<impl gix_pack::data::File>::decode_entry is causing an out-of-memory error for me.

I'm scanning github.com/rust-lang/crates.io-index using crates_io_index.crates_parallel().
Unfortunately I can't reliably reproduce it, and the stack trace ends with rayon worker.

I've added a bunch of try_reserve() to make the error less crashy. I realize it may be treating only the symptom.

Details

thread '<unnamed>' panicked at library/std/src/alloc.rs:354:9:
memory allocation of 24781832 bytes failed
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: std::alloc::default_alloc_error_hook
   3: std::alloc::rust_oom
   4: ___rg_oom
   5: alloc::alloc::handle_alloc_error
   6: alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle
   7: alloc::vec::Vec<T,A>::resize
   8: gix_pack::data::file::decode::entry::<impl gix_pack::data::File>::decode_entry
   9: gix_odb::store_impls::dynamic::find::<impl gix_odb::store_impls::dynamic::Handle<S>>::try_find_cached_inner
  10: gix_odb::store_impls::dynamic::find::<impl gix_pack::find_traits::Find for gix_odb::store_impls::dynamic::Handle<S>>::try_find_cached
  11: gix_odb::cache::impls::<impl gix_pack::find_traits::Find for gix_odb::Cache<S>>::try_find_cached
  12: gix_odb::cache::impls::<impl gix_object::traits::find::Find for gix_odb::Cache<S>>::try_find
  13: gix::repository::object::<impl gix::types::Repository>::_find_object_inner_generated_by_gix_macro_momo
  14: gix::repository::object::<impl gix::types::Repository>::find_object
  15: <crates_index::git::impl_::CratesTreesToBlobs as core::iter::traits::iterator::Iterator>::next
  16: <crates_index::git::impl_::Crates as core::iter::traits::iterator::Iterator>::next
  17: alloc::vec::Vec<T,A>::extend_desugared
  18: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter
  19: core::ops::function::impls::<impl core::ops::function::Fn<A> for &F>::call
  20: rayon::iter::plumbing::Folder::consume_iter
  21: rayon::iter::plumbing::bridge_producer_consumer::helper

Byron

Thanks for bringing this up.

I hope you can profile the memory consumption of the program to see where memory gets stuck. On the bright side, gitoxide is made to reuse buffer space and avoid frequent allocations and deallocations, so decoding a whole pack will end up with buffers of just the right side to decode any object without allocating/reallocating. At least that's the theory, with an actual profile pending.

This also means that depending on usage, calling shrink_to_fit() might reduce the overall memory consumption as buffers could become larger than they have to be. It's certainly worth investigating and comparing capacity() with len().

As for this PR, generally try_reserve(additional) seems to be used like try_reserve(target_len) (which is also how I would love to use this method), which is allocating much more than it needs.

gix-pack/src/data/file/decode/entry.rs

kornelski · 2024-02-05T13:08:40Z

I've also noticed a pattern of resize() + copy_from_slice(), which can avoid writing twice with clear() + extend_from_slice(), so I've added oom checks there too.

I'm not sure if you're OK with expect("OOM"), or do you prefer a proper Result?

Also this could be less repetitive with some helper functions. Shall I add some? In gix-utils?

Byron · 2024-02-05T13:26:49Z

As for this PR, generally try_reserve(additional) seems to be used like try_reserve(target_len) (which is also how I would love to use this method), which is allocating much more than it needs.

Have you read this note? try_reserve() seems to be used incorrectly.

I'm not sure if you're OK with expect("OOM"), or do you prefer a proper Result?

Yes, please integrate it into the Error type if available.

Also this could be less repetitive with some helper functions. Shall I add some? In gix-utils?

I think that would be a good place, but let's not get ahead to ourselves, see below.

I don't have a good feeling about this PR to be honest. It's a bit like running around with band-aid in a fight that can't be won. There is so many places where strings or small buffers are allocated, and they will trigger panics just like before. I don't think that any program that runs into OOM will helped consistently with this patch series.

If a server doesn't want to crash due to OOM, other mechanisms need to be found that are probably related to outsourcing the code into a separate binary, or catching panics. Maybe there is other ways as well, like having a custom allocator that interacts with the application, somehow.

Right now all code in gitoxide behaves like allocations can't fail while panicking if they do. Since panics don't have to be fatal, I think there should be a more general way to deal with OOM.

Of course, with a profile run one might be able to optimize memory usage and prevent OOMs that way, which of course I am very open to if you find the culprit.

kornelski · 2024-02-05T21:16:09Z

try_reserve() seems to be used incorrectly.

Oops, fixed.

integrate it into the Error type if available.

For the cache, I opted to return None, instead of complicating the public interface. Since it's a cache, I assume it's allowed not to return entries 100% of the time.

I don't think that any program that runs into OOM will helped consistently with this patch series.

That is true. However, in my experience a few checks on larger buffers work surprisingly well. Programs rarely operate juuust at the edge of running out of memory. Likelihood of hitting the limit is roughly proportional to the allocation size. Also allocators tend to have caches of freelists or preallocated buckets that fit small strings, so handling of just large allocations goes a long way.

I know OOM handling is a tough sell, especially in Rust. But I think in handling of pack files it makes sense, since this is data coming from the network, with size mostly out of control of the gix user (I do fetch arbitrary untrusted git URLs).

A server can never guarantee it won't crash, but severity and frequency of the crashes makes a difference. A server shedding some load due to OOM can still make progress, but a crash aborts work in progress, adds more work to restart and reload, and potentially crashes again from the backlog of aborted requests being retried.

I'll investigate in more detail where memory usage comes from in my case. So far I don't think it's a bug. The crates.io repo is huge, plus I'm scanning it in parallel (which creates multiple thread-local repos), so I may just be actually running out of memory.

Byron · 2024-02-06T06:28:30Z

For the cache, I opted to return None, instead of complicating the public interface. Since it's a cache, I assume it's allowed not to return entries 100% of the time.

A good call, thank you.

That is true. However, in my experience a few checks on larger buffers work surprisingly well. Programs rarely operate juuust at the edge of running out of memory. Likelihood of hitting the limit is roughly proportional to the allocation size. Also allocators tend to have caches of freelists or preallocated buckets that fit small strings, so handling of just large allocations goes a long way.

I know OOM handling is a tough sell, especially in Rust. But I think in handling of pack files it makes sense, since this is data coming from the network, with size mostly out of control of the gix user (I do fetch arbitrary untrusted git URLs).

A server can never guarantee it won't crash, but severity and frequency of the crashes makes a difference. A server shedding some load due to OOM can still make progress, but a crash aborts work in progress, adds more work to restart and reload, and potentially crashes again from the backlog of aborted requests being retried.

Thanks for elaborating, I understand your position better now and share it. Another argument for keeping this PR is that it will not just panic on 32 bit systems, but OOM instead (when converting integers), clearly an improvement.

I'll investigate in more detail where memory usage comes from in my case. So far I don't think it's a bug. The crates.io repo is huge, plus I'm scanning it in parallel (which creates multiple thread-local repos), so I may just be actually running out of memory.

That would also be my expectation - gitoxide practices quite a hygiene about allocations and re-use of allocations, which all leads up to Repository holding a free-list of buffers for use when decoding objects. As long as the application doesn't keep these objects around, there should be no build-up. Further, all these buffers go away as thread-local storage once the Repository is dropped. I am still curious what you find out though.

Will review later to try and get this merged.

Byron

When using try_reserve(additional), in many places it's still reserving too much by treating buf_len as additional. I don't know how often I should write this, and it's a big reason I don't feel good about this PR.
Please review every single usage of try_reserve().

Maybe it would help if the scope of the PR wouldn't span multiple crates?

Byron · 2024-02-06T09:19:22Z

Apologies, I realise my mistake! I kept seeing try_reserve() as performing self.cap() + additional, but what it really does is self.len() + additional. It's not the first time I fail to properly deal with try_reserve(), maybe this is actually where the strange feeling is coming from 😅.

Thus, I will review every single use of try_reserve() to remember this, and flag (or fix) anything I find to finally get this merged.

- keep errors simple (size isn't affected) - fix 'unused' warning and move `set_vec_to_slice` closer to consumers; let it return Option for ease of use - make filter-code fallible if buffer can't be obtained - keep semantics of `out` in `decode_entry()`

Byron

Thanks a lot, and for bearing with me!
And I really hope this will make a difference for you!

This PR will be merged once CI is green.

I am very curious about the results of the additional profiling you might do, and hope you will share them.
A quick question I hope you could answer: what is the amount of available memory of the machine that keeps running into OOM?

Byron · 2024-02-06T09:33:54Z

gix-diff/tests/diff.rs

                Some(data) => {
-                    buffer.resize(data.len(), 0);
-                    buffer.copy_from_slice(data);
+                    buffer.clear();


I will keep in mind that clear + extend is faster than resize + extend as it definitely won't write any zeros.

Byron · 2024-02-06T09:47:11Z

gix-pack/src/cache/lru.rs

            self.debug.put();
-            let (prev_cap, cur_cap);
+            let mut v = std::mem::take(&mut self.last_evicted);
+            self.mem_used -= v.capacity();


Thank you! The memory computation is much easier to follow now.

Byron · 2024-02-06T09:51:10Z

gix-diff/src/blob/pipeline.rs

+        OutOfMemory,
+    }
+
+    impl From<TryReserveError> for Error {


I removed this as keeping the TryReserveError makes no difference to the already huge size of the error type (272 bytes). Note that for now I don't worry about huge error types as there are general problems around thiserror that need solving, and when solved the size issue would go away automatically.

Byron · 2024-02-06T09:59:30Z

gix-pack/src/data/file/decode/entry.rs

+                let size: usize = entry.decompressed_size.try_into().map_err(|_| Error::OutOfMemory)?;
+                out.clear();
+                out.try_reserve(size)?;
+                out.resize(size, 0);


Having a clear() followed by a resize will definitely cause a lot of zeros be written compared to the previous implementation. I will change it back to the original semantics, with try_reserve().

Byron · 2024-02-06T10:05:40Z

gix-pack/src/data/file/decode/entry.rs

            target_buf[..last_result_size].copy_from_slice(&source_buf[..last_result_size]);
        }
-        out.resize(last_result_size, 0);
+        debug_assert!(out.len() >= last_result_size);


Thank you, this assertion expresses the intend much better than a blank resize which should always truncate the vec anyway.

Byron · 2024-02-06T10:28:47Z

Oh, and it looks like writing less zeroes really pays off! ~1% of performance improvement when decoding packs.

❯ hyperfine -w1 -N "gix free pack verify $PWD/tests/fixtures/repos/linux.git/objects/pack/pack-b73c0dd365b8ab881a163a1d342dac7416f52591.idx" "./target/release/gix free pack verify $PWD/tests/fixtures/repos/linux.git/objects/pack/pack-b73c0dd365b8ab881a163a1d342dac7416f52591.idx" -r4
Benchmark 1: gix free pack verify /Users/byron/dev/github.com/Byron/gitoxide/tests/fixtures/repos/linux.git/objects/pack/pack-b73c0dd365b8ab881a163a1d342dac7416f52591.idx
  Time (mean ± σ):      9.279 s ±  0.235 s    [User: 68.318 s, System: 2.532 s]
  Range (min … max):    9.077 s …  9.607 s    4 runs

Benchmark 2: ./target/release/gix free pack verify /Users/byron/dev/github.com/Byron/gitoxide/tests/fixtures/repos/linux.git/objects/pack/pack-b73c0dd365b8ab881a163a1d342dac7416f52591.idx
  Time (mean ± σ):      9.177 s ±  0.153 s    [User: 68.082 s, System: 2.528 s]
  Range (min … max):    9.081 s …  9.406 s    4 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  ./target/release/gix free pack verify /Users/byron/dev/github.com/Byron/gitoxide/tests/fixtures/repos/linux.git/objects/pack/pack-b73c0dd365b8ab881a163a1d342dac7416f52591.idx ran
    1.01 ± 0.03 times faster than gix free pack verify /Users/byron/dev/github.com/Byron/gitoxide/tests/fixtures/repos/linux.git/objects/pack/pack-b73c0dd365b8ab881a163a1d342dac7416f52591.idx

It's worth noting that I initially measured ~5% of improvement, which was with an older version of gix - turns out that over time the zlib performance increased measurably (probably).

kornelski · 2024-02-06T11:56:50Z

Thank you for the review.

Byron requested changes Feb 4, 2024

View reviewed changes

gix-pack/src/data/file/decode/entry.rs Outdated Show resolved Hide resolved

kornelski force-pushed the entryoom branch 2 times, most recently from 511a22a to 98a9b19 Compare February 5, 2024 12:51

kornelski force-pushed the entryoom branch from 98a9b19 to aeb12c3 Compare February 5, 2024 19:24

Handle OOM when copying to buffers

0be338f

kornelski force-pushed the entryoom branch from aeb12c3 to 0be338f Compare February 5, 2024 20:33

Byron requested changes Feb 6, 2024

View reviewed changes

Byron added 2 commits February 6, 2024 11:00

refactor

2a9ef4e

- keep errors simple (size isn't affected) - fix 'unused' warning and move `set_vec_to_slice` closer to consumers; let it return Option for ease of use - make filter-code fallible if buffer can't be obtained - keep semantics of `out` in `decode_entry()`

fix!: allow buffer operations to fail gracefully on out-of-memory.

fc68140

Byron approved these changes Feb 6, 2024

View reviewed changes

Byron merged commit 684fa5c into GitoxideLabs:main Feb 6, 2024

Uh oh!

Comments

Conversation

kornelski commented Feb 4, 2024

Uh oh!

Byron left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kornelski commented Feb 5, 2024

Uh oh!

Byron commented Feb 5, 2024

Uh oh!

kornelski commented Feb 5, 2024

Uh oh!

Byron commented Feb 6, 2024

Uh oh!

Byron left a comment

Choose a reason for hiding this comment

Uh oh!

Byron commented Feb 6, 2024

Uh oh!

Byron left a comment

Choose a reason for hiding this comment

Uh oh!

Byron Feb 6, 2024

Choose a reason for hiding this comment

Uh oh!

Byron Feb 6, 2024

Choose a reason for hiding this comment

Uh oh!

Byron Feb 6, 2024

Choose a reason for hiding this comment

Uh oh!

Byron Feb 6, 2024

Choose a reason for hiding this comment

Uh oh!

Byron Feb 6, 2024

Choose a reason for hiding this comment

Uh oh!

Byron commented Feb 6, 2024

Uh oh!

kornelski commented Feb 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants