Gracefully handle OOM during decoding of pack entries#1275
Gracefully handle OOM during decoding of pack entries#1275Byron merged 3 commits intoGitoxideLabs:mainfrom
Conversation
Byron
left a comment
There was a problem hiding this comment.
Thanks for bringing this up.
I hope you can profile the memory consumption of the program to see where memory gets stuck. On the bright side, gitoxide is made to reuse buffer space and avoid frequent allocations and deallocations, so decoding a whole pack will end up with buffers of just the right side to decode any object without allocating/reallocating. At least that's the theory, with an actual profile pending.
This also means that depending on usage, calling shrink_to_fit() might reduce the overall memory consumption as buffers could become larger than they have to be. It's certainly worth investigating and comparing capacity() with len().
As for this PR, generally try_reserve(additional) seems to be used like try_reserve(target_len) (which is also how I would love to use this method), which is allocating much more than it needs.
511a22a to
98a9b19
Compare
|
I've also noticed a pattern of I'm not sure if you're OK with Also this could be less repetitive with some helper functions. Shall I add some? In gix-utils? |
Have you read this note?
Yes, please integrate it into the
I think that would be a good place, but let's not get ahead to ourselves, see below. I don't have a good feeling about this PR to be honest. It's a bit like running around with band-aid in a fight that can't be won. There is so many places where strings or small buffers are allocated, and they will trigger panics just like before. I don't think that any program that runs into OOM will helped consistently with this patch series. If a server doesn't want to crash due to OOM, other mechanisms need to be found that are probably related to outsourcing the code into a separate binary, or catching panics. Maybe there is other ways as well, like having a custom allocator that interacts with the application, somehow. Right now all code in Of course, with a profile run one might be able to optimize memory usage and prevent OOMs that way, which of course I am very open to if you find the culprit. |
Oops, fixed.
For the cache, I opted to return
That is true. However, in my experience a few checks on larger buffers work surprisingly well. Programs rarely operate juuust at the edge of running out of memory. Likelihood of hitting the limit is roughly proportional to the allocation size. Also allocators tend to have caches of freelists or preallocated buckets that fit small strings, so handling of just large allocations goes a long way. I know OOM handling is a tough sell, especially in Rust. But I think in handling of pack files it makes sense, since this is data coming from the network, with size mostly out of control of the gix user (I do fetch arbitrary untrusted git URLs). A server can never guarantee it won't crash, but severity and frequency of the crashes makes a difference. A server shedding some load due to OOM can still make progress, but a crash aborts work in progress, adds more work to restart and reload, and potentially crashes again from the backlog of aborted requests being retried. I'll investigate in more detail where memory usage comes from in my case. So far I don't think it's a bug. The crates.io repo is huge, plus I'm scanning it in parallel (which creates multiple thread-local repos), so I may just be actually running out of memory. |
A good call, thank you.
Thanks for elaborating, I understand your position better now and share it. Another argument for keeping this PR is that it will not just panic on 32 bit systems, but OOM instead (when converting integers), clearly an improvement.
That would also be my expectation - Will review later to try and get this merged. |
Byron
left a comment
There was a problem hiding this comment.
When using try_reserve(additional), in many places it's still reserving too much by treating buf_len as additional. I don't know how often I should write this, and it's a big reason I don't feel good about this PR.
Please review every single usage of try_reserve().
Maybe it would help if the scope of the PR wouldn't span multiple crates?
|
Apologies, I realise my mistake! I kept seeing Thus, I will review every single use of |
Byron
left a comment
There was a problem hiding this comment.
Thanks a lot, and for bearing with me!
And I really hope this will make a difference for you!
This PR will be merged once CI is green.
I am very curious about the results of the additional profiling you might do, and hope you will share them.
A quick question I hope you could answer: what is the amount of available memory of the machine that keeps running into OOM?
| Some(data) => { | ||
| buffer.resize(data.len(), 0); | ||
| buffer.copy_from_slice(data); | ||
| buffer.clear(); |
There was a problem hiding this comment.
I will keep in mind that clear + extend is faster than resize + extend as it definitely won't write any zeros.
| self.debug.put(); | ||
| let (prev_cap, cur_cap); | ||
| let mut v = std::mem::take(&mut self.last_evicted); | ||
| self.mem_used -= v.capacity(); |
There was a problem hiding this comment.
Thank you! The memory computation is much easier to follow now.
gix-diff/src/blob/pipeline.rs
Outdated
| OutOfMemory, | ||
| } | ||
|
|
||
| impl From<TryReserveError> for Error { |
There was a problem hiding this comment.
I removed this as keeping the TryReserveError makes no difference to the already huge size of the error type (272 bytes). Note that for now I don't worry about huge error types as there are general problems around thiserror that need solving, and when solved the size issue would go away automatically.
| let size: usize = entry.decompressed_size.try_into().map_err(|_| Error::OutOfMemory)?; | ||
| out.clear(); | ||
| out.try_reserve(size)?; | ||
| out.resize(size, 0); |
There was a problem hiding this comment.
Having a clear() followed by a resize will definitely cause a lot of zeros be written compared to the previous implementation. I will change it back to the original semantics, with try_reserve().
| target_buf[..last_result_size].copy_from_slice(&source_buf[..last_result_size]); | ||
| } | ||
| out.resize(last_result_size, 0); | ||
| debug_assert!(out.len() >= last_result_size); |
There was a problem hiding this comment.
Thank you, this assertion expresses the intend much better than a blank resize which should always truncate the vec anyway.
|
Oh, and it looks like writing less zeroes really pays off! ~1% of performance improvement when decoding packs. It's worth noting that I initially measured ~5% of improvement, which was with an older version of |
|
Thank you for the review. |
gix_pack::data::file::decode::entry::<impl gix_pack::data::File>::decode_entryis causing an out-of-memory error for me.I'm scanning github.com/rust-lang/crates.io-index using crates_io_index.crates_parallel().
Unfortunately I can't reliably reproduce it, and the stack trace ends with rayon worker.
I've added a bunch of
try_reserve()to make the error less crashy. I realize it may be treating only the symptom.Details