Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Q&A] InflateState reset has measurable costs when decompressing millions of objects #89

Closed
Byron opened this issue Aug 4, 2020 · 3 comments · Fixed by #91
Closed

Comments

@Byron
Copy link

Byron commented Aug 4, 2020

When decompressing the objects in the linux kernel pack with 7.5 million objects/decompressions, I am using a boxed InflateState which is reset for each object to get ready for the next decompression. Please note that this is the common case, as it happens when receiving a pack when cloning or fetching, as an index has to be built from the pack. gitoxide is already faster than git when resolving these objects (i.e. computing the SHA1), but is quite a bit slower during indexing, which I am trying to fix :D.

Since it comes with its own output buffer, this one is reset as well.

This shows up when profiling decompressing/indexing the entire kernel pack:

Screenshot 2020-08-04 at 11 25 03

Commenting out the line that resets the buffer has the desired effect, removing this cost entirely.

Screenshot 2020-08-04 at 11 30 10

The respective field has a TODO associated with it, and I wonder if you think my issue could be fixed with that.

Thanks for sharing your thoughts.

@oyvindln
Copy link
Collaborator

oyvindln commented Aug 5, 2020

Hm, re-using the buffer without zeroing it could in theory pose some security risk if there was a bug somewhere in the decompression code (not that we have found anything like that through fuzzing.) It's probably possible to use a structure or something to statically guarantee not accidentally reading old content though.

A simple temporary solution would be to add an extra reset function that doesn't zero out the buffer for uses like this, with an accompanied warning. In any case the streaming api needs a serious rework.

It is also possible that the compiler isn't optimizing the code to zero the array well and calling bzero more than it needs to, but I haven't looked into what code the line generates.

@Byron
Copy link
Author

Byron commented Aug 6, 2020

A simple temporary solution would be to add an extra reset function that doesn't zero out the buffer for uses like this, with an accompanied warning. In any case the streaming api needs a serious rework.

I agree, that seems like the easiest course of action, and I am happy to submit a PR for that if you give me the go along with a name for that method :D.

On another note, do you from the top of your head could think of other overhead/cost associated with starting to decompress a stream? I had the feeling that even with that issue removed, gitoxide could only maintain about 25MB/s when decompressing many, many small streams, whereas git keep up at ~50MB/s.

In any case, thanks a lot for your help!

@Byron
Copy link
Author

Byron commented Aug 11, 2020

I took a closer look at this issue and did my best to assure the slowdown at the end of the pack (where there are many, many small streams) is not caused by something outside of actually decompressing the streams.

To achieve that, I split the producing code that is decompressing small objects into its own thread and avoided making any allocation there. The outcome (an entry) is sent in chunks big enough to minimize synchronization overhead to a consumer thread, which integrates that into a tree. That way, it's easy to see if for instance the consumer is the bottleneck. Fortunately, one core can integrate millions of entries, and I have seen it receive peaks of close to 400k entries per second.

I have also applied the pack in the linked PR to miniz oxide to shave off several seconds.

Below is how this looks like after profiling decompressing the linux kernel pack.

Screenshot 2020-08-11 at 15 08 56

Most time is spent in the producer thread decompressing entries, some time is spent on IO as well, which is expected.

Now I get these times which are quite respectable: 52.1MB/s vs 53.7MB/s for by git itself.

asciicast

asciicast

Once we get the PR to a mergable state an a new patch release, I think this issue can be closed as well - it's good enough :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants