Move "loose object header" serialization and deserialisation to git-object #250

xmo-odoo · 2021-11-03T17:41:29Z

The loose header manipulation currently lives in git-pack but it depends on nothing of git-pack.

Move it to git-object (::encode and ::decode), as well as update the way it's generated: a header should be no more than 28 bytes: 6 bytes for the kind, 20 for the size (log10(2**64) = 19.3), 1 for the space and 1 for the NUL, so return a SmallVec<[u8, 28]> instead of writing it out directly (this should slightly reduce the amount of IO when writing out the header to an unbuffered stream), this also avoids the need for header_buf in the traversal state.

In order to generate the header without having to write out the entire content, add WriteTo::size(), the size is (relatively) easy to compute for all kinds. Ultimately this should avoid having to buffer or move around the object data before generating the header (though that's a bit TBD, I don't remember making those changes in git-pack).

This also requires adding size computations to git_actor::Signature and git_actor::Time. For the latter the result should be reasonably efficient¹. If the time part gets moved to 64b, this should probably be updated to use a lookup table² or even better u64::log10 as hopefully it'll be stable by then³.

Also add direct shortcuts to WriteTo (to generate a loose header) and ObjectRef (to directly parse a loose object).

Others:

Lift conversion and header encoding at the start of Sink::write_stream as they don't seem to depend on the hash kind.
Alter loose::Store to inlin write_header but extract the creation of the output stream, this reveals that the dispatch on hash kind wasn't useful anymore as only the creation of the stream needs it (aside from finalize_object).

One concern I found when making this change is things are kinda broken wrt 32/64b: the existing code is a bit mixed between usize and u64, but tends to store data in buffers which work by usize anyway.

But of course using usize means broken / corrupted files > 4GB on 32b platforms, which is not great either.

Then again git itself has the same issue except even worse: it uses unsigned long internally, which is not only 32b on 32b platforms (ILP32) but also on 64 bits windows (LLP64)...

Final note: the cargo fmt for commits and tags is really bad as it puts one sub-expression per line, a better alternative might be to provide size-only and [inline] versions of the encoding helpers?

it's considered a good solution when the input is uniformly distributed (http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog10Obvious), and for us the input is biased toward efficiency: timestamp 1e9 is September 2001, and timestamp 1e8 is March 1973, so the vast majority of legit commits will take the first branch, and only joke commits will fail the second really. ↩
https://commaok.xyz/post/lookup_tables/ however these solve the problem for u32 ↩
https://github.com/rust-lang/rust/issues/70887 and efficiency improvements could be contributed to the stdlib directly. ↩

The loose header manipulation currently lives in git-pack but it depends on nothing *of* git-pack. Move it to git-object (`::encode` and `::decode`), as well as update the way it's generated: a header should be no more than 28 bytes: 6 bytes for the kind, 20 for the size (log10(2**64) = 19.3), 1 for the space and 1 for the NUL, so return a `SmallVec<[u8, 28]>` instead of writing it out directly (this should slightly reduce the amount of IO when writing out the header to an unbuffered stream), this also avoids the need for `header_buf` in the traversal state. In order to generate the header without having to write out the entire content, add `WriteTo::size()`, the size is (relatively) easy to compute for all kinds. Ultimately this should avoid having to buffer or move around the object data before generating the header (though that's a bit TBD, I don't remember making those changes in git-pack). This also requires adding size computations to `git_actor::Signature` and `git_actor::Time`. For the latter the result should be reasonably efficient[^bithack]. If the time part gets moved to 64b, this should probably be updated to use a lookup table[^lookup] or even better `u64::log10` as hopefully it'll be stable by then[^70887]. Also add direct shortcuts to `WriteTo` (to generate a loose header) and `ObjectRef` (to directly parse a loose object). Others: * Lift conversion and header encoding at the start of `Sink::write_stream` as they don't seem to depend on the hash kind. * Alter `loose::Store` to inlin `write_header` but extract the creation of the output stream, this reveals that the dispatch on hash kind wasn't useful anymore as only the creation of the stream needs it (aside from `finalize_object`). One concern I found when making this change is things are kinda broken wrt 32/64b: the existing code is a bit mixed between usize and u64, but tends to store data in buffers which work by usize anyway. But of course using usize means broken / corrupted files > 4GB on 32b platforms, which is not great either. Then again git itself has the same issue except even worse: it uses `unsigned long` internally, which is not only 32b on 32b platforms (ILP32) but also on 64 bits windows (LLP64)... Final note: the `cargo fmt` for commits and tags is *really bad* as it puts one sub-expression per line, a better alternative might be to provide size-only and `[inline]` versions of the encoding helpers? [^bithack]: it's considered a good solution when the input is uniformly distributed (http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog10Obvious), and for us the input is biased toward efficiency: timestamp 1e9 is September 2001, and timestamp 1e8 is March 1973, so the vast majority of legit commits will take the first branch, and only joke commits will fail the second really. [^lookup]: https://commaok.xyz/post/lookup_tables/ however these solve the problem for u32 [^70887]: rust-lang/rust#70887 and efficiency improvements could be contributed to the stdlib directly.

xmo-odoo · 2021-11-03T18:57:51Z

Not too sure what the test failure is, and I can't run this locally (machine has a longstanding problem with libssh2 / ssh2-rs).

Byron

Thanks a million for this wonderful PR! It's very much appreciated and I can't wait to merge.

It's all better with it, particularly the cleanup that allows loose headers to be returned directly.

One concern I found when making this change is things are kinda broken wrt 32/64b: the existing code is a bit mixed between usize and u64, but tends to store data in buffers which work by usize anyway.

I value the change, as using usize more consistently is probably the way to go instead of using some form of 'pretend-u64' which breaks down sooner or later anyway. The README already contains this limitation, so it's probably OK for now and can be fixed if there is actual demand for 32 bit support with increased size limits.

Something I'd love to see generally is some /* description */ of sorts for all the magic numbers when computing the size. That way, it would resemble a description of the serialization format and is easier to understand for everyone perusing.

This also requires adding size computations to git_actor::Signature and git_actor::Time. For the latter the result should be reasonably efficient1. If the time part gets moved to 64b, this should probably be updated to use a lookup table2 or even better u64::log10 as hopefully it'll be stable by then3.

A bit over my head, and I am trusting you there.

One more kind request: I am using cargo smart-release for changelog generation and it picks up conventional commits to put into the changelog and derive a new version number. git-pack has breaking changes, so I'd hope you can split the changes into two commits, one for new features/non-breaking changes as something like feat: <short description>, probably everything in this commit fits the body well. And another commit for git-pack changes that does something like remove!: <what was removed>.

Byron · 2021-11-04T05:23:55Z

git-object/src/commit/write.rs


+    fn size(&self) -> usize {
+        let hashsize = self.tree.kind().len_in_hex();
+        b"tree".len()


#[rustfmt::skip] would help to have more specialized formatting without cargo fmt exploding it into newlines.

However, I'd really love to see what the number literals are there for, to help maintenance and be a learning experience at the same time as it's basically a description of the serialized format then :).

Yep, the numeric constants are really just the literal characters (spaces and newlines mostly) inserted by the trusted_* functions so e.g. trusted_header_id(key, value) is key <space> hex <newline> hence key.len() + 1 + len_in_hex + 1.

But I think I'll update this to add e.g. #[inline] encode::trusted_header_id_size(key, value) -> usize that way the non-trivial size implementations of Commit and Tag should be much easier to relate to the write_to implementation. I waffled a bit on doing it before pushing, but after your comments and the mess rustfmt makes of it it seems like a cleaner solution than rustfmt::skip.

Byron · 2021-11-04T05:36:30Z

git-object/src/object/mod.rs

+    /// Deserialize an object from a loose serialisation
+    pub fn from_loose(data: &'a [u8]) -> Result<ObjectRef<'a>, LooseDecodeError> {
+        let (kind, size, offset) = loose_header(data)?;
+        Ok(Self::from_bytes(kind, &data[offset..offset + size])?)


Did you know that one can write &data[offset..][..size]) here? So much easier on the eye and definitely optimized away :).

Good point, I intellectually know about the pattern but I always have a hard time actually remembering to use it.

It feels like I only learned it a month ago, but when I did it was added everywhere :D - so amazing.

Byron · 2021-11-04T05:49:06Z

git-pack/src/data/object.rs

 use crate::data::Object;

+// FIXME: this entire thing should be in git-object! (probably) with
+// the possible exception of checksum checking


I tentatively agree, let's please do this in a separate PR.

OK, and I'll have to remove that comment as well, apparently I forgot to fix/remove it during my last cleanup pass. Sorry about that.

Byron · 2021-11-04T06:00:17Z

Not too sure what the test failure is, and I can't run this locally (machine has a longstanding problem with libssh2 / ssh2-rs).

Merging main will fix the issue, curl needed an update.

Byron · 2021-11-04T06:18:35Z

git-odb/src/store/loose/write.rs


+    fn write(&self, object: impl WriteTo, hash: git_hash::Kind) -> Result<git_hash::ObjectId, Self::Error> {
+        let mut to = self.dest(hash)?;
+        to.write_all(&object.loose_header()).map_err(|err| Error::Io {


One more thing: I do feel uneasy with trusting the size() implementation just yet, so I'd like to see it run against a moderately sized pack

Fortunately, there is already a program to exercise this code path, it's gix pack-explode. This stress test could be amended by another run that writes actual objects instead of using the sink. It will automatically verify that the written hash matches the one from the pack.

The stress test could be run against the Rustc repository for instance. Maybe there are other ways too though, maybe I am just paranoid. By the time of this writing, 20% of the linux kernel pack objects were exploded to disk without issue.

The linux kernel pack could successfully exercise the new code path using computed sizes in the header. It would be nice to see how long exploding the Rust pack to disk takes on CI to evaluate if it's prohibitive or not to keep active.

➜ gitoxide git:(xmo-odoo-git-loose-objects) ./target/release/gix --verbose pack-explode tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.idx out 06:14:01 SHA1 of index done 212.8MB in 0.13s (1.6GB/s) 06:14:02 SHA1 of pack done 1.4GB in 0.92s (1.5GB/s) 06:14:02 collecting sorted index done 7.6M entries in 1.38s (5.5M entries/s) 06:40:38 Traversing of 7600359 objects done in 1595.78s (4762 objects/s, ~59.9 MB/s)

maybe I am just paranoid. By the time of this writing, 20% of the linux kernel pack objects were exploded to disk without issue.

Seems to me you're merely cautious which seems like a good idea.

Testing on linux kernel packs will probably have to wait until next week though: apparently I didn't have a local clone of the linux kernel repository, and the holidays DSL is unable to check it out.

(I managed to find a workaround for the ssh2-rs issue so I could run the test suite, I've an odd failure in various_termination_signals_remove_tempfiles_unconditionally but that might be because this machine runs a really old macos because... reasons).

xmo-odoo · 2021-11-04T12:04:03Z

Something I'd love to see generally is some /* description */ of sorts for all the magic numbers when computing the size. That way, it would resemble a description of the serialization format and is easier to understand for everyone perusing.

Yeah that's actually why I was disheartened by the rustfmt change: I tried to match the order of operations in the serialization (hence all the + 1, that's basically where the spaces and newlines are in the normal serialization and I figured the compiler's constant-folding would take care of it — same with computing the size of string/bytes literals, since that's a const function on const values the compiler seems to just handle everything).

I think for clarity and maintainability adding encode::*_size versions of the utility encodes will probably be for the best, that way it should much more closely match write_to, and thus be easier to maintain.

One more kind request: I am using cargo smart-release for changelog generation and it picks up conventional commits to put into the changelog and derive a new version number. git-pack has breaking changes, so I'd hope you can split the changes into two commits, one for new features/non-breaking changes as something like feat: <short description>, probably everything in this commit fits the body well. And another commit for git-pack changes that does something like remove!: <what was removed>.

Sure thing, I'll have to read up on that as I've never used that standard but it seems like a completely fair request.

Byron · 2021-11-05T06:02:35Z

Sure thing, I'll have to read up on that as I've never used that standard but it seems like a completely fair request

Thanks - it's really only for the changelog and auto-versioning. So no need to use it in any commit either, but only in those that affect the API.

Byron · 2021-11-18T09:09:33Z

Thanks so much for the initiative, I have taken the liberty of addressing the requested changes myself and will definitely look into moving the data::Object into git-object. There was a reason for this not being done already, I am sure, but if I can't come up with it again I am happy to move it to git-object.

…#250)

xmo-odoo · 2021-11-18T09:33:29Z

Hey yeah, excellent call, sorry about that I got a bit busy after coming back from holidays and am not done setting up my new personal machine either.

It's correct to have it in git-pack due to the pack-location field. The latter is required for additional metadata to be available when creating packs. One might be able to re-arrange parts of the implementation so that the non-packlocation fields live in git-object. However, the latter wouldn't be very useful as of now as the `Find` trait does have to return objects with pack location information.

xmo-odoo marked this pull request as draft November 3, 2021 17:54

Byron reviewed Nov 4, 2021

View reviewed changes

Byron added a commit that referenced this pull request Nov 18, 2021

refactor!: move loose header manipulation from git-pack to git-object (…

598698b

…#250)

Byron merged commit ee737cd into GitoxideLabs:main Nov 18, 2021

xmo-odoo deleted the git-loose-objects branch November 18, 2021 09:31

yoshuawuyts mentioned this pull request Aug 10, 2022

Tracking Issue for Integer::{ilog,ilog2,ilog10} rust-lang/rust#70887

Closed

8 tasks

Uh oh!

Move "loose object header" serialization and deserialisation to git-object #250

Move "loose object header" serialization and deserialisation to git-object #250

Uh oh!

Conversation

xmo-odoo commented Nov 3, 2021

Footnotes

Uh oh!

xmo-odoo commented Nov 3, 2021

Uh oh!

Byron left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Byron commented Nov 4, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xmo-odoo commented Nov 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Byron commented Nov 5, 2021

Uh oh!

Byron commented Nov 18, 2021

Uh oh!

xmo-odoo commented Nov 18, 2021

Uh oh!

Uh oh!

xmo-odoo commented Nov 4, 2021 •

edited

Loading