-
-
Notifications
You must be signed in to change notification settings - Fork 385
Move "loose object header" serialization and deserialisation to git-object #250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The loose header manipulation currently lives in git-pack but it depends on nothing *of* git-pack. Move it to git-object (`::encode` and `::decode`), as well as update the way it's generated: a header should be no more than 28 bytes: 6 bytes for the kind, 20 for the size (log10(2**64) = 19.3), 1 for the space and 1 for the NUL, so return a `SmallVec<[u8, 28]>` instead of writing it out directly (this should slightly reduce the amount of IO when writing out the header to an unbuffered stream), this also avoids the need for `header_buf` in the traversal state. In order to generate the header without having to write out the entire content, add `WriteTo::size()`, the size is (relatively) easy to compute for all kinds. Ultimately this should avoid having to buffer or move around the object data before generating the header (though that's a bit TBD, I don't remember making those changes in git-pack). This also requires adding size computations to `git_actor::Signature` and `git_actor::Time`. For the latter the result should be reasonably efficient[^bithack]. If the time part gets moved to 64b, this should probably be updated to use a lookup table[^lookup] or even better `u64::log10` as hopefully it'll be stable by then[^70887]. Also add direct shortcuts to `WriteTo` (to generate a loose header) and `ObjectRef` (to directly parse a loose object). Others: * Lift conversion and header encoding at the start of `Sink::write_stream` as they don't seem to depend on the hash kind. * Alter `loose::Store` to inlin `write_header` but extract the creation of the output stream, this reveals that the dispatch on hash kind wasn't useful anymore as only the creation of the stream needs it (aside from `finalize_object`). One concern I found when making this change is things are kinda broken wrt 32/64b: the existing code is a bit mixed between usize and u64, but tends to store data in buffers which work by usize anyway. But of course using usize means broken / corrupted files > 4GB on 32b platforms, which is not great either. Then again git itself has the same issue except even worse: it uses `unsigned long` internally, which is not only 32b on 32b platforms (ILP32) but also on 64 bits windows (LLP64)... Final note: the `cargo fmt` for commits and tags is *really bad* as it puts one sub-expression per line, a better alternative might be to provide size-only and `[inline]` versions of the encoding helpers? [^bithack]: it's considered a good solution when the input is uniformly distributed (http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog10Obvious), and for us the input is biased toward efficiency: timestamp 1e9 is September 2001, and timestamp 1e8 is March 1973, so the vast majority of legit commits will take the first branch, and only joke commits will fail the second really. [^lookup]: https://commaok.xyz/post/lookup_tables/ however these solve the problem for u32 [^70887]: rust-lang/rust#70887 and efficiency improvements could be contributed to the stdlib directly.
Not too sure what the test failure is, and I can't run this locally (machine has a longstanding problem with libssh2 / ssh2-rs). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a million for this wonderful PR! It's very much appreciated and I can't wait to merge.
It's all better with it, particularly the cleanup that allows loose headers to be returned directly.
One concern I found when making this change is things are kinda broken wrt 32/64b: the existing code is a bit mixed between usize and u64, but tends to store data in buffers which work by usize anyway.
I value the change, as using usize
more consistently is probably the way to go instead of using some form of 'pretend-u64' which breaks down sooner or later anyway. The README already contains this limitation, so it's probably OK for now and can be fixed if there is actual demand for 32 bit support with increased size limits.
Something I'd love to see generally is some /* description */
of sorts for all the magic numbers when computing the size. That way, it would resemble a description of the serialization format and is easier to understand for everyone perusing.
This also requires adding size computations to git_actor::Signature and git_actor::Time. For the latter the result should be reasonably efficient1. If the time part gets moved to 64b, this should probably be updated to use a lookup table2 or even better u64::log10 as hopefully it'll be stable by then3.
A bit over my head, and I am trusting you there.
One more kind request: I am using cargo smart-release
for changelog generation and it picks up conventional commits
to put into the changelog and derive a new version number. git-pack
has breaking changes, so I'd hope you can split the changes into two commits, one for new features/non-breaking changes as something like feat: <short description>
, probably everything in this commit fits the body well. And another commit for git-pack
changes that does something like remove!: <what was removed>
.
|
||
fn size(&self) -> usize { | ||
let hashsize = self.tree.kind().len_in_hex(); | ||
b"tree".len() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#[rustfmt::skip]
would help to have more specialized formatting without cargo fmt
exploding it into newlines.
However, I'd really love to see what the number literals are there for, to help maintenance and be a learning experience at the same time as it's basically a description of the serialized format then :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, the numeric constants are really just the literal characters (spaces and newlines mostly) inserted by the trusted_*
functions so e.g. trusted_header_id(key, value)
is key <space> hex <newline>
hence key.len() + 1 + len_in_hex + 1
.
But I think I'll update this to add e.g. #[inline] encode::trusted_header_id_size(key, value) -> usize
that way the non-trivial size
implementations of Commit and Tag should be much easier to relate to the write_to
implementation. I waffled a bit on doing it before pushing, but after your comments and the mess rustfmt makes of it it seems like a cleaner solution than rustfmt::skip
.
/// Deserialize an object from a loose serialisation | ||
pub fn from_loose(data: &'a [u8]) -> Result<ObjectRef<'a>, LooseDecodeError> { | ||
let (kind, size, offset) = loose_header(data)?; | ||
Ok(Self::from_bytes(kind, &data[offset..offset + size])?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you know that one can write &data[offset..][..size])
here? So much easier on the eye and definitely optimized away :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I intellectually know about the pattern but I always have a hard time actually remembering to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like I only learned it a month ago, but when I did it was added everywhere :D - so amazing.
use crate::data::Object; | ||
|
||
// FIXME: this entire thing should be in git-object! (probably) with | ||
// the possible exception of checksum checking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tentatively agree, let's please do this in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, and I'll have to remove that comment as well, apparently I forgot to fix/remove it during my last cleanup pass. Sorry about that.
Merging |
|
||
fn write(&self, object: impl WriteTo, hash: git_hash::Kind) -> Result<git_hash::ObjectId, Self::Error> { | ||
let mut to = self.dest(hash)?; | ||
to.write_all(&object.loose_header()).map_err(|err| Error::Io { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing: I do feel uneasy with trusting the size()
implementation just yet, so I'd like to see it run against a moderately sized pack
Fortunately, there is already a program to exercise this code path, it's gix pack-explode
. This stress test could be amended by another run that writes actual objects instead of using the sink. It will automatically verify that the written hash matches the one from the pack.
The stress test could be run against the Rustc repository for instance. Maybe there are other ways too though, maybe I am just paranoid. By the time of this writing, 20% of the linux kernel pack objects were exploded to disk without issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The linux kernel pack could successfully exercise the new code path using computed sizes in the header. It would be nice to see how long exploding the Rust pack to disk takes on CI to evaluate if it's prohibitive or not to keep active.
➜ gitoxide git:(xmo-odoo-git-loose-objects) ./target/release/gix --verbose pack-explode tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.idx out
06:14:01 SHA1 of index done 212.8MB in 0.13s (1.6GB/s)
06:14:02 SHA1 of pack done 1.4GB in 0.92s (1.5GB/s)
06:14:02 collecting sorted index done 7.6M entries in 1.38s (5.5M entries/s)
06:40:38 Traversing of 7600359 objects done in 1595.78s (4762 objects/s, ~59.9 MB/s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I am just paranoid. By the time of this writing, 20% of the linux kernel pack objects were exploded to disk without issue.
Seems to me you're merely cautious which seems like a good idea.
Testing on linux kernel packs will probably have to wait until next week though: apparently I didn't have a local clone of the linux kernel repository, and the holidays DSL is unable to check it out.
(I managed to find a workaround for the ssh2-rs issue so I could run the test suite, I've an odd failure in various_termination_signals_remove_tempfiles_unconditionally
but that might be because this machine runs a really old macos because... reasons).
Yeah that's actually why I was disheartened by the rustfmt change: I tried to match the order of operations in the serialization (hence all the I think for clarity and maintainability adding
Sure thing, I'll have to read up on that as I've never used that standard but it seems like a completely fair request. |
Thanks - it's really only for the changelog and auto-versioning. So no need to use it in any commit either, but only in those that affect the API. |
Thanks so much for the initiative, I have taken the liberty of addressing the requested changes myself and will definitely look into moving the data::Object into git-object. There was a reason for this not being done already, I am sure, but if I can't come up with it again I am happy to move it to |
Hey yeah, excellent call, sorry about that I got a bit busy after coming back from holidays and am not done setting up my new personal machine either. |
It's correct to have it in git-pack due to the pack-location field. The latter is required for additional metadata to be available when creating packs. One might be able to re-arrange parts of the implementation so that the non-packlocation fields live in git-object. However, the latter wouldn't be very useful as of now as the `Find` trait does have to return objects with pack location information.
The loose header manipulation currently lives in git-pack but it depends on nothing of git-pack.
Move it to git-object (
::encode
and::decode
), as well as update the way it's generated: a header should be no more than 28 bytes: 6 bytes for the kind, 20 for the size (log10(2**64) = 19.3), 1 for the space and 1 for the NUL, so return aSmallVec<[u8, 28]>
instead of writing it out directly (this should slightly reduce the amount of IO when writing out the header to an unbuffered stream), this also avoids the need forheader_buf
in the traversal state.In order to generate the header without having to write out the entire content, add
WriteTo::size()
, the size is (relatively) easy to compute for all kinds. Ultimately this should avoid having to buffer or move around the object data before generating the header (though that's a bit TBD, I don't remember making those changes in git-pack).This also requires adding size computations to
git_actor::Signature
andgit_actor::Time
. For the latter the result should be reasonably efficient1. If the time part gets moved to 64b, this should probably be updated to use a lookup table2 or even betteru64::log10
as hopefully it'll be stable by then3.Also add direct shortcuts to
WriteTo
(to generate a loose header) andObjectRef
(to directly parse a loose object).Others:
Sink::write_stream
as they don't seem to depend on the hash kind.loose::Store
to inlinwrite_header
but extract the creation of the output stream, this reveals that the dispatch on hash kind wasn't useful anymore as only the creation of the stream needs it (aside fromfinalize_object
).One concern I found when making this change is things are kinda broken wrt 32/64b: the existing code is a bit mixed between usize and u64, but tends to store data in buffers which work by usize anyway.
But of course using usize means broken / corrupted files > 4GB on 32b platforms, which is not great either.
Then again git itself has the same issue except even worse: it uses
unsigned long
internally, which is not only 32b on 32b platforms (ILP32) but also on 64 bits windows (LLP64)...Final note: the
cargo fmt
for commits and tags is really bad as it puts one sub-expression per line, a better alternative might be to provide size-only and[inline]
versions of the encoding helpers?Footnotes
it's considered a good solution when the input is uniformly distributed (http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog10Obvious), and for us the input is biased toward efficiency: timestamp 1e9 is September 2001, and timestamp 1e8 is March 1973, so the vast majority of legit commits will take the first branch, and only joke commits will fail the second really. ↩
https://commaok.xyz/post/lookup_tables/ however these solve the problem for u32 ↩
https://github.com/rust-lang/rust/issues/70887 and efficiency improvements could be contributed to the stdlib directly. ↩