Skip to content

Conversation

xmo-odoo
Copy link
Contributor

@xmo-odoo xmo-odoo commented Nov 3, 2021

The loose header manipulation currently lives in git-pack but it depends on nothing of git-pack.

Move it to git-object (::encode and ::decode), as well as update the way it's generated: a header should be no more than 28 bytes: 6 bytes for the kind, 20 for the size (log10(2**64) = 19.3), 1 for the space and 1 for the NUL, so return a SmallVec<[u8, 28]> instead of writing it out directly (this should slightly reduce the amount of IO when writing out the header to an unbuffered stream), this also avoids the need for header_buf in the traversal state.

In order to generate the header without having to write out the entire content, add WriteTo::size(), the size is (relatively) easy to compute for all kinds. Ultimately this should avoid having to buffer or move around the object data before generating the header (though that's a bit TBD, I don't remember making those changes in git-pack).

This also requires adding size computations to git_actor::Signature and git_actor::Time. For the latter the result should be reasonably efficient1. If the time part gets moved to 64b, this should probably be updated to use a lookup table2 or even better u64::log10 as hopefully it'll be stable by then3.

Also add direct shortcuts to WriteTo (to generate a loose header) and ObjectRef (to directly parse a loose object).

Others:

  • Lift conversion and header encoding at the start of Sink::write_stream as they don't seem to depend on the hash kind.
  • Alter loose::Store to inlin write_header but extract the creation of the output stream, this reveals that the dispatch on hash kind wasn't useful anymore as only the creation of the stream needs it (aside from finalize_object).

One concern I found when making this change is things are kinda broken wrt 32/64b: the existing code is a bit mixed between usize and u64, but tends to store data in buffers which work by usize anyway.

But of course using usize means broken / corrupted files > 4GB on 32b platforms, which is not great either.

Then again git itself has the same issue except even worse: it uses unsigned long internally, which is not only 32b on 32b platforms (ILP32) but also on 64 bits windows (LLP64)...

Final note: the cargo fmt for commits and tags is really bad as it puts one sub-expression per line, a better alternative might be to provide size-only and [inline] versions of the encoding helpers?

Footnotes

  1. it's considered a good solution when the input is uniformly distributed (http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog10Obvious), and for us the input is biased toward efficiency: timestamp 1e9 is September 2001, and timestamp 1e8 is March 1973, so the vast majority of legit commits will take the first branch, and only joke commits will fail the second really.

  2. https://commaok.xyz/post/lookup_tables/ however these solve the problem for u32

  3. https://github.com/rust-lang/rust/issues/70887 and efficiency improvements could be contributed to the stdlib directly.

@xmo-odoo xmo-odoo marked this pull request as draft November 3, 2021 17:54
The loose header manipulation currently lives in git-pack but it
depends on nothing *of* git-pack.

Move it to git-object (`::encode` and `::decode`), as well as update
the way it's generated: a header should be no more than 28 bytes: 6
bytes for the kind, 20 for the size (log10(2**64) = 19.3), 1 for the
space and 1 for the NUL, so return a `SmallVec<[u8, 28]>` instead of
writing it out directly (this should slightly reduce the amount of IO
when writing out the header to an unbuffered stream), this also avoids
the need for `header_buf` in the traversal state.

In order to generate the header without having to write out the entire
content, add `WriteTo::size()`, the size is (relatively) easy to
compute for all kinds. Ultimately this should avoid having to buffer
or move around the object data before generating the header (though
that's a bit TBD, I don't remember making those changes in git-pack).

This also requires adding size computations to `git_actor::Signature`
and `git_actor::Time`. For the latter the result should be reasonably
efficient[^bithack]. If the time part gets moved to 64b, this should
probably be updated to use a lookup table[^lookup] or even better
`u64::log10` as hopefully it'll be stable by then[^70887].

Also add direct shortcuts to `WriteTo` (to generate a loose header)
and `ObjectRef` (to directly parse a loose object).

Others:

* Lift conversion and header encoding at the start of
  `Sink::write_stream` as they don't seem to depend on the hash kind.
* Alter `loose::Store` to inlin `write_header` but extract the
  creation of the output stream, this reveals that the dispatch on
  hash kind wasn't useful anymore as only the creation of the stream
  needs it (aside from `finalize_object`).

One concern I found when making this change is things are kinda broken
wrt 32/64b: the existing code is a bit mixed between usize and u64,
but tends to store data in buffers which work by usize anyway.

But of course using usize means broken / corrupted files > 4GB on 32b
platforms, which is not great either.

Then again git itself has the same issue except even worse: it uses
`unsigned long` internally, which is not only 32b on 32b platforms
(ILP32) but also on 64 bits windows (LLP64)...

Final note: the `cargo fmt` for commits and tags is *really bad* as it
puts one sub-expression per line, a better alternative might be to
provide size-only and `[inline]` versions of the encoding helpers?

[^bithack]: it's considered a good solution when the input is
    uniformly distributed
    (http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog10Obvious),
    and for us the input is biased toward efficiency: timestamp 1e9 is
    September 2001, and timestamp 1e8 is March 1973, so the vast
    majority of legit commits will take the first branch, and only
    joke commits will fail the second really.
[^lookup]: https://commaok.xyz/post/lookup_tables/ however these solve
    the problem for u32
[^70887]: rust-lang/rust#70887 and
    efficiency improvements could be contributed to the stdlib
    directly.
@xmo-odoo
Copy link
Contributor Author

xmo-odoo commented Nov 3, 2021

Not too sure what the test failure is, and I can't run this locally (machine has a longstanding problem with libssh2 / ssh2-rs).

Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a million for this wonderful PR! It's very much appreciated and I can't wait to merge.

It's all better with it, particularly the cleanup that allows loose headers to be returned directly.

One concern I found when making this change is things are kinda broken wrt 32/64b: the existing code is a bit mixed between usize and u64, but tends to store data in buffers which work by usize anyway.

I value the change, as using usize more consistently is probably the way to go instead of using some form of 'pretend-u64' which breaks down sooner or later anyway. The README already contains this limitation, so it's probably OK for now and can be fixed if there is actual demand for 32 bit support with increased size limits.

Something I'd love to see generally is some /* description */ of sorts for all the magic numbers when computing the size. That way, it would resemble a description of the serialization format and is easier to understand for everyone perusing.

This also requires adding size computations to git_actor::Signature and git_actor::Time. For the latter the result should be reasonably efficient1. If the time part gets moved to 64b, this should probably be updated to use a lookup table2 or even better u64::log10 as hopefully it'll be stable by then3.

A bit over my head, and I am trusting you there.


One more kind request: I am using cargo smart-release for changelog generation and it picks up conventional commits to put into the changelog and derive a new version number. git-pack has breaking changes, so I'd hope you can split the changes into two commits, one for new features/non-breaking changes as something like feat: <short description>, probably everything in this commit fits the body well. And another commit for git-pack changes that does something like remove!: <what was removed>.


fn size(&self) -> usize {
let hashsize = self.tree.kind().len_in_hex();
b"tree".len()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#[rustfmt::skip] would help to have more specialized formatting without cargo fmt exploding it into newlines.

However, I'd really love to see what the number literals are there for, to help maintenance and be a learning experience at the same time as it's basically a description of the serialized format then :).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the numeric constants are really just the literal characters (spaces and newlines mostly) inserted by the trusted_* functions so e.g. trusted_header_id(key, value) is key <space> hex <newline> hence key.len() + 1 + len_in_hex + 1.

But I think I'll update this to add e.g. #[inline] encode::trusted_header_id_size(key, value) -> usize that way the non-trivial size implementations of Commit and Tag should be much easier to relate to the write_to implementation. I waffled a bit on doing it before pushing, but after your comments and the mess rustfmt makes of it it seems like a cleaner solution than rustfmt::skip.

/// Deserialize an object from a loose serialisation
pub fn from_loose(data: &'a [u8]) -> Result<ObjectRef<'a>, LooseDecodeError> {
let (kind, size, offset) = loose_header(data)?;
Ok(Self::from_bytes(kind, &data[offset..offset + size])?)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you know that one can write &data[offset..][..size]) here? So much easier on the eye and definitely optimized away :).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I intellectually know about the pattern but I always have a hard time actually remembering to use it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like I only learned it a month ago, but when I did it was added everywhere :D - so amazing.

use crate::data::Object;

// FIXME: this entire thing should be in git-object! (probably) with
// the possible exception of checksum checking
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tentatively agree, let's please do this in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, and I'll have to remove that comment as well, apparently I forgot to fix/remove it during my last cleanup pass. Sorry about that.

@Byron
Copy link
Member

Byron commented Nov 4, 2021

Not too sure what the test failure is, and I can't run this locally (machine has a longstanding problem with libssh2 / ssh2-rs).

Merging main will fix the issue, curl needed an update.


fn write(&self, object: impl WriteTo, hash: git_hash::Kind) -> Result<git_hash::ObjectId, Self::Error> {
let mut to = self.dest(hash)?;
to.write_all(&object.loose_header()).map_err(|err| Error::Io {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing: I do feel uneasy with trusting the size() implementation just yet, so I'd like to see it run against a moderately sized pack

Fortunately, there is already a program to exercise this code path, it's gix pack-explode. This stress test could be amended by another run that writes actual objects instead of using the sink. It will automatically verify that the written hash matches the one from the pack.

The stress test could be run against the Rustc repository for instance. Maybe there are other ways too though, maybe I am just paranoid. By the time of this writing, 20% of the linux kernel pack objects were exploded to disk without issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linux kernel pack could successfully exercise the new code path using computed sizes in the header. It would be nice to see how long exploding the Rust pack to disk takes on CI to evaluate if it's prohibitive or not to keep active.

➜  gitoxide git:(xmo-odoo-git-loose-objects) ./target/release/gix --verbose pack-explode tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.idx out
 06:14:01 SHA1 of index done 212.8MB in 0.13s (1.6GB/s)
 06:14:02  SHA1 of pack done 1.4GB in 0.92s (1.5GB/s)
 06:14:02 collecting sorted index done 7.6M entries in 1.38s (5.5M entries/s)
 06:40:38              Traversing of 7600359 objects done in 1595.78s (4762 objects/s, ~59.9 MB/s)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I am just paranoid. By the time of this writing, 20% of the linux kernel pack objects were exploded to disk without issue.

Seems to me you're merely cautious which seems like a good idea.

Testing on linux kernel packs will probably have to wait until next week though: apparently I didn't have a local clone of the linux kernel repository, and the holidays DSL is unable to check it out.

(I managed to find a workaround for the ssh2-rs issue so I could run the test suite, I've an odd failure in various_termination_signals_remove_tempfiles_unconditionally but that might be because this machine runs a really old macos because... reasons).

@xmo-odoo
Copy link
Contributor Author

xmo-odoo commented Nov 4, 2021

Something I'd love to see generally is some /* description */ of sorts for all the magic numbers when computing the size. That way, it would resemble a description of the serialization format and is easier to understand for everyone perusing.

Yeah that's actually why I was disheartened by the rustfmt change: I tried to match the order of operations in the serialization (hence all the + 1, that's basically where the spaces and newlines are in the normal serialization and I figured the compiler's constant-folding would take care of it — same with computing the size of string/bytes literals, since that's a const function on const values the compiler seems to just handle everything).

I think for clarity and maintainability adding encode::*_size versions of the utility encodes will probably be for the best, that way it should much more closely match write_to, and thus be easier to maintain.

One more kind request: I am using cargo smart-release for changelog generation and it picks up conventional commits to put into the changelog and derive a new version number. git-pack has breaking changes, so I'd hope you can split the changes into two commits, one for new features/non-breaking changes as something like feat: <short description>, probably everything in this commit fits the body well. And another commit for git-pack changes that does something like remove!: <what was removed>.

Sure thing, I'll have to read up on that as I've never used that standard but it seems like a completely fair request.

@Byron
Copy link
Member

Byron commented Nov 5, 2021

Sure thing, I'll have to read up on that as I've never used that standard but it seems like a completely fair request

Thanks - it's really only for the changelog and auto-versioning. So no need to use it in any commit either, but only in those that affect the API.

@Byron
Copy link
Member

Byron commented Nov 18, 2021

Thanks so much for the initiative, I have taken the liberty of addressing the requested changes myself and will definitely look into moving the data::Object into git-object. There was a reason for this not being done already, I am sure, but if I can't come up with it again I am happy to move it to git-object.

@Byron Byron merged commit ee737cd into GitoxideLabs:main Nov 18, 2021
@xmo-odoo xmo-odoo deleted the git-loose-objects branch November 18, 2021 09:31
@xmo-odoo
Copy link
Contributor Author

Hey yeah, excellent call, sorry about that I got a bit busy after coming back from holidays and am not done setting up my new personal machine either.

Byron added a commit that referenced this pull request Nov 19, 2021
It's correct to have it in git-pack due to the pack-location field.
The latter is required for additional metadata to be available when
creating packs.

One might be able to re-arrange parts of the implementation so that
the non-packlocation fields live in git-object. However, the latter
wouldn't be very useful as of now as the `Find` trait does have to
return objects with pack location information.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants