Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[commitgraph] Implement commit-graph-verify plumbing command. #26

Merged
merged 9 commits into from
Oct 13, 2020

Conversation

avoidscorn
Copy link
Contributor

Commit graphs are optional, so commit-graph plumbing is guarded by a
commitgraph feature that is enabled by default.

@Byron
Copy link
Owner

Byron commented Oct 2, 2020

Thanks a lot! This is indeed a nice increment, and I will take a closer look soon.
Even though commit-graph is optional within a repository, I don't think it's worth feature gating it as it is not driving up compile times with its ~1000 lines of Rust code. Rather I prefer to provide the capability similar to how git does, hoping that one day we can even generate it, maybe fast enough to do that by default and transparently.

edit: just saw it's a draft. Will take a look when ready. Happy coding!

Commit graphs are optional, so commit-graph plumbing is guarded by a
`commitgraph` feature that is enabled by default.
It doesn't look like `quick_error` supports type parameters on error
enums, and I want to use type parameters for verification errors.
This constructor tries to accept as many commit-graph-looking paths as
possible, such as `.git/objects/info`,
`.git/objects/info/commit-graph`, or `.git/objects/info/commit-graphs`.
* It doesn't use evil `as`.
* It's a bit less code.
* It should be a tiny bit faster.
* It is slightly more strict in that it will panic if the base graphs
  list is not evenly divisible into hashes. The file parser's error
  checking should ensure that this panic never happens, but maybe maybe
  maybe...
@avoidscorn avoidscorn changed the title [commitgraph] Stub out commit-graph-verify plumbing command. [commitgraph] Implement commit-graph-verify plumbing command. Oct 11, 2020
@avoidscorn avoidscorn marked this pull request as ready for review October 11, 2020 02:59
@avoidscorn
Copy link
Contributor Author

Completed the non-parallel, no-progress implementation of commit-graph-verify.

I think the most iffy part of this is File::traverse's two error types: Error and EitherError. I wanted graph::verify::Error to be able to contain file::verify::Error values, but for convenience I also wanted Graph::verify_integrity to use the same error type inside and outside of its processor closure.

I'm also unsure how closely you want to match git in terms of verification error messages and test cases.

@Byron
Copy link
Owner

Byron commented Oct 11, 2020

Completed the non-parallel, no-progress implementation of commit-graph-verify.

That sounds like great progress! I wonder how fast verification already is. After all, I wouldn't expect the graph files to be too large, to the point where single-threaded operation might already be fast enough for most. After all, using git-features::parallel will drive the code complexity up a notch.
Of course it would be interesting to see how it fares against git commit-graph verify, I could imagine it's competitive already.

The same applies to adding progress, to some extend, even though it's usually much easier to accomplish. It might be a waste though if the average verify operation takes less than 1s, the time the 'line renderer' will wait until it shows up at all.

Edit: I just noticed that git also looks up the actual commit object for additional validation. This is something the processor will be used for, and with that it certainly drives up the processing time. The current plumbing implementation could implement it, as git-odb::compound::Db supports full git object databases. This would also be its first use, so that could be very interesting.

I think the most iffy part of this is File::traverse's two error types: Error and EitherError. I wanted graph::verify::Error to be able to contain file::verify::Error values, but for convenience I also wanted Graph::verify_integrity to use the same error type inside and outside of its processor closure.

I took a look and understand why that is necessary. Something leaving a bitter taste is the asymmetry this causes in the API. One could (probably) argue that Graph::verify_integrity could also use EitherError. But would that be practical or will it ever be useful?

A somewhat less complex way of restoring symmetry could be to put the Processor Error into file::verify::Error similarly to how it's done in graph::verify::Error, and that would require a separate ProcessorError for all error cases that can happen within the Processor itself to prevent graph::verify::Error from being recursive. This would get rid of EitherError and add a new specific error for everything that can happen within the processor.

To me that doesn't sound too bad, at least if a symmetric API is considered valuable, and is my preferred solution. Please let me know if I am missing something though, as what's there is definitely doing the job well enough already.

I'm also unsure how closely you want to match git in terms of verification error messages and test cases.

This is entirely up to you. gitoxide is not trying to mirror git in every way, but wants to improve it where it matters. This could be performance, but probably is ease of use in the most cases. That said, verification messages could be whatever is most descriptive for the error at hand - and if that's what git is using, it should be fair to use that too.

When it comes to test-cases, I tend to be very pragmatic and go with a happy-path test for every feature (or command line flag), along with at least one sad-path test to see that verification actually works while producing acceptable error messages, without the need to test for every variant in the error enum at all. Thus git's tests are certainly more exhaustive, but there is value to keep it simple and lean at the beginning to help moving fast, at the cost of being reactive in case bugs inevitably come in with tests for their fixes only.

Great work thus far, thanks so much! I am super interested in seeing this in a stress test, and of course to compare its performance with git, which I think is single-threaded too.

@avoidscorn
Copy link
Contributor Author

avoidscorn commented Oct 11, 2020

Here are some quick measurements on the most pathological repo I could find (https://github.com/cirosantilli/test-many-commits-1m):

  • git commit-graph verify takes 14.5s and reports progress. Single threaded.
  • git -c core.commitGraph=false merge-base --is-ancestor <oldest commit> <newest commit> takes 8.3 seconds. Single threaded.
  • git -c core.commitGraph=true merge-base --is-ancestor <oldest commit> <newest commit> takes 1.3 seconds. Single threaded.

I'll revisit the EitherError stuff. I'm actually thinking of moving all commit-level stuff from File::traverse to Graph::verify_integrity, as that would eliminate having multiple layers of processors, which I think side-steps this whole issue. If not, I'll just create a private graph::verify::ProcessorError and map its variants to graph::verify::Error.

Edit: If we go with the graph::verify::ProcessorError approach, do you want those errors mapped to graph::verify::Errors before Graph::verify_integrity returns, or do you want them exposed as graph::verify::Error::File(file::verify::Error::Processor(graph::verify::ProcessorError))? The later seems super-gross, as it implies graph::verify::Error::File(file::verify::Error::Processor(graph::verify::ProcessorError::Processor(<caller-supplied error>))) is also a thing.

Can git-commitgraph depend on git-odb? If not, I'm not sure where you'd want to put full commit-graph + object database verification and graph-algorithm code.

@Byron
Copy link
Owner

Byron commented Oct 11, 2020

git commit-graph verify takes 14.5s and reports progress. Single threaded.

This looks very much like progress should be implemented then. It's quite swift given that it probably makes 1 million object lookups… if that's true, it would do 71k lookups per second on a single thread, which seems like a lot judging from my experience with git's performance with pack lookups. Alternatively your CPU is ~2.5x of what I am using (also not too unlikely :D).

The other timings will also be interesting especially once we have a similar command.

[…] The later seems super-gross, as it implies graph::verify::Error::File(file::verify::Error::Processor(graph::verify::ProcessorError::Processor()))

That's true, it looks like the nesting level gets a little deeper than I thought. I hoped it would look like this when matched:

match err {
  graph::verify::Error::File(file::verify::Error::Processor(err)) => {case 1: I thought it would be this…},
  graph::verify::Error::File(file::verify::Error::Processor(graph::verify::ProcessorError::Processor(err))) => {case 2: …and agree this seems a bit unwieldy, despite being lossless and correct},
  _ => {}
}

Case 1 seems quite alright still, even though case 2 might be a bit too much, especially if you see a way to sidestep the issue.

Can git-commitgraph depend on git-odb? If not, I'm not sure where you'd want to put full commit-graph + object database verification and graph-algorithm code.

That's an interesting one! Thus far gitoxide-core was able to play a glue-role, bringing everything together, without having to know too much. By the looks of the canonical implementation, the processor looking up the commit would have to know an aweful lot about the commit-graph, which I'd certainly want to avoid putting into gitoxide-core.

What about putting the 'detailed verification' behind a feature flag in git-commitgraph? That way it could implement a processor bound to a compound::Db, which gitoxide-core could just use after instantiating one from an object directory, something like this:

let db = git_odb::compound::Db::at(object_dir)?;
graph.verify_integrity(, |commit| verify_detailed(commit, &db),);

Thinking about it, there might be another way. Looking up an object could feasibly be put behind an impl of a function that does the lookup, completely avoiding having to know git-odb.

let db = git_odb::compound::Db::at(object_dir)?;
let mut buf = Vec::new();
graph.verify_integrity(, |commit| verify_detailed(commit, |id| db.locate(id, &mut buf).transpose().ok().flatten().and_then(borrowed::Object::as_commit).map(borrowed::Commit::to_owned),);

The above would not work in a multi-threaded environment, but could be made to work by making a closure returning a closure for lookups per thread (this is common practice in git-odb already, but complicates things as you can imagine). It's certainly alright to aim for single-threadedness first to get first performance results.

I would certainly prefer the second one as it is general, but can live with the first one as it's simpler to get started with.
Maybe I am missing something, what do you think about these options?

@avoidscorn
Copy link
Contributor Author

graph.verify_integrity(…, |commit| verify_detailed(commit, |id| db.locate(id, &mut buf).transpose().ok().flatten().and_then(borrowed::Object::as_commit).map(borrowed::Commit::to_owned), …);

Not gonna lie, that line keeps me up at night. Almost as much as EitherError.

I'd like to follow this up with implementing cargo bench support using the Criterion library. I'd reuse tests/fixtures/repos and add tests/fixtures/commit-graphs. For ease of use, I'd like to make the benchmarks populate those fixtures as needed. Maybe via lazy_static.

Missing features:
1. It operates on commit-graph files only, so it doesn't verify that
   commit-graph data matches `git-odb` data.
2. No progress reporting or parallelization. This shouldn't be needed
   until until we need to check against `git-odb` data.

Example output for Linux repo:
```
$ time ./target/release/gixp commit-graph-verify -s ~/src/linux/.git/objects/info
number of commits with the given number of parents
	 0: 4
	 1: 878988
	 2: 67800
	 3: 652
	 4: 408
	 5: 382
	 6: 454
	 7: 95
	 8: 65
	 9: 47
	10: 25
	11: 26
	12: 14
	13: 4
	14: 3
	18: 1
	19: 1
	20: 1
	21: 1
	24: 1
	27: 1
	30: 1
	32: 1
	66: 1
	->: 948976

longest path length between two commits: 160521

real	0m0.196s
user	0m0.180s
sys	0m0.016s
```
Get rid of `file::verify::EitherError` in favor of having
`graph::verify::Error` manually copy file errors into the graph error
object.

So:
1. `graph::verify::Error`'s `File` variant uses a dummy type parameter
   for its nested `file::verify::Error` value. This dummy type parameter
   is essentially a never type, as
   `graph::verify::Error::File(file::verify::Error::Processor(...))` is
   not valid.
2. `Graph::verify_integrity` calls `File::traverse` with its normal
   error type (`graph::verify::Error`) also serving as the processor's
   error type.
3. `Graph::traverse` propagates processor errors to its caller via
   `Error::Processor(err)`.
4. `Graph::verify_integrity` manually transcribes
   `file::verify::Error<T>` variants into
   `file::verify::Error<NeverType>` variants before embedding the file
    error into `graph::verify::Error::File` variants.
@Byron
Copy link
Owner

Byron commented Oct 12, 2020

Not gonna lie, that line keeps me up at night. Almost as much as EitherError.

That sounds like it's still (a tad) better then :). Actually there is already a precedence for doing this, as a similar problem exists when resolving objects foreign to a thin pack. Technically the code could prescribe the use of compound::Db, but I found it unnecessarily restrictive despite a Repository probably doing exactly that.

I'd like to follow this up with implementing cargo bench support using the Criterion library.

That sounds good, thanks for all the work you put into this! I will hopefully find some more time tomorrow to do a thorough review.

@Byron Byron merged commit fa22cab into Byron:main Oct 13, 2020
@Byron
Copy link
Owner

Byron commented Oct 13, 2020

Fantastic work, thank you!

I just applied tiny refactorings, with two ones worth mentioning (or at least in recent memory :D):

  • The special Error type that wanted to be ! was replaced with std::convert::Infallible, which also wants to be ! when ready.
  • git_odb::hash::bytes_of_flie is now git_features::hash::bytes_of_file, but is not yet used for file::verify_checksum() as it's slightly more involved and probably not worth it. However, you might implement it anyway when progress support is added.

Please share your thoughts in case you would want things to look differently, or make the changes in the next PR.

In any case, I am very much looking forward to your next PR :)!

@Byron
Copy link
Owner

Byron commented Oct 13, 2020

Oh, and before I forget: Now would be a good time to add, with minimal effort, a stress test for commit-graph-verify. I think it could be two lines around here to run the verification on a big commit-graph like the one of the linux kernel repo.

@avoidscorn avoidscorn deleted the commit-graph-verify branch October 13, 2020 23:44
@avoidscorn
Copy link
Contributor Author

Both of your changes look good!

Byron added a commit that referenced this pull request Oct 15, 2023
The new URL should trigger an overflow check but it only
happens when `url::Url::parse()` is called directly as our
code doesn't let it through anymore.

Here is the log from the fuzzer run as reported:
```
	[Environment] ASAN_OPTIONS=handle_abort=2
+----------------------------------------Release Build Stacktrace----------------------------------------+
Command: /mnt/scratch0/clusterfuzz/resources/platform/linux/unshare -c -n /mnt/scratch0/clusterfuzz/bot/builds/clusterfuzz-builds_gitoxide_9a561c2a19701ceb3cded247e9ae8f349711bbca/revisions/gix-url-parse -rss_limit_mb=2560 -timeout=60 -runs=100 /mnt/scratch0/clusterfuzz/bot/inputs/fuzzer-testcases/f508abd59698de9914f2b8894cc135f55208e494873d456c6c19828509103805
Time ran: 0.12717413902282715
INFO: Running with entropic power schedule (0xFF, 100).
INFO: Seed: 1001182178
INFO: Loaded 1 modules   (90683 inline 8-bit counters): 90683 [0x5d0c0597cce0, 0x5d0c05992f1b),
INFO: Loaded 1 PC tables (90683 PCs): 90683 [0x5d0c05992f20,0x5d0c05af52d0),
/mnt/scratch0/clusterfuzz/bot/builds/clusterfuzz-builds_gitoxide_9a561c2a19701ceb3cded247e9ae8f349711bbca/revisions/gix-url-parse: Running 1 inputs 100 time(s) each.
Running: /mnt/scratch0/clusterfuzz/bot/inputs/fuzzer-testcases/f508abd59698de9914f2b8894cc135f55208e494873d456c6c19828509103805
thread '<unnamed>' panicked at /rust/registry/src/index.crates.io-6f17d22bba15001f/idna-0.4.0/src/punycode.rs:272:17:
attempt to add with overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
AddressSanitizer:DEADLYSIGNAL
=================================================================
==1200==ERROR: AddressSanitizer: ABRT on unknown address 0x0539000004b0 (pc 0x7c51742fb00b bp 0x7ffcd1eadd80 sp 0x7ffcd1eadaf0 T0)
    #0 0x7c51742fb00b in raise /build/glibc-SzIz7B/glibc-2.31/sysdeps/unix/sysv/linux/raise.c:51:1
    #1 0x7c51742da858 in abort /build/glibc-SzIz7B/glibc-2.31/stdlib/abort.c:79:7
    #2 0x5d0c0572a1e6 in std::sys::unix::abort_internal::he854d2f74b119e66 /rustc/e20cb7702117f1ad8127a16406ba9edd230c4f65/library/std/src/sys/unix/mod.rs:375:14
    #3 0x5d0c0518cda6 in std::process::abort::h68c27a968dc7c74f /rustc/e20cb7702117f1ad8127a16406ba9edd230c4f65/library/std/src/process.rs:2271:5
    #4 0x5d0c0564d3d4 in libfuzzer_sys::initialize::_$u7b$$u7b$closure$u7d$$u7d$::h1e76e422e0c48db0 /rust/registry/src/index.crates.io-6f17d22bba15001f/libfuzzer-sys-0.4.3/src/lib.rs:57:9
    #5 0x5d0c0571dcf7 in _$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..Fn$LT$Args$GT$$GT$::call::h0c028c5af3475e03 /rustc/e20cb7702117f1ad8127a16406ba9edd230c4f65/library/alloc/src/boxed.rs:2021:9
    #6 0x5d0c0571dcf7 in std::panicking::rust_panic_with_hook::hd26c5407fbf20d71 /rustc/e20cb7702117f1ad8127a16406ba9edd230c4f65/library/std/src/panicking.rs:735:13
    #7 0x5d0c0571da05 in std::panicking::begin_panic_handler::_$u7b$$u7b$closure$u7d$$u7d$::h944e23ea90982f5a /rustc/e20cb7702117f1ad8127a16406ba9edd230c4f65/library/std/src/panicking.rs:601:13
    #8 0x5d0c0571aee5 in std::sys_common::backtrace::__rust_end_short_backtrace::h8a3632d339dd3313 /rustc/e20cb7702117f1ad8127a16406ba9edd230c4f65/library/std/src/sys_common/backtrace.rs:170:18
    #9 0x5d0c0571d781 in rust_begin_unwind /rustc/e20cb7702117f1ad8127a16406ba9edd230c4f65/library/std/src/panicking.rs:597:5
    #10 0x5d0c05190634 in core::panicking::panic_fmt::h85c36fc727234039 /rustc/e20cb7702117f1ad8127a16406ba9edd230c4f65/library/core/src/panicking.rs:72:14
    #11 0x5d0c051906d2 in core::panicking::panic::h6a47ed7881a36f4d /rustc/e20cb7702117f1ad8127a16406ba9edd230c4f65/library/core/src/panicking.rs:127:5
    #12 0x5d0c053e09c6 in idna::punycode::encode_into::hd674630fb161bf5b /rust/registry/src/index.crates.io-6f17d22bba15001f/idna-0.4.0/src/punycode.rs:0
    #13 0x5d0c053eacbc in idna::uts46::Idna::to_ascii_inner::h69c52eb69ae48276 /rust/registry/src/index.crates.io-6f17d22bba15001f/idna-0.4.0/src/uts46.rs:469:34
    #14 0x5d0c053eb793 in idna::uts46::Idna::to_ascii::h76237795045112f3 /rust/registry/src/index.crates.io-6f17d22bba15001f/idna-0.4.0/src/uts46.rs:481:26
    #15 0x5d0c053eda7a in idna::uts46::Config::to_ascii::h423c722ab2fa9813 /rust/registry/src/index.crates.io-6f17d22bba15001f/idna-0.4.0/src/uts46.rs:572:9
    #16 0x5d0c053f0070 in idna::domain_to_ascii::h93e94e995d03e9ef /rust/registry/src/index.crates.io-6f17d22bba15001f/idna-0.4.0/src/lib.rs:64:5
    #17 0x5d0c0530374a in url::host::Host::domain_to_ascii::h6cb1ae8fe42a1e42 /rust/registry/src/index.crates.io-6f17d22bba15001f/url-2.4.1/src/host.rs:166:9
    #18 0x5d0c0530374a in url::host::Host::parse::h962d3990e0ff5091 /rust/registry/src/index.crates.io-6f17d22bba15001f/url-2.4.1/src/host.rs:86:22
    #19 0x5d0c0532e87d in url::parser::Parser::parse_host::h89faea9182ce2512 /rust/registry/src/index.crates.io-6f17d22bba15001f/url-2.4.1/src/parser.rs:1024:20
    #20 0x5d0c0532ba8d in url::parser::Parser::parse_host_and_port::heb44bd7ebd2593f6 /rust/registry/src/index.crates.io-6f17d22bba15001f/url-2.4.1/src/parser.rs:944:33
    #21 0x5d0c0532896f in url::parser::Parser::after_double_slash::hbb313f562f0978a2 /rust/registry/src/index.crates.io-6f17d22bba15001f/url-2.4.1/src/parser.rs:843:13
    #22 0x5d0c0531a129 in url::parser::Parser::parse_with_scheme::h54a417e4650ea024 /rust/registry/src/index.crates.io-6f17d22bba15001f/url-2.4.1/src/parser.rs:453:17
    #23 0x5d0c05317824 in url::parser::Parser::parse_url::hfa6b21c53cd0ac1c /rust/registry/src/index.crates.io-6f17d22bba15001f/url-2.4.1/src/parser.rs:366:20
    #24 0x5d0c05350ba0 in url::ParseOptions::parse::hb8b3309b3b920457 /rust/registry/src/index.crates.io-6f17d22bba15001f/url-2.4.1/src/lib.rs:257:9
    #25 0x5d0c052b1ffc in url::Url::parse::h82a965c69df59bba /rust/registry/src/index.crates.io-6f17d22bba15001f/url-2.4.1/src/lib.rs:292:9
    #26 0x5d0c052b1ffc in gix_url::parse::input_to_utf8_and_url::h14b70a32a8884316 gitoxide/gix-url/src/parse.rs:252:5
    #27 0x5d0c052a9b9d in gix_url::parse::url::h6f7f7b0bddf4b8d7 gitoxide/gix-url/src/parse.rs:99:24
    #28 0x5d0c052b34cc in gix_url::parse::hfaab74909f01c9cc gitoxide/gix-url/src/lib.rs:38:46
    #29 0x5d0c05270b4a in rust_fuzzer_test_input gitoxide/gix-url/fuzz/fuzz_targets/parse.rs:5:14
    #30 0x5d0c0564d537 in __rust_try libfuzzer_sys.f28e88650cadb2d4-cgu.0:0
    #31 0x5d0c0564c79f in std::panicking::try::h90783eeef7e35925 /rustc/e20cb7702117f1ad8127a16406ba9edd230c4f65/library/std/src/panicking.rs:468:19
    #32 0x5d0c0564c79f in std::panic::catch_unwind::h041b281a0e92d580 /rustc/e20cb7702117f1ad8127a16406ba9edd230c4f65/library/std/src/panic.rs:142:14
    #33 0x5d0c0564c79f in LLVMFuzzerTestOneInput /rust/registry/src/index.crates.io-6f17d22bba15001f/libfuzzer-sys-0.4.3/src/lib.rs:28:22
    #34 0x5d0c0566bb83 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15
    #35 0x5d0c056572e2 in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6
    #36 0x5d0c0565cb8c in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9
    #37 0x5d0c056860c2 in main /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10
    #38 0x7c51742dc082 in __libc_start_main /build/glibc-SzIz7B/glibc-2.31/csu/libc-start.c:308:16
    #39 0x5d0c05191c4d in _start
```
`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants