Use chunks_exact where possible by kornelski · Pull Request #1277 · GitoxideLabs/gitoxide

kornelski · 2024-02-05T15:50:00Z

I've noticed a few functions use chunks(len), but assume they will actually get len-long chunks, even though there could be a shorter remainder chunk.

I suspect there should be more strict length checks for the inputs to chunks_(exact), and zip iterators. For example read_fan would tolerate inputs shorter than the number of bytes read it returns.

Byron

Thanks a lot for contributing!

I will definitely keep chunks_exact() in mind as it's certainly less of a footgun than chunks(), even though both have their strength.

Regarding these, I think there are occasions where this is an improvement. In the other cases, I think it's better to go back to chunks() - the idea is that it should panic if there is unexpected input, which typically is checked beforehand.

Byron · 2024-02-05T17:32:54Z

gix-commitgraph/src/file/init.rs

 fn read_fan(d: &[u8]) -> ([u32; FAN_LEN], usize) {
    let mut fan = [0; FAN_LEN];
-    for (c, f) in d.chunks(4).zip(fan.iter_mut()) {
+    for (c, f) in d.chunks_exact(4).zip(fan.iter_mut()) {


Let's keep using chunks here, as that will cause a panic which is better than silently working with a fan that isn't fully initialised. Please apply this to all other instances of this function as well.

.chunks() would not catch cases of broken input when when it's multiple of 4, e.g. 8-byte long d initializes 2 bytes, and keeps 254 bytes zeroed. I've added an assert instead, which catches all length mismatches.

That's a good point! The assert is doing a much better job. I also checked the implementation and think that chunks_exact has the potential to be faster, so probably good to use it even if it otherwise wouldn't make a difference. (It does use more state on the stack as well, in case that ever matters).

gix-index/src/decode/mod.rs

Byron · 2024-02-05T17:36:19Z

gix-pack/src/index/access.rs

        let pack64_offset = self.offset_pack_offset64_v2();
        match self.version {
            index::Version::V2 => izip!(
-                self.data[V2_HEADER_SIZE..].chunks(self.hash_len),


Let's leave this at chunks as well, it should panic if the count is off (which it will).

This could truncate izip! too, so maybe an explicit check of all 3 lengths would be better?.

I've added length check, which helps with zip too.

Thank you, very readable and a much stronger assertion.

Byron · 2024-02-05T17:37:33Z

gix-pack/src/index/access.rs

                let pack_offset_64_start = self.offset_pack_offset64_v2();
                offset32_start
-                    .chunks(N32_SIZE)
+                    .chunks_exact(N32_SIZE)


Let's leave this at chunks as well, it should panic if the count is off (which it will).

I've added an explicit length check instead. I think that makes intention clearer. It lowers risk of performing some work with side effects before panicking, and may be easier to diagnose bad inputs than panicking somewhere down the line.

src/plumbing/main.rs

kornelski · 2024-02-06T02:44:13Z

gix-pack uses offsets into self.data, such as &self.data[self.offset_pack_offset_v2()..].

Is this necessary?

Looking at implementation of offset_crc32_v2, offset_pack_offset_v2, offset_pack_offset64_v2 I can see they're one after another, but this relationship is unclear elsewhere that just uses the offset.

The offsets are also used with an open range [offset..], rather than being slices of their own fragment of data only. This makes it tricky to check if usage of chunks or chunks_exact is correct, because the open-range data slice definitely doesn't have the right length, and it's not clear if it's meant to have, or whether the pack format allows arbitrary data at the end or the different parts to overlap.

Instead of providing offsets, could these functions return right-sized slices of data?

Byron

Great! Thanks a lot! I will definitely be more careful about the usage of chunks and chunks_exact() from now on.

Byron · 2024-02-06T08:55:28Z

gix-commitgraph/src/file/init.rs

 fn read_fan(d: &[u8]) -> ([u32; FAN_LEN], usize) {
    let mut fan = [0; FAN_LEN];
-    for (c, f) in d.chunks(4).zip(fan.iter_mut()) {
+    for (c, f) in d.chunks_exact(4).zip(fan.iter_mut()) {


That's a good point! The assert is doing a much better job. I also checked the implementation and think that chunks_exact has the potential to be faster, so probably good to use it even if it otherwise wouldn't make a difference. (It does use more state on the stack as well, in case that ever matters).

Byron · 2024-02-06T08:57:55Z

gix-index/src/decode/mod.rs

                            let chunk_size = (entry_offsets.len() as f32 / num_threads as f32).ceil() as usize;
-                            let num_chunks = entry_offsets.chunks(chunk_size).count();
+                            let entry_offsets_chunked = entry_offsets.chunks(chunk_size);
+                            let num_chunks = entry_offsets_chunked.len();


Using .len() is much better than count(), thanks for catching this!

Byron · 2024-02-06T09:00:46Z

gix-pack/src/index/access.rs

        let pack64_offset = self.offset_pack_offset64_v2();
        match self.version {
            index::Version::V2 => izip!(
-                self.data[V2_HEADER_SIZE..].chunks(self.hash_len),


Thank you, very readable and a much stronger assertion.

Byron · 2024-02-06T09:15:23Z

Apologies, I forgot to reply.

gix-pack uses offsets into self.data, such as &self.data[self.offset_pack_offset_v2()..].

Is this necessary?

I don't now what you are alluding to. Please feel free to submit a PR with improvements if this can clear this up.

Looking at implementation of offset_crc32_v2, offset_pack_offset_v2, offset_pack_offset64_v2 I can see they're one after another, but this relationship is unclear elsewhere that just uses the offset.

The offsets are also used with an open range [offset..], rather than being slices of their own fragment of data only. This makes it tricky to check if usage of chunks or chunks_exact is correct, because the open-range data slice definitely doesn't have the right length, and it's not clear if it's meant to have, or whether the pack format allows arbitrary data at the end or the different parts to overlap.

Instead of providing offsets, could these functions return right-sized slices of data?

In the index, locations of all portions of data can be known in advance, I think, which should make it possible to determine the exact range a slice should occupy. A PR for this would definitely be welcome, maybe this could help avoid running into 'traps' laid by malicious actors. Note though that such an actor would need to be able to distribute full repos, with packs and indices, as typically all indices are build from scratch when cloning, a vital part of validating the received pack. From that point of view, there might be little benefit to further restricting the regions. Definitely your call though, as I wouldn't expect right-sizing to greatly increase the code's complexity.

kornelski · 2024-02-06T11:52:27Z

What I've meant was about Rust API/design pattern, instead of:

let slice_of_crcs_and_more = &self.data[self.offset_crc32_v2()..];

refactoring it to make the offset an implementation detail, and have:

let slice_of_crcs_only = self.crc32_v2();

where crc32_v2 would return &self.data[start..end] knowing the right start offset and num_elements, so that callers of this function wouldn't need to usetake(num_elements).

Byron · 2024-02-06T18:06:40Z

Thanks for clarifying. I think there shouldn't be any issue with refactoring, particularly if it allows to deduplicate some of the logic to make it less error prone.
If that sounds like what you are suggesting, please do feel free to submit a PR.

Byron requested changes Feb 5, 2024

View reviewed changes

Use chunks_exact where possible

2482023

kornelski force-pushed the chunks_exact branch from 4842419 to 2482023 Compare February 6, 2024 00:59

Byron approved these changes Feb 6, 2024

View reviewed changes

Byron merged commit d4d478b into GitoxideLabs:main Feb 6, 2024

kornelski deleted the chunks_exact branch February 6, 2024 11:52

Uh oh!

Comments

Conversation

kornelski commented Feb 5, 2024

Uh oh!

Byron left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kornelski commented Feb 6, 2024

Uh oh!

Byron left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Byron commented Feb 6, 2024

Uh oh!

kornelski commented Feb 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Byron commented Feb 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kornelski commented Feb 6, 2024 •

edited

Loading