New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop some `unsafe`s - the compiler now optimizes equivalent safe code #43

Merged
merged 18 commits into from Jun 26, 2018

Conversation

Projects
None yet
4 participants
@Shnatsel
Copy link
Contributor

Shnatsel commented Jun 23, 2018

This PR drops some unsafe code that was introduced in place of safe code as an optimization. It is no longer needed: on Rust 1.27 (current stable) performance degradation from this change is within measurement noise even when taking 10x the normal number of samples using Criterion.

cargo bench has 2% variance by itself, so it couldn't measure anything below that, so I had to switch to Criterion and then jack up the number of samples to make sure there is no degradation at all. You can find the benchmarking setup here. It's a bit messy, but I can clean it up and open a PR if you're interested.

Drop unsafe loops. On Rust 1.27 performance degradation is within mea…
…surement noise even when taking 10x the normal number of samples using Criterion.
@Shnatsel

This comment has been minimized.

Copy link
Contributor

Shnatsel commented Jun 23, 2018

Travis check failed because rustup is currently broken for nightly, it fails to download the compiler. I have this on my local machine too.

Shnatsel added some commits Jun 24, 2018

Exit earlier on invalid run length, without leaving behind a buffer w…
…ith potentially uninitialized contents. I have tried and failed to actually exploit this to get an information leak.
src/lib.rs Outdated
}

for i in self.pos as usize..pos_end as usize {
self.buffer[i] = self.buffer[i + forward as usize]

This comment has been minimized.

@eddyb

eddyb Jun 24, 2018

Contributor

Indentation here is still too to the left. Also, the semicolon missing.

This comment has been minimized.

@Shnatsel

Shnatsel Jun 24, 2018

Contributor

Yup, fixed in the very next commit. Didn't want to put unrelated things to the same commit. Nevermind, it's a different line that looked similar. Thanks for the catch, and thanks for the review!

src/lib.rs Outdated
if self.pos < dist && pos_end > self.pos {
return Err("invalid run length in stream".to_owned());
}

if self.buffer.len() < pos_end as usize {
unsafe {
self.buffer.set_len(pos_end as usize);

This comment has been minimized.

@eddyb

eddyb Jun 24, 2018

Contributor

I wonder if push is efficient enough, compared to assigning.
The capacity should be fixed, maybe just needs an assert!(pos_end as usize <= self.buffer.capacity()) to hint to LLVM.

This comment has been minimized.

@eddyb

eddyb Jun 24, 2018

Contributor

That is, replacing if self.buffer.len() < pos_end as usize {...} (and the original loop) with:

        for i in self.pos as usize..self.buffer.len().min(pos_end as usize) {
            self.buffer[i] = self.buffer[i - dist as usize];
        }
        assert!(pos_end as usize <= self.buffer.capacity());
        while self.buffer.len() < pos_end as usize {
            let x = self.buffer[self.buffer.len() - dist as usize];
            self.buffer.push(x);
        }

One interesting question would be whether we are even validating that dist != 0.

This comment has been minimized.

@Shnatsel

Shnatsel Jun 24, 2018

Contributor

I am still iterating on this part, so I didn't include it in this PR. I will try this and report the performance impact. Thanks!

I'm also going to try extend_from_slice just to see how that works. Probably poorly because the slice belongs to the same buffer, and since the vector might be reallocated the slice could be invalidated, so I'd have to clone it. Now if I had a fixed-size vector that would be guaranteed not to reallocate that might have been faster than push().

This comment has been minimized.

@eddyb

eddyb Jun 24, 2018

Contributor

extend_from_slice is not really relevant, because this is RLE, not just "copy from the past", so if dist > len, you start repeating by reading bytes you've just written.
Also note that length of the vector is only increased once, then the sliding window stays constantly sized, so the push loop is less performance-sensitive.

You'd need a different loop structure with something like memcpy inside, and skipping dist-long runs, if you wanted to actually take advantage of distances larger than a few bytes, but I have my suspicions that it will be significantly faster.

This comment has been minimized.

@Shnatsel

Shnatsel Jun 24, 2018

Contributor

Sadly, the code you've suggested incurs a 9-11% performance penalty on decompressing entire files, depending on the file. I have tweaked it a bit and got it to incur 8-10% penalty, here's the code:

        let upper_bound = self.buffer.len().min(pos_end as usize);
        for i in self.pos as usize..upper_bound {
            self.buffer[i] = self.buffer[i - dist as usize];
        }
        assert!(pos_end as usize <= self.buffer.capacity());
        let initial_buffer_len = self.buffer.len();
        for i in initial_buffer_len..pos_end as usize {
            let x = self.buffer[i - dist as usize];
            self.buffer.push(x);
        }

Presence or absence of assert() has no effect (in this code, I haven't tested the variant with while without assert).

I also got 10% performance overhead simply by replacing unsafe {self.buffer.set_len(pos_end as usize);} with self.buffer.resize(pos_end as usize, 0u8); in the original code, which results in a slightly more concise code than your solution.

This comment has been minimized.

@Shnatsel

Shnatsel Jun 25, 2018

Contributor

Thank you for investigating! This is probably the single best response to me filing a security issue I've ever seen, and I didn't even have a proof of concept this time.

As for the commit message - function run_len_dist() when taken in isolation is vulnerable; the rest of the code just so happens to never call it with dist set to 0, so the crate as a whole is not vulnerable. I would prefer to note that in the commit history, and update only the "It is unclear..." part to reflect that the crate as a whole is not vulnerable. But I will drop the mention of the vulnerability if you insist.

Also, here's my memcpy-based prototype.

        if self.buffer.len() < pos_end as usize {
            self.buffer.resize(pos_end as usize, 0u8);
        }

        fill_slice_with_subslice(&mut self.buffer, (self.pos as usize - dist as usize, self.pos as usize), (self.pos as usize, pos_end as usize));
        fn fill_slice_with_subslice(slice: &mut[u8], (source_from, source_to): (usize, usize), (dest_from, dest_to): (usize, usize)) {
            let (source, destination) = if dest_from >= source_from {slice.split_at_mut(dest_from)} else {slice.split_at_mut(source_from)};
            let source = &mut source[source_from..source_to];
            let destination = &mut destination[..dest_to-dest_from];
            for i in (0..( (destination.len()) / source.len() )).map(|x| x * source.len()) {
                destination[i..source.len()+i].copy_from_slice(&source);
            }
        }

It fails some tests and I have been trying to understand why to no avail, so I'm afraid I won't be able to complete it.

I'm afraid I'm not familiar with assembler or LLVM IR, so I will not be able to inspect it in any meaningful way. Sorry. I will benchmark your suggested changes with panic=abort on nightly and report the results.

This comment has been minimized.

@eddyb

eddyb Jun 25, 2018

Contributor

Note that I don't mean panic=abort, there's a "nightly" feature which changes what abort() does. Although panic=abort might itself cause further improvements, so there's probably a few combinations to test.

I wouldn't have done the copy_from_slice with a for loop, or at least not with division/multiplication - there's away to step ranges, although I'd also try a while loop that updates state instead.

But anyway, dist is not a multiple of len, you're missing something to handle len % dist (again, it should ideally be done without using %, unless you can find a way to use division that is a performance boost, but I doubt that's possible here).

This comment has been minimized.

@Shnatsel

Shnatsel Jun 25, 2018

Contributor

I've conjured up a function based on copy_from_slice() that is more efficient than a per-element loop. Using while made it a lot more readable, so thanks for the tip!

        if self.buffer.len() < pos_end as usize {
            self.buffer.resize(pos_end as usize, 0u8);
        }
        fill_slice_with_subslice(&mut self.buffer, (self.pos as usize - dist as usize, self.pos as usize), (self.pos as usize, pos_end as usize));
        fn fill_slice_with_subslice(slice: &mut[u8], (source_from, source_to): (usize, usize), (dest_from, dest_to): (usize, usize)) {
            let (source, destination) = slice.split_at_mut(dest_from); //TODO: allow destination to be lower than source
            let source = &source[source_from..source_to];
            let destination = &mut destination[..(dest_to - dest_from)];
            let mut offset = 0;
            while offset + source.len() < destination.len() {
                destination[offset..source.len()+offset].copy_from_slice(&source);
                offset += source.len();
            }
            let remaining_chunk = destination.len()-offset;
            &mut destination[offset..].copy_from_slice(&source[..remaining_chunk]);
        }

It offsets some of the costs of safe memory initialization so that switching to safe initialization would only create 5% overhead instead of 10%. If we switch the loop above to something like this too, then we'd have an entirely safe crate with the same performance as before.

I have a branch with all changes from this PR plus the optimized loop: https://github.com/Shnatsel/inflate/tree/safe-with-optimized-loop

Sadly, this function still fails one test - the line with its invocation causes an interger overflow on test "issue_30_realworld", i.e. self.pos is actually less than dist. How did it work in a simple loop that did self.buffer[i - dist as usize] with i initialized to self.pos but doesn't work here is beyond me.

This comment has been minimized.

@eddyb

eddyb Jun 25, 2018

Contributor

I still think this code could be much simpler if it was duplicated between forward and backward, just like the old code - in fact, it would be very similar to the old code, just doing more than one element at a time.

Also, is there a benefit to always using the same source subslice, or is it just as efficient / more efficient to always copy the last dist bytes? There may be some cache effects here.

This comment has been minimized.

@Shnatsel

Shnatsel Jun 26, 2018

Contributor

Always copying the last dist bytes seems to be exactly as efficient as aways copying the same slice.

The code I ended up with looks like this:

        let (source, destination) = (&mut self.buffer).split_at_mut(self.pos as usize);
        let source = &source[source.len() - dist as usize..];
        let mut offset = 0;
        while offset + source.len() < destination.len() {
            destination[offset..source.len()+offset].copy_from_slice(&source);
            offset += source.len();
        }
        let remaining_chunk = destination.len()-offset;
        &mut destination[offset..].copy_from_slice(&source[..remaining_chunk]);

Which is a bit more readable. This nets 3% to 7% performance improvement.

Surprisingly, putting the same code in the other copying loop in this function actually hurts performance by 1% on my samples.

I've also tried an iterator-based version, which is concise but as slow as copying byte-by-byte:

        let (source, destination) = (&mut self.buffer).split_at_mut(self.pos as usize);
        let source = &source[source.len() - dist as usize..];
        for (d,s) in destination.chunks_mut(dist as usize).zip(source.chunks(dist as usize).cycle()) {
            let d_len = d.len(); // last chunk has a size lower than we've specified
            d.copy_from_slice(&s[..d_len]);
        }

However, I've realized that this function can return pretty much any garbage and tests won't fail. Since this creates regression potential, optimizing copying falls out of scope of this PR.

src/lib.rs Outdated
}

for i in self.pos as usize..pos_end as usize {
self.buffer[i] = self.buffer[i + forward as usize]

This comment has been minimized.

@eddyb

eddyb Jun 24, 2018

Contributor

Semicolon still missing.

@eddyb

This comment has been minimized.

Copy link
Contributor

eddyb commented Jun 25, 2018

Btw, there is still the use of the abort intrinsic, could that be removed perhaps?
Or does unwinding still affect benchmarks?

Also, I hope both #35 and this were tested with both the "nightly" feature and without. It's possible that if unwinding prevents optimizations, changes that don't affect benchmarks with unwinding actually do affect benchmarks without. cc @RazrFalcon

run_len_dist: validate that dist is not 0. Fixes information disclosu…
…re vulnerability in this function. This is not exploitable in practice, because run_len_dist() is not exposes in the public API and is never called with dist=0 from this crate, even given deliberately malformed inputs.

@Shnatsel Shnatsel force-pushed the Shnatsel:master branch from c250e50 to 11c73f2 Jun 25, 2018

src/lib.rs Outdated
@@ -65,7 +65,7 @@
//! }
//! ```

#![cfg_attr(feature = "unstable", feature(core))]
#![cfg_attr(feature = "unstable", feature(core_intrinsics))]

This comment has been minimized.

@eddyb

eddyb Jun 25, 2018

Contributor

Wow, for how long has this been broken?

This comment has been minimized.

@Shnatsel

Shnatsel Jun 25, 2018

Contributor

TL;DR: dunno.
The only relevant changelog entry I could find was "Declaration of lang items and intrinsics are now feature-gated by default." in 0.11.0 (2014-07-02), but that's probably about feature(core) not feature(core_intrinsics). I could not find anything about core_intrinsics on Rust bug tracker.

@Shnatsel

This comment has been minimized.

Copy link
Contributor

Shnatsel commented Jun 25, 2018

Current code with unsafe:

  • rustc 1.27 stable: baseline
  • rustc 1.28.0-nightly (60efbdead 2018-06-23): +3% regression
  • "unstable" feature, rustc 1.28.0-nightly (60efbdead 2018-06-23): +3% regression

Safe code with while suggested by @eddyb:

  • rustc 1.27 stable: +11% regression
  • #[inline(never)], rustc 1.27 stable: +12% regression
  • rustc 1.28.0-nightly (60efbdead 2018-06-23): +13% regression
  • #[inline(never)], rustc 1.28.0-nightly (60efbdead 2018-06-23): +13% regression
  • "unstable" feature, assert() rewritten as if .. {abort()} rustc 1.28.0-nightly (60efbdead 2018-06-23): +13% regression

I could not benchmark panic=abort because Criterion dependency pest_derive fails to build with "error: the linked panic runtime panic_unwind is not compiled with this crate's panic strategy abort", and 3% differences are not observable with simpler tools.

Apparently the abort intrinsic is no longer needed. Shall I drop it in this PR?

@eddyb

This comment has been minimized.

Copy link
Contributor

eddyb commented Jun 25, 2018

Apparently the abort intrinsic is no longer needed. Shall I drop it in this PR?

Sounds good!

Shnatsel added some commits Jun 25, 2018

Remove use of 'abort' intrinsic now that panic!() is optimized by the…
… compiler. Replace its use with more idiomatic assertions.
@Shnatsel

This comment has been minimized.

Copy link
Contributor

Shnatsel commented Jun 25, 2018

Done. I hope removing a feature from Cargo.toml is not a breaking change?

src/lib.rs Outdated
@@ -404,7 +386,7 @@ impl CodeLengthReader {
self.result.push(0);
}
}
_ => abort(),
_ => panic!(),

This comment has been minimized.

@hauleth

hauleth Jun 25, 2018

Wouldn't it be better to use unreachable!() macro? It would provide more meaningful name.

Cargo.toml Outdated
@@ -9,7 +9,6 @@ keywords = ["deflate", "decompression", "compression", "piston"]

[features]
default = []

This comment has been minimized.

@eddyb

eddyb Jun 25, 2018

Contributor

Is this needed at all? I think all it did was disable unstable by default.

@eddyb

This comment has been minimized.

Copy link
Contributor

eddyb commented Jun 25, 2018

@Shnatsel I don't know about breaking change. cc @bvssvni

@bvssvni

This comment has been minimized.

Copy link
Member

bvssvni commented Jun 25, 2018

I think removing a feature is a breaking change. Bump it to 0.5.0.

@eddyb

This comment has been minimized.

Copy link
Contributor

eddyb commented Jun 25, 2018

@bvssvni IMO we should avoid breaking changes, to ensure that any improvements are used in the ecosystem. But YMMV.

@bvssvni

This comment has been minimized.

Copy link
Member

bvssvni commented Jun 25, 2018

@eddyb It's always better to release a new version than messing up existing projects.

Shnatsel added some commits Jun 26, 2018

@Shnatsel

This comment has been minimized.

Copy link
Contributor

Shnatsel commented Jun 26, 2018

I would like the safety improvements done so far not to warrant a version bump so that they would naturally propagate through the ecosystem. I'm on board with reverting the switch from abort() to assertions if that's what's needed to make it happen.

On the other hand, "unstable" feature was only available on nightly where breakage is kind of expected, and it was already broken to boot, so all people notice would be that it code with that feature enabled has actually started compiling.

IDK. Your call.

@bvssvni

This comment has been minimized.

Copy link
Member

bvssvni commented Jun 26, 2018

OK, if it was broken then we can bump it to 0.4.3.

@Shnatsel

This comment has been minimized.

Copy link
Contributor

Shnatsel commented Jun 26, 2018

I haven't managed to get rid of the unsafe block without losing performance, so instead I've covered it in asserts. Asserts actually improve performance by roughly 2%, depending on the workload.

I am actually in the market for an entirely safe inflate, perhaps even one with #![forbid(unsafe_code)], but coercing llvm into optimizing safe code to the same extent has proven to be beyond my abilities.

This PR is now as good as I can get it. Please conduct final review and merge it or request changes.

@bvssvni

This comment has been minimized.

Copy link
Member

bvssvni commented Jun 26, 2018

Merging.

@bvssvni bvssvni merged commit 728bd49 into PistonDevelopers:master Jun 26, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@bvssvni

This comment has been minimized.

Copy link
Member

bvssvni commented Jun 26, 2018

@eddyb Do you want to test your access to publish? Or do you want me to do it?

@Shnatsel

This comment has been minimized.

Copy link
Contributor

Shnatsel commented Jun 27, 2018

Thank you!

I'll probably work on png crate next. My fuzzers have found a panic on malformed input in it, I'll fix that to unblock further fuzzing. Plus it's got unsafes, perhaps I can do something about that.

@Shnatsel Shnatsel referenced this pull request Jun 27, 2018

Closed

Missing documentation #25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment