Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upDrop some `unsafe`s - the compiler now optimizes equivalent safe code #43
Conversation
This comment has been minimized.
This comment has been minimized.
|
Travis check failed because |
Shnatsel
added some commits
Jun 24, 2018
eddyb
reviewed
Jun 24, 2018
| } | ||
|
|
||
| for i in self.pos as usize..pos_end as usize { | ||
| self.buffer[i] = self.buffer[i + forward as usize] |
This comment has been minimized.
This comment has been minimized.
eddyb
Jun 24, 2018
Contributor
Indentation here is still too to the left. Also, the semicolon missing.
This comment has been minimized.
This comment has been minimized.
Shnatsel
Jun 24, 2018
•
Contributor
Yup, fixed in the very next commit. Didn't want to put unrelated things to the same commit. Nevermind, it's a different line that looked similar. Thanks for the catch, and thanks for the review!
eddyb
reviewed
Jun 24, 2018
| if self.pos < dist && pos_end > self.pos { | ||
| return Err("invalid run length in stream".to_owned()); | ||
| } | ||
|
|
||
| if self.buffer.len() < pos_end as usize { | ||
| unsafe { | ||
| self.buffer.set_len(pos_end as usize); |
This comment has been minimized.
This comment has been minimized.
eddyb
Jun 24, 2018
Contributor
I wonder if push is efficient enough, compared to assigning.
The capacity should be fixed, maybe just needs an assert!(pos_end as usize <= self.buffer.capacity()) to hint to LLVM.
This comment has been minimized.
This comment has been minimized.
eddyb
Jun 24, 2018
Contributor
That is, replacing if self.buffer.len() < pos_end as usize {...} (and the original loop) with:
for i in self.pos as usize..self.buffer.len().min(pos_end as usize) {
self.buffer[i] = self.buffer[i - dist as usize];
}
assert!(pos_end as usize <= self.buffer.capacity());
while self.buffer.len() < pos_end as usize {
let x = self.buffer[self.buffer.len() - dist as usize];
self.buffer.push(x);
}One interesting question would be whether we are even validating that dist != 0.
This comment has been minimized.
This comment has been minimized.
Shnatsel
Jun 24, 2018
Contributor
I am still iterating on this part, so I didn't include it in this PR. I will try this and report the performance impact. Thanks!
I'm also going to try extend_from_slice just to see how that works. Probably poorly because the slice belongs to the same buffer, and since the vector might be reallocated the slice could be invalidated, so I'd have to clone it. Now if I had a fixed-size vector that would be guaranteed not to reallocate that might have been faster than push().
This comment has been minimized.
This comment has been minimized.
eddyb
Jun 24, 2018
Contributor
extend_from_slice is not really relevant, because this is RLE, not just "copy from the past", so if dist > len, you start repeating by reading bytes you've just written.
Also note that length of the vector is only increased once, then the sliding window stays constantly sized, so the push loop is less performance-sensitive.
You'd need a different loop structure with something like memcpy inside, and skipping dist-long runs, if you wanted to actually take advantage of distances larger than a few bytes, but I have my suspicions that it will be significantly faster.
This comment has been minimized.
This comment has been minimized.
Shnatsel
Jun 24, 2018
Contributor
Sadly, the code you've suggested incurs a 9-11% performance penalty on decompressing entire files, depending on the file. I have tweaked it a bit and got it to incur 8-10% penalty, here's the code:
let upper_bound = self.buffer.len().min(pos_end as usize);
for i in self.pos as usize..upper_bound {
self.buffer[i] = self.buffer[i - dist as usize];
}
assert!(pos_end as usize <= self.buffer.capacity());
let initial_buffer_len = self.buffer.len();
for i in initial_buffer_len..pos_end as usize {
let x = self.buffer[i - dist as usize];
self.buffer.push(x);
}Presence or absence of assert() has no effect (in this code, I haven't tested the variant with while without assert).
I also got 10% performance overhead simply by replacing unsafe {self.buffer.set_len(pos_end as usize);} with self.buffer.resize(pos_end as usize, 0u8); in the original code, which results in a slightly more concise code than your solution.
This comment has been minimized.
This comment has been minimized.
Shnatsel
Jun 25, 2018
Contributor
Thank you for investigating! This is probably the single best response to me filing a security issue I've ever seen, and I didn't even have a proof of concept this time.
As for the commit message - function run_len_dist() when taken in isolation is vulnerable; the rest of the code just so happens to never call it with dist set to 0, so the crate as a whole is not vulnerable. I would prefer to note that in the commit history, and update only the "It is unclear..." part to reflect that the crate as a whole is not vulnerable. But I will drop the mention of the vulnerability if you insist.
Also, here's my memcpy-based prototype.
if self.buffer.len() < pos_end as usize {
self.buffer.resize(pos_end as usize, 0u8);
}
fill_slice_with_subslice(&mut self.buffer, (self.pos as usize - dist as usize, self.pos as usize), (self.pos as usize, pos_end as usize));
fn fill_slice_with_subslice(slice: &mut[u8], (source_from, source_to): (usize, usize), (dest_from, dest_to): (usize, usize)) {
let (source, destination) = if dest_from >= source_from {slice.split_at_mut(dest_from)} else {slice.split_at_mut(source_from)};
let source = &mut source[source_from..source_to];
let destination = &mut destination[..dest_to-dest_from];
for i in (0..( (destination.len()) / source.len() )).map(|x| x * source.len()) {
destination[i..source.len()+i].copy_from_slice(&source);
}
}It fails some tests and I have been trying to understand why to no avail, so I'm afraid I won't be able to complete it.
I'm afraid I'm not familiar with assembler or LLVM IR, so I will not be able to inspect it in any meaningful way. Sorry. I will benchmark your suggested changes with panic=abort on nightly and report the results.
This comment has been minimized.
This comment has been minimized.
eddyb
Jun 25, 2018
Contributor
Note that I don't mean panic=abort, there's a "nightly" feature which changes what abort() does. Although panic=abort might itself cause further improvements, so there's probably a few combinations to test.
I wouldn't have done the copy_from_slice with a for loop, or at least not with division/multiplication - there's away to step ranges, although I'd also try a while loop that updates state instead.
But anyway, dist is not a multiple of len, you're missing something to handle len % dist (again, it should ideally be done without using %, unless you can find a way to use division that is a performance boost, but I doubt that's possible here).
This comment has been minimized.
This comment has been minimized.
Shnatsel
Jun 25, 2018
Contributor
I've conjured up a function based on copy_from_slice() that is more efficient than a per-element loop. Using while made it a lot more readable, so thanks for the tip!
if self.buffer.len() < pos_end as usize {
self.buffer.resize(pos_end as usize, 0u8);
}
fill_slice_with_subslice(&mut self.buffer, (self.pos as usize - dist as usize, self.pos as usize), (self.pos as usize, pos_end as usize));
fn fill_slice_with_subslice(slice: &mut[u8], (source_from, source_to): (usize, usize), (dest_from, dest_to): (usize, usize)) {
let (source, destination) = slice.split_at_mut(dest_from); //TODO: allow destination to be lower than source
let source = &source[source_from..source_to];
let destination = &mut destination[..(dest_to - dest_from)];
let mut offset = 0;
while offset + source.len() < destination.len() {
destination[offset..source.len()+offset].copy_from_slice(&source);
offset += source.len();
}
let remaining_chunk = destination.len()-offset;
&mut destination[offset..].copy_from_slice(&source[..remaining_chunk]);
}It offsets some of the costs of safe memory initialization so that switching to safe initialization would only create 5% overhead instead of 10%. If we switch the loop above to something like this too, then we'd have an entirely safe crate with the same performance as before.
I have a branch with all changes from this PR plus the optimized loop: https://github.com/Shnatsel/inflate/tree/safe-with-optimized-loop
Sadly, this function still fails one test - the line with its invocation causes an interger overflow on test "issue_30_realworld", i.e. self.pos is actually less than dist. How did it work in a simple loop that did self.buffer[i - dist as usize] with i initialized to self.pos but doesn't work here is beyond me.
This comment has been minimized.
This comment has been minimized.
eddyb
Jun 25, 2018
Contributor
I still think this code could be much simpler if it was duplicated between forward and backward, just like the old code - in fact, it would be very similar to the old code, just doing more than one element at a time.
Also, is there a benefit to always using the same source subslice, or is it just as efficient / more efficient to always copy the last dist bytes? There may be some cache effects here.
This comment has been minimized.
This comment has been minimized.
Shnatsel
Jun 26, 2018
Contributor
Always copying the last dist bytes seems to be exactly as efficient as aways copying the same slice.
The code I ended up with looks like this:
let (source, destination) = (&mut self.buffer).split_at_mut(self.pos as usize);
let source = &source[source.len() - dist as usize..];
let mut offset = 0;
while offset + source.len() < destination.len() {
destination[offset..source.len()+offset].copy_from_slice(&source);
offset += source.len();
}
let remaining_chunk = destination.len()-offset;
&mut destination[offset..].copy_from_slice(&source[..remaining_chunk]);Which is a bit more readable. This nets 3% to 7% performance improvement.
Surprisingly, putting the same code in the other copying loop in this function actually hurts performance by 1% on my samples.
I've also tried an iterator-based version, which is concise but as slow as copying byte-by-byte:
let (source, destination) = (&mut self.buffer).split_at_mut(self.pos as usize);
let source = &source[source.len() - dist as usize..];
for (d,s) in destination.chunks_mut(dist as usize).zip(source.chunks(dist as usize).cycle()) {
let d_len = d.len(); // last chunk has a size lower than we've specified
d.copy_from_slice(&s[..d_len]);
}However, I've realized that this function can return pretty much any garbage and tests won't fail. Since this creates regression potential, optimizing copying falls out of scope of this PR.
eddyb
reviewed
Jun 24, 2018
| } | ||
|
|
||
| for i in self.pos as usize..pos_end as usize { | ||
| self.buffer[i] = self.buffer[i + forward as usize] |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Btw, there is still the use of the Also, I hope both #35 and this were tested with both the |
Shnatsel
force-pushed the
Shnatsel:master
branch
from
c250e50
to
11c73f2
Jun 25, 2018
eddyb
reviewed
Jun 25, 2018
| @@ -65,7 +65,7 @@ | |||
| //! } | |||
| //! ``` | |||
|
|
|||
| #![cfg_attr(feature = "unstable", feature(core))] | |||
| #![cfg_attr(feature = "unstable", feature(core_intrinsics))] | |||
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Shnatsel
Jun 25, 2018
Contributor
TL;DR: dunno.
The only relevant changelog entry I could find was "Declaration of lang items and intrinsics are now feature-gated by default." in 0.11.0 (2014-07-02), but that's probably about feature(core) not feature(core_intrinsics). I could not find anything about core_intrinsics on Rust bug tracker.
This comment has been minimized.
This comment has been minimized.
|
Current code with unsafe:
Safe code with
I could not benchmark Apparently the abort intrinsic is no longer needed. Shall I drop it in this PR? |
This comment has been minimized.
This comment has been minimized.
Sounds good! |
Shnatsel
added some commits
Jun 25, 2018
This comment has been minimized.
This comment has been minimized.
|
Done. I hope removing a feature from Cargo.toml is not a breaking change? |
hauleth
reviewed
Jun 25, 2018
| @@ -404,7 +386,7 @@ impl CodeLengthReader { | |||
| self.result.push(0); | |||
| } | |||
| } | |||
| _ => abort(), | |||
| _ => panic!(), | |||
This comment has been minimized.
This comment has been minimized.
hauleth
Jun 25, 2018
Wouldn't it be better to use unreachable!() macro? It would provide more meaningful name.
eddyb
reviewed
Jun 25, 2018
| @@ -9,7 +9,6 @@ keywords = ["deflate", "decompression", "compression", "piston"] | |||
|
|
|||
| [features] | |||
| default = [] | |||
This comment has been minimized.
This comment has been minimized.
eddyb
Jun 25, 2018
Contributor
Is this needed at all? I think all it did was disable unstable by default.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
I think removing a feature is a breaking change. Bump it to 0.5.0. |
This comment has been minimized.
This comment has been minimized.
|
@bvssvni IMO we should avoid breaking changes, to ensure that any improvements are used in the ecosystem. But YMMV. |
This comment has been minimized.
This comment has been minimized.
|
@eddyb It's always better to release a new version than messing up existing projects. |
Shnatsel
added some commits
Jun 26, 2018
This comment has been minimized.
This comment has been minimized.
|
I would like the safety improvements done so far not to warrant a version bump so that they would naturally propagate through the ecosystem. I'm on board with reverting the switch from On the other hand, "unstable" feature was only available on nightly where breakage is kind of expected, and it was already broken to boot, so all people notice would be that it code with that feature enabled has actually started compiling. IDK. Your call. |
This comment has been minimized.
This comment has been minimized.
|
OK, if it was broken then we can bump it to 0.4.3. |
Shnatsel
added some commits
Jun 26, 2018
This comment has been minimized.
This comment has been minimized.
|
I haven't managed to get rid of the unsafe block without losing performance, so instead I've covered it in asserts. Asserts actually improve performance by roughly 2%, depending on the workload. I am actually in the market for an entirely safe This PR is now as good as I can get it. Please conduct final review and merge it or request changes. |
This comment has been minimized.
This comment has been minimized.
|
Merging. |
bvssvni
merged commit 728bd49
into
PistonDevelopers:master
Jun 26, 2018
1 check passed
This comment has been minimized.
This comment has been minimized.
|
@eddyb Do you want to test your access to publish? Or do you want me to do it? |
This comment has been minimized.
This comment has been minimized.
|
Thank you! I'll probably work on |
Shnatsel commentedJun 23, 2018
This PR drops some unsafe code that was introduced in place of safe code as an optimization. It is no longer needed: on Rust 1.27 (current stable) performance degradation from this change is within measurement noise even when taking 10x the normal number of samples using Criterion.
cargo benchhas 2% variance by itself, so it couldn't measure anything below that, so I had to switch to Criterion and then jack up the number of samples to make sure there is no degradation at all. You can find the benchmarking setup here. It's a bit messy, but I can clean it up and open a PR if you're interested.