New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New mandelbrot.rs #2
Conversation
Thanks for contributing! Llvm does not autovectorise if you are 2.5 times slower. Simd can only double the performances. We can be only 2 times slower and I hope your implementation will be that fast :) |
// | ||
// contributed by the Rust Project Developers | ||
// contributed by TeXitoi | ||
// contributed by Matt Watson |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do not use the original version, you can remove the rust project line.
My contribution must be the last one as, according to the rules of the benchmark game, the last contributor must be the person that submit the benchmark. Or you can keep this order, but then you must propose yourself the benchmark.
Thanks for the tips, as I said my understanding of rust is nearly nil (and I wouldn't consider myself a very experienced programmer in general). When I said this (and n_body) were being vectorized it was because i was looking at the emitted ASM. LLVM is often making decent use of SSE3 instructions to save itself time (although it appears that it's quite fragile and broke it at some point as it's now only storing one variable in each xmm). I'm pretty sure most of the 2.5x slowdown is coming from other overhead such as not caching the xy values, not exploiting symmetry and generally writing code that is a bit noobish. |
std::io::stdout().write(&rows[i]).ok().expect("Could not write to stdout"); | ||
} | ||
} | ||
io::stdout().flush().ok().expect("Could not flush stdout"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stdout will be executed and locked several times. Even if it must not be the bottleneck, we can avoid it.
Calling unwrap on a result is most of the time as good as calling ok().unwrap().
And this loop can be written to avoid array indexing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let stdout_unlocked = std::io::stdout();
let mut stdout = stdout_unlocked.lock();
for row in handles.into_iter().flat_map(|h| h.join().unwrap().into_iter()) {
stdout.write_all(&row).unwrap();
}
stdout.flush().unwrap();
Shure, llvm does a good job here. The benchmark rules disallow you to use the symmetry of the image on this benchmark. Unrolling the loop to compute the mandelbrot suite 2 elements by 2 by using simd has proved in the past a speedup of about 2. |
Ah, that would make a lot more sense. I was trying to convince it to use the vector instructions on the individual complex numbers (it was working more like this when I last looked at the asm), but as you can see it doesn't really save much time and LLVM happily ignores the possibility when it sees ways to optimize elsewhere.. |
let yloc = &yvals; | ||
|
||
let handles: Vec<_> = (0..THREADS).map(|e| { | ||
let xloc = xloc.to_vec(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't really understand the whole borrowing/ownership thing, but I this is still only day 3 of actually having written any rust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happen is that thread::spawn want to take ownership of xloc and yloc, because it can't take references as it does not know if the reference will live enough. Your solution is to make a copy of the vector to each thread. This extra allocation must not be a bottleneck.
The other solution is to pu xloc and yloc in a Arc
(Atomic Reference Counter), and then you can clone() the arc (it increment the ref count, but do not copy the inner object). Like that, you can avoid the allocation and copy of xloc and yloc, and just increment a ref count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good! That's what I intended to do. Although I didn't really understand why I wasn't allowed to use the copies I had.
xloc and yloc are small, given that I'm copying around size_size data, 2_size*THREADS didn't seem worth worrying about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, you can remove the xloc and yloc definitions outside of the loop and do let xlock = xvals.clone();
Another pattern that can be used is to declare outside of the loop let xvals = xvals;
, it makes xvals immutable. clone()
still must be used.
Wrote it again with vectorising across multiple points in mind. Strangely enabling AVX doesn't seem to improve things notably over SSE3 despite the hot loops being noticeably shorter. |
It looks great! |
|
||
fn mul2 (x: Vecf64, y: Vecf64) -> Vecf64 { | ||
let mut res = ZEROS; | ||
for i in 0..VLEN {res[i] = x[i] * y[i];} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idiomatic indentation is:
for i in 0..VLEN { res[i] = x[i] * y[i]; }
(with spaces before and after brackets).
There is a downside here in that it uses a lot of memory. |
thread::spawn(move || { | ||
let mut rows = vec![vec![0 as u8; size / 8]; size / THREADS]; | ||
for y in 0..size / THREADS { | ||
for x in 0..size / 8 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not idiomatic to loop on index and then using [], better to do:
for (y, row) in rows.iter_mut().enumerate() {
for (x, byte) in row.iter_mut().enumerate() {
...
*byte = mbrot8(cr, ci);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll go away for a bit and do a bit of reading -- there's still something I'm not getting about & and * properly. You're most welcome to make the changes you're suggesting (and any others) and submit.
I really like your version :) I couldn't find any useless allocation (except xlock and ylock, but it seems negligeable). |
time -v indicates the maximum stack frame is almost exactly double what is needed (to simply store the pbm). Does our use of stdout create a copy before it gets flushed? |
According to http://doc.rust-lang.org/stable/std/io/struct.Stdout.html stdout is buffered. |
I'll submit the bench a little bit later, so if you have any idea, don't hesitate to submit a new PR. Thanks again! |
I was thinking I'd have a look at n_body and then maybe the spectrum one. |
I'm looking at spectralnorm right now |
Still completely green to rust, wound up creating a new version rather than fixing the old because it was easier to understand this way. LLVM seems to vectorise it reasonably well and seems to run in about 2.5x the c time on my system.