Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New mandelbrot.rs #2

Merged
merged 3 commits into from May 19, 2015
Merged

New mandelbrot.rs #2

merged 3 commits into from May 19, 2015

Conversation

Schroedingers-Hat
Copy link
Contributor

Still completely green to rust, wound up creating a new version rather than fixing the old because it was easier to understand this way. LLVM seems to vectorise it reasonably well and seems to run in about 2.5x the c time on my system.

@TeXitoi
Copy link
Owner

TeXitoi commented May 19, 2015

Thanks for contributing!

Llvm does not autovectorise if you are 2.5 times slower. Simd can only double the performances. We can be only 2 times slower and I hope your implementation will be that fast :)

//
// contributed by the Rust Project Developers
// contributed by TeXitoi
// contributed by Matt Watson
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do not use the original version, you can remove the rust project line.

My contribution must be the last one as, according to the rules of the benchmark game, the last contributor must be the person that submit the benchmark. Or you can keep this order, but then you must propose yourself the benchmark.

@Schroedingers-Hat
Copy link
Contributor Author

Thanks for the tips, as I said my understanding of rust is nearly nil (and I wouldn't consider myself a very experienced programmer in general).

When I said this (and n_body) were being vectorized it was because i was looking at the emitted ASM. LLVM is often making decent use of SSE3 instructions to save itself time (although it appears that it's quite fragile and broke it at some point as it's now only storing one variable in each xmm).
Here's an example from godbot. It saves some instructions on mbrotpt (may not be optimal), and I added that shift function just to play around.

I'm pretty sure most of the 2.5x slowdown is coming from other overhead such as not caching the xy values, not exploiting symmetry and generally writing code that is a bit noobish.

std::io::stdout().write(&rows[i]).ok().expect("Could not write to stdout");
}
}
io::stdout().flush().ok().expect("Could not flush stdout");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stdout will be executed and locked several times. Even if it must not be the bottleneck, we can avoid it.

Calling unwrap on a result is most of the time as good as calling ok().unwrap().

And this loop can be written to avoid array indexing.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    let stdout_unlocked = std::io::stdout();
    let mut stdout = stdout_unlocked.lock();
    for row in handles.into_iter().flat_map(|h| h.join().unwrap().into_iter()) {
        stdout.write_all(&row).unwrap();
    }
    stdout.flush().unwrap();

@TeXitoi
Copy link
Owner

TeXitoi commented May 19, 2015

Shure, llvm does a good job here.

The benchmark rules disallow you to use the symmetry of the image on this benchmark.

Unrolling the loop to compute the mandelbrot suite 2 elements by 2 by using simd has proved in the past a speedup of about 2.

@Schroedingers-Hat
Copy link
Contributor Author

Unrolling the loop to compute the mandelbrot suite 2 elements by 2 by using simd has proved in the past a speedup of about 2.

Ah, that would make a lot more sense. I was trying to convince it to use the vector instructions on the individual complex numbers (it was working more like this when I last looked at the asm), but as you can see it doesn't really save much time and LLVM happily ignores the possibility when it sees ways to optimize elsewhere..

let yloc = &yvals;

let handles: Vec<_> = (0..THREADS).map(|e| {
let xloc = xloc.to_vec();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't really understand the whole borrowing/ownership thing, but I this is still only day 3 of actually having written any rust.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happen is that thread::spawn want to take ownership of xloc and yloc, because it can't take references as it does not know if the reference will live enough. Your solution is to make a copy of the vector to each thread. This extra allocation must not be a bottleneck.

The other solution is to pu xloc and yloc in a Arc (Atomic Reference Counter), and then you can clone() the arc (it increment the ref count, but do not copy the inner object). Like that, you can avoid the allocation and copy of xloc and yloc, and just increment a ref count.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good! That's what I intended to do. Although I didn't really understand why I wasn't allowed to use the copies I had.

xloc and yloc are small, given that I'm copying around size_size data, 2_size*THREADS didn't seem worth worrying about.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, you can remove the xloc and yloc definitions outside of the loop and do let xlock = xvals.clone();

Another pattern that can be used is to declare outside of the loop let xvals = xvals;, it makes xvals immutable. clone() still must be used.

@Schroedingers-Hat
Copy link
Contributor Author

Wrote it again with vectorising across multiple points in mind.
This version edges out the fastest C++ version on my system by about 10% if I set THREADS to 20. I don't know how that will hold up on other CPUs.

Strangely enabling AVX doesn't seem to improve things notably over SSE3 despite the hot loops being noticeably shorter.

@TeXitoi
Copy link
Owner

TeXitoi commented May 19, 2015

It looks great!


fn mul2 (x: Vecf64, y: Vecf64) -> Vecf64 {
let mut res = ZEROS;
for i in 0..VLEN {res[i] = x[i] * y[i];}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idiomatic indentation is:

for i in 0..VLEN { res[i] = x[i] * y[i]; }

(with spaces before and after brackets).

@Schroedingers-Hat
Copy link
Contributor Author

There is a downside here in that it uses a lot of memory.
If you could explain why I appear to have allocated double what is necessary (I'm copying something I don't need to?) that'd be great.

thread::spawn(move || {
let mut rows = vec![vec![0 as u8; size / 8]; size / THREADS];
for y in 0..size / THREADS {
for x in 0..size / 8 {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not idiomatic to loop on index and then using [], better to do:

for (y, row) in rows.iter_mut().enumerate() {
    for (x, byte) in row.iter_mut().enumerate() {
        ...
        *byte = mbrot8(cr, ci);
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll go away for a bit and do a bit of reading -- there's still something I'm not getting about & and * properly. You're most welcome to make the changes you're suggesting (and any others) and submit.

@TeXitoi
Copy link
Owner

TeXitoi commented May 19, 2015

I really like your version :) I couldn't find any useless allocation (except xlock and ylock, but it seems negligeable).

@Schroedingers-Hat
Copy link
Contributor Author

time -v indicates the maximum stack frame is almost exactly double what is needed (to simply store the pbm). Does our use of stdout create a copy before it gets flushed?

@TeXitoi
Copy link
Owner

TeXitoi commented May 19, 2015

According to http://doc.rust-lang.org/stable/std/io/struct.Stdout.html stdout is buffered.

@TeXitoi TeXitoi merged commit 01bdd34 into TeXitoi:master May 19, 2015
@TeXitoi
Copy link
Owner

TeXitoi commented May 19, 2015

I'll submit the bench a little bit later, so if you have any idea, don't hesitate to submit a new PR. Thanks again!

@Schroedingers-Hat
Copy link
Contributor Author

I was thinking I'd have a look at n_body and then maybe the spectrum one.

@TeXitoi
Copy link
Owner

TeXitoi commented May 19, 2015

I'm looking at spectralnorm right now

mbrubeck added a commit to mbrubeck/benchmarksgame-rs that referenced this pull request Mar 14, 2017
TeXitoi pushed a commit that referenced this pull request Mar 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants