New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrote chameneos_redux.rs #24

Closed
wants to merge 2 commits into
from

Conversation

Projects
None yet
3 participants
@Veedrac

Veedrac commented Sep 28, 2015

Rewrote chameneos_redux.

Well, actually I dropped the first rewrite 'cause it was too slow.

...And the second rewrite, since that was still too slow and used a ton of unsafe.

I guess this is the third rewrite plus a bit, since it took a lot of mauling to get it into shape.

The problem is that the competition is really good. The C++ uses a thread pool, you see.
But to make it fast, they used a really biased selection algorithm... and then just spawned enough threads to hide the bias. The C version doesn't use a thread pool, but it deals with the meeting count in the same atomic instruction as the mall exchange. And both of them do unsynchronized cross-thread writes...

But, I'm glad to say, a lot of elbow grease later and I'm pretty sure I've soundly beaten both.

Advantages

It uses a Java-like thread pool. Except mine is atomic and can only actually hold 7 threads - the other 3 must be active or in the mall lest they just disappear. A form of juggling, perhaps. And when I say atomic I mean it - in the best case it can atomically dequeue a task, add it to an empty mall and switch into a fresh thread in a single exchange. After a meeting, it'll enqueue both tasks and dequeue the next in another single exchange.

I use no unsafe at all, relying instead on gratuitously tuned uncontested atomics. A lot of assembly has been read to make sure of that status.

My timings put this roughly 2-3x the speed of the next-best competitor. I have no doubt other people's timings will be slightly different, but I'm pretty confident in the lead. I can actually make it a bit faster by dropping the number of executor threads - unlike C++ I don't need one per virtual thread to hide bias, and unlike the Java I don't have contention problems with fewer threads. However, I thought that might get me into trouble, so I didn't. I might mention it when I submit it, though.

Disadvantages

let x = || Default::default();
let mut states: [ChameneosState; 16] = [
    x(), x(), x(), x(), x(), x(), x(), x(),
    x(), x(), x(), x(), x(), x(), x(), x(),
];

PS: help me i can't stop

@llogiq

This comment has been minimized.

Show comment
Hide comment
@llogiq

llogiq Sep 29, 2015

Contributor

It's ok. We're all in the same boat. 😏

Contributor

llogiq commented Sep 29, 2015

It's ok. We're all in the same boat. 😏

@llogiq

This comment has been minimized.

Show comment
Hide comment
@llogiq

llogiq Sep 29, 2015

Contributor

By the way, my benchmark results:

$ time bin/chameneos_redux 6000000
[output omitted]
real    2m0.756s
user    1m25.856s
sys     1m49.006s

$ time bin/chameneos_redux_veedrac 6000000
[output omitted]
real    0m2.254s
user    0m8.837s
sys     0m0.009s

@Veedrac you call that confident in the lead? That's probably the understatement of the millenium. 😆

Contributor

llogiq commented Sep 29, 2015

By the way, my benchmark results:

$ time bin/chameneos_redux 6000000
[output omitted]
real    2m0.756s
user    1m25.856s
sys     1m49.006s

$ time bin/chameneos_redux_veedrac 6000000
[output omitted]
real    0m2.254s
user    0m8.837s
sys     0m0.009s

@Veedrac you call that confident in the lead? That's probably the understatement of the millenium. 😆

@llogiq

This comment has been minimized.

Show comment
Hide comment
@llogiq

llogiq Sep 29, 2015

Contributor

@Veedrac Do you want me to submit this to rust-lang/rust?

Contributor

llogiq commented Sep 29, 2015

@Veedrac Do you want me to submit this to rust-lang/rust?

@Veedrac

This comment has been minimized.

Show comment
Hide comment
@Veedrac

Veedrac Sep 29, 2015

My backup laptop just died on me (the horror), so that might be best, thanks.

Veedrac commented Sep 29, 2015

My backup laptop just died on me (the horror), so that might be best, thanks.

@llogiq

This comment has been minimized.

Show comment
Hide comment
@llogiq

llogiq Sep 29, 2015

Contributor

Will do.

Contributor

llogiq commented Sep 29, 2015

Will do.

@Veedrac

This comment has been minimized.

Show comment
Hide comment
@Veedrac

Veedrac Sep 30, 2015

By "confident in the lead", I meant compared to the C/++ versions which are also way faster than the old Rust code. If that's a quad-core benchmark, it actually puts Rust behind C (which irks me greatly).

Can you time C++ #2 and C #5 under the same conditions (and also under taskset 1)? I get

threads Rust C++ C
1 (taskset) 1.692 2.857 13.027
2 (native) 1.399 2.194 3.370

I tested on a quad-core I borrowed at one point, and IIRC results were similar for Rust whereas C and C++ traded places.

FWIW, I fixed my computer so I'm fine if you'd rather I submit it. It does electrocute me now, though, and is half made of tape, but it only needs to last 2 days for its replacement.

Veedrac commented Sep 30, 2015

By "confident in the lead", I meant compared to the C/++ versions which are also way faster than the old Rust code. If that's a quad-core benchmark, it actually puts Rust behind C (which irks me greatly).

Can you time C++ #2 and C #5 under the same conditions (and also under taskset 1)? I get

threads Rust C++ C
1 (taskset) 1.692 2.857 13.027
2 (native) 1.399 2.194 3.370

I tested on a quad-core I borrowed at one point, and IIRC results were similar for Rust whereas C and C++ traded places.

FWIW, I fixed my computer so I'm fine if you'd rather I submit it. It does electrocute me now, though, and is half made of tape, but it only needs to last 2 days for its replacement.

@llogiq

This comment has been minimized.

Show comment
Hide comment
@llogiq

llogiq Sep 30, 2015

Contributor

It's ok, I'll submit it. And do the C/C++ benchmarks. Holy [insert expletive here], someone should get this man an actual computer!

Contributor

llogiq commented Sep 30, 2015

It's ok, I'll submit it. And do the C/C++ benchmarks. Holy [insert expletive here], someone should get this man an actual computer!

@llogiq

This comment has been minimized.

Show comment
Hide comment
@llogiq

llogiq Sep 30, 2015

Contributor

Here's C on four cores:

real    0m0.562s
user    0m2.111s
sys     0m0.116s

And here C++:

real    0m0.676s
user    0m2.636s
sys     0m0.007s
Contributor

llogiq commented Sep 30, 2015

Here's C on four cores:

real    0m0.562s
user    0m2.111s
sys     0m0.116s

And here C++:

real    0m0.676s
user    0m2.636s
sys     0m0.007s
@Veedrac

This comment has been minimized.

Show comment
Hide comment
@Veedrac

Veedrac Sep 30, 2015

Ack, not good. I went back to test on the four-core machine and I have indeed regressed somewhere. I'll investigate.

Veedrac commented Sep 30, 2015

Ack, not good. I went back to test on the four-core machine and I have indeed regressed somewhere. I'll investigate.

@Veedrac

This comment has been minimized.

Show comment
Hide comment
@Veedrac

Veedrac Sep 30, 2015

OK, I know what happened. I was mistiming the four-core case. However, it's not all bad. On the four-core machine,

threads Rust C++ C
1 (taskset) 0.438 1.537 4.672
4 (native) 1.535 2.042 0.724

Or, normalizing each row,

threads Rust C++ C
1 (taskset) 1.0 3.5 10.7
4 (native) 2.1 2.8 1.0

Thinking about this more, it even makes sense. Oh well; I'll tell you if I think of things.

Veedrac commented Sep 30, 2015

OK, I know what happened. I was mistiming the four-core case. However, it's not all bad. On the four-core machine,

threads Rust C++ C
1 (taskset) 0.438 1.537 4.672
4 (native) 1.535 2.042 0.724

Or, normalizing each row,

threads Rust C++ C
1 (taskset) 1.0 3.5 10.7
4 (native) 2.1 2.8 1.0

Thinking about this more, it even makes sense. Oh well; I'll tell you if I think of things.

@llogiq

This comment has been minimized.

Show comment
Hide comment
@llogiq

llogiq Oct 1, 2015

Contributor

@Veedrac Can I help, e.g. by profiling?

Contributor

llogiq commented Oct 1, 2015

@Veedrac Can I help, e.g. by profiling?

@Veedrac

This comment has been minimized.

Show comment
Hide comment
@Veedrac

Veedrac Oct 1, 2015

Doubt it, sorry, but I have thought of a few things which should actually make a significant difference. I'll see if I have the time to try them out fully, but initial impressions look good.

Veedrac commented Oct 1, 2015

Doubt it, sorry, but I have thought of a few things which should actually make a significant difference. I'll see if I have the time to try them out fully, but initial impressions look good.

@Veedrac

This comment has been minimized.

Show comment
Hide comment
@Veedrac

Veedrac Oct 1, 2015

Here's the improvement I was thinking about:

threads Rust C++ C
1 (taskset) 0.300 1.514 4.604
4 (native) 0.991 2.049 0.685

Normalized:

threads Rust C++ C
1 (taskset) 1.0 5.0 15.3
4 (native) 1.4 3.0 1.0

Aka. I'm close. 5x as fast in the single-threaded benchmark, and only 40% worse in the multithreaded one.

Veedrac commented Oct 1, 2015

Here's the improvement I was thinking about:

threads Rust C++ C
1 (taskset) 0.300 1.514 4.604
4 (native) 0.991 2.049 0.685

Normalized:

threads Rust C++ C
1 (taskset) 1.0 5.0 15.3
4 (native) 1.4 3.0 1.0

Aka. I'm close. 5x as fast in the single-threaded benchmark, and only 40% worse in the multithreaded one.

@llogiq

This comment has been minimized.

Show comment
Hide comment
@llogiq

llogiq Oct 1, 2015

Contributor

Please check in your current solution, so that I can double-check your benchmarks.

Contributor

llogiq commented Oct 1, 2015

Please check in your current solution, so that I can double-check your benchmarks.

@Veedrac

This comment has been minimized.

Show comment
Hide comment
@Veedrac

Veedrac Oct 1, 2015

@llogiq I've pushed them. I added a little extra on top, but it didn't really help.

I'm also inclined to believe my own four-core benchmarks a little more, since they corroborate closer with the official results (eg. C++ taking much longer than C).

Veedrac commented Oct 1, 2015

@llogiq I've pushed them. I added a little extra on top, but it didn't really help.

I'm also inclined to believe my own four-core benchmarks a little more, since they corroborate closer with the official results (eg. C++ taking much longer than C).

@llogiq

This comment has been minimized.

Show comment
Hide comment
@llogiq

llogiq Oct 2, 2015

Contributor

I do have one small improvement in ChameneosState::meet:

    fn meet(&self, same: bool, color: Color) {
        self.meet_count.fetch_add(1, Ordering::AcqRel);
        if same {
            self.meet_same_count.fetch_add(1, Ordering::AcqRel);
        }
        self.color.store(color, Ordering::Release);
    }

This is shorter and also appears to be a bit faster on my machine:

Before:
real    0m2.257s
user    0m8.820s
sys     0m0.012s

After:
real    0m2.028s
user    0m7.914s
sys     0m0.013s

I don't know if the speed difference also appears on Isaac's server, though.

Contributor

llogiq commented Oct 2, 2015

I do have one small improvement in ChameneosState::meet:

    fn meet(&self, same: bool, color: Color) {
        self.meet_count.fetch_add(1, Ordering::AcqRel);
        if same {
            self.meet_same_count.fetch_add(1, Ordering::AcqRel);
        }
        self.color.store(color, Ordering::Release);
    }

This is shorter and also appears to be a bit faster on my machine:

Before:
real    0m2.257s
user    0m8.820s
sys     0m0.012s

After:
real    0m2.028s
user    0m7.914s
sys     0m0.013s

I don't know if the speed difference also appears on Isaac's server, though.

@Veedrac

This comment has been minimized.

Show comment
Hide comment
@Veedrac

Veedrac Oct 2, 2015

It used to be that but I changed it for performance! I'm still getting the opposite result. I'll time on the four-core soon.

The difference is just

incq    16(%rdi)
testb   %sil, %sil
je  .LBB3_4
incq    24(%rdi)

vs

lock        incq    16(%rdi)
testb   %sil, %sil
je  .LBB3_4
lock        incq    24(%rdi)

Perhaps it's a difference due to contention or architecture, but I'm certainly surprised you get the later being faster.

Veedrac commented Oct 2, 2015

It used to be that but I changed it for performance! I'm still getting the opposite result. I'll time on the four-core soon.

The difference is just

incq    16(%rdi)
testb   %sil, %sil
je  .LBB3_4
incq    24(%rdi)

vs

lock        incq    16(%rdi)
testb   %sil, %sil
je  .LBB3_4
lock        incq    24(%rdi)

Perhaps it's a difference due to contention or architecture, but I'm certainly surprised you get the later being faster.

@llogiq

This comment has been minimized.

Show comment
Hide comment
@llogiq

llogiq Oct 2, 2015

Contributor

Strange. The actual assembly is a bit reordered, but I doubt it matters too much. Please re-benchmark on four cores.

Oh, and with current intel, locks are mostly free.

Contributor

llogiq commented Oct 2, 2015

Strange. The actual assembly is a bit reordered, but I doubt it matters too much. Please re-benchmark on four cores.

Oh, and with current intel, locks are mostly free.

@Veedrac

This comment has been minimized.

Show comment
Hide comment
@Veedrac

Veedrac Oct 2, 2015

The fetch_add version takes about 10% longer on the four-core machine here.

Veedrac commented Oct 2, 2015

The fetch_add version takes about 10% longer on the four-core machine here.

@llogiq

This comment has been minimized.

Show comment
Hide comment
@llogiq

llogiq Oct 2, 2015

Contributor

OK then. Strange that LLVM emits suboptimal code for this. Or maybe rustc isn't using it correctly.

I'd have expected a lock only on SeqCst.

Contributor

llogiq commented Oct 2, 2015

OK then. Strange that LLVM emits suboptimal code for this. Or maybe rustc isn't using it correctly.

I'd have expected a lock only on SeqCst.

@Veedrac

This comment has been minimized.

Show comment
Hide comment
@Veedrac

Veedrac Oct 5, 2015

I've submitted chameneos-redux, thread-ring and k-nucleotide to the Benchmarks Game.

I don't want to submit to Rust at the moment, since my replacement backup computer is even slower than my old backup computer (despite having twice the clock speed!).

Veedrac commented Oct 5, 2015

I've submitted chameneos-redux, thread-ring and k-nucleotide to the Benchmarks Game.

I don't want to submit to Rust at the moment, since my replacement backup computer is even slower than my old backup computer (despite having twice the clock speed!).

@llogiq

This comment has been minimized.

Show comment
Hide comment
@llogiq

llogiq Oct 5, 2015

Contributor

I'll see if I can find the time to whip up a PR.

Contributor

llogiq commented Oct 5, 2015

I'll see if I can find the time to whip up a PR.

@Veedrac

This comment has been minimized.

Show comment
Hide comment
@Veedrac

Veedrac Nov 25, 2015

This was submitted with a bugfix. I'll get that pushed later.

Not quite the performance I was wishing for, but still much faster than C++, and even C in the geometric mean.

http://benchmarksgame.alioth.debian.org/u64q/performance.php?test=chameneosredux
http://benchmarksgame.alioth.debian.org/u32/performance.php?test=chameneosredux

Veedrac commented Nov 25, 2015

This was submitted with a bugfix. I'll get that pushed later.

Not quite the performance I was wishing for, but still much faster than C++, and even C in the geometric mean.

http://benchmarksgame.alioth.debian.org/u64q/performance.php?test=chameneosredux
http://benchmarksgame.alioth.debian.org/u32/performance.php?test=chameneosredux

@TeXitoi

This comment has been minimized.

Show comment
Hide comment
@TeXitoi

TeXitoi Jan 30, 2017

Owner

As this version is submitted to the benchmarksgame, I close the PR.

Owner

TeXitoi commented Jan 30, 2017

As this version is submitted to the benchmarksgame, I close the PR.

@TeXitoi TeXitoi closed this Jan 30, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment