Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreaded optimisation #6

Closed
Smattr opened this issue Feb 14, 2018 · 5 comments
Closed

Multithreaded optimisation #6

Smattr opened this issue Feb 14, 2018 · 5 comments

Comments

@Smattr
Copy link
Owner

Smattr commented Feb 14, 2018

The current multithreaded checker performs pretty poorly. Checking tests/long-running2/model.m on a sample platform with 4 threads:

  • single threaded, unoptimised: 64 seconds
  • 4 threads, unoptimised: 348 seconds
  • single threaded, optimised: 8 seconds
  • 4 threads, optimised: 314 seconds

This is not surprising as I expected the initial implementation to be pretty terrible. The multithreaded checker uses a single mutex-protected queue and a single mutex-protected seen set. Execution is completely dominated by contention on these two mutexes.

For the queue, one possibility would be to switch to per-thread, mutex-protected queues with work stealing. A second option would be a multi-headed, lock-free queue. The multithreaded checker does not need to pop the queued states in any particular order, which gives us a lot of freedom.

For the seen set, it should be possible to implement a thread-safe, lock-free set.

Smattr added a commit that referenced this issue Feb 16, 2018
This is part of movement towards better multithreaded performance.

Related to Github #6 "Multithreaded optimisation"
Smattr added a commit that referenced this issue Feb 16, 2018
Surprisingly, this doesn't have too much of an effect on execution time in my
current example. We drop from 314 seconds to 264 seconds. I haven't profiled to
see whether execution time is dominated by set insertion, but perhaps this is
the case.

Related to Github #6 "Multithreaded optimisation"
Smattr added a commit that referenced this issue Feb 16, 2018
This is recommended in the case where you need to loop if using the weak
version. It also seems to yield slightly better performance at first glance.

Related to Github #6 "Multithreaded optimisation"
Smattr added a commit that referenced this issue Feb 18, 2018
Related to Github #6 "Multithreaded optimisation"
Smattr added a commit that referenced this issue Feb 18, 2018
This is another data point for this implementation, but still performs terribly.

Related to Github #6 "Multithreaded optimisation"
Smattr added a commit that referenced this issue Feb 18, 2018
This is more performant right now, though still not particularly impressive.
This reverts 7c27a33 and
20b96ef.

Related to Github #6 "Multithreaded optimisation"
@Smattr
Copy link
Owner Author

Smattr commented Feb 18, 2018

In the current tip (90d1347) on the same 4 thread platform we get 1164 seconds for checking long-running2, with pretty high utilisation across all threads. Using a single mutex-protected set (96703a3) gives us 361 seconds with lower utilisation (typically only 2 threads). I suspect the latter is due to a superior hash set implementation; i.e. doing less work.

Obviously both of these results are pretty poor. We need to come up with a better algorithm here.

Smattr added a commit that referenced this issue Feb 19, 2018
This commit implements a wrapper around new/delete as the struct, Allocator.
This struct caches the last thing that was deleted and uses it for allocation
the next time it is called upon. In our domain, this works because (1) we do all
our new/deletes in pseudo lock step (we never have two 'deletes' without a 'new'
in-between) and (2) our allocations are all statically localised to threads, so
we can get away without any locking in the allocator.

This seems both trivial and over-engineered. However, it makes a significant
practical difference. In my current work horse, long-running2, this takes the
single-threaded checker from 90 seconds down to 51 and the 4-threaded checker
from 117 seconds down to 85.

Related to Github #6 "Multithreaded optimisation"
Smattr added a commit that referenced this issue Feb 20, 2018
This was previously introduced when the multithreaded queue was built on
atomics. It's now built on std::queue, so we have no need for the manual
linked-list.

Though it was not the purpose of this commit, this change has a noticeable
improvement on performance as well. Our current work horse, long-running2, is
down from 51 seconds single-threaded to 46 seconds. With 4 threads, it drops
from 85 seconds to 68 seconds.

Related to Github #6 "Multithreaded optimisation"
Smattr added a commit that referenced this issue Feb 27, 2018
Instead of declaring a large array when the user has asked for a fixed state
set, we now call calloc on start up to allocate this array. This has a few
advantages:

  1. Large callocs are usually implemented as mmaps that get zeroed pages
     directly from the OS. In other words, using calloc makes it clearer to the
     OS what we are doing and it should be more able to optimise its operations.
  2. The previous implementation resulted in an extremely large BSS section when
     using a large hash set size. This does not inflate the binary but causes
     problems when references into this section require more than 32-bit
     addressing. On Linux this necessitated moving to one of the larger memory
     models for linking which is undesirable.
  3. The previous implementation constructed all the StateBases in the hash set
     on start up. Although StateBase recently gained a trivial default
     constructor, this operation was still a waste of time as we would later
     overwrite all of this. Compounding the fact that it wasted runtime, this
     also could (in fact, did) result in touching numerous pages on start up,
     causing the OS to immediately give us a lot of memory.

The changes in this commit have one significant wart noted in the FIXME comment.
We should address this in the near future.

Closes Github #7 "Fixed set capacity requires 64-bit addressing"
Related to Github #6 "Multithreaded optimisation"
@Smattr
Copy link
Owner Author

Smattr commented Mar 3, 2018

Another data point... as of fc6f707 we can comfortably beat CMurphi on long-running2 with a 4GB state set:

  • CMurphi -m 4096: 38 seconds
  • Rumur --threads=1 --set-capacity=4294967296: 22 seconds
  • Rumur --threads=4 --set-capacity=4294967296: 39 seconds

It's not strictly an apples-to-apples comparison here:

  1. We haven't implemented the "undefined" semantics yet.
  2. CMurphi's using a fixed size queue, while we're using a dynamic one.

Our multithreaded implementation is still pretty naive. I believe we could go significantly faster here. In particular for this model we could probably use something like double-checked locking with cmpxchg on the state set entries given they are <128 bits.

@Smattr
Copy link
Owner Author

Smattr commented May 15, 2018

Interestingly, the latest results from the C checker in branch "c-checker2" (commit f67cf8c) are more in line with the results prior to implementing static allocation (the 1164 seconds and 361 seconds quoted above). The figure for the C checker is around 453 seconds. This is with a single thread and none of the optimisation tricks we've played in the C++ checker. My hope is that once we implement multithreading, the C checker will beat the C++ checker on this trivial example and easily trounce it on more complicated models.

@Smattr
Copy link
Owner Author

Smattr commented May 24, 2018

As noted in 5637524, we hit 79 seconds in "c-checker2" now. Although this is still not beating the C++ checker, this looks like a promising approach for multithreaded scalability.

@Smattr
Copy link
Owner Author

Smattr commented Feb 7, 2019

We comfortably outperform CMurphi and have done so for some time. Closing as done.

@Smattr Smattr closed this as completed Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant