Multithreaded optimisation #6

Smattr · 2018-02-14T04:43:12Z

The current multithreaded checker performs pretty poorly. Checking tests/long-running2/model.m on a sample platform with 4 threads:

single threaded, unoptimised: 64 seconds
4 threads, unoptimised: 348 seconds
single threaded, optimised: 8 seconds
4 threads, optimised: 314 seconds

This is not surprising as I expected the initial implementation to be pretty terrible. The multithreaded checker uses a single mutex-protected queue and a single mutex-protected seen set. Execution is completely dominated by contention on these two mutexes.

For the queue, one possibility would be to switch to per-thread, mutex-protected queues with work stealing. A second option would be a multi-headed, lock-free queue. The multithreaded checker does not need to pop the queued states in any particular order, which gives us a lot of freedom.

For the seen set, it should be possible to implement a thread-safe, lock-free set.

This is part of movement towards better multithreaded performance. Related to Github #6 "Multithreaded optimisation"

Surprisingly, this doesn't have too much of an effect on execution time in my current example. We drop from 314 seconds to 264 seconds. I haven't profiled to see whether execution time is dominated by set insertion, but perhaps this is the case. Related to Github #6 "Multithreaded optimisation"

This is recommended in the case where you need to loop if using the weak version. It also seems to yield slightly better performance at first glance. Related to Github #6 "Multithreaded optimisation"

Related to Github #6 "Multithreaded optimisation"

This is another data point for this implementation, but still performs terribly. Related to Github #6 "Multithreaded optimisation"

This is more performant right now, though still not particularly impressive. This reverts 7c27a33 and 20b96ef. Related to Github #6 "Multithreaded optimisation"

Smattr · 2018-02-18T00:05:02Z

In the current tip (90d1347) on the same 4 thread platform we get 1164 seconds for checking long-running2, with pretty high utilisation across all threads. Using a single mutex-protected set (96703a3) gives us 361 seconds with lower utilisation (typically only 2 threads). I suspect the latter is due to a superior hash set implementation; i.e. doing less work.

Obviously both of these results are pretty poor. We need to come up with a better algorithm here.

This commit implements a wrapper around new/delete as the struct, Allocator. This struct caches the last thing that was deleted and uses it for allocation the next time it is called upon. In our domain, this works because (1) we do all our new/deletes in pseudo lock step (we never have two 'deletes' without a 'new' in-between) and (2) our allocations are all statically localised to threads, so we can get away without any locking in the allocator. This seems both trivial and over-engineered. However, it makes a significant practical difference. In my current work horse, long-running2, this takes the single-threaded checker from 90 seconds down to 51 and the 4-threaded checker from 117 seconds down to 85. Related to Github #6 "Multithreaded optimisation"

This was previously introduced when the multithreaded queue was built on atomics. It's now built on std::queue, so we have no need for the manual linked-list. Though it was not the purpose of this commit, this change has a noticeable improvement on performance as well. Our current work horse, long-running2, is down from 51 seconds single-threaded to 46 seconds. With 4 threads, it drops from 85 seconds to 68 seconds. Related to Github #6 "Multithreaded optimisation"

Instead of declaring a large array when the user has asked for a fixed state set, we now call calloc on start up to allocate this array. This has a few advantages: 1. Large callocs are usually implemented as mmaps that get zeroed pages directly from the OS. In other words, using calloc makes it clearer to the OS what we are doing and it should be more able to optimise its operations. 2. The previous implementation resulted in an extremely large BSS section when using a large hash set size. This does not inflate the binary but causes problems when references into this section require more than 32-bit addressing. On Linux this necessitated moving to one of the larger memory models for linking which is undesirable. 3. The previous implementation constructed all the StateBases in the hash set on start up. Although StateBase recently gained a trivial default constructor, this operation was still a waste of time as we would later overwrite all of this. Compounding the fact that it wasted runtime, this also could (in fact, did) result in touching numerous pages on start up, causing the OS to immediately give us a lot of memory. The changes in this commit have one significant wart noted in the FIXME comment. We should address this in the near future. Closes Github #7 "Fixed set capacity requires 64-bit addressing" Related to Github #6 "Multithreaded optimisation"

Smattr · 2018-03-03T22:05:45Z

Another data point... as of fc6f707 we can comfortably beat CMurphi on long-running2 with a 4GB state set:

CMurphi -m 4096: 38 seconds
Rumur --threads=1 --set-capacity=4294967296: 22 seconds
Rumur --threads=4 --set-capacity=4294967296: 39 seconds

It's not strictly an apples-to-apples comparison here:

We haven't implemented the "undefined" semantics yet.
CMurphi's using a fixed size queue, while we're using a dynamic one.

Our multithreaded implementation is still pretty naive. I believe we could go significantly faster here. In particular for this model we could probably use something like double-checked locking with cmpxchg on the state set entries given they are <128 bits.

Smattr · 2018-05-15T04:37:11Z

Interestingly, the latest results from the C checker in branch "c-checker2" (commit f67cf8c) are more in line with the results prior to implementing static allocation (the 1164 seconds and 361 seconds quoted above). The figure for the C checker is around 453 seconds. This is with a single thread and none of the optimisation tricks we've played in the C++ checker. My hope is that once we implement multithreading, the C checker will beat the C++ checker on this trivial example and easily trounce it on more complicated models.

Smattr · 2018-05-24T04:40:17Z

As noted in 5637524, we hit 79 seconds in "c-checker2" now. Although this is still not beating the C++ checker, this looks like a promising approach for multithreaded scalability.

Smattr · 2019-02-07T04:03:42Z

We comfortably outperform CMurphi and have done so for some time. Closing as done.

Smattr added a commit that referenced this issue Feb 16, 2018

take an ID value in queue operations

dea3a2a

This is part of movement towards better multithreaded performance. Related to Github #6 "Multithreaded optimisation"

Smattr added a commit that referenced this issue Feb 18, 2018

take a thread id when inserting into the state set

20b96ef

Related to Github #6 "Multithreaded optimisation"

Smattr added a commit that referenced this issue Feb 18, 2018

switch to using multiple std::unordered_sets for multithreaded set

7c27a33

This is another data point for this implementation, but still performs terribly. Related to Github #6 "Multithreaded optimisation"

Smattr added a commit that referenced this issue Feb 18, 2018

switch back to a unified thread-safe set

2e29f4c

This is more performant right now, though still not particularly impressive. This reverts 7c27a33 and 20b96ef. Related to Github #6 "Multithreaded optimisation"

This was referenced Mar 3, 2018

Single-thread window for new states #14

Closed

DCLP-style multithreaded set addition #16

Closed

Smattr closed this as completed Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multithreaded optimisation #6

Multithreaded optimisation #6

Smattr commented Feb 14, 2018

Smattr commented Feb 18, 2018

Smattr commented Mar 3, 2018

Smattr commented May 15, 2018

Smattr commented May 24, 2018

Smattr commented Feb 7, 2019

Multithreaded optimisation #6

Multithreaded optimisation #6

Comments

Smattr commented Feb 14, 2018

Smattr commented Feb 18, 2018

Smattr commented Mar 3, 2018

Smattr commented May 15, 2018

Smattr commented May 24, 2018

Smattr commented Feb 7, 2019