-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multithreaded optimisation #6
Comments
This is part of movement towards better multithreaded performance. Related to Github #6 "Multithreaded optimisation"
Surprisingly, this doesn't have too much of an effect on execution time in my current example. We drop from 314 seconds to 264 seconds. I haven't profiled to see whether execution time is dominated by set insertion, but perhaps this is the case. Related to Github #6 "Multithreaded optimisation"
This is recommended in the case where you need to loop if using the weak version. It also seems to yield slightly better performance at first glance. Related to Github #6 "Multithreaded optimisation"
Related to Github #6 "Multithreaded optimisation"
This is another data point for this implementation, but still performs terribly. Related to Github #6 "Multithreaded optimisation"
In the current tip (90d1347) on the same 4 thread platform we get 1164 seconds for checking long-running2, with pretty high utilisation across all threads. Using a single mutex-protected set (96703a3) gives us 361 seconds with lower utilisation (typically only 2 threads). I suspect the latter is due to a superior hash set implementation; i.e. doing less work. Obviously both of these results are pretty poor. We need to come up with a better algorithm here. |
This commit implements a wrapper around new/delete as the struct, Allocator. This struct caches the last thing that was deleted and uses it for allocation the next time it is called upon. In our domain, this works because (1) we do all our new/deletes in pseudo lock step (we never have two 'deletes' without a 'new' in-between) and (2) our allocations are all statically localised to threads, so we can get away without any locking in the allocator. This seems both trivial and over-engineered. However, it makes a significant practical difference. In my current work horse, long-running2, this takes the single-threaded checker from 90 seconds down to 51 and the 4-threaded checker from 117 seconds down to 85. Related to Github #6 "Multithreaded optimisation"
This was previously introduced when the multithreaded queue was built on atomics. It's now built on std::queue, so we have no need for the manual linked-list. Though it was not the purpose of this commit, this change has a noticeable improvement on performance as well. Our current work horse, long-running2, is down from 51 seconds single-threaded to 46 seconds. With 4 threads, it drops from 85 seconds to 68 seconds. Related to Github #6 "Multithreaded optimisation"
Instead of declaring a large array when the user has asked for a fixed state set, we now call calloc on start up to allocate this array. This has a few advantages: 1. Large callocs are usually implemented as mmaps that get zeroed pages directly from the OS. In other words, using calloc makes it clearer to the OS what we are doing and it should be more able to optimise its operations. 2. The previous implementation resulted in an extremely large BSS section when using a large hash set size. This does not inflate the binary but causes problems when references into this section require more than 32-bit addressing. On Linux this necessitated moving to one of the larger memory models for linking which is undesirable. 3. The previous implementation constructed all the StateBases in the hash set on start up. Although StateBase recently gained a trivial default constructor, this operation was still a waste of time as we would later overwrite all of this. Compounding the fact that it wasted runtime, this also could (in fact, did) result in touching numerous pages on start up, causing the OS to immediately give us a lot of memory. The changes in this commit have one significant wart noted in the FIXME comment. We should address this in the near future. Closes Github #7 "Fixed set capacity requires 64-bit addressing" Related to Github #6 "Multithreaded optimisation"
Another data point... as of fc6f707 we can comfortably beat CMurphi on long-running2 with a 4GB state set:
It's not strictly an apples-to-apples comparison here:
Our multithreaded implementation is still pretty naive. I believe we could go significantly faster here. In particular for this model we could probably use something like double-checked locking with cmpxchg on the state set entries given they are <128 bits. |
Interestingly, the latest results from the C checker in branch "c-checker2" (commit f67cf8c) are more in line with the results prior to implementing static allocation (the 1164 seconds and 361 seconds quoted above). The figure for the C checker is around 453 seconds. This is with a single thread and none of the optimisation tricks we've played in the C++ checker. My hope is that once we implement multithreading, the C checker will beat the C++ checker on this trivial example and easily trounce it on more complicated models. |
As noted in 5637524, we hit 79 seconds in "c-checker2" now. Although this is still not beating the C++ checker, this looks like a promising approach for multithreaded scalability. |
We comfortably outperform CMurphi and have done so for some time. Closing as done. |
The current multithreaded checker performs pretty poorly. Checking tests/long-running2/model.m on a sample platform with 4 threads:
This is not surprising as I expected the initial implementation to be pretty terrible. The multithreaded checker uses a single mutex-protected queue and a single mutex-protected seen set. Execution is completely dominated by contention on these two mutexes.
For the queue, one possibility would be to switch to per-thread, mutex-protected queues with work stealing. A second option would be a multi-headed, lock-free queue. The multithreaded checker does not need to pop the queued states in any particular order, which gives us a lot of freedom.
For the seen set, it should be possible to implement a thread-safe, lock-free set.
The text was updated successfully, but these errors were encountered: