Optimize proof performance #625

senier · 2021-03-29T14:50:01Z

When using message types, where inner messages gets unfolded into the outer message, the number of paths in a message raise quickly which leads to impractical verification times (e.g. after converting the SPDM spec from refinements to message types, verification took ~6-7 minutes).

The reason for the poor performance is that we make no attempt to parallelize verification even though proofs over different paths are independent.

Implement parallel Z3 proofs
Implement parallel checking of messages

The following improvement is being pursued:

Parallelize Z3 proofs

We'll conduct a couple of experiments to identify a setup that allows for parallelization of Z3 proofs. Multiprocessing cannot be used as Z3 objects cannot be pickled. Activating Z3's own parallelization (parallel.enable) will make RecordFlux hang indefinitely. Even if it worked, parallelization a single proof has much less potential than parallelizing multiple proofs over different paths (also, I never observed a situation where Z3's parallelization sped up anything in previous experiments - it probably only helps when using specific tactics.).

This leaves us with concurrent.futures.ThreadPoolExecutor, i.e. real multithreading. A quick experiment using __prove_field_positions, which consumes most time during verification, was unsuccessful. When parallelizing using ThreadPoolExecutor RecordFlux crashes (with Segfault or in other strange ways). The reason is probably in using only default contexts in Z3 (source):

Z3Py uses a default global context. For most applications this is sufficient. An application may use multiple Z3 contexts. Objects created in one context cannot be used in another one. However, several objects may be "translated" from one context to another. It is not safe to access Z3 objects from multiple threads. The only exception is the method interrupt() that can be used to interrupt() a long computation.

In a separate experiment using only the Z3 Python API we'll need to confirm whether parallelization using multiple contexts is possible at all. If so, our Proof class needs to be changed to create a new context per proof session. For this to work, the z3expr needs to be extended with a parameter for the context to create the object in (recursively).

Parallelize checking of messages

An additional cheap improvement is to parallelize the verification of messages itself.

Previous topics

~~1. Move from unbounded integers (z3.Int) to bit vectors (z3.BitVec).~~ (dropped as this is very complicated and requires explicit overflow handling)
~~2. Fix the logic used to QF_BV (#612)~~ (not pursued anymore, see last comment in #612)

Use SMTlib instead of the Z3 binding (which should allow us to use specialized provers like boolector) (Unclear how much this would help - we may investigate this later)

The text was updated successfully, but these errors were encountered:

Closes #625 Closes #612

senier · 2021-08-13T14:16:55Z

Z3 can indeed be parallelized using ThreadPoolExecutor. The crashes go away when a dedicated context is used per solver:

def solve(e):
    context = z3.Context()
    left, right = e
    s = z3.SolverFor("QF_LIA", ctx=context)
    s.append (z3.IntVal(left, ctx=context) > z3.IntVal(right, ctx=context))
    return (left, right, s.check())

The bad new is that I get a speedup of 2 at most, no matter how many cores are used (tests with 2..16). I'll try to create the context and/or solver only once per thread. Maybe the context or solver creation accounts for most of the proof time...

treiher · 2021-08-13T14:55:42Z

Could this be caused by Python's Global Interpreter Lock? You could try to use ProcessPoolExecutor instead.

treiher · 2021-08-13T15:07:36Z

Never mind, I just read the explanation above that Z3 contexts are not picklable. If Z3 doesn't provide a way to spawn separate processes on its own, I don't see how we could improve performance by parallelizing proofs.

senier · 2021-08-16T11:23:08Z

Here are the conclusions of my experiment:

ThreadPoolExecutor worked, but expectedly gave a constant speedup of ~2 due to the GIL.
ProcessPoolExecutor also worked and had no problems serializing z3 expressions. This is unexpected as multiprocessing failed and ProcessPoolExecutor is just a frontend to it.
With ProcessPoolExecutor I achieve a speedup of ~8, but more than 13 cores give no improvement
As @jklmnn assumed, we don't need specific contexts when using processes, which will make the changes in this ticket much more lightweight.

These results look promising enough to try them in RecordFlux now.

Ref. #625