Skip to content

Performance of Silicon on long running examples (e.g. Nagini)

Federico Poli edited this page Feb 21, 2020 · 3 revisions

Background

The benchmarks were performed due to Silicon's bad performance on examples generated by Nagini. The benchmarks compare several versions of Silicon that try different optimisations/heuristics in different combinations and with different settings, e.g. timeout values.

Document outline

  1. Overview
  2. Results
  3. Observations and Interpretation

Overview

  • Test files were taken from the Viper test suite, from examples provided by Marco (Nagini, SCION) and from examples provided by Vytautas (Chalice obligations, Rust)
  • Only tests for which the current Silicon configuration (i.e. the baseline configuration) needs at least one second to verify were considered; 213 files in total. The test files were selected based on a single run of Silicon on my office computer, the benchmarks were afterwards performed on a different computer (Vytautas prepared a "benchmarking laptop" for me).
  • Each file was verified five times and the average runtime was recorded
  • Silicon was run in sequential mode, i.e. without any kind of internal parallelisation
  • All Silicon versions were run via Nailgun
  • All benchmarks were performed using the Viper Runner tool

The overall benchmark consists of three groups:

  • In group A I experimented with the path feasibility checks that Silicon makes. I.e. before branching over a condition b, Silicon queries Z3 to find out b is known to be true or false; if so, the corresponding path isn't explored. Group A contains the following five Silicon configurations (in the terminology of the Viper Runner tool):

    Configuration Description
    base_cT250 The baseline configuration of Silicon. Path feasibility checks are always made, with a timeout of 250ms.
    base_cT10 Baseline configuration , but with a path feasibility check timeout of 10ms
    skip_1 Don't check path feasibility if the branching is due to short-circuiting evaluation of conjunctions (b1 && b2 is treated as if it were b1 && (b1 ==> b2)
    skip_2 Don't check path feasibility if the condition contains a quantified variable (affects, e.g. forall x :: x in xs ==> ...)
    skip_3 Combination of skip_1 and skip_3

Note: I observed early on that reducing the path feasibility check timeout to 10ms (base_cT10) always performs better than a timeout of 250ms (base_cT250). Thus, all non-base configurations, e.g. skip1/2/3 and those of groups B and C, had a path feasibility check timeout of 10ms.

  • In group B I experimented with performing additional Z3 check-sats in order to saturate Z3's state, i.e. to get Z3 to learn additional lemmas. The underlying idea is that Z3 would learn lemmas early, e.g. because they follow from domain definitions and are independent of any Viper state, which would avoid situations in which Z3 relearns lemmas again and again in subsequent push-pop scopes. Group B contains the following 10 Silicon configurations:

    Configuration Description
    base_cT250 As before
    base_cT10 As before
    pre500 Perform a saturating check-sat after the program preamble (Silicon's internal preamble plus domain definitions from the program to verify) has been emitted to Z3. The check-sat is made with a timeout of 500ms.
    pre250 Analogous, but with a timeout of 250ms
    pre100 Analogous, but with a timeout of 100ms
    pre10 Analogous, but with a timeout of 10ms
    many500 Performs saturating check-sats at the following places: (1) after preamble, (2) after inhaling a contract, (3) after unfold, (4) after inhale, (5) before Silicon repeatedly queries Z3 in an iteration, e.g. as done in the QP consume algorithm or as part of a state consolidation. Each check is done with a different timeout. The base timeout is 500ms, which is then divided by the following reducers: 1, 2, 2.5, 5, 50. Examples: base timeout 500ms; after preamble reducer 1, effective timeout 500ms; after unfold reducer 2.5, effective timeout 200ms. An effective minimal timeout of 10ms is enforced.
    many250 Analogous, but with a base timeout of 250ms
    many100 Analogous, but with a base timeout of 100ms
    many10 Analogous, but with a base timeout of 10ms
  • In group C I combined the best configurations from groups A and B. Group C thus represents the final run of the benchmark and its results determine which configuration will be the next default configuration of Silicon. Group C contains the following 9 Silicon configurations:

    Configuration Description
    base_cT250 As before
    base_cT10 As before
    skip3 As before
    pre10 As before
    many100 As before
    many10 As before
    skip3+pre10 Combination of skip3 and pre10
    skip3+many100 Combination of skip3 and many100
    skip3+many10 Combination of skip3 and many10

Results

Group A

The results can be found in this Excel file.

The first block of data in the Excel file are the average runtime per Silicon configuration and test file, ordered by runtime of baseline Silicon:

group-A-01.png

The second block of data, columns Min [s] and Config, list the minimum verification time per test file and the Silicon configuration that yielded the minimum, respectively:

group-A-02.png

The third block of data compares "new" configurations (e.g. optimised ones) to "old" configurations (typically the baseline configuration). Each comparison consists of two columns: absolute difference (e.g. cT10-cT250 [s]) and relative difference (e.g. cT10/cT250 [%diff]). Negative values (green) mean that the new configuration is faster, positive (red) ones mean the opposite.

group-A-03.png

The Excel file also contains a strip of summaries below the main data (scroll down to row 213) whose structure is analogous to the previously described structure, i.e. which is also divided into three blocks. The first block comprises, per configuration, the sum, min, max, median and average of the recorded runtimes. The second block lists how often (column Wins [#]) a particular configuration (column Config) was the fastest configuration (i.e. the sum of all wins equals the number of test files, modulo tests that timed out). The third block is similar to the first, but computes sum, min, etc. for the absolute and relative differences between configurations.

group-A-04.png

Group B

The results can be found in this Excel file. The file has the same structure as that of group A.

Group C

The results can be found in this Excel file.

The file has the same structure as that of group A, except that the block that lists which configuration was fastest for how many files has been extended to provide the necessary overview to finally decide which configuration should be the next default Silicon configuration.

The block has been extended as follows: in addition to the total number of wins per configuration (column Total [#]) it also contains wins per duration bin, e.g. how often was a configuration fastest for files that verified in between two and five seconds (column Wins 2-5s [#]). Moreover, the block also contains total, averaged, maximum etc. per configuration across all test files.

group-C-01.png

Observations and Interpretation

  • Any new configuration is better than then current Silicon configuration (cT250)
  • skip3+many100 is overall the fastest configuration: 610s (cT250: 850s)
  • skip3 wins for "fast" tests (up to five seconds)
  • skip3+many100 and skip3+many10 win for "slow" tests (of which we don't have that many, though)
  • The lead of skip3 over skip3+many100/skip3+many10 is always negligible and less than 500ms, whereas the lead in the opposite direction can be very substantial

In conclusion, it seems that

  • skip3+many100 should be implemented as the next default Silicon configuration
  • the default path feasibility check timeout should be 10ms
  • it is worthwhile to allow users (e.g. via command-line options) to adjust the timeouts used for the various Z3 saturation calls
  • it is not necessary to allow users to select when to skip path feasibility checks, i.e. to choose between skip1/2/3