Multi-Tool Triangulation for Formal Specification Validation
Apart Research Hackathon — Secure Program Synthesis — May 2026 — Track 2: Specification Validation
Formal verification has a triage problem: engineers face a bewildering array of tools (Tamarin, ProVerif, TLA+, Lean, Dafny, Kani) with no systematic guidance on which to use — and no way to tell if their specification actually captures what they intended.
Spec Triangulator addresses this with two components:
-
Classifier — Routes natural-language system descriptions to the right formal tool pair. 93.8% accuracy, 79.8% multi-tool recall on 48 real-world examples. Zero fine-tuning.
-
Triangulation Engine — Runs TLC (explicit-state model checking) and Z3 (SMT inductive invariant proof) on the same TLA+ specification and compares verdicts. Disagreement surfaces specification gaps.
Two tools with different mathematical foundations checking the same property:
| Verdict | Meaning |
|---|---|
| Both VERIFIED | High confidence — reachability + inductive proof agree |
| TLC ✅ Z3 ❌ | Invariant not inductive — spec needs strengthening |
| TLC ❌ Z3 ✅ | State space too large for TLC — inductive proof scales |
| Both ❌ | Genuine transition bug — fix before proceeding |
We validate this with 40 controlled experiments covering all four quadrants.
| Metric | Value |
|---|---|
| Classifier accuracy | 93.8% (45/48) |
| Multi-tool recall | 79.8% |
| Specs triangulated | 10 real TLA+ specs |
| TLC states checked | 117,278 |
| Z3 inductive proofs | 10/10 UNSAT in <50ms |
| Confusion matrix cases | 40/40 in correct quadrant |
spec-triangulator/
├── src/
│ ├── tool_runner.py # TLC + Z3 triangulation engine (self-contained)
│ ├── confusion_matrix.py # 40-case confusion matrix experiments
│ ├── eval.py # Classifier evaluation pipeline
│ └── build_dataset.py # Dataset construction
├── data/
│ ├── dataset.json # 48 labeled examples with spec file paths
│ └── specs_manifest.json # Spec URL/path mapping for all 48 examples
├── results/
│ ├── triangulation_results.json # 10-spec triangulation output
│ └── confusion_matrix_results.json # 40-case confusion matrix output
├── docs/
│ └── confusion_matrix.md # Detailed explanation of all 4 quadrants
├── notebooks/
│ └── spec_triangulator.ipynb # Kaggle-ready notebook
├── requirements.txt
└── README.md
pip install z3-solver groqTLC (tla2tools.jar) is downloaded automatically by tool_runner.py on first run.
cd src
python tool_runner.py
# Output: results/triangulation_results.jsoncd src
python confusion_matrix.py
# Output: results/confusion_matrix_results.jsonexport GROQ_API_KEY=your_key_here
cd src
python eval.pyUpload notebooks/spec_triangulator.ipynb to Kaggle. Add your GROQ_API_KEY as a Kaggle Secret. The notebook installs all dependencies and runs the full pipeline.
TLC and Z3 verify the same property by completely different means:
TLC performs breadth-first search over all reachable states from Init, checking TypeOK in every state:
∀ s reachable from Init: TypeOK(s)
Z3 proves the invariant is inductive — preserved by every possible transition:
Check: TypeOK(s) ∧ Next(s,s') → TypeOK(s')
i.e., ¬(TypeOK(s) ∧ Next(s,s') ∧ ¬TypeOK(s')) must be UNSAT
TLC only considers reachable states. Z3 considers all mathematically possible pre-states. This difference is what makes disagreement informative.
TLC checks 1,146 reachable states — all satisfy TPTypeOK. ✅
Z3 checks inductiveness of TypeOK ∧ TCConsistent ∧ ProtocolInv:
- Without
ProtocolInv(rm_committed → tm_committed): Z3 finds spurious pre-staterm=committed, tm=init— unreachable fromTPInitbut not ruled out inductively. ❌ → Q2: spec needs strengthening - With full invariant: UNSAT — invariant is inductive. ✅ → Q1: high confidence
48 real formally-verified systems across 4 categories, sourced from published papers:
| Category | Count | Tools | Example Systems |
|---|---|---|---|
| Protocol Security | 12 | Tamarin + ProVerif | TLS 1.3, 5G AKA, Needham-Schroeder |
| State-based / Concurrent | 11 | TLC + Z3 | Etcd Raft, Paxos, Two-Phase Commit |
| Language-specific | 12 | Verus + Kani | Rust Vec::push, Binary Search (Dafny) |
| Foundational ITP | 13 | Lean + Rocq | Merge Sort (Mathlib), Prime Number Theorem |
27/48 examples have local spec file paths. All labels sourced from original papers — not generated by LLMs.
Sources: SysMoBench, TLA+ Examples repository, TLS13Tamarin, AWS Rust Std verification, Mathlib4, FVAPPS, Vericoding
| Z3 = VERIFIED | Z3 = COUNTEREXAMPLE | |
|---|---|---|
| TLC = VERIFIED | Q1: 10/10 — both agree ✅ | Q2: 10/10 — invariant not inductive |
| TLC = TIMEOUT/CE | Q3: 10/10 — Z3 scales ✅ | Q4: 10/10 — genuine bugs |
Q2 pattern: Correct invariant on reachable states, but Z3 finds a spurious pre-state the invariant doesn't inductively exclude. Fix: strengthen the invariant with protocol-level clauses.
Q3 pattern: Readers-Writers at N≥6 (3^N states) times out TLC in 8s. Z3 inductive proof: O(N²) clauses, <100ms. Dining Philosophers N≥14 shows the same pattern.
Q4 pattern: Real transition bugs — missing preconditions. Both TLC (finds trace) and Z3 (finds non-inductive step) independently detect the fault.
All TLA+ specs from github.com/tlaplus/Examples (Apache 2.0):
| Spec | File | States |
|---|---|---|
| Two-Phase Commit | transaction_commit/TwoPhase.tla |
1,146 |
| Paxos | SimplifiedFastPaxos/Paxos.tla |
55,079 |
| Dining Philosophers | DiningPhilosophers/DiningPhilosophers.tla |
58 |
| Die Hard | DieHard/DieHard.tla |
97 |
| Game of Life | GameOfLife/GameOfLife.tla |
1,024 |
| Coffee Can | CoffeeCan/CoffeeCan.tla |
74 |
| Readers-Writers | ReadersWriters/ReadersWriters.tla |
59,674 |
| Transaction Commit | transaction_commit/TCommit.tla |
94 |
| Async Channel | SpecifyingSystems/AsynchronousInterface/Channel.tla |
16 |
| Async Interface | SpecifyingSystems/AsynchronousInterface/AsynchInterface.tla |
16 |
- Basin et al. (2018). A Formal Analysis of 5G Authentication. CCS 2018.
- Cremers et al. (2017). A Comprehensive Symbolic Analysis of TLS 1.3. CCS 2017.
- Dong et al. (2024). SysMoBench. arXiv 2024.
- Lamport (2002). Specifying Systems. Addison-Wesley.
- de Moura & Bjørner (2008). Z3: An Efficient SMT Solver. TACAS 2008.
- Newcombe et al. (2015). How AWS Uses Formal Methods. CACM 58(4).
MIT License. TLA+ spec files retain their original Apache 2.0 license from the TLA+ Examples repository.