feat: ECCVM witness generation optimisation #5211

zac-williamson · 2024-03-14T13:04:54Z

This PR modifies the witness generation code for the ECCVM circuit builder.

In our ivc benchmarks, the overall proportion of work performed by ECCVM::create_prover has reduced from 10% to less than 1%.

Key changes are multithreading witness generation, as well as removing a substantial number of field inversions that we were unnecessarily performing. The inversions are now more effectively performed via calling field_t::batch_invert

Benchmarking lock created at ~/BENCHMARK_IN_PROGRESS.
client_ivc_bench                                                                                                                                                                  100%   15MB  47.2MB/s   00:00    
2024-03-18T10:50:07+00:00
Running ./client_ivc_bench
Run on (16 X 3631.57 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1024 KiB (x8)
  L3 Unified 36608 KiB (x1)
Load Average: 1.16, 0.82, 0.33
--------------------------------------------------------------------------------
Benchmark                      Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------
ClientIVCBench/Full/6      23697 ms        18934 ms            1 Decider::construct_proof=1 Decider::construct_proof(t)=755.044M ECCVMComposer::compute_commitment_key=1 ECCVMComposer::compute_commitment_key(t)=3.77177M ECCVMComposer::compute_witness=1 ECCVMComposer::compute_witness(t)=129.434M ECCVMComposer::create_prover=1 ECCVMComposer::create_prover(t)=149.26M ECCVMComposer::create_proving_key=1 ECCVMComposer::create_proving_key(t)=15.833M ECCVMProver::construct_proof=1 ECCVMProver::construct_proof(t)=1.78177G Goblin::merge=11 Goblin::merge(t)=128.554M GoblinTranslatorCircuitBuilder::constructor=1 GoblinTranslatorCircuitBuilder::constructor(t)=58.2017M GoblinTranslatorComposer::create_prover=1 GoblinTranslatorComposer::create_prover(t)=121.617M GoblinTranslatorProver::construct_proof=1 GoblinTranslatorProver::construct_proof(t)=928.122M ProtoGalaxyProver_::accumulator_update_round=10 ProtoGalaxyProver_::accumulator_update_round(t)=727.574M ProtoGalaxyProver_::combiner_quotient_round=10 ProtoGalaxyProver_::combiner_quotient_round(t)=7.29332G ProtoGalaxyProver_::perturbator_round=10 ProtoGalaxyProver_::perturbator_round(t)=1.32753G ProtoGalaxyProver_::preparation_round=10 ProtoGalaxyProver_::preparation_round(t)=4.16456G ProtogalaxyProver::fold_instances=10 ProtogalaxyProver::fold_instances(t)=13.513G ProverInstance(Circuit&)=11 ProverInstance(Circuit&)(t)=1.96494G batch_mul_with_endomorphism=30 batch_mul_with_endomorphism(t)=567.025M commit=425 commit(t)=4.03553G compute_combiner=10 compute_combiner(t)=7.29114G compute_perturbator=9 compute_perturbator(t)=1.32717G compute_univariate=48 compute_univariate(t)=1.43152G construct_circuits=6 construct_circuits(t)=4.27911G
Benchmarking lock deleted.
client_ivc_bench.json                                                                                                                                                             100% 4027   130.8KB/s   00:00    
function                                        ms     % sum
construct_circuits(t)                         4279    18.12%
ProverInstance(Circuit&)(t)                   1965     8.32%
ProtogalaxyProver::fold_instances(t)         13513    57.21%
Decider::construct_proof(t)                    755     3.20%
ECCVMComposer::create_prover(t)                149     0.63%
GoblinTranslatorComposer::create_prover(t)     122     0.51%
ECCVMProver::construct_proof(t)               1782     7.54%
GoblinTranslatorProver::construct_proof(t)     928     3.93%
Goblin::merge(t)                               129     0.54%

Total time accounted for: 23621ms/23697ms = 99.68%

Major contributors:
function                                        ms    % sum
commit(t)                                     4036   17.08%
compute_combiner(t)                           7291   30.87%
compute_perturbator(t)                        1327    5.62%
compute_univariate(t)                         1432    6.06%

Breakdown of ECCVMProver::create_prover:
ECCVMComposer::compute_witness(t)              129    86.72%
ECCVMComposer::create_proving_key(t)            16    10.61%

Breakdown of ProtogalaxyProver::fold_instances:
ProtoGalaxyProver_::preparation_round(t)           4165    30.82%
ProtoGalaxyProver_::perturbator_round(t)           1328     9.82%
ProtoGalaxyProver_::combiner_quotient_round(t)     7293    53.97%
ProtoGalaxyProver_::accumulator_update_round(t)     728     5.38%

…omputation

AztecBot · 2024-03-14T13:35:06Z

Benchmark results

Metrics with a significant change:

l2_block_processing_time_in_ms (32): 5,736 (+19%)
note_successful_decrypting_time_in_ms (32): 832 (+60%)
note_successful_decrypting_time_in_ms (64): 1,142 (+17%)

Detailed results

All benchmarks are run on txs on the Benchmarking contract on the repository. Each tx consists of a batch call to create_note and increment_balance, which guarantees that each tx has a private call, a nested private call, a public call, and a nested public call, as well as an emitted private note, an unencrypted log, and public storage read and write.

This benchmark source data is available in JSON format on S3 here.

Values are compared against data from master at commit 4d04a7e8 and shown if the difference exceeds 1%.

L2 block published to L1

Each column represents the number of txs on an L2 block published to L1.

Metric	8 txs	32 txs	64 txs
l1_rollup_calldata_size_in_bytes	5,668	18,820	36,356
l1_rollup_calldata_gas	66,364	239,152	469,844
l1_rollup_execution_gas	659,687	941,736	1,318,251
l2_block_processing_time_in_ms	1,261 (-4%)	⚠️ 5,736 (+19%)	8,877 (-2%)
note_successful_decrypting_time_in_ms	180 (+2%)	⚠️ 832 (+60%)	⚠️ 1,142 (+17%)
note_trial_decrypting_time_in_ms	86.1 (+10%)	51.8 (+46%)	59.1 (-46%)
l2_block_building_time_in_ms	18,292 (+1%)	69,348 (+1%)	137,003 (+1%)
l2_block_rollup_simulation_time_in_ms	8,258 (+1%)	29,388 (+2%)	57,218 (+2%)
l2_block_public_tx_process_time_in_ms	10,012 (+1%)	39,898 (+1%)	79,685 (+1%)

L2 chain processing

Each column represents the number of blocks on the L2 chain where each block has 16 txs.

Metric	5 blocks	10 blocks
node_history_sync_time_in_ms	13,983 (-3%)	26,935 (-1%)
note_history_successful_decrypting_time_in_ms	1,279 (+5%)	2,494 (+3%)
note_history_trial_decrypting_time_in_ms	103 (+69%)	179 (+25%)
node_database_size_in_bytes	19,071,056	35,741,776
pxe_database_size_in_bytes	29,859	59,414

Circuits stats

Stats on running time and I/O sizes collected for every circuit run across all benchmarks.

Circuit	circuit_simulation_time_in_ms	circuit_input_size_in_bytes	circuit_output_size_in_bytes
private-kernel-init	281 (+2%)	44,366	28,244
private-kernel-ordering	215	52,868	14,326
base-parity	1,798	128	311
base-rollup	726 (+1%)	165,787	925
root-parity	1,708 (+10%)	1,244	311
root-rollup	68.2	4,487	789
private-kernel-inner	646 (+1%)	73,771	28,244
public-kernel-app-logic	444	35,260	28,215
public-kernel-tail	172 (+1%)	40,926	28,215
merge-rollup	8.70 (+6%)	2,696	925

Tree insertion stats

The duration to insert a fixed batch of leaves into each tree type.

Metric	1 leaves	16 leaves	64 leaves	128 leaves	512 leaves	1024 leaves	2048 leaves	4096 leaves	32 leaves
batch_insert_into_append_only_tree_16_depth_ms	10.0 (+1%)	16.0	N/A	N/A	N/A	N/A	N/A	N/A	N/A
batch_insert_into_append_only_tree_16_depth_hash_count	16.8	31.6	N/A	N/A	N/A	N/A	N/A	N/A	N/A
batch_insert_into_append_only_tree_16_depth_hash_ms	0.585 (+1%)	0.495	N/A	N/A	N/A	N/A	N/A	N/A	N/A
batch_insert_into_append_only_tree_32_depth_ms	N/A	N/A	45.6	72.3	230	444 (+1%)	880 (-1%)	1,720 (-1%)	N/A
batch_insert_into_append_only_tree_32_depth_hash_count	N/A	N/A	96.0	159	543	1,055	2,079	4,127	N/A
batch_insert_into_append_only_tree_32_depth_hash_ms	N/A	N/A	0.469	0.446	0.419	0.416 (+1%)	0.418 (-1%)	0.412 (-1%)	N/A
batch_insert_into_indexed_tree_20_depth_ms	N/A	N/A	53.6 (-2%)	106	334 (-1%)	658 (+1%)	1,312 (-2%)	2,594 (-1%)	N/A
batch_insert_into_indexed_tree_20_depth_hash_count	N/A	N/A	104	207	691	1,363	2,707	5,395	N/A
batch_insert_into_indexed_tree_20_depth_hash_ms	N/A	N/A	0.477 (-2%)	0.480	0.456 (-1%)	0.454 (+1%)	0.455 (-2%)	0.452 (-1%)	N/A
batch_insert_into_indexed_tree_40_depth_ms	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	61.1 (+1%)
batch_insert_into_indexed_tree_40_depth_hash_count	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	109
batch_insert_into_indexed_tree_40_depth_hash_ms	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	0.535 (+1%)

Miscellaneous

Transaction sizes based on how many contract classes are registered in the tx.

Metric	0 registered classes
tx_size_in_bytes	22,012

Transaction processing duration by data writes.

Metric	0 new note hashes	1 new note hashes
tx_pxe_processing_time_ms	3,237 (-2%)	1,750 (+1%)

Metric	0 public data writes	1 public data writes
tx_sequencer_processing_time_ms	12.2 (+6%)	1,240 (+1%)

codygunton · 2024-03-18T21:48:57Z

barretenberg/cpp/src/barretenberg/proof_system/circuit_builder/eccvm/msm_builder.hpp

-                    row.pc = pc;
-                    msm_state.push_back(row);
-                } else {
+                if (j == num_rounds - 1) {


This deep nesting is pretty ick / far from readable.

codygunton

This is large and complex PR that should be 2-3 PRs with more documentation, but the entire ECCVM needs to be read from scratch anyway, so I'll approve and merge after having sanity checked for a while.

zac-williamson added 2 commits March 14, 2024 11:53

multithreaded witness generation and removed redundant field inversions

21b30c4

removed more reccvm edundant inverses, multithreaded eccvm table prec…

c345b09

…omputation

codygunton self-requested a review March 14, 2024 13:25

codygunton assigned zac-williamson Mar 14, 2024

zac-williamson added 2 commits March 15, 2024 17:47

fixed ecc op queue test

5996672

Merge branch 'master' into zw/eccvm-witgen-optimisations

d3a2e60

codygunton changed the title ~~[feat] ECCVM witness generation optimisation~~ feat: ECCVM witness generation optimisation Mar 18, 2024

codygunton added 2 commits March 18, 2024 05:58

Merge branch 'master' into zw/eccvm-witgen-optimisations

47f8ef5

Analysis no longer needed

25d8a3c

codygunton requested changes Mar 18, 2024

View reviewed changes

codygunton approved these changes Mar 18, 2024

View reviewed changes

codygunton merged commit 85ac726 into master Mar 18, 2024
97 of 98 checks passed

codygunton deleted the zw/eccvm-witgen-optimisations branch March 18, 2024 21:52

AztecBot mentioned this pull request Mar 18, 2024

chore(master): Release 0.30.0 #5296

Merged

codygunton mentioned this pull request May 1, 2024

Gate count jumps after finalization AztecProtocol/barretenberg#875

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ECCVM witness generation optimisation #5211

feat: ECCVM witness generation optimisation #5211

zac-williamson commented Mar 14, 2024 •

edited by codygunton

Loading

AztecBot commented Mar 14, 2024 •

edited

Loading

L2 block published to L1

L2 chain processing

Circuits stats

Tree insertion stats

Miscellaneous

codygunton Mar 18, 2024 •

edited

Loading

codygunton left a comment

feat: ECCVM witness generation optimisation #5211

feat: ECCVM witness generation optimisation #5211

Conversation

zac-williamson commented Mar 14, 2024 • edited by codygunton Loading

AztecBot commented Mar 14, 2024 • edited Loading

Benchmark results

L2 block published to L1

L2 chain processing

Circuits stats

Tree insertion stats

Miscellaneous

codygunton Mar 18, 2024 • edited Loading

Choose a reason for hiding this comment

codygunton left a comment

Choose a reason for hiding this comment

zac-williamson commented Mar 14, 2024 •

edited by codygunton

Loading

AztecBot commented Mar 14, 2024 •

edited

Loading

codygunton Mar 18, 2024 •

edited

Loading