Fix async scheduler starvation #9416

nholland94 · 2021-08-30T19:20:07Z

Some changes intended to address some of the async scheduler starvation issues that we believe are the root cause of the unresponsive daemon servers.

I recommend reviewing commit-by-commit, as the commits are broken up to represent the pieces of work that were done in this PR.

psteckler · 2021-08-30T19:40:04Z

Is there evidence that these changes have an effect where we were seeing indications of starvation?

If so, is that evidence attributable to particular changes, or only in aggregation?

nholland94 · 2021-08-31T04:00:24Z

Is there evidence that these changes have an effect where we were seeing indications of starvation?

If so, is that evidence attributable to particular changes, or only in aggregation?

Trace logs @deepthiskumar and I have collected indicate there are long async scheduler jobs in the sections of code that this PR addresses. Specifically, there are long async jobs during staged ledger diff application as well as snark pool frontier diff processing, which are the areas of code this PR intends to split up and optimize (respectively). Planning on deploying this to our seed nodes to confirm if these changes have enough impact, or if more work needs to be done to breakup more async jobs (these were the most egregious long async jobs, but there are others we may need to address as well).

deepthiskumar

Looks good, added a few comments.
Could you also annotate Parallel_scan.update_metrics with trace?

src/lib/network_pool/snark_pool.ml

deepthiskumar · 2021-08-31T08:10:01Z

src/lib/staged_ledger/staged_ledger.ml

    in
+    let%bind () = yield_result () in


call yield after the verify_scan_state_after_apply below? That is also taking a significant time. We should probably add a yield call in Transaction_snark_scan_state.scan_statement in the next iteration

... verify_scan_state_after_apply (skip=false) took $time_elapsed 3237.1082305908203 verify_scan_state_after_apply (skip=false) took $time_elapsed 3296.386480331421 verify_scan_state_after_apply (skip=false) took $time_elapsed 3375.446557998657 verify_scan_state_after_apply (skip=false) took $time_elapsed 2469.630241394043 verify_scan_state_after_apply (skip=false) took $time_elapsed 2484.107494354248 ...

I added some yields to Statement_scanner.check_invariants that should help break this up.

src/external/ocaml-rocksdb/rocks.ml

src/lib/network_pool/snark_pool.ml

src/lib/transaction_snark_scan_state/transaction_snark_scan_state.ml

src/lib/network_pool/snark_pool.ml

nholland94 · 2021-10-06T16:16:33Z

This PR has proven effective at fixing the rest server responsiveness issues (we have tested builds of this on mainnet and devnet via our seeds as well as through a partner). I also ran a test to measure if this will cause an increase on block validation acceptance time, which would in turn negatively impact block gossip consistency. I implemented a metric to measure this (in a separate PR), and I deployed 2 builds to our seed nodes: one with this change, and one without. The results show that, on average, the node running these changes validates blocks faster than the node that was not running these changes.

The grafana dashboard below shows the metric results from these nodes. Seed 2, shown in green, was running the build with these changes, and seed 2, shown in yellow, was running the build without these changes.

seed-2-6845fb5f99-rz59k: "gcr.io/o1labs-192920/mina-daemon:1.2.0beta7-feature-block-validation-acceptance-time-metric-async-scheduler-starvation-mainnet-89d6634"
seed-3-b569ffd4f-24crd: "gcr.io/o1labs-192920/mina-daemon:1.2.0beta7-feature-block-validation-acceptance-time-metric-mainnet-4f55c74"

nholland94 added 3 commits August 30, 2021 15:17

Batch snark pool ledger db interactions

db279cf

Breakup staged ledger application async jobs

9de1b7b

Some small tracing modifications

c1e6fc1

nholland94 added the ci-build-me Add this label to trigger a circle+buildkite build for this branch label Aug 30, 2021

nholland94 requested review from bkase, imeckler, mrmr1993, psteckler and a team as code owners August 30, 2021 19:20

Fix snark pool race condition

46e51fb

deepthiskumar reviewed Aug 31, 2021

View reviewed changes

nholland94 added 2 commits August 31, 2021 13:37

Fix unit tests

3331ce7

Address PR review

e35da60

mrmr1993 reviewed Aug 31, 2021

View reviewed changes

src/external/ocaml-rocksdb/rocks.ml Show resolved Hide resolved

src/external/ocaml-rocksdb/rocks.ml Show resolved Hide resolved

mrmr1993 reviewed Aug 31, 2021

View reviewed changes

src/lib/network_pool/snark_pool.ml Outdated Show resolved Hide resolved

Fix GC bug

400982f

mrmr1993 reviewed Aug 31, 2021

View reviewed changes

src/lib/network_pool/snark_pool.ml Outdated Show resolved Hide resolved

src/lib/transaction_snark_scan_state/transaction_snark_scan_state.ml Outdated Show resolved Hide resolved

Address more PR review comments

c7b012d

mrmr1993 approved these changes Sep 1, 2021

View reviewed changes

src/lib/network_pool/snark_pool.ml Outdated Show resolved Hide resolved

nholland94 and others added 3 commits August 31, 2021 21:43

Refactor

ffc3873

Merge branch 'release/1.2.0' into fix/async-scheduler-starvation

7a204a3

Bump o1trace buffer size; catch exception

dffb978

nholland94 force-pushed the fix/async-scheduler-starvation branch from a8eb60c to dffb978 Compare September 14, 2021 01:04

nholland94 and others added 4 commits September 14, 2021 15:40

Fix rocks multi_get bug

ffaba6e

Escape json strings in trace-tool (dumb version)

da8e7f8

reformat

4bd5a72

Merge branch 'release/1.2.0' into fix/async-scheduler-starvation

7b5bc5f

nholland94 mentioned this pull request Oct 6, 2021

Rosetta API fixes #9561

Merged

lk86 merged commit 31e23f7 into release/1.2.0 Oct 7, 2021

lk86 deleted the fix/async-scheduler-starvation branch October 7, 2021 06:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix async scheduler starvation #9416

Fix async scheduler starvation #9416

nholland94 commented Aug 30, 2021

psteckler commented Aug 30, 2021 •

edited

nholland94 commented Aug 31, 2021

deepthiskumar left a comment

deepthiskumar Aug 31, 2021

deepthiskumar Aug 31, 2021

nholland94 Sep 1, 2021

nholland94 commented Oct 6, 2021

Fix async scheduler starvation #9416

Fix async scheduler starvation #9416

Conversation

nholland94 commented Aug 30, 2021

psteckler commented Aug 30, 2021 • edited

nholland94 commented Aug 31, 2021

deepthiskumar left a comment

Choose a reason for hiding this comment

deepthiskumar Aug 31, 2021

Choose a reason for hiding this comment

deepthiskumar Aug 31, 2021

Choose a reason for hiding this comment

nholland94 Sep 1, 2021

Choose a reason for hiding this comment

nholland94 commented Oct 6, 2021

psteckler commented Aug 30, 2021 •

edited