Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix async scheduler starvation #9416

Merged
merged 15 commits into from Oct 7, 2021
Merged

Conversation

nholland94
Copy link
Member

Some changes intended to address some of the async scheduler starvation issues that we believe are the root cause of the unresponsive daemon servers.

I recommend reviewing commit-by-commit, as the commits are broken up to represent the pieces of work that were done in this PR.

@nholland94 nholland94 added the ci-build-me Add this label to trigger a circle+buildkite build for this branch label Aug 30, 2021
@psteckler
Copy link
Member

psteckler commented Aug 30, 2021

Is there evidence that these changes have an effect where we were seeing indications of starvation?

If so, is that evidence attributable to particular changes, or only in aggregation?

@nholland94
Copy link
Member Author

Is there evidence that these changes have an effect where we were seeing indications of starvation?

If so, is that evidence attributable to particular changes, or only in aggregation?

Trace logs @deepthiskumar and I have collected indicate there are long async scheduler jobs in the sections of code that this PR addresses. Specifically, there are long async jobs during staged ledger diff application as well as snark pool frontier diff processing, which are the areas of code this PR intends to split up and optimize (respectively). Planning on deploying this to our seed nodes to confirm if these changes have enough impact, or if more work needs to be done to breakup more async jobs (these were the most egregious long async jobs, but there are others we may need to address as well).

Copy link
Member

@deepthiskumar deepthiskumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, added a few comments.
Could you also annotate Parallel_scan.update_metrics with trace?

src/lib/network_pool/snark_pool.ml Outdated Show resolved Hide resolved
src/lib/network_pool/snark_pool.ml Outdated Show resolved Hide resolved
in
let%bind () = yield_result () in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call yield after the verify_scan_state_after_apply below? That is also taking a significant time. We should probably add a yield call in Transaction_snark_scan_state.scan_statement in the next iteration

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...
verify_scan_state_after_apply (skip=false) took $time_elapsed  3237.1082305908203
verify_scan_state_after_apply (skip=false) took $time_elapsed  3296.386480331421
verify_scan_state_after_apply (skip=false) took $time_elapsed  3375.446557998657
verify_scan_state_after_apply (skip=false) took $time_elapsed  2469.630241394043
verify_scan_state_after_apply (skip=false) took $time_elapsed  2484.107494354248
...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some yields to Statement_scanner.check_invariants that should help break this up.

src/lib/network_pool/snark_pool.ml Outdated Show resolved Hide resolved
@nholland94
Copy link
Member Author

This PR has proven effective at fixing the rest server responsiveness issues (we have tested builds of this on mainnet and devnet via our seeds as well as through a partner). I also ran a test to measure if this will cause an increase on block validation acceptance time, which would in turn negatively impact block gossip consistency. I implemented a metric to measure this (in a separate PR), and I deployed 2 builds to our seed nodes: one with this change, and one without. The results show that, on average, the node running these changes validates blocks faster than the node that was not running these changes.

The grafana dashboard below shows the metric results from these nodes. Seed 2, shown in green, was running the build with these changes, and seed 2, shown in yellow, was running the build without these changes.

Screen Shot 2021-10-06 at 12 09 33 PM

seed-2-6845fb5f99-rz59k: "gcr.io/o1labs-192920/mina-daemon:1.2.0beta7-feature-block-validation-acceptance-time-metric-async-scheduler-starvation-mainnet-89d6634"
seed-3-b569ffd4f-24crd: "gcr.io/o1labs-192920/mina-daemon:1.2.0beta7-feature-block-validation-acceptance-time-metric-mainnet-4f55c74"

@nholland94 nholland94 mentioned this pull request Oct 6, 2021
@lk86 lk86 merged commit 31e23f7 into release/1.2.0 Oct 7, 2021
@lk86 lk86 deleted the fix/async-scheduler-starvation branch October 7, 2021 06:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-build-me Add this label to trigger a circle+buildkite build for this branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants