New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix async scheduler starvation #9416
Conversation
Is there evidence that these changes have an effect where we were seeing indications of starvation? If so, is that evidence attributable to particular changes, or only in aggregation? |
Trace logs @deepthiskumar and I have collected indicate there are long async scheduler jobs in the sections of code that this PR addresses. Specifically, there are long async jobs during staged ledger diff application as well as snark pool frontier diff processing, which are the areas of code this PR intends to split up and optimize (respectively). Planning on deploying this to our seed nodes to confirm if these changes have enough impact, or if more work needs to be done to breakup more async jobs (these were the most egregious long async jobs, but there are others we may need to address as well). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, added a few comments.
Could you also annotate Parallel_scan.update_metrics
with trace?
in | ||
let%bind () = yield_result () in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call yield after the verify_scan_state_after_apply
below? That is also taking a significant time. We should probably add a yield call in Transaction_snark_scan_state.scan_statement
in the next iteration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...
verify_scan_state_after_apply (skip=false) took $time_elapsed 3237.1082305908203
verify_scan_state_after_apply (skip=false) took $time_elapsed 3296.386480331421
verify_scan_state_after_apply (skip=false) took $time_elapsed 3375.446557998657
verify_scan_state_after_apply (skip=false) took $time_elapsed 2469.630241394043
verify_scan_state_after_apply (skip=false) took $time_elapsed 2484.107494354248
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some yields to Statement_scanner.check_invariants
that should help break this up.
src/lib/transaction_snark_scan_state/transaction_snark_scan_state.ml
Outdated
Show resolved
Hide resolved
a8eb60c
to
dffb978
Compare
This PR has proven effective at fixing the rest server responsiveness issues (we have tested builds of this on mainnet and devnet via our seeds as well as through a partner). I also ran a test to measure if this will cause an increase on block validation acceptance time, which would in turn negatively impact block gossip consistency. I implemented a metric to measure this (in a separate PR), and I deployed 2 builds to our seed nodes: one with this change, and one without. The results show that, on average, the node running these changes validates blocks faster than the node that was not running these changes. The grafana dashboard below shows the metric results from these nodes. Seed 2, shown in green, was running the build with these changes, and seed 2, shown in yellow, was running the build without these changes.
|
Some changes intended to address some of the async scheduler starvation issues that we believe are the root cause of the unresponsive daemon servers.
I recommend reviewing commit-by-commit, as the commits are broken up to represent the pieces of work that were done in this PR.