-
Notifications
You must be signed in to change notification settings - Fork 528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test block production under load #10091
Conversation
src/app/cli/src/tests/coda_worker.ml
Outdated
@@ -456,7 +456,8 @@ module T = struct | |||
in | |||
let monitor = Async.Monitor.create ~name:"coda" () in | |||
let with_monitor f input = | |||
Async.Scheduler.within' ~monitor (fun () -> f input) | |||
Async.Scheduler.within' ~monitor ~priority:Priority.low (fun () -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is probably unnecessary, this is only for the old integration tests.
3c081f7
to
7b1546f
Compare
9095a05
to
d89a4b2
Compare
105aada
to
2f37252
Compare
5a43fd8
to
b893cf4
Compare
@@ -5,6 +5,11 @@ TEST_NAME="$1" | |||
MINA_IMAGE="gcr.io/o1labs-192920/mina-daemon:$MINA_DOCKER_TAG-devnet" | |||
ARCHIVE_IMAGE="gcr.io/o1labs-192920/mina-archive:$MINA_DOCKER_TAG" | |||
|
|||
if [[ "${TEST_NAME:0:4}" == "opt-" ]] && [[ "$RUN_OPT_TESTS" == "" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this block of code for? is this your way of temporarily removing your test from CI? seems like this code should be removed now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block doesn't run opt-XXX
integration tests unless RUN_OPT_TESTS
is provided.
Test from this PR fails with around 25% probability (i.e. isn't reliable, but is informative when launched many times). Also it's expensive (30 instances and 1.5 hour), hence it's switched off by default but can be laucnhed manually through Buildkite
b893cf4
to
d412af3
Compare
@@ -115,7 +115,7 @@ module Error_accumulator = struct | |||
let contexts_by_time = | |||
contextualized_errors |> String.Map.to_alist | |||
|> List.map ~f:(fun (ctx, errors) -> (errors.introduction_time, ctx)) | |||
|> Time.Map.of_alist_exn | |||
|> Time.Map.of_alist_reduce ~f:(Printf.sprintf "%s, %s") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you change this so to be a string list Time.Map.t
instead? We ultimately want to iterate over each error individually, and the time indexing here is only used as a way to sort the errors by time. This change should be pretty simple to make and would improve how errors are printed when there is a time conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. Left one comment I would like to see addressed to keep the error logging the same.
d412af3
to
f99217a
Compare
888a721
to
d744abc
Compare
@@ -29,6 +29,7 @@ in Pipeline.build Pipeline.Config::{ | |||
TestExecutive.execute "payment" dependsOn, | |||
TestExecutive.execute "delegation" dependsOn, | |||
TestExecutive.execute "gossip-consis" dependsOn, | |||
TestExecutive.execute "opt-block-prod" dependsOn, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this name is too long by 1 character (which is why some of the other names are truncated).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I know the issue, but as evidenced by green CI, I'm not surpassing the limit
@@ -48,6 +48,8 @@ let tests : test list = | |||
; ("delegation", (module Delegation_test.Make : Intf.Test.Functor_intf)) | |||
; ("archive-node", (module Archive_node_test.Make : Intf.Test.Functor_intf)) | |||
; ("gossip-consis", (module Gossip_consistency.Make : Intf.Test.Functor_intf)) | |||
; ( "opt-block-prod" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see earlier comment about name length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I know the issue, but as evidenced by green CI, I'm not surpassing the limit
@@ -6,7 +6,7 @@ open Core_kernel | |||
open Signature_lib | |||
|
|||
let keypairs = | |||
let n = 120 in | |||
let n = 1200 in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, we need that many?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need more than 120 to ensure enough time between subsequent transactions from the same address
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed to avoid two transactions from the same address to co-exist in network (as transactions tend to get re-ordered and this is problematic because higher nonce transaction might get discarded)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as another PR it might be a good idea to just keep secret key files for testing generated exactly once. This way 1200 keys will be less problematic
1b2da3d
to
430d06d
Compare
Test that block production delay is neglibile under transaction load.
This is needed to make sure most blocks have 100% tx occupation (125 transactions per block).
Problem: Block production test is failing now and takes more than an hour to execute. Solution: make the block production test run on demand when RUN_OPT_TESTS=1 env variable is set.
430d06d
to
eb7573a
Compare
Problem: both chain reliability and peer reliability tests rely on blocks to be produced within an expected time interval. However when a node is stopped, this remains a valid assumption no longer. When a significant portion of stake is offline, block creation may not happen naturally and make the test fail with legitimate reasons for it. Solution: remove stake from the node that is being stopped.
eb7573a
to
3a99ec0
Compare
On some block producing nodes blocks are created with a significant lag, apparently due to being overwhelmed with transactions.
Problem: there is no good way to reproduce the condition in test environment
Solution: implement an integration test that tests this condition with a high probability.
The test implemented fails on current
develop
with ~25% probability. By default this test is turned off, provideRUN_OPT_TESTS=1
environment variable via Buildkite custom build to see the test executed.The test launches:
Result of the test is expected to be:
Checklist: