Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test block production under load #10091

Merged
merged 7 commits into from
Mar 24, 2022
Merged

Conversation

georgeee
Copy link
Member

@georgeee georgeee commented Jan 25, 2022

On some block producing nodes blocks are created with a significant lag, apparently due to being overwhelmed with transactions.

Problem: there is no good way to reproduce the condition in test environment

Solution: implement an integration test that tests this condition with a high probability.

The test implemented fails on current develop with ~25% probability. By default this test is turned off, provide RUN_OPT_TESTS=1 environment variable via Buildkite custom build to see the test executed.

The test launches:

  • 4 transaction producers, each generating a transaction each 2 seconds
  • a block producer
  • a snark coordinator with 25 snark workers

Result of the test is expected to be:

  1. Median of transaction count in blocks is ~125 (first few empty blocks are ignored)
  2. All blocks are produced within 60s after slot start

Checklist:

  • Modified the current draft of release notes with details on what is completed or incomplete within this project
  • Document code purpose, how to use it
    • Mention expected invariants, implicit constraints
  • Tests were added for the new behavior
    • Document test purpose, significance of failures
    • Test names should reflect their purpose
  • All tests pass (CI will check this if you didn't)
  • Serialized types are in stable-versioned modules
  • Does this close issues? None

@georgeee georgeee requested a review from a team as a code owner January 25, 2022 11:09
@@ -456,7 +456,8 @@ module T = struct
in
let monitor = Async.Monitor.create ~name:"coda" () in
let with_monitor f input =
Async.Scheduler.within' ~monitor (fun () -> f input)
Async.Scheduler.within' ~monitor ~priority:Priority.low (fun () ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is probably unnecessary, this is only for the old integration tests.

@georgeee georgeee added the ci-build-me Add this label to trigger a circle+buildkite build for this branch label Jan 25, 2022
@georgeee georgeee force-pushed the georgeee/block-production-prio branch 5 times, most recently from 3c081f7 to 7b1546f Compare January 27, 2022 04:51
@georgeee georgeee requested a review from a team as a code owner January 27, 2022 04:51
@georgeee georgeee force-pushed the georgeee/block-production-prio branch 5 times, most recently from 9095a05 to d89a4b2 Compare January 28, 2022 16:00
@georgeee georgeee linked an issue Jan 28, 2022 that may be closed by this pull request
2 tasks
@georgeee georgeee force-pushed the georgeee/block-production-prio branch 11 times, most recently from 105aada to 2f37252 Compare February 2, 2022 18:11
@georgeee georgeee force-pushed the georgeee/block-production-prio branch from 5a43fd8 to b893cf4 Compare March 9, 2022 22:09
@@ -5,6 +5,11 @@ TEST_NAME="$1"
MINA_IMAGE="gcr.io/o1labs-192920/mina-daemon:$MINA_DOCKER_TAG-devnet"
ARCHIVE_IMAGE="gcr.io/o1labs-192920/mina-archive:$MINA_DOCKER_TAG"

if [[ "${TEST_NAME:0:4}" == "opt-" ]] && [[ "$RUN_OPT_TESTS" == "" ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this block of code for? is this your way of temporarily removing your test from CI? seems like this code should be removed now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block doesn't run opt-XXX integration tests unless RUN_OPT_TESTS is provided.

Test from this PR fails with around 25% probability (i.e. isn't reliable, but is informative when launched many times). Also it's expensive (30 instances and 1.5 hour), hence it's switched off by default but can be laucnhed manually through Buildkite

@georgeee georgeee changed the title WIP: Prioritize block production over other tasks Test block production under load Mar 12, 2022
@georgeee georgeee force-pushed the georgeee/block-production-prio branch from b893cf4 to d412af3 Compare March 14, 2022 18:07
@@ -115,7 +115,7 @@ module Error_accumulator = struct
let contexts_by_time =
contextualized_errors |> String.Map.to_alist
|> List.map ~f:(fun (ctx, errors) -> (errors.introduction_time, ctx))
|> Time.Map.of_alist_exn
|> Time.Map.of_alist_reduce ~f:(Printf.sprintf "%s, %s")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you change this so to be a string list Time.Map.t instead? We ultimately want to iterate over each error individually, and the time indexing here is only used as a way to sort the errors by time. This change should be pretty simple to make and would improve how errors are printed when there is a time conflict.

Copy link
Member

@nholland94 nholland94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Left one comment I would like to see addressed to keep the error logging the same.

@georgeee georgeee force-pushed the georgeee/block-production-prio branch from d412af3 to f99217a Compare March 14, 2022 22:43
@georgeee georgeee force-pushed the georgeee/block-production-prio branch 3 times, most recently from 888a721 to d744abc Compare March 15, 2022 20:12
@@ -29,6 +29,7 @@ in Pipeline.build Pipeline.Config::{
TestExecutive.execute "payment" dependsOn,
TestExecutive.execute "delegation" dependsOn,
TestExecutive.execute "gossip-consis" dependsOn,
TestExecutive.execute "opt-block-prod" dependsOn,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this name is too long by 1 character (which is why some of the other names are truncated).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I know the issue, but as evidenced by green CI, I'm not surpassing the limit

@@ -48,6 +48,8 @@ let tests : test list =
; ("delegation", (module Delegation_test.Make : Intf.Test.Functor_intf))
; ("archive-node", (module Archive_node_test.Make : Intf.Test.Functor_intf))
; ("gossip-consis", (module Gossip_consistency.Make : Intf.Test.Functor_intf))
; ( "opt-block-prod"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see earlier comment about name length

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I know the issue, but as evidenced by green CI, I'm not surpassing the limit

@@ -6,7 +6,7 @@ open Core_kernel
open Signature_lib

let keypairs =
let n = 120 in
let n = 1200 in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, we need that many?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need more than 120 to ensure enough time between subsequent transactions from the same address

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed to avoid two transactions from the same address to co-exist in network (as transactions tend to get re-ordered and this is problematic because higher nonce transaction might get discarded)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as another PR it might be a good idea to just keep secret key files for testing generated exactly once. This way 1200 keys will be less problematic

@georgeee georgeee force-pushed the georgeee/block-production-prio branch from 1b2da3d to 430d06d Compare March 23, 2022 23:18
Test that block production delay is neglibile
under transaction load.
This is needed to make sure most blocks have 100% tx occupation (125
transactions per block).
Problem: Block production test is failing now and takes more than an
hour to execute.

Solution: make the block production test run on demand when
RUN_OPT_TESTS=1 env variable is set.
@georgeee georgeee force-pushed the georgeee/block-production-prio branch from 430d06d to eb7573a Compare March 23, 2022 23:35
Problem: both chain reliability and peer reliability tests rely on
blocks to be produced within an expected time interval. However when a
node is stopped, this remains a valid assumption no longer. When a
significant portion of stake is offline, block creation may not happen
naturally and make the test fail with legitimate reasons for it.

Solution: remove stake from the node that is being stopped.
@georgeee georgeee force-pushed the georgeee/block-production-prio branch from eb7573a to 3a99ec0 Compare March 23, 2022 23:54
@georgeee georgeee merged commit 939134b into compatible Mar 24, 2022
@georgeee georgeee deleted the georgeee/block-production-prio branch March 24, 2022 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-build-me Add this label to trigger a circle+buildkite build for this branch proj-network-stability
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fix integration test snark coordinator init timing out bug Blocks are not produced at the start of a slot
6 participants