Test block production under load #10091

georgeee · 2022-01-25T11:09:37Z

On some block producing nodes blocks are created with a significant lag, apparently due to being overwhelmed with transactions.

Problem: there is no good way to reproduce the condition in test environment

Solution: implement an integration test that tests this condition with a high probability.

The test implemented fails on current develop with ~25% probability. By default this test is turned off, provide RUN_OPT_TESTS=1 environment variable via Buildkite custom build to see the test executed.

The test launches:

4 transaction producers, each generating a transaction each 2 seconds
a block producer
a snark coordinator with 25 snark workers

Result of the test is expected to be:

Median of transaction count in blocks is ~125 (first few empty blocks are ignored)
All blocks are produced within 60s after slot start

Checklist:

Modified the current draft of release notes with details on what is completed or incomplete within this project
Document code purpose, how to use it
- Mention expected invariants, implicit constraints
Tests were added for the new behavior
- Document test purpose, significance of failures
- Test names should reflect their purpose
All tests pass (CI will check this if you didn't)
Serialized types are in stable-versioned modules
Does this close issues? None

mrmr1993 · 2022-01-25T13:28:30Z

src/app/cli/src/tests/coda_worker.ml

@@ -456,7 +456,8 @@ module T = struct
          in
          let monitor = Async.Monitor.create ~name:"coda" () in
          let with_monitor f input =
-            Async.Scheduler.within' ~monitor (fun () -> f input)
+            Async.Scheduler.within' ~monitor ~priority:Priority.low (fun () ->


This change is probably unnecessary, this is only for the old integration tests.

QuiteStochastic · 2022-03-11T23:52:04Z

buildkite/scripts/run-test-executive.sh

@@ -5,6 +5,11 @@ TEST_NAME="$1"
 MINA_IMAGE="gcr.io/o1labs-192920/mina-daemon:$MINA_DOCKER_TAG-devnet"
 ARCHIVE_IMAGE="gcr.io/o1labs-192920/mina-archive:$MINA_DOCKER_TAG"

+if [[ "${TEST_NAME:0:4}" == "opt-" ]] && [[ "$RUN_OPT_TESTS" == "" ]]; then


what is this block of code for? is this your way of temporarily removing your test from CI? seems like this code should be removed now

This block doesn't run opt-XXX integration tests unless RUN_OPT_TESTS is provided.

Test from this PR fails with around 25% probability (i.e. isn't reliable, but is informative when launched many times). Also it's expensive (30 instances and 1.5 hour), hence it's switched off by default but can be laucnhed manually through Buildkite

nholland94 · 2022-03-14T21:41:35Z

src/lib/integration_test_lib/test_error.ml

@@ -115,7 +115,7 @@ module Error_accumulator = struct
    let contexts_by_time =
      contextualized_errors |> String.Map.to_alist
      |> List.map ~f:(fun (ctx, errors) -> (errors.introduction_time, ctx))
-      |> Time.Map.of_alist_exn
+      |> Time.Map.of_alist_reduce ~f:(Printf.sprintf "%s, %s")


Could you change this so to be a string list Time.Map.t instead? We ultimately want to iterate over each error individually, and the time indexing here is only used as a way to sort the errors by time. This change should be pretty simple to make and would improve how errors are printed when there is a time conflict.

nholland94

LGTM overall. Left one comment I would like to see addressed to keep the error logging the same.

psteckler · 2022-03-15T23:15:49Z

buildkite/src/Jobs/Test/TestnetIntegrationTests.dhall

@@ -29,6 +29,7 @@ in Pipeline.build Pipeline.Config::{
    TestExecutive.execute "payment" dependsOn,
    TestExecutive.execute "delegation" dependsOn,
    TestExecutive.execute "gossip-consis" dependsOn,
+    TestExecutive.execute "opt-block-prod" dependsOn,


I think this name is too long by 1 character (which is why some of the other names are truncated).

Yeah, I know the issue, but as evidenced by green CI, I'm not surpassing the limit

psteckler · 2022-03-15T23:16:29Z

src/app/test_executive/test_executive.ml

@@ -48,6 +48,8 @@ let tests : test list =
  ; ("delegation", (module Delegation_test.Make : Intf.Test.Functor_intf))
  ; ("archive-node", (module Archive_node_test.Make : Intf.Test.Functor_intf))
  ; ("gossip-consis", (module Gossip_consistency.Make : Intf.Test.Functor_intf))
+  ; ( "opt-block-prod"


see earlier comment about name length

Yeah, I know the issue, but as evidenced by green CI, I'm not surpassing the limit

psteckler · 2022-03-15T23:18:04Z

src/lib/mina_base/gen/gen.ml

@@ -6,7 +6,7 @@ open Core_kernel
 open Signature_lib

 let keypairs =
-  let n = 120 in
+  let n = 1200 in


wow, we need that many?

We need more than 120 to ensure enough time between subsequent transactions from the same address

This is needed to avoid two transactions from the same address to co-exist in network (as transactions tend to get re-ordered and this is problematic because higher nonce transaction might get discarded)

I think as another PR it might be a good idea to just keep secret key files for testing generated exactly once. This way 1200 keys will be less problematic

Test that block production delay is neglibile under transaction load.

This is needed to make sure most blocks have 100% tx occupation (125 transactions per block).

Problem: Block production test is failing now and takes more than an hour to execute. Solution: make the block production test run on demand when RUN_OPT_TESTS=1 env variable is set.

Problem: both chain reliability and peer reliability tests rely on blocks to be produced within an expected time interval. However when a node is stopped, this remains a valid assumption no longer. When a significant portion of stake is offline, block creation may not happen naturally and make the test fail with legitimate reasons for it. Solution: remove stake from the node that is being stopped.

georgeee requested a review from a team as a code owner January 25, 2022 11:09

mrmr1993 reviewed Jan 25, 2022

View reviewed changes

georgeee requested review from bkase, imeckler, psteckler and a team as code owners January 25, 2022 18:02

georgeee added the ci-build-me Add this label to trigger a circle+buildkite build for this branch label Jan 25, 2022

georgeee force-pushed the georgeee/block-production-prio branch 5 times, most recently from 3c081f7 to 7b1546f Compare January 27, 2022 04:51

georgeee requested a review from a team as a code owner January 27, 2022 04:51

georgeee force-pushed the georgeee/block-production-prio branch 5 times, most recently from 9095a05 to d89a4b2 Compare January 28, 2022 16:00

georgeee linked an issue Jan 28, 2022 that may be closed by this pull request

Blocks are not produced at the start of a slot #10108

Closed

2 tasks

georgeee force-pushed the georgeee/block-production-prio branch 11 times, most recently from 105aada to 2f37252 Compare February 2, 2022 18:11

georgeee force-pushed the georgeee/block-production-prio branch from 5a43fd8 to b893cf4 Compare March 9, 2022 22:09

QuiteStochastic reviewed Mar 11, 2022

View reviewed changes

georgeee changed the title ~~WIP: Prioritize block production over other tasks~~ Test block production under load Mar 12, 2022

georgeee added the proj-network-stability label Mar 14, 2022

georgeee force-pushed the georgeee/block-production-prio branch from b893cf4 to d412af3 Compare March 14, 2022 18:07

nholland94 reviewed Mar 14, 2022

View reviewed changes

nholland94 approved these changes Mar 14, 2022

View reviewed changes

georgeee force-pushed the georgeee/block-production-prio branch from d412af3 to f99217a Compare March 14, 2022 22:43

QuiteStochastic approved these changes Mar 15, 2022

View reviewed changes

georgeee force-pushed the georgeee/block-production-prio branch 3 times, most recently from 888a721 to d744abc Compare March 15, 2022 20:12

georgeee mentioned this pull request Mar 15, 2022

Merge georgeee/block-production-prio to develop #10475

Merged

6 tasks

psteckler reviewed Mar 15, 2022

View reviewed changes

psteckler approved these changes Mar 15, 2022

View reviewed changes

lk86 approved these changes Mar 16, 2022

View reviewed changes

georgeee force-pushed the georgeee/block-production-prio branch from 1b2da3d to 430d06d Compare March 23, 2022 23:18

georgeee added 6 commits March 24, 2022 00:35

Test block production delay

44ef6d0

Test that block production delay is neglibile under transaction load.

Use 25 snark workers

23a9ba3

This is needed to make sure most blocks have 100% tx occupation (125 transactions per block).

Make block production test run only optionally

82ccf81

Problem: Block production test is failing now and takes more than an hour to execute. Solution: make the block production test run on demand when RUN_OPT_TESTS=1 env variable is set.

Handle duplicate timestamps for soft errors

9557271

Use sections in test to improve readability

3556e79

Allow manual restart of successful jobs

843ada6

georgeee force-pushed the georgeee/block-production-prio branch from 430d06d to eb7573a Compare March 23, 2022 23:35

georgeee force-pushed the georgeee/block-production-prio branch from eb7573a to 3a99ec0 Compare March 23, 2022 23:54

georgeee merged commit 939134b into compatible Mar 24, 2022

georgeee deleted the georgeee/block-production-prio branch March 24, 2022 22:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test block production under load #10091

Test block production under load #10091

georgeee commented Jan 25, 2022 •

edited

Loading

mrmr1993 Jan 25, 2022

QuiteStochastic Mar 11, 2022

georgeee Mar 12, 2022

nholland94 Mar 14, 2022

nholland94 left a comment

psteckler Mar 15, 2022

georgeee Mar 15, 2022

psteckler Mar 15, 2022

georgeee Mar 15, 2022

psteckler Mar 15, 2022

georgeee Mar 15, 2022

georgeee Mar 15, 2022

georgeee Mar 15, 2022

Test block production under load #10091

Test block production under load #10091

Conversation

georgeee commented Jan 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nholland94 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

georgeee commented Jan 25, 2022 •

edited

Loading