roachtest: update wal-failover/among-stores/with-progress to use process monitor #148604

xinhaoz · 2025-06-20T16:13:59Z

Epic: none

cockroach-teamcity · 2025-06-20T16:14:13Z

This change is

srosenberg · 2025-06-22T16:22:35Z

pkg/cmd/roachtest/roachtestutil/disk_stall.go

+
+	err := s.c.RunE(ctx, option.WithNodes(nodes), "sudo", "/bin/bash", "-c", cmd)
+	if err != nil {
+		s.f.L().PrintfCtx(ctx, "error in StallForDuration: %v", err)


The above command can fail under different failure modes. As before, it can fail due to a (transient) network issue, resulting in an SSH flake. It can also fail if the script errors out for any reason, e.g., output file is not available. In either case, I think you want to fail the test. Note that SSH error would have already been detected (in remoteSession.Run) and classified as flake.

DarrylWong · 2025-06-23T15:47:01Z

pkg/cmd/roachtest/tests/disk_stall.go

-						}()
+						// Use a single call to StallForDuration to reduce SSH connections.
+						t.Status("short disk stall on n1 for " + shortStallDur.String())
+						s.StallForDuration(ctx, c.Node(1), shortStallDur)


While this does fix the immediate worst case of an Unstall command flaking and the node fataling, I think this test is still sensitive to ssh flakes.

For example:

workloadCmd := `./cockroach workload run kv --read-percent 0 ` + fmt.Sprintf(`--duration %s --concurrency 4096 --max-rate=2048 --tolerate-errors `, operationDur.String()) + `--min-block-bytes=4096 --max-block-bytes=4096 --timeout 1s {pgurl:1-3}`

Here you run the workload for 3 minutes before attempting to induce a disk stall. However as you've seen, one ssh flake could hang for several minutes. So it's possible that you call StallForDuration, but by the time it actually is remotely run, the workload has already finished.

Maybe StallForDuration could handle stalling and unstalling for the entire 3 minute interval? Of course this could still flake but you'd go from creating 100 ssh connections in a short period to only two so it seems much less likely?

DarrylWong · 2025-06-23T15:52:33Z

pkg/cmd/roachtest/tests/disk_stall.go

@@ -484,7 +486,7 @@ func runDiskStalledWALFailoverWithProgress(ctx context.Context, t test.Test, c c

 	t.Status("starting oscillating workload and disk stall pattern")
 	testStartedAt := timeutil.Now()
-	m := c.NewMonitor(ctx, c.CRDBNodes())
+	m := t.NewGroup(task.WithContext(ctx))


nit: g for error group instead of m?

DarrylWong · 2025-06-23T15:53:21Z

pkg/cmd/roachtest/tests/disk_stall.go

@@ -511,7 +513,7 @@ func runDiskStalledWALFailoverWithProgress(ctx context.Context, t test.Test, c c
 	for timeutil.Since(testStartedAt) < testDuration {
 		if t.Failed() {
 			t.Fatalf("test failed, stopping further iterations")
-			return
+			break


nit: t.Fatalf already calls panic

…ess monitor Epic: none

xinhaoz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan)

a discussion (no related file):
Spoke with @DarrylWong offline and he's going to add the StallForDuration functionality to the disk stall utils. Dropped the commit adding it to the roachtestutil interface so this is just swapping to use the new monitor.

xinhaoz · 2025-06-23T20:06:24Z

pkg/cmd/roachtest/tests/disk_stall.go

@@ -484,7 +486,7 @@ func runDiskStalledWALFailoverWithProgress(ctx context.Context, t test.Test, c c

 	t.Status("starting oscillating workload and disk stall pattern")
 	testStartedAt := timeutil.Now()
-	m := c.NewMonitor(ctx, c.CRDBNodes())
+	m := t.NewGroup(task.WithContext(ctx))


xinhaoz · 2025-06-23T20:06:25Z

pkg/cmd/roachtest/tests/disk_stall.go

@@ -511,7 +513,7 @@ func runDiskStalledWALFailoverWithProgress(ctx context.Context, t test.Test, c c
 	for timeutil.Since(testStartedAt) < testDuration {
 		if t.Failed() {
 			t.Fatalf("test failed, stopping further iterations")
-			return
+			break


xinhaoz · 2025-06-24T14:02:09Z

tftr!
bors r+

craig · 2025-06-24T14:45:17Z

Build succeeded:

xinhaoz force-pushed the roachtest-disk-stall branch 3 times, most recently from e65a61a to adef307 Compare June 20, 2025 19:24

xinhaoz marked this pull request as ready for review June 20, 2025 19:35

xinhaoz requested a review from a team as a code owner June 20, 2025 19:35

xinhaoz requested review from herkolategan and DarrylWong and removed request for a team June 20, 2025 19:35

srosenberg reviewed Jun 22, 2025

View reviewed changes

DarrylWong reviewed Jun 23, 2025

View reviewed changes

DarrylWong mentioned this pull request Jun 23, 2025

roachtest: use failure injection library for disk stalls #147349

Merged

roachtest: update wal-failover/among-stores/with-progress to use proc…

f594ee2

…ess monitor Epic: none

xinhaoz force-pushed the roachtest-disk-stall branch from adef307 to f594ee2 Compare June 23, 2025 20:02

xinhaoz commented Jun 23, 2025

View reviewed changes

xinhaoz changed the title ~~roachtest: add StallForDuration for short disk stalls~~ roachtest: update wal-failover/among-stores/with-progress to use process monitor Jun 23, 2025

xinhaoz requested review from a team and annrpom June 23, 2025 21:57

DarrylWong approved these changes Jun 23, 2025

View reviewed changes

annrpom approved these changes Jun 23, 2025

View reviewed changes

craig bot merged commit 7ca9db8 into cockroachdb:master Jun 24, 2025
21 of 22 checks passed

celeste-cockroachdb bot added the target-release-25.3.0 label Jun 24, 2025

xinhaoz deleted the roachtest-disk-stall branch June 24, 2025 18:50

celeste-cockroachdb bot added v25.3.0-prerelease and removed target-release-25.3.0 labels Jul 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

roachtest: update wal-failover/among-stores/with-progress to use process monitor #148604

roachtest: update wal-failover/among-stores/with-progress to use process monitor #148604

xinhaoz commented Jun 20, 2025 •

edited

Loading

Uh oh!

cockroach-teamcity commented Jun 20, 2025

Uh oh!

srosenberg Jun 22, 2025

Uh oh!

DarrylWong Jun 23, 2025

Uh oh!

DarrylWong Jun 23, 2025

Uh oh!

xinhaoz Jun 23, 2025

Uh oh!

DarrylWong Jun 23, 2025

Uh oh!

xinhaoz Jun 23, 2025

Uh oh!

xinhaoz left a comment

Uh oh!

xinhaoz Jun 23, 2025

Uh oh!

xinhaoz Jun 23, 2025

Uh oh!

xinhaoz commented Jun 24, 2025

Uh oh!

craig bot commented Jun 24, 2025

Uh oh!

Uh oh!

Uh oh!

roachtest: update wal-failover/among-stores/with-progress to use process monitor #148604

roachtest: update wal-failover/among-stores/with-progress to use process monitor #148604

Conversation

xinhaoz commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Jun 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xinhaoz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xinhaoz commented Jun 24, 2025

Uh oh!

craig bot commented Jun 24, 2025

Uh oh!

Uh oh!

Uh oh!

xinhaoz commented Jun 20, 2025 •

edited

Loading