Skip to content

roachtest: update wal-failover/among-stores/with-progress to use process monitor #148604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 24, 2025

Conversation

xinhaoz
Copy link
Member

@xinhaoz xinhaoz commented Jun 20, 2025

Epic: none

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@xinhaoz xinhaoz force-pushed the roachtest-disk-stall branch 3 times, most recently from e65a61a to adef307 Compare June 20, 2025 19:24
@xinhaoz xinhaoz marked this pull request as ready for review June 20, 2025 19:35
@xinhaoz xinhaoz requested a review from a team as a code owner June 20, 2025 19:35
@xinhaoz xinhaoz requested review from herkolategan and DarrylWong and removed request for a team June 20, 2025 19:35

err := s.c.RunE(ctx, option.WithNodes(nodes), "sudo", "/bin/bash", "-c", cmd)
if err != nil {
s.f.L().PrintfCtx(ctx, "error in StallForDuration: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above command can fail under different failure modes. As before, it can fail due to a (transient) network issue, resulting in an SSH flake. It can also fail if the script errors out for any reason, e.g., output file is not available. In either case, I think you want to fail the test. Note that SSH error would have already been detected (in remoteSession.Run) and classified as flake.

}()
// Use a single call to StallForDuration to reduce SSH connections.
t.Status("short disk stall on n1 for " + shortStallDur.String())
s.StallForDuration(ctx, c.Node(1), shortStallDur)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this does fix the immediate worst case of an Unstall command flaking and the node fataling, I think this test is still sensitive to ssh flakes.

For example:

workloadCmd := `./cockroach workload run kv --read-percent 0 ` +
					fmt.Sprintf(`--duration %s --concurrency 4096 --max-rate=2048 --tolerate-errors `, operationDur.String()) +
					`--min-block-bytes=4096 --max-block-bytes=4096 --timeout 1s {pgurl:1-3}`

Here you run the workload for 3 minutes before attempting to induce a disk stall. However as you've seen, one ssh flake could hang for several minutes. So it's possible that you call StallForDuration, but by the time it actually is remotely run, the workload has already finished.

Maybe StallForDuration could handle stalling and unstalling for the entire 3 minute interval? Of course this could still flake but you'd go from creating 100 ssh connections in a short period to only two so it seems much less likely?

@@ -484,7 +486,7 @@ func runDiskStalledWALFailoverWithProgress(ctx context.Context, t test.Test, c c

t.Status("starting oscillating workload and disk stall pattern")
testStartedAt := timeutil.Now()
m := c.NewMonitor(ctx, c.CRDBNodes())
m := t.NewGroup(task.WithContext(ctx))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: g for error group instead of m?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -511,7 +513,7 @@ func runDiskStalledWALFailoverWithProgress(ctx context.Context, t test.Test, c c
for timeutil.Since(testStartedAt) < testDuration {
if t.Failed() {
t.Fatalf("test failed, stopping further iterations")
return
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: t.Fatalf already calls panic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@xinhaoz xinhaoz force-pushed the roachtest-disk-stall branch from adef307 to f594ee2 Compare June 23, 2025 20:02
Copy link
Member Author

@xinhaoz xinhaoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan)


a discussion (no related file):
Spoke with @DarrylWong offline and he's going to add the StallForDuration functionality to the disk stall utils. Dropped the commit adding it to the roachtestutil interface so this is just swapping to use the new monitor.

@@ -484,7 +486,7 @@ func runDiskStalledWALFailoverWithProgress(ctx context.Context, t test.Test, c c

t.Status("starting oscillating workload and disk stall pattern")
testStartedAt := timeutil.Now()
m := c.NewMonitor(ctx, c.CRDBNodes())
m := t.NewGroup(task.WithContext(ctx))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -511,7 +513,7 @@ func runDiskStalledWALFailoverWithProgress(ctx context.Context, t test.Test, c c
for timeutil.Since(testStartedAt) < testDuration {
if t.Failed() {
t.Fatalf("test failed, stopping further iterations")
return
break
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@xinhaoz xinhaoz changed the title roachtest: add StallForDuration for short disk stalls roachtest: update wal-failover/among-stores/with-progress to use process monitor Jun 23, 2025
@xinhaoz xinhaoz requested review from a team and annrpom June 23, 2025 21:57
@xinhaoz
Copy link
Member Author

xinhaoz commented Jun 24, 2025

tftr!
bors r+

@craig
Copy link
Contributor

craig bot commented Jun 24, 2025

@craig craig bot merged commit 7ca9db8 into cockroachdb:master Jun 24, 2025
21 of 22 checks passed
@xinhaoz xinhaoz deleted the roachtest-disk-stall branch June 24, 2025 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants