-
Notifications
You must be signed in to change notification settings - Fork 3.9k
roachtest: update wal-failover/among-stores/with-progress to use process monitor #148604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e65a61a
to
adef307
Compare
|
||
err := s.c.RunE(ctx, option.WithNodes(nodes), "sudo", "/bin/bash", "-c", cmd) | ||
if err != nil { | ||
s.f.L().PrintfCtx(ctx, "error in StallForDuration: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above command can fail under different failure modes. As before, it can fail due to a (transient) network issue, resulting in an SSH flake. It can also fail if the script errors out for any reason, e.g., output file is not available. In either case, I think you want to fail the test. Note that SSH error would have already been detected (in remoteSession.Run
) and classified as flake.
}() | ||
// Use a single call to StallForDuration to reduce SSH connections. | ||
t.Status("short disk stall on n1 for " + shortStallDur.String()) | ||
s.StallForDuration(ctx, c.Node(1), shortStallDur) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this does fix the immediate worst case of an Unstall
command flaking and the node fataling, I think this test is still sensitive to ssh flakes.
For example:
workloadCmd := `./cockroach workload run kv --read-percent 0 ` +
fmt.Sprintf(`--duration %s --concurrency 4096 --max-rate=2048 --tolerate-errors `, operationDur.String()) +
`--min-block-bytes=4096 --max-block-bytes=4096 --timeout 1s {pgurl:1-3}`
Here you run the workload for 3 minutes before attempting to induce a disk stall. However as you've seen, one ssh flake could hang for several minutes. So it's possible that you call StallForDuration
, but by the time it actually is remotely run, the workload has already finished.
Maybe StallForDuration
could handle stalling and unstalling for the entire 3 minute interval? Of course this could still flake but you'd go from creating 100 ssh connections in a short period to only two so it seems much less likely?
@@ -484,7 +486,7 @@ func runDiskStalledWALFailoverWithProgress(ctx context.Context, t test.Test, c c | |||
|
|||
t.Status("starting oscillating workload and disk stall pattern") | |||
testStartedAt := timeutil.Now() | |||
m := c.NewMonitor(ctx, c.CRDBNodes()) | |||
m := t.NewGroup(task.WithContext(ctx)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: g
for error group instead of m
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -511,7 +513,7 @@ func runDiskStalledWALFailoverWithProgress(ctx context.Context, t test.Test, c c | |||
for timeutil.Since(testStartedAt) < testDuration { | |||
if t.Failed() { | |||
t.Fatalf("test failed, stopping further iterations") | |||
return | |||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: t.Fatalf
already calls panic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
…ess monitor Epic: none
adef307
to
f594ee2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @herkolategan)
a discussion (no related file):
Spoke with @DarrylWong offline and he's going to add the StallForDuration functionality to the disk stall utils. Dropped the commit adding it to the roachtestutil interface so this is just swapping to use the new monitor.
@@ -484,7 +486,7 @@ func runDiskStalledWALFailoverWithProgress(ctx context.Context, t test.Test, c c | |||
|
|||
t.Status("starting oscillating workload and disk stall pattern") | |||
testStartedAt := timeutil.Now() | |||
m := c.NewMonitor(ctx, c.CRDBNodes()) | |||
m := t.NewGroup(task.WithContext(ctx)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -511,7 +513,7 @@ func runDiskStalledWALFailoverWithProgress(ctx context.Context, t test.Test, c c | |||
for timeutil.Since(testStartedAt) < testDuration { | |||
if t.Failed() { | |||
t.Fatalf("test failed, stopping further iterations") | |||
return | |||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
tftr! |
Epic: none