Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Network Manager race condition #1376

Merged
merged 20 commits into from
Aug 30, 2023
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
86f0d75
Create restart VM test
avilevy18 Aug 15, 2023
9782b46
Make sure the ops agent runs after the `NetworkManager.service` starts
avilevy18 Aug 16, 2023
e526c37
Merge branch 'master' into avilevy-network-manager-fix
avilevy18 Aug 16, 2023
a6434a9
Make sure the ops agent runs after the `wicked.service` starts
avilevy18 Aug 16, 2023
6558d0e
Merge remote-tracking branch 'origin/avilevy-network-manager-fix' int…
avilevy18 Aug 16, 2023
c22b92a
Refactor services to use `network-online`
avilevy18 Aug 17, 2023
657dbee
Merge branch 'master' into avilevy-network-manager-fix
avilevy18 Aug 24, 2023
e54778c
Do an initial test of healthchecks before VM restart
avilevy18 Aug 24, 2023
e648ca5
Merge remote-tracking branch 'origin/avilevy-network-manager-fix' int…
avilevy18 Aug 24, 2023
0a0bec9
Added retries for network healthchecks
avilevy18 Aug 24, 2023
f17be31
Testing fix
avilevy18 Aug 24, 2023
fae85ec
Query service output from start timestamp.
franciscovalentecastro Aug 24, 2023
907ff36
Check for main service start timestamp.
franciscovalentecastro Aug 24, 2023
5a7db21
Merge branch 'master' into avilevy-network-manager-fix
avilevy18 Aug 24, 2023
cfedc5f
Query 30s earlier considering service start time.
franciscovalentecastro Aug 24, 2023
e3dd018
Addressed comments
avilevy18 Aug 25, 2023
51fad8f
Merge branch 'master' into avilevy-network-manager-fix
avilevy18 Aug 25, 2023
60da58a
Merge branch 'master' into avilevy-network-manager-fix
avilevy18 Aug 28, 2023
724121b
Merge branch 'master' into avilevy-network-manager-fix
avilevy18 Aug 28, 2023
ea8691f
Refactored to use exponential backoff
avilevy18 Aug 29, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 38 additions & 2 deletions integration_test/ops_agent_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -3953,8 +3953,9 @@ func checkExpectedHealthCheckResult(t *testing.T, output string, name string, ex
func getRecentServiceOutputForPlatform(platform string) string {
if gce.IsWindows(platform) {
cmd := strings.Join([]string{
"$Past = (Get-Date) - (New-TimeSpan -Minute 1)",
"Get-WinEvent -MaxEvents 10 -FilterHashtable @{ Logname='Application'; ProviderName='google-cloud-ops-agent'; StartTime=$Past } | select -ExpandProperty Message",
"$ServiceStart = (Get-EventLog -LogName 'System' -Source 'Service Control Manager' -EntryType 'Information' -Message '*Google Cloud Ops Agent service entered the running state*' -Newest 1).TimeGenerated",
"$QueryStart = $ServiceStart - (New-TimeSpan -Seconds 30)",
"Get-WinEvent -MaxEvents 10 -FilterHashtable @{ Logname='Application'; ProviderName='google-cloud-ops-agent'; StartTime=$QueryStart } | select -ExpandProperty Message",
}, ";")
return cmd
}
Expand Down Expand Up @@ -4233,6 +4234,41 @@ func TestNoNvmlOtelReceiverWithoutGpu(t *testing.T) {
})
}

func TestRestartVM(t *testing.T) {
t.Parallel()
gce.RunForEachPlatform(t, func(t *testing.T, platform string) {
t.Parallel()

ctx, logger, vm := agents.CommonSetup(t, platform)
if err := agents.SetupOpsAgent(ctx, logger.ToMainLog(), vm, ""); err != nil {
t.Fatal(err)
}

cmdOut, err := gce.RunRemotely(ctx, logger.ToMainLog(), vm, "", getRecentServiceOutputForPlatform(vm.Platform))
if err != nil {
t.Fatal(err)
}

// Ensure sure all healthchecks pass before the restart
checkExpectedHealthCheckResult(t, cmdOut.Stdout, "Network", "PASS", "")
checkExpectedHealthCheckResult(t, cmdOut.Stdout, "Ports", "PASS", "")
checkExpectedHealthCheckResult(t, cmdOut.Stdout, "API", "PASS", "")

if err := gce.RestartInstance(ctx, logger, vm); err != nil {
t.Fatal(err)
}

cmdOut, err = gce.RunRemotely(ctx, logger.ToMainLog(), vm, "", getRecentServiceOutputForPlatform(vm.Platform))
if err != nil {
t.Fatal(err)
}

checkExpectedHealthCheckResult(t, cmdOut.Stdout, "Network", "PASS", "")
avilevy18 marked this conversation as resolved.
Show resolved Hide resolved
checkExpectedHealthCheckResult(t, cmdOut.Stdout, "Ports", "PASS", "")
checkExpectedHealthCheckResult(t, cmdOut.Stdout, "API", "PASS", "")
})
}

func TestMain(m *testing.M) {
code := m.Run()
gce.CleanupKeysOrDie()
Expand Down
11 changes: 10 additions & 1 deletion internal/healthchecks/network_check.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ package healthchecks
import (
"errors"
"net/http"
"time"

"github.com/GoogleCloudPlatform/ops-agent/internal/logs"
)
Expand All @@ -29,7 +30,15 @@ type networkRequest struct {
}

func (r networkRequest) SendRequest(logger logs.StructuredLogger) error {
response, err := http.Get(r.url)
var response *http.Response
var err error
for i := 0; i < 5; i++ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we need to do this 5 times if we're using systemd to make network-online a prerequisite target?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it's for error 5xx flakes but in that case a comment in the code is probably warranted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a machine has multiple network interfaces, network-online indicates that only one of them has come up so far.
Adding network-online.target is a best effort not a full solution.

response, err = http.Get(r.url)
if err == nil && response.StatusCode == http.StatusOK {
break
}
time.Sleep(1 * time.Second)
avilevy18 marked this conversation as resolved.
Show resolved Hide resolved
}
if err != nil {
if isTimeoutError(err) || isConnectionRefusedError(err) {
return r.healthCheckError
Expand Down
3 changes: 2 additions & 1 deletion systemd/google-cloud-ops-agent.service
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@

[Unit]
Description=Google Cloud Ops Agent
Wants=google-cloud-ops-agent-fluent-bit.service google-cloud-ops-agent-opentelemetry-collector.service google-cloud-ops-agent-diagnostics.service
Wants=google-cloud-ops-agent-fluent-bit.service google-cloud-ops-agent-opentelemetry-collector.service google-cloud-ops-agent-diagnostics.service network-online.target
After=network-online.target
avilevy18 marked this conversation as resolved.
Show resolved Hide resolved

[Service]
Type=oneshot
Expand Down
Loading