Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Network Manager race condition #1376

Merged
merged 20 commits into from
Aug 30, 2023
Merged

Conversation

avilevy18
Copy link
Contributor

@avilevy18 avilevy18 commented Aug 15, 2023

Description

Health check were failing incorrectly on VM restart for some distros, this was caused by the Ops Agent main service starting up before the network service.

Related issue

b/295772265

How has this been tested?

Integration tests

Checklist:

  • Unit tests
    • Unit tests do not apply.
    • Unit tests have been added/modified and passed for this PR.
  • Integration tests
    • Integration tests do not apply.
    • Integration tests have been added/modified and passed for this PR.
  • Documentation
    • This PR introduces no user visible changes.
    • This PR introduces user visible changes and the corresponding documentation change has been made.
  • Minor version bump
    • This PR introduces no new features.
    • This PR introduces new features, and there is a separate PR to bump the minor version since the last release already.
    • This PR bumps the version.

@avilevy18 avilevy18 marked this pull request as ready for review August 25, 2023 14:41
@avilevy18 avilevy18 requested review from a team, sophieyfang and ridwanmsharif and removed request for a team and sophieyfang August 25, 2023 14:41
Copy link
Contributor

@ridwanmsharif ridwanmsharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something we need to do in Windows to also avoid this race condition?

response, err := http.Get(r.url)
var response *http.Response
var err error
for i := 0; i < 5; i++ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we need to do this 5 times if we're using systemd to make network-online a prerequisite target?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it's for error 5xx flakes but in that case a comment in the code is probably warranted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a machine has multiple network interfaces, network-online indicates that only one of them has come up so far.
Adding network-online.target is a best effort not a full solution.

systemd/google-cloud-ops-agent.service Show resolved Hide resolved
@rafaelwestphal
Copy link
Contributor

This should refer to b/295772265 instead.

@avilevy18 avilevy18 added the kokoro:force-run Forces kokoro to run integration tests on a CL label Aug 28, 2023
@stackdriver-instrumentation-release stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Aug 28, 2023
internal/healthchecks/network_check.go Outdated Show resolved Hide resolved
@avilevy18 avilevy18 added the kokoro:force-run Forces kokoro to run integration tests on a CL label Aug 29, 2023
@stackdriver-instrumentation-release stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Aug 29, 2023
@avilevy18 avilevy18 added the kokoro:force-run Forces kokoro to run integration tests on a CL label Aug 29, 2023
@stackdriver-instrumentation-release stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Aug 29, 2023
@avilevy18 avilevy18 merged commit c06b141 into master Aug 30, 2023
51 of 53 checks passed
@avilevy18 avilevy18 deleted the avilevy-network-manager-fix branch August 30, 2023 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants