Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase the missing log time period #8752

Merged
merged 2 commits into from May 3, 2021
Merged

Conversation

deepthiskumar
Copy link
Member

WatchdogNoNewLogs metric is getting triggered on devnet quite frequently because of less activity (see #8727). On mainnet this alert was getting triggered mainly due to nodes restarting which got fixed after adding the for clause.

Changes:

  1. Increased the time period for which we look for logs from 10minutes to an hour. (This will be effectively 1 hr 10 min because of the 12 mins in the for clause). There was a suggestion to move this to a warning but instead I increased the time period to 1 hour. Seemed ok if we missed a few minutes of logs and take a backup only if it became critical? (see the investigation section in the runbook)
  2. Updated the alert description
  3. Added a runbook (please take a look at that as well)

Tested by running the watchdog script locally

Closes #8727

@deepthiskumar deepthiskumar added the ci-build-me Add this label to trigger a circle+buildkite build for this branch label May 3, 2021
@mrmr1993 mrmr1993 merged commit e737439 into compatible May 3, 2021
@mrmr1993 mrmr1993 deleted the fix/watchdog-log-alert branch May 3, 2021 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-build-me Add this label to trigger a circle+buildkite build for this branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants