Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce number of regions uptime-checks run in #2330

Merged
merged 1 commit into from Mar 10, 2023

Conversation

pnasrat
Copy link
Contributor

@pnasrat pnasrat commented Mar 9, 2023

See #2320

Tested applying to 2i2c staging and works

See #2320
@pnasrat pnasrat self-assigned this Mar 9, 2023
@pnasrat
Copy link
Contributor Author

pnasrat commented Mar 9, 2023

Tested on 2i2c staging via

terraform apply -target 'google_monitoring_uptime_check_config.hub_simple_uptime_check["staging.2i2c.cloud"]'

image

Copy link
Member

@consideRatio consideRatio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM!

As part of reviewing this, I tried catch up with my understanding about these uptime checks. Do you think this understanding is correct @pnasrat?

  • We have "uptime checks" defined in the GCP project two-eye-two-see making HTTP(s) requests to check if something is up and running.
  • Each uptime check has so far been a set of uptime checks, where the service checked is accessed from multiple locations of the world
  • This change makes us only check service avaibility from USA
  • This change is a change only to the two-eye-two-see GCP project, and not all individual cloud accounts with clusters.

@pnasrat
Copy link
Contributor Author

pnasrat commented Mar 10, 2023

@consideRatio see the docs here about the central project https://infrastructure.2i2c.org/en/latest/topic/monitoring-alerting/uptime-checks.html but yes. The main issue #2320 contains the detailed analysis of why (our free quota limit is being blown through) and this is an attempt to reduce costs until we can revisit monitoring and alerting design in general.

This is split off from the changing period due to the apparent need to destroy then apply to change period. I'm updating the docs when I change the period, and will add about region selection there when I get that out for review after testing if my assumption around the INVALID ARGUMENT error and period is correct.

Longer term I think these probes will be supplemented with detailed monitoring and alerting, that runs in cluster.

@pnasrat pnasrat merged commit 1acfd72 into master Mar 10, 2023
@pnasrat pnasrat deleted the uptime-check-limit-regions branch March 10, 2023 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

2 participants