Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

health: Create package to determine health #15

Merged
merged 12 commits into from
Jul 27, 2020
Merged

health: Create package to determine health #15

merged 12 commits into from
Jul 27, 2020

Conversation

gvso
Copy link
Contributor

@gvso gvso commented Jul 21, 2020

This creates a new health package that has a main IsHealthy function to check if the metrics for the candidate revision meet the configuration criteria.

// PercentileToAlignReduce takes a percentile value maps it to a AlignReduce
// value.
//
// TODO: once we start supporting any percentile value, this should not be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we not able to support floats right away?

Copy link
Contributor Author

@gvso gvso Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we decided to use the REST API instead of MQL first, we don't need (and can't) support floats yet. The REST API only allows specific percentiles (99, 95, 50)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. let's have validation that if values like 97 is specified it should return error instead of defaulting to some unwanted value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I'm using fmt.Errorf since pkg/errors doesn't have a constructor with formatting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use fmt.Errorf. Try errors.Errorf.

@gvso gvso requested a review from ahmetb July 21, 2020 19:53
pkg/health/health.go Outdated Show resolved Hide resolved
pkg/health/health.go Outdated Show resolved Hide resolved
}

// IsHealthy checks if a revision is healthy based on the metrics configuration.
func (h *Health) IsHealthy(ctx context.Context, query metrics.Query, healthCheckOffset time.Duration, metricsChecks []config.Metric) (bool, error) {
Copy link
Contributor

@ahmetb ahmetb Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to do this in another PR:

If you just return bool for fail/pass, we cannot capture details like "what metrics did we see", "why did we make this decision".

As part of our design we said we'll provide these insights back to the users. So probably we need to export something like []CheckResults that explains result of each criteria checked and if it was "met" or "not".

And then the rollout machinery can take alook at []CheckResults to decide whether to halt or roll forward|back.

//
// Otherwise, all metrics criteria are checked to determine if the revision is
// healthy.
func Diagnose(ctx context.Context, provider metrics.Metrics, query metrics.Query,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not check request count yet since it depends on #13 . I will do it in a following PR

Comment on lines 23 to 27
MetricsType config.MetricsCheck
MinOrMaxValue interface{}
ActualValue interface{}
IsCriteriaMet bool
Reason string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like MinOrMaxValue (rename: Threshold or ExpectedValue) and ActualValue are what we need in this type.

IsCriteriaMet is ok to.

but the type etc can be inferred from the []config.Metric we passed in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also just use float64

isHealthy := true
var results, failedResults []*CheckResult
for i, criteria := range healthCriteria {
result := determineResult(criteria.Type, criteria.Min, criteria.Max, metricsValues[i])
Copy link
Contributor

@ahmetb ahmetb Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update config.Metric to have Threshold (not Min / Max) and here in this method we know if it should be used as ≤ or ≥ based on the .Type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done at #20

pkg/health/health.go Outdated Show resolved Hide resolved
pkg/health/health.go Outdated Show resolved Hide resolved
Comment on lines 75 to 79
case config.LatencyMetricsCheck:
value, err = latency(ctx, provider, query, offset, criteria.Percentile)
break
case config.ErrorRateMetricsCheck:
value, err = errorRate(ctx, provider, query, offset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add logs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be done in a following PR after we integrate this package with the rollout

Comment on lines 17 to 18
FailedCheckResults []*CheckResult
IsHealthy bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally this structure should also represent that no rollout operation should be done (halt/noop) based on minrequests.

similar to IsHealthy but different.

this is probably an enum DiagnosisResult { UNKNOWN=0 , INCONCLUSIVE, HEALTHY= , UNHEALTHY }

}

// errorRate returns the percentage of errors during the given offset.
func errorRate(ctx context.Context, provider metrics.Metrics, query metrics.Query, offset time.Duration) (float64, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errorRatePercent

@gvso gvso requested a review from ahmetb July 24, 2020 17:00

// Possible diagnosis results.
const (
Unknown DiagnosisResult = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use iota syntax.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I didn't know about this.

)

// DiagnosisResult is a possible result after a diagnosis.
type DiagnosisResult int32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why int32? can't this just be int

// Diagnosis is the information about the health of the revision.
type Diagnosis struct {
OverallResult DiagnosisResult
CheckResults []*CheckResult
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this pointer? use pointers very sparingly

return nil, errors.Wrap(err, "could not collect metrics")
}

overallResult := Healthy
Copy link
Contributor

@ahmetb ahmetb Jul 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is potentially a problem if no healthCriteria is specified (somehow) and you return Healthy. that's why we define Inconclusive and Unknown for.

if you say var overallResult DiagnosisResult it will default to Unknown.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be clear my concern is: It's not defensive enough for a critical task such as rollout. user is much better off if we don't roll out or rollback, rather than blindly rolling forward due to a bug in our tool.

I need you to approach to the problem with this mindset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! My initial thought was that Inconclusive is the value returned when there are no enough requests, but that's still not implemented.

I was not sure if the best approach to empty criteria was roll forward (maybe the user didn't want health checks) or do nothing (Unknown/Inconclusive).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure if the best approach to empty criteria was roll forward

Right, it's not a good approach.

Inconclusive or Unknown is better. But best might be returning error if no criteria is specified (terminal error).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's address this at #30

Comment on lines 69 to 70
if criteria.Type == config.LatencyMetricsCheck {
logger = logger.WithField("percentile", criteria.Percentile)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just log it always, it'll be a percentile=0 field that doesn't mean anything. it's cleaner that way (+log stms align in human-readable output)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

result := determineResult(criteria.Type, criteria.Threshold, metricsValues[i])
results = append(results, result)

if !result.IsCriteriaMet {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably explained "avoid if-nesting" if you can more than a handful times now. :)

What if you said:

for healthCriteria {

  if result.IsCriteriaMet {
    continue
  }

  // things you do if unhealthy
  // and you don't need to indent it
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any time you write a lengthy if { ... } block I need you to keep this in mind: can you move the block outside the if somehow or not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. This was a simple block before the last change. I will take this into account when making changes too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

if criteria.Type == config.LatencyMetricsCheck {
logger = logger.WithField("percentile", criteria.Percentile)
}
logger.Debug("criteria was not met")
Copy link
Contributor

@ahmetb ahmetb Jul 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you initialize logger outside the if and also log criteria met?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@ahmetb
Copy link
Contributor

ahmetb commented Jul 24, 2020

cc: @grayside this is probably the most important PR of the project. If you have time, PTAL.

In this PR we create some core machinery to look at a set of metrics thresholds and look at Stackdriver Monitoring API and make a decision (roll forward/back, noop).

// to change to a switch statement when criteria with a minimum threshold is
// added.
if actualValue <= threshold {
result.IsCriteriaMet = true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the CheckResult values are not immutable and the calculation is a simple comparison, I think this would be better as a method rather than a property.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grayside so it's actually hard to look at CheckResult object and tell if it's a positive result or negative.

e.g. for latency metrics, threshold=100 actual=200 → requirement is failed
however, for "min request" metrics threshold=100 actual=200 → requirement is met

the method that generates this CheckResult decides if those numbers mean something good or bad.

in case of "min request" metrics is not met → we simply do noop (wait more requests to happen)

however in case of "latency" metrics not met → means "rollback".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that makes sense, and reading again I see the comment about the forthcoming switch statement. That seems like it will be a bit complex. I can't help but go to the idea that there should be an interface such as the following (reusing the same verbs being discussed above, but more locally this time):

type Metric interface {
  Assess(context.Context, metrics.Metrics, time.Duration, []config.Metrics[]) (*CheckResult, error)
  Diagnose() bool
}

Then the individual Metrics can be implemented with a bit more isolation and testability.
The current structure seems like it will work okay for 2-3 metrics but start to become less manageable thereafter.

Copy link
Contributor Author

@gvso gvso Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @grayside! I think I understand your concern, but I am a little bit confused about the expected interface for the caller of this interface.

The idea is that you get a CheckResult per metric criteria since the CheckResult contains the expected threshold, the actual value, and whether the criteria is met based on those previous two values.

  1. Do you want to pass a single config.Metric and get a single CheckResult?
for criteria := range healthCriteria {
    results = append(results, health.Assess(ctx, metricsProvider, ...)) // ignore error
}
  1. Do you want to pass an array of metrics criteria an get an array of CheckResult?

  2. Something else?

Also, I don't see how this would prevent the switch when determining results since comparing actualValue <= threshold or threshold <= actualValue depends on the health criteria type (unless we pass a comparison function, which sounds like an overkill here).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit +1 on this being an overkill right now.
Realistically we'll only have a≤b or b≤a comparisons as far as metrics go.
So creating a LatencyMetric or RequestCountMetric types probably are overkill.

Though I agree, Adam’s approach long-term is definitely appealing. I'd say no need to refactor this right now.

//
// Otherwise, all metrics criteria are checked to determine if the revision is
// healthy.
func Diagnose(ctx context.Context, provider metrics.Metrics, query metrics.Query,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where this is called, so it's hard to tell, but this list of parameters feels off to me. My blind expectation in looking at this:

  1. Not clear to me why there's a provider and a query here
  2. The offset and minRequests seems awkward next to parameters such as a config and query. Why are they distinct properties?

This looks like we're exposing some implementation details on the primary method of a public package.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll try to explain the context here:

  1. provider is a metrics backend that lets you query certain type of metrics. However, we specify "filters" such as "service_name=X,revision=Y,project=Z", and those (under the covers) might be implemented differently.

    though I agree, actually we could make that as part of initialization process to the MetricsProvider (and initialize a different provider for each svc/revision/project/...)

  2. offset is for being able to specify "metrics for last X minutes" (for example Stackdriver actually doesn't support last X minutes, rather, we need to compute startDate/endDate inside its metric provider impl).

  3. minRequests is something we will (over time) transition to an actual config.Metric criteria (just like latency) so ideally this will go away from this method signature.

}

return &Diagnosis{
OverallResult: overallResult,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I appreciate the efficiency of the loop above, it's doing two things: collecting metrics and evaluating the aggregation of those metrics.

I prefer to separate out the data collection from the data analysis, and minimize the amount of property passing and storing of values derived from mutable data.

In that case, instead of a control function such as Diagnose, you might follow the metaphor deeper and call an Assess() method which collects all the stats (with potential future enhancements such as concurrency), then from the resulting object produce the diagnosis at will.

// Assess has all the external calls
status := health.Assess(ctx, provider, query...)
// Diagnose provides a directive on what to do based on the signals.
status.Diagnose()

CollectMetrics() goes away in this case.

Copy link
Contributor Author

@gvso gvso Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach, what about doing this:

results := health.Assess(ctx, provider, query, offset, healthCriteria) // returns []health.CheckResult
diagnosis := health.Diagnose(healthCriteria, results)

The above approach works too. It just means an extra struct containing []CheckResult and []config.Metric

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, SGTM. Please capture this in an issue and let's do it another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping track of this at #30

// metrics criteria.
func CollectMetrics(ctx context.Context, provider metrics.Metrics, query metrics.Query, offset time.Duration, healthCriteria []config.Metric) ([]float64, error) {
logger := util.LoggerFromContext(ctx)
logger.Debug("start collecting metrics")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: OpenTelemetry instrumentation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how OT would play into here. Are you talking about "querying OT" or "using OT to trace the operator tool itself"?

We would look metrics storages (Stackdriver API, Prometheus) and those don't know anything about OT when it comes to querying.

pkg/health/health.go Outdated Show resolved Hide resolved
return 0.01, nil
}

tests := []struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to test metrics and thresholds set to 0. Even if that is working fine now, that seems likely to raise unexpected behavior in any future refactoring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping track at #31

@ahmetb
Copy link
Contributor

ahmetb commented Jul 27, 2020

Let’s address the quick actionable feedback and open issues for outstanding long-term issues and send follow-up PRs. This CL is getting pretty beefy.

Getulio Valentin Sánchez and others added 3 commits July 27, 2020 13:59
Signed-off-by: Getulio Valentin Sánchez <gvso@google.com>
Signed-off-by: Getulio Valentin Sánchez <gvso@google.com>
This creates a new health package that has a main IsHealthy function to check if
the metrics for the candidate revision meet the configuration criteria.

Signed-off-by: Getulio Valentin Sánchez <valentin2507@gmail.com>
gvso and others added 9 commits July 27, 2020 13:59
Signed-off-by: Getulio Valentin Sánchez <valentin2507@gmail.com>
Signed-off-by: Getulio Valentin Sánchez <valentin2507@gmail.com>
Signed-off-by: Getulio Valentin Sánchez <gvso@google.com>
Signed-off-by: Getulio Valentin Sánchez <gvso@google.com>
Signed-off-by: Getulio Valentin Sánchez <gvso@google.com>
Signed-off-by: Getulio Valentin Sánchez <gvso@google.com>
Signed-off-by: Getulio Valentin Sánchez <gvso@google.com>
Signed-off-by: Getulio Valentin Sánchez <gvso@google.com>
Signed-off-by: Getulio Valentin Sánchez <gvso@google.com>
@gvso
Copy link
Contributor Author

gvso commented Jul 27, 2020

I have addressed the quick actionable feedback, and I am keeping track of the other issues at #30 and #31

@ahmetb ahmetb merged commit 736e698 into GoogleCloudPlatform:main Jul 27, 2020
@gvso gvso deleted the health-package branch July 27, 2020 19:28
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants