health: Create package to determine health #15

gvso · 2020-07-21T18:11:59Z

This creates a new health package that has a main IsHealthy function to check if the metrics for the candidate revision meet the configuration criteria.

ahmetb · 2020-07-21T18:13:28Z

internal/metrics/metrics.go

+// PercentileToAlignReduce takes a percentile value maps it to a AlignReduce
+// value.
+//
+// TODO: once we start supporting any percentile value, this should not be


why are we not able to support floats right away?

Since we decided to use the REST API instead of MQL first, we don't need (and can't) support floats yet. The REST API only allows specific percentiles (99, 95, 50)

makes sense. let's have validation that if values like 97 is specified it should return error instead of defaulting to some unwanted value.

Done. I'm using fmt.Errorf since pkg/errors doesn't have a constructor with formatting.

don't use fmt.Errorf. Try errors.Errorf.

pkg/health/health.go

ahmetb · 2020-07-21T21:17:58Z

pkg/health/health.go

+}
+
+// IsHealthy checks if a revision is healthy based on the metrics configuration.
+func (h *Health) IsHealthy(ctx context.Context, query metrics.Query, healthCheckOffset time.Duration, metricsChecks []config.Metric) (bool, error) {


Maybe we need to do this in another PR:

If you just return bool for fail/pass, we cannot capture details like "what metrics did we see", "why did we make this decision".

As part of our design we said we'll provide these insights back to the users. So probably we need to export something like []CheckResults that explains result of each criteria checked and if it was "met" or "not".

And then the rollout machinery can take alook at []CheckResults to decide whether to halt or roll forward|back.

gvso · 2020-07-23T15:42:25Z

pkg/health/health.go

+//
+// Otherwise, all metrics criteria are checked to determine if the revision is
+// healthy.
+func Diagnose(ctx context.Context, provider metrics.Metrics, query metrics.Query,


This does not check request count yet since it depends on #13 . I will do it in a following PR

ahmetb · 2020-07-23T22:07:24Z

pkg/health/health.go

+	MetricsType   config.MetricsCheck
+	MinOrMaxValue interface{}
+	ActualValue   interface{}
+	IsCriteriaMet bool
+	Reason        string


it seems like MinOrMaxValue (rename: Threshold or ExpectedValue) and ActualValue are what we need in this type.

IsCriteriaMet is ok to.

but the type etc can be inferred from the []config.Metric we passed in.

also just use float64

ahmetb · 2020-07-23T22:13:39Z

pkg/health/health.go

+	isHealthy := true
+	var results, failedResults []*CheckResult
+	for i, criteria := range healthCriteria {
+		result := determineResult(criteria.Type, criteria.Min, criteria.Max, metricsValues[i])


update config.Metric to have Threshold (not Min / Max) and here in this method we know if it should be used as ≤ or ≥ based on the .Type

Done at #20

pkg/health/health.go

ahmetb · 2020-07-23T22:21:03Z

pkg/health/health.go

+		case config.LatencyMetricsCheck:
+			value, err = latency(ctx, provider, query, offset, criteria.Percentile)
+			break
+		case config.ErrorRateMetricsCheck:
+			value, err = errorRate(ctx, provider, query, offset)


This will be done in a following PR after we integrate this package with the rollout

ahmetb · 2020-07-23T22:22:21Z

pkg/health/health.go

+	FailedCheckResults []*CheckResult
+	IsHealthy          bool


ideally this structure should also represent that no rollout operation should be done (halt/noop) based on minrequests.

similar to IsHealthy but different.

this is probably an enum DiagnosisResult { UNKNOWN=0 , INCONCLUSIVE, HEALTHY= , UNHEALTHY }

ahmetb · 2020-07-23T22:23:01Z

pkg/health/health.go

+}
+
+// errorRate returns the percentage of errors during the given offset.
+func errorRate(ctx context.Context, provider metrics.Metrics, query metrics.Query, offset time.Duration) (float64, error) {


errorRatePercent

ahmetb · 2020-07-24T22:31:07Z

pkg/health/health.go

+
+// Possible diagnosis results.
+const (
+	Unknown      DiagnosisResult = 0


use iota syntax.

Nice! I didn't know about this.

ahmetb · 2020-07-24T22:31:31Z

pkg/health/health.go

+)
+
+// DiagnosisResult is a possible result after a diagnosis.
+type DiagnosisResult int32


why int32? can't this just be int

ahmetb · 2020-07-24T22:32:26Z

pkg/health/health.go

+// Diagnosis is the information about the health of the revision.
+type Diagnosis struct {
+	OverallResult DiagnosisResult
+	CheckResults  []*CheckResult


why is this pointer? use pointers very sparingly

ahmetb · 2020-07-24T22:33:36Z

pkg/health/health.go

+		return nil, errors.Wrap(err, "could not collect metrics")
+	}
+
+	overallResult := Healthy


this is potentially a problem if no healthCriteria is specified (somehow) and you return Healthy. that's why we define Inconclusive and Unknown for.

if you say var overallResult DiagnosisResult it will default to Unknown.

to be clear my concern is: It's not defensive enough for a critical task such as rollout. user is much better off if we don't roll out or rollback, rather than blindly rolling forward due to a bug in our tool.

I need you to approach to the problem with this mindset.

Thanks! My initial thought was that Inconclusive is the value returned when there are no enough requests, but that's still not implemented.

I was not sure if the best approach to empty criteria was roll forward (maybe the user didn't want health checks) or do nothing (Unknown/Inconclusive).

I was not sure if the best approach to empty criteria was roll forward

Right, it's not a good approach.

Inconclusive or Unknown is better. But best might be returning error if no criteria is specified (terminal error).

Let's address this at #30

ahmetb · 2020-07-24T22:35:35Z

pkg/health/health.go

+			if criteria.Type == config.LatencyMetricsCheck {
+				logger = logger.WithField("percentile", criteria.Percentile)


just log it always, it'll be a percentile=0 field that doesn't mean anything. it's cleaner that way (+log stms align in human-readable output)

ahmetb · 2020-07-24T22:37:24Z

pkg/health/health.go

+		result := determineResult(criteria.Type, criteria.Threshold, metricsValues[i])
+		results = append(results, result)
+
+		if !result.IsCriteriaMet {


I probably explained "avoid if-nesting" if you can more than a handful times now. :)

What if you said:

for healthCriteria { if result.IsCriteriaMet { continue } // things you do if unhealthy // and you don't need to indent it }

any time you write a lengthy if { ... } block I need you to keep this in mind: can you move the block outside the if somehow or not

My bad. This was a simple block before the last change. I will take this into account when making changes too

ahmetb · 2020-07-24T22:38:44Z

pkg/health/health.go

+			if criteria.Type == config.LatencyMetricsCheck {
+				logger = logger.WithField("percentile", criteria.Percentile)
+			}
+			logger.Debug("criteria was not met")


why don't you initialize logger outside the if and also log criteria met?

ahmetb · 2020-07-24T22:40:23Z

cc: @grayside this is probably the most important PR of the project. If you have time, PTAL.

In this PR we create some core machinery to look at a set of metrics thresholds and look at Stackdriver Monitoring API and make a decision (roll forward/back, noop).

grayside · 2020-07-24T23:23:32Z

pkg/health/health.go

+	// to change to a switch statement when criteria with a minimum threshold is
+	// added.
+	if actualValue <= threshold {
+		result.IsCriteriaMet = true


Given the CheckResult values are not immutable and the calculation is a simple comparison, I think this would be better as a method rather than a property.

@grayside so it's actually hard to look at CheckResult object and tell if it's a positive result or negative.

e.g. for latency metrics, threshold=100 actual=200 → requirement is failed
however, for "min request" metrics threshold=100 actual=200 → requirement is met

the method that generates this CheckResult decides if those numbers mean something good or bad.

in case of "min request" metrics is not met → we simply do noop (wait more requests to happen)

however in case of "latency" metrics not met → means "rollback".

Ah, that makes sense, and reading again I see the comment about the forthcoming switch statement. That seems like it will be a bit complex. I can't help but go to the idea that there should be an interface such as the following (reusing the same verbs being discussed above, but more locally this time):

type Metric interface { Assess(context.Context, metrics.Metrics, time.Duration, []config.Metrics[]) (*CheckResult, error) Diagnose() bool }

Then the individual Metrics can be implemented with a bit more isolation and testability.
The current structure seems like it will work okay for 2-3 metrics but start to become less manageable thereafter.

Thanks @grayside! I think I understand your concern, but I am a little bit confused about the expected interface for the caller of this interface.

The idea is that you get a CheckResult per metric criteria since the CheckResult contains the expected threshold, the actual value, and whether the criteria is met based on those previous two values.

Do you want to pass a single config.Metric and get a single CheckResult?

for criteria := range healthCriteria { results = append(results, health.Assess(ctx, metricsProvider, ...)) // ignore error }

Do you want to pass an array of metrics criteria an get an array of CheckResult?

Something else?

Also, I don't see how this would prevent the switch when determining results since comparing actualValue <= threshold or threshold <= actualValue depends on the health criteria type (unless we pass a comparison function, which sounds like an overkill here).

I'm a bit +1 on this being an overkill right now.
Realistically we'll only have a≤b or b≤a comparisons as far as metrics go.
So creating a LatencyMetric or RequestCountMetric types probably are overkill.

Though I agree, Adam’s approach long-term is definitely appealing. I'd say no need to refactor this right now.

grayside · 2020-07-24T23:30:48Z

pkg/health/health.go

+//
+// Otherwise, all metrics criteria are checked to determine if the revision is
+// healthy.
+func Diagnose(ctx context.Context, provider metrics.Metrics, query metrics.Query,


I don't see where this is called, so it's hard to tell, but this list of parameters feels off to me. My blind expectation in looking at this:

Not clear to me why there's a provider and a query here

The offset and minRequests seems awkward next to parameters such as a config and query. Why are they distinct properties?

This looks like we're exposing some implementation details on the primary method of a public package.

I’ll try to explain the context here:

provider is a metrics backend that lets you query certain type of metrics. However, we specify "filters" such as "service_name=X,revision=Y,project=Z", and those (under the covers) might be implemented differently.

though I agree, actually we could make that as part of initialization process to the MetricsProvider (and initialize a different provider for each svc/revision/project/...)

offset is for being able to specify "metrics for last X minutes" (for example Stackdriver actually doesn't support last X minutes, rather, we need to compute startDate/endDate inside its metric provider impl).

minRequests is something we will (over time) transition to an actual config.Metric criteria (just like latency) so ideally this will go away from this method signature.

grayside · 2020-07-24T23:50:08Z

pkg/health/health.go

+	}
+
+	return &Diagnosis{
+		OverallResult: overallResult,


While I appreciate the efficiency of the loop above, it's doing two things: collecting metrics and evaluating the aggregation of those metrics.

I prefer to separate out the data collection from the data analysis, and minimize the amount of property passing and storing of values derived from mutable data.

In that case, instead of a control function such as Diagnose, you might follow the metaphor deeper and call an Assess() method which collects all the stats (with potential future enhancements such as concurrency), then from the resulting object produce the diagnosis at will.

// Assess has all the external calls status := health.Assess(ctx, provider, query...) // Diagnose provides a directive on what to do based on the signals. status.Diagnose()

CollectMetrics() goes away in this case.

I like this approach, what about doing this:

results := health.Assess(ctx, provider, query, offset, healthCriteria) // returns []health.CheckResult diagnosis := health.Diagnose(healthCriteria, results)

The above approach works too. It just means an extra struct containing []CheckResult and []config.Metric

Yeah, SGTM. Please capture this in an issue and let's do it another PR.

Keeping track of this at #30

grayside · 2020-07-24T23:55:24Z

pkg/health/health.go

+// metrics criteria.
+func CollectMetrics(ctx context.Context, provider metrics.Metrics, query metrics.Query, offset time.Duration, healthCriteria []config.Metric) ([]float64, error) {
+	logger := util.LoggerFromContext(ctx)
+	logger.Debug("start collecting metrics")


TODO: OpenTelemetry instrumentation?

I don't know how OT would play into here. Are you talking about "querying OT" or "using OT to trace the operator tool itself"?

We would look metrics storages (Stackdriver API, Prometheus) and those don't know anything about OT when it comes to querying.

pkg/health/health.go

grayside · 2020-07-25T00:03:23Z

pkg/health/health_test.go

+		return 0.01, nil
+	}
+
+	tests := []struct {


It might be good to test metrics and thresholds set to 0. Even if that is working fine now, that seems likely to raise unexpected behavior in any future refactoring.

Keeping track at #31

ahmetb · 2020-07-27T17:55:41Z

Let’s address the quick actionable feedback and open issues for outstanding long-term issues and send follow-up PRs. This CL is getting pretty beefy.

Signed-off-by: Getulio Valentin Sánchez <gvso@google.com>

This creates a new health package that has a main IsHealthy function to check if the metrics for the candidate revision meet the configuration criteria. Signed-off-by: Getulio Valentin Sánchez <valentin2507@gmail.com>

Signed-off-by: Getulio Valentin Sánchez <valentin2507@gmail.com>

Signed-off-by: Getulio Valentin Sánchez <gvso@google.com>

gvso · 2020-07-27T18:16:01Z

I have addressed the quick actionable feedback, and I am keeping track of the other issues at #30 and #31