Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for untyped metrics from Ops Agent #668

Merged
merged 8 commits into from Jul 14, 2023

Conversation

damemi
Copy link
Member

@damemi damemi commented Jul 10, 2023

This adds functionality to automatically parse the attribute set by the Ops Agent's prometheus receiver to indicate untyped prometheus metrics (GoogleCloudPlatform/opentelemetry-operations-collector#170). When this attribute is found in a Gauge metric, the time series will be double-exported as a Gauge and a Cumulative. The special attribute will be dropped from metric labels.

This behavior is on by default behind a featuregate gcp.untyped_double_export and is only meant for internal use until an upstream approach to untyped metrics is implemented.

@damemi damemi requested a review from a team as a code owner July 10, 2023 19:51
@dashpole
Copy link
Contributor

This behavior is on by default and is only meant for internal use until an upstream approach to untyped metrics is implemented.

Can we put this behind an alpha feature gate? Otherwise, I would considering it a breaking change when we remove the behavior

@codecov
Copy link

codecov bot commented Jul 10, 2023

Codecov Report

Merging #668 (505a1cf) into main (a19d9ea) will increase coverage by 0.42%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #668      +/-   ##
==========================================
+ Coverage   68.56%   68.99%   +0.42%     
==========================================
  Files          36       36              
  Lines        4559     4647      +88     
==========================================
+ Hits         3126     3206      +80     
- Misses       1280     1288       +8     
  Partials      153      153              
Impacted Files Coverage Δ
...er/collector/integrationtest/testcases/testcase.go 81.51% <ø> (ø)
...collector/googlemanagedprometheus/extra_metrics.go 78.72% <100.00%> (+7.09%) ⬆️
...porter/collector/googlemanagedprometheus/naming.go 100.00% <100.00%> (ø)
...tor/integrationtest/testcases/testcases_metrics.go 100.00% <100.00%> (ø)

... and 3 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@damemi
Copy link
Member Author

damemi commented Jul 10, 2023

This behavior is on by default and is only meant for internal use until an upstream approach to untyped metrics is implemented.

Can we put this behind an alpha feature gate? Otherwise, I would considering it a breaking change when we remove the behavior

We can, but the mechanism (the attribute) isn't user-facing, so I think changing it wouldn't be breaking. Of course there's nothing to stop users from adding this attribute themselves anyway and hacking it.

@dashpole
Copy link
Contributor

Should this be specific to the GMP Exporter? I'm OK with this version being in the googlecloud exporter, but I think we want it in GMP parts eventually.

exporter/collector/metrics.go Outdated Show resolved Hide resolved
Value: value,
}},
Metric: &metricpb.Metric{
Type: t,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the GMP suffixing handled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, it looks like GMP uses a special unknown suffix when it duplicates the metric. Maybe we should change our metric name function to check for this label

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it has different suffixes for the gauge portion of the unknown metric (unknown and the counter portion of the unknown metric (unknown:counter).

Copy link
Member Author

@damemi damemi Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm going to refactor this, a better approach would be to copy the untyped Gauge pmetric entry into an extra Sum metric, and let the existing path after that handle everything. Then GetMetricName() can handle both.

I think using an actual Sum from pdata will also make it easier to handle @ridwanmsharif's question #668 (comment). @dashpole wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me, but I'm not convinced the logic should live in the base exporter. If someone used the prometheus receiver with our base exporter, they would get duplicate timeseries errors. We need the GMP logic of adding suffixes to prevent duplicate timeseries problems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can at least move the gate to the GMP exporter so the logic is only active there. Not sure if we could move the logic itself. Maybe through the "extra metrics" approach we did for target and scope info metrics actually

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to move everything using the extra metrics approach. Ptal, thanks!

ValueType: valueType,
Points: []*monitoringpb.Point{{
Interval: &monitoringpb.TimeInterval{
EndTime: timestamppb.New(point.Timestamp().AsTime()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to confirm this, but not setting the start time will make this normalize the same way GMP does with these counters?

That is, when the value goes down, treat it as a reset?

Copy link
Member Author

@damemi damemi Jul 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this, but it's still doing basically the same thing in the new code so to answer your question: yes it should. This is the normalization function:

func (s *standardNormalizer) NormalizeNumberDataPoint(point pmetric.NumberDataPoint, identifier string) (pmetric.NumberDataPoint, bool) {
(called from here on a Sum, which the updates I just pushed use for the Cumulative instead of a Gauge)

You can see this now in the fixtures, where I had to add 2 new untyped input data points just to get one output unknown cumulative. This is because the first data point gets normalized as a reset point and dropped by our exporter.

As a side note, I had to update our fixture generators to test this because they currently normalize all timestamps to be rightNow() (in order to work with testing against the actual GCM api). But I needed to bypass that for this test to get the cumulative normalization to show up.

@damemi
Copy link
Member Author

damemi commented Jul 13, 2023

Collector tests are failing from something in the "skip timestamp update" hack I put in... maybe I can't do that Edit: I wasn't passing the skip flag to the collector tests. Updating this seems to have fixed it

},
{
"metric": {
"type": "prometheus.googleapis.com/fake_untyped_metric_total/unknown:counter",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is the expected counter... of note:

  • _total is appended (I think this comes from the upstream prom normalization that's currently on be default)
  • 2 data points for fake_untyped_metric as a gauge, because there are 2 input points (first Sum is dropped for normalization)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comes from the upstream prom normalization that's currently on be default

Should we update our collector dependency? I don't think that is supposed to be the case anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can leave it how it is, and update later

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a test bump on this branch and re-ran the fixture and the _total/units suffixes are dropped in the new version. I'll open a separate bump PR after I merge this

"type": "workload.googleapis.com/fake_untyped_metric",
"labels": {
"ex_com_lemons_untyped": "13",
"prometheus_untyped_metric": "true",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the fixture for testing the featuregate. this label would be dropped in the gmp exporter if the gate was enabled

@damemi
Copy link
Member Author

damemi commented Jul 14, 2023

Final update now that all tests are green: using 2 gauge entries to get the normalized cumulative didn't work with the actual GCM integration test, because the 2nd gauge caused a duplicate timeseries error (see failure here). Added 7830343 to disable cumulative normalization, use 1 gauge entry, and remove the "skip timestamp update" option I added earlier.

However the lack of startTimestamp on the input gauge caused it to be initialized to the 0 time by this line, which caused an invalid timeseries error from GCM (see that failure here). To fix that I pushed 505a1cf to add a startTimestamp to the input gauge and more importantly copy that startTimestamp to the new point. I think this is valid for use cases where untyped metrics might want to manually register a reset point.

Now that this is green, I'm merging. Will open another issue to check on our other cumulative tests to make sure we're not dropping points in those.

@damemi damemi merged commit 0c97ee5 into GoogleCloudPlatform:main Jul 14, 2023
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants