Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(metrics): add experimental otel metrics collector #2988

Merged
merged 2 commits into from May 17, 2023

Conversation

swiatekm-sumo
Copy link
Contributor

@swiatekm-sumo swiatekm-sumo commented Apr 13, 2023

Experimental otel metrics collector to replace Prometheus. This is done by using the opentelemetry operator with the target allocator and Prometheus CR options enabled. This causes the target allocator to take the role of Prometheus Operator: tracking Service and Pod Monitors and translating that into raw Prometheus configuration. The target allocator also allocates the scrape targets to otel Pods, so this setup is scalable.

This collector replaces the metrics metadata StatefulSet - it's essentially the same pipeline, but with a different receiver

Intentionally missing in this PR:

  • Node metric aggregations via recording rules (we should add tests for these)
  • filtering, there's no remote writes and we export all collected metrics
  • partial histogram timeseries are not yet supported due to a difference in behaviour between telegrafreceiver and prometheusreceiver
  • autoscaling
  • various QoL options present for metadata

TODO:

  • more comments
  • some labels are missing
  • template tests

Checklist

  • Changelog updated or skip changelog label added
  • Template tests added for new features
  • Integration tests added or modified for major features

@swiatekm-sumo swiatekm-sumo changed the title Feat/otel prometheus replacement feat(metrics): add experimental otel metrics collector Apr 13, 2023
@swiatekm-sumo swiatekm-sumo force-pushed the feat/otel-prometheus-replacement branch from 5e9c4cf to 6c69ca2 Compare April 13, 2023 17:45
@github-actions github-actions bot added the documentation documentation label Apr 13, 2023
@swiatekm-sumo swiatekm-sumo force-pushed the feat/otel-prometheus-replacement branch 5 times, most recently from 721b864 to 66798b4 Compare April 15, 2023 13:21
@swiatekm-sumo swiatekm-sumo force-pushed the feat/otel-prometheus-replacement branch 5 times, most recently from 60b9c23 to 0b835f2 Compare April 26, 2023 11:34
@swiatekm-sumo swiatekm-sumo marked this pull request as ready for review April 26, 2023 12:35
@swiatekm-sumo swiatekm-sumo requested a review from a team as a code owner April 26, 2023 12:35
@swiatekm-sumo swiatekm-sumo changed the base branch from main to chore/upgrade-operator-chart April 26, 2023 12:36
Base automatically changed from chore/upgrade-operator-chart to main April 26, 2023 12:52
deploy/helm/sumologic/Chart.yaml Outdated Show resolved Hide resolved
Comment on lines 27 to +32
receivers:
{{- if not .Values.sumologic.metrics.collector.otelcol.enabled }}
- telegraf
{{- else }}
- prometheus
{{- end }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to enable users to add other receivers soon?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a use case for that? It's assumed right now that everything exposes Prometheus metrics either directly or via a sidecar. Either way, I wouldn't want to add that in this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know that it requires some more work for now, but I was just wondering what is going to be the future of this collector.

@swiatekm-sumo swiatekm-sumo force-pushed the feat/otel-prometheus-replacement branch 5 times, most recently from 12ea9ba to b6c364a Compare April 27, 2023 15:16
prometheus:
enabled: false
prometheusOperator:
enabled: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could both, the prometheus operator and the opentelemetry operator be enabled by the user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are. Unless I'm misunderstanding the question.

@@ -405,6 +405,20 @@ sumologic:
## Defines metrics metadata enrichment provider - `otelcol` or `fluentd`. `otelcol` is the default and is recommended. `fluentd` is deprecated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we change this comment to say Prometheus is the default for metrics collection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because it's not the default yet.

@@ -0,0 +1,57 @@
{{- if and (eq (include "metrics.otelcol.enabled" .) "true") .Values.sumologic.metrics.collector.otelcol.enabled }}
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opentelemetry CR! I guess we included the opentelemetry operator in the chart in an earlier PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been included for a while actually, though it's disabled by default. It's the only way to do auto-instrumentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a warning informing user that Opentelemetry operator must be enabled to use otelcol metrics collector? Something similar to warnings in https://github.com/SumoLogic/sumologic-kubernetes-collection/blob/main/deploy/helm/sumologic/templates/NOTES.txt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not add anything user facing here yet. This is clearly marked as experimental, and it's currently more for our internal testing than anything else. And there is a comment in values.yaml saying that opentelemetry-operator needs to be enabled.

requests:
storage: {{ .Values.metadata.persistence.size }}
{{- end }}
config: |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the SIG meeting, I guess the way this configuration is specified could change...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than change, there's probably going to be other ways to define it.

# regex: (id|name)
relabel_configs: *relabel_configs # partially copied from what operator generates
target_allocator:
endpoint: http://{{ template "sumologic.metadata.name.metrics.targetallocator.name" . }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so does the collector use this endpoint to discover the targets to be scraped?

Does this get converted to http_sd_configs by the operator under the hood...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a feature of prometheus receiver, and it doesn't use http_sd_configs as far as I know. The operator actually gives us the targetallocator service.

mode: statefulset
replicas: 3
serviceAccount: {{ template "sumologic.metadata.name.metrics.collector.serviceaccount" . }}
targetAllocator:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the operator create a target allocator deployment for us?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

targetAllocator:
serviceAccount: {{ template "sumologic.metadata.name.metrics.targetallocator.serviceaccount" . }}
enabled: true
prometheusCR:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does enabling the prometheus CR under the target allocator do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes the target allocator watch for Service and Pod Monitors, and turn them into Prometheus scrape target configuration. Without this option it only acts on the discovery config defined explicitly in prometheus receiver config.

{{- include "sumologic.labels.common" . | nindent 4 }}
spec:
mode: statefulset
replicas: 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it is worthy to make this configurable, WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of settings that should be made configurable before we move this to a user-facing beta. For this PR, I wanted to limit the complexity, since it's big and complex as-is. So I only exposed the configuration necessary to run integration tests successfully.

@kkujawa-sumo
Copy link
Contributor

It can be done in another pull request but it will be worthy to add diagram and documentation to show data flow for metrics collection using otelcol metrics collector, without documentation a lot of people could have a problem with understanding this, even if it is experimental feature.

@swiatekm-sumo swiatekm-sumo force-pushed the feat/otel-prometheus-replacement branch 2 times, most recently from 71e56e2 to 042b761 Compare May 11, 2023 14:01
endpoint: http://{{ template "sumologic.metadata.name.metrics.targetallocator.name" . }}
interval: 30s
collector_id: ${POD_NAME}
extensions:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should perhaps extract the extension definitions to another file, to prevent duplication. This can be done in another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extensions aren't the same, the new metrics collector is managed by the operator, and the operator injects the health check extension automatically, and takes care of opening the right port and adding the health checks in K8s. They're mostly the same though, so there's room to improve this.

Copy link
Contributor

@andrzej-stencel andrzej-stencel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! 👏 Let's run this 🚀

@swiatekm-sumo swiatekm-sumo force-pushed the feat/otel-prometheus-replacement branch from 042b761 to c66e812 Compare May 17, 2023 13:37
@swiatekm-sumo swiatekm-sumo enabled auto-merge (rebase) May 17, 2023 13:39
@swiatekm-sumo swiatekm-sumo merged commit a737107 into main May 17, 2023
57 checks passed
@swiatekm-sumo swiatekm-sumo deleted the feat/otel-prometheus-replacement branch May 17, 2023 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants