Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clickhouse writers don't support buffered storage #182

Open
jdnurmi opened this issue Oct 5, 2023 · 5 comments
Open

Clickhouse writers don't support buffered storage #182

jdnurmi opened this issue Oct 5, 2023 · 5 comments

Comments

@jdnurmi
Copy link

jdnurmi commented Oct 5, 2023

While storage extensions are enabled, none of the clickhouse exporters in the signoz distribution support using them, so all agents that would emit to clickhouse are constrained by memory, even during an outage on clickhouse.

It would be helpful if the config structures supported the {sending_queue: {storage: _extension_}} options so that they could potentially buffer to disk to allow for maintenance that doesn't immediately balloon up memory usage and create back-pressure on emitters.

@srikanthccv
Copy link
Member

I imagine you are referring filestorage persistent queue and it is very much alpha. See #176, #181. This will be addressed as a part of them.

@jdnurmi
Copy link
Author

jdnurmi commented Oct 6, 2023

Indeed, currently looking at filestorage, but really the "want" is anything that allows me to build a not-in-memory buffer before shipping. The file_storage plugin seems the most viable at this moment, but as I'm in AWS, I'd also be perfectly fine with any of the 'unbounded' resources (SQS, S3, Kinesis, Kafka, etc) - my agents either end up OOM'ing or creating backpressure to my services, which then further clutters up the logs with services complaining about backpressure :)

file_storage to me would allow me to allocate multiple GB's to agent buffers without requiring me to scale up instances just to handle logging agents.

From the other end of the 'solution' space - if the agent could be optionally configured to drop (either by age or randomly) rather than blocking, that would at least avoid the back-pressure problem, but I think that would be less ideal than being able to build up buffers that get consumed after contention clears.

@ankitnayan
Copy link
Contributor

This seems like a good idea and should be part of recommended OSS settings too IMO.

Will this make the otel-collectors stateful in k8s as we need to attach a PV now to the otel-collector or should it be ephemeral storage? The problem with being ephemeral is that the data written to file will be gone after the collector restart.

@srikanthccv
Copy link
Member

I do not think we should recommend it, but if users want to use it, they should be free to do so. Please see open-telemetry/opentelemetry-collector#5902 and several other issues related to the persistent queue. Overall, if your main concern is an outage of the final backend destination please set up a proper queue that guarantees the delivery of data without loss.

@jdnurmi
Copy link
Author

jdnurmi commented Oct 9, 2023

In my mind, persistence would be similarly optional/controllable by the helm chart - PV's if desired, empty:{} if not.

I agree it should be optional - right now we can't really enable it because the signoz exporter doesn't support it, and I acknowledge, unbounded memory with perfect flushes would solve the problem. But without some sort of persistence layer before it hits clickhouse, what is far more common is that signoz as a participant in the cluster blows up its ram and either OOM's or creates back-pressure causing actual service faults or a chain of OOM's into poorly written collectors.

That's not to say it's signoz's fault at all - it seems to be trying to do the right thing to not lose data - but perfect/good/enemies and all - I would rather that (my) installation be able to write to disk to buffer, because it's far more likely that either a bad spike occurred and it'll get worked through shortly, or I'm rolling the cluster and it'll get sorted when things restart, but going from nominal 1-2GB of memory to 16GB of memory is a complete non-starter for my agents - and will almost certainly hit cluster limits and lose data before I'll run out of disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants