-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clickhouse writers don't support buffered storage #182
Comments
Indeed, currently looking at filestorage, but really the "want" is anything that allows me to build a not-in-memory buffer before shipping. The file_storage plugin seems the most viable at this moment, but as I'm in AWS, I'd also be perfectly fine with any of the 'unbounded' resources (SQS, S3, Kinesis, Kafka, etc) - my agents either end up OOM'ing or creating backpressure to my services, which then further clutters up the logs with services complaining about backpressure :) file_storage to me would allow me to allocate multiple GB's to agent buffers without requiring me to scale up instances just to handle logging agents. From the other end of the 'solution' space - if the agent could be optionally configured to drop (either by age or randomly) rather than blocking, that would at least avoid the back-pressure problem, but I think that would be less ideal than being able to build up buffers that get consumed after contention clears. |
This seems like a good idea and should be part of recommended OSS settings too IMO. Will this make the otel-collectors stateful in k8s as we need to attach a PV now to the otel-collector or should it be ephemeral storage? The problem with being ephemeral is that the data written to file will be gone after the collector restart. |
I do not think we should recommend it, but if users want to use it, they should be free to do so. Please see open-telemetry/opentelemetry-collector#5902 and several other issues related to the persistent queue. Overall, if your main concern is an outage of the final backend destination please set up a proper queue that guarantees the delivery of data without loss. |
In my mind, persistence would be similarly optional/controllable by the helm chart - PV's if desired, empty:{} if not. I agree it should be optional - right now we can't really enable it because the signoz exporter doesn't support it, and I acknowledge, unbounded memory with perfect flushes would solve the problem. But without some sort of persistence layer before it hits clickhouse, what is far more common is that signoz as a participant in the cluster blows up its ram and either OOM's or creates back-pressure causing actual service faults or a chain of OOM's into poorly written collectors. That's not to say it's signoz's fault at all - it seems to be trying to do the right thing to not lose data - but perfect/good/enemies and all - I would rather that (my) installation be able to write to disk to buffer, because it's far more likely that either a bad spike occurred and it'll get worked through shortly, or I'm rolling the cluster and it'll get sorted when things restart, but going from nominal 1-2GB of memory to 16GB of memory is a complete non-starter for my agents - and will almost certainly hit cluster limits and lose data before I'll run out of disk. |
While storage extensions are enabled, none of the clickhouse exporters in the signoz distribution support using them, so all agents that would emit to clickhouse are constrained by memory, even during an outage on clickhouse.
It would be helpful if the config structures supported the
{sending_queue: {storage: _extension_}}
options so that they could potentially buffer to disk to allow for maintenance that doesn't immediately balloon up memory usage and create back-pressure on emitters.The text was updated successfully, but these errors were encountered: