The 'upstream connect error or disconnect/reset before headers. reset reason: connection failure' error when using collector in OpenShift infrastructure #5091

art-iva-cente · 2024-01-11T01:07:41Z

art-iva-cente
Jan 11, 2024

We are running a jaeger collector (1.48.1) in OpenShift, and we send the telemetry from our Java Spring applications using opentelemetry javaagent ver 1.28.0. We are using service endpoints in OpenShift as in http://jaeger-collector-headless.qa-app-monitoring.svc:4317, and the issue is when the collector pod restarts the javaagent can't reconnect again. What happens instead the javaagent reports the error below continuously:

ERROR io.opentelemetry.exporter.internal.grpc.OkHttpGrpcExporter - Failed to export spans. Server is UNAVAILABLE. Make sure your collector is running and reachable from this network. Full error message: upstream connect error or disconnect/reset before headers. reset reason: connection failure

What could be the cause of the issue in terms of how the error maps to the networking/infrastructure problems?

If I simulate the situation locally, i.e. I run the local jaeger-all-in-one container then drop it, restart it, the error is different, and the javaagent restores the connection successfully, that's the error on local that I get:

[otel.javaagent 2023-12-27 20:19:37:802 -0800] [OkHttp http://localhost:4317/...] ERROR io.opentelemetry.exporter.internal.grpc.OkHttpGrpcExporter - Failed to export spans. The request could not be executed. Full error message: Failed to connect to localhost/0:0:0:0:0:0:0:1:4317

It's straightforward. It can't connect as the container is intentionally down, then it restores the connection as soon as I restart it.

What is different with the "upstream connect error or disconnect/reset before headers. reset reason: connection failure" compare to "Failed to connect to localhost/0:0:0:0:0:0:0:1:4317" error?

PS: To the javaagent we are passing OTEL_TRACES_EXPORTER=otlp OTEL_METRICS_EXPORTER=none OTEL_EXPORTER_OTLP_ENDPOINT=http://our-openshift.svc:4317 OTEL_EXPORTER_OTLP_PROTOCOL=grpc

yurishkuro · 2024-01-11T01:30:17Z

yurishkuro
Jan 11, 2024
Maintainer

Doesn't sound like a Jaeger issue to me - I bet if you place OTEL collector between your SDK and Jaeger you'd get the same behavior. My guess would be it has something to do with how networking or service mesh routing is working, maybe they don't detect a downed server in time.

1 reply

Stevenpc3 Feb 1, 2024

I have a similar issue but I CAN get it to work internally via kube dsn like http:servicename.namespace:4317 but if I try to put Jaeger on a different cluster then set the exporterURL to http://mysite.com it will not work. I have been able to directly curl via to it using http, but NOT grpc. We adjusted the ALB and such to, but at this point I am just wondering if it even works. All examples use localhost or service to service in the same kube namesapce. If I can prove it works then I can look to the ALB for breaks.

Also it appears like the basepath for collector does not work and requires root path. so http://mysite.com/collector for forwarding traces breaks with basepath set to collector. But if I do a rewerite or just amke it http://mysite.com it works... ish for some otlp-http commands but again I only saw "trace set" in the collector logs with debug on. Otherwise, nothing else works.

Just looking for some feedback on this. Using 1.51 and using helm charts to deploy with some tweaks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jaeger - Distributed Tracing Platform

The 'upstream connect error or disconnect/reset before headers. reset reason: connection failure' error when using collector in OpenShift infrastructure #5091

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Jaeger - Distributed Tracing Platform

The 'upstream connect error or disconnect/reset before headers. reset reason: connection failure' error when using collector in OpenShift infrastructure #5091

art-iva-cente Jan 11, 2024

Replies: 1 comment · 1 reply

yurishkuro Jan 11, 2024 Maintainer

Stevenpc3 Feb 1, 2024

art-iva-cente
Jan 11, 2024

Replies: 1 comment 1 reply

yurishkuro
Jan 11, 2024
Maintainer