You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently, we started having problems with the Consul Connect service mesh, as the connectivity started failing for all the services. Although we have not been able to attribute the failure to a specific event, we recently performed the following changes:
Updated Consul servers (from v1.12.4 -> v1.16.6)
Update Nomad servers (from v1.3.5 -> v1.7.7)
Added a new DC in Consul for another region (through WAN federation).
We are receiving the following error when trying to connect to an upstream service:
Note: we are using Hashicorp's count API & dashboard test applications for all the tests.
Troubleshooting steps
We started troubleshooting this problem by using the consul troubleshoot command (as suggested here), but we are getting mixed results. As we get errors for about 30 seconds or so:
/opt # export ENVOY_ADMIN_ENDPOINT="127.0.0.2:19001"
/opt # consul troubleshoot proxy -envoy-admin-endpoint "$ENVOY_ADMIN_ENDPOINT" -upstream-envoy-id sre-test-count-api
==> Validation
✓ Certificates are valid
✓ Envoy has 0 rejected configurations
✓ Envoy has detected 19 connection failure(s)
✓ Listener for upstream "sre-test-count-api" found
✓ Cluster "sre-test-count-api.default.{DatacenterHere}.internal.b3b7e58f-a453-b673-1b66-9ae778eebf8a.consul" for upstream "sre-test-count-api" found
! No healthy endpoints for cluster "sre-test-count-api.default.{DatacenterHere}.internal.b3b7e58f-a453-b673-1b66-9ae778eebf8a.consul" for upstream "sre-test-count-api"
-> Check that your upstream service is healthy and running
-> Check that your upstream service is registered with Consul
-> Check that the upstream proxy is healthy and running
-> If you are explicitly configuring upstreams, ensure the name of the upstream is correct
/opt #
But then, it starts reporting no errors for a similar time:
/opt # export ENVOY_ADMIN_ENDPOINT="127.0.0.2:19001"
/opt # consul troubleshoot proxy -envoy-admin-endpoint "$ENVOY_ADMIN_ENDPOINT" -upstream-envoy-id sre-test-count-api
==> Validation
✓ Certificates are valid
✓ Envoy has 0 rejected configurations
✓ Envoy has detected 19 connection failure(s)
✓ Listener for upstream "sre-test-count-api" found
✓ Cluster "sre-test-count-api.default.{DatacenterHere}.internal.b3b7e58f-a453-b673-1b66-9ae778eebf8a.consul" for upstream "sre-test-count-api" found
✓ Healthy endpoints for cluster "sre-test-count-api.default.{DatacenterHere}.internal.b3b7e58f-a453-b673-1b66-9ae778eebf8a.consul" for upstream "sre-test-count-api" found
✓ Upstream resources are valid
If you are still experiencing issues, you can:
-> Check intentions to ensure the upstream allows traffic from this source
-> If using transparent proxy, ensure DNS resolution is to the same IP you have verified here
/opt #
None of the possible causes for the "No healthy endpoints for cluster" exist, as the health check status for the service in Consul is reported as healthy, and the intention exists.
Then, we used consul connect proxy in debug mode, and we are getting the following error:
$ consul connect proxy -log-level DEBUG -service local-development -upstream sre-test-count-api:8000
==> Consul Connect proxy starting...
Configuration mode: Flags
Service: local-development
Upstream: sre-test-count-api => :8000
Public listener: Disabled
==> Log data will now stream in as it occurs:
2024-05-09T17:37:26.610Z [DEBUG] proxy: got new config
2024-05-09T17:37:26.611Z [INFO] proxy: Starting listener: listener=127.0.0.1:8000->service:default/default/sre-test-count-api bind_addr=127.0.0.1:8000
2024-05-09T17:37:26.613Z [INFO] proxy: Proxy loaded config and ready to serve
2024-05-09T17:37:26.613Z [INFO] proxy: Parsed TLS identity: uri=spiffe://b3b7e58f-a453-b673-1b66-9ae778eebf8a.consul/ns/default/dc/{DatacenterHere}/svc/local-development
2024-05-09T17:37:41.039Z [DEBUG] proxy.connect: resolved service instance: service=local-development address=10.35.0.146:25187 identity=spiffe:///ns/default/dc/{DatacenterHere}/svc/sre-test-count-api
2024-05-09T17:37:41.040Z [ERROR] proxy.upstream: failed to dial: error="x509: certificate signed by unknown authority"
Note: In this case, the Intention also exists (local-development -> sre-test-count-api).
We used the ENVOY_ADMIN_ENDPOINT to get the certificate chain, and it looks strangely large (see attached file). We have tried removing the configurations by removing the root & intermediate CA in Vault, and then re-configuring from scratch with no luck.
Any suggestions on how to further troubleshoot the "x509: certificate signed by unknown authority" error?
Reproduction Steps
We have not been able to reproduce this problem, as it is affecting some of our environments, but we have some environments where the exact configuration is working with no problems.
Overview of the Issue
Recently, we started having problems with the Consul Connect service mesh, as the connectivity started failing for all the services. Although we have not been able to attribute the failure to a specific event, we recently performed the following changes:
We are receiving the following error when trying to connect to an upstream service:
Note: we are using Hashicorp's count API & dashboard test applications for all the tests.
Troubleshooting steps
We started troubleshooting this problem by using the
consul troubleshoot command
(as suggested here), but we are getting mixed results. As we get errors for about 30 seconds or so:But then, it starts reporting no errors for a similar time:
None of the possible causes for the "No healthy endpoints for cluster" exist, as the health check status for the service in Consul is reported as healthy, and the intention exists.
Then, we used
consul connect proxy
in debug mode, and we are getting the following error:Note: In this case, the Intention also exists (local-development -> sre-test-count-api).
We used the ENVOY_ADMIN_ENDPOINT to get the certificate chain, and it looks strangely large (see attached file). We have tried removing the configurations by removing the root & intermediate CA in Vault, and then re-configuring from scratch with no luck.
Any suggestions on how to further troubleshoot the "x509: certificate signed by unknown authority" error?
Reproduction Steps
We have not been able to reproduce this problem, as it is affecting some of our environments, but we have some environments where the exact configuration is working with no problems.
Consul info for both Client and Server
Client info
Server info
Other versions:
Servers:
• Consul v1.16.6
• Nomad v1.7.7
• Vault v1.11.4+ent
Client
• Consul v1.12.4
• Nomad v1.3.5
Operating system and Environment details
The text was updated successfully, but these errors were encountered: