Consul Connect problem with Certificate Authority: Getting error="x509: certificate signed by unknown authority" #21080

ovelascoh · 2024-05-10T01:48:51Z

Overview of the Issue

Recently, we started having problems with the Consul Connect service mesh, as the connectivity started failing for all the services. Although we have not been able to attribute the failure to a specific event, we recently performed the following changes:

Updated Consul servers (from v1.12.4 -> v1.16.6)
Update Nomad servers (from v1.3.5 -> v1.7.7)
Added a new DC in Consul for another region (through WAN federation).

We are receiving the following error when trying to connect to an upstream service:

/opt # env | grep UPSTREAM
NOMAD_UPSTREAM_IP_sre_test_count_api=127.0.0.1
NOMAD_UPSTREAM_IP_sre-test-count-api=127.0.0.1
NOMAD_UPSTREAM_ADDR_sre_test_count_api=127.0.0.1:8080
NOMAD_UPSTREAM_PORT_sre_test_count_api=8080
NOMAD_UPSTREAM_ADDR_sre-test-count-api=127.0.0.1:8080
NOMAD_UPSTREAM_PORT_sre-test-count-api=8080
/opt # curl $NOMAD_UPSTREAM_ADDR_sre_test_count_api
curl: (56) Recv failure: Connection reset by peer
/opt #

Note: we are using Hashicorp's count API & dashboard test applications for all the tests.

Troubleshooting steps

We started troubleshooting this problem by using the consul troubleshoot command (as suggested here), but we are getting mixed results. As we get errors for about 30 seconds or so:

/opt # export ENVOY_ADMIN_ENDPOINT="127.0.0.2:19001"
/opt # consul troubleshoot proxy -envoy-admin-endpoint "$ENVOY_ADMIN_ENDPOINT" -upstream-envoy-id sre-test-count-api
==> Validation
 ✓ Certificates are valid
 ✓ Envoy has 0 rejected configurations
 ✓ Envoy has detected 19 connection failure(s)
 ✓ Listener for upstream "sre-test-count-api" found
 ✓ Cluster "sre-test-count-api.default.{DatacenterHere}.internal.b3b7e58f-a453-b673-1b66-9ae778eebf8a.consul" for upstream "sre-test-count-api" found
 ! No healthy endpoints for cluster "sre-test-count-api.default.{DatacenterHere}.internal.b3b7e58f-a453-b673-1b66-9ae778eebf8a.consul" for upstream "sre-test-count-api"
  -> Check that your upstream service is healthy and running
  -> Check that your upstream service is registered with Consul
  -> Check that the upstream proxy is healthy and running
  -> If you are explicitly configuring upstreams, ensure the name of the upstream is correct
/opt #

But then, it starts reporting no errors for a similar time:

/opt # export ENVOY_ADMIN_ENDPOINT="127.0.0.2:19001"
/opt # consul troubleshoot proxy -envoy-admin-endpoint "$ENVOY_ADMIN_ENDPOINT" -upstream-envoy-id sre-test-count-api
==> Validation
 ✓ Certificates are valid
 ✓ Envoy has 0 rejected configurations
 ✓ Envoy has detected 19 connection failure(s)
 ✓ Listener for upstream "sre-test-count-api" found
 ✓ Cluster "sre-test-count-api.default.{DatacenterHere}.internal.b3b7e58f-a453-b673-1b66-9ae778eebf8a.consul" for upstream "sre-test-count-api" found
 ✓ Healthy endpoints for cluster "sre-test-count-api.default.{DatacenterHere}.internal.b3b7e58f-a453-b673-1b66-9ae778eebf8a.consul" for upstream "sre-test-count-api" found
 ✓ Upstream resources are valid
  If you are still experiencing issues, you can:
  -> Check intentions to ensure the upstream allows traffic from this source
  -> If using transparent proxy, ensure DNS resolution is to the same IP you have verified here
/opt #

None of the possible causes for the "No healthy endpoints for cluster" exist, as the health check status for the service in Consul is reported as healthy, and the intention exists.

Then, we used consul connect proxy in debug mode, and we are getting the following error:

$ consul connect proxy -log-level DEBUG -service local-development -upstream sre-test-count-api:8000
==> Consul Connect proxy starting...
    Configuration mode: Flags
               Service: local-development
              Upstream: sre-test-count-api => :8000
       Public listener: Disabled

==> Log data will now stream in as it occurs:

    2024-05-09T17:37:26.610Z [DEBUG] proxy: got new config
    2024-05-09T17:37:26.611Z [INFO]  proxy: Starting listener: listener=127.0.0.1:8000->service:default/default/sre-test-count-api bind_addr=127.0.0.1:8000
    2024-05-09T17:37:26.613Z [INFO]  proxy: Proxy loaded config and ready to serve
    2024-05-09T17:37:26.613Z [INFO]  proxy: Parsed TLS identity: uri=spiffe://b3b7e58f-a453-b673-1b66-9ae778eebf8a.consul/ns/default/dc/{DatacenterHere}/svc/local-development
    2024-05-09T17:37:41.039Z [DEBUG] proxy.connect: resolved service instance: service=local-development address=10.35.0.146:25187 identity=spiffe:///ns/default/dc/{DatacenterHere}/svc/sre-test-count-api
    2024-05-09T17:37:41.040Z [ERROR] proxy.upstream: failed to dial: error="x509: certificate signed by unknown authority"

Note: In this case, the Intention also exists (local-development -> sre-test-count-api).

We used the ENVOY_ADMIN_ENDPOINT to get the certificate chain, and it looks strangely large (see attached file). We have tried removing the configurations by removing the root & intermediate CA in Vault, and then re-configuring from scratch with no luck.

Any suggestions on how to further troubleshoot the "x509: certificate signed by unknown authority" error?

Reproduction Steps

We have not been able to reproduce this problem, as it is affecting some of our environments, but we have some environments where the exact configuration is working with no problems.

Consul info for both Client and Server

Client info

$ consul info
agent:
        check_monitors = 0
        check_ttls = 1
        checks = 68
        services = 69
build:
        prerelease =
        revision = 94542765
        version = 1.12.4
        version_metadata =
consul:
        acl = enabled
        known_servers = 5
        server = false
runtime:
        arch = amd64
        cpu_count = 32
        goroutines = 3975
        max_procs = 32
        os = linux
        version = go1.18.1
serf_lan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 432
        failed = 3
        health_score = 0
        intent_queue = 0
        left = 7
        member_time = 893611
        members = 83
        query_queue = 0
        query_time = 78
$

{
  "acl": {
    "default_policy": "deny",
    "down_policy": "extend-cache",
    "enabled": true,
    "tokens": {
      "default": "{TokenHere}"
    }
  },
  "addresses": {
    "http": "0.0.0.0"
  },
  "bind_addr": "{{ GetDefaultInterfaces | include \"type\" \"ipv4\" | limit 1 | attr \"address\" }}",
  "connect": {
    "enabled": true
  },
  "data_dir": "/var/consul",
  "datacenter": "{DatacenterHere}",
  "dns_config": {
    "allow_stale": true,
    "enable_truncate": true
  },
  "enable_central_service_config": true,
  "encrypt": "{KeyHere}",
  "leave_on_terminate": true,
  "log_level": "info",
  "ports": {
    "grpc": 8502
  },
  "recursors": [
    "127.0.0.1"
  ],
  "retry_join": [
    "provider={ProviderHere} zone_pattern={ZoneHere}.* tag_value=consul-server credentials_file=/etc/consul/consul-godiscover.json"
  ],
  "server": false
}

Server info

$ consul info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease =
        revision = 895390c6
        version = 1.16.6
        version_metadata =
consul:
        acl = enabled
        bootstrap = false
        known_datacenters = 2
        leader = true
        leader_addr = {IPHere}:8300
        server = true
raft:
        applied_index = 454370875
        commit_index = 454370875
        fsm_pending = 0
        last_contact = 0
        last_log_index = 454370875
        last_log_term = 1504
        last_snapshot_index = 454357484
        last_snapshot_term = 1504
        latest_configuration = [{Suffrage:Voter ID:{IDHere} Address:{IPHere}:8300} {Suffrage:Voter ID:{IDHere} Address:{IPHere}:8300} {Suffrage:Voter ID:{IDHere} Address:{IPHere}:8300} {Suffrage:Voter ID:{IDHere} Address:{IPHere}:8300} {Suffrage:Voter ID:{IDHere} Address:{IPHere}:8300}]
        latest_configuration_index = 0
        num_peers = 4
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Leader
        term = 1504
runtime:
        arch = amd64
        cpu_count = 2
        goroutines = 2265
        max_procs = 2
        os = linux
        version = go1.21.7
serf_lan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 432
        failed = 3
        health_score = 0
        intent_queue = 0
        left = 7
        member_time = 893634
        members = 84
        query_queue = 0
        query_time = 78
serf_wan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 41228
        members = 10
        query_queue = 0
        query_time = 1
$

{
  "acl": {
    "default_policy": "deny",
    "down_policy": "async-cache",
    "enabled": true,
    "tokens": {
      "default": "{TokenHere}",
      "master": "{TokenHere}"
    }
  },
  "bind_addr": "0.0.0.0",
  "bootstrap_expect": 5,
  "client_addr": "0.0.0.0",
  "connect": {
    "ca_config": {
      "address": "{VaultURLHere}",
      "intermediate_pki_path": "pki-service-mesh-intermediate",
      "root_pki_path": "pki-root",
      "token": "{TokenHere}"
    },
    "ca_provider": "vault",
    "enabled": true
  },
  "data_dir": "/var/consul-server",
  "datacenter": "{DatacenterHere}",
  "dns_config": {
    "allow_stale": true,
    "enable_truncate": true
  },
  "encrypt": "{KeyHere}",
  "leave_on_terminate": true,
  "log_level": "info",
  "ports": {
    "grpc": 8502
  },
  "primary_datacenter": "{DatacenterHere}",
  "retry_interval": "15s",
  "retry_join": [
    "provider={ProviderHere} tag_value=consul-server zone_pattern={ZoneHere}.*"
  ],
  "server": true,
  "telemetry": {
    "disable_hostname": true,
    "prometheus_retention_time": "360h"
  },
  "ui": true
}

Other versions:
Servers:
• Consul v1.16.6
• Nomad v1.7.7
• Vault v1.11.4+ent

Client
• Consul v1.12.4
• Nomad v1.3.5

Operating system and Environment details

$ cat /etc/*release*
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
$

The text was updated successfully, but these errors were encountered:

ovelascoh · 2024-05-10T01:52:44Z

cert-chain.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consul Connect problem with Certificate Authority: Getting error="x509: certificate signed by unknown authority" #21080

Consul Connect problem with Certificate Authority: Getting error="x509: certificate signed by unknown authority" #21080

ovelascoh commented May 10, 2024 •

edited

Loading

ovelascoh commented May 10, 2024

Consul Connect problem with Certificate Authority: Getting error="x509: certificate signed by unknown authority" #21080

Consul Connect problem with Certificate Authority: Getting error="x509: certificate signed by unknown authority" #21080

Comments

ovelascoh commented May 10, 2024 • edited Loading

Overview of the Issue

Reproduction Steps

Consul info for both Client and Server

Operating system and Environment details

ovelascoh commented May 10, 2024

ovelascoh commented May 10, 2024 •

edited

Loading