Hello, and thanks for the release.
After updating to v15, my two-node Podman cluster fails to be reachable from one another. I've tried many steps including a reinstallation, and found the following reproduction step to show the issue in details:
-
Set up two nodes from scratch with the official Docker image. For each node, I set up proper Let's Encrypt certs + domains, and HTTPS on port :53443
-
Initialize the cluster on the primary, and join from the secondary with certificate validations ticked. The initial join works fine and syncs all data as intended. For simplicity, I only configured one IPv4 address per node.
-
However, after a while, remote nodes become marked as Unreachable from one another, and the cluster fail to function. Triggering a Resync would also show this error more quickly. Various symptoms include:
-
The primary node shows that the secondary is "Unreachable" right after first sync. The secondary node will show the primary being "Unreachable" after a while.
-
For the Cluster Catalog Zone and its member zones, manually resyncing them works fine. They work under a different, non-HTTPS mechanism so I suppose that part is still intact. There seems to be a problem with IPv6 but I'll look into it later.
-
Most importantly, it is impossible to switch to another node using the context menu. When switching to the primary from the secondary, No active session exists. Please login and try again. is shown as an error. When switching to the secondary from the primary, I am returned to the login screen and got stuck there after multiple logins, probably due to a failing loop. So switching context will brick the entire browser session from the primary node's domain.
-
In both nodes, I receive spurious amounts of logs of the following nature:
[timestamp UTC] Heartbeat failed for Secondary node 'secondary.example.com (192.168.53.2)'. DnsServerCore.HttpApi.InvalidTokenHttpApiClientException: Invalid token or session expired.
at DnsServerCore.HttpApi.HttpApiClient.CheckResponseStatus(JsonElement rootElement) in Z:\Technitium\Projects\DnsServer\DnsServerCore.HttpApi\HttpApiClient.cs:line 147
at DnsServerCore.HttpApi.HttpApiClient.GetClusterStateAsync(Boolean includeServerIpAddresses, Boolean includeNodeCertificates, CancellationToken cancellationToken) in Z:\Technitium\Projects\DnsServer\DnsServerCore.HttpApi\HttpApiClient.cs:line 394
at DnsServerCore.Cluster.ClusterNode.GetClusterStateAsync(CancellationToken cancellationToken) in Z:\Technitium\Projects\DnsServer\DnsServerCore\Cluster\ClusterNode.cs:line 517
at DnsServerCore.Cluster.ClusterNode.HeartbeatTimerCallbackAsync(Object state) in Z:\Technitium\Projects\DnsServer\DnsServerCore\Cluster\ClusterNode.cs:line 224
They fail at the exact same lines, the only difference being the node addresses/IPs.
-
Lastly, gracefully leaving/removing a node from the cluster is untenable, as it will show an Invalid token or session expired. error. The only way is to force leave/force remove the node itself.
This issue seems to have been reported via Reddit here too. The problem seems specific to the HTTPS API and some kind of tokens being utilised. For now, I'll keep running the Cluster in "desynced mode".
Are there any extra things to be configured for Clustering after the v15 upgrade? If so, please let me know, as I couldn't find anything of note in the changelog and the blogpost. Also let me know if you can reproduce this outside of Docker.
Hello, and thanks for the release.
After updating to v15, my two-node Podman cluster fails to be reachable from one another. I've tried many steps including a reinstallation, and found the following reproduction step to show the issue in details:
Set up two nodes from scratch with the official Docker image. For each node, I set up proper Let's Encrypt certs + domains, and HTTPS on port :53443
Initialize the cluster on the primary, and join from the secondary with certificate validations ticked. The initial join works fine and syncs all data as intended. For simplicity, I only configured one IPv4 address per node.
However, after a while, remote nodes become marked as Unreachable from one another, and the cluster fail to function. Triggering a Resync would also show this error more quickly. Various symptoms include:
The primary node shows that the secondary is "Unreachable" right after first sync. The secondary node will show the primary being "Unreachable" after a while.
For the Cluster Catalog Zone and its member zones, manually resyncing them works fine. They work under a different, non-HTTPS mechanism so I suppose that part is still intact. There seems to be a problem with IPv6 but I'll look into it later.
Most importantly, it is impossible to switch to another node using the context menu. When switching to the primary from the secondary,
No active session exists. Please login and try again.is shown as an error. When switching to the secondary from the primary, I am returned to the login screen and got stuck there after multiple logins, probably due to a failing loop. So switching context will brick the entire browser session from the primary node's domain.In both nodes, I receive spurious amounts of logs of the following nature:
They fail at the exact same lines, the only difference being the node addresses/IPs.
Lastly, gracefully leaving/removing a node from the cluster is untenable, as it will show an
Invalid token or session expired.error. The only way is to force leave/force remove the node itself.This issue seems to have been reported via Reddit here too. The problem seems specific to the HTTPS API and some kind of tokens being utilised. For now, I'll keep running the Cluster in "desynced mode".
Are there any extra things to be configured for Clustering after the v15 upgrade? If so, please let me know, as I couldn't find anything of note in the changelog and the blogpost. Also let me know if you can reproduce this outside of Docker.