Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document wsAtomicTransaction externalURLPrefix limitations in Kubernetes #7223

Open
kgibm opened this issue Feb 15, 2024 · 1 comment · May be fixed by #7307
Open

Document wsAtomicTransaction externalURLPrefix limitations in Kubernetes #7223

kgibm opened this issue Feb 15, 2024 · 1 comment · May be fixed by #7307
Assignees

Comments

@kgibm
Copy link
Member

kgibm commented Feb 15, 2024

When using wsAtomicTransaction in a Kubernetes environment without peer recovery, if externalURLPrefix is set to use ${env.POD_IP}, and if the client pod communicates through a Kubernetes service, and if the service communication is NATted to the pod IP (e.g. OpenShift SDN), then, under heavy load, if a socket opened to the service IP of the target pod is closed and enters TIME_WAIT, and if a new socket is attempted to be created by WS-AT on the pod IP and it happens to use the same ephemeral local port, then SYN packets will be dropped and retransmitted during the TIME_WAIT period causing undesired latency. If this duration exceeds 30 seconds, the WS-AT call will fail with a java.net.SocketTimeoutException with diagnostic trace *=info:org.apache.cxf.*=all:

java.net.SocketTimeoutException: connect timed out
        [...]
        at com.ibm.ws.wsat.service.impl.WebClientImpl$3.call(WebClientImpl.java:120)

This manifests as retransmitting SYN packets in network trace on the client; for example:

$ TZ=UTC tshark -t ud -T fields -e frame.number -e _ws.col.Time -e ip.src -e tcp.srcport -e ip.dst -e tcp.dstport -e tcp.stream -e frame.len -e _ws.col.Protocol -e _ws.col.Info -r *pcap* -Y "tcp.flags.syn == 1 && tcp.analysis.retransmission"
409708	2024-02-05 16:59:08.192371	10.1.2.3	5502	10.1.2.4	9443	3946	76	TCP	[TCP Retransmission] 5502 → 9443 [SYN] Seq=0 Win=26733 Len=0 MSS=8911 SACK_PERM=1 TSval=123 TSecr=0 WS=128
411507	2024-02-05 16:59:10.241377	10.1.2.3	5502	10.1.2.4	9443	3946	76	TCP	[TCP Retransmission] 5502 → 9443 [SYN] Seq=0 Win=26733 Len=0 MSS=8911 SACK_PERM=1 TSval=123 TSecr=0 WS=128
412178	2024-02-05 16:59:14.272379	10.1.2.3	5502	10.1.2.4	9443	3946	76	TCP	[TCP Retransmission] 5502 → 9443 [SYN] Seq=0 Win=26733 Len=0 MSS=8911 SACK_PERM=1 TSval=123 TSecr=0 WS=128
419229	2024-02-05 16:59:22.784377	10.1.2.3	5502	10.1.2.4	9443	3946	76	TCP	[TCP Retransmission] 5502 → 9443 [SYN] Seq=0 Win=26733 Len=0 MSS=8911 SACK_PERM=1 TSval=123 TSecr=0 WS=128

In other words, because of the NATting, although the client sends to the service IP, the pod receives the traffic on the pod IP, so although the client's 4-tuple is (client IP, client ephemeral port, service IP, service port), the pod's 4-tuple is (client IP, client ephemeral port, pod IP, pod port). Example tcpdumps demonstrating this differing destination IP (the 4-tuple is columns 3-6):

Client pod:

389156	2024-02-05 16:58:30.554889	10.1.2.3	5502	172.1.2.3	9443	3709	68	TCP	5502 → 9443 [ACK] Seq=4697 Ack=635 Win=123 Len=0 TSval=123 TSecr=123

Target pod:

554314	2024-02-05 16:58:30.554905	10.1.2.3	5502	10.1.2.4	9443	11791	68	TCP	5502 → 9443 [ACK] Seq=4697 Ack=635 Win=123 Len=0 TSval=123 TSecr=123

So when the socket to the service is closed and enters TIME_WAIT, if WS-AT then sends a new socket directly to the pod IP (client IP, client ephemeral port, pod IP, pod port), the pod's 4-tuple is the same as before (client IP, client ephemeral port, pod IP, pod port) and cannot be re-used. It is believed this is due to the default conntrack TIME_WAIT time of 120 seconds. Using net.ipv4.tcp_tw_reuse=1 does not help. It is possible that net.netfilter.nf_conntrack_tcp_timeout_time_wait=60 may resolve the issue.

One workaround is to use a different port for externalURLPrefix than what is used for the service which avoids the 4-tuple TCP conflict. This requires adding an additional httpEndpoint and adding the port hostAlias to the virtualHost. Since a virtualHost may not have been explicitly defined previously and the default was used, take care to include all incoming virtual hosts. In a Kubernetes environment, although traffic might come into a particular port (e.g. 9443), the request Host header might be on a service port (e.g. 443) and missing such a hostAlias will result in 404s. For example:

<?xml version="1.0" encoding="UTF-8"?>
<server>
  <httpEndpoint id="wsatHttpEndpoint" host="*" httpsPort="9444" />
  <wsAtomicTransaction SSLEnabled="true" SSLRef="cssSSLSettings" externalURLPrefix="https://${env.POD_IP}:9444" />
  <virtualHost id="default_host">
    <hostAlias>*:443</hostAlias>
    <hostAlias>*:9443</hostAlias>
    <hostAlias>*:9444</hostAlias>
  </virtualHost>
</server>

In addition, ensure that the new port is accessible by the client. For example, although a ports.containerPort entry on the pod specification is not needed as "Any port which is listening on the default '0.0.0.0' address inside a container will be accessible from the network", there may be a NetworkPolicy that restricts available ports and thus an additional NetworkPolicy may be required.

While this issue affects OpenShift SDN networking it does not appear to affect OVN-Kubernetes networking. OVN-Kubernetes has replaced OpenShift SDN as the default networking plugin as of OpenShift 4.12. It is believed that switching to OVN-Kubernetes may also resolve this issue.

Diagnostic notes:

  1. Gather tcpdump on both client and server pods using nsenter: https://access.redhat.com/solutions/4569211 and https://access.redhat.com/solutions/1611883

  2. Quickly finding retransmitting SYNs on the socket connect:

    TZ=UTC tshark -t ud -T fields -e frame.number -e _ws.col.Time -e ip.src -e tcp.srcport -e ip.dst -e tcp.dstport -e tcp.stream -e frame.len -e _ws.col.Protocol -e _ws.col.Info -r *pcap* -Y "tcp.flags.syn == 1 && tcp.analysis.retransmission"

    1. Then, search for that same 4-tuple before the first SYN and check if any packets (usually FIN or ACK, but could also be RST) are within 60 seconds of the first SYN.
@kgibm kgibm self-assigned this Feb 21, 2024
@kgibm
Copy link
Member Author

kgibm commented Apr 17, 2024

This has since been observed with a different symptom of SYNs being dropped while another conversation is still active, thus this issue is not limited to TIME_WAIT and the hypothesis about net.netfilter.nf_conntrack_tcp_timeout_time_wait=60 is incorrect or incomplete. Investigating further.

kgibm added a commit to kgibm/docs that referenced this issue Apr 29, 2024
Fixes OpenLiberty#7223

Signed-off-by: Kevin Grigorenko <kevin.grigorenko@us.ibm.com>
@kgibm kgibm linked a pull request Apr 29, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants