Regression 2.14.0 breaks `TransportServer` targeting services on control plane nodes #3079

braunsonm · 2023-09-07T19:15:00Z

Setup Details

CIS Version : 2.14.0
Build: f5networks/k8s-bigip-ctlr:latest
BIGIP Version: Big IP 15.1.8.1
Agent Mode: AS3
Orchestration: K8S
Orchestration Version: 1.26.6
Pool Mode: Nodeport
Additional Setup details: 3 control plane, 2 agents

Description

In the previous version of BigIP CTLR, we had a TransportServer which load balances our kube-api across the control plane "master" nodes. This has worked fine for over a year. Starting in 2.14.0 this now breaks as there are no members added to the pool. I'm not sure if there was a change to the node selection but we do not specify a node label filter.

Steps To Reproduce

Deploy the following, targeting a service on your server nodes

apiVersion: v1
kind: Service
metadata:
  name: kubernetes-vip
  namespace: kube-system
spec:
  ipFamilies:
    - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: https
    port: 6443
    protocol: TCP
    targetPort: 6443
  selector:
    component: kube-apiserver
  sessionAffinity: None
  type: NodePort
---
apiVersion: cis.f5.com/v1
kind: TransportServer
metadata:
  labels:
    f5cr: "true"
  name: kubernetes-vip
  namespace: kube-system
spec:
  mode: standard
  pool:
    monitor:
      interval: 10
      timeout: 31
      type: tcp
    service: kubernetes-vip
    servicePort: 6443
  snat: auto
  type: tcp
  virtualServerAddress: 123.123.123.123
  virtualServerPort: 443

Notice the pool is created with no members

Expected Result

The pool should be created with the 3 control plane nodes running the kubernetes API server. There are 3 endpoints for the service and hitting the nodeport directly works.

Actual Result

There are no members in the pool.

Observations

This works on 2.13.1

The text was updated successfully, but these errors were encountered:

braunsonm · 2023-09-07T19:34:55Z

I think I see a bit of the difference here,
In version 2.13.1, our "worker" nodes were the only 2 nodes in the pool. They aren't actually running the pod targeted by the service but since it's a NodePort, they were also added to the pool list with their health check passing, relying on kube-proxy to forward traffic.

In version 2.14.0, this changed and now those worker nodes aren't added to the pool list. Maybe because CTLR got smarter and noticed they aren't actually running the pod targeted by the service.

But in neither version, the control-plane nodes running the pods are not being added to the pool and seem to be getting filtered out by some process in CTLR.

braunsonm · 2023-09-07T19:41:06Z

The removal of control planes nodes from getting added to the pool comes from this commit: b7d33d5 by @lavanya-f5

Which looks at NoExecute taints as not being "healthy". This isn't necessarily the case, it just means that workloads on these need to accept the taint, which things like the kube-api do when the node-role.kubernetes.io/control-plane=true. I think this logic is flawed.

I'm not sure where the change came in 2.14.0 where now the worker nodes (which aren't running the pod in question) are no longer added to the pool. I can understand that decision a little more, but if the services externalTrafficPolicy=Cluster (which this one is) then all nodes should be receiving the traffic. So I think this is also a separate but related bug.

lavanya-f5 · 2023-09-14T10:41:19Z

@braunsonm For first issue, even if it is node-role.kubernetes.io/control-plane=true, only NoSchedule taint is added.NoExecute should only be added when node is unreachable or goes to not ready state because of memory or disk pressure.Will discuss this with team and get back.

For second issue, on 2.14 where no pool members are added i.e even worker nodes are added.Need more information to debug further
can you share below information from the cluster for all nodes
kubectl edit node -o yaml

Also please share the CIS logs

braunsonm · 2023-09-14T17:43:33Z

@lavanya-f5 That isn't true on all distributions. On Rancher RKE2 NoExecute is added for the etcd role.

Server:

spec:
  taints:
  - effect: NoExecute
    key: node-role.kubernetes.io/etcd
    value: "true"
  - effect: NoSchedule
    key: node-role.kubernetes.io/control-plane
    value: "true"

Worker:

<No Taints>

Also why would you care if a node has memory or disk pressure if it's still running workloads that F5 should proxy to and is passing the health check? Just looking for NotReady should be sufficient here.

RE: Workers, is there anything specific that you want me to share? I'm not comfortable posting my entire node yamls including IPs here. For the logs, the CIS simply logs:

Processing Key: &{kube-system Endpoints kubernetes-vip 0xc000318740 Create }
Processing Key: &{kube-system Service kubernetes-vip 0xc000993d40 Create }
Processing Key: &{kube-system TransportServer kubernetes-vip 0xc002ee04f8 Create }
Processing Transport Server kubernetes-vip for port 443
Adding service '{kubernetes-vip kube-systm}' in CIS cache
Finished syncing transport servers <List of servers>
Updating VirtualServer Status with {IP Ok} for resource name:kubernetes-vip, namespace:kube-system

The service itself has 3 endpoints (all on the server nodes)

lavanya-f5 · 2023-09-15T04:06:54Z

@braunsonm Thanks for the info. will be fixing the logic for notready check and share the build

lavanya-f5 · 2023-09-15T06:03:40Z

@braunsonm dev build with fix quay.io/f5networks/k8s-bigip-ctlr-devel:aed60df8b8b25242f7f0ab1f41eed1b5ef7e7672. Please use this image for testing and provide your feedback if issue still exists

braunsonm · 2023-09-20T20:26:06Z

Tested and this seems to work! @lavanya-f5

braunsonm added bug untriaged no JIRA created labels Sep 7, 2023

trinaths added In Review and removed untriaged no JIRA created labels Sep 8, 2023

lavanya-f5 mentioned this issue Sep 15, 2023

Fix logic for node not ready check #3084

Merged

2 tasks

lavanya-f5 linked a pull request Sep 20, 2023 that will close this issue

Fix logic for node not ready check #3084

Merged

2 tasks

lavanya-f5 closed this as completed Sep 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression 2.14.0 breaks `TransportServer` targeting services on control plane nodes #3079

Regression 2.14.0 breaks `TransportServer` targeting services on control plane nodes #3079

braunsonm commented Sep 7, 2023

braunsonm commented Sep 7, 2023 •

edited

braunsonm commented Sep 7, 2023 •

edited

lavanya-f5 commented Sep 14, 2023 •

edited

braunsonm commented Sep 14, 2023

lavanya-f5 commented Sep 15, 2023

lavanya-f5 commented Sep 15, 2023

braunsonm commented Sep 20, 2023

Regression 2.14.0 breaks TransportServer targeting services on control plane nodes #3079

Regression 2.14.0 breaks TransportServer targeting services on control plane nodes #3079

Comments

braunsonm commented Sep 7, 2023

Setup Details

Description

Steps To Reproduce

Expected Result

Actual Result

Observations

braunsonm commented Sep 7, 2023 • edited

braunsonm commented Sep 7, 2023 • edited

lavanya-f5 commented Sep 14, 2023 • edited

braunsonm commented Sep 14, 2023

lavanya-f5 commented Sep 15, 2023

lavanya-f5 commented Sep 15, 2023

braunsonm commented Sep 20, 2023

Regression 2.14.0 breaks `TransportServer` targeting services on control plane nodes #3079

Regression 2.14.0 breaks `TransportServer` targeting services on control plane nodes #3079

braunsonm commented Sep 7, 2023 •

edited

braunsonm commented Sep 7, 2023 •

edited

lavanya-f5 commented Sep 14, 2023 •

edited