Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression 2.14.0 breaks TransportServer targeting services on control plane nodes #3079

Closed
braunsonm opened this issue Sep 7, 2023 · 7 comments · Fixed by #3084
Closed

Comments

@braunsonm
Copy link

Setup Details

CIS Version : 2.14.0
Build: f5networks/k8s-bigip-ctlr:latest
BIGIP Version: Big IP 15.1.8.1
Agent Mode: AS3
Orchestration: K8S
Orchestration Version: 1.26.6
Pool Mode: Nodeport
Additional Setup details: 3 control plane, 2 agents

Description

In the previous version of BigIP CTLR, we had a TransportServer which load balances our kube-api across the control plane "master" nodes. This has worked fine for over a year. Starting in 2.14.0 this now breaks as there are no members added to the pool. I'm not sure if there was a change to the node selection but we do not specify a node label filter.

Steps To Reproduce

  1. Deploy the following, targeting a service on your server nodes
apiVersion: v1
kind: Service
metadata:
  name: kubernetes-vip
  namespace: kube-system
spec:
  ipFamilies:
    - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: https
    port: 6443
    protocol: TCP
    targetPort: 6443
  selector:
    component: kube-apiserver
  sessionAffinity: None
  type: NodePort
---
apiVersion: cis.f5.com/v1
kind: TransportServer
metadata:
  labels:
    f5cr: "true"
  name: kubernetes-vip
  namespace: kube-system
spec:
  mode: standard
  pool:
    monitor:
      interval: 10
      timeout: 31
      type: tcp
    service: kubernetes-vip
    servicePort: 6443
  snat: auto
  type: tcp
  virtualServerAddress: 123.123.123.123
  virtualServerPort: 443
  1. Notice the pool is created with no members

Expected Result

The pool should be created with the 3 control plane nodes running the kubernetes API server. There are 3 endpoints for the service and hitting the nodeport directly works.

Actual Result

There are no members in the pool.

Observations

This works on 2.13.1

@braunsonm braunsonm added bug untriaged no JIRA created labels Sep 7, 2023
@braunsonm
Copy link
Author

braunsonm commented Sep 7, 2023

I think I see a bit of the difference here,
In version 2.13.1, our "worker" nodes were the only 2 nodes in the pool. They aren't actually running the pod targeted by the service but since it's a NodePort, they were also added to the pool list with their health check passing, relying on kube-proxy to forward traffic.

In version 2.14.0, this changed and now those worker nodes aren't added to the pool list. Maybe because CTLR got smarter and noticed they aren't actually running the pod targeted by the service.

But in neither version, the control-plane nodes running the pods are not being added to the pool and seem to be getting filtered out by some process in CTLR.

@braunsonm
Copy link
Author

braunsonm commented Sep 7, 2023

The removal of control planes nodes from getting added to the pool comes from this commit: b7d33d5 by @lavanya-f5

Which looks at NoExecute taints as not being "healthy". This isn't necessarily the case, it just means that workloads on these need to accept the taint, which things like the kube-api do when the node-role.kubernetes.io/control-plane=true. I think this logic is flawed.

I'm not sure where the change came in 2.14.0 where now the worker nodes (which aren't running the pod in question) are no longer added to the pool. I can understand that decision a little more, but if the services externalTrafficPolicy=Cluster (which this one is) then all nodes should be receiving the traffic. So I think this is also a separate but related bug.

@trinaths trinaths added In Review and removed untriaged no JIRA created labels Sep 8, 2023
@lavanya-f5
Copy link
Contributor

lavanya-f5 commented Sep 14, 2023

@braunsonm For first issue, even if it is node-role.kubernetes.io/control-plane=true, only NoSchedule taint is added.NoExecute should only be added when node is unreachable or goes to not ready state because of memory or disk pressure.Will discuss this with team and get back.

For second issue, on 2.14 where no pool members are added i.e even worker nodes are added.Need more information to debug further
can you share below information from the cluster for all nodes
kubectl edit node -o yaml

Also please share the CIS logs

@braunsonm
Copy link
Author

@lavanya-f5 That isn't true on all distributions. On Rancher RKE2 NoExecute is added for the etcd role.

Server:

spec:
  taints:
  - effect: NoExecute
    key: node-role.kubernetes.io/etcd
    value: "true"
  - effect: NoSchedule
    key: node-role.kubernetes.io/control-plane
    value: "true"

Worker:

<No Taints>

Also why would you care if a node has memory or disk pressure if it's still running workloads that F5 should proxy to and is passing the health check? Just looking for NotReady should be sufficient here.

RE: Workers, is there anything specific that you want me to share? I'm not comfortable posting my entire node yamls including IPs here. For the logs, the CIS simply logs:

Processing Key: &{kube-system Endpoints kubernetes-vip 0xc000318740 Create }
Processing Key: &{kube-system Service kubernetes-vip 0xc000993d40 Create }
Processing Key: &{kube-system TransportServer kubernetes-vip 0xc002ee04f8 Create }
Processing Transport Server kubernetes-vip for port 443
Adding service '{kubernetes-vip kube-systm}' in CIS cache
Finished syncing transport servers <List of servers>
Updating VirtualServer Status with {IP Ok} for resource name:kubernetes-vip, namespace:kube-system

The service itself has 3 endpoints (all on the server nodes)

@lavanya-f5
Copy link
Contributor

@braunsonm Thanks for the info. will be fixing the logic for notready check and share the build

@lavanya-f5
Copy link
Contributor

@braunsonm dev build with fix quay.io/f5networks/k8s-bigip-ctlr-devel:aed60df8b8b25242f7f0ab1f41eed1b5ef7e7672. Please use this image for testing and provide your feedback if issue still exists

@lavanya-f5 lavanya-f5 linked a pull request Sep 20, 2023 that will close this issue
2 tasks
@braunsonm
Copy link
Author

Tested and this seems to work! @lavanya-f5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants