Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service type LoadBalancer with externalTrafficPolicy: Local needs to work out of the box #212

Closed
3 of 7 tasks
garloff opened this issue Nov 7, 2022 · 7 comments
Closed
3 of 7 tasks
Assignees
Labels
Container Issues or pull requests relevant for Team 2: Container Infra and Tooling

Comments

@garloff
Copy link
Contributor

garloff commented Nov 7, 2022

As SCS container user, I need to be sure that deploying a standard nginx-ingress controller will create the needed L4 load-balancer that works without me having to tweak things with provider-dependent (infrastructure-specific) annotations.

Trouble currently is that various cloud-infras require tweaked deployment yamls to pass on custom annotations, e.g. to enable a health-monitor on OpenStack (with our current k8s-cluster-api-provider ref. implementation, because of the interaction between a NodePort-based load balancer and externalTrafficPolicy: Local as set by nginx ingress).

Definition of Ready:

  • User Story is small enough to be finished within one sprint
  • User Story is clear and understood by the whole team
  • Acceptance criteria are defined
  • Acceptance criteria are clear and understood by the whole team

Definition of Done:

  • ADR is written
  • Conformance test exists
  • SCS Reference implementation is tweaked such that it passed
@garloff garloff added the Container Issues or pull requests relevant for Team 2: Container Infra and Tooling label Nov 7, 2022
@garloff garloff changed the title Standardize requirement for working load-balancers Servicetype: Loadbalancer needs to work out of the box Nov 7, 2022
@garloff garloff changed the title Servicetype: Loadbalancer needs to work out of the box Service type: Loadbalancer needs to work out of the box Nov 7, 2022
@horazont horazont changed the title Service type: Loadbalancer needs to work out of the box Service type: LoadBalancer needs to work out of the box Nov 7, 2022
@horazont horazont changed the title Service type: LoadBalancer needs to work out of the box Service type LoadBalancer needs to work out of the box Nov 7, 2022
@garloff garloff changed the title Service type LoadBalancer needs to work out of the box Service type LoadBalancer with externalTrafficPolicy: Local needs to work out of the box Nov 7, 2022
@joshmue
Copy link

joshmue commented Nov 8, 2022

Simple test scripts covering that:

With externalTrafficPolicy set to Cluster (default)

#!/bin/bash

set -e

kubectl get pods nginx > /dev/null || kubectl run nginx --restart=Never --image=nginx --port 80
kubectl get svc nginx-svc > /dev/null || kubectl expose pod nginx --port 80 --name nginx-svc --type=LoadBalancer

while [[ -z "$IP" ]]; do
  IP=$(kubectl get svc nginx-svc '--output=go-template={{range .status.loadBalancer.ingress}}{{.ip}}{{end}}')
done

echo "Testing access to $IP.
externalTrafficPolicy: Cluster
"
set -x
for i in {1..10};do
  curl --max-time 30 -sS $IP > /dev/null
done

kubectl delete pod nginx
kubectl delete svc nginx-svc

With externalTrafficPolicy set to Local

#!/bin/bash

set -e

kubectl get pods nginx > /dev/null || kubectl run nginx --restart=Never --image=nginx --port 80
kubectl get svc nginx-svc > /dev/null || kubectl expose pod nginx --port 80 --name nginx-svc --type=LoadBalancer --overrides='{"metadata": {"apiVersion": "v1"}, "spec": {"externalTrafficPolicy": "Local"}}'
while [[ -z "$IP" ]]; do
  IP=$(kubectl get svc nginx-svc '--output=go-template={{range .status.loadBalancer.ingress}}{{.ip}}{{end}}')
done

echo "Testing access to $IP.
externalTrafficPolicy: Local
"
for i in {1..50};do
  curl --max-time 30 -sS $IP > /dev/null
done

kubectl delete pod nginx
kubectl delete svc nginx-svc

@ajfriesen
Copy link

@joshmue

There is kubectl wait which might be handy here:
https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#wait

When you apply the wait condition for the test nginx pod we know that the pod is running and know that the error is coming from the loadbalancer.

I am not sure if you can use the kubectl wait for the loadbalancer service. Maybe worth a try
It looks like there was done some work but I did not check further yet:

@joshmue
Copy link

joshmue commented Nov 14, 2022

Yes, explicitly checking whether the pod is running already would be a good idea; Ruling out this potential error source.

I am not sure if you can use the kubectl wait for the loadbalancer service. Maybe worth a try
It looks like there was done some work but I did not check further yet:

Ah, did not know that kubectl wait can handle non-condition fields now as well via jsonpath. Interesting.
But it does not enable querying more information. We'd still have to rely on .status.loadBalancer.ingress[0].ip.

In the testing setup, this is set earlier that traffic actually reaches the running pod. Another idea might be to look into K8s events, but I do not think that they are meant for automated use cases.

So, this is either a fixable bug in the current implementation, or we could adjust the script to handle the first non-successful requests as non-fatal and only consider the case as failed if (a) a testcase timeout occurs or (b) a request failed when a prior request succeeded.

@ajfriesen
Copy link

@joshmue
I just checked the gridscale CCM implementation for the Loadbalancer.
Our internal cloud stack will only give us the IP if the loadbalancer is fully provisioned and ready to accept traffic. So the CCM actually gets the object in a useable state.

I could not find any spec yet from k8s. I think we just did this for our own sanity.
It make sense anyway to retrieve the IP for the loadbalancer only when the loadbalancer is ready. This might be a biased viewpoint though.
I guess we are forced to do a loop and/or time condition at this point for this test.

@garloff
Copy link
Contributor Author

garloff commented Nov 21, 2022

Decision:
We can't find rules that would mandate immediate availability after the LB reports being there.

  • Add a reasonable timeout that we give the LB to actually LB traffic (~1min)
  • Also fail early if a requests succeeded and a subsequent one fails

Additionally look into CCM whether we can delay reporting LB service IP until LB is ACTIVE.

@joshmue
Copy link

joshmue commented Dec 5, 2022

Upated simple test scripts:

With externalTrafficPolicy set to Cluster (default)

#!/bin/bash

set -e

suffix=$(openssl rand -hex 3)

kubectl get pods nginx-$suffix > /dev/null || kubectl run nginx-$suffix --restart=Never --image=nginx --port 80
kubectl get svc nginx-$suffix-svc > /dev/null || kubectl expose pod nginx-$suffix --port 80 --name nginx-$suffix-svc --type=LoadBalancer

while [[ -z "$IP" ]]; do
  IP=$(kubectl get svc nginx-$suffix-svc '--output=go-template={{range .status.loadBalancer.ingress}}{{.ip}}{{end}}')
done

echo "Testing access to $IP.
externalTrafficPolicy: Cluster
"
set -x
for i in {1..30};do
  curl --max-time 5 -sS $IP > /dev/null && break || echo "pretest $i: curl failed, but wait until one call succeeded or enough tests failed to stop waiting"
done
echo "Do actual testing, now"
for i in {1..50};do
  curl --max-time 2 -sS $IP > /dev/null
  echo "test $i: succeeded"
done

kubectl delete pod nginx-$suffix
kubectl delete svc nginx-$suffix-svc

With externalTrafficPolicy set to Local

#!/bin/bash

set -e

suffix=$(openssl rand -hex 3)

kubectl get pods nginx-$suffix > /dev/null || kubectl run nginx-$suffix --restart=Never --image=nginx --port 80
kubectl get svc nginx-$suffix-svc > /dev/null || kubectl expose pod nginx-$suffix --port 80 --name nginx-$suffix-svc --type=LoadBalancer --overrides='{"metadata": {"apiVersion": "v1"}, "spec": {"externalTrafficPolicy": "Local"}}'
while [[ -z "$IP" ]]; do
  IP=$(kubectl get svc nginx-$suffix-svc '--output=go-template={{range .status.loadBalancer.ingress}}{{.ip}}{{end}}')
done

echo "Testing access to $IP.
externalTrafficPolicy: Local
"
set -x
for i in {1..30};do
  curl --max-time 5 -sS $IP > /dev/null && break || echo "pretest $i: curl failed, but wait until one call succeeded or enough tests failed to stop waiting"
done
echo "Do actual testing, now"
for i in {1..50};do
  curl --max-time 2 -sS $IP > /dev/null
  echo "test $i: succeeded"
done

kubectl delete pod nginx-$suffix
kubectl delete svc nginx-$suffix-svc

@garloff
Copy link
Contributor Author

garloff commented Dec 5, 2022

Test fails as it should against a loadbalancer on our Ref.Impl. unless we enable health-monitors by default, which is exactly what we expect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Container Issues or pull requests relevant for Team 2: Container Infra and Tooling
Projects
Archived in project
Development

No branches or pull requests

3 participants