Service type LoadBalancer with externalTrafficPolicy: Local needs to work out of the box #212

garloff · 2022-11-07T08:55:48Z

As SCS container user, I need to be sure that deploying a standard nginx-ingress controller will create the needed L4 load-balancer that works without me having to tweak things with provider-dependent (infrastructure-specific) annotations.

Trouble currently is that various cloud-infras require tweaked deployment yamls to pass on custom annotations, e.g. to enable a health-monitor on OpenStack (with our current k8s-cluster-api-provider ref. implementation, because of the interaction between a NodePort-based load balancer and externalTrafficPolicy: Local as set by nginx ingress).

Definition of Ready:

User Story is small enough to be finished within one sprint
User Story is clear and understood by the whole team
Acceptance criteria are defined
Acceptance criteria are clear and understood by the whole team

Definition of Done:

ADR is written
Conformance test exists
SCS Reference implementation is tweaked such that it passed

The text was updated successfully, but these errors were encountered:

joshmue · 2022-11-08T17:02:03Z

Simple test scripts covering that:

With `externalTrafficPolicy` set to `Cluster` (default)

#!/bin/bash

set -e

kubectl get pods nginx > /dev/null || kubectl run nginx --restart=Never --image=nginx --port 80
kubectl get svc nginx-svc > /dev/null || kubectl expose pod nginx --port 80 --name nginx-svc --type=LoadBalancer

while [[ -z "$IP" ]]; do
  IP=$(kubectl get svc nginx-svc '--output=go-template={{range .status.loadBalancer.ingress}}{{.ip}}{{end}}')
done

echo "Testing access to $IP.
externalTrafficPolicy: Cluster
"
set -x
for i in {1..10};do
  curl --max-time 30 -sS $IP > /dev/null
done

kubectl delete pod nginx
kubectl delete svc nginx-svc

With `externalTrafficPolicy` set to `Local`

#!/bin/bash

set -e

kubectl get pods nginx > /dev/null || kubectl run nginx --restart=Never --image=nginx --port 80
kubectl get svc nginx-svc > /dev/null || kubectl expose pod nginx --port 80 --name nginx-svc --type=LoadBalancer --overrides='{"metadata": {"apiVersion": "v1"}, "spec": {"externalTrafficPolicy": "Local"}}'
while [[ -z "$IP" ]]; do
  IP=$(kubectl get svc nginx-svc '--output=go-template={{range .status.loadBalancer.ingress}}{{.ip}}{{end}}')
done

echo "Testing access to $IP.
externalTrafficPolicy: Local
"
for i in {1..50};do
  curl --max-time 30 -sS $IP > /dev/null
done

kubectl delete pod nginx
kubectl delete svc nginx-svc

ajfriesen · 2022-11-14T09:58:24Z

@joshmue

There is kubectl wait which might be handy here:
https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#wait

When you apply the wait condition for the test nginx pod we know that the pod is running and know that the error is coming from the loadbalancer.

I am not sure if you can use the kubectl wait for the loadbalancer service. Maybe worth a try
It looks like there was done some work but I did not check further yet:

joshmue · 2022-11-14T10:21:46Z

Yes, explicitly checking whether the pod is running already would be a good idea; Ruling out this potential error source.

I am not sure if you can use the kubectl wait for the loadbalancer service. Maybe worth a try
It looks like there was done some work but I did not check further yet:

Ah, did not know that kubectl wait can handle non-condition fields now as well via jsonpath. Interesting.
But it does not enable querying more information. We'd still have to rely on .status.loadBalancer.ingress[0].ip.

In the testing setup, this is set earlier that traffic actually reaches the running pod. Another idea might be to look into K8s events, but I do not think that they are meant for automated use cases.

So, this is either a fixable bug in the current implementation, or we could adjust the script to handle the first non-successful requests as non-fatal and only consider the case as failed if (a) a testcase timeout occurs or (b) a request failed when a prior request succeeded.

ajfriesen · 2022-11-14T15:19:11Z

@joshmue
I just checked the gridscale CCM implementation for the Loadbalancer.
Our internal cloud stack will only give us the IP if the loadbalancer is fully provisioned and ready to accept traffic. So the CCM actually gets the object in a useable state.

I could not find any spec yet from k8s. I think we just did this for our own sanity.
It make sense anyway to retrieve the IP for the loadbalancer only when the loadbalancer is ready. This might be a biased viewpoint though.
I guess we are forced to do a loop and/or time condition at this point for this test.

garloff · 2022-11-21T09:18:59Z

Decision:
We can't find rules that would mandate immediate availability after the LB reports being there.

Add a reasonable timeout that we give the LB to actually LB traffic (~1min)
Also fail early if a requests succeeded and a subsequent one fails

Additionally look into CCM whether we can delay reporting LB service IP until LB is ACTIVE.

joshmue · 2022-12-05T09:05:54Z

Upated simple test scripts:

With `externalTrafficPolicy` set to `Cluster` (default)

#!/bin/bash

set -e

suffix=$(openssl rand -hex 3)

kubectl get pods nginx-$suffix > /dev/null || kubectl run nginx-$suffix --restart=Never --image=nginx --port 80
kubectl get svc nginx-$suffix-svc > /dev/null || kubectl expose pod nginx-$suffix --port 80 --name nginx-$suffix-svc --type=LoadBalancer

while [[ -z "$IP" ]]; do
  IP=$(kubectl get svc nginx-$suffix-svc '--output=go-template={{range .status.loadBalancer.ingress}}{{.ip}}{{end}}')
done

echo "Testing access to $IP.
externalTrafficPolicy: Cluster
"
set -x
for i in {1..30};do
  curl --max-time 5 -sS $IP > /dev/null && break || echo "pretest $i: curl failed, but wait until one call succeeded or enough tests failed to stop waiting"
done
echo "Do actual testing, now"
for i in {1..50};do
  curl --max-time 2 -sS $IP > /dev/null
  echo "test $i: succeeded"
done

kubectl delete pod nginx-$suffix
kubectl delete svc nginx-$suffix-svc

With `externalTrafficPolicy` set to `Local`

#!/bin/bash

set -e

suffix=$(openssl rand -hex 3)

kubectl get pods nginx-$suffix > /dev/null || kubectl run nginx-$suffix --restart=Never --image=nginx --port 80
kubectl get svc nginx-$suffix-svc > /dev/null || kubectl expose pod nginx-$suffix --port 80 --name nginx-$suffix-svc --type=LoadBalancer --overrides='{"metadata": {"apiVersion": "v1"}, "spec": {"externalTrafficPolicy": "Local"}}'
while [[ -z "$IP" ]]; do
  IP=$(kubectl get svc nginx-$suffix-svc '--output=go-template={{range .status.loadBalancer.ingress}}{{.ip}}{{end}}')
done

echo "Testing access to $IP.
externalTrafficPolicy: Local
"
set -x
for i in {1..30};do
  curl --max-time 5 -sS $IP > /dev/null && break || echo "pretest $i: curl failed, but wait until one call succeeded or enough tests failed to stop waiting"
done
echo "Do actual testing, now"
for i in {1..50};do
  curl --max-time 2 -sS $IP > /dev/null
  echo "test $i: succeeded"
done

kubectl delete pod nginx-$suffix
kubectl delete svc nginx-$suffix-svc

garloff · 2022-12-05T09:24:07Z

Test fails as it should against a loadbalancer on our Ref.Impl. unless we enable health-monitors by default, which is exactly what we expect.

garloff mentioned this issue Nov 7, 2022

SCS K8s cluster standardization #181

Open

65 tasks

garloff added the Container Issues or pull requests relevant for Team 2: Container Infra and Tooling label Nov 7, 2022

garloff changed the title ~~Standardize requirement for working load-balancers~~ Servicetype: Loadbalancer needs to work out of the box Nov 7, 2022

garloff changed the title ~~Servicetype: Loadbalancer needs to work out of the box~~ Service type: Loadbalancer needs to work out of the box Nov 7, 2022

horazont changed the title ~~Service type: Loadbalancer needs to work out of the box~~ Service type: LoadBalancer needs to work out of the box Nov 7, 2022

horazont changed the title ~~Service type: LoadBalancer needs to work out of the box~~ Service type LoadBalancer needs to work out of the box Nov 7, 2022

garloff changed the title ~~Service type LoadBalancer needs to work out of the box~~ Service type LoadBalancer with externalTrafficPolicy: Local needs to work out of the box Nov 7, 2022

garloff assigned joshmue Nov 7, 2022

garloff mentioned this issue Nov 7, 2022

Add create-monitor=true to could.conf to solve LB issues. SovereignCloudStack/k8s-cluster-api-provider#174

Closed

garloff mentioned this issue Dec 5, 2022

Write ADR for load-balancers #227

Open

9 tasks

garloff closed this as completed Dec 5, 2022

garloff mentioned this issue Dec 5, 2022

Enable LB health-monitor to fulfill ADR from Docs/#227 SovereignCloudStack/k8s-cluster-api-provider#312

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service type LoadBalancer with externalTrafficPolicy: Local needs to work out of the box #212

Service type LoadBalancer with externalTrafficPolicy: Local needs to work out of the box #212

garloff commented Nov 7, 2022 •

edited

Loading

joshmue commented Nov 8, 2022

ajfriesen commented Nov 14, 2022

joshmue commented Nov 14, 2022

ajfriesen commented Nov 14, 2022

garloff commented Nov 21, 2022

joshmue commented Dec 5, 2022

garloff commented Dec 5, 2022

Service type LoadBalancer with externalTrafficPolicy: Local needs to work out of the box #212

Service type LoadBalancer with externalTrafficPolicy: Local needs to work out of the box #212

Comments

garloff commented Nov 7, 2022 • edited Loading

Definition of Ready:

Definition of Done:

joshmue commented Nov 8, 2022

With externalTrafficPolicy set to Cluster (default)

With externalTrafficPolicy set to Local

ajfriesen commented Nov 14, 2022

joshmue commented Nov 14, 2022

ajfriesen commented Nov 14, 2022

garloff commented Nov 21, 2022

joshmue commented Dec 5, 2022

With externalTrafficPolicy set to Cluster (default)

With externalTrafficPolicy set to Local

garloff commented Dec 5, 2022

garloff commented Nov 7, 2022 •

edited

Loading

With `externalTrafficPolicy` set to `Cluster` (default)

With `externalTrafficPolicy` set to `Local`

With `externalTrafficPolicy` set to `Cluster` (default)

With `externalTrafficPolicy` set to `Local`