endless port+floating-ip allocation (LB for kube-api) seems to happen again #480

mxmxchere · 2023-08-01T13:08:29Z

This issue was originally described here #179.

We are currently using openstack capo controller version 0.7.3 with the regiocloud openstack infrastructure and this seems to happen again. The theory that this has something to do with the CA being supplied to the capo-controller-manager is not true (it happened now with the CA being there right from the beginning.)

chess-knight · 2023-08-02T08:07:21Z

Can the #283 be related?

mxmxchere · 2023-08-02T08:58:13Z

Maybe, i have not checked whether we had two unrelated subnets accidently.

It might well be that this is a distinct issue as the error message was something like "could not associate floating ip to loadbalancer because the loadbalancer already has a floating ip attached". We could also reproduce the error that attaching a FIP to a loadbalancer fails if it has one already via the horizon dashboard.

mxmxchere · 2023-08-02T10:07:05Z

Warning  Failedassociatefloatingip   28m  openstack-controller  Failed to associate floating IP 81.163.192.187 with port 88fafc84-c141-4e01-9340-efc576038eca: Expected HTT
P response code [200] when accessing [PUT https://neutron.services.a.regiocloud.tech/v2.0/floatingips/3773b8b2-e643-443e-b2b3-ef5e3d9a6e2b], but got 409 instead             
{"NeutronError": {"type": "FloatingIPPortAlreadyAssociated", "message": "Cannot associate floating IP 81.163.192.187 (3773b8b2-e643-443e-b2b3-ef5e3d9a6e2b) with port 88fafc8
4-c141-4e01-9340-efc576038eca using fixed IP 10.8.0.62, as that fixed IP already has a floating IP on external network e6be7364-bfd8-4de7-8120-8f41c69a139a.", "detail": ""}}
  Warning  Failedassociatefloatingip  28m (x3 over 28m)  openstack-controller  (combined from similar events): Failed to associate floating IP 81.163.193.138 with port 88faf
c84-c141-4e01-9340-efc576038eca: Expected HTTP response code [200] when accessing [PUT https://neutron.services.a.regiocloud.tech/v2.0/floatingips/607f553c-c8d8-4922-99c2-2c
fb76e8a4da], but got 409 instead

is the complete error message

DEiselt · 2023-08-02T10:52:27Z

The FloatingIPPortAlreadyAssociated error is to be expected from Openstack side as far as i know. I would also consider this a mere symptom of what is happening.

By trying to reproduce this to investigate the cause, there were multiple observations. Generally i think it is related to #283 because it worked fine starting from a clean project (my attempt) but didn't work for @mxmxchere trying to create a cluster in the same project shortly after.

Follow up of the current state:

should it be possible to create multiple independent clusters in the same project?
- this is relevant because the environments file contains multiple options to "uniquefy" a cluster
make clean and make fullclean doesn't seem to work correctly, but i don't know if in general or due to inconsistencies with multiple clusters

chess-knight · 2023-08-02T11:54:05Z

should it be possible to create multiple independent clusters in the same project?

this is relevant because the environments file contains multiple options to "uniquefy" a cluster

As far as I know, it is possible to deploy multiple clusters in the same OpenStack project. I and @matofeder successfully share one OpenStack project. See also #343, where the last missing piece was implemented to fully support this use case.
Options which need to be different for us are prefix and testcluster_name.

make clean and make fullclean doesn't seem to work correctly, but i don't know if in general or due to inconsistencies with multiple clusters

I think that it is related to inconsistencies with multiple clusters. Personally, most of the time I prefer make fullclean or delete_cluster.sh within the management server when I want to delete only the workload cluster.

DEiselt · 2023-08-10T15:58:46Z

After further investigation, this issue seems to resolve around the testcluster_name.

Reproduction:

I was able to reproduce it by using two separate application credentials and environment files, setting a different prefix but using the same testcluster_name.

Problem:

This looks like a "race condition" where two capo controllers are trying to reconcile the same LB with different FIPs. One of them "wins" by being the first, leaving the second to constantly try to attach "his" FIP to the LB.

Options to fix:

add a random / unique identifier to the (default) testcluster_name (quick fix)
- this will help to better identify / manage resources owned by a deployment
- this might also help with cleanup flows (Improve deletion of workload clusters #484 , Resources are not cleaned up #492)
remove the default name to force it to be specified in the environment file (quick fix)
collect and use the OpenStack UUIDs of resource for managing them (more involved, might also affect other components like the capo controller itself)

As a bonus for the capo controller, there should be a limit on how many attempts are made before aborting, because of the usage of public (and limited) ipv4 addresses.

mxmxchere · 2023-08-14T08:49:55Z

We can combine option 1 and 2:

per default there is no such setting like testcluster_name and a random name is used
if testcluster_name is set, it is used (maybe with an optional warning "Static cluster name set, this could lead to resource conflicts when deployed multiple times")

That way behaviour for existing installations is not changed while helping to avoid the resource conflicts.

chess-knight · 2023-08-15T07:25:09Z

#495 is probably also related

DEiselt · 2023-08-17T10:29:08Z

Approach to solve this as discussed in container meeting on 17.08.:

make sure that $prefix is used in loadbalancer (Prefix all resources with PREFIX and CLUSTER_NAME #495)
make sure that the prefix is unique (Detect prefix conflict and output intelligible error message #506)
optional for convenience: use either user-defined or randomly generated value for $prefix to avoid forcing users to come up with names

DEiselt · 2023-09-05T09:58:05Z

I closed the draft PR for the randomized portion of the testcluster_name in favor of #495 and #506 solving this issue. This way we can avoid relying on some random generated values being part of the cluster name.

DEiselt · 2023-09-07T09:13:00Z

Loadbalancers (kubeapi): k8s-clusterapi-cluster-$CLUSTER_NAME-$CLUSTER_NAME-kubeapi (not good: Lacks $PREFIX)

This is related to / blocked by the above part of #495 . Once the load balancer name includes the $PREFIX, this issue should be fixed because #506 was implemented to make sure that the prefix is unique. See this comment above for background on why the prefix is necessary / should fix the issue.

mxmxchere assigned DEiselt Aug 1, 2023

chess-knight added the Container Issues or pull requests relevant for Team 2: Container Infra and Tooling label Aug 15, 2023

DEiselt linked a pull request Aug 16, 2023 that will close this issue

Changed behaviour of default testcluster name #504

Closed

jschoone added the Sprint Izmir Sprint Izmir (2023, cwk 32+33) label Aug 23, 2023

jschoone added the Sprint Jena Sprint Jena (2023, cwk 34+35) label Sep 5, 2023

jschoone added the on hold Is on hold label Oct 10, 2023

jschoone closed this as completed Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

endless port+floating-ip allocation (LB for kube-api) seems to happen again #480

endless port+floating-ip allocation (LB for kube-api) seems to happen again #480

mxmxchere commented Aug 1, 2023

chess-knight commented Aug 2, 2023

mxmxchere commented Aug 2, 2023

mxmxchere commented Aug 2, 2023

DEiselt commented Aug 2, 2023

chess-knight commented Aug 2, 2023 •

edited

Loading

DEiselt commented Aug 10, 2023

mxmxchere commented Aug 14, 2023

chess-knight commented Aug 15, 2023

DEiselt commented Aug 17, 2023

DEiselt commented Sep 5, 2023

DEiselt commented Sep 7, 2023

endless port+floating-ip allocation (LB for kube-api) seems to happen again #480

endless port+floating-ip allocation (LB for kube-api) seems to happen again #480

Comments

mxmxchere commented Aug 1, 2023

chess-knight commented Aug 2, 2023

mxmxchere commented Aug 2, 2023

mxmxchere commented Aug 2, 2023

DEiselt commented Aug 2, 2023

chess-knight commented Aug 2, 2023 • edited Loading

DEiselt commented Aug 10, 2023

Reproduction:

Problem:

Options to fix:

mxmxchere commented Aug 14, 2023

chess-knight commented Aug 15, 2023

DEiselt commented Aug 17, 2023

DEiselt commented Sep 5, 2023

DEiselt commented Sep 7, 2023

chess-knight commented Aug 2, 2023 •

edited

Loading