Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

endless port+floating-ip allocation (LB for kube-api) seems to happen again #480

Closed
mxmxchere opened this issue Aug 1, 2023 · 11 comments
Closed
Assignees
Labels
Container Issues or pull requests relevant for Team 2: Container Infra and Tooling on hold Is on hold Sprint Izmir Sprint Izmir (2023, cwk 32+33) Sprint Jena Sprint Jena (2023, cwk 34+35)

Comments

@mxmxchere
Copy link
Contributor

This issue was originally described here #179.

We are currently using openstack capo controller version 0.7.3 with the regiocloud openstack infrastructure and this seems to happen again. The theory that this has something to do with the CA being supplied to the capo-controller-manager is not true (it happened now with the CA being there right from the beginning.)

@chess-knight
Copy link
Member

Can the #283 be related?

@mxmxchere
Copy link
Contributor Author

Maybe, i have not checked whether we had two unrelated subnets accidently.

It might well be that this is a distinct issue as the error message was something like "could not associate floating ip to loadbalancer because the loadbalancer already has a floating ip attached". We could also reproduce the error that attaching a FIP to a loadbalancer fails if it has one already via the horizon dashboard.

@mxmxchere
Copy link
Contributor Author

Warning  Failedassociatefloatingip   28m  openstack-controller  Failed to associate floating IP 81.163.192.187 with port 88fafc84-c141-4e01-9340-efc576038eca: Expected HTT
P response code [200] when accessing [PUT https://neutron.services.a.regiocloud.tech/v2.0/floatingips/3773b8b2-e643-443e-b2b3-ef5e3d9a6e2b], but got 409 instead             
{"NeutronError": {"type": "FloatingIPPortAlreadyAssociated", "message": "Cannot associate floating IP 81.163.192.187 (3773b8b2-e643-443e-b2b3-ef5e3d9a6e2b) with port 88fafc8
4-c141-4e01-9340-efc576038eca using fixed IP 10.8.0.62, as that fixed IP already has a floating IP on external network e6be7364-bfd8-4de7-8120-8f41c69a139a.", "detail": ""}}
  Warning  Failedassociatefloatingip  28m (x3 over 28m)  openstack-controller  (combined from similar events): Failed to associate floating IP 81.163.193.138 with port 88faf
c84-c141-4e01-9340-efc576038eca: Expected HTTP response code [200] when accessing [PUT https://neutron.services.a.regiocloud.tech/v2.0/floatingips/607f553c-c8d8-4922-99c2-2c
fb76e8a4da], but got 409 instead

is the complete error message

@DEiselt
Copy link
Contributor

DEiselt commented Aug 2, 2023

The FloatingIPPortAlreadyAssociated error is to be expected from Openstack side as far as i know. I would also consider this a mere symptom of what is happening.

By trying to reproduce this to investigate the cause, there were multiple observations. Generally i think it is related to #283 because it worked fine starting from a clean project (my attempt) but didn't work for @mxmxchere trying to create a cluster in the same project shortly after.

Follow up of the current state:

  • should it be possible to create multiple independent clusters in the same project?
    • this is relevant because the environments file contains multiple options to "uniquefy" a cluster
  • make clean and make fullclean doesn't seem to work correctly, but i don't know if in general or due to inconsistencies with multiple clusters

@chess-knight
Copy link
Member

chess-knight commented Aug 2, 2023

  • should it be possible to create multiple independent clusters in the same project?
    • this is relevant because the environments file contains multiple options to "uniquefy" a cluster

As far as I know, it is possible to deploy multiple clusters in the same OpenStack project. I and @matofeder successfully share one OpenStack project. See also #343, where the last missing piece was implemented to fully support this use case.
Options which need to be different for us are prefix and testcluster_name.

make clean and make fullclean doesn't seem to work correctly, but i don't know if in general or due to inconsistencies with multiple clusters

I think that it is related to inconsistencies with multiple clusters. Personally, most of the time I prefer make fullclean or delete_cluster.sh within the management server when I want to delete only the workload cluster.

@DEiselt
Copy link
Contributor

DEiselt commented Aug 10, 2023

After further investigation, this issue seems to resolve around the testcluster_name.

Reproduction:

I was able to reproduce it by using two separate application credentials and environment files, setting a different prefix but using the same testcluster_name.

Problem:

This looks like a "race condition" where two capo controllers are trying to reconcile the same LB with different FIPs. One of them "wins" by being the first, leaving the second to constantly try to attach "his" FIP to the LB.

Options to fix:

  1. add a random / unique identifier to the (default) testcluster_name (quick fix)
  2. remove the default name to force it to be specified in the environment file (quick fix)
  3. collect and use the OpenStack UUIDs of resource for managing them (more involved, might also affect other components like the capo controller itself)

As a bonus for the capo controller, there should be a limit on how many attempts are made before aborting, because of the usage of public (and limited) ipv4 addresses.

@mxmxchere
Copy link
Contributor Author

We can combine option 1 and 2:

  • per default there is no such setting like testcluster_name and a random name is used
  • if testcluster_name is set, it is used (maybe with an optional warning "Static cluster name set, this could lead to resource conflicts when deployed multiple times")

That way behaviour for existing installations is not changed while helping to avoid the resource conflicts.

@chess-knight
Copy link
Member

#495 is probably also related

@chess-knight chess-knight added the Container Issues or pull requests relevant for Team 2: Container Infra and Tooling label Aug 15, 2023
@DEiselt DEiselt linked a pull request Aug 16, 2023 that will close this issue
@DEiselt
Copy link
Contributor

DEiselt commented Aug 17, 2023

Approach to solve this as discussed in container meeting on 17.08.:

  1. make sure that $prefix is used in loadbalancer (Prefix all resources with PREFIX and CLUSTER_NAME #495)
  2. make sure that the prefix is unique (Detect prefix conflict and output intelligible error message #506)
  3. optional for convenience: use either user-defined or randomly generated value for $prefix to avoid forcing users to come up with names

@jschoone jschoone added the Sprint Izmir Sprint Izmir (2023, cwk 32+33) label Aug 23, 2023
@jschoone jschoone added the Sprint Jena Sprint Jena (2023, cwk 34+35) label Sep 5, 2023
@DEiselt
Copy link
Contributor

DEiselt commented Sep 5, 2023

I closed the draft PR for the randomized portion of the testcluster_name in favor of #495 and #506 solving this issue. This way we can avoid relying on some random generated values being part of the cluster name.

@DEiselt
Copy link
Contributor

DEiselt commented Sep 7, 2023

Loadbalancers (kubeapi): k8s-clusterapi-cluster-$CLUSTER_NAME-$CLUSTER_NAME-kubeapi (not good: Lacks $PREFIX)

This is related to / blocked by the above part of #495 . Once the load balancer name includes the $PREFIX, this issue should be fixed because #506 was implemented to make sure that the prefix is unique. See this comment above for background on why the prefix is necessary / should fix the issue.

@jschoone jschoone added the on hold Is on hold label Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Container Issues or pull requests relevant for Team 2: Container Infra and Tooling on hold Is on hold Sprint Izmir Sprint Izmir (2023, cwk 32+33) Sprint Jena Sprint Jena (2023, cwk 34+35)
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants