Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(service): horizontal scaling #3178

Merged
merged 15 commits into from Jun 20, 2023
Merged

feat(service): horizontal scaling #3178

merged 15 commits into from Jun 20, 2023

Conversation

olevski
Copy link
Member

@olevski olevski commented Oct 19, 2022

/deploy renku=core-svc-horizontal-scaling renku-gateway=core-sticky-sessions #persist

@olevski olevski requested a review from a team as a code owner October 19, 2022 20:08
@olevski olevski marked this pull request as draft October 19, 2022 20:21
@olevski olevski temporarily deployed to renku-ci-rp-3178 October 19, 2022 21:26 Inactive
@RenkuBot
Copy link
Contributor

You can access the deployment of this PR at https://renku-ci-rp-3178.dev.renku.ch

@olevski olevski temporarily deployed to renku-ci-rp-3178 October 19, 2022 22:33 Inactive
@olevski olevski temporarily deployed to renku-ci-rp-3178 October 19, 2022 22:48 Inactive
@olevski
Copy link
Member Author

olevski commented Oct 20, 2022

We have to set resource requests for all containers in the core service so that the pod autoscaler works.

I prospose the following resource requests:

  • core: 4Gi
  • core-datasets-workers: 2Gi
  • core-management-workers: 100Mi
  • core-scheduler: 100Mi
  • traefik: 100Mi (no historical data here for this specifically, this is based on the data from the gateway traefik instance)

I looked at the historical memory consumption in Gi over the last 90 days from renkulab.io.

core
image

core-datasets-workers
image

core-mangement-workers
image

core-scheduler
image

@olevski
Copy link
Member Author

olevski commented Oct 20, 2022

Cpu usage is fairly low. And IMO we should not really bother including this in the hpa.
image

@olevski
Copy link
Member Author

olevski commented Oct 20, 2022

What do these changes mean:

  • Routing is more complicated to maintain/get the sticky cookies.
  • The pod disruption budget prevents an admin from draining nodes/doing other things and accidentally fully removing all instances of the core service. With this this pdb for example an admin (or similar user) will be prevented from draining a node that would bring the number of replicas below 1.
  • The horizontal pod autoscaler will aim to keep the memory utilization of the core service pod at 50%. The utilization is calculated as sum of all container memory usage / sum of all container memory requests in the pod. This will scale up and down accordingly but never below 2 replicas. The 2 replicas help with the pod disruption budget setting. Setting minimum replicas and pdb to 1 means that the admin cannot evict your service and needs to come talk to you before doing so.
  • The update strategy is now "Rolling" with a miniumum unavailable pods at 0 and a surge of 1. This means that during the update the number of replicas that are available will be maintained and the update will be done 1 replica at a time. So if you have 2 replicas with 0 min unavailable and 1 surge, then the update process adds an extra new replica, waits for the extra to become available, then kills an old replica, adds a new extra replica, waits for that to become available and finally kills the last old replica. However during this time you will have replicas that run different versions of the code.
  • Added tini in the docker container which properly handles the interruption signal that k8s sends to the pod and containers and forwards this to all processes. Otherwise k8s sends the signal and after 30 seconds forcefully removes stuff. Tini just makes sure the signal reaches all processes running in a container.

@olevski
Copy link
Member Author

olevski commented Oct 20, 2022

This is how the routing changes:

Current:

flowchart LR
        Browser
    subgraph Ingress [Ingress]
        IngressRenku[http://renkulab.io/ui-server/api/renku]
    end
    subgraph k8s[k8s cluster]
        UI[UI-server]
        subgraph Gateway
            GatewayTraefik[Gateway traefik]
            GatewayAuth[Gateway-auth]
        end
        subgraph CoreSvc[Core Service Pod]
            Core
        end
    end
    Browser -- 1 --> IngressRenku
    IngressRenku -- 2 --> UI
    UI -- 3 --> GatewayTraefik
    GatewayTraefik -- 4 --> GatewayAuth
    GatewayAuth -- 5 --> GatewayTraefik
    GatewayTraefik -- 6 --> Core

New

flowchart LR
        Browser
    subgraph Ingress [Ingress]
        IngressRenku[http://renkulab.io/ui-server/api/renku]
        IngressCore[http://renkulab.io/api/renku]
    end
    subgraph k8s[k8s cluster]
        UI[UI-server]
        subgraph Gateway
            GatewayTraefik[Gateway traefik]
            GatewayAuth[Gateway-auth]
        end
        subgraph CoreSvc[Core Service Pod]
            Core
            Traefik
        end
    end
    Browser -- 1 --> IngressRenku
    IngressRenku -- 2 --> UI
    UI -- 3 --> GatewayTraefik
    GatewayTraefik -- 4 --> IngressCore
    IngressCore -- 5 --> Traefik
    Traefik -- 6 --> GatewayAuth
    GatewayAuth -- 7 --> Traefik
    Traefik -- 8 --> Core

The gateway uses traefik to do the routing. And traefik cannot assign sticky session cookies. It only sees the address for the k8s service and the round robin load balancing the k8s service does is not known to traefik. But the k8s ingress does know what actual replica will the k8s service use and can assign the sticky session cookie. That is why we need to go through the ingress now to get the sticky sessions to work.

And in the new version of the routing the core service's traefik container has to go to the gateway to get authenticated/exchange the JWT for any other token it needs.

@olevski olevski temporarily deployed to renku-ci-rp-3178 October 20, 2022 13:16 Inactive
@olevski olevski temporarily deployed to renku-ci-rp-3178 October 20, 2022 13:38 Inactive
@olevski olevski temporarily deployed to renku-ci-rp-3178 October 20, 2022 14:05 Inactive
@olevski
Copy link
Member Author

olevski commented Oct 25, 2022

Results from load testing:

Migrations

  • I did it with a 1 year old project (that does not have a lot of commits)
  • I could not find a better project or a more complicated one. I tried https://dev.renku.ch/gitlab/mohammad.alisafaee/old-datasets-v0.6.0-with-submodules but it said it requires manual migration
  • 30 concurrent migrations on my 1 year old simple project ran with these response times: avg=1.35s min=4.59ms med=365.41ms max=20.69s p(90)=3.58s p(95)=5.33s, all 30 migrations complete successfully
  • running the same thing on dev.renku.ch produced the following response times: avg=3.05s min=4.48ms med=1.04s max=18.81s p(90)=9.17s p(95)=12.68s, all migrations completed successfully - gitlab had some instability when forking but the migrations had no issues

File uploads

  • 10 concurrent uploads each uploading 100MB file
  • most often only half of the uploads succeed
  • sometimes when you get lucky all will succeed
  • this also has the problem of the memory leak but this was not crashing the uploads - something else was
  • on dev this is how long the requests took avg=2.18s min=0s med=2.34s max=59.99s p(90)=2.94s p(95)=3.14s
  • on this ci deployment this is how long the requests took avg=390.62ms min=3.73ms med=384.41ms max=2.79s p(90)=591.64ms p(95)=671.37ms
  • the tests definitely took considerably longer to finish on dev.renku.ch than on the ci and the response times confirm this too

@Panaetius Panaetius temporarily deployed to renku-ci-rp-3178 May 9, 2023 07:53 — with GitHub Actions Inactive
@Panaetius Panaetius temporarily deployed to renku-ci-rp-3178 May 9, 2023 08:44 — with GitHub Actions Inactive
@Panaetius Panaetius temporarily deployed to renku-ci-rp-3178 May 9, 2023 09:13 — with GitHub Actions Inactive
@Panaetius Panaetius temporarily deployed to renku-ci-rp-3178 May 11, 2023 15:55 — with GitHub Actions Inactive
@Panaetius Panaetius temporarily deployed to renku-ci-rp-3178 May 17, 2023 08:12 — with GitHub Actions Inactive
@olevski
Copy link
Member Author

olevski commented Jun 1, 2023

@Panaetius this is good to go. I cannot approve because I opened the PR in the first place.

@olevski olevski temporarily deployed to renku-ci-rp-3178 June 1, 2023 08:16 — with GitHub Actions Inactive
@olevski olevski deployed to renku-ci-rp-3178 June 19, 2023 14:54 — with GitHub Actions Active
@lorenzo-cavazzi
Copy link
Member

Does this require refreshing SwissDataScienceCenter/renku-ui#2134 ?

@Panaetius
Copy link
Member

Does this require refreshing SwissDataScienceCenter/renku-ui#2134 ?

no. While the versions list is not served by nginx anymore but the individual core svc, the content/URL shouldn't have change (/api/renku/versions will just go to/redirect to /api/renku/v10/2.0/versions)

@Panaetius Panaetius enabled auto-merge (squash) June 20, 2023 09:57
@Panaetius Panaetius merged commit fab2b58 into develop Jun 20, 2023
28 of 29 checks passed
@Panaetius Panaetius deleted the feat-scale-core-service branch June 20, 2023 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants