feat(service): horizontal scaling #3178

olevski · 2022-10-19T20:08:44Z

/deploy renku=core-svc-horizontal-scaling renku-gateway=core-sticky-sessions #persist

RenkuBot · 2022-10-19T21:38:57Z

You can access the deployment of this PR at https://renku-ci-rp-3178.dev.renku.ch

olevski · 2022-10-20T00:30:32Z

We have to set resource requests for all containers in the core service so that the pod autoscaler works.

I prospose the following resource requests:

core: 4Gi
core-datasets-workers: 2Gi
core-management-workers: 100Mi
core-scheduler: 100Mi
traefik: 100Mi (no historical data here for this specifically, this is based on the data from the gateway traefik instance)

I looked at the historical memory consumption in Gi over the last 90 days from renkulab.io.

core

core-datasets-workers

core-mangement-workers

core-scheduler

olevski · 2022-10-20T00:34:14Z

Cpu usage is fairly low. And IMO we should not really bother including this in the hpa.

olevski · 2022-10-20T12:39:27Z

What do these changes mean:

Routing is more complicated to maintain/get the sticky cookies.
The pod disruption budget prevents an admin from draining nodes/doing other things and accidentally fully removing all instances of the core service. With this this pdb for example an admin (or similar user) will be prevented from draining a node that would bring the number of replicas below 1.
The horizontal pod autoscaler will aim to keep the memory utilization of the core service pod at 50%. The utilization is calculated as sum of all container memory usage / sum of all container memory requests in the pod. This will scale up and down accordingly but never below 2 replicas. The 2 replicas help with the pod disruption budget setting. Setting minimum replicas and pdb to 1 means that the admin cannot evict your service and needs to come talk to you before doing so.
The update strategy is now "Rolling" with a miniumum unavailable pods at 0 and a surge of 1. This means that during the update the number of replicas that are available will be maintained and the update will be done 1 replica at a time. So if you have 2 replicas with 0 min unavailable and 1 surge, then the update process adds an extra new replica, waits for the extra to become available, then kills an old replica, adds a new extra replica, waits for that to become available and finally kills the last old replica. However during this time you will have replicas that run different versions of the code.
Added tini in the docker container which properly handles the interruption signal that k8s sends to the pod and containers and forwards this to all processes. Otherwise k8s sends the signal and after 30 seconds forcefully removes stuff. Tini just makes sure the signal reaches all processes running in a container.

olevski · 2022-10-20T13:11:04Z

This is how the routing changes:

Current:

flowchart LR
        Browser
    subgraph Ingress [Ingress]
        IngressRenku[http://renkulab.io/ui-server/api/renku]
    end
    subgraph k8s[k8s cluster]
        UI[UI-server]
        subgraph Gateway
            GatewayTraefik[Gateway traefik]
            GatewayAuth[Gateway-auth]
        end
        subgraph CoreSvc[Core Service Pod]
            Core
        end
    end
    Browser -- 1 --> IngressRenku
    IngressRenku -- 2 --> UI
    UI -- 3 --> GatewayTraefik
    GatewayTraefik -- 4 --> GatewayAuth
    GatewayAuth -- 5 --> GatewayTraefik
    GatewayTraefik -- 6 --> Core

New

flowchart LR
        Browser
    subgraph Ingress [Ingress]
        IngressRenku[http://renkulab.io/ui-server/api/renku]
        IngressCore[http://renkulab.io/api/renku]
    end
    subgraph k8s[k8s cluster]
        UI[UI-server]
        subgraph Gateway
            GatewayTraefik[Gateway traefik]
            GatewayAuth[Gateway-auth]
        end
        subgraph CoreSvc[Core Service Pod]
            Core
            Traefik
        end
    end
    Browser -- 1 --> IngressRenku
    IngressRenku -- 2 --> UI
    UI -- 3 --> GatewayTraefik
    GatewayTraefik -- 4 --> IngressCore
    IngressCore -- 5 --> Traefik
    Traefik -- 6 --> GatewayAuth
    GatewayAuth -- 7 --> Traefik
    Traefik -- 8 --> Core

The gateway uses traefik to do the routing. And traefik cannot assign sticky session cookies. It only sees the address for the k8s service and the round robin load balancing the k8s service does is not known to traefik. But the k8s ingress does know what actual replica will the k8s service use and can assign the sticky session cookie. That is why we need to go through the ingress now to get the sticky sessions to work.

And in the new version of the routing the core service's traefik container has to go to the gateway to get authenticated/exchange the JWT for any other token it needs.

olevski · 2022-10-25T00:58:05Z

Results from load testing:

Migrations

I did it with a 1 year old project (that does not have a lot of commits)
I could not find a better project or a more complicated one. I tried https://dev.renku.ch/gitlab/mohammad.alisafaee/old-datasets-v0.6.0-with-submodules but it said it requires manual migration
30 concurrent migrations on my 1 year old simple project ran with these response times: avg=1.35s min=4.59ms med=365.41ms max=20.69s p(90)=3.58s p(95)=5.33s, all 30 migrations complete successfully
running the same thing on dev.renku.ch produced the following response times: avg=3.05s min=4.48ms med=1.04s max=18.81s p(90)=9.17s p(95)=12.68s, all migrations completed successfully - gitlab had some instability when forking but the migrations had no issues

File uploads

10 concurrent uploads each uploading 100MB file
most often only half of the uploads succeed
sometimes when you get lucky all will succeed
this also has the problem of the memory leak but this was not crashing the uploads - something else was
on dev this is how long the requests took avg=2.18s min=0s med=2.34s max=59.99s p(90)=2.94s p(95)=3.14s
on this ci deployment this is how long the requests took avg=390.62ms min=3.73ms med=384.41ms max=2.79s p(90)=591.64ms p(95)=671.37ms
the tests definitely took considerably longer to finish on dev.renku.ch than on the ci and the response times confirm this too

helm-chart/renku-core/values.yaml

olevski · 2023-06-01T08:16:09Z

@Panaetius this is good to go. I cannot approve because I opened the PR in the first place.

lorenzo-cavazzi · 2023-06-19T15:21:44Z

Does this require refreshing SwissDataScienceCenter/renku-ui#2134 ?

Panaetius · 2023-06-20T09:57:04Z

Does this require refreshing SwissDataScienceCenter/renku-ui#2134 ?

no. While the versions list is not served by nginx anymore but the individual core svc, the content/URL shouldn't have change (/api/renku/versions will just go to/redirect to /api/renku/v10/2.0/versions)

feat(service): horizontal scaling

93a16f4

olevski requested a review from a team as a code owner October 19, 2022 20:08

olevski marked this pull request as draft October 19, 2022 20:21

Merge branch 'develop' into feat-scale-core-service

b77fe85

olevski had a problem deploying to renku-ci-rp-3178 October 19, 2022 20:23 Failure

olevski had a problem deploying to renku-ci-rp-3178 October 19, 2022 20:38 Failure

olevski force-pushed the feat-scale-core-service branch from d20ca57 to ec9c103 Compare October 19, 2022 21:15

olevski had a problem deploying to renku-ci-rp-3178 October 19, 2022 21:16 Failure

olevski temporarily deployed to renku-ci-rp-3178 October 19, 2022 21:26 Inactive

olevski force-pushed the feat-scale-core-service branch from ec9c103 to e070f6f Compare October 19, 2022 22:33

olevski temporarily deployed to renku-ci-rp-3178 October 19, 2022 22:33 Inactive

squashme: minor fix

c722993

olevski force-pushed the feat-scale-core-service branch from e070f6f to c722993 Compare October 19, 2022 22:48

olevski temporarily deployed to renku-ci-rp-3178 October 19, 2022 22:48 Inactive

olevski had a problem deploying to renku-ci-rp-3178 October 20, 2022 12:28 Failure

olevski force-pushed the feat-scale-core-service branch from 703e606 to fddc9e1 Compare October 20, 2022 12:44

olevski had a problem deploying to renku-ci-rp-3178 October 20, 2022 12:44 Failure

olevski force-pushed the feat-scale-core-service branch from fddc9e1 to 5528dd4 Compare October 20, 2022 13:16

olevski temporarily deployed to renku-ci-rp-3178 October 20, 2022 13:16 Inactive

olevski force-pushed the feat-scale-core-service branch from 5528dd4 to 5b2510d Compare October 20, 2022 13:37

olevski temporarily deployed to renku-ci-rp-3178 October 20, 2022 13:38 Inactive

feat(service): add hpa and pdb

750f156

olevski force-pushed the feat-scale-core-service branch from 5b2510d to 750f156 Compare October 20, 2022 14:05

olevski temporarily deployed to renku-ci-rp-3178 October 20, 2022 14:05 Inactive

Panaetius had a problem deploying to renku-ci-rp-3178 May 9, 2023 07:45 — with GitHub Actions Failure

Panaetius temporarily deployed to renku-ci-rp-3178 May 9, 2023 07:53 — with GitHub Actions Inactive

Panaetius force-pushed the feat-scale-core-service branch from a06c32f to f8983d3 Compare May 9, 2023 08:44

Panaetius temporarily deployed to renku-ci-rp-3178 May 9, 2023 08:44 — with GitHub Actions Inactive

Panaetius temporarily deployed to renku-ci-rp-3178 May 9, 2023 09:13 — with GitHub Actions Inactive

Panaetius mentioned this pull request May 9, 2023

chore: prep 0.30.0 release SwissDataScienceCenter/renku#3060

Merged

Panaetius had a problem deploying to renku-ci-rp-3178 May 11, 2023 10:28 — with GitHub Actions Failure

Panaetius had a problem deploying to renku-ci-rp-3178 May 11, 2023 12:39 — with GitHub Actions Failure

Panaetius had a problem deploying to renku-ci-rp-3178 May 11, 2023 12:59 — with GitHub Actions Failure

Panaetius had a problem deploying to renku-ci-rp-3178 May 11, 2023 14:22 — with GitHub Actions Failure

Panaetius had a problem deploying to renku-ci-rp-3178 May 11, 2023 14:41 — with GitHub Actions Failure

Panaetius had a problem deploying to renku-ci-rp-3178 May 11, 2023 15:04 — with GitHub Actions Failure

Panaetius had a problem deploying to renku-ci-rp-3178 May 11, 2023 15:15 — with GitHub Actions Failure

Panaetius had a problem deploying to renku-ci-rp-3178 May 11, 2023 15:30 — with GitHub Actions Failure

Panaetius temporarily deployed to renku-ci-rp-3178 May 11, 2023 15:55 — with GitHub Actions Inactive

olevski commented May 16, 2023

View reviewed changes

helm-chart/renku-core/values.yaml Outdated Show resolved Hide resolved

Panaetius added 2 commits May 17, 2023 10:11

address comments

782e235

Merge branch 'develop' into feat-scale-core-service

376282c

Panaetius temporarily deployed to renku-ci-rp-3178 May 17, 2023 08:12 — with GitHub Actions Inactive

Merge branch 'develop' into feat-scale-core-service

a430217

olevski temporarily deployed to renku-ci-rp-3178 June 1, 2023 08:16 — with GitHub Actions Inactive

Merge branch 'develop' into feat-scale-core-service

1b169c0

olevski deployed to renku-ci-rp-3178 June 19, 2023 14:54 — with GitHub Actions Active

Panaetius enabled auto-merge (squash) June 20, 2023 09:57

Panaetius approved these changes Jun 20, 2023

View reviewed changes

Panaetius merged commit fab2b58 into develop Jun 20, 2023
28 of 29 checks passed

Panaetius deleted the feat-scale-core-service branch June 20, 2023 09:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(service): horizontal scaling #3178

feat(service): horizontal scaling #3178

olevski commented Oct 19, 2022 •

edited by Panaetius

RenkuBot commented Oct 19, 2022

olevski commented Oct 20, 2022

olevski commented Oct 20, 2022

olevski commented Oct 20, 2022 •

edited

olevski commented Oct 20, 2022 •

edited

olevski commented Oct 25, 2022

olevski commented Jun 1, 2023

lorenzo-cavazzi commented Jun 19, 2023

Panaetius commented Jun 20, 2023

feat(service): horizontal scaling #3178

feat(service): horizontal scaling #3178

Conversation

olevski commented Oct 19, 2022 • edited by Panaetius

RenkuBot commented Oct 19, 2022

olevski commented Oct 20, 2022

olevski commented Oct 20, 2022

olevski commented Oct 20, 2022 • edited

olevski commented Oct 20, 2022 • edited

olevski commented Oct 25, 2022

Migrations

File uploads

olevski commented Jun 1, 2023

lorenzo-cavazzi commented Jun 19, 2023

Panaetius commented Jun 20, 2023

olevski commented Oct 19, 2022 •

edited by Panaetius

olevski commented Oct 20, 2022 •

edited

olevski commented Oct 20, 2022 •

edited