Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to improve run time on larger desired state files? #22

Closed
bgeesaman opened this issue Apr 20, 2018 · 12 comments
Closed

How to improve run time on larger desired state files? #22

bgeesaman opened this issue Apr 20, 2018 · 12 comments

Comments

@bgeesaman
Copy link
Contributor

First, thank you for sharing helmsman with the community. When I kicked the tires on it using a smallish toml configuration with 2 apps, performance of a full run took a handful of seconds. Everything was good. But now, with ~30 helm charts defined and installed in a 10-node cluster, a CI run takes well over 8 minutes. For bumping, say, a container version in a values.yaml for one app, this greatly slows iteration speed. It appears to be ever-growing problem with regard to how many calls to helm there are and the increased time each helm call takes. The logs show quite a few helm calls per chart, and each one takes 4 seconds to return.

$ time helm ls
...snip...
30 helm charts installed
...snip...
real	0m3.897s
user	0m0.173s
sys	0m0.070s

Also, with so many long runs, there is now a much higher likelihood of a race condition between multiple overlapping CI runs even 2+ minutes apart. helmsman seems to gather its state in the beginning and then implement the work, so the second CI job will notice a chart is not there and then later go to implement it, only to error out because the first CI run placed it there in the meantime.

Any ideas/strategies for improving overall performance of each run? I'd love to see things get under a minute, if possible. Thanks!

@sami-alajrami
Copy link
Contributor

sami-alajrami commented Apr 23, 2018

Thank you @bgeesaman for reporting this. I have been facing this situation recently as well and I was planning to focus on performance improvement in the next release.
There are possible performance improvements:

  1. Some of the checks done against helm to build the current state are repeated unnecessarily. Rewriting those in a more performance-optimized way should prove helpful.
  2. I also thought of running the decision making check for each app (chart) where helmsman decides what to do about it in parallel (currently it's sequential).

In the meantime, the way I currently deal with this situation is by splitting my desired state into multiple files and configuring how they are run through the CI pipeline.

For example, desired state for production apps on a release-branch , desired state for dev apps on another branch and so on .. I sometime also use scripting in the pipeline to check if a desired state file has changed in the last commit or not before running helmsman on it.

@bgeesaman
Copy link
Contributor Author

bgeesaman commented Apr 23, 2018

@sami-alajrami Thanks for that. It's pretty much what I was thinking, too. Reducing the number of calls to helm where possible is great, but the real pain is how long helm takes to run each command. If I was to break up my deploys, say, per namespace, so that I'd have 5-6 apps instead of 30, it'd be a good bit faster, but I think I'd still have 4+ second calls to helm because of a shared Tiller deployment. What about supporting the --tiller-namespace helm option through a TOML setting? That way, I could deploy a tiller in each namespace and have each one only handling/looking up what's in that namespace?

That would also have the benefit of making Helmsman be available to "soft multi-tenant" clusters.

@sami-alajrami
Copy link
Contributor

That's a good idea @bgeesaman .. how does the format below sound?

 [settings]
kubeContext = "minikube"
installSharedTiller = false #  default is true
...
[namespaces]
  [namespaces.staging]
  protected = false
  installTiller= true
  [namespaces.production]
  protected = true 
  installTiller= false # default is false
... 

Then, when running helm commands, it implicitly either use the shared Tiller (if the desired namespace does not have its own), or use the specific Tiller of the namespace (using --tiller-namespace)

@bgeesaman
Copy link
Contributor Author

bgeesaman commented Apr 26, 2018

I was thinking it would be an [apps.appname] level configuration item like:

[apps]

    [apps.jenkins]
    name = "jenkins" # should be unique across all apps
    description = "jenkins"
    namespace = "staging" # maps to the namespace as defined in environments above
    tillerNamespace = "staging"
    enabled = true # change to false if you want to delete this app release [empty = flase]
    chart = "stable/jenkins" # changing the chart name means delete and recreate this chart
    version = "0.14.3" # chart version
    valuesFile = "" # leaving it empty uses the default chart values
    purge = false # will only be considered when there is a delete operation
    test = false # run the tests when this release is installed for the first time only
    protected = true

    [apps.nginx]
    name = "nginx" # should be unique across all apps
    description = "nginx"
    namespace = "staging" # maps to the namespace as defined in environments above
    enabled = true # change to false if you want to delete this app release [empty = flase]
    chart = "stable/nginx" # changing the chart name means delete and recreate this chart
    version = "0.14.3" # chart version
    valuesFile = "" # leaving it empty uses the default chart values
    purge = false # will only be considered when there is a delete operation
    test = false # run the tests when this release is installed for the first time only
    protected = true
    wait = true

Where the jenkins is sent to the tiller in staging and the nginx is sent to the tiller in kube-system. But I might be not fully aware of the architectural needs to say if that's doable.

@bgeesaman
Copy link
Contributor Author

Also, I'm curious to know what it would take to be able to provide the TLS configuration items to be able to communicate securely. Thanks!

@sami-alajrami
Copy link
Contributor

before using the --tiller-namespace flag, Tiller must be deployed into that namespace (using helm init --tiller-namespace ). Hence, I suggested it in the namespaces section:

[namespaces]
  [namespaces.staging]
  protected = false
  installTiller= true

And then , the tillerNamespace for each app:

[apps.jenkins]
    name = "jenkins" 
    description = "jenkins"
    namespace = "staging" # maps to the namespace as defined in namespaces above
    tillerNamespace = "staging"  # optional and can point to another namespace e.g, dev

becomes optional. I.e.

  1. If explicitly defined, use it as defined (with the risk of failure if Tiller is not deployed in the specified namespace)
  2. if not specified, check if the staging namespace is supposed to have Tiller deployed into it (using the namespace definition). If yes, use the flag --tiller-namespace staging.
  3. If the staging namespace is not configured to deploy Tiller there, then use the shared Tiller in kube-system

does that make sense? @bgeesaman

@sami-alajrami
Copy link
Contributor

sami-alajrami commented Apr 26, 2018

eliminating the definition of installTiller in the namespaces section, means that helm would be initiated in each namespace when the app is being checked which makes it more complicated at code level and may result in unnecessary helm init calls when you have several apps in the same namespace and each defines a tillerNamespace

@bgeesaman
Copy link
Contributor Author

That seems reasonable as it would skip over an already deployed tiller in that namespace and not re-init it unnecessarily.

I deploy tiller via my cluster provisioning process and then trigger the CI pipeline with helmsman in it. I have one tiller now, but as mentioned above, I'd like to break those out and add TLS configurations to each. Since I will probably want to configure the tiller TLS settings differently per namespace (separating along team/project/CI pipeline needs), as long as I can specify those extra options somehow, that should work well!

@sami-alajrami sami-alajrami added this to the v1.2.0 milestone Apr 27, 2018
sami-alajrami pushed a commit that referenced this issue Apr 28, 2018
sami-alajrami pushed a commit that referenced this issue Apr 28, 2018
@sami-alajrami
Copy link
Contributor

sami-alajrami commented Apr 28, 2018

@bgeesaman could you please try v1.2.0-rc with a desired state similar to:

[settings]
  # other options
serviceAccount = "foo" # the service account used to deploy tiller into namespaces if they do not have a specific service account defined for them in the namespaces section below. If this one is not defined, the namespace default service account is used 
storageBackend = "secret" # default is configMap

[namespaces]
  [namespaces.production]
  protected = true
  installTiller = true # the foo service account above is used
  [namespaces.staging]
   protected = false
   installTiller = true
   tillerServiceAccount = "tiller-staging" # should already exist in the staging namespace
   caCert = "secrets/ca.cert.pem" # or an env var, e.g. "$CA_CERT_PATH" 
   tillerCert = "secrets/tiller.cert.pem" # or S3 bucket s3://mybucket/tiller.crt
   tillerKey = "secrets/tiller.key.pem"  # or GCS bucket gs://mybucket/tiller.key
   clientCert = "secrets/helm.cert.pem"
   clientKey = "secrets/helm.key.pem"

[apps]

    # jenkins will be deployed using the Tiller deployed in the staging namespace
    # if the staging namespace wasn't configured to deploy Tiller, the kube-system Tiller will be used
    # Tiller will always be deployed into kube-system regardless of defining it in the namespaces section or not. You can ,however, configure it to use a specific service account and TLS  
    [apps.jenkins]
    name = "jenkins" 
    description = "jenkins"
    namespace = "staging" 
    enabled = true 
    chart = "stable/jenkins" 
    version = "0.14.3" 
    valuesFile = "" 

There is no tillerNamespace flag supported for apps. This proved to introduce some complexities if a user change it (will leave multiple deployments of the same release managed by different Tillers).

Regarding performance:

  • The new release replaces many helm calls with a single one for building the current state.
  • Using multiple Tillers introduces some latency due to deploying Tillers and waiting for them to be ready.
  • You can use --skip-validation if you are confident that your desired state is valid. This can save you some ~10 seconds or so.
  • Avoid using the wait flag unless necessary.
  • Use the protected flag to skip unnecessary upgrades for apps that are not meant to change.

It would be great if you could test this release in your context and let me know how the performance improves when using multiple Tillers. Thanks!

@bgeesaman
Copy link
Contributor Author

@sami-alajrami This is great to hear! I'll try this out today and let you know shortly.

@bgeesaman
Copy link
Contributor Author

I have 28 applications deployed via a single Tiller in kube-system in an existing cluster. Running time on a helm ls takes just under 4s. So each and every helm call is pretty "expensive". Using helmsman version 1.1.0, a "do nothing" (all apps are already installed and no configuration changes are to be made or waiting on) run takes nearly 12 minutes to complete. If I leave the applications installed in Tiller but comment out all but 5 of them in my toml file, it takes just over 2 minutes. Clearly, a large number of applications running on a given Tiller has negative side effects for each call's time.

Version 1.1.0:

  • time helm ls - 3.87s
  • Single Tiller, 28/28 apps, 10 Namespaces - 11:55
  • Single Tiller, 5/28 apps, 10 Namespaces - 2:05

Using version v1.2.0-rc on the same cluster, I left the same configuration (5 of the 28 apps uncommented) and reran. The time to call helm stayed the same (same tiller with all the same apps running), but the time to complete went way down because it only had to run through 5 apps. From 2:05 to 0:46. I then simply uncommented the remaining 23 apps and reran. 11:55 down to 2:34... nice!

Version v1.2.0-rc:

  • time helm ls - 3.87s
  • Single Tiller, 5/28 apps, 10 Namespaces - 0:46
  • Single Tiller, 28/28 apps, 10 Namespaces - 2:34

Next, I created 5 new namespaces: test1, test2, ...test5. I modified my toml file to deploy 5 apps in test1. Now, the "blank" Tiller in the namespace test1 takes a quarter of a second to return. Running off a single Tiller in test1 with a "do nothing" run takes 18 seconds. Adding 5 apps in test2 adds 5 seconds. Adding 15 more apps across 3 more namespaces adds roughly 18 seconds (6 seconds per).

Version v1.2.0-rc:

  • time helm ls --tiller-namespace test1 - 0.27s
  • 1 Tiller, 5/5 apps in test1, 3 Namespaces - 0:18
  • 2 Tillers, 5/5 apps in test1, 5/5 apps in test2, 3 Namespaces - 0:23
  • 5 Tillers, 5/5 per testN namespace (25 apps total), 5 Namespaces - 0:41

Using separate Tillers per namespace and limiting them to about 5 applications each, deploying 28 applications went from 11:55 down to 2:34 down to 0:47, saving over 11 mins for every CI run. These are fantastic gains.

Kudos and THANK YOU @sami-alajrami

@sami-alajrami
Copy link
Contributor

Awesome! Glad to hear that .. And thanks for the greet tests/stats 👍

mkubaczyk pushed a commit that referenced this issue Aug 18, 2023
mkubaczyk pushed a commit that referenced this issue Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants