Batched fetching #251

KnVerey · 2018-03-14T16:58:07Z

Problem

Every polling loop, including the one to check the initial status, makes an API call for every single resource in the set. When you've got over a thousand resources, that takes a long time, even with concurrency!

Solution

Make one request per resource type in the set instead, with a fallback if there's a cache miss for a type a resource instance needs.

We tested this in our canary environment and saw the "checking initial status" step decrease from 24s to 6s. The apply + initial polling loop step also decreased from 44s to 24s (probably because apply --prune is still slow). I would expect this time to grow somewhat with larger resource sets (raw kubectl get is slower on larger clusters), but no longer linearly with the number of resources.

A nice side-effect of this design is that we can remove the kubectl access from the instances entirely, preventing people from using it when they shouldn't be (i.e. outside polling loops).

Reviewers, please especially look for any concurrency gotchas we might have here. This new cache is effectively shared state in a multi-threaded environment. Should we freeze some of it, for example?

Still to do

Add unit tests for the new class

dturn

Overall this looks solid.

dturn · 2018-03-21T20:24:52Z

lib/kubernetes-deploy/sync_mediator.rb

+
+    def get_instance(kind, resource_name)
+      if @cache.key?(kind)
+        @cache[kind].find { |r| r.dig("metadata", "name") == resource_name } || {}


What do you think about the cache being a 2d array by kind and then name? (e.g. @cache[kind][resource_name] || {})

I think it's a good idea; preprocessing it into that form up front should in theory be more efficient overall.

KnVerey · 2018-03-29T00:22:29Z

I changed the SyncMediator tests to be unit tests that use a set of fake resource classes and stub kubectl. The details of the cache behaviour is what we really need to cover IMO (nearly all the existing integration tests cover the critical path of the class with real resources), and I was having a hard time making the assertions I wanted about when api calls are/aren't made. The unit tests are obviously also a lot faster. Let me know what you think.

I fixed a couple bugs I found in the process, and I think this is now ready for a detailed look. 🙏

dturn · 2018-03-29T14:54:10Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

+    def sync(mediator)
+      super
+      @latest_rs = exists? ? find_latest_rs(mediator) : nil
+      @server_version = mediator.kubectl.server_version


Its ugly but we could store this as a class level var to prevent every deployment from having to make the api call. Might not be worth the optimization.

Sorry, I must have missed this comment. It wouldn't be safe to store it at the class level, since someone could be using this as a proper gem and running separate DeployTask instances in parallel against different servers. However, we could at least cache it here and only make one call per deployment instance.

dturn

Couple of minor things, but look good.

dturn · 2018-03-29T14:57:48Z

lib/kubernetes-deploy/sync_mediator.rb

+
+    def fetch_by_kind(kind)
+      raw_json, _, st = kubectl.run("get", kind, "-a", "--output=json")
+      return unless st.success?


Should we emit a warning of the call to kubectl fails?

Since both this cache and the polling loop overall are built to handle transient errors gracefully, I think that'd just add noise (that's the reasoning for it not being enabled in the equivalent place today anyway). If you turn on debug-level logging, it will get recorded by kubectl.run itself.

dturn · 2018-03-29T15:08:28Z

test/unit/sync_mediator_test.rb

+    assert_equal({}, missing)
+  end
+
+  def test_get_instance_does_not_populate_the_cache


This test doesn't look finished, it also looks like its covered by the first test

You're right, it pretty much is. I'll merge them and move the note.

klautcomputing

I have found a couple of nits, but nothing major.

I was wondering whether there is a reason why the code sometimes uses dig(a,b,c) and in other places [a][b][c]. Any particular reason?

Are we okay with only having tests for deployments and pods?

klautcomputing · 2018-04-03T14:32:17Z

test/unit/sync_mediator_test.rb

+    stub_kubectl_response('get', 'FakeConfigMap', *@params, success: false, resp: { "items" => [] }, err: 'no').times(2)
+    stub_kubectl_response('get', 'FakeConfigMap', @fake_cm.name, *@params, resp: @fake_cm.kubectl_response, times: 1)
+    assert_equal [], mediator.get_all('FakeConfigMap')
+    assert_equal [], mediator.get_all('FakeConfigMap', "fake" => "false", "type" => "fakeconfigmap")


Does this test more than the line before? That both with a selector and without they don't get cached?

Yeah, admittedly it's unlikely we'd introduce a behavioural difference there, but I thought it couldn't hurt. I can remove it if you want.

Nope totally fine, just took me a second to figure out why you added the test. Maybe just add a comment to it so that no one else has to wonder why.

klautcomputing · 2018-04-03T15:12:22Z

lib/kubernetes-deploy/kubernetes_resource/bugsnag.rb

@@ -1,33 +0,0 @@
-# frozen_string_literal: true


Out of curiosity: why was this deleted?

This is a Shopify custom resource that we 🔥 in production months ago. It's effectively dead code, so I deleted rather than updated it.

klautcomputing · 2018-04-03T19:43:13Z

lib/kubernetes-deploy/kubernetes_resource/redis.rb


-      @status = if @deployment_exists && @service_exists
+    def status
+      if deployment_ready? && service_ready?


could use def deploy_succeeded? here as well

klautcomputing · 2018-04-03T19:44:19Z

lib/kubernetes-deploy/kubernetes_resource/cloudsql.rb

+      @proxy_service = mediator.get_instance(Service.kind, "cloudsql-#{@name}")
+    end
+
+    def status


this is the same as deploy_succeeded? could therefore be refactored into trinary using that function.

klautcomputing · 2018-04-04T02:04:46Z

lib/kubernetes-deploy/kubernetes_resource/memcached.rb

+    end
+
+    def status
+      if deployment_ready? && service_ready? && configmap_ready?
        "Provisioned"
      else
        "Unknown"


is def deploy_succeeded? dead code now? https://github.com/Shopify/kubernetes-deploy/pull/251/files#diff-8abb8dfa4044255bd0d4793a0bdfdefcR23

That link isn't working for me, but that seems like a mistake. Good catch!

KnVerey · 2018-04-04T17:47:09Z

I was wondering whether there is a reason why the code sometimes uses dig(a,b,c) and in other places [a][b][c]. Any particular reason?

We were originally using an older version of ruby on this project, and I have a vague memory of updating cases like a[b][c] if a.key?(b) to use dig when we upgraded, but not bothering beyond that. So probably a combination of legacy and inconsistency on my part. Feel free to call them out.

Are we okay with only having tests for deployments and pods?

Since we're so reliant on correct interactions with multiple versions of k8s, for the most part, this project has focussed on maintaining excellent integration coverage. We typically add unit tests when there are edge cases that are difficult/impossible to get integration coverage for. If you see any code that isn't covered, please call it out and I'll add coverage one way or another. One exception is the classes for Shopify's custom resources, which are virtually uncovered. The work to make them dynamic will give us generic coverage for them, but until then it is difficult since our test clusters don't run the backing controllers. Typically I deploy my test app locally as a 🎩 before merging any PRs that touch those classes. That definitely sucks. Maybe now's the time to change it... we could have the test itself inject dummy resources into the namespace to at least cover the basics. See also: codecov

KnVerey · 2018-04-04T18:41:44Z

Updated and rebased.

sirupsen · 2018-04-05T19:25:43Z

(not reviewing but pretty excited about this)

klautcomputing

🍠

KnVerey

I did a self-review to re-familiarize myself with the code before shipping and found a couple minor things and a bug:

the fetch_events/logs methods previously used a kubectl instance that did not log failure by default and I forgot to switch them to specify log_failure: false at the command level now that they are passed a variable kubectl instance
Fix in response to Danny's comment about caching the server version
The service refactor had two errors in it that weren't caught by tests, so I fixed them and added a unit test suite for that class. Incidentally, that class's logic is incomplete and flawed when you consider the full array of service types possible; we should revisit it.

KnVerey · 2018-04-06T04:07:53Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

+    def sync(mediator)
+      super
+      @latest_rs = exists? ? find_latest_rs(mediator) : nil
+      @server_version = mediator.kubectl.server_version


Sorry, I must have missed this comment. It wouldn't be safe to store it at the class level, since someone could be using this as a proper gem and running separate DeployTask instances in parallel against different servers. However, we could at least cache it here and only make one call per deployment instance.

KnVerey · 2018-04-06T15:50:37Z

🎩 with a test app with shopify CRs successful

dturn reviewed Mar 21, 2018

View reviewed changes

KnVerey requested a review from klautcomputing March 21, 2018 23:08

KnVerey changed the title ~~[WIP] Batched fetching~~ Batched fetching Mar 21, 2018

dturn reviewed Mar 29, 2018

View reviewed changes

dturn approved these changes Mar 29, 2018

View reviewed changes

KnVerey force-pushed the batched_fetching branch from 370bfc0 to 19000ba Compare March 29, 2018 20:31

klautcomputing suggested changes Apr 4, 2018

View reviewed changes

KnVerey and others added 7 commits April 4, 2018 13:50

Batch/cache List resources when task is huge

39ce342

Make the mediator's cache smarter

c71eefe

WIP, squash me later

cd815bc

Add tests and rework cache

f2bf605

Finish unit tests for SyncMediator and fix bugs

b1854f3

Merge redundant tests

e9e8d30

Code review

a732963

KnVerey force-pushed the batched_fetching branch from 19000ba to a732963 Compare April 4, 2018 18:41

klautcomputing approved these changes Apr 5, 2018

View reviewed changes

Fixes from self-review

6c6c40a

KnVerey commented Apr 6, 2018

View reviewed changes

KnVerey merged commit abb873a into master Apr 6, 2018

KnVerey deleted the batched_fetching branch April 6, 2018 16:40

dturn mentioned this pull request Jul 12, 2018

Only batch fetch when there is a sufficiently large number requested #316

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batched fetching #251

Batched fetching #251

KnVerey commented Mar 14, 2018 •

edited

Loading

dturn left a comment

dturn Mar 21, 2018

KnVerey Mar 21, 2018

KnVerey commented Mar 29, 2018

dturn Mar 29, 2018

KnVerey Apr 6, 2018

dturn left a comment

dturn Mar 29, 2018

KnVerey Mar 29, 2018

dturn Mar 29, 2018

KnVerey Mar 29, 2018

klautcomputing left a comment

klautcomputing Apr 3, 2018

KnVerey Apr 4, 2018

klautcomputing Apr 4, 2018

klautcomputing Apr 3, 2018

KnVerey Apr 4, 2018

klautcomputing Apr 3, 2018

klautcomputing Apr 3, 2018

klautcomputing Apr 4, 2018

KnVerey Apr 4, 2018

KnVerey commented Apr 4, 2018

KnVerey commented Apr 4, 2018

sirupsen commented Apr 5, 2018

klautcomputing left a comment

KnVerey left a comment

KnVerey Apr 6, 2018

KnVerey commented Apr 6, 2018

Batched fetching #251

Batched fetching #251

Conversation

KnVerey commented Mar 14, 2018 • edited Loading

Problem

Solution

Still to do

dturn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KnVerey commented Mar 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dturn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klautcomputing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KnVerey commented Apr 4, 2018

KnVerey commented Apr 4, 2018

sirupsen commented Apr 5, 2018

klautcomputing left a comment

Choose a reason for hiding this comment

KnVerey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KnVerey commented Apr 6, 2018

KnVerey commented Mar 14, 2018 •

edited

Loading