Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batched fetching #251

Merged
merged 8 commits into from
Apr 6, 2018
Merged

Batched fetching #251

merged 8 commits into from
Apr 6, 2018

Conversation

KnVerey
Copy link
Contributor

@KnVerey KnVerey commented Mar 14, 2018

Problem

Every polling loop, including the one to check the initial status, makes an API call for every single resource in the set. When you've got over a thousand resources, that takes a long time, even with concurrency!

Solution

Make one request per resource type in the set instead, with a fallback if there's a cache miss for a type a resource instance needs.

We tested this in our canary environment and saw the "checking initial status" step decrease from 24s to 6s. The apply + initial polling loop step also decreased from 44s to 24s (probably because apply --prune is still slow). I would expect this time to grow somewhat with larger resource sets (raw kubectl get is slower on larger clusters), but no longer linearly with the number of resources.

A nice side-effect of this design is that we can remove the kubectl access from the instances entirely, preventing people from using it when they shouldn't be (i.e. outside polling loops).

Reviewers, please especially look for any concurrency gotchas we might have here. This new cache is effectively shared state in a multi-threaded environment. Should we freeze some of it, for example?

Still to do

  • Add unit tests for the new class

Copy link
Contributor

@dturn dturn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks solid.


def get_instance(kind, resource_name)
if @cache.key?(kind)
@cache[kind].find { |r| r.dig("metadata", "name") == resource_name } || {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about the cache being a 2d array by kind and then name? (e.g. @cache[kind][resource_name] || {})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea; preprocessing it into that form up front should in theory be more efficient overall.

@KnVerey KnVerey changed the title [WIP] Batched fetching Batched fetching Mar 21, 2018
@KnVerey
Copy link
Contributor Author

KnVerey commented Mar 29, 2018

I changed the SyncMediator tests to be unit tests that use a set of fake resource classes and stub kubectl. The details of the cache behaviour is what we really need to cover IMO (nearly all the existing integration tests cover the critical path of the class with real resources), and I was having a hard time making the assertions I wanted about when api calls are/aren't made. The unit tests are obviously also a lot faster. Let me know what you think.

I fixed a couple bugs I found in the process, and I think this is now ready for a detailed look. 🙏

def sync(mediator)
super
@latest_rs = exists? ? find_latest_rs(mediator) : nil
@server_version = mediator.kubectl.server_version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its ugly but we could store this as a class level var to prevent every deployment from having to make the api call. Might not be worth the optimization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I must have missed this comment. It wouldn't be safe to store it at the class level, since someone could be using this as a proper gem and running separate DeployTask instances in parallel against different servers. However, we could at least cache it here and only make one call per deployment instance.

Copy link
Contributor

@dturn dturn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor things, but look good.


def fetch_by_kind(kind)
raw_json, _, st = kubectl.run("get", kind, "-a", "--output=json")
return unless st.success?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we emit a warning of the call to kubectl fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since both this cache and the polling loop overall are built to handle transient errors gracefully, I think that'd just add noise (that's the reasoning for it not being enabled in the equivalent place today anyway). If you turn on debug-level logging, it will get recorded by kubectl.run itself.

assert_equal({}, missing)
end

def test_get_instance_does_not_populate_the_cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't look finished, it also looks like its covered by the first test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it pretty much is. I'll merge them and move the note.

Copy link
Contributor

@klautcomputing klautcomputing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have found a couple of nits, but nothing major.

I was wondering whether there is a reason why the code sometimes uses dig(a,b,c) and in other places [a][b][c]. Any particular reason?

Are we okay with only having tests for deployments and pods?

stub_kubectl_response('get', 'FakeConfigMap', *@params, success: false, resp: { "items" => [] }, err: 'no').times(2)
stub_kubectl_response('get', 'FakeConfigMap', @fake_cm.name, *@params, resp: @fake_cm.kubectl_response, times: 1)
assert_equal [], mediator.get_all('FakeConfigMap')
assert_equal [], mediator.get_all('FakeConfigMap', "fake" => "false", "type" => "fakeconfigmap")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test more than the line before? That both with a selector and without they don't get cached?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, admittedly it's unlikely we'd introduce a behavioural difference there, but I thought it couldn't hurt. I can remove it if you want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope totally fine, just took me a second to figure out why you added the test. Maybe just add a comment to it so that no one else has to wonder why.

@@ -1,33 +0,0 @@
# frozen_string_literal: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: why was this deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a Shopify custom resource that we 🔥 in production months ago. It's effectively dead code, so I deleted rather than updated it.


@status = if @deployment_exists && @service_exists
def status
if deployment_ready? && service_ready?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use def deploy_succeeded? here as well

@proxy_service = mediator.get_instance(Service.kind, "cloudsql-#{@name}")
end

def status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the same as deploy_succeeded? could therefore be refactored into trinary using that function.

end

def status
if deployment_ready? && service_ready? && configmap_ready?
"Provisioned"
else
"Unknown"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That link isn't working for me, but that seems like a mistake. Good catch!

@KnVerey
Copy link
Contributor Author

KnVerey commented Apr 4, 2018

I was wondering whether there is a reason why the code sometimes uses dig(a,b,c) and in other places [a][b][c]. Any particular reason?

We were originally using an older version of ruby on this project, and I have a vague memory of updating cases like a[b][c] if a.key?(b) to use dig when we upgraded, but not bothering beyond that. So probably a combination of legacy and inconsistency on my part. Feel free to call them out.

Are we okay with only having tests for deployments and pods?

Since we're so reliant on correct interactions with multiple versions of k8s, for the most part, this project has focussed on maintaining excellent integration coverage. We typically add unit tests when there are edge cases that are difficult/impossible to get integration coverage for. If you see any code that isn't covered, please call it out and I'll add coverage one way or another. One exception is the classes for Shopify's custom resources, which are virtually uncovered. The work to make them dynamic will give us generic coverage for them, but until then it is difficult since our test clusters don't run the backing controllers. Typically I deploy my test app locally as a 🎩 before merging any PRs that touch those classes. That definitely sucks. Maybe now's the time to change it... we could have the test itself inject dummy resources into the namespace to at least cover the basics. See also: codecov

@KnVerey
Copy link
Contributor Author

KnVerey commented Apr 4, 2018

Updated and rebased.

@sirupsen
Copy link

sirupsen commented Apr 5, 2018

(not reviewing but pretty excited about this)

Copy link
Contributor

@klautcomputing klautcomputing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🍠

Copy link
Contributor Author

@KnVerey KnVerey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a self-review to re-familiarize myself with the code before shipping and found a couple minor things and a bug:

  • the fetch_events/logs methods previously used a kubectl instance that did not log failure by default and I forgot to switch them to specify log_failure: false at the command level now that they are passed a variable kubectl instance
  • Fix in response to Danny's comment about caching the server version
  • The service refactor had two errors in it that weren't caught by tests, so I fixed them and added a unit test suite for that class. Incidentally, that class's logic is incomplete and flawed when you consider the full array of service types possible; we should revisit it.

def sync(mediator)
super
@latest_rs = exists? ? find_latest_rs(mediator) : nil
@server_version = mediator.kubectl.server_version
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I must have missed this comment. It wouldn't be safe to store it at the class level, since someone could be using this as a proper gem and running separate DeployTask instances in parallel against different servers. However, we could at least cache it here and only make one call per deployment instance.

@KnVerey
Copy link
Contributor Author

KnVerey commented Apr 6, 2018

🎩 with a test app with shopify CRs successful

@KnVerey KnVerey merged commit abb873a into master Apr 6, 2018
@KnVerey KnVerey deleted the batched_fetching branch April 6, 2018 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants