Batch dry run but like for real this time #943

timothysmith0609 · 2024-01-19T19:00:03Z

Part of https://github.com/Shopify/infrastructure-tooling/issues/686

Related: #797

What are you trying to accomplish with this PR?
#781 implemented batched dry-run kubectl apply as part of the resource validation phase. For what I'm sure was a compelling reason at the time, if batched dry-run fails, we fallback to doing a resource-by-resource check. Unfortunately for larger resource sets, this means we end up making N kubectl calls as we individually check each resource: the very issue #781 was meant to solve.

It turns out batch_dry_run pathologically fails because we were attempting to apply non-applyable resources (e.g. Pods with generateName, etc.). The solution is to partition resources into applyable/non-applyable and batch dry-run only the former. A quick top hat proves this works.

With that out of the way, the next thing we should do is just trust the batch results and raise a FatalDeploymentError if it doesn't succeed.

…results" This reverts commit d319915.

timothysmith0609 · 2024-01-22T16:11:32Z

lib/krane/deploy_task.rb

+      validate_globals(resources)
+      applyables, individuals = resources.partition { |r| r.deploy_method == :apply }
+      batch_dry_run_success = validate_dry_run(applyables)
+      resources_to_validate = batch_dry_run_success ? individuals : resources
+
+      Krane::Concurrency.split_across_threads(resources_to_validate) do |r|


Originally I wanted to to just trust the output of the batched dry run test and raise a FatalDeploymentError. However, that made it very difficult to give a consistent UX with regards to giving useful feedback: the main issue being it became necessary to suppress all feedback if a sensitive document/secret was contained in the apply set.

To that end, I've kept things pretty similar, just removing resources from the per-resource checks if batched dry-run succeeds. If it fails, since that's a less common case, we default to per-resource checking in order to provide more granular feedback and highlight what the actual issue it. This is the difference between "Something somewhere went wrong" and "THIS particular file and THIS particular line are wrong".

…cceeds

timothysmith0609 · 2024-01-29T19:06:31Z

.github/workflows/ci.yml

+        test_suite:
+          - "unit_test"
+          - "cli_test"
+          - "serial_integration_test"
+          - "integration_test"


integration tests are flaky and take a long time to run. Adding this extra dimension so we don't have to sit through unit/cli/serial tests each time integration fails

timothysmith0609 · 2024-01-29T19:07:45Z

lib/krane/deploy_task.rb

      Krane::Concurrency.split_across_threads(resources) do |r|
-        # No need to pass in kubectl (and do per-resource dry run apply) if batch dry run succeeded
-        if batch_dry_run_success
-          r.validate_definition(kubectl: nil, selector: @selector, dry_run: false)
-        else
-          r.validate_definition(kubectl: kubectl, selector: @selector, dry_run: true)
-        end
+        # No need to pass in kubectl as we batch dry run server-side apply above
+        r.validate_definition(kubectl: nil, selector: @selector, dry_run: false)


We still validate ALL resources, because there are additional krane-specific checks. However, we never pass in kubectl so we never end up running a dry-run apply (neither client-side nor server-side)

timothysmith0609 · 2024-01-29T19:08:37Z

lib/krane/deploy_task.rb

+      message = "This command is namespaced and cannot be used to deploy global resources. "\
+      "Use GlobalDeployTask instead."
+      @logger.summary.add_paragraph(ColorizedString.new(message).yellow)
      @logger.summary.add_paragraph(ColorizedString.new("Global resources:\n#{global_names}").yellow)
-      raise FatalDeploymentError, "This command is namespaced and cannot be used to deploy global resources. "\
-        "Use GlobalDeployTask instead."
+      raise FatalDeploymentError


Moves around log lines to make the FatalDeploymentError handler simpler while preserving log output

timothysmith0609 · 2024-01-29T19:09:59Z

lib/krane/resource_deployer.rb

        tags = statsd_tags + (dry_run ? ['dry_run:true'] : ['dry_run:false'])
        Krane::StatsD.client.distribution('apply_all.duration', Krane::StatsD.duration(start), tags: tags)
        if st.success?
          log_pruning(out) if prune
+        elsif dry_run
+          record_dry_run_apply_failure(err, resources: resources)


The logging is just different enough to require separate methods for the dry-run and actual-run failures

timothysmith0609 · 2024-01-29T19:10:47Z

test/integration-serial/serial_deploy_test.rb

@@ -75,10 +75,9 @@ def test_apply_failure_with_sensitive_resources_hides_template_content
    refute_logs_match(%r{Kubectl err:.*something/invalid})

    assert_logs_match_all([
-      "Command failed: apply -f",
+      "Template validation failed",


Previously these errors would occur during the deploy step (since we threw away batch dry-run apply and fell back to client-side validations). Now that we trust batch dry-run, the failure occurs earlier

timothysmith0609 · 2024-01-29T19:18:32Z

test/integration/krane_deploy_test.rb

+  # def test_service_account_predeployed_before_unmanaged_pod
+  #   # Add a valid service account in unmanaged pod
+  #   service_account_name = "build-robot"
+  #   result = deploy_fixtures("hello-cloud",
+  #     subset: ["configmap-data.yml", "unmanaged-pod-1.yml.erb", "service-account.yml"],
+  #     render_erb: true) do |fixtures|
+  #     pod = fixtures["unmanaged-pod-1.yml.erb"]["Pod"].first
+  #     pod["spec"]["serviceAccountName"] = service_account_name
+  #     pod["spec"]["automountServiceAccountToken"] = false
+  #   end
+  #   # Expect the service account is deployed before the unmanaged pod
+  #   assert_deploy_success(result)
+  #   hello_cloud = FixtureSetAssertions::HelloCloud.new(@namespace)
+  #   hello_cloud.assert_configmap_data_present
+  #   hello_cloud.assert_all_service_accounts_up
+  #   hello_cloud.assert_unmanaged_pod_statuses("Succeeded", 1)
+  #   assert_logs_match_all([
+  #     %r{Successfully deployed in \d.\ds: ServiceAccount/build-robot},
+  #     %r{Successfully deployed in \d+.\ds: Pod/unmanaged-pod-.*},
+  #   ], in_order: true)
+  # end


This was the last error I went to fix and ends up being the most important. If we rely on the output of batched dry-run, we cannot deploy resources that are codependent on one another, even if such an apply is valid and would work in practice.

For example, in this test the Pod resource references a ServiceAccount that will be deployed alongside. This fails dry-run because the ServiceAccount does not actually exist on the server. Krane` gets around this by predeploying certain resources (like ServiceAccounts...).

I'm not sure what the path forward here is. In reality, it points to a fundamental, irreconcilable, issue: krane's built-in pacing makes it impossible to server-side dry-run validate sets of resources that include co-dependencies. That really only leaves:

resource-by-resource server-side dry run

batch client-side dry run (this handles the case above but is less robust, overall)

I'm tempted to just remove the batch dry-run and keep the resource-by-resource implementation. For core we can rely on bypassing this entirely with the work in #944 and Production Platform tooling.

timothysmith0609 · 2024-01-30T15:16:16Z

Replaced by #946

timothysmith0609 added 3 commits January 22, 2024 10:43

If we're going to batch dry run, then let's actually use the results

d319915

Revert "If we're going to batch dry run, then let's actually use the …

358f91c

…results" This reverts commit d319915.

avoid needless per-resource dry-run apply validation

e5fe76b

timothysmith0609 force-pushed the tsmith/batch-dry-run-but-like-for-real-this-time branch from 03703cc to e5fe76b Compare January 22, 2024 16:08

timothysmith0609 commented Jan 22, 2024

View reviewed changes

timothysmith0609 added 15 commits January 22, 2024 11:45

Use non-applyed resource to test per-resource validation

ac10cce

Still need to do other resource validation even if batched dry run su…

415ff19

…cceeds

Test fixes

1d15998

fix serial

d58b46a

testing

94d5c77

Fix backwards dry-run apply failure logic

4abbac6

fix serial tests

2e386ed

Fix tests

95f2a6d

rebasish

924f6c0

fix test

40c4d46

more parallelize test

3b8cfff

fix ci

98818e7

fix bad error

7fd0e3f

test fix

2c7ad38

comment out broken test

7c30468

timothysmith0609 commented Jan 29, 2024

View reviewed changes

Try client-side apply

2bf697d

timothysmith0609 mentioned this pull request Jan 29, 2024

smaller batch dry run blast #945

Closed

timothysmith0609 closed this Jan 30, 2024

timothysmith0609 deleted the tsmith/batch-dry-run-but-like-for-real-this-time branch January 30, 2024 15:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch dry run but like for real this time #943

Batch dry run but like for real this time #943

timothysmith0609 commented Jan 19, 2024 •

edited

Loading

timothysmith0609 Jan 22, 2024

timothysmith0609 Jan 29, 2024

timothysmith0609 Jan 29, 2024

timothysmith0609 Jan 29, 2024

timothysmith0609 Jan 29, 2024

timothysmith0609 Jan 29, 2024

timothysmith0609 Jan 29, 2024

timothysmith0609 commented Jan 30, 2024

Batch dry run but like for real this time #943

Batch dry run but like for real this time #943

Conversation

timothysmith0609 commented Jan 19, 2024 • edited Loading

timothysmith0609 Jan 22, 2024

Choose a reason for hiding this comment

timothysmith0609 Jan 29, 2024

Choose a reason for hiding this comment

timothysmith0609 Jan 29, 2024

Choose a reason for hiding this comment

timothysmith0609 Jan 29, 2024

Choose a reason for hiding this comment

timothysmith0609 Jan 29, 2024

Choose a reason for hiding this comment

timothysmith0609 Jan 29, 2024

Choose a reason for hiding this comment

timothysmith0609 Jan 29, 2024

Choose a reason for hiding this comment

timothysmith0609 commented Jan 30, 2024

timothysmith0609 commented Jan 19, 2024 •

edited

Loading