Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetworkPolicy resources are not predeployed when using calico as the network provider #773

Open
1 task done
trueskawka opened this issue Dec 15, 2020 · 11 comments · May be fixed by #774
Open
1 task done

NetworkPolicy resources are not predeployed when using calico as the network provider #773

trueskawka opened this issue Dec 15, 2020 · 11 comments · May be fixed by #774

Comments

@trueskawka
Copy link

trueskawka commented Dec 15, 2020

We've been using your tooling for a while, thanks for making it open-source <3 Appreciate all the work you put into it!

We recently went through the full update path from kubernetes-deploy in 0.25.0 and wanted to update to the newest krane, but this bug is preventing us from updating beyond major version 1.

Bug report

When using calico as the network provider, NetworkPolicy resources are no longer pre-deployed.

Expected behavior: NetworkPolicy resources are pre-deployed as priority.

Actual behavior: NetworkPolicy resources are being deployed with non-priority resources.

Version(s) affected:
This has been introduced when bumping to major 2, in #735

Steps to Reproduce

You can test it in minikube using your test suite, after starting it with calico as the network provider

$ minikube start --network-plugin=cni --cni=calico
$ bundle exec rake test TEST=test/integration/krane_deploy_test.rb TESTOPTS="--name=test_network_policies_are_deployed_first -v"
Preparing test cluster... Done.

# Running tests with run options --name=test_network_policies_are_deployed_first -v --seed 3651:

F

Finished tests in 8.788226s, 0.1138 tests/s, 0.1138 assertions/s.


Failure:
KraneDeployTest#test_network_policies_are_deployed_first [/Users/alicjaraszkowska/Documents/Repos/krane/test/integration/krane_deploy_test.rb:1385]
Minitest::Assertion: 'Predeploying priority resources' not found in the expected sequence in the following logs:
[INFO][2020-12-14 19:02:12 -0800]
[INFO][2020-12-14 19:02:12 -0800]	------------------------------------Phase 1: Initializing deploy------------------------------------
[INFO][2020-12-14 19:02:13 -0800]	All required parameters and files are present
[INFO][2020-12-14 19:02:13 -0800]	Discovering resources:
[INFO][2020-12-14 19:02:15 -0800]	  - NetworkPolicy/allow-all-network-policy
[INFO][2020-12-14 19:02:16 -0800]
[INFO][2020-12-14 19:02:16 -0800]	----------------------------Phase 2: Checking initial resource statuses-----------------------------
[INFO][2020-12-14 19:02:16 -0800]	NetworkPolicy/allow-all-network-policy            Not Found
[INFO][2020-12-14 19:02:16 -0800]
[INFO][2020-12-14 19:02:16 -0800]	----------------------------------Phase 3: Deploying all resources----------------------------------
[INFO][2020-12-14 19:02:17 -0800]	Deploying NetworkPolicy/allow-all-network-policy (timeout: 30s)
[INFO][2020-12-14 19:02:21 -0800]	Successfully deployed in 3.7s: NetworkPolicy/allow-all-network-policy
[INFO][2020-12-14 19:02:21 -0800]
[INFO][2020-12-14 19:02:21 -0800]	------------------------------------------Result: SUCCESS-------------------------------------------
[INFO][2020-12-14 19:02:21 -0800]	Successfully deployed 1 resource
[INFO][2020-12-14 19:02:21 -0800]
[INFO][2020-12-14 19:02:21 -0800]	Successful resources
[INFO][2020-12-14 19:02:21 -0800]	NetworkPolicy/allow-all-network-policy            Created


Slowest tests:

8.719095s test_network_policies_are_deployed_first#KraneDeployTest

Feature request

  • If the maintainers agree with the feature as described here, I intend to submit a Pull Request myself.1

Proposal:
Changing the way resources are being prioritized. I'd prefer to go with approach 1, since it would make it more flexible, but I may not have enough context and maybe making a strong assumption about NetworkPolicy resources proposed in approach 2 is sufficient.

  1. Make predeploy sequence configurable
    Since the change that introduced this issue has been implemented to address a request for a reverse priority order (i.e. custom resource over core, Service deployed in priority resources phase #728), it seems like having a strong assumption on which resources should be prioritized written directly in code could potentially not fit all the use cases.

    Providing a configurable way for the user to define which resources should be predeployed would allow for allowing any use cases they need.

  2. Ensure NetworkPolicy resources are always predeployed
    Filter out NetworkPolicy resources from this hash:

    crs = cluster_resource_discoverer.crds.select(&:predeployed?).map { |cr| [cr.kind, { group: cr.group }] }
    

1 This is the quickest way to get a new feature! We reserve the right to close feature requests, even ones we like, if the proposer does not intend to contribute to the feature and it doesn't fit in our current roadmap.

@dturn
Copy link
Contributor

dturn commented Dec 15, 2020

I'm surprised this is happening. By default CRs are pre-deployed. This behavior can be turned off by an annotation on the CRD https://github.com/Shopify/krane#customizing-behaviour-with-annotations krane.shopify.io/predeploye: false .

Can you add the example CR and CRD to a comment in this issue?

@trueskawka
Copy link
Author

trueskawka commented Dec 15, 2020

We're setting up calico as a separate workload in every cluster. Whenever we deploy a different workload to a cluster, calico is already there and the krane discovery is going to find a CRD NetworkPolicy resource coming from it. That overrides the predeployment for the CR NetworkPolicy for any workload we're deploying.

That happens in every workload we deploy to the cluster, no matter what the CR NetworkPolicy is, because the hash for CRDs comes after the hash for CRs when concatenating in DeployTask.predeploysequence (https://github.com/Shopify/krane/blob/master/lib/krane/deploy_task.rb#L59-L78), specifically:

Hash[before_crs + crs + after_crs]

Examples

  1. CRD NetworkPolicy
    We're using calico CRDs, e.g.:
---
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: globalnetworkpolicies.crd.projectcalico.org
  labels:
    workload: calico
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: GlobalNetworkPolicy
    plural: globalnetworkpolicies
    singular: globalnetworkpolicy
  1. CR NetworkPolicy
    Any CR NetworkPolicy will have this issue, a minimal example:
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: example-network-policy
spec:
  podSelector: {}
  ingress:
    - {}
  policyTypes:
    - Ingress

@dturn
Copy link
Contributor

dturn commented Dec 16, 2020

Oh, I see NetworkPolicy exists in extensions / networking.k8s.io depending on k8s version.

When you use Calico it also adds a NetworkPolicy CRD and because of the way we construct the hash: https://github.com/Shopify/krane/blob/master/lib/krane/deploy_task.rb#L77 we only consider Calico's NetworkPolicy as something to predeploy. Yep this is 100% a bug.

So fixing this is a bit tricky. My two stabs at it are in #774, but neither are quite right. I've got a 3rd idea (we have to track the groups to pre-deploy and the groups not to pre-deploy), but might not finish it today.

@trueskawka
Copy link
Author

Yay, glad it makes sense with more context 🎉

I couldn't think of an elegant way to solve it, because in some cases you may want to predeploy CRDs and not CRs (seemed like that was the case for the feature request that the grouping was addressing, i.e. #728).

I was wondering if there is a definite set of cases for which resources you may or may not want to predeploy, or can it depend on the workload/how the cluster is set up?

@dturn
Copy link
Contributor

dturn commented Dec 17, 2020

I was wondering if there is a definite set of cases for which resources you may or may not want to predeploy, or can it depend on the workload/how the cluster is set up?

Its up to each operator if they want to pre-deploy CRs or not. As an example: If you use a CR to do canary deploys, you wouldn't want that to be pre-deployed.

@trueskawka
Copy link
Author

Hi @dturn, happy new year 🎉

Did you end up making any progress on a different approach? Anything I could do to help?

Happy to provide more info and brainstorm :)

@trueskawka
Copy link
Author

Following up @dturn :) Anything I could do to help?

@dturn
Copy link
Contributor

dturn commented Jan 11, 2021

I'm slowly making progress on #774 but when I went to write the test I realized this is going to take longer than expected. It turns out there are at least 3 places we assume Kinds are unique: predeploy, KubernetesResource#build, and ResourceCache.

The PR originally was trying to solve the first issue. Solving the second one is done locally, but still haven't had enough time to finish up the 3rd. Assuming no more surprises, I'm optimistic I'll get it done this week.

@dturn
Copy link
Contributor

dturn commented Feb 2, 2021

@trueskawka you've probably figured this out, but I haven't had as much time to work on this as needed. I can't commit to getting this fixed in a reasonable time frame. If you're interested in taking over let me know and I'll do my best to support you. Otherwise, you can follow this this issue and I'll keep it updated.

@trueskawka
Copy link
Author

Hi @dturn!

We were able to work around it by doing our networking a bit differently and not having to depend on calico as much. I'm happy to try and PR a change to fix it for others who might be still running into this issue.

Gonna do a PR sometime in the next two weeks and @-mention you :)

@benlangfeld
Copy link
Contributor

@dturn @trueskawka Has anything changed since the last comments here? I'm running into this same problem.

@dturn I know you stated there's three bugs to fix, but I wonder if it's necessary to fix them all together. If you have a solution for the specific issue described here, is it not possible to ship that on its own, acknowledging that other similar issues will remain?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants