Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

feat: Antrea plugin support in AKS Engine #2407

Merged
merged 5 commits into from
Dec 20, 2019

Conversation

reachjainrahul
Copy link
Contributor

Reason for Change:

Add Antrea networking and policy plugin in AKS engine. https://github.com/vmware-tanzu/antrea

Issue Fixed:

Requirements:

Notes:

@welcome
Copy link

welcome bot commented Dec 6, 2019

💖 Thanks for opening your first pull request! 💖 We use semantic commit messages to streamline the release process. Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix. Examples of commit messages with semantic prefixes: - fix: change azure disk cachingMode to ReadOnly - feat: make maximumLoadBalancerRuleCount configurable - docs: add note on AKS Engine and AKS relationship
Make sure to check out the developer guide for guidance on testing your change.

@msftclas
Copy link

msftclas commented Dec 6, 2019

CLA assistant check
All CLA requirements met.

@reachjainrahul
Copy link
Contributor Author

/assign @jackfrancis

@@ -141,6 +141,7 @@ aks-engine generate --set agentPoolProfiles[0].count=5,agentPoolProfiles[1].name

* To enable the optional network policy enforcement using calico, you have to set the parameter during this step according to this [guide](../topics/features.md#optional-enable-network-policy-enforcement-using-calico)
* To enable the optional network policy enforcement using cilium, you have to set the parameter during this step according to this [guide](../topics/features.md#optional-enable-network-policy-enforcement-using-cilium)
* To enable the optional network policy enforcement using antrea, you have to set the parameter during this step according to this [guide](../topics/features.md#optional-enable-network-policy-enforcement-using-antrea)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sounds more like a policy only mode.. however, the aim of this patch is for Antrea network plugin and not just policy, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is for policy mode only. For networking there is separate doc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct.. what i was looking for was a way to ensure that we do not convey the message that antrea can be used in a policy only with other networking cni and that there should be some validation.. Jack answered that in his comment regarding validation. so we should be good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually 3 combination were added in the validation.

  1. Just NetworkPlugin, Without any Policy. If any policy is specified other than antrea, its a validation error
  2. Just NetworkPolicy. Without any NetworkPlugin. If network plugin is specified other than antrea, its a validation error
  3. Both Network and Policy plugin as antrea.

We would remove option 1. Hope it make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed option 1 as a valid option following the recent rationalization of cilium, which follows a similar pattern.

examples/networkpolicy/kubernetes-antrea.json Show resolved Hide resolved
examples/networkpolicy/README.md Outdated Show resolved Hide resolved
@jackfrancis jackfrancis added this to Under Review in backlog Dec 18, 2019
@jackfrancis jackfrancis moved this from Under Review to In progress in backlog Dec 18, 2019
@jackfrancis
Copy link
Member

Hello again, I've been enumerating through the existing addons and in my refactor of this one it appears we're in a similar functional space as antrea:

#2480

Specifically, it appears that like the existing cilium implementation, antrea delivers both IPAM and a NetworkPolicy functionality, and those will be represented by "networkPlugin": "antrea" and "networkPolicy": "antrea", respectively. Also, like cilium it appears that the antrea IPAM implementation depends upon artifacts delivered via a daemonset in the addon spec.

Is that correct?

If so, I would advocate that we deliver the antrea specification as follows:

  • if the user specifies "networkPolicy": "antrea" in the api model, we automatically set the "networkPlugin": "antrea" configuration
  • if the user specifies only "networkPlugin": "antrea", we throw a validation error at ARM template generation time

In other words, we want to always deliver a cluster configuration that looks like:

...
    "networkPolicy": "antrea",
    "networkPlugin": "antrea
...

And we permit the user only specifying the networkPolicy part, automatically filling in the antrea networkPlugin configuration for him/her. But we don't permit just an antrea networkPlugin cluster configuration, because that may mislead the user into thinking that we support just an antrea CNI implementation for IPAM without NetworkPolicy, which is in fact not true.

Does that all sound right?

@jianjuns
Copy link

@jackfrancis your suggestion makes sense to me. We do not support a model to use Antrea for IPAM and connectivity only, but use other CNI for NetworkPolicy.

backlog automation moved this from In progress to Done Dec 19, 2019
@reachjainrahul
Copy link
Contributor Author

Hello again, I've been enumerating through the existing addons and in my refactor of this one it appears we're in a similar functional space as antrea:

#2480

Specifically, it appears that like the existing cilium implementation, antrea delivers both IPAM and a NetworkPolicy functionality, and those will be represented by "networkPlugin": "antrea" and "networkPolicy": "antrea", respectively. Also, like cilium it appears that the antrea IPAM implementation depends upon artifacts delivered via a daemonset in the addon spec.

Is that correct?

If so, I would advocate that we deliver the antrea specification as follows:

  • if the user specifies "networkPolicy": "antrea" in the api model, we automatically set the "networkPlugin": "antrea" configuration
  • if the user specifies only "networkPlugin": "antrea", we throw a validation error at ARM template generation time

In other words, we want to always deliver a cluster configuration that looks like:

...
    "networkPolicy": "antrea",
    "networkPlugin": "antrea
...

And we permit the user only specifying the networkPolicy part, automatically filling in the antrea networkPlugin configuration for him/her. But we don't permit just an antrea networkPlugin cluster configuration, because that may mislead the user into thinking that we support just an antrea CNI implementation for IPAM without NetworkPolicy, which is in fact not true.

Does that all sound right?

@jackfrancis Would you update the patch for validation error? Your suggestion sounds good to me too.

backlog automation moved this from Done to Under Review Dec 19, 2019
@jackfrancis
Copy link
Member

Hi @reachjainrahul and folks,

I've rebased this PR and done a bit of cleanup so it resembles other addons (specifically cilium). I have a passing E2E test, so I think this is ready to merge. However, I am seeing this on my cluster:

$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                            READY   STATUS             RESTARTS   AGE
...
kube-system   antrea-agent-m526s                              2/2     Running            1          45m
kube-system   antrea-agent-qjrk8                              2/2     Running            0          45m
kube-system   antrea-agent-zg5c4                              2/2     Running            0          45m
kube-system   antrea-controller-6795f5bbfd-4b6qr              1/1     Running            0          45m
...
kube-system   kube-controller-manager-k8s-master-38231306-0   0/1     CrashLoopBackOff   8          45m
...
kube-system   kube-scheduler-k8s-master-38231306-0            0/1     CrashLoopBackOff   8          45m

Note the 8 restarts for both the controller-manager and scheduler pods. Both seem to be exhibiting a similar failure symptom:

$ kubectl logs kube-controller-manager-k8s-master-38231306-0 -n kube-system
...
I1219 23:28:02.459750       1 leaderelection.go:281] failed to renew lease kube-system/kube-controller-manager: failed to tryAcquireOrRenew context deadline exceeded
F1219 23:28:02.459832       1 controllermanager.go:281] leaderelection lost
$ kubectl logs kube-scheduler-k8s-master-38231306-0 -n kube-system
...
I1219 23:28:04.120899       1 leaderelection.go:281] failed to renew lease kube-system/kube-scheduler: failed to tryAcquireOrRenew context deadline exceeded
E1219 23:28:04.121324       1 server.go:254] lost master
lost lease

Have you seen this in any of your tests? As you can imagine, pod scheduling/reconciliation is significantly de-optimized on a cluster like this, as the key componentry responsible for that is regularly offline.

@jackfrancis
Copy link
Member

/azp run pr-e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@@ -312,6 +312,11 @@ ensureKubelet() {
sleep 3
done
{{end}}
{{if HasAntreaNetworkPolicy}}
while [ ! -f /etc/cni/net.d/10-antrea.conf ]; do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: if the daemonset implementation ever changes the name lf this CNI config file, we'll have to update this file wait implementation.

jackfrancis
jackfrancis previously approved these changes Dec 19, 2019
Copy link
Member

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@codecov
Copy link

codecov bot commented Dec 20, 2019

Codecov Report

Merging #2407 into master will increase coverage by 0.06%.
The diff coverage is 89.69%.

@@            Coverage Diff             @@
##           master    #2407      +/-   ##
==========================================
+ Coverage   72.63%   72.69%   +0.06%     
==========================================
  Files         130      130              
  Lines       24005    24096      +91     
==========================================
+ Hits        17435    17517      +82     
- Misses       5544     5553       +9     
  Partials     1026     1026

readOnly: true
- command:
- start_ovs
image: antrea/antrea-ubuntu:latest

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should perhaps pin this to a version and not latest? antrea/antrea-ubuntu:v0.2.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed an immutable reference would be preferable. Doesn't have to block merge, though, at this phase.

Also, the references to the images are here:

https://github.com/Azure/aks-engine/pull/2407/files#diff-bb9b6cb6f08a63800be959ba49eaf714R42

abhiraut
abhiraut previously approved these changes Dec 20, 2019
"env": {
},
"options": {
"allowedOrchestratorVersions": ["1.13", "1.14", "1.15", "1.16"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reachjainrahul do @abhiraut do we know for sure that antrea doesn't work w/ k8s 1.17?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in theory yes, it should work.. however let us update this only after we run some ci tests against 1.17? @jianjuns any opinion?

@jackfrancis
Copy link
Member

/azp run pr-e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jackfrancis
Copy link
Member

/azp run pr-e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@@ -122,6 +121,7 @@ func assignKubernetesParameters(properties *api.Properties, parametersMap params
// Kubernetes node binaries as packaged by upstream kubernetes
// example at https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.11.md#node-binaries-1
addValue(parametersMap, "windowsKubeBinariesURL", kubernetesConfig.WindowsNodeBinariesURL)
addValue(parametersMap, "kubeServiceCidr", kubernetesConfig.ServiceCIDR)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this revert? Antrea requires ServiceCIDR. Adding here will set it for Windows VM only, right ???

"HasAntreaNetworkPlugin": func() bool {
return cs.Properties.OrchestratorProfile.KubernetesConfig.NetworkPlugin == NetworkPluginAntrea
"HasAntreaNetworkPolicy": func() bool {
return cs.Properties.OrchestratorProfile.KubernetesConfig.NetworkPlugin == NetworkPolicyAntrea
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldnt this be NetworkPolicy check ?

@@ -1166,7 +1159,7 @@
"kubeConfigCertificate": "[parameters('kubeConfigCertificate')]",
"kubeConfigPrivateKey": "[parameters('kubeConfigPrivateKey')]",
"kubeDnsServiceIp": "10.0.0.10",
"kubeServiceCidr": "[parameters('kubeServiceCidr')]",
"kubeServiceCidr": "10.0.0.0/16",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this revert intentional?

@@ -39,6 +39,10 @@ const (
ciliumCleanStateImageReference string = "docker.io/cilium/cilium-init:2018-10-16"
ciliumOperatorImageReference string = "docker.io/cilium/operator:v1.4"
ciliumEtcdOperatorImageReference string = "docker.io/cilium/cilium-etcd-operator:v2.0.5"
antreaControllerImageReference string = "antrea/antrea-ubuntu:latest"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Antrea 0.2.0 is released today. Could we please update this to antrea/antrea-ubuntu:v0.2.0

@reachjainrahul
Copy link
Contributor Author

I didnt see this in my test which I ran 2 weeks ago. @jianjuns any known issue with latest antrea? my antrea yaml referred to master.

@reachjainrahul
Copy link
Contributor Author

/azp run pr-e2e

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 2407 in repo Azure/aks-engine

@reachjainrahul
Copy link
Contributor Author

Hi @reachjainrahul and folks,

I've rebased this PR and done a bit of cleanup so it resembles other addons (specifically cilium). I have a passing E2E test, so I think this is ready to merge. However, I am seeing this on my cluster:

$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                            READY   STATUS             RESTARTS   AGE
...
kube-system   antrea-agent-m526s                              2/2     Running            1          45m
kube-system   antrea-agent-qjrk8                              2/2     Running            0          45m
kube-system   antrea-agent-zg5c4                              2/2     Running            0          45m
kube-system   antrea-controller-6795f5bbfd-4b6qr              1/1     Running            0          45m
...
kube-system   kube-controller-manager-k8s-master-38231306-0   0/1     CrashLoopBackOff   8          45m
...
kube-system   kube-scheduler-k8s-master-38231306-0            0/1     CrashLoopBackOff   8          45m

Note the 8 restarts for both the controller-manager and scheduler pods. Both seem to be exhibiting a similar failure symptom:

$ kubectl logs kube-controller-manager-k8s-master-38231306-0 -n kube-system
...
I1219 23:28:02.459750       1 leaderelection.go:281] failed to renew lease kube-system/kube-controller-manager: failed to tryAcquireOrRenew context deadline exceeded
F1219 23:28:02.459832       1 controllermanager.go:281] leaderelection lost
$ kubectl logs kube-scheduler-k8s-master-38231306-0 -n kube-system
...
I1219 23:28:04.120899       1 leaderelection.go:281] failed to renew lease kube-system/kube-scheduler: failed to tryAcquireOrRenew context deadline exceeded
E1219 23:28:04.121324       1 server.go:254] lost master
lost lease

Have you seen this in any of your tests? As you can imagine, pod scheduling/reconciliation is significantly de-optimized on a cluster like this, as the key componentry responsible for that is regularly offline.

I ran E2E test with Antrea 0.2.0 and didnt see the issue.

azureuser@k8s-master-41238993-0:~$ kubectl get pods -o wide -n kube-system 
NAME                                            READY   STATUS    RESTARTS   AGE   IP             NODE                                 NOMINATED NODE   READINESS GATES
antrea-agent-ct7fp                              2/2     Running   0          26m   10.240.255.5   k8s-master-41238993-0                <none>           <none>
antrea-agent-fclvn                              2/2     Running   1          26m   10.240.0.5     k8s-agentpool1-41238993-vmss000001   <none>           <none>
antrea-agent-zv74b                              2/2     Running   0          26m   10.240.0.4     k8s-agentpool1-41238993-vmss000000   <none>           <none>
antrea-controller-5b9b8d5bb9-pcllg              1/1     Running   0          26m   10.240.0.5     k8s-agentpool1-41238993-vmss000001   <none>           <none>
kube-addon-manager-k8s-master-41238993-0        1/1     Running   0          25m   10.240.255.5   k8s-master-41238993-0                <none>           <none>
kube-apiserver-k8s-master-41238993-0            1/1     Running   0          25m   10.240.255.5   k8s-master-41238993-0                <none>           <none>
kube-controller-manager-k8s-master-41238993-0   1/1     Running   0          25m   10.240.255.5   k8s-master-41238993-0                <none>           <none>
kube-scheduler-k8s-master-41238993-0            1/1     Running   0          25m   10.240.255.5   k8s-master-41238993-0                <none>           <none>
kubernetes-dashboard-7947fffdf5-q57cc           1/1     Running   0          26m   10.244.0.5     k8s-agentpool1-41238993-vmss000001   <none>           <none>
metrics-server-69b44566d5-fwftw                 1/1     Running   0          26m   10.244.0.4     k8s-agentpool1-41238993-vmss000001   <none>           <none>

@jackfrancis
Copy link
Member

/azp run pr-e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@acs-bot acs-bot added the lgtm label Dec 20, 2019
@acs-bot acs-bot merged commit 02e3076 into Azure:master Dec 20, 2019
backlog automation moved this from Under Review to Done Dec 20, 2019
@welcome
Copy link

welcome bot commented Dec 20, 2019

Congrats on merging your first pull request! 🎉🎉🎉

@acs-bot
Copy link

acs-bot commented Dec 20, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis, reachjainrahul

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

acs-bot pushed a commit that referenced this pull request Jan 9, 2020
* Rashmi/ciprod12042019 (#3)

improvement: Update Containermonitoring addon for december release (#1)

* Fixing yaml indentation issues for omsagent (#4)

Fixing yaml indentation issues for omsagent

* Fixing more indentation issues in omsagent yaml (#5)

Fixing more indentation issues in omsagent yaml

* updating templates_generated file

* Fix: Updating generated file to fix unit tests (#6)

fix: Updating generated file to fix unit tests

* Adding test coverage for 1.16 and 1.17 (#7)

Adding test coverage for 1.16 and 1.17

* Merging changes for omsagent latest release - ciprod01072020 after syncing from remote master(#8)

* chore: use go template comments for generate proxy certs script (#2336)

* fix: fix ARM dependency issues with vm user-specified extensions on node pools (#2398)

* fix: fix ARM dependency issues if many extensions are specified for a node profile

* fix scale up case for windows vhd case. (#2483)

* refactor: make cilium addon user-configurable (#2480)

* refactor: make cilium addon user-configurable

* chore: clarify that cilium doesn't work w/ 1.16 and above, add validation

* test: addons UT

* test: go template UT

* ci: use Standard_D8_v3 for cilium test, only run NetworkPolicy tests

* fix: error message language

* chore: remove debug fmt.Println

* ci: revert back to Standard_D2_v3

* chore: upgrade cni-plugins to v0.7.6 (#2484)

* fix: hard-coding hyper-v generation when using VHD URls as a quick unblock (#2487)

* feat: Configuring docker log rotation for Windows nodes (#2478)

* feat: Antrea plugin support in AKS Engine (#2407)

* Antrea plugin support in AKS Engine

* chore: clean up

* chore: use ContainerImage

* chore: generated code

* refactor: Updating antrea yaml to 0.2.0

Co-authored-by: Jack Francis <jackfrancis@gmail.com>

* chore: lint (#2493)

* test: revert change to default kubernetes.json api model example (#2494)

* chore: update cloud-provider-azure components to v0.4.0 (#2473)

* chore: update cloud-provider-azure components to v0.4.0

See https://github.com/kubernetes-sigs/cloud-provider-azure/releases/tag/v0.4.0

* refactor: strip MCR constant to base hostname of URL

* fix: fetch Azure cloud-manager images from /oss/kubernetes/

* refactor: make audit-policy and azure-cloud-provider addons user-configurable (#2496)

* chore: pre-pull k8s v1.15.7-azs (#2490)

* fix: Fix some path handling in collect-windows-logs script (#2488)

* docs: remove mentions of old orchestrators (#2501)

* chore: Targeting dec patches for windows VHD (#2505)

* refactor: move StorageClass into azure-cloud-provider addon (#2497)

* add "Standard_DS3_v2" to "AcceleratedNetworking" supported list (#2509)

* ci: collect logs during E2E runs (#2520)

* refactor: user-configurable flannel and scheduled maintenance addons (#2517)

* chore: update Azure NPM to v1.0.31 (#2521)

* feat: add support for Kubernetes 1.18.0-alpha.1 (#2503)

* feat: add support for Kubernetes 1.18.0-alpha.1

See https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.18.md#changelog-since-v1170

* test: add 1.18 to Jenkinsfile

* ci: disable kms for 1.18

* chore: move flannel 1.18 spec to containeraddons

* chore: generated code

* fix: use new cloudprovider implementation for 1.18

Co-authored-by: Jack Francis <jackfrancis@gmail.com>

* test: don't test non-working >= 1.16 flannel + docker (#2524)

* fix: apply new master node labels for k8s v1.18+ compatibility (#2467)

* fix: apply new master node labels for k8s v1.18+ compatibility

* test: check master labels in the future for back-compat

* feat: cleaning up old kubelet/kubeproxy logs for Windows nodes (#2504)

* feat: cleaning up old kubelet/kubeproxy logs for Windows nodes

* Fixing path to look for logs

* generated files

* refactor: standardize to "addons", deprecate "containeraddons" (#2525)

* fix: configure addons before setting kubelet config (#2513)

* chore: update addon-resizer (#2527)

See https://github.com/kubernetes/autoscaler/releases/tag/addon-resizer-1.8.7

* fix: aci-connector region is ignored (#2535)

* test: use LOCATION env var for api model in E2E tests (#2542)

* fix: promote system addons to system-cluster-critical (#2533)

* test: use northeurope for byok testing (#2536)

* Changes for omsagent-version-ciprod01072020

* Committing generated file

Co-authored-by: Jack Francis <jackfrancis@gmail.com>
Co-authored-by: Mark Rossetti <marosset@microsoft.com>
Co-authored-by: Rohit <rjaini@microsoft.com>
Co-authored-by: Rahul Jain <58573065+reachjainrahul@users.noreply.github.com>
Co-authored-by: Matt Boersma <Matt.Boersma@microsoft.com>
Co-authored-by: Javier Darsie <44655727+jadarsie@users.noreply.github.com>
Co-authored-by: Patrick Lang <PatrickLang@users.noreply.github.com>
Co-authored-by: Wenjun Wu <wenjun.wu@live.com>
Co-authored-by: Jaeryn <13284103+jaer-tsun@users.noreply.github.com>
Co-authored-by: Anish Ramasekar <anish.ramasekar@gmail.com>

* deleting github merge auto-generated files

* Adding back 1.17 omsagent yaml changes

* Updating generated file to address build failures

Co-authored-by: Jack Francis <jackfrancis@gmail.com>
Co-authored-by: Mark Rossetti <marosset@microsoft.com>
Co-authored-by: Rohit <rjaini@microsoft.com>
Co-authored-by: Rahul Jain <58573065+reachjainrahul@users.noreply.github.com>
Co-authored-by: Matt Boersma <Matt.Boersma@microsoft.com>
Co-authored-by: Javier Darsie <44655727+jadarsie@users.noreply.github.com>
Co-authored-by: Patrick Lang <PatrickLang@users.noreply.github.com>
Co-authored-by: Wenjun Wu <wenjun.wu@live.com>
Co-authored-by: Jaeryn <13284103+jaer-tsun@users.noreply.github.com>
Co-authored-by: Anish Ramasekar <anish.ramasekar@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
backlog
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

6 participants