[Feature] AKS allows creation of NodePools in different Subnets (Azure CNI) #1338

rhummelmose · 2019-11-28T09:45:30Z

As stated under limitations in the documentation, all node pools have to reside within the samee subnet.

The ask is to support assignment of a unique subnet per node pool in a cluster, but all shared from the same VNET.

ferantivero · 2019-12-09T17:56:24Z

If I am not wrong this issue is more to start validating requests since users could specify multiple subnets when it isn't currently supported.

But please let me mention that I love if this actually work instead of being a limitation. Already voted up for it in user voice 😸

palma21 · 2020-01-08T07:36:30Z

Do you have any details on around how networking breaks?

What plugin are you using?

jluk · 2020-01-14T18:29:02Z

@rhummelmose any details you can share on Jorge's question above?

bhicks329 · 2020-01-15T19:18:29Z

I’ve experimented with this today and I was able to create a 1.17 (SLB, Azure CNI) AKS cluster with multiple node pools in different subnets with Istio over the top utilising Egress Gateways to control the flow of traffic. I haven’t attempted to place any UDRs on the subnets yet, but will do soon. I’m interested to know if this is going to be a supported scenario in the future and also what @rhummelmose has experienced in terms of breaking things.

rhummelmose · 2020-01-15T22:13:40Z

Sorry guys I will have to stand up a cluster again and see if I can make it break. I will close the issue if I can't reproduce it.

jluk · 2020-01-15T22:15:48Z

I've set this issue to track this open feature ask, so let's keep this one open but update prev comments to remove the issue mentioned if its been resolved and can't be reproduced.

@bhicks329 this is actively being worked on to be a supported scenario.

marcostrullato · 2020-01-16T06:54:00Z

Hi @jluk I was with @rhummelmose when we experienced the issue.
If Rasmus can't help, I will: I'm sure I can reproduce the issue.

jluk · 2020-01-16T18:15:09Z

That would be great @marcostrullato if you have any repro steps we can investigate

marcostrullato · 2020-01-17T08:22:48Z

Ok I will restore the cluster with the IaC I had, and I'll check if I can make it available to you. I'll come back early next week.

marcostrullato · 2020-01-29T07:42:23Z

Apologize, I'm late. It's in our tasks list and will be done soon.

marcostrullato · 2020-01-31T07:26:05Z

Hi everyone, so we have replicated the situation once again.
You can find the IaC code here: https://github.com/marcostrullato/aksiac
Off course I've cleaned it from any reference, so be aware it might be hard to run.

The source of this IaC is the akscommander code from @rhummelmose.

Is there anything else I can do?

Cheers

jluk · 2020-01-31T16:18:28Z

Thanks @marcostrullato for the pointer to repro - to help us diagnose can you explain what the situation is that is not working? Is it a failure of communication between agent pools, failure during setup/provision, or something else?

marcostrullato · 2020-02-03T08:44:05Z

Hi @jluk what is happening is a failure in communications. It's possible to reach one of the agent pool, where the other is inaccessible.

Let me drive you through the IaC code.

This is the code to define the vnets (https://github.com/marcostrullato/aksiac/blob/master/aks-commander/terraform/base/main.tf)

resource "azurerm_virtual_network" "vnet" {
  name                = "${var.prefix}-vnet-${terraform.workspace}"
  location            = "${var.region}"
  resource_group_name = data.terraform_remote_state.remote_state_core.outputs.resource_group_name
  address_space       = ["${var.aks_vnet_address_space}"]
}

resource "azurerm_subnet" "subnet1" {
  name                 = "subnet1"
  resource_group_name  = "${azurerm_virtual_network.vnet.resource_group_name}"
  address_prefix       = "${var.aks_subnet_gpu_address_prefix}"
  virtual_network_name = "${azurerm_virtual_network.vnet.name}"
  service_endpoints    = ["Microsoft.ContainerRegistry","Microsoft.AzureCosmosDB","Microsoft.Storage","Microsoft.KeyVault"]
}

resource "azurerm_subnet" "subnet2" {
  name                 = "subnet2"
  resource_group_name  = "${azurerm_virtual_network.vnet.resource_group_name}"
  address_prefix       = "${var.aks_subnet_address_prefix}"
  virtual_network_name = "${azurerm_virtual_network.vnet.name}"
  service_endpoints    = ["Microsoft.ContainerRegistry","Microsoft.AzureCosmosDB","Microsoft.Storage","Microsoft.KeyVault"]
}

This is the code for the agentpools (https://github.com/marcostrullato/aksiac/blob/master/aks-commander/terraform/aks/main.tf) Bear in mind I've masked/hidden names and references.

  agent_pool_profile {
    name                = "cpupool"
    count               = 1
    min_count           = 1
    max_count           = 4
    vm_size             = "Standard_DS3_v2"
    os_type             = "Linux"
    os_disk_size_gb     = 30
    type                = "VirtualMachineScaleSets"
    availability_zones  = [ "1", "2", "3"]
    enable_auto_scaling = true
    vnet_subnet_id      = data.terraform_remote_state.remote_state_base.outputs.aks_cluster_subnet_id
  }

  agent_pool_profile {
    name                = "gpupool"
    count               = 1
    min_count           = 1
    max_count           = 10
    vm_size             = "Standard_NC6s_v2"
    os_type             = "Linux"
    os_disk_size_gb     = 30
    type                = "VirtualMachineScaleSets"
    availability_zones  = [ "1", "2", "3"]
    enable_auto_scaling = true
    vnet_subnet_id      =""
  }

With this configuration in our environment networking does not completely work.

@rhummelmose anything else to add?

Regards

ams0 · 2020-02-13T14:50:13Z

Able to reproduce: adding a second pool in another subnet works:

$> kg node
NAME                                STATUS   ROLES   AGE   VERSION
aks-base-15062426-vmss000000        Ready    agent   9h    v1.17.0
aks-base-15062426-vmss000001        Ready    agent   9h    v1.17.0
aks-extrapool-15062426-vmss000000   Ready    agent   8h    v1.17.0
aks-extrapool-15062426-vmss000001   Ready    agent   8h    v1.17.0

However:

AzureCNI:
Pods can talk to each other across nodepools, services work with pods both pools.

Kubernet:
Services can route traffic to pods sitting on any pool, however, pods on the non-default pool have no connectivity to the internet, and no connectivity to other pods in other pools or service IPs. However, by manually associating the RT in the MC_ group to the subnet of the extra nodepool makes it work.

So I think it just needs to be either documented or the az aks nodepool command should associate the RT automatically.

ams0 · 2020-02-13T15:07:26Z

Also, RT association survives scaling operations and adding another nodepool (but the latter requires additional association of course).

jluk · 2020-02-13T18:31:37Z

Thanks for the input @ams0 this is in-line with what we're working on. The likely rollout is Azure CNI coverage first, Kubenet following that a bit later.

The general improvement for BYO RT for Kubenet is being worked on right now.

scivm · 2020-02-17T17:18:57Z

@jluk will the improvements work with current GA 1.15.7 or only 1.17?

djsly · 2020-02-17T21:59:10Z

@scivm hopefully this is an AKS release that won't be required to be tied to an K8S version since it seems only Azure specific resources are required to be properly configured.

jluk · 2020-02-18T17:23:55Z

Azure CNI subnet per pool should work for all the supported AKS k8s versions.

jyee021 · 2021-03-09T23:15:25Z

Is there an anticipate GA date for this feature?

ondrejhlavacek · 2021-03-15T14:41:45Z

I have found out that Kubernetes Network Policies do not work between node pools in different subnets. We're running AKS with Azure CNI and Azure networking policy. Is this related to this issue or should I create a support ticket?

paulgmiller · 2021-04-05T21:46:29Z

Our guess is that this is related because of kube-proxy --cluster-cidr:
"When configured, traffic sent to a Service cluster IP from outside this range will be masqueraded"

We're still looking for this KEP to fix this and might try and get more involved.  https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/2450-Remove-knowledge-of-pod-cluster-CIDR-from-iptables-rules

jack4it · 2021-05-03T18:59:43Z

Pods in a pool with a unique subnet (outside --cluster-cidr) can't reach the internet but can reach other pods in other pools. Pods with hostNetowrk: true work fine. Does anyone see this behavior? It's consistent across multi clusters in my test.

noahbirrer-8451 · 2021-06-28T18:19:38Z

Dedicating a unique VNET per node pool is not on the roadmap today, but your input is valuable. For others with this need please share the scenarios!

@jluk We are in need of this feature. Our scenario is described below.

This feature would allow us to better allocate and plan IP space that is shared with our on-prem network via VNET peering. We use Azure CNI and our current cluster strategy is to use a shared cluster for all workloads (those that require on-prem connectivity and those that do not). We would like to move the distinction of whether or not a workload can reach on-prem from a peered VNET from the cluster level to the node pool level.

Today, we are limited to using one VNET per cluster since nodepools cannot be created on separate VNETs. One app that requires on-prem connectivity dictates that all workloads then unnecessarily consume peered IP space, which can be costly with CNI. Ideally, workloads that do not require on-prem connectivity could be run separately on subnets on private/unpeered nodepools, which would reduce the amount of on-prem network space that is consumed.

plaformsre · 2021-07-05T06:29:55Z

We need to be able to scale out AKS deployment with the ability to add additional VNETs and subnets after initial deployment. Many organisations are just not 'buying' that migration to a larger VNET / subnet is the only option.
Could you please let us know when it will become GA?

ArgonQQ · 2021-07-05T06:55:57Z

I totally agree with @mapdegree . From the start of a project you already have defined your limitations in terms of pod count per Subnet. By default its relatively easy to scale a VNET (or add a CIDR range) but its currently not possible to extend a subnet if any resources are attached to it.

Therefore if you run out of IP-Addresses it is required to completely remove the subnet and create a bigger one (if the VNET is properly sized).

From my POV this is not how Microsoft describes Azure because currently there is no "infinite scale" available because you always have CIDR limitations except you have reserved all private CIDR ranges and you're not peering anywhere.

paulgmiller · 2021-07-14T23:57:30Z

To give an update here. Kube-proxy currently expects one cluster-cidr and nats and does other things if you're not inside it.

This is a known issue
https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/2450-Remove-knowledge-of-pod-cluster-CIDR-from-iptables-rules

But there is already support to use the nodes pod cidr isntead of cluster
kubernetes/kubernetes#88935

So our current thought is to use the update the node's podcidr with a deamonset or controller. I have a prototype but we're still a month or more from delivering it. So this is all just a proposal for now and may change.

You will still have to all be in the same vnet.

ElkRom · 2021-11-09T20:50:40Z

Hi guys, do you have any updates regarding this issue?

asubmani · 2021-12-17T15:49:13Z

Wouldn't this introduce more problems?

CURRENT SCENARIO:

Multiple user node pools (no taints tolerations). AKS/k8s schedules pods
Should be easy to lock pod to pod communication by NS using calico/cilium etc.
control all intra AKS using network policies and Ingress/Egress from the cluster using 3rd party NVA's.

FUTURE SCENARIO (if each nodepol has it's own subnet)

K8s reschedules a pod from Node in NP-A to another node in NP-B

now the pod will have a different IP range, which might make k8s network policies hard to manage)
each time this happens the NVA needs updating (assuming you control which subnets egress out)
If above is true then each NP/Subnet may need some mechanism to guarantee only certain pods get scheduled on a specific NP to guarantee network access.

Overall k8's ability to schedule pods to node pools will be restricted.

swgriffith · 2022-01-26T18:51:39Z

Any update on this? It seems that the kube-proxy pod cidr issue has been addressed in upstream Kubernetes.

kubernetes/enhancements#2450

andriktr · 2022-03-02T12:09:54Z

Hey,
When this feature will become GA?
Thanks.

IvanJosipovic · 2022-03-30T18:11:40Z

My company is also interested in this feature. Will Azure Network Policy be supported when this goes GA?

danielalvesleandro · 2022-04-04T15:16:38Z

Hi guys, any news regarding GA for this feature? Will it support Azure Network Policies when GA? Thanks.

wedaly · 2022-05-16T19:57:06Z

To give an update here. Kube-proxy currently expects one cluster-cidr and nats and does other things if you're not inside it.

This is a known issue https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/2450-Remove-knowledge-of-pod-cluster-CIDR-from-iptables-rules

Any update on this? It seems that the kube-proxy pod cidr issue has been addressed in upstream Kubernetes.

kubernetes/enhancements#2450

Hi all, wanted to give a quick update on the kube-proxy cluster CIDR issue. As of AKS release 2022-05-01, AKS now configures kube-proxy to detect local traffic using the network interface name prefix instead of cluster CIDR. This should preserve the IP addresses of traffic from secondary subnets so that network policies are applied correctly. Please note that the change applies only to Kubernetes versions 1.23.3 and later.

ondrejhlavacek · 2022-05-17T08:51:13Z

Does this mean that Network Policies will start working across node pools in different subnets?

wedaly · 2022-05-17T16:19:44Z

Does this mean that Network Policies will start working across node pools in different subnets?

Yes, network policies will work across different subnets.

ghost · 2022-06-03T06:00:48Z

Thank you for the feature request. I'm closing this issue as this feature has shipped and it hasn't had activity for 7 days.

triage-new-issues bot added the triage label Nov 28, 2019

jluk mentioned this issue Dec 16, 2019

az aks nodepool add with --vnet-subnet-id does not use vnet for nodepool Azure/azure-cli#10984

Closed

palma21 changed the title ~~AKS allows creation of multi-subnet clusters but they do not work~~ AKS allows creation of NodePools in different Subnets Dec 16, 2019

palma21 added enhancement feature feature-request Requested Features labels Dec 16, 2019

triage-new-issues bot removed the triage label Dec 16, 2019

jluk added this to Backlog (Committed Items) in Azure Kubernetes Service Roadmap (Public) Dec 16, 2019

jluk self-assigned this Dec 16, 2019

palma21 moved this from Backlog (Committed Items) to In Progress (Development) in Azure Kubernetes Service Roadmap (Public) Jan 8, 2020

jluk removed the feature-request Requested Features label Jan 14, 2020

palma21 removed the feature label May 12, 2021

paulgmiller assigned paulgmiller and unassigned jluk Jul 14, 2021

ARD92 mentioned this issue Oct 13, 2021

[Feature request]: Support of node pool spread across different Vnets #2595

Open

PPACI mentioned this issue Oct 21, 2021

Missing limitation in "Create and manage multiple node pools for a cluster in Azure Kubernetes Service" for unique subnet MicrosoftDocs/azure-docs#82733

Closed

swgriffith closed this as completed Jan 25, 2022

swgriffith reopened this Jan 25, 2022

wedaly added the resolution/shipped label May 26, 2022

ghost closed this as completed Jun 3, 2022

palma21 moved this from Public Preview (Shipped & Improving) to Generally Available (Done) in Azure Kubernetes Service Roadmap (Public) Jun 30, 2022

Azure locked as resolved and limited conversation to collaborators Jul 3, 2022

wangyira moved this from Generally Available (Done) to Archive (GA older than 1 month) in Azure Kubernetes Service Roadmap (Public) Sep 20, 2022

This issue was closed.

[Feature] AKS allows creation of NodePools in different Subnets (Azure CNI) #1338

[Feature] AKS allows creation of NodePools in different Subnets (Azure CNI) #1338

Comments

rhummelmose commented Nov 28, 2019 • edited by jluk

ferantivero commented Dec 9, 2019

palma21 commented Jan 8, 2020

jluk commented Jan 14, 2020

bhicks329 commented Jan 15, 2020

rhummelmose commented Jan 15, 2020

jluk commented Jan 15, 2020

marcostrullato commented Jan 16, 2020

jluk commented Jan 16, 2020

marcostrullato commented Jan 17, 2020

marcostrullato commented Jan 29, 2020

marcostrullato commented Jan 31, 2020

jluk commented Jan 31, 2020

marcostrullato commented Feb 3, 2020

ams0 commented Feb 13, 2020

ams0 commented Feb 13, 2020

jluk commented Feb 13, 2020

scivm commented Feb 17, 2020

djsly commented Feb 17, 2020

jluk commented Feb 18, 2020

jyee021 commented Mar 9, 2021

ondrejhlavacek commented Mar 15, 2021

paulgmiller commented Apr 5, 2021

jack4it commented May 3, 2021

noahbirrer-8451 commented Jun 28, 2021

plaformsre commented Jul 5, 2021

ArgonQQ commented Jul 5, 2021

paulgmiller commented Jul 14, 2021

ElkRom commented Nov 9, 2021

asubmani commented Dec 17, 2021

CURRENT SCENARIO:

FUTURE SCENARIO (if each nodepol has it's own subnet)

swgriffith commented Jan 26, 2022

andriktr commented Mar 2, 2022

IvanJosipovic commented Mar 30, 2022

danielalvesleandro commented Apr 4, 2022

wedaly commented May 16, 2022

ondrejhlavacek commented May 17, 2022

wedaly commented May 17, 2022

ghost commented Jun 3, 2022

rhummelmose commented Nov 28, 2019 •

edited by jluk