Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] AKS allows creation of NodePools in different Subnets (Azure CNI) #1338

Closed
rhummelmose opened this issue Nov 28, 2019 · 66 comments
Closed

Comments

@rhummelmose
Copy link

rhummelmose commented Nov 28, 2019

As stated under limitations in the documentation, all node pools have to reside within the samee subnet.

The ask is to support assignment of a unique subnet per node pool in a cluster, but all shared from the same VNET.

@ferantivero
Copy link

If I am not wrong this issue is more to start validating requests since users could specify multiple subnets when it isn't currently supported.

But please let me mention that I love if this actually work instead of being a limitation. Already voted up for it in user voice 😸

@palma21 palma21 changed the title AKS allows creation of multi-subnet clusters but they do not work AKS allows creation of NodePools in different Subnets Dec 16, 2019
@jluk jluk added this to Backlog (Committed Items) in Azure Kubernetes Service Roadmap (Public) Dec 16, 2019
@jluk jluk self-assigned this Dec 16, 2019
@palma21
Copy link
Member

palma21 commented Jan 8, 2020

Do you have any details on around how networking breaks?

What plugin are you using?

@palma21 palma21 moved this from Backlog (Committed Items) to In Progress (Development) in Azure Kubernetes Service Roadmap (Public) Jan 8, 2020
@jluk jluk removed the feature-request Requested Features label Jan 14, 2020
@jluk
Copy link
Contributor

jluk commented Jan 14, 2020

@rhummelmose any details you can share on Jorge's question above?

@bhicks329
Copy link

I’ve experimented with this today and I was able to create a 1.17 (SLB, Azure CNI) AKS cluster with multiple node pools in different subnets with Istio over the top utilising Egress Gateways to control the flow of traffic. I haven’t attempted to place any UDRs on the subnets yet, but will do soon. I’m interested to know if this is going to be a supported scenario in the future and also what @rhummelmose has experienced in terms of breaking things.

@rhummelmose
Copy link
Author

Sorry guys I will have to stand up a cluster again and see if I can make it break. I will close the issue if I can't reproduce it.

@jluk
Copy link
Contributor

jluk commented Jan 15, 2020

I've set this issue to track this open feature ask, so let's keep this one open but update prev comments to remove the issue mentioned if its been resolved and can't be reproduced.

@bhicks329 this is actively being worked on to be a supported scenario.

@marcostrullato
Copy link

Hi @jluk I was with @rhummelmose when we experienced the issue.
If Rasmus can't help, I will: I'm sure I can reproduce the issue.

@jluk
Copy link
Contributor

jluk commented Jan 16, 2020

That would be great @marcostrullato if you have any repro steps we can investigate

@marcostrullato
Copy link

Ok I will restore the cluster with the IaC I had, and I'll check if I can make it available to you. I'll come back early next week.

@marcostrullato
Copy link

Apologize, I'm late. It's in our tasks list and will be done soon.

@marcostrullato
Copy link

Hi everyone, so we have replicated the situation once again.
You can find the IaC code here: https://github.com/marcostrullato/aksiac
Off course I've cleaned it from any reference, so be aware it might be hard to run.

The source of this IaC is the akscommander code from @rhummelmose.

Is there anything else I can do?

Cheers

@jluk
Copy link
Contributor

jluk commented Jan 31, 2020

Thanks @marcostrullato for the pointer to repro - to help us diagnose can you explain what the situation is that is not working? Is it a failure of communication between agent pools, failure during setup/provision, or something else?

@marcostrullato
Copy link

Hi @jluk what is happening is a failure in communications. It's possible to reach one of the agent pool, where the other is inaccessible.

Let me drive you through the IaC code.

This is the code to define the vnets (https://github.com/marcostrullato/aksiac/blob/master/aks-commander/terraform/base/main.tf)

resource "azurerm_virtual_network" "vnet" {
  name                = "${var.prefix}-vnet-${terraform.workspace}"
  location            = "${var.region}"
  resource_group_name = data.terraform_remote_state.remote_state_core.outputs.resource_group_name
  address_space       = ["${var.aks_vnet_address_space}"]
}

resource "azurerm_subnet" "subnet1" {
  name                 = "subnet1"
  resource_group_name  = "${azurerm_virtual_network.vnet.resource_group_name}"
  address_prefix       = "${var.aks_subnet_gpu_address_prefix}"
  virtual_network_name = "${azurerm_virtual_network.vnet.name}"
  service_endpoints    = ["Microsoft.ContainerRegistry","Microsoft.AzureCosmosDB","Microsoft.Storage","Microsoft.KeyVault"]
}

resource "azurerm_subnet" "subnet2" {
  name                 = "subnet2"
  resource_group_name  = "${azurerm_virtual_network.vnet.resource_group_name}"
  address_prefix       = "${var.aks_subnet_address_prefix}"
  virtual_network_name = "${azurerm_virtual_network.vnet.name}"
  service_endpoints    = ["Microsoft.ContainerRegistry","Microsoft.AzureCosmosDB","Microsoft.Storage","Microsoft.KeyVault"]
}

This is the code for the agentpools (https://github.com/marcostrullato/aksiac/blob/master/aks-commander/terraform/aks/main.tf) Bear in mind I've masked/hidden names and references.

  agent_pool_profile {
    name                = "cpupool"
    count               = 1
    min_count           = 1
    max_count           = 4
    vm_size             = "Standard_DS3_v2"
    os_type             = "Linux"
    os_disk_size_gb     = 30
    type                = "VirtualMachineScaleSets"
    availability_zones  = [ "1", "2", "3"]
    enable_auto_scaling = true
    vnet_subnet_id      = data.terraform_remote_state.remote_state_base.outputs.aks_cluster_subnet_id
  }

  agent_pool_profile {
    name                = "gpupool"
    count               = 1
    min_count           = 1
    max_count           = 10
    vm_size             = "Standard_NC6s_v2"
    os_type             = "Linux"
    os_disk_size_gb     = 30
    type                = "VirtualMachineScaleSets"
    availability_zones  = [ "1", "2", "3"]
    enable_auto_scaling = true
    vnet_subnet_id      =""
  }

With this configuration in our environment networking does not completely work.

@rhummelmose anything else to add?

Regards

@ams0
Copy link

ams0 commented Feb 13, 2020

Able to reproduce: adding a second pool in another subnet works:

$> kg node
NAME                                STATUS   ROLES   AGE   VERSION
aks-base-15062426-vmss000000        Ready    agent   9h    v1.17.0
aks-base-15062426-vmss000001        Ready    agent   9h    v1.17.0
aks-extrapool-15062426-vmss000000   Ready    agent   8h    v1.17.0
aks-extrapool-15062426-vmss000001   Ready    agent   8h    v1.17.0

However:

AzureCNI:
Pods can talk to each other across nodepools, services work with pods both pools.

Kubernet:
Services can route traffic to pods sitting on any pool, however, pods on the non-default pool have no connectivity to the internet, and no connectivity to other pods in other pools or service IPs. However, by manually associating the RT in the MC_ group to the subnet of the extra nodepool makes it work.

So I think it just needs to be either documented or the az aks nodepool command should associate the RT automatically.

@ams0
Copy link

ams0 commented Feb 13, 2020

Also, RT association survives scaling operations and adding another nodepool (but the latter requires additional association of course).

@jluk
Copy link
Contributor

jluk commented Feb 13, 2020

Thanks for the input @ams0 this is in-line with what we're working on. The likely rollout is Azure CNI coverage first, Kubenet following that a bit later.

The general improvement for BYO RT for Kubenet is being worked on right now.

@scivm
Copy link

scivm commented Feb 17, 2020

@jluk will the improvements work with current GA 1.15.7 or only 1.17?

@djsly
Copy link

djsly commented Feb 17, 2020

@scivm hopefully this is an AKS release that won't be required to be tied to an K8S version since it seems only Azure specific resources are required to be properly configured.

@jluk
Copy link
Contributor

jluk commented Feb 18, 2020

Azure CNI subnet per pool should work for all the supported AKS k8s versions.

@jyee021
Copy link

jyee021 commented Mar 9, 2021

Is there an anticipate GA date for this feature?

@ondrejhlavacek
Copy link

I have found out that Kubernetes Network Policies do not work between node pools in different subnets. We're running AKS with Azure CNI and Azure networking policy. Is this related to this issue or should I create a support ticket?

@paulgmiller
Copy link
Member

Our guess is that this is related because of kube-proxy --cluster-cidr:
"When configured, traffic sent to a Service cluster IP from outside this range will be masqueraded"

We're still looking for this KEP to fix this and might try and get more involved.  https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/2450-Remove-knowledge-of-pod-cluster-CIDR-from-iptables-rules

@jack4it
Copy link

jack4it commented May 3, 2021

Pods in a pool with a unique subnet (outside --cluster-cidr) can't reach the internet but can reach other pods in other pools. Pods with hostNetowrk: true work fine. Does anyone see this behavior? It's consistent across multi clusters in my test.

@palma21 palma21 removed the feature label May 12, 2021
@noahbirrer-8451
Copy link

Dedicating a unique VNET per node pool is not on the roadmap today, but your input is valuable. For others with this need please share the scenarios!

@jluk We are in need of this feature. Our scenario is described below.

This feature would allow us to better allocate and plan IP space that is shared with our on-prem network via VNET peering. We use Azure CNI and our current cluster strategy is to use a shared cluster for all workloads (those that require on-prem connectivity and those that do not). We would like to move the distinction of whether or not a workload can reach on-prem from a peered VNET from the cluster level to the node pool level.

Today, we are limited to using one VNET per cluster since nodepools cannot be created on separate VNETs. One app that requires on-prem connectivity dictates that all workloads then unnecessarily consume peered IP space, which can be costly with CNI. Ideally, workloads that do not require on-prem connectivity could be run separately on subnets on private/unpeered nodepools, which would reduce the amount of on-prem network space that is consumed.

@plaformsre
Copy link

We need to be able to scale out AKS deployment with the ability to add additional VNETs and subnets after initial deployment. Many organisations are just not 'buying' that migration to a larger VNET / subnet is the only option.
Could you please let us know when it will become GA?

@ArgonQQ
Copy link

ArgonQQ commented Jul 5, 2021

I totally agree with @mapdegree . From the start of a project you already have defined your limitations in terms of pod count per Subnet. By default its relatively easy to scale a VNET (or add a CIDR range) but its currently not possible to extend a subnet if any resources are attached to it.

Therefore if you run out of IP-Addresses it is required to completely remove the subnet and create a bigger one (if the VNET is properly sized).

From my POV this is not how Microsoft describes Azure because currently there is no "infinite scale" available because you always have CIDR limitations except you have reserved all private CIDR ranges and you're not peering anywhere.

@paulgmiller
Copy link
Member

To give an update here. Kube-proxy currently expects one cluster-cidr and nats and does other things if you're not inside it.

This is a known issue
https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/2450-Remove-knowledge-of-pod-cluster-CIDR-from-iptables-rules

But there is already support to use the nodes pod cidr isntead of cluster
kubernetes/kubernetes#88935

So our current thought is to use the update the node's podcidr with a deamonset or controller. I have a prototype but we're still a month or more from delivering it. So this is all just a proposal for now and may change.

You will still have to all be in the same vnet.

@ElkRom
Copy link

ElkRom commented Nov 9, 2021

Hi guys, do you have any updates regarding this issue?

@asubmani
Copy link

Wouldn't this introduce more problems?

CURRENT SCENARIO:

  1. Multiple user node pools (no taints tolerations). AKS/k8s schedules pods
  2. Should be easy to lock pod to pod communication by NS using calico/cilium etc.
  3. control all intra AKS using network policies and Ingress/Egress from the cluster using 3rd party NVA's.

FUTURE SCENARIO (if each nodepol has it's own subnet)

  1. K8s reschedules a pod from Node in NP-A to another node in NP-B

now the pod will have a different IP range, which might make k8s network policies hard to manage)
each time this happens the NVA needs updating (assuming you control which subnets egress out)
If above is true then each NP/Subnet may need some mechanism to guarantee only certain pods get scheduled on a specific NP to guarantee network access.

  1. Overall k8's ability to schedule pods to node pools will be restricted.

@swgriffith
Copy link
Member

Any update on this? It seems that the kube-proxy pod cidr issue has been addressed in upstream Kubernetes.

kubernetes/enhancements#2450

@andriktr
Copy link

andriktr commented Mar 2, 2022

Hey,
When this feature will become GA?
Thanks.

@IvanJosipovic
Copy link

My company is also interested in this feature. Will Azure Network Policy be supported when this goes GA?

@danielalvesleandro
Copy link

Hi guys, any news regarding GA for this feature? Will it support Azure Network Policies when GA? Thanks.

@wedaly
Copy link

wedaly commented May 16, 2022

To give an update here. Kube-proxy currently expects one cluster-cidr and nats and does other things if you're not inside it.

This is a known issue https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/2450-Remove-knowledge-of-pod-cluster-CIDR-from-iptables-rules

Any update on this? It seems that the kube-proxy pod cidr issue has been addressed in upstream Kubernetes.

kubernetes/enhancements#2450

Hi all, wanted to give a quick update on the kube-proxy cluster CIDR issue. As of AKS release 2022-05-01, AKS now configures kube-proxy to detect local traffic using the network interface name prefix instead of cluster CIDR. This should preserve the IP addresses of traffic from secondary subnets so that network policies are applied correctly. Please note that the change applies only to Kubernetes versions 1.23.3 and later.

@ondrejhlavacek
Copy link

Does this mean that Network Policies will start working across node pools in different subnets?

@wedaly
Copy link

wedaly commented May 17, 2022

Does this mean that Network Policies will start working across node pools in different subnets?

Yes, network policies will work across different subnets.

@ghost
Copy link

ghost commented Jun 3, 2022

Thank you for the feature request. I'm closing this issue as this feature has shipped and it hasn't had activity for 7 days.

@ghost ghost closed this as completed Jun 3, 2022
@palma21 palma21 moved this from Public Preview (Shipped & Improving) to Generally Available (Done) in Azure Kubernetes Service Roadmap (Public) Jun 30, 2022
@Azure Azure locked as resolved and limited conversation to collaborators Jul 3, 2022
@wangyira wangyira moved this from Generally Available (Done) to Archive (GA older than 1 month) in Azure Kubernetes Service Roadmap (Public) Sep 20, 2022
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Azure Kubernetes Service Roadmap (Pub...
Archive (GA older than 1 month)
Development

No branches or pull requests