Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

About known issue of custom vnet for Windows in custom-vnet.md #371

Closed
zhiweiv opened this issue Jan 24, 2019 · 19 comments
Closed

About known issue of custom vnet for Windows in custom-vnet.md #371

zhiweiv opened this issue Jan 24, 2019 · 19 comments
Labels

Comments

@zhiweiv
Copy link
Contributor

zhiweiv commented Jan 24, 2019

Is this a request for help?:
No

Is this an ISSUE or FEATURE REQUEST?
ISSUE

What version of aks-engine?:
0.29.1

Kubernetes version:
1.13.2

What happened:
In https://github.com/Azure/aks-engine/blob/master/docs/tutorials/custom-vnet.md, it says Custom VNET for Kubernetes Windows cluster has a known issue, and it gives a link Azure/acs-engine#1767. In this link it is a long conversation with a dozen of related issues. I did not find what the known issue it is actually.

With the workaround in #210, I and some guys ran Winodws nodes with custom vnet successfully. I am wondering does the known issue still exist now? If it does, what exactly it is? If it does not exist already, let's kindly update the description in custom-vnet.md?

@CecileRobertMichon
Copy link
Contributor

@PatrickLang what's the status of custom vnet with Windows?

@PatrickLang
Copy link
Contributor

@CecileRobertMichon I haven't been able to look at it yet, ETA is after Kubernetes 1.14 is done in March

@zhiweiv
Copy link
Contributor Author

zhiweiv commented Jan 25, 2019

Custom vnet is the blocker feature for us to deploy Windows nodes to production. It looks good so far in our test environent, will update here if something goes wrong.

@JoaquimFreitas
Copy link

@zhiweiv,
Custom vnet in aks-engine 0.29.1 with k8s 1.13.2 is even a bigger messy deployment, a cpmplete no-go for production
Windows nodes don't even show up in the cluster after the deployment, all missing in action due to some SHA checksum failure installing Docker in the VMs of the nodes.

Even more that problem is now propagated to default network deployments, only a random number of Windows nodes appear in the cluster, the rest are missing.

@zhiweiv
Copy link
Contributor Author

zhiweiv commented Jan 26, 2019

We are still evaluating it in testing environment, you can try the template I pasted in #341, it should work.

@JoaquimFreitas
Copy link

JoaquimFreitas commented Jan 26, 2019

@zhiweiv

Thanks for the tip.
Unfortunately, I already tried a template similar as yours at #341 and... it doesn't work...

That is, I'm unable to execute any kubectl commands to check the cluster, like kubectl get nodes, it just takes a lot of time and then fails with the error:

Unable to connect to the server: dial tcp 10.20.255.249:443: connectex: no reesponse from the server

That's the IP of the k8s-master-lb.
I'm missing something due to the fact that with that it is a private cluster?
(OK, it seems that a jumpbox is needed)

I even added into azuredeploy.parameters.json before deploying

"masterSubnet": {
    "value": "10.20.0.0/16"
 },

But that is a variable that it is not even needed, it is not used/referenced anywhere in the azuredepoly.json,
masterVnetSubnetID is the one used.

@zhiweiv
Copy link
Contributor Author

zhiweiv commented Jan 27, 2019

For just test vnet, you can remove private cluster in template. In my generated azuredeploy.json, masterSubnet is referenced in customData of osProfile for Windows, it is provided as PowerShell parameter. If you can't find it, there must be something wrong in template generation phase.

"osProfile": {
    "adminPassword": "[parameters('windowsAdminPassword')]",
    "adminUsername": "[parameters('windowsAdminUsername')]",
    "computername": "[concat(variables('windowspool1VMNamePrefix'), copyIndex(variables('windowspool1Offset')))]",
    "customData": "It is too long to share"
},

Here are my steps to deploy the cluster.
1 aks-engine generate --api-model kubernetes.json
2 add masterSubnet in azuredeploy.parameters.json under _output folder
3 az group deployment create --name "" --resource-group "<RESOURCE_GROUP_NAME>" --template-file "./_output/xxx/azuredeploy.json" --parameters "./_output/xxx/azuredeploy.parameters.json"

@JoaquimFreitas
Copy link

@zhiweiv

Thanks for the tip,
Yep I found the masterSubnet inside the customData field, never though it was used inside that long, long data, my first searches I didn't find it.

I added it to the azuredeploy.parameters.json (I've made a typo error on my first tries, that's why it didn't do anything), and ended up with these vnet/subnet related parameters:

    "masterVnetSubnetID": {
      "value": "/subscriptions/SUB_ID/resourceGroups/RG_NAME/providers/Microsoft.Network/virtualNetworks/VNET_NAME/subnets/SUBNET_NAME"
    },
    "masterSubnet": {
      "value": "10.20.0.0/17"
    },
    "vnetCidr": {
      "value": "10.20.0.0/16"
    },
    "firstConsecutiveStaticIP": {
      "value": "10.20.127.239"
    },
    "kubeClusterCidr": {
      "value": "10.20.128.0/17"
    },

And made the deployment without errors.
I already see Windows nodes in the cluster... but NOT ALL of them...

It means that I'am now able to deploy custom vnet BUT ended up with the aks-engine ERROR that i've reported two days ago at #385

This is a nasty problem plaguing aks-engine 0.28.1/0.29.1 hybrid deployments, I'm not even able to reproduce hybrid deployments I've done in the past... try to make a new test deployment equal to the one you have working and check if you are able to do so again.

BTW, what is the Windows SKU you are using at your deployments?

Thanks again.

@zhiweiv
Copy link
Contributor Author

zhiweiv commented Jan 28, 2019

I didn't specify the sku in template, so it is default value Datacenter-Core-1809-with-Containers-smalldisk, you are using Datacenter-Core-1803-with-Containers-smalldisk in #385, it will be deprecated soon.

From the error you pasted in #385, aks-engine failed to install Docker 18.09.0, Windows 1809 ships with Docker 18.09.0 by default, you can have a try.

@JoaquimFreitas
Copy link

@zhiweiv

Good tip on the Windows SKU, I was smelling something related with that, I'm going to make more tries with that default SKU...

Thanks

@JoaquimFreitas
Copy link

@zhiweiv

Bad news...
Even using the default Datacenter-Core-1809-with-Containers-smalldisk the "Docker SHA256 Install failure" reported at #385 STILL occurs in some of the VMs.

I've checked the VMs not attached to the cluster, where the error occurred, and Docker is really not available...

@zhiweiv
Copy link
Contributor Author

zhiweiv commented Jan 29, 2019

I didn't reproduce your problem with template similiar to yours in #385, and seems your problem occurred in regular cluster without custom VNET also. This issue is mainly for custom VNET, let us follow up your problem in #385, I think you can collect all logs under c:\k and c:\azuredata in missing Windows nodes and paste them to #385, you can also check if it is same issue with #351.

4645k8s010                 Ready     agent     52m       v1.13.2
4645k8s011                 Ready     agent     52m       v1.13.2
4645k8s020                 Ready     agent     53m       v1.13.2
4645k8s021                 Ready     agent     52m       v1.13.2
k8s-linuxpool-46451102-0   Ready     agent     56m       v1.13.2
k8s-linuxpool-46451102-1   Ready     agent     56m       v1.13.2
k8s-master-46451102-0      Ready     master    56m       v1.13.2
k8s-master-46451102-1      Ready     master    56m       v1.13.2
k8s-master-46451102-2      Ready     master    56m       v1.13.2
{
	"apiVersion": "vlabs",
	"properties": {
		"orchestratorProfile": {
			"orchestratorType": "Kubernetes",
			"orchestratorRelease": "1.13",
			"kubernetesConfig": {
				"etcdDiskSizeGB": "60"
			}
		},
		"masterProfile": {
			"count": 3,
			"dnsPrefix": "",
			"vmSize": "Standard_B2s",
			"OSDiskSizeGB": 60
		},
		"agentPoolProfiles": [{
			"name": "linuxpool",
			"count": 2,
			"vmSize": "Standard_B2s",
			"OSDiskSizeGB": 60,
			"availabilityProfile": "AvailabilitySet"
		},
		{
			"name": "windowspoo1l",
			"count": 2,
			"vmSize": "Standard_B2s",
			"OSDiskSizeGB": 60,
			"osType": "Windows",
			"availabilityProfile": "AvailabilitySet"
		},
		{
			"name": "windowspool2",
			"count": 2,
			"vmSize": "Standard_B2s",
			"OSDiskSizeGB": 60,
			"osType": "Windows",
			"availabilityProfile": "AvailabilitySet"
		}],
		"windowsProfile": {
			"adminUsername": "",
			"adminPassword": ""
		},
		"linuxProfile": {
			"adminUsername": "",
			"ssh": {
				"publicKeys": [{
					"keyData": ""
				}]
			}
		},
		"servicePrincipalProfile": {
			"clientId": "",
			"secret": ""
		}
	}
}

@JoaquimFreitas
Copy link

JoaquimFreitas commented Jan 29, 2019

@zhiweiv ,

Many thanks for your feedback and effort.

Related with #385, it seems that it IS the enabling of acceleratedNetworkWindows that's causing the problem reported there, go figure! Will keep the discussion further there!

Related with custom vnet, following your magnificent tip about the missing parameter subnetMaster, I'l already devised some additional code at my Terraform project that will add the JSON definition for missing parameter with the adequate subnet CIDR value into the azuredeploy.parameters.json file to be used in the deployment, so I have the deployment coded and automated, so I'll resume the custom vnet tests.

@zhiweiv
Copy link
Contributor Author

zhiweiv commented Jan 29, 2019

Since your first acceleratedNetworkWindows is false, I didn't notice the following two acceleratedNetworkWindows are true 🤣, just removed them for simplicity while testing.

I am interested in your idea of Terraform deployment automation, currently I am using PowerShell to assist. Do you mind to share some codes about this part when you are done if possible.

Maybe this is not the right place, you can send to my email: zhiweiv@outlook.com.

@JoaquimFreitas
Copy link

And with the subnetMaster correction/addition to azuredeploy.parameters.json, the deployment of an hybrid cluster with custom vnet seems to work AS expected:

PS > kubectl get nodes
NAME                       STATUS   ROLES    AGE     VERSION
2389k8s010                 Ready    agent    4m44s   v1.13.2
2389k8s011                 Ready    agent    4m53s   v1.13.2
2389k8s020                 Ready    agent    4m39s   v1.13.2
2389k8s021                 Ready    agent    4m51s   v1.13.2
k8s-linuxpool-23890794-0   Ready    agent    6m41s   v1.13.2
k8s-linuxpool-23890794-1   Ready    agent    6m42s   v1.13.2
k8s-master-23890794-0      Ready    master   6m39s   v1.13.2
k8s-master-23890794-1      Ready    master   6m41s   v1.13.2
k8s-master-23890794-2      Ready    master   6m39s   v1.13.2
PS > kubectl get nodes -o wide
NAME                       STATUS   ROLES   AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                    KERNEL-VERSION   CONTAINER-RUNTIME
2389k8s010                 Ready    agent   4m51s   v1.13.2   10.20.0.128   <none>        Windows Server Datacenter   10.0.17763.253
                           docker://18.9.0
2389k8s011                 Ready   agent   5m    v1.13.2   10.20.0.159   <none>   Windows Server Datacenter   10.0.17763.253
                           docker://18.9.0
2389k8s020                 Ready   agent   4m46s   v1.13.2   10.20.0.97   <none>   Windows Server Datacenter   10.0.17763.253
                           docker://18.9.0
2389k8s021                 Ready   agent   4m58s   v1.13.2   10.20.0.4   <none>   Windows Server Datacenter   10.0.17763.253
                           docker://18.9.0
k8s-linuxpool-23890794-0   Ready   agent    6m48s   v1.13.2   10.20.0.66      <none>   Ubuntu 16.04.5 LTS   4.15.0-1036-azure   docker://3.0.1
k8s-linuxpool-23890794-1   Ready   agent    6m49s   v1.13.2   10.20.0.35      <none>   Ubuntu 16.04.5 LTS   4.15.0-1036-azure   docker://3.0.1
k8s-master-23890794-0      Ready   master   6m46s   v1.13.2   10.20.127.239   <none>   Ubuntu 16.04.5 LTS   4.15.0-1036-azure   docker://3.0.1
k8s-master-23890794-1      Ready   master   6m48s   v1.13.2   10.20.127.240   <none>   Ubuntu 16.04.5 LTS   4.15.0-1036-azure   docker://3.0.1
k8s-master-23890794-2      Ready   master   6m46s   v1.13.2   10.20.127.241   <none>   Ubuntu 16.04.5 LTS   4.15.0-1036-azure   docker://3.0.1

Finally!

@stale
Copy link

stale bot commented Mar 30, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 30, 2019
@zhiweiv
Copy link
Contributor Author

zhiweiv commented Apr 2, 2019

The issue for Windows with custom VNET so far is described in #431, I will open another issue if find something else, close this.

@stale stale bot removed the stale label Apr 2, 2019
@zhiweiv zhiweiv closed this as completed Apr 2, 2019
@vyta
Copy link
Contributor

vyta commented Apr 11, 2019

Still seems like an issue. @zhiweiv why did you close?

@zhiweiv
Copy link
Contributor Author

zhiweiv commented Apr 11, 2019

All related issues and pr were closed by bot automatically, seems there is no plan to fix it in the near future. The main issue we found were described in linked issues, this one is more as a question for something which we may don’t know as described in md, seems there won’t be an answer here too, so I closed this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants