Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk error when pods are mounting a certain amount of volumes on a node #201

Closed
gonzochic opened this issue Feb 21, 2018 · 13 comments

Comments

Projects
None yet
3 participants
@gonzochic
Copy link

commented Feb 21, 2018

We are currently running a 5 node cluster in AKS with all-in-all 10vcpus and 35gb ram. We noticed following behavior: We have a couple of StatefulSets which claim an Azure Disk with some storage. During runtime, the pod is going into a CrashLoop because the volume is suddenly not accessible anymore (io-error). This leads to a crash of the application running in the pod. The health probe recognizes that, restarts, and crashes again. We managed to keep a container running and we were not able to access the volume anymore (but it was still mounted in the OS).

The solution to this problem was usually to manually delete the pod. After rescheduling it suddenly worked again.

In the past this happened only a few times, until yesterday. Yesterday, we had the same issues and as soon we deleted the failing pod, and this pod was rescheduled in running state, another pod was crashing. We always had 4 failing pods with IO-Errors which made us think if has something to do with the total amount of mounted Azure Disks.

We have the following assumption:
If a new pod is scheduled on a node, which already has 4 mounted Azure Disks, one of the running pods (which is claiming one of the volumes) "loses" access to his volume and therefore crashes. Additionally we found the following link which restricts the amount of Azure Disks that can be mounted on a VM (Link)

What we would expect:
If our assumption is correct, i would expect following behaviour:

  • Pods whith PVC to Azure Disk PV should not be scheduled to a physical node which has already a maximum of volumes mounted
  • If this is not possible: The newly scheduled Pod should not be able to be scheduled on the node and therefore throw an error (instead of making an already scheduled Pod crash)

Have you observed something similar in the past?

Here some information about our system (reducted private information):

Name:               aks-agentpool-(reducted)
Roles:              agent
Labels:             agentpool=agentpool
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=Standard_D2_v2
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=westeurope
                    failure-domain.beta.kubernetes.io/zone=0
                    kubernetes.azure.com/cluster=(reducted)
                    kubernetes.io/hostname=aks-agentpool-(reducted)
                    kubernetes.io/role=agent
                    storageprofile=managed
                    storagetier=Standard_LRS
Annotations:        node.alpha.kubernetes.io/ttl=0
                    volumes.kubernetes.io/controller-managed-attach-detach=true
Taints:             <none>
CreationTimestamp:  Tue, 20 Feb 2018 17:07:16 +0100
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Tue, 20 Feb 2018 17:07:42 +0100   Tue, 20 Feb 2018 17:07:42 +0100   RouteCreated                 RouteController created a route
  OutOfDisk            False   Wed, 21 Feb 2018 09:49:47 +0100   Tue, 20 Feb 2018 17:07:16 +0100   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure       False   Wed, 21 Feb 2018 09:49:47 +0100   Tue, 20 Feb 2018 17:07:16 +0100   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Wed, 21 Feb 2018 09:49:47 +0100   Tue, 20 Feb 2018 17:07:16 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure
  Ready                True    Wed, 21 Feb 2018 09:49:47 +0100   Tue, 20 Feb 2018 17:07:36 +0100   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  (reducted)
  Hostname:    (reducted)
Capacity:
 alpha.kubernetes.io/nvidia-gpu:  0
 cpu:                             2
 memory:                          7114304Ki
 pods:                            110
Allocatable:
 alpha.kubernetes.io/nvidia-gpu:  0
 cpu:                             2
 memory:                          7011904Ki
 pods:                            110
System Info:
 Machine ID:                          (reducted)
 System UUID:                         (reducted)
 Boot ID:                             (reducted)
 Kernel Version:                      4.13.0-1007-azure
 OS Image:                            Debian GNU/Linux 8 (jessie)
 Operating System:                    linux
 Architecture:                        amd64
 Container Runtime Version:           docker://1.12.6
 Kubelet Version:                     v1.8.7
 Kube-Proxy Version:                  v1.8.7
PodCIDR:                              10.244.4.0/24
ExternalID:                           (reducted)
@gonzochic

This comment has been minimized.

Copy link
Author

commented Feb 22, 2018

Steps to reproduce
Provision an AKS in the Azure portal using a single node (we used a D3_V2 as reference, because it should be able to handle 16 volumes, which could be the Kubernetes default). Then run the following yaml file on your cluster which provisions a statefulset for you. We started with a replica count of 8. With 8 replicas we had already 2 pods failing with IO-Errors. Here is the yaml:

apiVersion: apps/v1beta2
kind: StatefulSet
metadata:
  name: myvm
spec:
  selector:
    matchLabels:
      app: myvm # has to match .spec.template.metadata.labels
  serviceName: "myvm"
  replicas: 8 
  template:
    metadata:
      labels:
        app: myvm 
    spec:
      terminationGracePeriodSeconds: 10
      containers:
      - name: ubuntu
        image: ubuntu:xenial
        command: [ "/bin/bash", "-c", "--" ]
        args: [ "while true; do sleep 30; done;" ]
        livenessProbe:
          exec:
            command:
            - ls
            - /mnt/data
          initialDelaySeconds: 5
          periodSeconds: 5
        resources:
          requests:
            cpu: 100m
            memory: 250Mi
        volumeMounts:
        - name: mydata
          mountPath: /mnt/data
  volumeClaimTemplates:
  - metadata:
      name: mydata
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: default
      resources:
        requests:
          storage: 1Gi
@LeDominik

This comment has been minimized.

Copy link

commented Feb 22, 2018

Testing this we can also see in dmesg that when adding a new replica the next device fails (here sde). Incoming is sdl:

[50023.841688] EXT4-fs warning (device sdc): htree_dirblock_to_tree:962: inode #2: lblock 0: comm updatedb.mlocat: error -5 reading directory block
[50023.986177] EXT4-fs warning (device sdd): htree_dirblock_to_tree:962: inode #2: lblock 0: comm updatedb.mlocat: error -5 reading directory block
[50024.371780] EXT4-fs warning (device sdc): htree_dirblock_to_tree:962: inode #2: lblock 0: comm updatedb.mlocat: error -5 reading directory block
[50024.373185] EXT4-fs warning (device sdd): htree_dirblock_to_tree:962: inode #2: lblock 0: comm updatedb.mlocat: error -5 reading directory block
[78545.257391] EXT4-fs warning (device sde): ext4_end_bio:313: I/O error -5 writing to inode 12 (offset 0 size 0 starting block 32834)
[78545.257394] Buffer I/O error on device sde, logical block 32833
[78545.261633] Aborting journal on device sde-8.
[78545.265041] JBD2: Error -5 detected when updating journal superblock for sde-8.
[78545.270385] sd 3:0:0:2: [sde] Synchronizing SCSI cache
[78547.314632] scsi 3:0:0:7: Direct-Access     Msft     Virtual Disk     1.0  PQ: 0 ANSI: 4
[78547.341351] sd 3:0:0:7: Attached scsi generic sg5 type 0
[78547.341768] sd 3:0:0:7: [sdl] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB)
[78547.341809] sd 3:0:0:7: [sdl] Write Protect is off
[78547.341811] sd 3:0:0:7: [sdl] Mode Sense: 0f 00 10 00
[78547.341977] sd 3:0:0:7: [sdl] Write cache: enabled, read cache: enabled, supports DPO and FUA
[78547.359641] sd 3:0:0:7: [sdl] Attached SCSI disk

We went straight to 7 replicas first, then up to 8, and no it's:

  • OS Disk
  • Disk for myvm-0 on /dev/sdc (broken)
  • Disk for myvm-1 on /dev/sdd (broken)
  • Disk for myvm-2 on /dev/sde (broken after going for 8 replicas see above)
  • Disk for myvm-3 on /dev/sdf (still ok)
  • and so on up to myvm-7
@andyzhangx

This comment has been minimized.

Copy link
Collaborator

commented Feb 23, 2018

@gonzochic @LeDominik azure disk only supports ReadWriteOnce which means it could only be used on one node, you could not assign 8 pod replicas since those pods could be on different nodes.
I would suggest use azure file if your replica is more than one.

@gonzochic

This comment has been minimized.

Copy link
Author

commented Feb 23, 2018

@andyzhangx I know that the volume is ReadWriteOnce. If you watch closely you will see that we are scheduling a separate PersistantVolumeand PersistantVolumeClaim for each replica.
In the Azure Portal you are also able to see that 8 Disks are attached to the VM. But three of them are not accessible as you can see from @LeDominik log.

screen shot 2018-02-23 at 14 11 17

Here the quote from the docs:

Note, however, that while scaling up creates new PersistentVolumeClaims automatically, scaling down does not automatically delete these PVCs. This gives you the choice to keep those initialized PVCs around to make scaling back up quicker, or to extract data before deleting them.

@andyzhangx

This comment has been minimized.

Copy link
Collaborator

commented Feb 23, 2018

@gonzochic you are right, I have tried in my testing env 3 times, the root cause is the dev name(/dev/sd*) would change after attaching 6th data disks(it's always 6th) for D2_V2 VM which allow 8 data disks at maximum. That is to say, 5 data disks are safe... Below is the evidence:

azureuser@k8s-agentpool-87187153-0:/tmp$ tree /dev/disk/azure/
/dev/disk/azure/
├── resource -> ../../sdb
├── resource-part1 -> ../../sdb1
├── root -> ../../sda
├── root-part1 -> ../../sda1
└── scsi1
    ├── lun0 -> ../../../sdc
    ├── lun1 -> ../../../sdd
    ├── lun2 -> ../../../sde
    ├── lun3 -> ../../../sdf
    └── lun4 -> ../../../sdg

1 directory, 9 files
azureuser@k8s-agentpool-87187153-0:/tmp$ tree /dev/disk/azure/
/dev/disk/azure/
├── resource -> ../../sdb
├── resource-part1 -> ../../sdb1
├── root -> ../../sda
├── root-part1 -> ../../sda1
└── scsi1
    ├── lun0 -> ../../../sdi
    ├── lun1 -> ../../../sdd
    ├── lun2 -> ../../../sde
    ├── lun3 -> ../../../sdf
    ├── lun4 -> ../../../sdg
    └── lun5 -> ../../../sdh

I will contact with Azure Linux VM team to check whether there is any solution for this dev name change issue after attach new data disks.

@andyzhangx

This comment has been minimized.

Copy link
Collaborator

commented Feb 24, 2018

Update:
Current solution is add “cachingmode: None” in azure disk storage class, that would solve this device name change issue. I have proposed a PR, change the default cachingmode as None.

Here is an example:
https://github.com/andyzhangx/Demo/blob/master/pv/storageclass-azuredisk.yaml

@gonzochic

This comment has been minimized.

Copy link
Author

commented Feb 24, 2018

Thanks @andyzhangx, alternatively I saw the following PR. Which makes if configurable how many Disks a Node can mount. I think it is planned for 1.10 and would make it possible to "catch" this error on a Kubernetes Scheduler Level. Additionally this is somewhere a must, or what is currently happening if I try to mount more volumes than the Azure VM size allows?

@andyzhangx

This comment has been minimized.

Copy link
Collaborator

commented Feb 24, 2018

@gonzochic if a node has reached maximum disk number, a new pod with an azure disk mount would fail. In your case, there is "disk error" after mounting other pod with an azure disk,, this issue is different, it's not due to maximum disk a node can mount, it's due to device name change by using “cachingmode: ReadWrite".
Make configurable disk number for cloud provider is a new feature and I don't think it will catch 1.10 since next week will be code freeze. The feature could be in v1.11, I will let you know when it's available.

So to fix your issue, you could use my proposed solution, I have verified it works well:
Current solution is add “cachingmode: None” in azure disk storage class, that would solve this device name change issue. I have proposed a PR, change the default cachingmode as None.

Here is an example:
https://github.com/andyzhangx/Demo/blob/master/pv/storageclass-azuredisk.yaml

@andyzhangx

This comment has been minimized.

Copy link
Collaborator

commented Feb 24, 2018

@gonzochic Just realized you are using AKS, so my proposed solution is

  1. create a new azure disk storage class
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: hdd
provisioner: kubernetes.io/azure-disk
parameters:
  skuname: Standard_LRS
  kind: Managed
  cachingmode: None
  1. change storageClassName: default to storageClassName: hdd in statefulset config

I have already verified in my test env, all 8 replicas are running, no crash any more.

@LeDominik

This comment has been minimized.

Copy link

commented Feb 24, 2018

Hej @andyzhangx — thanks for the tip! We’re going to test that out and post feedback!
(Will take till Monday, I was “encouraged” not to take my laptop with me for the weekend 😄 )

k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this issue Feb 25, 2018

Kubernetes Submit Queue
Merge pull request #60346 from andyzhangx/fix-devname-change
Automatic merge from submit-queue (batch tested with PRs 60346, 60135, 60289, 59643, 52640). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

fix device name change issue for azure disk

**What this PR does / why we need it**:
fix device name change issue for azure disk due to default host cache setting changed from None to ReadWrite from v1.7, and default host cache setting in azure portal is `None`

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #60344, #57444
also fixes following issues:
Azure/acs-engine#1918
Azure/AKS#201

**Special notes for your reviewer**:
From v1.7, default host cache setting changed from None to ReadWrite, this would lead to device name change after attach multiple disks on azure vm, finally lead to disk unaccessiable from pod.
For an example:
statefulset with 8 replicas(each with an azure disk) on one node will always fail, according to my observation, add the 6th data disk will always make dev name change, some pod could not access data disk after that.

I have verified this fix on v1.8.4
Without this PR on one node(dev name changes):
```
azureuser@k8s-agentpool2-40588258-0:~$ tree /dev/disk/azure
...
└── scsi1
    ├── lun0 -> ../../../sdk
    ├── lun1 -> ../../../sdj
    ├── lun2 -> ../../../sde
    ├── lun3 -> ../../../sdf
    ├── lun4 -> ../../../sdg
    ├── lun5 -> ../../../sdh
    └── lun6 -> ../../../sdi
```

With this PR on one node(no dev name change):
```
azureuser@k8s-agentpool2-40588258-1:~$ tree /dev/disk/azure
...
└── scsi1
    ├── lun0 -> ../../../sdc
    ├── lun1 -> ../../../sdd
    ├── lun2 -> ../../../sde
    ├── lun3 -> ../../../sdf
    ├── lun5 -> ../../../sdh
    └── lun6 -> ../../../sdi
```

Following `myvm-0`, `myvm-1` is crashing due to dev name change, after controller manager replacement, myvm2-x  pods work well.

```
Every 2.0s: kubectl get po                                                                                                                                                   Sat Feb 24 04:16:26 2018

NAME      READY     STATUS             RESTARTS   AGE
myvm-0    0/1       CrashLoopBackOff   13         41m
myvm-1    0/1       CrashLoopBackOff   11         38m
myvm-2    1/1       Running            0          35m
myvm-3    1/1       Running            0          33m
myvm-4    1/1       Running            0          31m
myvm-5    1/1       Running            0          29m
myvm-6    1/1       Running            0          26m

myvm2-0   1/1       Running            0          17m
myvm2-1   1/1       Running            0          14m
myvm2-2   1/1       Running            0          12m
myvm2-3   1/1       Running            0          10m
myvm2-4   1/1       Running            0          8m
myvm2-5   1/1       Running            0          5m
myvm2-6   1/1       Running            0          3m
```

**Release note**:

```
fix device name change issue for azure disk
```
/assign @karataliu 
/sig azure
@feiskyer  could you mark it as v1.10 milestone?
@brendandburns @khenidak @rootfs @jdumars FYI

Since it's a critical bug, I will cherry pick this fix to v1.7-v1.9, note that v1.6 does not have this issue since default cachingmode is `None`
@LeDominik

This comment has been minimized.

Copy link

commented Feb 26, 2018

@andyzhangx : Thanks, it works like a charm. On the 1-node-AKS test-cluster we were able to go up to the nodes advertised maximum supported 16 disks with a stateful set, the 17th pod then had to wait due to No nodes are available that match all of the predicates: MaxVolumeCount (1)., so just what we expected.

All disks are working fine, Thanks 👍

@andyzhangx

This comment has been minimized.

Copy link
Collaborator

commented Feb 26, 2018

my pleasure, would you close this issue? thanks.

@gonzochic

This comment has been minimized.

Copy link
Author

commented Mar 1, 2018

Until now we had no more issues! Thanks for that :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.