Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input/output error when accessing PV #297

Closed
p8952 opened this issue Apr 11, 2018 · 8 comments

Comments

@p8952
Copy link

commented Apr 11, 2018

Since around 11:45 UTC AKS in Western Europe seems to have been having issues accessing PVs from inside pods.

I've deleted the STS/POD/PVC/PV as well as the underlying Azure Disk and re-created them but experience the same behvaiour.

PVs appear to work at first but stop responding after a short while.

StatefulSet:

[peter@localhost ~]$ kubectl describe statefulset couchdb
Name:               couchdb
Namespace:          default
CreationTimestamp:  Wed, 11 Apr 2018 12:15:03 +0100
Selector:           app=couchdb
Labels:             app=couchdb
Annotations:        kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"apps/v1beta2","kind":"StatefulSet","metadata":{"annotations":{},"labels":{"app":"couchdb"},"name":"couchdb","namespace":"default"},"spec...
Replicas:           1 desired | 1 total
Pods Status:        1 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:  app=couchdb
  Containers:
   couchdb:
    Image:        registry.hub.docker.com/library/couchdb:2.1.1
    Port:         5984/TCP
    Environment:  <none>
    Mounts:
      /opt/couchdb/data from couchdb (rw)
  Volumes:
   couchdb:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  couchdb
    ReadOnly:   false
Volume Claims:
  Name:          couchdb
  StorageClass:  
  Labels:        app=couchdb
  Annotations:   <none>
  Capacity:      5Gi
  Access Modes:  [ReadWriteOnce]
Events:
  Type    Reason            Age                From         Message
  ----    ------            ----               ----         -------
  Normal  SuccessfulCreate  48m                statefulset  create Claim couchdb-couchdb-0 Pod couchdb-0 in StatefulSet couchdb success
  Normal  SuccessfulCreate  33m (x2 over 48m)  statefulset  create Pod couchdb-0 in StatefulSet couchdb successful

PersistentVolumeClaim:

[peter@localhost ~]$ kubectl describe pvc couchdb
Name:          couchdb-couchdb-0
Namespace:     default
StorageClass:  default
Status:        Bound
Volume:        pvc-9536cef0-3d79-11e8-97bd-0a58ac1f14ba
Labels:        app=couchdb
Annotations:   pv.kubernetes.io/bind-completed=yes
               pv.kubernetes.io/bound-by-controller=yes
               volume.beta.kubernetes.io/storage-provisioner=kubernetes.io/azure-disk
Finalizers:    []
Capacity:      5Gi
Access Modes:  RWO
Events:
  Type    Reason                 Age   From                         Message
  ----    ------                 ----  ----                         -------
  Normal  ProvisioningSucceeded  48m   persistentvolume-controller  Successfully provisioned volume pvc-9536cef0-3d79-11e8-97bd-0a58ac1f14ba using kubernetes.io/azure-disk

PersistentVolume:

Name:            pvc-9536cef0-3d79-11e8-97bd-0a58ac1f14ba
Labels:          <none>
Annotations:     pv.kubernetes.io/bound-by-controller=yes
                 pv.kubernetes.io/provisioned-by=kubernetes.io/azure-disk
                 volumehelper.VolumeDynamicallyCreatedByKey=azure-disk-dynamic-provisioner
StorageClass:    default
Status:          Bound
Claim:           default/couchdb-couchdb-0
Reclaim Policy:  Delete
Access Modes:    RWO
Capacity:        5Gi
Message:         
Source:
    Type:         AzureDisk (an Azure Data Disk mount on the host and bind mount to the pod)
    DiskName:     kubernetes-dynamic-pvc-9536cef0-3d79-11e8-97bd-0a58ac1f14ba
    DiskURI:      /subscriptions/e6f9d9e5-8fe9-4698-967e-2365e43755b1/resourceGroups/MC_AKS_AKS_westeurope/providers/Microsoft.Compute/disks/kubernetes-dynamic-pvc-9536cef0-3d79-11e8-97bd-0a58ac1f14ba
    Kind:         Managed
    FSType:       ext4
    CachingMode:  ReadWrite
    ReadOnly:     false
Events:           <none>

Pod:

Name:           couchdb-0
Namespace:      default
Node:           aks-agentpool-36019550-0/10.240.0.4
Start Time:     Wed, 11 Apr 2018 12:30:11 +0100
Labels:         app=couchdb
                controller-revision-hash=couchdb-79b8b79b6
Annotations:    kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"StatefulSet","namespace":"default","name":"couchdb","uid":"952f4a28-3d79-11e8-97bd-0a58ac1f14ba","apiVersi...
Status:         Running
IP:             10.244.0.195
Controlled By:  StatefulSet/couchdb
Containers:
  couchdb:
    Container ID:   docker://82ee7c513e668254e9e5f18696b29737b2a6b49aac7fd84b255d209ea61b220f
    Image:          registry.hub.docker.com/library/couchdb:2.1.1
    Image ID:       docker-pullable://registry.hub.docker.com/library/couchdb@sha256:91d0e1fcd8ee367af230d2c5e2c2d98ac9082c54ef979c4c48952b4313c8d9b0
    Port:           5984/TCP
    State:          Running
      Started:      Wed, 11 Apr 2018 12:30:32 +0100
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /opt/couchdb/data from couchdb (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-cd94g (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          True 
  PodScheduled   True 
Volumes:
  couchdb:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  couchdb-couchdb-0
    ReadOnly:   false
  default-token-cd94g:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-cd94g
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.alpha.kubernetes.io/notReady:NoExecute for 300s
                 node.alpha.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason                 Age   From                               Message
  ----    ------                 ----  ----                               -------
  Normal  Scheduled              37m   default-scheduler                  Successfully assigned couchdb-0 to aks-agentpool-36019550-0
  Normal  SuccessfulMountVolume  37m   kubelet, aks-agentpool-36019550-0  MountVolume.SetUp succeeded for volume "default-token-cd94g"
  Normal  SuccessfulMountVolume  37m   kubelet, aks-agentpool-36019550-0  MountVolume.SetUp succeeded for volume "pvc-9536cef0-3d79-11e8-97bd-0a58ac1f14ba"
  Normal  Pulled                 37m   kubelet, aks-agentpool-36019550-0  Container image "registry.hub.docker.com/library/couchdb:2.1.1" already present on machine
  Normal  Created                37m   kubelet, aks-agentpool-36019550-0  Created container
  Normal  Started                37m   kubelet, aks-agentpool-36019550-0  Started container

Input/Output Error:

[peter@localhost ~]$ kubectl exec -it couchdb-0 bash

root@couchdb-0:/opt/couchdb# mount                                                                                                                                                                              
overlay on / type overlay (rw,relatime,lowerdir=l/CMQH3DNO4TKTEOUEDCP5JR7G4W:l/3G3OTMQYEHSGAZCDIJM5CSKYXF:l/WSJ3OYMJYUVLQO57LMQEHKVURM:l/CADCXIR4G2HVKOKW2EG6PEUOMM:l/ARS7W2WGNR3ZTBJUORRQNHEZYI:l/KCBDKQGOWRPN75L46TPITHIKSV:l/EZMXEVF3SVMLLLELSAWT4PMDGN:l/ETIIRCCLGDHFEPODXIOM46J66S:l/EPM4H3ZF6U5ITLU5DFNLDO7JT3:l/AWTHZ34EPC43OYUHSAD66LPWIO:l/EDWWYAIVV2PYGRWAYBGKBNVIVM,upperdir=bff439e8c86654950fb81d2bee73795d0bfbcabb6ca8465adf174183805bca39/diff,workdir=bff439e8c86654950fb81d2bee73795d0bfbcabb6ca8465adf174183805bca39/work)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev type tmpfs (rw,nosuid,mode=755)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666)
sysfs on /sys type sysfs (ro,nosuid,nodev,noexec,relatime)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,relatime,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (ro,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (ro,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (ro,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/freezer type cgroup (ro,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/rdma type cgroup (ro,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/memory type cgroup (ro,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/blkio type cgroup (ro,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/pids type cgroup (ro,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/devices type cgroup (ro,nosuid,nodev,noexec,relatime,devices)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
/dev/sda1 on /dev/termination-log type ext4 (rw,relatime,discard,data=ordered)
/dev/sda1 on /etc/resolv.conf type ext4 (rw,relatime,discard,data=ordered)
/dev/sda1 on /etc/hostname type ext4 (rw,relatime,discard,data=ordered)
/dev/sda1 on /etc/hosts type ext4 (rw,relatime,discard,data=ordered)
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k)
/dev/sde on /opt/couchdb/data type ext4 (rw,relatime,data=ordered)
tmpfs on /run/secrets/kubernetes.io/serviceaccount type tmpfs (ro,relatime)
proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/fs type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/irq type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime)
tmpfs on /proc/kcore type tmpfs (rw,nosuid,mode=755)
tmpfs on /proc/timer_list type tmpfs (rw,nosuid,mode=755)
tmpfs on /proc/sched_debug type tmpfs (rw,nosuid,mode=755)

root@couchdb-0:/opt/couchdb# ls /opt/couchdb/data
ls: reading directory /opt/couchdb/data: Input/output error

@andyzhangx

This comment has been minimized.

Copy link
Collaborator

commented Apr 11, 2018

I am quite sure this issue is related to cachingmode setting as ReadWrite, pls change to None
Below are details:

2. disk unavailable after attach/detach a data disk on a node

Issue details:

From k8s v1.7, default host cache setting changed from None to ReadWrite, this change would lead to device name change after attach multiple disks on a node, finally lead to disk unavailable from pod. When access data disk inside a pod, will get following error:

[root@admin-0 /]# ls /datadisk
ls: reading directory .: Input/output error

In my testing on Ubuntu 16.04 D2_V2 VM, when attaching the 6th data disk will cause device name change on agent node, e.g. following lun0 disk should be sdc other than sdk.

azureuser@k8s-agentpool2-40588258-0:~$ tree /dev/disk/azure
...
└── scsi1
    ├── lun0 -> ../../../sdk
    ├── lun1 -> ../../../sdj
    ├── lun2 -> ../../../sde
    ├── lun3 -> ../../../sdf
    ├── lun4 -> ../../../sdg
    ├── lun5 -> ../../../sdh
    └── lun6 -> ../../../sdi

Related issues

Workaround:

  • add cachingmode: None in azure disk storage class(default is ReadWrite), e.g.
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: hdd
provisioner: kubernetes.io/azure-disk
parameters:
  skuname: Standard_LRS
  kind: Managed
  cachingmode: None

Fix

k8s version fixed version
v1.6 no such issue as cachingmode is already None by default
v1.7 1.7.14
v1.8 1.8.11
v1.9 1.9.4
v1.10 1.10.0
@p8952

This comment has been minimized.

Copy link
Author

commented Apr 11, 2018

Thanks @andyzhangx, I'll give that a go.

Just one thing to note, I am not currently defining a StorageClass but rather relying on the default storage class. If there are issues with cachingmode being set to ReadWrite then should the default storage class be set to None?


[peter@localhost ~]$ kubectl get storageclasses
NAME                PROVISIONER                AGE
default (default)   kubernetes.io/azure-disk   3m
managed-premium     kubernetes.io/azure-disk   2m

[peter@localhost ~]$ kubectl describe storageclass default
Name:            default
IsDefaultClass:  Yes
Annotations:     kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"storage.k8s.io/v1beta1","kind":"StorageClass","metadata":{"annotations":{"storageclass.beta.kubernetes.io/is-default-class":"true"},"labels":{"kubernetes.io/cluster-service":"true"},"name":"default","namespace":""},"parameters":{"kind":"Managed","storageaccounttype":"Standard_LRS"},"provisioner":"kubernetes.io/azure-disk"}
,storageclass.beta.kubernetes.io/is-default-class=true
Provisioner:    kubernetes.io/azure-disk
Parameters:     kind=Managed,storageaccounttype=Standard_LRS
ReclaimPolicy:  Delete
Events:         <none>
@andyzhangx

This comment has been minimized.

Copy link
Collaborator

commented Apr 11, 2018

Currently in AKS, default storage class could not be changed, I would suggest create a new storage class with cachingmode: None

@p8952

This comment has been minimized.

Copy link
Author

commented Jun 21, 2018

Now that AKS is GA, do we still need to manually create a StorageClass with cachingmode: None? Or does the default StorageClass now work correctly?

@andyzhangx

This comment has been minimized.

Copy link
Collaborator

commented Jun 22, 2018

@p8952 AKS does not set cachingmode value, so it depends on your k8s version, I have make default value of cachingmode as None in following fixed version.

Fix

k8s version fixed version
v1.6 no such issue as cachingmode is already None by default
v1.7 1.7.14
v1.8 1.8.11
v1.9 1.9.4
v1.10 1.10.0
@andyzhangx

This comment has been minimized.

Copy link
Collaborator

commented Jun 22, 2018

FYI @jackfrancis There is a fix in acs-engine for this issue, I will leave it to you, let you decide whether set the default value of cachingmode for azure disk storage class in AKS level. @p8952 Thanks for pointing out.

@jackfrancis

This comment has been minimized.

Copy link
Member

commented Jun 22, 2018

@ultimateboy @amanohar @weinong FYI

Updating this in AKS will be a control plane concern. You can use to the acs-engine implementation as a reference.

@andyzhangx

This comment has been minimized.

Copy link
Collaborator

commented Dec 4, 2018

close the issue now since it's already resolved, let me know if you have any question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.