Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENOLCK (no locks available) errors because rpc-statd is not running or GlusterFS is installed #175

Closed
ceojinhak opened this issue Sep 16, 2018 · 16 comments

Comments

@ceojinhak
Copy link

I tried to install Trident v18.07 on K8s cluster with FAS(ontap-nas).
During the installation, I've got the following error logs with the debug mode.

[xadmop01@devrepo1 trident-installer]$ ./tridentctl install -n trident
INFO Trident pod started. namespace=trident pod=trident-797f547579-d572m
INFO Waiting for Trident REST interface.
ERRO Trident REST interface was not available after 180.00 seconds.
FATA Install failed; exit status 1; Error: Get http://127.0.0.1:8000/trident/v1/version: dial tcp 127.0.0.1:8000: connect: connection refused
command terminated with exit code 1; use 'tridentctl logs' to learn more. Resolve the issue; use 'tridentctl uninstall' to clean up; and try again.

No high latency between k8s nodes and FAS storage. Several re-installations were useless.
What can be the cause of the issue?

@kangarlou
Copy link
Contributor

Please follow https://netapp-trident.readthedocs.io/en/master/kubernetes/troubleshooting.html. I would pay close attention to the etcd logs.

@ceojinhak
Copy link
Author

ceojinhak commented Sep 16, 2018

Attached some detail logs as below. Can you give me some advice for me to resolve the issue?

[xadmop01@devrepo1 trident-installer]$ ./tridentctl logs -l all -n trident
trident log:
time="2018-09-10T05:47:10Z" level=info msg="Running Trident storage orchestrator." binary=/usr/local/bin/trident_orchestrator build_time="Mon Jul 30 21:46:22 UTC 2 018" version=18.07.0

etcd log:
2018-09-10 05:46:18.434421 I | etcdmain: etcd Version: 3.2.19
2018-09-10 05:46:18.434502 I | etcdmain: Git SHA: 8a9b3d538
2018-09-10 05:46:18.434512 I | etcdmain: Go Version: go1.8.7
2018-09-10 05:46:18.434517 I | etcdmain: Go OS/Arch: linux/amd64
2018-09-10 05:46:18.434521 I | etcdmain: setting maximum number of CPUs to 32, total number of available CPUs is 32
2018-09-10 05:46:18.436238 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2018-09-10 05:46:18.436389 I | embed: listening for peers on http://127.0.0.1:8002
2018-09-10 05:46:18.436439 I | embed: listening for client requests on 127.0.0.1:8001
2018-09-10 05:46:19.440214 W | etcdserver: another etcd process is using "/var/etcd/data/member/snap/db" and holds the file lock.
2018-09-10 05:46:19.440239 W | etcdserver: waiting for it to exit before starting...
2018-09-10 05:46:24.458489 C | mvcc/backend: cannot open database at /var/etcd/data/member/snap/db (no locks available)
panic: cannot open database at /var/etcd/data/member/snap/db (no locks available)

goroutine 75 [running]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc420169f20, 0xf8c075, 0x1f, 0xc42005ee68, 0x2, 0x2)
/tmp/etcd-release-3.2.19/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x15c
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.newBackend(0xc420276aa0, 0x1d, 0x5f5e100, 0x2710, 0x280000000, 0x2)
/tmp/etcd-release-3.2.19/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/backend.go:131 +0x1a7
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.New(0xc420276aa0, 0x1d, 0x5f5e100, 0x2710, 0x280000000, 0x0, 0x0)
/tmp/etcd-release-3.2.19/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/backend.go:113 +0x48
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.newBackend(0xc42027c000, 0x8c35d9, 0x4325a8)
/tmp/etcd-release-3.2.19/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/backend.go:36 +0x1b1
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.openBackend.func1(0xc4201cae40, 0xc42027c000)
/tmp/etcd-release-3.2.19/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/backend.go:56 +0x2b
created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.openBackend
/tmp/etcd-release-3.2.19/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/backend.go:57 +0xa4

[xadmop01@devrepo1 trident-installer]$ ./tridentctl install -n trident -d
DEBU Initialized logging. logLevel=debug
DEBU Running outside a pod, creating CLI-based client.
DEBU Initialized Kubernetes CLI client. cli=kubectl flavor=k8s namespace=openstack version=1.10.4
DEBU Validated installation environment. installationNamespace=trident kubernetesVersion=
DEBU Parsed requested volume size. quantity=2Gi
DEBU Dumping RBAC fields. ucpBearerToken= ucpHost= useKubernetesRBAC=true
DEBU Namespace exists. namespace=trident
DEBU PVC does not exist. pvc=trident
DEBU PV does not exist. pv=trident
INFO Starting storage driver. backend=/home/xadmop01/trident-installer/setup/backend.json
DEBU config: {"backendName":"ontapnas_xxx.xxx.xxx.xxx","dataLIF":"xxx.xxx.xxx.xxx","managementLIF":"xxx.xxx.xxx.xxx","password":"xxxxx","storageDriverName":"onta p-nas","svm":"portal","username":"admin","version":1}
DEBU Storage prefix is absent, will use default prefix.
DEBU Parsed commonConfig: {Version:1 StorageDriverName:ontap-nas BackendName:ontapnas_192.168.10.200 Debug:false DebugTraceFlags:map[] DisableDelete:false Storag ePrefixRaw:[] StoragePrefix: SerialNumbers:[] DriverContext:}
DEBU Initializing storage driver. driver=ontap-nas
DEBU Addresses found from ManagementLIF lookup. addresses="[xxx.xxx.xxx.xxx]" hostname=xxx.xxx.xxx.xxx
DEBU Using specified SVM. SVM=portal
DEBU ONTAP API version. Ontapi=1.140
DEBU Read serial numbers. Count=2 SerialNumbers="451436000030,451436000031"
INFO Controller serial numbers. serialNumbers="451436000030,451436000031"
DEBU Configuration defaults Encryption=false ExportPolicy=default FileSystemType=ext4 NfsMountOptions="-o nfsvers=3" SecurityStyle=unix Si ze=1G SnapshotDir=false SnapshotPolicy=none SpaceReserve=none SplitOnClone=false StoragePrefix=trident_ UnixPermissions=---rwxrwxrwx
DEBU Data LIFs dataLIFs="[xxx.xxx.xxx.xxx]"
DEBU Found NAS LIFs. dataLIFs="[xxx.xxx.xxx.xxx]"
DEBU Addresses found from hostname lookup. addresses="[xxx.xxx.xxx.xxx]" hostname=xxx.xxx.xxx.xxx
DEBU Found matching Data LIF. hostNameAddress=xxx.xxx.xxx.xxx
DEBU Configured EMS heartbeat. intervalHours=24
DEBU Read storage pools assigned to SVM. pools="[n1_aggr1]" svm=portal
DEBU Read aggregate attributes. aggregate=n1_aggr1 mediaType=ssd
DEBU Storage driver initialized. driver=ontap-nas
INFO Storage driver loaded. driver=ontap-nas
INFO Starting Trident installation. namespace=trident
DEBU Deleted Kubernetes object by YAML.
DEBU Deleted cluster role binding.
DEBU Deleted Kubernetes object by YAML.
DEBU Deleted cluster role.
DEBU Deleted Kubernetes object by YAML.
DEBU Deleted service account.
DEBU Created Kubernetes object by YAML.
INFO Created service account.
DEBU Created Kubernetes object by YAML.
INFO Created cluster role.
DEBU Created Kubernetes object by YAML.
INFO Created cluster role binding.
DEBU Created Kubernetes object by YAML.
INFO Created PVC.
DEBU Attempting volume create. size=2147483648 storagePool=n1_aggr1 volConfig.StorageClass=
DEBU Creating Flexvol. aggregate=n1_aggr1 encryption=false exportPolicy=default name=trident_trident securityStyle=unix size=21474836 48 snapshotDir=false snapshotPolicy=none snapshotReserve=0 spaceReserve=none unixPermissions=---rwxrwxrwx
DEBU SVM root volume has no load-sharing mirrors. rootVolume=portal_root
DEBU Created Kubernetes object by YAML.
INFO Created PV. pv=trident
INFO Waiting for PVC to be bound. pvc=trident
DEBU PVC not yet bound, waiting. increment=437.894321ms pvc=trident
DEBU PVC not yet bound, waiting. increment=744.425557ms pvc=trident
DEBU PVC not yet bound, waiting. increment=693.54601ms pvc=trident
DEBU PVC not yet bound, waiting. increment=1.517944036s pvc=trident
DEBU PVC not yet bound, waiting. increment=2.286463684s pvc=trident
DEBU PVC not yet bound, waiting. increment=2.656384976s pvc=trident
DEBU Logged EMS message. driver=ontap-nas
DEBU Created Kubernetes object by YAML.
INFO Created Trident deployment.
INFO Waiting for Trident pod to start.
DEBU Trident pod not yet running, waiting. increment=556.890976ms
DEBU Trident pod not yet running, waiting. increment=826.778894ms
DEBU Trident pod not yet running, waiting. increment=1.540550114s
DEBU Trident pod not yet running, waiting. increment=1.856453902s
DEBU Trident pod not yet running, waiting. increment=2.484890356s
DEBU Trident pod not yet running, waiting. increment=2.818583702s
DEBU Trident pod not yet running, waiting. increment=3.26384762s
DEBU Trident pod not yet running, waiting. increment=12.061166391s
INFO Trident pod started. namespace=trident pod=trident-678586db49-fsdpm
INFO Waiting for Trident REST interface.
DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json
DEBU REST interface not yet up, waiting. increment=734.140902ms
DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json
DEBU REST interface not yet up, waiting. increment=442.993972ms
DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json
DEBU REST interface not yet up, waiting. increment=929.671236ms
DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json
DEBU REST interface not yet up, waiting. increment=2.345275362s
DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json
DEBU REST interface not yet up, waiting. increment=2.031313228s
DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json
DEBU REST interface not yet up, waiting. increment=4.000982326s
DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json
DEBU REST interface not yet up, waiting. increment=8.222916418s
DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json
DEBU REST interface not yet up, waiting. increment=4.755251834s
DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json
DEBU REST interface not yet up, waiting. increment=19.045514652s
DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json
DEBU REST interface not yet up, waiting. increment=11.815640826s
DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json
DEBU REST interface not yet up, waiting. increment=28.062204043s
DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json
DEBU REST interface not yet up, waiting. increment=1m3.739392425s
DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json
DEBU REST interface not yet up, waiting. increment=1m16.513543682s
DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json
ERRO Trident REST interface was not available after 180.00 seconds.
FATA Install failed; exit status 1; Error: Get http://127.0.0.1:8000/trident/v1/version: dial tcp 127.0.0.1:8000: connect: connection refused
command terminated with exit code 1; use 'tridentctl logs' to learn more. Resolve the issue; use 'tridentctl uninstall' to clean up; and try again.

@kangarlou
Copy link
Contributor

We're aware of this issue, and if I'm not mistaken, I've already helped you or your account team determine etcd is the problem when a question was asked using our internal mailing list. My understanding is there is a support case open, so please be patient and wait for the process to go through. In the meantime, we'll update this issue once we have a solution. Thanks!

@kangarlou
Copy link
Contributor

kangarlou commented Sep 18, 2018

We have validated that the problem (cannot open database at /var/etcd/data/member/snap/db (no locks available)) isn't related to Trident or Kubernetes as the etcd panic also happens when etcd is run as a binary with a manually mounted NFS share.

The problem seems to be with obtaining a lock on an NFS file and getting ENOLCK.

Any application that issues the flock system call may experience the same problem:
https://linux.die.net/man/2/flock
https://golang.org/pkg/syscall/#Flock

Our NFS experts are investigating this problem.

@kangarlou
Copy link
Contributor

kangarlou commented Sep 18, 2018

@ceojinhak To confirm NFS locking is the issue, you can follow these steps:

  1. Manually create an NFS share.
  2. Manually mount the NFS share to the host.
  3. Create a file on this share.
  4. Use flock (man 1 flock) to obtain a lock on this file (e.g., flock -n --verbose /mnt/nfs/myfile -c cat). If you see errors, this confirms flock has failed.

If flock fails, this confirms NFS locking is the issue.

@ceojinhak
Copy link
Author

ceojinhak commented Sep 18, 2018 via email

@nlowe
Copy link

nlowe commented Sep 18, 2018

I just ran into this while trying to deploy trident to a different k8s cluster. If you have multiple installations of trident on the same netapp instance you will probably want to change igroupName and storagePrefix in addition to backendName. Changing these allowed trident to be deployed successfully.

@kangarlou
Copy link
Contributor

Thanks @nlowe. As the first comment indicates, this is an issue with our ontap-nas driver and with a new install, so the etcd volume doesn't exist already. However, you're right that the majority of the etcd problems are because users inadvertently share the same etcd volume between different instances of Trident. You can also run tridentctl installl --volume-name to specify a different name for the etcd volume, but it's a good practice to use different storage prefixes for different instances of Trident (currently you should only deploy one Trident instance per k8s cluster).

@ceojinhak
Copy link
Author

ceojinhak commented Sep 19, 2018 via email

@kangarlou
Copy link
Contributor

That's exactly what it means. Something with NFS isn't configured properly in your customer environment. For NFS locking to work,statd should be running on your client hosts:

sudo systemctl enable rpc-statd  # Enable statd on boot
sudo systemctl start rpc-statd   # Start statd for the current session

@ceojinhak
Copy link
Author

The customer verified the activation of rpc-statd status on all nodes. By the way, he succeeded to install Trident after NFS v4 enable on FAS.
He has installed another k8s cluster other than the problematic k8s cluster and confirmed that the trident was installed well. At this time, there was no changes in storage side. So, as you said, this issue is likely a problem with the nfs client (rpc-statd), and the reason for the successful installation when nfs v4 was enabled in a problematic environment is that rpc-statd is only monitor for nfs v2/v3. For the nfs v4, it seems to have changed in a different way.
Finally, the customer and I decided to close this case to prevent time waste any more.

Many thanks for your help so far.

@kangarlou
Copy link
Contributor

Glad to hear you figured it out! I'm adding some notes for the future reference of anyone who may encounter this problem.

Source: https://www.netapp.com/us/media/tr-4067.pdf

File locking mechanisms were created to prevent a file from being accessed for write operations by more than one user or application at a time. NFS leverages file locking either using the NLM process in NFSv3 or by leasing and locking, which is built in to the NFSv4.x protocols. Not all applications leverage file locking, however; for example, the application “vi” does not lock files. Instead, it uses a file swap method to save changes to a file.
When an NFS client requests a lock, the client interacts with the clustered Data ONTAP system to save the lock state. Where the lock state is stored depends on the NFS version being used. In NFSv3, the lock state is stored at the data layer. In NFSv4.x, the lock states are stored in the NAS protocol stack.
Use file locking using the NLM protocol when possible with NFSv3.
Use NFSv4.x (4.1 if possible) when appropriate to take advantage of stateful connections,
integrated locking, and session functionality.

Source: http://people.redhat.com/steved/Netapp_NFS_BestPractice.pdf
Section 5.3: Network Lock Manager

Source: https://www.centos.org/docs/5/html/Deployment_Guide-en-US/s1-nfs-client-config-options.html

Common NFS Mount Options: nolock — Disables file locking. This setting is occasionally required when connecting to older NFS servers."

Source: https://www.centos.org/docs/5/html/Deployment_Guide-en-US/ch-nfs.html

NFSv4 has no interaction with portmapper, rpc.mountd, rpc.lockd, and rpc.statd, since protocol support has been incorporated into the v4 protocol. NFSv4 listens on the well known TCP port (2049) which eliminates the need for the portmapper interaction. The mounting and locking protocols have been incorpated into the V4 protocol which eliminates the need for interaction with rpc.mountd and rpc.lockd.
rpc.lockd — allows NFS clients to lock files on the server. If rpc.lockd is not started, file locking will fail. rpc.lockd implements the Network Lock Manager (NLM) protocol. This process corresponds to the nfslock service. This is not used with NFSv4.
rpc.statd — This process implements the Network Status Monitor (NSM) RPC protocol which notifies NFS clients when an NFS server is restarted without being gracefully brought down. This process is started automatically by the nfslock service and does not require user configuration. This is not used with NFSv4.

Source: https://wiki.wireshark.org/Network_Lock_Manager

The purpose of the NLM protocol is to provide something similar to POSIX advisory file locking semantics to NFS version 2 and 3.
The lock manager is typically implemented completely inside user space in a lock manager daemon; that daemon will receive messages from the NFS client when a lock is requested, and will send NLM requests to the NLM server on the NFS server machine, and will receive NLM requests from NFS clients of the machine on which it's running and will make local locking calls on behalf of those clients. You need to run this lock manager daemon on BOTH the client and the server for lock management to work.
Lock manager peers rely on the NSM protocol to notify each other of service restarts/reboots so that locks can be resynchronized after a reboot.

@innergy innergy changed the title Trident v18.07 installation failure ENOLCK (no locks available) errors because rpc-statd is not running or GlusterFS is installed Nov 21, 2018
@innergy
Copy link
Contributor

innergy commented Nov 21, 2018

We should write this up in the troubleshooting section.

@roycec
Copy link

roycec commented Dec 17, 2018

I just ran into this while trying to deploy trident to a different k8s cluster. If you have multiple installations of trident on the same netapp instance you will probably want to change igroupName and storagePrefix in addition to backendName. Changing these allowed trident to be deployed successfully.

Hello,

I'm running into a similar problem right now. I tried installing/uninstalling the latest trident driver several times. We used the same NetApp SVM before for tests with Openshift and want to reuse it for Kubernetes/Rancher. Is it possible to use the same SVM for more than one cluster? Or how or where can the attributes you mentioned be changed?

@roycec
Copy link

roycec commented Dec 17, 2018

ok, I stumbled across a discussion regarding this issue on a NetApp slack channel. The "first" installation used a standard volume name for etcd storage on the SVM. When a second installation tries to create/use it, an error is raised. In this case a customized installation is necessary.

@korenaren
Copy link
Contributor

Trident 19.07 no longer uses the etcd volume, so this is no longer an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants