Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The ceph cluster resource ocs-external-storagecluster-cephcluster is getting stuck in the test cluster #170

Closed
larsks opened this issue Jul 21, 2023 · 20 comments
Assignees
Labels
openshift This issue pertains to NERC OpenShift

Comments

@larsks
Copy link
Contributor

larsks commented Jul 21, 2023

The CephCluster resource on the test cluster is unhealthy. kubectl -n openshift-storage get cephcluster shows:

Progressing failed to configure external cluster monitoring: failed to configure external metrics endpoint: failed to create or update mgr endpoint: failed to create endpoint "rook-ceph-mgr-external". Endpoints "rook-ceph-mgr-external" is invalid: [subsets[0].addresses[0].ip: Invalid value: "": must be a valid IP address, (e.g. 10.9.8.7 or 2001:db8::ffff), subsets[0].addresses[0].ip: Invalid value: "": must be a valid IP address] true

@larsks
Copy link
Contributor Author

larsks commented Jul 21, 2023

@dystewart reports:

Sounds good, also this issue looks to be unique to test cluster, the other cephcluster resources are happy on prod/infra
Also of note odf versions:

  • infra: 4.10.14
  • test: 4.10.10
  • prod: 4.10.10

So it's odd we're seeing this on the test cluster where we're using the same version and same config as elsewhere.

@larsks larsks self-assigned this Jul 21, 2023
@larsks
Copy link
Contributor Author

larsks commented Jul 21, 2023

Looking at the logs from rook-ceph-operator pod, we see the following error:

2023-07-21 14:43:53.367644 E | ceph-spec: failed to get mgr map. failed to get mgr dump. . Error EACCES: access denied: exit status 13

It looks like the principal with which the nerc-ocp-test cluster is authenticating to the ceph cluster doesn't have appropriate privileges. We (whereby "we" I mean "the NESE administrators", because most of us don't have appropriate access) would have to compare the permissions for the principals created for the test cluster with those in use for the other clusters.

This goes directly to an issue that I've brought up several times recently -- specifically, that ODF requires privileged access to the ceph cluster, and that this level of access makes the NESE administrators uncomfortable (for good reason). Looking at the code I think this may only be necessary for the monitoring configuration, which we're not using anyway, but it's not clear there's a way to disable that via the operator. That's what I'm looking at right now.

@dystewart
Copy link

@larsks so it sounds like the privileges associated with the test cluster on the ceph side changed at some point? Since we had access relatively recently to storage (sometime last week)

@larsks
Copy link
Contributor Author

larsks commented Jul 21, 2023

@dystewart ...maybe? The error doesn't appear to impact basic ODF functionality (I mean, I can create new PVCs and they bind as expected). Is there any chance this error was presenting earlier and we just didn't see it?

@dystewart
Copy link

I suppose that could be true. But I think there's something else going on with storage in the test cluster.. While the pvcs can be created and are binding to pv I keeps seeing:

MountVolume.MountDevice failed for volume "pvc-c51ad2b1-0061-4c14-b8e9-acfa57f6e004" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000024-2cb1a8f3-267c-11ee-8761-0a580a800207 already exists

@Zongshun96 also reports this behavior. Though, I'm not sure if it's related to the issue with the operator we are seeing since this seems to be on the OpenShift side

@larsks
Copy link
Contributor Author

larsks commented Jul 21, 2023

But I think there's something else going on

I think you're right. Spinning up some test resources, I see:

3s          Warning   FailedMount              pod/mariadb-54964c5695-9xkdv           Unable to attach or mount volumes: unmounted volumes=[mariadb-data], unattached volumes=[mariadb-data kube-api-access-6h79v]: timed out waiting for the condition
1s          Warning   FailedMount              pod/mariadb-54964c5695-9xkdv           MountVolume.MountDevice failed for volume "pvc-d62b6f0d-e9e8-44ae-8e35-daf7777ec6ea" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
0s          Warning   FailedMount              pod/mariadb-54964c5695-9xkdv           MountVolume.MountDevice failed for volume "pvc-d62b6f0d-e9e8-44ae-8e35-daf7777ec6ea" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000024-e6a6321c-27ff-11ee-8761-0a580a800207 already exists

An in fact, inspecting the underlying host, the /dev/rbd* devices don't exist.

...but they have been created on the ceph cluster.

@larsks
Copy link
Contributor Author

larsks commented Jul 21, 2023

More interesting findings! I can manually rbd map and rbd unmap volumes: the rbd map command makes the appropriate /dev/rbdX device available, and I can access it/mkfs it/mount it etc, but the rbd map command never exits.

I'm not sure what that means. I'm going to see what sort of ceph forums I can find where people might know more.

@Zongshun96
Copy link

Since we had access relatively recently to storage (sometime last week)

I think I can add some context in terms of when we accessed the storage.
I had setuped several RHODS pipeline servers for testing in the past weeks and it seems this volume already exists error seems to show up within the past 10 days.
In screenshot below, the mariadb-pipelines-definition pod in namespace test-pv raised this error when I created the server on Wed (July 19th).
Screen Shot 2023-07-21 at 4 14 21 PM

But for another testing server in namespace testing I setuped on July 11, its volume worked just fine.
Screen Shot 2023-07-21 at 4 03 26 PM

@larsks
Copy link
Contributor Author

larsks commented Jul 27, 2023

It looks like we're seeing two specific errors. From the csi-rbdplugin-provisioner pods:

csi-rbdplugin-fkps9 csi-rbdplugin E0727 17:17:09.129973 8902 utils.go:200] ID: 72323 Req-ID: 0001-0011-openshift-storage-0000000000000024-e6a6321c-27ff-11ee-8761-0a580a800207 GRPC error: rpc error: code = Internal desc = failed to establish the connection: failed to get connection:

And from the rook-ceph-operator pod(s):

rook-ceph-operator-6886456b4b-wzpn6 rook-ceph-operator 2023-07-27 11:26:13.645453 E | ceph-spec: failed to get mgr map. failed to get mgr dump. . Error EACCES: access denied: exit status 13

That second error looks very much like a Ceph permissions error; I'm trying to reproduce that with a ceph command line. I'm less certain about the "failed to get connection" error, since I'm successfully able to interact with the Ceph cluster.

@larsks
Copy link
Contributor Author

larsks commented Jul 27, 2023

The "access denied" error messages comes from this code, which leads to here, where we can see rook trying to run a ceph mgr dump command.

On nerc-ocp-prod:

  • Running ceph --id healthchecker-nerc-ocp-prod-1-rbd mgr dump succeeds

  • Running ceph --id provisioner-nerc-ocp-prod-1-rbd mgr dump results in:

    Error EACCES: access denied
    

    (And exits with code 13)

On nerc-ocp-test:

  • Running ceph --id healthchecker-nerc-ocp-test-1-rbd mgr dump fails with an access denied error.
  • Running ceph --id provisioner-nerc-ocp-test-1-rbd mgr dump fails with an access denied error.

So, it looks like the healthchecker user on the test cluster doesn't have the same permissions as on the prod cluster. I'm emailing help@nese to have them check the permissions for the healthchecker user between the two clusters.

@dystewart
Copy link

dystewart commented Jul 27, 2023

@larsks some more data, looks like when targeting nodes wrk-0 through wrk-9 for scheduling pvcs bind and attach successfully. When you deploy targeting a gpu node (wrk-10 & wrk-11) you see the following:

  Normal   Scheduled               17m                  default-scheduler        Successfully assigned test/gpu-ceph-test to wrk-10
  Normal   SuccessfulAttachVolume  17m                  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-5a76ccd9-f048-4be9-ba4e-fd589a64d48e"
  Warning  FailedMount             14m                  kubelet                  MountVolume.MountDevice failed for volume "pvc-5a76ccd9-f048-4be9-ba4e-fd589a64d48e" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount             3m31s (x2 over 12m)  kubelet                  Unable to attach or mount volumes: unmounted volumes=[gpu-test], unattached volumes=[kube-api-access-5pvwk gpu-test]: timed out waiting for the condition
  Warning  FailedMount             73s (x5 over 14m)    kubelet                  Unable to attach or mount volumes: unmounted volumes=[gpu-test], unattached volumes=[gpu-test kube-api-access-5pvwk]: timed out waiting for the condition
  Warning  FailedMount             32s (x14 over 14m)   kubelet                  MountVolume.MountDevice failed for volume "pvc-5a76ccd9-f048-4be9-ba4e-fd589a64d48e" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000024-6393337f-2cb0-11ee-8761-0a580a800207 already exists

An in fact, inspecting the underlying host, the /dev/rbd* devices don't exist.

So this probably an issue with the gpu node config

@larsks
Copy link
Contributor Author

larsks commented Jul 28, 2023

I got tired of always having to create ceph credentials when diagnosing this sort of thing, here you go.

@larsks
Copy link
Contributor Author

larsks commented Jul 28, 2023

So this probably an issue with the gpu node config

Good spotting! It looks like those nodes have a connectivity problem with NESE; running this test:

k get node -o name |
grep wrk |
cut -f2 -d/ |
while read node; do
  if ssh -l core $node.nerc-ocp-test.rc.fas.harvard.edu nc -z 10.255.116.12 6789 < /dev/null; then
    echo "$node: OKAY"
  else
    echo "$node: FAILED"
  fi
done | tee results.txt

Results in:

wrk-0: OKAY
wrk-1: OKAY
wrk-10: FAILED
wrk-11: FAILED
wrk-2: OKAY
wrk-3: OKAY
wrk-4: OKAY
wrk-5: OKAY
wrk-6: OKAY
wrk-7: OKAY
wrk-8: OKAY
wrk-9: OKAY

@larsks
Copy link
Contributor Author

larsks commented Jul 28, 2023

@dystewart the gpu nodes (wrk-10 and wrk-11) are missing the bond0.2175 interface that is required for access to NESE.

@larsks
Copy link
Contributor Author

larsks commented Jul 28, 2023

It looks like the problem is that the bond0 interface on these nodes is misconfigured:

error reconciling NodeNetworkConfigurationPolicy at desired state apply: ,
 failed to execute nmstatectl set --no-commit --timeout 480: 'exit status 1' '' '/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py:325: UserWarning: Using 'set' is deprecated, use 'apply' instead.
  warnings.warn("Using 'set' is deprecated, use 'apply' instead.")
2023-07-28 18:35:07,095 root         DEBUG    Nmstate version: 1.0.2
2023-07-28 18:35:07,096 root         DEBUG    Applying desire state: {'interfaces': [{'ipv4': {'auto-routes': True, 'dhcp': True, 'enabled': True}, 'mtu': 9000, 'name': 'bond0.2175', 'state': 'up', 'type': 'vlan', 'vlan': {'base-iface': 'bond0', 'id': 2175}}]}
2023-07-28 18:35:07,147 root         DEBUG    NetworkManager version 1.30.0
2023-07-28 18:35:07,152 root         DEBUG    Async action: Retrieve applied config: ethernet eth1 started
2023-07-28 18:35:07,152 root         DEBUG    Async action: Retrieve applied config: bond bond0 started
2023-07-28 18:35:07,153 root         DEBUG    Async action: Retrieve applied config: ethernet eth1 finished
2023-07-28 18:35:07,154 root         DEBUG    Async action: Retrieve applied config: bond bond0 finished
Traceback (most recent call last):
  File "/usr/bin/nmstatectl", line 11, in <module>
    load_entry_point('nmstate==1.0.2', 'console_scripts', 'nmstatectl')()
  File "/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py", line 73, in main
    return args.func(args)
  File "/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py", line 326, in set
    return apply(args)
  File "/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py", line 354, in apply
    args.save_to_disk,
  File "/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py", line 407, in apply_state
    save_to_disk=save_to_disk,
  File "/usr/lib/python3.6/site-packages/libnmstate/netapplier.py", line 78, in apply
    desired_state, ignored_ifnames, current_state, save_to_disk
  File "/usr/lib/python3.6/site-packages/libnmstate/net_state.py", line 51, in __init__
    gen_conf_mode,
  File "/usr/lib/python3.6/site-packages/libnmstate/ifaces/ifaces.py", line 166, in __init__
    self._pre_edit_validation_and_cleanup()
  File "/usr/lib/python3.6/site-packages/libnmstate/ifaces/ifaces.py", line 248, in _pre_edit_validation_and_cleanup
    self._validate_vlan_mtu()
  File "/usr/lib/python3.6/site-packages/libnmstate/ifaces/ifaces.py", line 337, in _validate_vlan_mtu
    f"Interface {iface.name} has bigger "
libnmstate.error.NmstateValueError: Interface bond0.2175 has bigger MTU(9000) than its base interface: bond0 MTU(1500)

And indeed, on wrk-3 we see:

10: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 90:e2:ba:d1:e1:e4 brd ff:ff:ff:ff:ff:ff

And on wrk-10:

6: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 52:0e:4d:21:49:ac brd ff:ff:ff:ff:ff:ff

larsks added a commit to larsks/nerc-ocp-config that referenced this issue Jul 28, 2023
We configure the bond0 interface to span interfaces named "nic1" and
"nic2", and we rely on udev rules to assign these names to the appropriate
interfaces. There were no such rules for the GPU nodes (wrk-{10,11}) on the
test cluster.

This commit adds the necessary udev rules.

Part of: nerc-project/operations#170
@larsks
Copy link
Contributor Author

larsks commented Jul 28, 2023

Our network configuration creates interface bond0 spanning interfaces named nic1 and nic2. We rely on udev rules to rename the appropriate interfaces, and no such rules were created for the GPU nodes.

This means that the higher level configuration -- which sets up both the bond0 interface and the VLAN interfaces associated with bond0 -- failed.

I've just pushed OCP-on-NERC/nerc-ocp-config#266, which should take care of the device names.

larsks added a commit to larsks/nerc-ocp-config that referenced this issue Jul 28, 2023
We configure the bond0 interface to span interfaces named "nic1" and
"nic2", and we rely on udev rules to assign these names to the appropriate
interfaces. There were no such rules for the GPU nodes (wrk-{10,11}) on the
test cluster.

This commit adds the necessary udev rules.

Part of: nerc-project/operations#170
larsks added a commit to larsks/nerc-ocp-config that referenced this issue Jul 31, 2023
We configure the bond0 interface to span interfaces named "nic1" and
"nic2", and we rely on udev rules to assign these names to the appropriate
interfaces. There were no such rules for the GPU nodes (wrk-{10,11}) on the
test cluster.

This commit adds the necessary udev rules.

Part of: nerc-project/operations#170
@larsks
Copy link
Contributor Author

larsks commented Jul 31, 2023

(Ignore that last --- deleted -- message; I goofed.)

@larsks
Copy link
Contributor Author

larsks commented Jul 31, 2023

@jtriley on wrk-10 we have a link on nic2 but not on nic1, which I think is expected. bond0 appears to be up, but we have no network connectivity:

[root@wrk-10 net]# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v4.18.0-305.88.1.el8_4.x86_64

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 140
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0

802.3ad info
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: b0:26:28:1a:56:dc
Active Aggregator Info:
        Aggregator ID: 2
        Number of ports: 1
        Actor Key: 15
        Partner Key: 32885
        Partner Mac Address: 00:23:04:ee:c1:2d

Slave Interface: nic1
MII Status: down
Speed: Unknown
Duplex: Unknown
Link Failure Count: 2
Permanent HW addr: b0:26:28:1a:56:dc
Slave queue ID: 0
Aggregator ID: 2
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: b0:26:28:1a:56:dc
    port key: 0
    port priority: 255
    port number: 1
    port state: 61
details partner lacp pdu:
    system priority: 1
    system mac address: 00:23:04:ee:c1:2d
    oper key: 32885
    port priority: 32768
    port number: 16657
    port state: 61

Slave Interface: nic2
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: b0:26:28:1a:56:dd
Slave queue ID: 0
Aggregator ID: 2
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: b0:26:28:1a:56:dc
    port key: 15
    port priority: 255
    port number: 2
    port state: 61
details partner lacp pdu:
    system priority: 1
    system mac address: 00:23:04:ee:c1:2d
    oper key: 32885
    port priority: 32768
    port number: 273
    port state: 61

On the nics, we see:

[root@wrk-10 net]# ip addr show nic1
3: nic1: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 9000 qdisc mq master bond0 state DOWN group default qlen 1000
    link/ether b0:26:28:1a:56:dc brd ff:ff:ff:ff:ff:ff
[root@wrk-10 net]# ip addr show nic2
5: nic2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
    link/ether b0:26:28:1a:56:dc brd ff:ff:ff:ff:ff:ff permaddr b0:26:28:1a:56:dd
[root@wrk-10 net]# ip addr show bond0
10: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether b0:26:28:1a:56:dc brd ff:ff:ff:ff:ff:ff
    inet 10.30.8.20/24 brd 10.30.8.255 scope global dynamic noprefixroute bond0
       valid_lft 604231sec preferred_lft 604231sec
    inet6 fe80::ba13:70cf:7655:4922/64 scope link noprefixroute
       valid_lft forever preferred_lft forever

But I'm unable to reach the default gateway:

[root@wrk-10 net]# ip route
default via 10.30.8.1 dev bond0 proto dhcp metric 300
10.30.8.0/24 dev bond0 proto kernel scope link src 10.30.8.20 metric 300
[root@wrk-10 net]# ping -c2 10.30.8.1
PING 10.30.8.1 (10.30.8.1) 56(84) bytes of data.
From 10.30.8.20 icmp_seq=1 Destination Host Unreachable
From 10.30.8.20 icmp_seq=2 Destination Host Unreachable

--- 10.30.8.1 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1032ms
pipe 2

@jtriley
Copy link

jtriley commented Aug 1, 2023

This should be resolved now. @aabaris found an issue where one of the data ports was getting traffic on the wrong set of vlans. Network ops found that both the GPU host's data ports on the switch were misconfigured with one link getting the ocp-test vlans and the other getting openstack vlans. Fixing this and rebooting the 2xGPU hosts appears to have fixed the problem. The machineconfigpool has successfully updated now and both hosts have the storage network device.

$ oc get -n openshift-storage cephcluster --as=system:admin
NAME                                      DATADIRHOSTPATH   MONCOUNT   AGE    PHASE       MESSAGE                          HEALTH        EXTERNAL
ocs-external-storagecluster-cephcluster                                136d   Connected   Cluster connected successfully   HEALTH_WARN   true

@jtriley jtriley closed this as completed Aug 1, 2023
@larsks
Copy link
Contributor Author

larsks commented Aug 1, 2023

I've confirmed that I can successfully deploy pods with pvcs on wrk-10.

larsks added a commit to dystewart/nerc-ocp-config that referenced this issue Aug 3, 2023
We configure the bond0 interface to span interfaces named "nic1" and
"nic2", and we rely on udev rules to assign these names to the appropriate
interfaces. There were no such rules for the GPU nodes (wrk-{10,11}) on the
test cluster.

This commit adds the necessary udev rules.

Part of: nerc-project/operations#170
@joachimweyl joachimweyl added the openshift This issue pertains to NERC OpenShift label Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
openshift This issue pertains to NERC OpenShift
Projects
None yet
Development

No branches or pull requests

5 participants