The ceph cluster resource ocs-external-storagecluster-cephcluster is getting stuck in the test cluster #170

larsks · 2023-07-21T14:11:31Z

The CephCluster resource on the test cluster is unhealthy. kubectl -n openshift-storage get cephcluster shows:

Progressing failed to configure external cluster monitoring: failed to configure external metrics endpoint: failed to create or update mgr endpoint: failed to create endpoint "rook-ceph-mgr-external". Endpoints "rook-ceph-mgr-external" is invalid: [subsets[0].addresses[0].ip: Invalid value: "": must be a valid IP address, (e.g. 10.9.8.7 or 2001:db8::ffff), subsets[0].addresses[0].ip: Invalid value: "": must be a valid IP address] true

The text was updated successfully, but these errors were encountered:

larsks · 2023-07-21T14:12:35Z

@dystewart reports:

Sounds good, also this issue looks to be unique to test cluster, the other cephcluster resources are happy on prod/infra
Also of note odf versions:

infra: 4.10.14

test: 4.10.10

prod: 4.10.10

So it's odd we're seeing this on the test cluster where we're using the same version and same config as elsewhere.

larsks · 2023-07-21T14:55:44Z

Looking at the logs from rook-ceph-operator pod, we see the following error:

2023-07-21 14:43:53.367644 E | ceph-spec: failed to get mgr map. failed to get mgr dump. . Error EACCES: access denied: exit status 13

It looks like the principal with which the nerc-ocp-test cluster is authenticating to the ceph cluster doesn't have appropriate privileges. We (whereby "we" I mean "the NESE administrators", because most of us don't have appropriate access) would have to compare the permissions for the principals created for the test cluster with those in use for the other clusters.

This goes directly to an issue that I've brought up several times recently -- specifically, that ODF requires privileged access to the ceph cluster, and that this level of access makes the NESE administrators uncomfortable (for good reason). Looking at the code I think this may only be necessary for the monitoring configuration, which we're not using anyway, but it's not clear there's a way to disable that via the operator. That's what I'm looking at right now.

dystewart · 2023-07-21T17:55:09Z

@larsks so it sounds like the privileges associated with the test cluster on the ceph side changed at some point? Since we had access relatively recently to storage (sometime last week)

larsks · 2023-07-21T17:58:16Z

@dystewart ...maybe? The error doesn't appear to impact basic ODF functionality (I mean, I can create new PVCs and they bind as expected). Is there any chance this error was presenting earlier and we just didn't see it?

dystewart · 2023-07-21T18:07:58Z

I suppose that could be true. But I think there's something else going on with storage in the test cluster.. While the pvcs can be created and are binding to pv I keeps seeing:

MountVolume.MountDevice failed for volume "pvc-c51ad2b1-0061-4c14-b8e9-acfa57f6e004" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000024-2cb1a8f3-267c-11ee-8761-0a580a800207 already exists

@Zongshun96 also reports this behavior. Though, I'm not sure if it's related to the issue with the operator we are seeing since this seems to be on the OpenShift side

larsks · 2023-07-21T19:53:24Z

But I think there's something else going on

I think you're right. Spinning up some test resources, I see:

3s          Warning   FailedMount              pod/mariadb-54964c5695-9xkdv           Unable to attach or mount volumes: unmounted volumes=[mariadb-data], unattached volumes=[mariadb-data kube-api-access-6h79v]: timed out waiting for the condition
1s          Warning   FailedMount              pod/mariadb-54964c5695-9xkdv           MountVolume.MountDevice failed for volume "pvc-d62b6f0d-e9e8-44ae-8e35-daf7777ec6ea" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
0s          Warning   FailedMount              pod/mariadb-54964c5695-9xkdv           MountVolume.MountDevice failed for volume "pvc-d62b6f0d-e9e8-44ae-8e35-daf7777ec6ea" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000024-e6a6321c-27ff-11ee-8761-0a580a800207 already exists

An in fact, inspecting the underlying host, the /dev/rbd* devices don't exist.

...but they have been created on the ceph cluster.

larsks · 2023-07-21T20:22:10Z

More interesting findings! I can manually rbd map and rbd unmap volumes: the rbd map command makes the appropriate /dev/rbdX device available, and I can access it/mkfs it/mount it etc, but the rbd map command never exits.

I'm not sure what that means. I'm going to see what sort of ceph forums I can find where people might know more.

Zongshun96 · 2023-07-21T20:24:37Z

Since we had access relatively recently to storage (sometime last week)

I think I can add some context in terms of when we accessed the storage.
I had setuped several RHODS pipeline servers for testing in the past weeks and it seems this volume already exists error seems to show up within the past 10 days.
In screenshot below, the mariadb-pipelines-definition pod in namespace test-pv raised this error when I created the server on Wed (July 19th).

But for another testing server in namespace testing I setuped on July 11, its volume worked just fine.

larsks · 2023-07-27T17:37:30Z

It looks like we're seeing two specific errors. From the csi-rbdplugin-provisioner pods:

csi-rbdplugin-fkps9 csi-rbdplugin E0727 17:17:09.129973 8902 utils.go:200] ID: 72323 Req-ID: 0001-0011-openshift-storage-0000000000000024-e6a6321c-27ff-11ee-8761-0a580a800207 GRPC error: rpc error: code = Internal desc = failed to establish the connection: failed to get connection:

And from the rook-ceph-operator pod(s):

rook-ceph-operator-6886456b4b-wzpn6 rook-ceph-operator 2023-07-27 11:26:13.645453 E | ceph-spec: failed to get mgr map. failed to get mgr dump. . Error EACCES: access denied: exit status 13

That second error looks very much like a Ceph permissions error; I'm trying to reproduce that with a ceph command line. I'm less certain about the "failed to get connection" error, since I'm successfully able to interact with the Ceph cluster.

larsks · 2023-07-27T17:59:01Z

The "access denied" error messages comes from this code, which leads to here, where we can see rook trying to run a ceph mgr dump command.

On nerc-ocp-prod:

Running ceph --id healthchecker-nerc-ocp-prod-1-rbd mgr dump succeeds
Running ceph --id provisioner-nerc-ocp-prod-1-rbd mgr dump results in:
```
Error EACCES: access denied
```
(And exits with code 13)

On nerc-ocp-test:

Running ceph --id healthchecker-nerc-ocp-test-1-rbd mgr dump fails with an access denied error.
Running ceph --id provisioner-nerc-ocp-test-1-rbd mgr dump fails with an access denied error.

So, it looks like the healthchecker user on the test cluster doesn't have the same permissions as on the prod cluster. I'm emailing help@nese to have them check the permissions for the healthchecker user between the two clusters.

dystewart · 2023-07-27T19:26:23Z

@larsks some more data, looks like when targeting nodes wrk-0 through wrk-9 for scheduling pvcs bind and attach successfully. When you deploy targeting a gpu node (wrk-10 & wrk-11) you see the following:

  Normal   Scheduled               17m                  default-scheduler        Successfully assigned test/gpu-ceph-test to wrk-10
  Normal   SuccessfulAttachVolume  17m                  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-5a76ccd9-f048-4be9-ba4e-fd589a64d48e"
  Warning  FailedMount             14m                  kubelet                  MountVolume.MountDevice failed for volume "pvc-5a76ccd9-f048-4be9-ba4e-fd589a64d48e" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount             3m31s (x2 over 12m)  kubelet                  Unable to attach or mount volumes: unmounted volumes=[gpu-test], unattached volumes=[kube-api-access-5pvwk gpu-test]: timed out waiting for the condition
  Warning  FailedMount             73s (x5 over 14m)    kubelet                  Unable to attach or mount volumes: unmounted volumes=[gpu-test], unattached volumes=[gpu-test kube-api-access-5pvwk]: timed out waiting for the condition
  Warning  FailedMount             32s (x14 over 14m)   kubelet                  MountVolume.MountDevice failed for volume "pvc-5a76ccd9-f048-4be9-ba4e-fd589a64d48e" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000024-6393337f-2cb0-11ee-8761-0a580a800207 already exists

An in fact, inspecting the underlying host, the /dev/rbd* devices don't exist.

So this probably an issue with the gpu node config

larsks · 2023-07-28T18:08:04Z

I got tired of always having to create ceph credentials when diagnosing this sort of thing, here you go.

larsks · 2023-07-28T18:27:36Z

So this probably an issue with the gpu node config

Good spotting! It looks like those nodes have a connectivity problem with NESE; running this test:

k get node -o name |
grep wrk |
cut -f2 -d/ |
while read node; do
  if ssh -l core $node.nerc-ocp-test.rc.fas.harvard.edu nc -z 10.255.116.12 6789 < /dev/null; then
    echo "$node: OKAY"
  else
    echo "$node: FAILED"
  fi
done | tee results.txt

Results in:

wrk-0: OKAY
wrk-1: OKAY
wrk-10: FAILED
wrk-11: FAILED
wrk-2: OKAY
wrk-3: OKAY
wrk-4: OKAY
wrk-5: OKAY
wrk-6: OKAY
wrk-7: OKAY
wrk-8: OKAY
wrk-9: OKAY

larsks · 2023-07-28T18:28:34Z

@dystewart the gpu nodes (wrk-10 and wrk-11) are missing the bond0.2175 interface that is required for access to NESE.

larsks · 2023-07-28T18:40:49Z

It looks like the problem is that the bond0 interface on these nodes is misconfigured:

error reconciling NodeNetworkConfigurationPolicy at desired state apply: ,
 failed to execute nmstatectl set --no-commit --timeout 480: 'exit status 1' '' '/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py:325: UserWarning: Using 'set' is deprecated, use 'apply' instead.
  warnings.warn("Using 'set' is deprecated, use 'apply' instead.")
2023-07-28 18:35:07,095 root         DEBUG    Nmstate version: 1.0.2
2023-07-28 18:35:07,096 root         DEBUG    Applying desire state: {'interfaces': [{'ipv4': {'auto-routes': True, 'dhcp': True, 'enabled': True}, 'mtu': 9000, 'name': 'bond0.2175', 'state': 'up', 'type': 'vlan', 'vlan': {'base-iface': 'bond0', 'id': 2175}}]}
2023-07-28 18:35:07,147 root         DEBUG    NetworkManager version 1.30.0
2023-07-28 18:35:07,152 root         DEBUG    Async action: Retrieve applied config: ethernet eth1 started
2023-07-28 18:35:07,152 root         DEBUG    Async action: Retrieve applied config: bond bond0 started
2023-07-28 18:35:07,153 root         DEBUG    Async action: Retrieve applied config: ethernet eth1 finished
2023-07-28 18:35:07,154 root         DEBUG    Async action: Retrieve applied config: bond bond0 finished
Traceback (most recent call last):
  File "/usr/bin/nmstatectl", line 11, in <module>
    load_entry_point('nmstate==1.0.2', 'console_scripts', 'nmstatectl')()
  File "/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py", line 73, in main
    return args.func(args)
  File "/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py", line 326, in set
    return apply(args)
  File "/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py", line 354, in apply
    args.save_to_disk,
  File "/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py", line 407, in apply_state
    save_to_disk=save_to_disk,
  File "/usr/lib/python3.6/site-packages/libnmstate/netapplier.py", line 78, in apply
    desired_state, ignored_ifnames, current_state, save_to_disk
  File "/usr/lib/python3.6/site-packages/libnmstate/net_state.py", line 51, in __init__
    gen_conf_mode,
  File "/usr/lib/python3.6/site-packages/libnmstate/ifaces/ifaces.py", line 166, in __init__
    self._pre_edit_validation_and_cleanup()
  File "/usr/lib/python3.6/site-packages/libnmstate/ifaces/ifaces.py", line 248, in _pre_edit_validation_and_cleanup
    self._validate_vlan_mtu()
  File "/usr/lib/python3.6/site-packages/libnmstate/ifaces/ifaces.py", line 337, in _validate_vlan_mtu
    f"Interface {iface.name} has bigger "
libnmstate.error.NmstateValueError: Interface bond0.2175 has bigger MTU(9000) than its base interface: bond0 MTU(1500)

And indeed, on wrk-3 we see:

10: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 90:e2:ba:d1:e1:e4 brd ff:ff:ff:ff:ff:ff

And on wrk-10:

6: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 52:0e:4d:21:49:ac brd ff:ff:ff:ff:ff:ff

We configure the bond0 interface to span interfaces named "nic1" and "nic2", and we rely on udev rules to assign these names to the appropriate interfaces. There were no such rules for the GPU nodes (wrk-{10,11}) on the test cluster. This commit adds the necessary udev rules. Part of: nerc-project/operations#170

larsks · 2023-07-28T19:41:16Z

Our network configuration creates interface bond0 spanning interfaces named nic1 and nic2. We rely on udev rules to rename the appropriate interfaces, and no such rules were created for the GPU nodes.

This means that the higher level configuration -- which sets up both the bond0 interface and the VLAN interfaces associated with bond0 -- failed.

I've just pushed OCP-on-NERC/nerc-ocp-config#266, which should take care of the device names.

We configure the bond0 interface to span interfaces named "nic1" and "nic2", and we rely on udev rules to assign these names to the appropriate interfaces. There were no such rules for the GPU nodes (wrk-{10,11}) on the test cluster. This commit adds the necessary udev rules. Part of: nerc-project/operations#170

larsks · 2023-07-31T22:02:01Z

(Ignore that last --- deleted -- message; I goofed.)

larsks · 2023-07-31T22:11:52Z

@jtriley on wrk-10 we have a link on nic2 but not on nic1, which I think is expected. bond0 appears to be up, but we have no network connectivity:

[root@wrk-10 net]# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v4.18.0-305.88.1.el8_4.x86_64

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 140
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0

802.3ad info
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: b0:26:28:1a:56:dc
Active Aggregator Info:
        Aggregator ID: 2
        Number of ports: 1
        Actor Key: 15
        Partner Key: 32885
        Partner Mac Address: 00:23:04:ee:c1:2d

Slave Interface: nic1
MII Status: down
Speed: Unknown
Duplex: Unknown
Link Failure Count: 2
Permanent HW addr: b0:26:28:1a:56:dc
Slave queue ID: 0
Aggregator ID: 2
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: b0:26:28:1a:56:dc
    port key: 0
    port priority: 255
    port number: 1
    port state: 61
details partner lacp pdu:
    system priority: 1
    system mac address: 00:23:04:ee:c1:2d
    oper key: 32885
    port priority: 32768
    port number: 16657
    port state: 61

Slave Interface: nic2
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: b0:26:28:1a:56:dd
Slave queue ID: 0
Aggregator ID: 2
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: b0:26:28:1a:56:dc
    port key: 15
    port priority: 255
    port number: 2
    port state: 61
details partner lacp pdu:
    system priority: 1
    system mac address: 00:23:04:ee:c1:2d
    oper key: 32885
    port priority: 32768
    port number: 273
    port state: 61

On the nics, we see:

[root@wrk-10 net]# ip addr show nic1
3: nic1: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 9000 qdisc mq master bond0 state DOWN group default qlen 1000
    link/ether b0:26:28:1a:56:dc brd ff:ff:ff:ff:ff:ff
[root@wrk-10 net]# ip addr show nic2
5: nic2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
    link/ether b0:26:28:1a:56:dc brd ff:ff:ff:ff:ff:ff permaddr b0:26:28:1a:56:dd
[root@wrk-10 net]# ip addr show bond0
10: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether b0:26:28:1a:56:dc brd ff:ff:ff:ff:ff:ff
    inet 10.30.8.20/24 brd 10.30.8.255 scope global dynamic noprefixroute bond0
       valid_lft 604231sec preferred_lft 604231sec
    inet6 fe80::ba13:70cf:7655:4922/64 scope link noprefixroute
       valid_lft forever preferred_lft forever

But I'm unable to reach the default gateway:

[root@wrk-10 net]# ip route
default via 10.30.8.1 dev bond0 proto dhcp metric 300
10.30.8.0/24 dev bond0 proto kernel scope link src 10.30.8.20 metric 300
[root@wrk-10 net]# ping -c2 10.30.8.1
PING 10.30.8.1 (10.30.8.1) 56(84) bytes of data.
From 10.30.8.20 icmp_seq=1 Destination Host Unreachable
From 10.30.8.20 icmp_seq=2 Destination Host Unreachable

--- 10.30.8.1 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1032ms
pipe 2

jtriley · 2023-08-01T18:23:40Z

This should be resolved now. @aabaris found an issue where one of the data ports was getting traffic on the wrong set of vlans. Network ops found that both the GPU host's data ports on the switch were misconfigured with one link getting the ocp-test vlans and the other getting openstack vlans. Fixing this and rebooting the 2xGPU hosts appears to have fixed the problem. The machineconfigpool has successfully updated now and both hosts have the storage network device.

$ oc get -n openshift-storage cephcluster --as=system:admin
NAME                                      DATADIRHOSTPATH   MONCOUNT   AGE    PHASE       MESSAGE                          HEALTH        EXTERNAL
ocs-external-storagecluster-cephcluster                                136d   Connected   Cluster connected successfully   HEALTH_WARN   true

larsks · 2023-08-01T18:40:26Z

I've confirmed that I can successfully deploy pods with pvcs on wrk-10.

We configure the bond0 interface to span interfaces named "nic1" and "nic2", and we rely on udev rules to assign these names to the appropriate interfaces. There were no such rules for the GPU nodes (wrk-{10,11}) on the test cluster. This commit adds the necessary udev rules. Part of: nerc-project/operations#170

larsks self-assigned this Jul 21, 2023

larsks mentioned this issue Jul 28, 2023

Add udev rules for gpu nodes OCP-on-NERC/nerc-ocp-config#266

Merged

Zongshun96 mentioned this issue Jul 31, 2023

Load Kubeflow Pipelines in RHODS Pipelines #156

Closed

jtriley closed this as completed Aug 1, 2023

jtriley mentioned this issue Aug 1, 2023

Noobaa on test cluster did not come back up after power outage #173

Closed

joachimweyl added the openshift This issue pertains to NERC OpenShift label Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The ceph cluster resource ocs-external-storagecluster-cephcluster is getting stuck in the test cluster #170

The ceph cluster resource ocs-external-storagecluster-cephcluster is getting stuck in the test cluster #170

larsks commented Jul 21, 2023

larsks commented Jul 21, 2023

larsks commented Jul 21, 2023 •

edited

Loading

dystewart commented Jul 21, 2023

larsks commented Jul 21, 2023

dystewart commented Jul 21, 2023

larsks commented Jul 21, 2023 •

edited

Loading

larsks commented Jul 21, 2023

Zongshun96 commented Jul 21, 2023

larsks commented Jul 27, 2023

larsks commented Jul 27, 2023

dystewart commented Jul 27, 2023 •

edited

Loading

larsks commented Jul 28, 2023

larsks commented Jul 28, 2023

larsks commented Jul 28, 2023

larsks commented Jul 28, 2023 •

edited

Loading

larsks commented Jul 28, 2023

larsks commented Jul 31, 2023

larsks commented Jul 31, 2023

jtriley commented Aug 1, 2023 •

edited

Loading

larsks commented Aug 1, 2023

The ceph cluster resource ocs-external-storagecluster-cephcluster is getting stuck in the test cluster #170

The ceph cluster resource ocs-external-storagecluster-cephcluster is getting stuck in the test cluster #170

Comments

larsks commented Jul 21, 2023

larsks commented Jul 21, 2023

larsks commented Jul 21, 2023 • edited Loading

dystewart commented Jul 21, 2023

larsks commented Jul 21, 2023

dystewart commented Jul 21, 2023

larsks commented Jul 21, 2023 • edited Loading

larsks commented Jul 21, 2023

Zongshun96 commented Jul 21, 2023

larsks commented Jul 27, 2023

larsks commented Jul 27, 2023

dystewart commented Jul 27, 2023 • edited Loading

larsks commented Jul 28, 2023

larsks commented Jul 28, 2023

larsks commented Jul 28, 2023

larsks commented Jul 28, 2023 • edited Loading

larsks commented Jul 28, 2023

larsks commented Jul 31, 2023

larsks commented Jul 31, 2023

jtriley commented Aug 1, 2023 • edited Loading

larsks commented Aug 1, 2023

larsks commented Jul 21, 2023 •

edited

Loading

larsks commented Jul 21, 2023 •

edited

Loading

dystewart commented Jul 27, 2023 •

edited

Loading

larsks commented Jul 28, 2023 •

edited

Loading

jtriley commented Aug 1, 2023 •

edited

Loading