-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The ceph cluster resource ocs-external-storagecluster-cephcluster is getting stuck in the test cluster #170
Comments
@dystewart reports:
So it's odd we're seeing this on the test cluster where we're using the same version and same config as elsewhere. |
Looking at the logs from
It looks like the principal with which the nerc-ocp-test cluster is authenticating to the ceph cluster doesn't have appropriate privileges. We (whereby "we" I mean "the NESE administrators", because most of us don't have appropriate access) would have to compare the permissions for the principals created for the test cluster with those in use for the other clusters. This goes directly to an issue that I've brought up several times recently -- specifically, that ODF requires privileged access to the ceph cluster, and that this level of access makes the NESE administrators uncomfortable (for good reason). Looking at the code I think this may only be necessary for the monitoring configuration, which we're not using anyway, but it's not clear there's a way to disable that via the operator. That's what I'm looking at right now. |
@larsks so it sounds like the privileges associated with the test cluster on the ceph side changed at some point? Since we had access relatively recently to storage (sometime last week) |
@dystewart ...maybe? The error doesn't appear to impact basic ODF functionality (I mean, I can create new PVCs and they bind as expected). Is there any chance this error was presenting earlier and we just didn't see it? |
I suppose that could be true. But I think there's something else going on with storage in the test cluster.. While the pvcs can be created and are binding to pv I keeps seeing:
@Zongshun96 also reports this behavior. Though, I'm not sure if it's related to the issue with the operator we are seeing since this seems to be on the OpenShift side |
I think you're right. Spinning up some test resources, I see:
An in fact, inspecting the underlying host, the ...but they have been created on the ceph cluster. |
More interesting findings! I can manually I'm not sure what that means. I'm going to see what sort of ceph forums I can find where people might know more. |
It looks like we're seeing two specific errors. From the
And from the
That second error looks very much like a Ceph permissions error; I'm trying to reproduce that with a |
The "access denied" error messages comes from this code, which leads to here, where we can see rook trying to run a On
On
So, it looks like the healthchecker user on the test cluster doesn't have the same permissions as on the prod cluster. I'm emailing |
@larsks some more data, looks like when targeting nodes wrk-0 through wrk-9 for scheduling pvcs bind and attach successfully. When you deploy targeting a gpu node (wrk-10 & wrk-11) you see the following:
So this probably an issue with the gpu node config |
I got tired of always having to create ceph credentials when diagnosing this sort of thing, here you go. |
Good spotting! It looks like those nodes have a connectivity problem with NESE; running this test:
Results in:
|
@dystewart the gpu nodes (wrk-10 and wrk-11) are missing the |
It looks like the problem is that the
And indeed, on wrk-3 we see:
And on wrk-10:
|
We configure the bond0 interface to span interfaces named "nic1" and "nic2", and we rely on udev rules to assign these names to the appropriate interfaces. There were no such rules for the GPU nodes (wrk-{10,11}) on the test cluster. This commit adds the necessary udev rules. Part of: nerc-project/operations#170
Our network configuration creates interface This means that the higher level configuration -- which sets up both the I've just pushed OCP-on-NERC/nerc-ocp-config#266, which should take care of the device names. |
We configure the bond0 interface to span interfaces named "nic1" and "nic2", and we rely on udev rules to assign these names to the appropriate interfaces. There were no such rules for the GPU nodes (wrk-{10,11}) on the test cluster. This commit adds the necessary udev rules. Part of: nerc-project/operations#170
We configure the bond0 interface to span interfaces named "nic1" and "nic2", and we rely on udev rules to assign these names to the appropriate interfaces. There were no such rules for the GPU nodes (wrk-{10,11}) on the test cluster. This commit adds the necessary udev rules. Part of: nerc-project/operations#170
(Ignore that last --- deleted -- message; I goofed.) |
@jtriley on wrk-10 we have a link on
On the nics, we see:
But I'm unable to reach the default gateway:
|
This should be resolved now. @aabaris found an issue where one of the data ports was getting traffic on the wrong set of vlans. Network ops found that both the GPU host's data ports on the switch were misconfigured with one link getting the ocp-test vlans and the other getting openstack vlans. Fixing this and rebooting the 2xGPU hosts appears to have fixed the problem. The machineconfigpool has successfully updated now and both hosts have the storage network device.
|
I've confirmed that I can successfully deploy pods with pvcs on wrk-10. |
We configure the bond0 interface to span interfaces named "nic1" and "nic2", and we rely on udev rules to assign these names to the appropriate interfaces. There were no such rules for the GPU nodes (wrk-{10,11}) on the test cluster. This commit adds the necessary udev rules. Part of: nerc-project/operations#170
The CephCluster resource on the test cluster is unhealthy.
kubectl -n openshift-storage get cephcluster
shows:The text was updated successfully, but these errors were encountered: