-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd stuck after hearbeat failure #14200
Comments
Restarting of the OSD was successful? If the OSD process exits, the pod will restart. So if the OSD did not restart after that error, the ceph-osd process must not have exited. |
@travisn Yes after restarting osd pod it was showing up In between there were logs Lastly it was showing but osd pod wasn't restarted |
@akash123-eng Was there any active client IO in the cluster? If the OSD's device was closed, the OSD may not notice until it tries to commit the IO. At that point, then it should fail and restart. |
@travisn yes there was active client io in the cluster |
Ok, then not sure. This just happened once, or has it happened multiple times? |
@travisn yes it happened once for now. but wanted to get behind its root cause |
Hi,
We are using rook-ceph with operator 1.10.8 and ceph 17.2.5
yesterday one of the OSDs had heartbeat failure marked down by Monitor.
But the strange thing was pod for that OSD was not restarted which is expected behavior
In logs, we can see error "set_numa_affinity unable to identify public interface"
We wanted to know what might be the root cause of this ? and how to fix it to avoid re-occurrence of the issue?
Environment:
rook version
inside of a Rook Pod): rook operator 1.10.8ceph -v
): ceph 17.2.5kubectl version
): 1.25.9ceph health
in the Rook Ceph toolbox): ceph status was ok all osds were up except above osd, all pgs were active + cleanThe text was updated successfully, but these errors were encountered: