osd stuck after hearbeat failure #14200

akash123-eng · 2024-05-14T14:37:02Z

Hi,

We are using rook-ceph with operator 1.10.8 and ceph 17.2.5
yesterday one of the OSDs had heartbeat failure marked down by Monitor.
But the strange thing was pod for that OSD was not restarted which is expected behavior
In logs, we can see error "set_numa_affinity unable to identify public interface"
We wanted to know what might be the root cause of this ? and how to fix it to avoid re-occurrence of the issue?

Environment:

OS (e.g. from /etc/os-release): centos 7.9
Rook version (use rook version inside of a Rook Pod): rook operator 1.10.8
Storage backend version (e.g. for ceph do ceph -v): ceph 17.2.5
Kubernetes version (use kubectl version): 1.25.9
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): RKE
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): ceph status was ok all osds were up except above osd, all pgs were active + clean

The text was updated successfully, but these errors were encountered:

travisn · 2024-05-14T21:19:02Z

Restarting of the OSD was successful?

If the OSD process exits, the pod will restart. So if the OSD did not restart after that error, the ceph-osd process must not have exited.

akash123-eng · 2024-05-15T05:36:26Z

@travisn Yes after restarting osd pod it was showing up
before that it was showing out in ceph status but pod was running but it was stuck.
In osd pod we can see below logs :
"
handle_connect_message_2 accept replacing existing(lossy) channel (new one lossy = 1)
no message from osd.x
osd not healthy; waiting to boot
osd is healthy faluse - only 0/12 up peers(less than 33%)
set_numa_affinity unable to identify public interface"

In between there were logs
"feature acting upacting
transitioning to stray"

Lastly it was showing
: /var/lib/ceph/osd/osd-x/block close
fbmap shutdown

but osd pod wasn't restarted
@Rakshith-R can you please help on above to find root cause ?

travisn · 2024-05-15T18:33:08Z

@akash123-eng Was there any active client IO in the cluster? If the OSD's device was closed, the OSD may not notice until it tries to commit the IO. At that point, then it should fail and restart.

akash123-eng · 2024-05-15T19:00:07Z

@travisn yes there was active client io in the cluster
Other osd were working fine

travisn · 2024-05-15T19:33:31Z

@travisn yes there was active client io in the cluster Other osd were working fine

Ok, then not sure. This just happened once, or has it happened multiple times?

akash123-eng · 2024-05-16T07:21:22Z

@travisn yes it happened once for now. but wanted to get behind its root cause
so we should fix it

akash123-eng added the bug label May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osd stuck after hearbeat failure #14200

osd stuck after hearbeat failure #14200

akash123-eng commented May 14, 2024

travisn commented May 14, 2024

akash123-eng commented May 15, 2024

travisn commented May 15, 2024

akash123-eng commented May 15, 2024

travisn commented May 15, 2024

akash123-eng commented May 16, 2024

osd stuck after hearbeat failure #14200

osd stuck after hearbeat failure #14200

Comments

akash123-eng commented May 14, 2024

travisn commented May 14, 2024

akash123-eng commented May 15, 2024

travisn commented May 15, 2024

akash123-eng commented May 15, 2024

travisn commented May 15, 2024

akash123-eng commented May 16, 2024