Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd stuck after hearbeat failure #14200

Open
akash123-eng opened this issue May 14, 2024 · 6 comments
Open

osd stuck after hearbeat failure #14200

akash123-eng opened this issue May 14, 2024 · 6 comments
Labels

Comments

@akash123-eng
Copy link

Hi,

We are using rook-ceph with operator 1.10.8 and ceph 17.2.5
yesterday one of the OSDs had heartbeat failure marked down by Monitor.
But the strange thing was pod for that OSD was not restarted which is expected behavior
In logs, we can see error "set_numa_affinity unable to identify public interface"
We wanted to know what might be the root cause of this ? and how to fix it to avoid re-occurrence of the issue?

Environment:

  • OS (e.g. from /etc/os-release): centos 7.9
  • Rook version (use rook version inside of a Rook Pod): rook operator 1.10.8
  • Storage backend version (e.g. for ceph do ceph -v): ceph 17.2.5
  • Kubernetes version (use kubectl version): 1.25.9
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): RKE
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): ceph status was ok all osds were up except above osd, all pgs were active + clean
@travisn
Copy link
Member

travisn commented May 14, 2024

Restarting of the OSD was successful?

If the OSD process exits, the pod will restart. So if the OSD did not restart after that error, the ceph-osd process must not have exited.

@akash123-eng
Copy link
Author

@travisn Yes after restarting osd pod it was showing up
before that it was showing out in ceph status but pod was running but it was stuck.
In osd pod we can see below logs :
"
handle_connect_message_2 accept replacing existing(lossy) channel (new one lossy = 1)
no message from osd.x
osd not healthy; waiting to boot
osd is healthy faluse - only 0/12 up peers(less than 33%)
set_numa_affinity unable to identify public interface"

In between there were logs
"feature acting upacting
transitioning to stray"

Lastly it was showing
: /var/lib/ceph/osd/osd-x/block close
fbmap shutdown

but osd pod wasn't restarted
@Rakshith-R can you please help on above to find root cause ?

@travisn
Copy link
Member

travisn commented May 15, 2024

@akash123-eng Was there any active client IO in the cluster? If the OSD's device was closed, the OSD may not notice until it tries to commit the IO. At that point, then it should fail and restart.

@akash123-eng
Copy link
Author

@travisn yes there was active client io in the cluster
Other osd were working fine

@travisn
Copy link
Member

travisn commented May 15, 2024

@travisn yes there was active client io in the cluster Other osd were working fine

Ok, then not sure. This just happened once, or has it happened multiple times?

@akash123-eng
Copy link
Author

@travisn yes it happened once for now. but wanted to get behind its root cause
so we should fix it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants