Is `ceph auth del osd.<ID>` necessary when replacing an OSD? #12405

weeix · 2023-06-20T08:26:21Z

weeix
Jun 20, 2023

Hi,

Is the ceph auth del osd.<ID> command necessary when replacing an OSD? I ask this because in my case, the OSD was not recreated and I got the following error in the rook-ceph-operator log:

2023-06-20 03:40:06.653950 I | op-osd: OSD orchestration status for node worker01.k8s.redacted is "failed"
2023-06-20 03:40:06.654027 E | op-osd: failed to provision OSD(s) on node worker01.k8s.redacted. &{OSDs:[] Status:failed PvcBackedOSD:false Message:failed to configure devices: failed to initialize osd: failed to run ceph-volume raw command. Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new a0f80766-bb2a-4dd2-838b-48fce8989b85
 stderr: Error EEXIST: entity osd.0 exists but key does not match
Traceback (most recent call last):
  File "/usr/sbin/ceph-volume", line 11, in <module>
    load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in __init__
    self.main(self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in main
    terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main
    terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 169, in main
    self.safe_prepare(self.args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 91, in safe_prepare
    self.prepare()
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 125, in prepare
    osd_fsid, json.dumps(secrets))
  File "/usr/lib/python3.6/site-packages/ceph_volume/util/prepare.py", line 176, in create_id
    raise RuntimeError('Unable to create a new OSD id')
RuntimeError: Unable to create a new OSD id: exit status 1}

After running the ceph auth del osd.0 command and deleting the rook-ceph-operator pod, the OSD was recreated as expected.

Context

I'm using Rook Ceph v1.11.6
I'm using a host-based cluster.
The old storage occasionally stopped working and corrupted the data, causing CrashLoopBackoff. This led me to vMotion the disk to a new storage and follow https://rook.io/docs/rook/v1.11/Storage-Configuration/Advanced/ceph-osd-mgmt/#remove-an-osd to remove the OSD, wipe the disk, and add it back to the cluster.
I'm replacing osd.0, located at /dev/sdb on node worker01.k8s.redacted.
After osd.0 is down, many PGs were degraded+undersized.
I tried to purge osd.0 with a job, but failed, so I had to purge it manually.
After using the ceph osd out osd.0 command, backfilling did not happen, so I had to proceed without waiting for all of the PGs to be active+clean. (Perhaps the PGs were undersized?)

Is this only specific to my situation? Would it be helpful if I open a PR to edit the document?

Thanks.

travisn · 2023-06-20T17:47:15Z

travisn
Jun 20, 2023
Maintainer

It's expected that the osd purge will also delete the osd auth. That said, I also saw recently during a test of osd purge that the osd auth did not get purged as expected and I had to delete the auth before I could create a new OSD. I believe what happened is that when the OSD was purged, the operator immediately tried to re-create the OSD before I had a chance to wipe the disk, so the same OSD tried to start up again and created the auth again. If that's the case, the operator should be stopped until the disk is wiped and ready to be re-created.

1 reply

weeix Jun 22, 2023
Author

Thank you. I tried replacing the osd again. osd purge indeed deleted the osd auth as well. I think I might have messed up something on my first try.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is `ceph auth del osd.<ID>` necessary when replacing an OSD? #12405

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Is ceph auth del osd.<ID> necessary when replacing an OSD? #12405

weeix Jun 20, 2023

Context

Replies: 1 comment · 1 reply

travisn Jun 20, 2023 Maintainer

weeix Jun 22, 2023 Author

Is `ceph auth del osd.<ID>` necessary when replacing an OSD? #12405

weeix
Jun 20, 2023

Replies: 1 comment 1 reply

travisn
Jun 20, 2023
Maintainer

weeix Jun 22, 2023
Author