Disaster recovery (1/3 nodes down; 2 OSDs also down) #14102

daaang · 2024-04-21T14:21:54Z

daaang
Apr 21, 2024

First of all: this isn't urgent. Any potential data loss at this point is tolerable. I have many copies of personal things (e.g. family photos). Everything that's at risk, if lost, really just means I'll have to re-rip a lot of DVDs. Not fun but not the end of the world.

Ok! A couple years ago, I got a kubernetes cluster up with 3 worker nodes and a bunch of disks, and I installed rook/ceph. Good so far. Used it for a while, pretty cool stuff.

Then 2 OSDs went down at the same time on 1 worker node. I ordered new disks and made sure I had backups of everything important. It's a good thing I did because, before the new disks arrived, a different worker node's power supply died.

State so far: 1 worker node completely down, 1 worker node up but 3/5 OSDs up, and 1 worker node up (6/6 OSDs up).

I figured I should start with the down worker node. I got a new power supply, but it still wouldn't turn on. I could go on, but it's just computer-building details. The important thing is that I kept buying different replacement parts while experiencing failure after failure, and I grew very stressed about the whole project. I abandoned it and turned everything off.

Two years pass

At this point, my goal is to get it working enough to make a backup, shut everything back down, then repair or replace the broken node, then start over fresh with a restore from backup. (And then I'll have backups going forward!)

I created a new 2-node kubernetes cluster and followed the directions for running an existing ceph cluster in a new kubernetes cluster, and that went great! But I still have 2 OSDs down.

I tried replacing one of them, but ceph osd status and ceph osd purge both hang indefinitely. I removed one of the down OSDs and replaced it with a blank disk. The rook-ceph-osd-prepare job noticed the empty disk but tried to assign it an id belonging to one of the OSDs from the down node, but I got that to work by running ceph auth del osd.ID, and now it's happily reporting the blank disk as having a corresponding OSD id, but the operator isn't creating a deployment for it.

Operator logs keep repeating this message, twice every 30 seconds:

I | clusterdisruption-controller: all "host" failure domains: [worker-bottom worker-middle]. osd is down in failure domain: "". active node drains: false. pg health: "cluster is not fully clean. PGs: [{StateName:undersized+degraded+peered Count:206} {StateName:stale+active+undersized+degraded Count:166} {StateName:active+undersized+degraded Count:141}]"

As I say at the top, it's not the end of the world if I have to just give up and start over completely. That said, in theory, I have one worker node up with 6/6 OSDs up and happy, so, again in theory, the data does exist, even if it's basically RAID-0 at this point.

Any ideas for how I can get ceph to accept that the down OSDs are never coming back, but here are some new empty disks?

Or if that's a lost cause, any ideas for how I can focus on the one node which should (again in theory) have 1 complete copy of everything, and if there's a way for me to access and back up what's there?

Here's ceph status:

  cluster:
    id:     b2964d52-8fec-4ea7-85bc-5b17b0f145b4
    health: HEALTH_WARN
            noout flag(s) set
            1 osds down
            Reduced data availability: 206 pgs inactive, 166 pgs stale
            Degraded data redundancy: 3739714/8011722 objects degraded (46.678%), 513 pgs degraded, 513 pgs undersized
            513 pgs not deep-scrubbed in time
            513 pgs not scrubbed in time

  services:
    mon: 1 daemons, quorum a (age 17h)
    mgr: a(active, since 17h)
    osd: 16 osds: 9 up (since 57m), 11 in (since 88m); 65 remapped pgs
         flags noout

  data:
    pools:   2 pools, 513 pgs
    objects: 2.67M objects, 10 TiB
    usage:   16 TiB used, 11 TiB / 27 TiB avail
    pgs:     40.156% pgs not active
             3739714/8011722 objects degraded (46.678%)
             206 undersized+degraded+peered
             166 stale+active+undersized+degraded
             141 active+undersized+degraded

I set noout because I started reading ceph docs, saw that, and figured I probably can't damage anything worse than it is, so maybe setting that will do something. (It did not appear to do anything.) Once I realized I was operating entirely in throw-everything-at-the-wall mode, I stepped back and decided to write all this up.

The following commands hang indefinitely with no output:

ceph osd status
rbd list -p replicapool

Is this a lost cause, or do you have some ideas? Either way, thank you so much for reading. Definitely please do point out any of the very stupid decisions I made, as I'm sure there were more than a few.

Answered by travisn

Apr 23, 2024

Next step I'd suggest is to reduce the pool replication to 2 and see if you can get healthy PGs. You can either update the CephBlockPool CR with the new replication.size. Since the ceph status is HEALTH_WARN, rook should be able to reconcile the pool changes successfully.

Then you could purge the OSDs on the third host since you don't expect to bring that host back up.

Also, what version of Rook and Ceph? The latest (or near-latest) versions?

View full answer

travisn · 2024-04-22T21:14:10Z

travisn
Apr 22, 2024
Maintainer

That's quite an interesting journey! Some questions for clarity...

What is your OSD topology? In your old cluster you had 3 nodes and new cluster 2 nodes? Are you trying to bring back OSDs just from 2 of the nodes?
What is your pool replication? 2? Were you using replica 3 previously when you had 3 nodes? What does ceph osd pool ls detail show?
What does ceph osd tree show?

3 replies

daaang Apr 22, 2024
Author

Yes, you have it exactly right. I considered adding disks from the third node to the two working nodes, but I couldn't remember with any certainty whether I'd made changes to the data after that one failed—plus I had no idea whether moving OSDs from one node to another would even work.
I had pool replication set to 3 both times. The second time I changed the mon count (from 3 to 1) and the mgr count (from 2 to 1) but nothing else.

ceph osd pool ls detail

pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 8400 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr_devicehealth
pool 2 'replicapool' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode on last_change 8474 lfor 0/0/3580 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd

ceph osd tree

ID  CLASS  WEIGHT    TYPE NAME               STATUS  REWEIGHT  PRI-AFF
-1         52.76112  root default
-5         21.83212      host worker-bottom
 1    hdd   3.63869          osd.1             down         0  1.00000
 3    hdd   3.63869          osd.3             down   1.00000  1.00000
 6    hdd   3.63869          osd.6               up   1.00000  1.00000
 9    hdd   3.63869          osd.9               up   1.00000  1.00000
13    hdd   3.63869          osd.13              up   1.00000  1.00000
15    hdd   3.63869          osd.15              up   1.00000  1.00000
-7         16.37425      host worker-middle
 0    hdd   3.63869          osd.0               up   1.00000  1.00000
 4    hdd   1.81940          osd.4               up   1.00000  1.00000
 7    hdd   3.63869          osd.7               up   1.00000  1.00000
10    hdd   1.81940          osd.10              up   1.00000  1.00000
12    hdd   3.63869          osd.12              up   1.00000  1.00000
14    hdd   1.81940          osd.14              up   1.00000  1.00000
-3         14.55475      host worker-top
 2    hdd   3.63869          osd.2             down         0  1.00000
 5    hdd   3.63869          osd.5             down         0  1.00000
 8    hdd   3.63869          osd.8             down         0  1.00000
11    hdd   3.63869          osd.11            down         0  1.00000

It's kind of neat to see the OSDs from worker-top (the host that isn't working). Also, sure enough, osd.13 appears to have moved. Interesting to see that ceph believes it is up even though there isn't a corresponding deployment for it.

Also, for reasons unclear to me (I haven't touched anything), rbd list -p replicapool appears to be working now (although I can't get rbd info -p replicapool <image> to work with anything there. I've been thinking of trying a direct mount just to see if anything happens, but I'm guessing the commands that aren't working are basically ceph refusing to allow potentially data-damaging access to anything.

I do still have the years-old /var/lib/rook backup directories, so I could also start this whole experiment over but with replication set to something other than 3, or with disks moved from the offline worker-top host and into the online hosts.

Thanks for the questions!

travisn Apr 23, 2024
Maintainer

Next step I'd suggest is to reduce the pool replication to 2 and see if you can get healthy PGs. You can either update the CephBlockPool CR with the new replication.size. Since the ceph status is HEALTH_WARN, rook should be able to reconcile the pool changes successfully.

Then you could purge the OSDs on the third host since you don't expect to bring that host back up.

Also, what version of Rook and Ceph? The latest (or near-latest) versions?

Answer selected by daaang

daaang Apr 23, 2024
Author

s/spec.replicated.size: 3/spec.replicated.size: 2/ did it! Thank you so much—is there a way for me to buy you a beer or a coffee or anything?

For the moment, I'm just mounting the RBD image and dumping to a couple backup drives, but once that's done, I'll see if I can restore cluster health on just two nodes, as an exercise. (Either way, my plan is to get back to 3 nodes one way or another, probably just start over with latest versions of everything, and restore data from backup.)

To answer your question, it's currently rook v1.9.13 and ceph v16.2.10 (those were what I'd been using a couple years back when they were latest, and I figured it was best to change as little as possible while trying to resurrect it).

travisn · 2024-04-23T15:06:07Z

travisn
Apr 23, 2024
Maintainer

s/spec.replicated.size: 3/spec.replicated.size: 2/ did it! Thank you so much—is there a way for me to buy you a beer or a coffee or anything?

Great to hear that helped. Someday if you're at Kubecon, do stop by the Rook booth to say hi. :)

For the moment, I'm just mounting the RBD image and dumping to a couple backup drives, but once that's done, I'll see if I can restore cluster health on just two nodes, as an exercise. (Either way, my plan is to get back to 3 nodes one way or another, probably just start over with latest versions of everything, and restore data from backup.)

To answer your question, it's currently rook v1.9.13 and ceph v16.2.10 (those were what I'd been using a couple years back when they were latest, and I figured it was best to change as little as possible while trying to resurrect it).

Agreed, upgrading at the same time as disaster recovery isn't generally expect, just curious on the versions in case follow up was still needed.

0 replies

daaang · 2024-04-29T09:49:03Z

daaang
Apr 29, 2024
Author

I said I'd try and restore health to the existing cluster before starting over and restoring from backup, as an exercise. As suggested, I purged all the no-longer-in-use OSDs. Then over the next week or so, all the degraded and misplaced PGs became active+clean without my having to do anything.

Anticlimactic—but in the good way!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disaster recovery (1/3 nodes down; 2 OSDs also down) #14102

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Disaster recovery (1/3 nodes down; 2 OSDs also down) #14102

daaang Apr 21, 2024

Replies: 3 comments · 3 replies

travisn Apr 22, 2024 Maintainer

daaang Apr 22, 2024 Author

travisn Apr 23, 2024 Maintainer

daaang Apr 23, 2024 Author

travisn Apr 23, 2024 Maintainer

daaang Apr 29, 2024 Author

daaang
Apr 21, 2024

Replies: 3 comments 3 replies

travisn
Apr 22, 2024
Maintainer

daaang Apr 22, 2024
Author

travisn Apr 23, 2024
Maintainer

daaang Apr 23, 2024
Author

travisn
Apr 23, 2024
Maintainer

daaang
Apr 29, 2024
Author