[rook-ceph] cannot connect OSDs after disaster-recovery #7557

caffeinism · 2021-04-08T18:01:56Z

caffeinism
Apr 8, 2021

https://github.com/rook/rook/blob/master/Documentation/ceph-disaster-recovery.md

I am having a very hard day because of this.

For some reason I have been trying to recover an old rook-ceph cluster from a new cluster.

I thought it would be easy because I had a proper memory before, but it caused me a lot of pain.

I am a beginner to ceph. Here are some clues you can guess.

My cluster consists of 3 nodes and 11 OSDs per node.

ceph -s

[root@rook-ceph-tools-75b55dbcb9-mdhhw /]# ceph -s
  cluster:
    id:     d0e0e519-e948-42ab-a770-018fe6158a81
    health: HEALTH_WARN
            1 filesystem is degraded
            1 MDSs report slow metadata IOs
            Reduced data availability: 161 pgs inactive
            7 daemons have recently crashed
            33 slow ops, oldest one blocked for 784 sec, mon.a has slow ops

  services:
    mon: 1 daemons, quorum a (age 13m)
    mgr: a(active, since 13m)
    mds: myfs:1/1 {0=myfs-a=up:replay} 1 up:standby
    osd: 33 osds: 33 up (since 3d), 33 in (since 3d); 2 remapped pgs

  task status:
    scrub status:
        mds.myfs-a: idle

  data:
    pools:   3 pools, 161 pgs
    objects: 0 objects, 0 B
    usage:   4.4 TiB used, 53 TiB / 58 TiB avail
    pgs:     100.000% pgs unknown
             161 unknown

I have done disaster recovery more than 3 times. But they all got the same result. Receive SLOW_OPS WARNING for MON. In fact, that's not going on at all. The age of mon and the elapsed time of SLOW_OPS are exactly the same.

In addition, the IP of ceph osd find osd.1 and the actual IP on kubernetes do not match.

kubernetes logs osd.1 pod

** File Read Latency Histogram By Level [default] **

debug 2021-04-08T17:51:28.163+0000 7ff5f1c2c700  1 osd.1 2911 tick checking mon for new map
debug 2021-04-08T17:51:58.363+0000 7ff5f1c2c700  1 osd.1 2911 tick checking mon for new map
debug 2021-04-08T17:52:28.639+0000 7ff5f1c2c700  1 osd.1 2911 tick checking mon for new map
debug 2021-04-08T17:52:59.387+0000 7ff5f1c2c700  1 osd.1 2911 tick checking mon for new map
debug 2021-04-08T17:53:29.734+0000 7ff5f1c2c700  1 osd.1 2911 tick checking mon for new map
debug 2021-04-08T17:53:59.842+0000 7ff5f1c2c700  1 osd.1 2911 tick checking mon for new map
debug 2021-04-08T17:54:30.722+0000 7ff5f1c2c700  1 osd.1 2911 tick checking mon for new map
debug 2021-04-08T17:55:01.678+0000 7ff5f1c2c700  1 osd.1 2911 tick checking mon for new map
debug 2021-04-08T17:55:32.674+0000 7ff5f1c2c700  1 osd.1 2911 tick checking mon for new map
debug 2021-04-08T17:56:02.889+0000 7ff5f1c2c700  1 osd.1 2911 tick checking mon for new map
debug 2021-04-08T17:56:33.701+0000 7ff5f1c2c700  1 osd.1 2911 tick checking mon for new map
debug 2021-04-08T17:57:04.453+0000 7ff5f1c2c700  1 osd.1 2911 tick checking mon for new map
debug 2021-04-08T17:57:35.257+0000 7ff5f1c2c700  1 osd.1 2911 tick checking mon for new map
debug 2021-04-08T17:58:05.517+0000 7ff5f1c2c700  1 osd.1 2911 tick checking mon for new map
debug 2021-04-08T17:58:36.428+0000 7ff5f1c2c700  1 osd.1 2911 tick checking mon for new map

It seems that mon and osd are not connected. It seems to be related to the phenomenon that the actual IP and the IP of the cluster are not the same.

How can I solve it?

Answered by travisn

Apr 9, 2021

It's good to see that the mon is healthy. The OSD should see the pod IP and report it to the mons when it starts, so I'm not sure why the OSDs would be stuck with the old ip address.

Which section of the disaster recovery guide did you go through? This one? This is certainly a difficult process unfortunately.

View full answer

caffeinism · 2021-04-08T18:38:17Z

caffeinism
Apr 8, 2021
Author

osd: 33 osds: 33 up -> After starting, it looks up for a while, but soon changes to down.

0 replies

travisn · 2021-04-09T16:50:08Z

travisn
Apr 9, 2021
Maintainer

It's good to see that the mon is healthy. The OSD should see the pod IP and report it to the mons when it starts, so I'm not sure why the OSDs would be stuck with the old ip address.

Which section of the disaster recovery guide did you go through? This one? This is certainly a difficult process unfortunately.

3 replies

caffeinism Apr 9, 2021
Author

I have previously successfully solved this method. And I finally solved this problem. This was because I used a stale mon backup file. The problem was solved by replacing it with a fresher backup file.

I learned a lesson to eat fresh food. 🥬

travisn Apr 9, 2021
Maintainer

Great to hear it's solved! Nice lesson :)

13567436138 Feb 8, 2024

How you solvef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rook-ceph] cannot connect OSDs after disaster-recovery #7557

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[rook-ceph] cannot connect OSDs after disaster-recovery #7557

caffeinism Apr 8, 2021

Replies: 2 comments · 3 replies

caffeinism Apr 8, 2021 Author

travisn Apr 9, 2021 Maintainer

caffeinism Apr 9, 2021 Author

travisn Apr 9, 2021 Maintainer

13567436138 Feb 8, 2024

caffeinism
Apr 8, 2021

Replies: 2 comments 3 replies

caffeinism
Apr 8, 2021
Author

travisn
Apr 9, 2021
Maintainer

caffeinism Apr 9, 2021
Author

travisn Apr 9, 2021
Maintainer