New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jewel: tools: add a tool to rebuild mon store from OSD #11126
Conversation
[root@lab8106 mon]# /usr/bin/ceph-mon -f --cluster ceph --id lab8106 --setuser ceph --setgroup ceph
starting mon.lab8106 rank 0 at 192.168.8.106:6789/0 mon_data /var/lib/ceph/mon/ceph-lab8106 fsid fa7ec1a1-662a-4ba3-b478-7cb570482b62
/root/rpmbuild/BUILD/ceph-11.0.0-2460-g22053d0/src/mon/PGMonitor.cc: In function 'void PGMonitor::check_osd_map(epoch_t)' thread 7f1f995f2500 time 2016-09-20 15:58:28.203619
/root/rpmbuild/BUILD/ceph-11.0.0-2460-g22053d0/src/mon/PGMonitor.cc: 911: FAILED assert(err == 0)
ceph version v11.0.0-2460-g22053d0 (22053d057fb03e9c932da2771d7c90556567d1e4)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x55bb3cb61d55]
2: (PGMonitor::check_osd_map(unsigned int)+0x150d) [0x55bb3cab811d]
3: (PGMonitor::on_active()+0xf6) [0x55bb3cab8566]
4: (PaxosService::_active()+0x195) [0x55bb3c9d5655]
5: (PaxosService::election_finished()+0x7a) [0x55bb3c9d5cda]
6: (Monitor::win_election(unsigned int, std::set<int, std::less<int>, std::allocator<int> >&, unsigned long, MonCommand const*, int, std::set<int, std::less<int>, std::allocator<int> > const*)+0x246) [0x55bb3c993a16]
7: (Monitor::win_standalone_election()+0x17f) [0x55bb3c993e5f]
8: (Monitor::bootstrap()+0xa1b) [0x55bb3c9949cb]
9: (Monitor::init()+0xea) [0x55bb3c994c2a]
10: (main()+0x2628) [0x55bb3c8e4958]
11: (__libc_start_main()+0xf5) [0x7f1f94eccb15]
12: (()+0x298445) [0x55bb3c958445]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2016-09-20 15:58:28.205597 7f1f995f2500 -1 /root/rpmbuild/BUILD/ceph-11.0.0-2460-g22053d0/src/mon/PGMonitor.cc: In function 'void PGMonitor::check_osd_map(epoch_t)' thread 7f1f995f2500 time 2016-09-20 15:58:28.203619
/root/rpmbuild/BUILD/ceph-11.0.0-2460-g22053d0/src/mon/PGMonitor.cc: 911: FAILED assert(err == 0)
ceph version v11.0.0-2460-g22053d0 (22053d057fb03e9c932da2771d7c90556567d1e4)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x55bb3cb61d55]
2: (PGMonitor::check_osd_map(unsigned int)+0x150d) [0x55bb3cab811d]
3: (PGMonitor::on_active()+0xf6) [0x55bb3cab8566]
4: (PaxosService::_active()+0x195) [0x55bb3c9d5655]
5: (PaxosService::election_finished()+0x7a) [0x55bb3c9d5cda]
6: (Monitor::win_election(unsigned int, std::set<int, std::less<int>, std::allocator<int> >&, unsigned long, MonCommand const*, int, std::set<int, std::less<int>, std::allocator<int> > const*)+0x246) [0x55bb3c993a16]
7: (Monitor::win_standalone_election()+0x17f) [0x55bb3c993e5f]
8: (Monitor::bootstrap()+0xa1b) [0x55bb3c9949cb]
9: (Monitor::init()+0xea) [0x55bb3c994c2a]
10: (main()+0x2628) [0x55bb3c8e4958]
11: (__libc_start_main()+0xf5) [0x7f1f94eccb15]
12: (()+0x298445) [0x55bb3c958445]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
0> 2016-09-20 15:58:28.205597 7f1f995f2500 -1 /root/rpmbuild/BUILD/ceph-11.0.0-2460-g22053d0/src/mon/PGMonitor.cc: In function 'void PGMonitor::check_osd_map(epoch_t)' thread 7f1f995f2500 time 2016-09-20 15:58:28.203619
/root/rpmbuild/BUILD/ceph-11.0.0-2460-g22053d0/src/mon/PGMonitor.cc: 911: FAILED assert(err == 0)
ceph version v11.0.0-2460-g22053d0 (22053d057fb03e9c932da2771d7c90556567d1e4)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x55bb3cb61d55]
2: (PGMonitor::check_osd_map(unsigned int)+0x150d) [0x55bb3cab811d]
3: (PGMonitor::on_active()+0xf6) [0x55bb3cab8566]
4: (PaxosService::_active()+0x195) [0x55bb3c9d5655]
5: (PaxosService::election_finished()+0x7a) [0x55bb3c9d5cda]
6: (Monitor::win_election(unsigned int, std::set<int, std::less<int>, std::allocator<int> >&, unsigned long, MonCommand const*, int, std::set<int, std::less<int>, std::allocator<int> > const*)+0x246) [0x55bb3c993a16]
7: (Monitor::win_standalone_election()+0x17f) [0x55bb3c993e5f]
8: (Monitor::bootstrap()+0xa1b) [0x55bb3c9949cb]
9: (Monitor::init()+0xea) [0x55bb3c994c2a]
10: (main()+0x2628) [0x55bb3c8e4958]
11: (__libc_start_main()+0xf5) [0x7f1f94eccb15]
12: (()+0x298445) [0x55bb3c958445]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
*** Caught signal (Aborted) **
in thread 7f1f995f2500 thread_name:ceph-mon
ceph version v11.0.0-2460-g22053d0 (22053d057fb03e9c932da2771d7c90556567d1e4)
1: (()+0x68fdea) [0x55bb3cd4fdea]
2: (()+0xf100) [0x7f1f95a9b100]
3: (gsignal()+0x37) [0x7f1f94ee05f7]
4: (abort()+0x148) [0x7f1f94ee1ce8]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x55bb3cb61f37]
6: (PGMonitor::check_osd_map(unsigned int)+0x150d) [0x55bb3cab811d]
7: (PGMonitor::on_active()+0xf6) [0x55bb3cab8566]
8: (PaxosService::_active()+0x195) [0x55bb3c9d5655]
9: (PaxosService::election_finished()+0x7a) [0x55bb3c9d5cda]
10: (Monitor::win_election(unsigned int, std::set<int, std::less<int>, std::allocator<int> >&, unsigned long, MonCommand const*, int, std::set<int, std::less<int>, std::allocator<int> > const*)+0x246) [0x55bb3c993a16]
11: (Monitor::win_standalone_election()+0x17f) [0x55bb3c993e5f]
12: (Monitor::bootstrap()+0xa1b) [0x55bb3c9949cb]
13: (Monitor::init()+0xea) [0x55bb3c994c2a]
14: (main()+0x2628) [0x55bb3c8e4958]
15: (__libc_start_main()+0xf5) [0x7f1f94eccb15]
16: (()+0x298445) [0x55bb3c958445]
2016-09-20 15:58:28.208314 7f1f995f2500 -1 *** Caught signal (Aborted) **
in thread 7f1f995f2500 thread_name:ceph-mon
ceph version v11.0.0-2460-g22053d0 (22053d057fb03e9c932da2771d7c90556567d1e4)
1: (()+0x68fdea) [0x55bb3cd4fdea]
2: (()+0xf100) [0x7f1f95a9b100]
3: (gsignal()+0x37) [0x7f1f94ee05f7]
4: (abort()+0x148) [0x7f1f94ee1ce8]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x55bb3cb61f37]
6: (PGMonitor::check_osd_map(unsigned int)+0x150d) [0x55bb3cab811d]
7: (PGMonitor::on_active()+0xf6) [0x55bb3cab8566]
8: (PaxosService::_active()+0x195) [0x55bb3c9d5655]
9: (PaxosService::election_finished()+0x7a) [0x55bb3c9d5cda]
10: (Monitor::win_election(unsigned int, std::set<int, std::less<int>, std::allocator<int> >&, unsigned long, MonCommand const*, int, std::set<int, std::less<int>, std::allocator<int> > const*)+0x246) [0x55bb3c993a16]
11: (Monitor::win_standalone_election()+0x17f) [0x55bb3c993e5f]
12: (Monitor::bootstrap()+0xa1b) [0x55bb3c9949cb]
13: (Monitor::init()+0xea) [0x55bb3c994c2a]
14: (main()+0x2628) [0x55bb3c8e4958]
15: (__libc_start_main()+0xf5) [0x7f1f94eccb15]
16: (()+0x298445) [0x55bb3c958445]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
0> 2016-09-20 15:58:28.208314 7f1f995f2500 -1 *** Caught signal (Aborted) **
in thread 7f1f995f2500 thread_name:ceph-mon
ceph version v11.0.0-2460-g22053d0 (22053d057fb03e9c932da2771d7c90556567d1e4)
1: (()+0x68fdea) [0x55bb3cd4fdea]
2: (()+0xf100) [0x7f1f95a9b100]
3: (gsignal()+0x37) [0x7f1f94ee05f7]
4: (abort()+0x148) [0x7f1f94ee1ce8]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x55bb3cb61f37]
6: (PGMonitor::check_osd_map(unsigned int)+0x150d) [0x55bb3cab811d]
7: (PGMonitor::on_active()+0xf6) [0x55bb3cab8566]
8: (PaxosService::_active()+0x195) [0x55bb3c9d5655]
9: (PaxosService::election_finished()+0x7a) [0x55bb3c9d5cda]
10: (Monitor::win_election(unsigned int, std::set<int, std::less<int>, std::allocator<int> >&, unsigned long, MonCommand const*, int, std::set<int, std::less<int>, std::allocator<int> > const*)+0x246) [0x55bb3c993a16]
11: (Monitor::win_standalone_election()+0x17f) [0x55bb3c993e5f]
12: (Monitor::bootstrap()+0xa1b) [0x55bb3c9949cb]
13: (Monitor::init()+0xea) [0x55bb3c994c2a]
14: (main()+0x2628) [0x55bb3c8e4958]
15: (__libc_start_main()+0xf5) [0x7f1f94eccb15]
16: (()+0x298445) [0x55bb3c958445]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. your tool i have a test,and find aproblem,and want to tell you . i have a cluster with 1 mon, 2 osd, and
i clean all and first time work,and base the rebuild environment,once again,it will not start the mon |
@zphj1987 thanks for the testing, i will try to repeat your steps tomorrow. |
@tchaikov 2016-09-22 16:49:59.900190 7fab37d8b700 5 mon.node173@0(leader).paxos(paxos active c 1..3) is_readable = 1 - now=2016-09-22 16:49:59.900190 lease_expire=2016-09-22 16:50:04.900174 has v0 lc 3 ceph version 10.2.2.8 (4cf7ed7423032cffc3768f1a091251d3733b26d0) |
@renhwztetecs i am not able to reproduce your issue with following steps. i repeated it for 3 times, no luck.
could you please paste your script so i can repeat it? thanks! |
@tchaikov cluster:
steps info
|
@renhwztetecs i don't see anything obvious other than
can you file a bug on tracker? maybe we can continue the investigation there instead reusing this PR? and could you upload your restore store.db and the output of
|
yeah! |
fix posted at #11276 |
ae0277a
to
651b906
Compare
@@ -215,7 +215,10 @@ int update_osdmap(ObjectStore& fs, OSDSuperblock& sb, MonitorDBStore& ms) | |||
|
|||
// trim stale maps | |||
unsigned ntrimmed = 0; | |||
for (auto e = first_committed; e < sb.oldest_map; e++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit says "doc: ...", but then it's also touching rebuild_mondb.cc
? Should that change be split out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ktdreyer, yes. the code change spilled out into the doc change in the original commit. let me fix it.
651b906
to
e9f323d
Compare
nothing coredump and test pass. 👍 |
@dachary sure, will do! |
e9f323d
to
787577a
Compare
i will update the cross references in this changeset after the commits (#11276) are merged in master |
…m OSD Reviewed-by: Loic Dachary <ldachary@redhat.com>
so ceph-objectstore-tool is able to use it when rebuilding monitor db. Fixes: http://tracker.ceph.com/issues/17179 Signed-off-by: Kefu Chai <kchai@redhat.com> (cherry picked from commit 19ef4f1)
Fixes: http://tracker.ceph.com/issues/17179 Signed-off-by: Kefu Chai <kchai@redhat.com> Conflicts: src/tools/CMakeLists.txt: this file was added in master, so update src/CMakeLists.txt instead src/tools/Makefile-server.am: jewel is still using autotools, so update this file also. src/tools/rebuild_mondb.cc: move the code spilled into doc/rados/troubleshooting/troubleshooting-mon.rst by accident back to this commit. (cherry picked from commit 24faea7)
Fixes: http://tracker.ceph.com/issues/17179 Signed-off-by: Kefu Chai <kchai@redhat.com> (cherry picked from commit d909fa0)
document the process to recover from leveldb corruption. Fixes: http://tracker.ceph.com/issues/17179 Signed-off-by: Kefu Chai <kchai@redhat.com> (cherry picked from commit 79a9f29) Conflicts: src/tools/rebuild_mondb.cc: remove the code change in this file from this commit. and the code gets removed is added in anther commit.
In general we return negative codes for error cases, so there is no need perform the cast here. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn> (cherry picked from commit 6a1c01d)
As follow: [ 72%] Building CXX object src/tools/CMakeFiles/ceph-objectstore-tool.dir/RadosDump.cc.o /home/jenkins-build/build/workspace/ceph-pull-requests/src/tools/rebuild_mondb.cc: In function ‘int update_mon_db(ObjectStore&, OSDSuperblock&, const string&, const string&)’: /home/jenkins-build/build/workspace/ceph-pull-requests/src/tools/rebuild_mondb.cc:289:22: warning: ‘crc’ may be used uninitialized in this function [-Wmaybe-uninitialized] if (have_crc && osdmap.get_crc() != crc) { ^ /home/jenkins-build/build/workspace/ceph-pull-requests/src/tools/rebuild_mondb.cc:238:14: note: ‘crc’ was declared here uint32_t crc; Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn> (cherry picked from commit f16a314)
…e.db we should rebuild pgmap_meta table from the collected osdmaps Fixes: http://tracker.ceph.com/issues/17400 Signed-off-by: Kefu Chai <kchai@redhat.com> (cherry picked from commit cdfa7a6)
we take it as an error if no caps is granted to an entity in the specified keyring file when rebuilding the monitor db. Signed-off-by: Kefu Chai <kchai@redhat.com> (cherry picked from commit b4bd400)
to make sure the recovered monitor store is ready for use. Signed-off-by: Kefu Chai <kchai@redhat.com> (cherry picked from commit af8e211)
787577a
to
25a35d4
Compare
changelog
|
jenkins test this please |
…m OSD Reviewed-by: Loic Dachary <ldachary@redhat.com>
It passed the rados (http://tracker.ceph.com/issues/17487#note-19) suite except for two jobs that are, I believe unrelated. It also passed the upgrade/jewel-x and upgrade/hammer-x (http://tracker.ceph.com/issues/17487#note-22) suites. |
@tchaikov i want to ask that why fsmap cannot be restored. |
because i don't understand cephfs enough to do this, and that was not my first priority by then. but please feel free to send a PR to enable this feature. i will be more than happy to test and review it. |
@tchaikov ,thank you very much for your answer, i have another question that why pgmap cannot be restore fully, you can see the status : |
@Alanwalker3 it would be great if we can move this discussion to ceph-devel. |
@tchaikov How can i get into ceph-devel. |
On Tue, Jul 18, 2017 at 2:47 PM, Alanwalker3 ***@***.***> wrote:
@tchaikov <https://github.com/tchaikov> How can i get into ceph-devel.
search http://ceph.com/irc/ for ceph-devel. if you believe it's a bug,
please file a ticket on tracker: http://tracker.ceph.com/projects/ceph with
reproducing steps.
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11126 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADmv62IaEj0DVS21wzsffkwrIuZr1kYks5sPFUXgaJpZM4KAHne>
.
--
Regards
Kefu Chai
|
@tchaikov From the file src/tools/rebuild_mondb.cc ,we can see that only auth,monitor,osdmap,pgmap_pg are updated,and we did nothing for pgmap. So if it is the reason that we cannot restore pgmap fully. |
@Alanwalker3 yes, we don't restore pgmap. as it will be reconstructed anyway after the cluster is back online.
i don't follow you. could you please rephrase this ? |
@tchaikov because i was busy with my work,i cannot answer you soon.i agree with you that the pgmap will be reconstructed anyway after the cluster is back online,so there is no need to restore pgmap. |
http://tracker.ceph.com/issues/17292
http://tracker.ceph.com/issues/17603