Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wip rebuild monstore #10933

Merged
merged 4 commits into from Sep 17, 2016
Merged

Wip rebuild monstore #10933

merged 4 commits into from Sep 17, 2016

Conversation

tchaikov
Copy link
Contributor

@@ -33,6 +33,7 @@ install(PROGRAMS

add_executable(ceph-objectstore-tool
ceph_objectstore_tool.cc
rebuild_mondb.cc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably also needs to go in automake?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do. thought we were ditching off automake =)

@athanatos
Copy link
Contributor

@gregsfortytwo Can you take a look?

Recovery using OSDs
-------------------

But what if all monitors fails at the same time? Since users are encouraged to
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/fails/fail/

@gregsfortytwo
Copy link
Member

Paging @jecluis as well. ;)

Symptoms of store corruption
----------------------------

Ceph monitor stores the `cluster map`_ in a key/value store. If a monitor in a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to say something like "a key/value store such as LevelDB".

I realize it's configurable and we might switch to RocksDB, but I think in the medium term anybody looking at this is going to know that LevelDB got messed up and not seeing that this is a repair tool for that case will confuse them. (Avoiding the name is especially weird since we are including LevelDB error messages below.)

@gregsfortytwo
Copy link
Member

Still reading code, but I don't see any tests for this. I just know it's going to break every time we change storage stuff (like with ceph-mgr or something, for instance) and we aren't going to notice if the nightlies don't tell us.

So a basic scenario where we append a "rebuild" task at the end of one (or several) normal RADOS tests to start with. At least one where we have an OSD we deliberately keep out-of-date like Sam mentioned above. One we'll write when the FS repair stuff is done. Perhaps others but that's all I've got off the top of my head.

const OSDSuperblock& sb,
MonitorDBStore& ms)
{
// stolen from AuthMonitor::prepare_command(), where prefix is "auth add"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like you actually copied the functions? If so we should definitely pull them out into a shared cc file or something, rather than having two copies that can diverge with different behavior or bugs!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really, i adapted them.

Copy link
Contributor Author

@tchaikov tchaikov Sep 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gregsfortytwo can we leave the refactor for another PR? as i need to restructure monitor carefully to cater for the needs on both sides.

@gregsfortytwo
Copy link
Member

Otherwise looks fine to me (although I only skimmed the pgmap updates).

@tchaikov
Copy link
Contributor Author

tchaikov commented Sep 1, 2016

all comments addressed except for #10933 (comment) and #10933 (comment).

i am working on a ceph-qa-suite task to

  1. nuke all store.db
  2. rebuild one from OSDs
  3. restore it to the first mon, mkfs on other mons
  4. revive them

@tchaikov
Copy link
Contributor Author

tchaikov commented Sep 7, 2016

jenkins, retest this please.

@tchaikov tchaikov force-pushed the wip-rebuild-monstore branch 2 times, most recently from b401aea to 2b5e2d2 Compare September 8, 2016 11:54
@tchaikov
Copy link
Contributor Author

tchaikov commented Sep 8, 2016

changelog

  • rebased against master
  • ceph-monstore-tool: instead of getting the client.admin key and add an almighty keyring, import all keyrings from the --keyring file, to ready the mon.

@tchaikov
Copy link
Contributor Author

tchaikov commented Sep 8, 2016

test added at ceph/ceph-qa-suite#1169

@tchaikov
Copy link
Contributor Author

changelog

  • rebase against master to resolve conflicts.

@athanatos could you take a look again? thanks!

unsigned ntrimmed = 0;
{
auto t = make_shared<MonitorDBStore::Transaction>();
for (auto e = first_committed; e && e < sb.oldest_map; e++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why e && e < oldest_map? What does first_committed = 0 mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is not such a k/v pair in store.db, ms.get(prefix, key) will simply return 0. this creates an illusion for ceph-objectstore-tool that there is always an osdmap#0 in store.db even there is nothing in it. and in real world, the osdmap starts at 1. i will accompany the code with some comments.

@tchaikov tchaikov force-pushed the wip-rebuild-monstore branch 2 times, most recently from b7cd123 to 8ad055e Compare September 12, 2016 08:53
@tchaikov
Copy link
Contributor Author

changelog

  • add comment explaining why we don't want to trim osdmap#0.
  • do not apply transaction if it's empty.
  • make sure the store.db is consistent after each transaction adding osdmaps.

@athanatos mind taking a look again? thanks as always.

for (auto e = first_committed; first_committed && e < sb.oldest_map; e++) {
t->erase(prefix, e);
t->erase(prefix, ms.combine_strings("full", e));
t->put(prefix, first_committed_name, e + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably do this once at the end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do.

@tchaikov
Copy link
Contributor Author

changelog

  • update the "first_committed" once in the end.

@tchaikov
Copy link
Contributor Author

changelog

  • remove trailing spaces.

@athanatos

@tchaikov
Copy link
Contributor Author

tchaikov commented Sep 12, 2016

@athanatos
Copy link
Contributor

LGTM, @gregsfortytwo One last look?

@tchaikov
Copy link
Contributor Author

tchaikov commented Sep 14, 2016

per the discussion with @jecluis

  • ~~import from OSD's keyring also (already implemented, and documented). the path is hardwired to ${data_path}/keyring.
  • check the crc of osdmaps
  • consider rsync'ing and using the meta directory only.

@jecluis i reconsidered you suggestion on pulling only the meta directory, i think it would be easier to just run this tool on the OSD side:

  • from the developer's point of view: we can reuse the interface exposed by ObjectStore. so it works even with bluestore.
  • from the user's point of view: yes, it's unnecessary to copy the store.db back and forth. but the process is scriptable. so i guess it's less painful than it sounds when performing disaster recovery. if it takes 5 seconds to collects an OSD, if there are 1000 OSD instances, it would take less than 1.5 hours to rebuild the mon db.

so ceph-objectstore-tool is able to use it when rebuilding monitor
db.

Fixes: http://tracker.ceph.com/issues/17179
Signed-off-by: Kefu Chai <kchai@redhat.com>
document the process to recover from leveldb corruption.

Fixes: http://tracker.ceph.com/issues/17179
Signed-off-by: Kefu Chai <kchai@redhat.com>
@tchaikov
Copy link
Contributor Author

tchaikov commented Sep 15, 2016

changelog

  • rebase against master.
  • check the crc of osdmaps before putting them into store.db.

@tchaikov
Copy link
Contributor Author

test passed at http://pulpito.ceph.com/kchai-2016-09-15_15:48:11-rados:singleton-wip-kefu-testing---basic-mira/ .

@jecluis @athanatos @gregsfortytwo, thanks for your reviews.

is this changeset good to merge?

@athanatos
Copy link
Contributor

It looks good to me, it would probably be good to get a final ack from @gregsfortytwo .

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks great to me. Reviewed-by:

Thanks @tchaikov! :)

@tchaikov tchaikov merged commit 19acda7 into ceph:master Sep 17, 2016
@tchaikov tchaikov deleted the wip-rebuild-monstore branch September 17, 2016 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants