Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore: shard extent map #10963

Merged
merged 14 commits into from Sep 7, 2016

Conversation

liewegas
Copy link
Member

@liewegas liewegas commented Sep 2, 2016

Rewrote much of the persistence of onode metadata. The
highlights:

 - extents and blobs stored together (the blob with the
   first referencing extent).
 - extents sharded across multiple k/v keys
 - if a blob if referenced from multiple blobs, it's
   stored in the onode key (called a "spanning blob").
 - when we clone a blob we copy the metadata, but mark
   it shared and put (just) the ref_map on the underlying
   blocks in a shared_blob key.  at this point we also
   assign a globally unique id (sbid = shared blob id)
   so the key has a unique name.
 - we instantiate a SharedBlob in memory regardless of
   whether we need to load the ref_map (which is only
   needed for deallocations!).  the BufferSpace is
   attached to this SharedBlob so we get unified caching
   across clones.

@liewegas liewegas force-pushed the wip-bluestore-sharded-extent-map branch 3 times, most recently from 4351945 to 605063e Compare September 2, 2016 20:26
while (in_flight)
cond.Wait(lock);
store->umount();
store->fsck();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have fsck_on_(u)mount set to true for this test suite (see main func) hence there is no need to call fsck directly. That's just a waste of time...

@liewegas liewegas force-pushed the wip-bluestore-sharded-extent-map branch 3 times, most recently from a9a83dc to 0160c6a Compare September 6, 2016 19:20
@liewegas
Copy link
Member Author

liewegas commented Sep 6, 2016

This passes tests, except for the bitmap granularity issue in #10999 that is worked around in that PR. I suggest we merge that one too (with an interim fix) until we do something more clever with min_alloc_size

@liewegas
Copy link
Member Author

liewegas commented Sep 6, 2016

@chhabaramesh

{
const char *p = key.c_str();
if (key.length() < 2 + 8 + 4)
if (key.length() < 8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 --- ugh. Magic number.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas
Copy link
Member Author

liewegas commented Sep 6, 2016

Thanks @allensamuels, fixed that. Also rolled the allocator granularity bug workaround into this PR for now so that the tests pass with all the new fsck's. It's passing the full test suite for me; I think it's ready to merge.

[==========] 61 tests from 1 test case ran. (2841968 ms total)
[ PASSED ] 61 tests.

lock("BlueStore::Collection::lock", true, false),
exists(true),
bnode_set(MAX(16, g_conf->bluestore_onode_cache_size / 128)),
onode_map(cs)
shared_blob_set(MAX(16, g_conf->bluestore_onode_cache_size / 4)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More magic numbers -- that are different from the old magic numbers :)

Rewrote much of the persistence of onode metadata.  The
highlights:

 - extents and blobs stored together (the blob with the
   first referencing extent).
 - extents sharded across multiple k/v keys
 - if a blob if referenced from multiple blobs, it's
   stored in the onode key (called a "spanning blob").
 - when we clone a blob we copy the metadata, but mark
   it shared and put (just) the ref_map on the underlying
   blocks in a shared_blob key.  at this point we also
   assign a globally unique id (sbid = shared blob id)
   so the key has a unique name.
 - we instantiate a SharedBlob in memory regardless of
   whether we need to load the ref_map (which is only
   needed for deallocations!).  the BufferSpace is
   attached to this SharedBlob so we get unified caching
   across clones.

Signed-off-by: Sage Weil <sage@redhat.com>
We could bump the _max value for a TransContext in it's
prepare state, have it wait for a long time on IO, and
let another txc allocate and commit something with
an id higher than the previous max.

Fix this first by pushing the max ids into the
TransContext where we can deal with them at commit time,
and then making _kv_sync_thread bump the committed
max in a safe way.

Note that this will need to change if/when we do
these commits in parallel.

Signed-off-by: Sage Weil <sage@redhat.com>
Only examine the range we just wrote to (and to the left
and right).

Signed-off-by: Sage Weil <sage@redhat.com>
This has to be block_size bits because min_alloc_size
can vary over mounts.

Signed-off-by: Sage Weil <sage@redhat.com>
We need to handle objects written during previous mounts
that may have had a smaller min_alloc_size.  Use
block_size, which is a safe lower bound.

Signed-off-by: Sage Weil <sage@redhat.com>
These were taking min_alloc_size, but this can change
across mounts; better to use the logical blob length
instead (that's what we want anyway!).

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas liewegas force-pushed the wip-bluestore-sharded-extent-map branch from 79b79e8 to fad3d99 Compare September 7, 2016 15:35
@liewegas liewegas merged commit 68cf9d8 into ceph:master Sep 7, 2016
@liewegas liewegas deleted the wip-bluestore-sharded-extent-map branch September 7, 2016 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants