os/bluestore: prevent extent merging across shard boundaries #11216

liewegas · 2016-09-23T20:40:55Z

We cannot have a single extent span a shard boundary. If we reach
a shard boundary, stop merging extents.

Note that it might be better to move the shard boundary, but it is
awkward to force that to happen at this point.

Signed-off-by: Sage Weil sage@redhat.com

We cannot have a single extent span a shard boundary. If we reach a shard boundary, stop merging extents. Note that it might be better to move the shard boundary, but it is awkward to force that to happen at this point. Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

liewegas · 2016-09-23T22:50:54Z

retest this please

liewegas · 2016-09-26T14:29:14Z

retest this please

somnathr · 2016-09-26T21:05:27Z

@liewegas I am seeing the following rocksdb tx after 1 hour run of 4K RW.

2016-09-26 14:25:00.624709 7fb8ddbaa700 30 submit_transaction Rocksdb transaction:
Put( Prefix = M key = 0x000000000000083d'.0000000018.00000000000000077820' Value size = 182)
Put( Prefix = M key = 0x000000000000083d'._info' Value size = 855)
Put( Prefix = O key = 0x7f80000000000000016b93fad7217262'd_data.10156b8b4567.000000000001df15!='0xfffffffffffffffeffffffffffffffff001b1000'x' Value size = 583)
Put( Prefix = O key = 0x7f80000000000000016b93fad7217262'd_data.10156b8b4567.000000000001df15!='0xfffffffffffffffeffffffffffffffff Value size = 4689)
Merge( Prefix = b key = 0x00000048de880000 Value size = 16)
Merge( Prefix = b key = 0x00000054a7a00000 Value size = 16)

out of that spanning blob part is 4689.
If I understood this PR correctly, we don't want an extent span across 2 shards as we don't want to write both the shards in future if IO touches only one (?).
This PR will not be solving the problem of writing the ever growing entire spanning blobs.
If so, is it possible to write only the portion of the spanning blob that is changed during this IO ?

liewegas · 2016-09-26T21:30:43Z

On Mon, 26 Sep 2016, somnathr wrote:

@liewegas I am seeing the following rocksdb tx after 1 hour run of 4K RW.

2016-09-26 14:25:00.624709 7fb8ddbaa700 30 submit_transaction Rocksdb transaction:
Put( Prefix = M key = 0x000000000000083d'.0000000018.00000000000000077820' Value size = 182)
Put( Prefix = M key = 0x000000000000083d'._info' Value size = 855)
Put( Prefix = O key = 0x7f80000000000000016b93fad7217262'd_data.10156b8b4567.000000000001df15!='0xfffffffffffffffeffffffffffffffff001b1000'x'
Value size = 583)
Put( Prefix = O key = 0x7f80000000000000016b93fad7217262'd_data.10156b8b4567.000000000001df15!='0xfffffffffffffffeffffffffffffffff Value size
= 4689)
Merge( Prefix = b key = 0x00000048de880000 Value size = 16)
Merge( Prefix = b key = 0x00000054a7a00000 Value size = 16)

out of that spanning blob part is 4689.
If I understood this PR correctly, we don't want an extent span across 2 shards as we don't want to write both the shards in future if IO
touches only one (?).

Right

This PR will not be solving the problem of writing the ever growing entire spanning blobs.
If so, is it possible to write only the portion of the spanning blob that is changed during this IO ?

The spanning blobs should normally be small. The current exception seems
to be if you write the whole object sequentialy and get one big blob and
extent, with 4KB blocks and crc32c, for about 4KB of blob. I think the
"fix" is to set a max blob size.. which we don't do currently unless
compression is enabled.

But it will take a bit of tuning.. we need to make sure the max blob size
is small enough that the target shard size * (1+target shard size slop)
will find the big blob's boundaries. Otherwise, one or more big blobs can
end up being spanning blobs. And we don't want any of them to be
spanning blobs if we can help it. I'm guessing a max blob of 512KB (about
512 bytes of metadata) would be okay.

OR, we could make the reshard smart enough to cleave an uncompressed
blob in half when it picks a bound. That is probably better, actually.

allensamuels · 2016-09-27T16:14:44Z

src/os/bluestore/BlueStore.cc

+
+  // identify the *next* shard
+  auto pshard = shards.begin();
+  while (pshard != shards.end() &&


Really should assert that p != end(). Not sure if seek_lextent really guarantees it. But if it doesn't things will really go south when you do n = p+1 later :(

seek_lextent doesn't guarantee that and that's why all the caller of seek_lextent() has a end check except this.
We should add that check or assert here I agree.

liewegas added bluestore bug-fix labels Sep 23, 2016

ceph_test_objectstore: add a test

4ece5b7

Signed-off-by: Sage Weil <sage@redhat.com>

liewegas merged commit 5f5880d into ceph:master Sep 27, 2016

allensamuels reviewed Sep 27, 2016

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

os/bluestore: prevent extent merging across shard boundaries #11216

os/bluestore: prevent extent merging across shard boundaries #11216

liewegas commented Sep 23, 2016

liewegas commented Sep 23, 2016

liewegas commented Sep 26, 2016

somnathr commented Sep 26, 2016

liewegas commented Sep 26, 2016

allensamuels Sep 27, 2016

somnathr Sep 27, 2016

os/bluestore: prevent extent merging across shard boundaries #11216

os/bluestore: prevent extent merging across shard boundaries #11216

Conversation

liewegas commented Sep 23, 2016

liewegas commented Sep 23, 2016

liewegas commented Sep 26, 2016

somnathr commented Sep 26, 2016

liewegas commented Sep 26, 2016

allensamuels Sep 27, 2016

Choose a reason for hiding this comment

somnathr Sep 27, 2016

Choose a reason for hiding this comment