Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore: prevent extent merging across shard boundaries #11216

Merged
merged 2 commits into from Sep 27, 2016

Conversation

liewegas
Copy link
Member

We cannot have a single extent span a shard boundary. If we reach
a shard boundary, stop merging extents.

Note that it might be better to move the shard boundary, but it is
awkward to force that to happen at this point.

Signed-off-by: Sage Weil sage@redhat.com

We cannot have a single extent span a shard boundary.  If we reach
a shard boundary, stop merging extents.

Note that it might be better to move the shard boundary, but it is
awkward to force that to happen at this point.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas
Copy link
Member Author

retest this please

1 similar comment
@liewegas
Copy link
Member Author

retest this please

@somnathr
Copy link

@liewegas I am seeing the following rocksdb tx after 1 hour run of 4K RW.

2016-09-26 14:25:00.624709 7fb8ddbaa700 30 submit_transaction Rocksdb transaction:
Put( Prefix = M key = 0x000000000000083d'.0000000018.00000000000000077820' Value size = 182)
Put( Prefix = M key = 0x000000000000083d'._info' Value size = 855)
Put( Prefix = O key = 0x7f80000000000000016b93fad7217262'd_data.10156b8b4567.000000000001df15!='0xfffffffffffffffeffffffffffffffff001b1000'x' Value size = 583)
Put( Prefix = O key = 0x7f80000000000000016b93fad7217262'd_data.10156b8b4567.000000000001df15!='0xfffffffffffffffeffffffffffffffff Value size = 4689)
Merge( Prefix = b key = 0x00000048de880000 Value size = 16)
Merge( Prefix = b key = 0x00000054a7a00000 Value size = 16)

out of that spanning blob part is 4689.
If I understood this PR correctly, we don't want an extent span across 2 shards as we don't want to write both the shards in future if IO touches only one (?).
This PR will not be solving the problem of writing the ever growing entire spanning blobs.
If so, is it possible to write only the portion of the spanning blob that is changed during this IO ?

@liewegas
Copy link
Member Author

On Mon, 26 Sep 2016, somnathr wrote:

@liewegas I am seeing the following rocksdb tx after 1 hour run of 4K RW.

2016-09-26 14:25:00.624709 7fb8ddbaa700 30 submit_transaction Rocksdb transaction:
Put( Prefix = M key = 0x000000000000083d'.0000000018.00000000000000077820' Value size = 182)
Put( Prefix = M key = 0x000000000000083d'._info' Value size = 855)
Put( Prefix = O key = 0x7f80000000000000016b93fad7217262'd_data.10156b8b4567.000000000001df15!='0xfffffffffffffffeffffffffffffffff001b1000'x'
Value size = 583)
Put( Prefix = O key = 0x7f80000000000000016b93fad7217262'd_data.10156b8b4567.000000000001df15!='0xfffffffffffffffeffffffffffffffff Value size
= 4689)
Merge( Prefix = b key = 0x00000048de880000 Value size = 16)
Merge( Prefix = b key = 0x00000054a7a00000 Value size = 16)

out of that spanning blob part is 4689.
If I understood this PR correctly, we don't want an extent span across 2 shards as we don't want to write both the shards in future if IO
touches only one (?).

Right

This PR will not be solving the problem of writing the ever growing entire spanning blobs.
If so, is it possible to write only the portion of the spanning blob that is changed during this IO ?

The spanning blobs should normally be small. The current exception seems
to be if you write the whole object sequentialy and get one big blob and
extent, with 4KB blocks and crc32c, for about 4KB of blob. I think the
"fix" is to set a max blob size.. which we don't do currently unless
compression is enabled.

But it will take a bit of tuning.. we need to make sure the max blob size
is small enough that the target shard size * (1+target shard size slop)
will find the big blob's boundaries. Otherwise, one or more big blobs can
end up being spanning blobs. And we don't want any of them to be
spanning blobs if we can help it. I'm guessing a max blob of 512KB (about
512 bytes of metadata) would be okay.

OR, we could make the reshard smart enough to cleave an uncompressed
blob in half when it picks a bound. That is probably better, actually.

@liewegas liewegas merged commit 5f5880d into ceph:master Sep 27, 2016

// identify the *next* shard
auto pshard = shards.begin();
while (pshard != shards.end() &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really should assert that p != end(). Not sure if seek_lextent really guarantees it. But if it doesn't things will really go south when you do n = p+1 later :(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seek_lextent doesn't guarantee that and that's why all the caller of seek_lextent() has a end check except this.
We should add that check or assert here I agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants