New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/bluestore: replace Blob ref_map with reference counting #12904
Conversation
899e8b0
to
a8731fe
Compare
retest this please |
ff7a587
to
2608dbb
Compare
const bluestore_blob_use_tracker_t& other) const; | ||
|
||
void bound_encode(size_t& p) const { | ||
denc(refs_count, p); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
denc_varint here throughout.. the varint bound is larger than the uint32_t bound
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean that we have a bug in bluestore_extent_ref_map_t::bound_encode?
void bound_encode(size_t& p) const {
denc((uint32_t)0, p);
...
union { | ||
uint32_t* ref_by_au; | ||
uint32_t ref_total; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a hard time following what these fields mean.
I think refs_count is more like num_au?
ref_size = 0; | ||
} | ||
|
||
uint32_t get_referenced() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_referenced_bytes() ?
} | ||
|
||
uint32_t get_referenced() const { | ||
uint32_t total = ref_total; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this go in an else so it doesn't get reassigned (and assigned an initial value that is a pointer)?
} | ||
return false; | ||
} | ||
return ref_total != 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw for clarity i'd make all these have else blocks for the !refs_count cases
2608dbb
to
803502a
Compare
@liewegas - resolved. |
yeah
|
803502a
to
42342a8
Compare
} | ||
uint32_t _au_size = 0, _num_au = 0; | ||
fallback_cb(&_num_au, &_au_size); | ||
if (_num_au && _au_size) { // we might be forbidden to fallback to partial |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to also check whether the _au_size is > than the blob size, and only do the fallback then. Because there is no point in doing per-au counting when the blob is a single au (or, in the weird case where the min_alloc_size has changed, smaller).
(Thanks for the variable rename--this makes sense now :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key property that determines whether the fallback is allowed is whether the blob can (or ever could) get split. Perhaps we should make that condition explicit; right now it's only sort of implicit in the get_release_size(). Or rather, get_release_size() should clearly return 0 in any cases where a split will never be possible (size <= csum_chunk or compressed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid the use of total counting is more restrictive here: au_size <= blob size isn't a proper precondition to omit per_au counting. Consider the following case.
MAS = 16K.
write 8K~8K.
then change MAS to 4K
and write 0K~4K
One has to be able to switch to per-au counting at this point but that's impossible if per-au wasn't enabled at the first write.
Hence the only valid preconditions for total counting are compression or SINGLE 'get' (i.e. write) at OFFSET 0.
{ | ||
assert(num_au == 0 && au_size == 0); | ||
assert(_num_au != 0 && _au_size != 0); | ||
assert(_num_au * _au_size >= total_bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we stay in the num_au==0 mode for small blobs (see comment below) then this assert will no longer be true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disagree. This assert verifies that one has enough space to split current total_bytes value over per-au counters. IMO this precondition is always present. E.g. one is always un able to migrate from total_bytes=1000 to num_au(=2) * au_size(=128) counters. Have I missed something?
} | ||
bool empty = maybe_empty ? !is_not_empty() : false; | ||
if (empty && release_units) { | ||
release_units->clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you document the semantics of the empty return value vs release_units in the header?
} | ||
} else { | ||
assert(au_size); | ||
assert((blob_offset % au_size) == 0 ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whitespace
static void generate_test_instances(list<bluestore_blob_use_tracker_t*>& o); | ||
private: | ||
void allocate(); | ||
void fall_back_to_partial(uint32_t _num_au, uint32_t _au_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming nit: fall_back_to_per_au() is a bit more clear IMO
} | ||
} | ||
return my_referenced == referenced; | ||
} else if (other.num_au){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could avoid this block by doing return other.equal(*this)
return false; | ||
} | ||
} | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this block. Won't we return false if au_size doesn't match? (The first bytes_per_au values won't be the same.) And what would it mean to compare equality when num_au doesn't match?
It looks like this is only used from fsck, in which case the size and au_size and num_au will match, right?
if (!num_au) { | ||
f->dump_unsigned("total_bytes", total_bytes); | ||
} else { | ||
f->open_array_section("use_tracker"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bytes_per_au ?
*cnt = ROUND_UP_TO(l, min_release_size) / min_release_size; | ||
*rsz = min_release_size; | ||
} else { | ||
*cnt = *rsz = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since both callers initialize these to 0 we could drop these blocks and make it part of the calling convention.
uint64_t end; | ||
if (p == ref_map.ref_map.end()) { | ||
end = b.get_ondisk_length(); | ||
// remove form pextents according to logical release list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*from
offset += (phase ? au_size - phase : au_size); | ||
if (bytes_per_au[pos] == 0) { | ||
if (release_units) { | ||
release_units->insert(pos * au_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this will build of a huge set of continguos blocks for a large blob, right?
Instead, we could add a bit more smarts here to identify contiguous au's getting freed and return a logical bluestore_pextent_t, as we did before. This would also avoid the rewrite of Blob::put_ref() (which was a pain to get right the first time around)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworked to return logical bluestore_pextent_t but modified put_ref totally anyway. The rationales are:
- reduced complexity - original implementation had O(p*r) vs. O(p+r) for the new one. Where p - blob pextent vector size, r - release vector size.
- original implementation still needed update since it uses ref_map.
The other patches look good! Overall I think this PR is clearly the way to go. My main hope is that we can avoid the Blob::put_ref() rewrite and avoid building a bit set<> of individual au's ... |
We could, instead, just accept that if MAS is reduced old blobs may not be
able to be freed with the new, smaller granularity...
|
5a60de8
to
89aa273
Compare
89aa273
to
c88d53f
Compare
@@ -8144,6 +8181,7 @@ int BlueStore::_do_alloc_write( | |||
dblob.add_unused(b_end, wi.blob_length - b_end); | |||
} | |||
} | |||
b->get_ref(coll.get(), wi.b_off0, wi.length0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like this pairs with the special case in get_ref()... can we add a comment for both locations to reference the other? (e.g., // see _wctx_write_finish())
e930feb
to
ddace0e
Compare
MismatchedFree Mismatched free() / delete / delete [] 0x4C2918D /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so operator delete(void*) /builddir/build/BUILD/valgrind-3.11.0/coregrind/m_replacemalloc vg_replace_malloc.c 576 0x805FB0 /usr/bin/ceph-osd clear /usr/src/debug/ceph-11.1.0-7031-g7e25bbc/src/os/bluestore bluestore_types.h 304 0x805FB0 /usr/bin/ceph-osd ~bluestore_blob_use_tracker_t /usr/src/debug/ceph-11.1.0-7031-g7e25bbc/src/os/bluestore bluestore_types.h 299 0x805FB0 /usr/bin/ceph-osd ~Blob /usr/src/debug/ceph-11.1.0-7031-g7e25bbc/src/os/bluestore BlueStore.h 422 from /a/sage-2017-01-25_23:09:16-rados-wip-sage-testing---basic-smithi/748438 |
ddace0e
to
bdd0736
Compare
Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
…on from ref_map to ref counting properly. Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
…zeof' unittest. Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
…rint for non-shared blobs. Signed-off-by: Igor Fedotov <ifedotor@mirantis.com>
…resence case. Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
…_encode Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
bdd0736
to
c7c7540
Compare
@liewegas - fixed. |
Hit this from ObjectStore/StoreTest.SyntheticMatrixCsumVsCompression/2 2017-01-27T18:58:04.542 INFO:teuthology.orchestra.run.smithi180.stderr: -2> 2017-01-27 18:58:04.424198 7f6b6de38700 -1 freelist _verify_range key 0xfffffffffff80000 not present, expected 0x1 2017-01-27T18:58:04.542 INFO:teuthology.orchestra.run.smithi180.stderr: -1> 2017-01-27 18:58:04.424227 7f6b6de38700 -1 freelist _verify_range saw 1 errors 2017-01-27T18:58:04.542 INFO:teuthology.orchestra.run.smithi180.stderr: 0> 2017-01-27 18:58:04.426027 7f6b6de38700 -1 /build/ceph-11.1.0-7072-g80df3d9/src/os/bluestore/BitmapFreelistManager.cc: In function 'void BitmapFreelistManager::_verify_range(uint64_t, uint64_t, int)' thread 7f6b6de38700 time 2017-01-27 1 8:58:04.424250 2017-01-27T18:58:04.542 INFO:teuthology.orchestra.run.smithi180.stderr:/build/ceph-11.1.0-7072-g80df3d9/src/os/bluestore/BitmapFreelistManager.cc: 454: FAILED assert(0 == "bitmap freelist errors") 2017-01-27T18:58:04.542 INFO:teuthology.orchestra.run.smithi180.stderr: 2017-01-27T18:58:04.542 INFO:teuthology.orchestra.run.smithi180.stderr: ceph version 11.1.0-7072-g80df3d9 (80df3d90ce6633aae8e985e346cc3a07f0ee8568) 2017-01-27T18:58:04.542 INFO:teuthology.orchestra.run.smithi180.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x7f6b7adc657e] 2017-01-27T18:58:04.543 INFO:teuthology.orchestra.run.smithi180.stderr: 2: (BitmapFreelistManager::_verify_range(unsigned long, unsigned long, int)+0x832) [0x7f6b83be4b82] 2017-01-27T18:58:04.543 INFO:teuthology.orchestra.run.smithi180.stderr: 3: (BitmapFreelistManager::release(unsigned long, unsigned long, std::shared_ptr)+0x2ab) [0x7f6b83be738b] 2017-01-27T18:58:04.543 INFO:teuthology.orchestra.run.smithi180.stderr: 4: (BlueStore::_txc_finalize_kv(BlueStore::TransContext*, std::shared_ptr)+0x23e) [0x7f6b83b4ed6e] 2017-01-27T18:58:04.543 INFO:teuthology.orchestra.run.smithi180.stderr: 5: (BlueStore::_kv_sync_thread()+0x1665) [0x7f6b83b58db5] 2017-01-27T18:58:04.543 INFO:teuthology.orchestra.run.smithi180.stderr: 6: (BlueStore::KVSyncThread::entry()+0xd) [0x7f6b83b85e0d] 2017-01-27T18:58:04.543 INFO:teuthology.orchestra.run.smithi180.stderr: 7: (()+0x8184) [0x7f6b7a935184] 2017-01-27T18:58:04.543 INFO:teuthology.orchestra.run.smithi180.stderr: 8: (clone()+0x6d) [0x7f6b7903f37d] which looks pretty suspicious. Maybe try running it in a loop? |
Passed 32 iterations without any issues.. |
This PR is based on some commits from PR #12700
This effectively reduces 4Mb Onode metadata in-memory size from 418K to 254K
Signed-off-by: Igor Fedotov ifedotov@mirantis.com