Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore: replace Blob ref_map with reference counting #12904

Merged
merged 8 commits into from Feb 2, 2017

Conversation

ifed01
Copy link
Contributor

@ifed01 ifed01 commented Jan 12, 2017

This PR is based on some commits from PR #12700
This effectively reduces 4Mb Onode metadata in-memory size from 418K to 254K

Signed-off-by: Igor Fedotov ifedotov@mirantis.com

@xiexingguo
Copy link
Member

retest this please

@ifed01 ifed01 force-pushed the wip-bluestore-onode-refmap3 branch from ff7a587 to 2608dbb Compare January 13, 2017 14:29
const bluestore_blob_use_tracker_t& other) const;

void bound_encode(size_t& p) const {
denc(refs_count, p);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

denc_varint here throughout.. the varint bound is larger than the uint32_t bound

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean that we have a bug in bluestore_extent_ref_map_t::bound_encode?
void bound_encode(size_t& p) const {
denc((uint32_t)0, p);
...

union {
uint32_t* ref_by_au;
uint32_t ref_total;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a hard time following what these fields mean.

I think refs_count is more like num_au?

ref_size = 0;
}

uint32_t get_referenced() const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_referenced_bytes() ?

}

uint32_t get_referenced() const {
uint32_t total = ref_total;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this go in an else so it doesn't get reassigned (and assigned an initial value that is a pointer)?

}
return false;
}
return ref_total != 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw for clarity i'd make all these have else blocks for the !refs_count cases

@ifed01 ifed01 force-pushed the wip-bluestore-onode-refmap3 branch from 2608dbb to 803502a Compare January 16, 2017 13:25
@ifed01
Copy link
Contributor Author

ifed01 commented Jan 16, 2017

@liewegas - resolved.

@liewegas
Copy link
Member

liewegas commented Jan 16, 2017 via email

@ifed01 ifed01 force-pushed the wip-bluestore-onode-refmap3 branch from 803502a to 42342a8 Compare January 16, 2017 14:40
}
uint32_t _au_size = 0, _num_au = 0;
fallback_cb(&_num_au, &_au_size);
if (_num_au && _au_size) { // we might be forbidden to fallback to partial
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to also check whether the _au_size is > than the blob size, and only do the fallback then. Because there is no point in doing per-au counting when the blob is a single au (or, in the weird case where the min_alloc_size has changed, smaller).

(Thanks for the variable rename--this makes sense now :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key property that determines whether the fallback is allowed is whether the blob can (or ever could) get split. Perhaps we should make that condition explicit; right now it's only sort of implicit in the get_release_size(). Or rather, get_release_size() should clearly return 0 in any cases where a split will never be possible (size <= csum_chunk or compressed).

Copy link
Contributor Author

@ifed01 ifed01 Jan 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid the use of total counting is more restrictive here: au_size <= blob size isn't a proper precondition to omit per_au counting. Consider the following case.
MAS = 16K.
write 8K~8K.
then change MAS to 4K
and write 0K~4K
One has to be able to switch to per-au counting at this point but that's impossible if per-au wasn't enabled at the first write.
Hence the only valid preconditions for total counting are compression or SINGLE 'get' (i.e. write) at OFFSET 0.

{
assert(num_au == 0 && au_size == 0);
assert(_num_au != 0 && _au_size != 0);
assert(_num_au * _au_size >= total_bytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we stay in the num_au==0 mode for small blobs (see comment below) then this assert will no longer be true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disagree. This assert verifies that one has enough space to split current total_bytes value over per-au counters. IMO this precondition is always present. E.g. one is always un able to migrate from total_bytes=1000 to num_au(=2) * au_size(=128) counters. Have I missed something?

}
bool empty = maybe_empty ? !is_not_empty() : false;
if (empty && release_units) {
release_units->clear();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you document the semantics of the empty return value vs release_units in the header?

}
} else {
assert(au_size);
assert((blob_offset % au_size) == 0 );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace

static void generate_test_instances(list<bluestore_blob_use_tracker_t*>& o);
private:
void allocate();
void fall_back_to_partial(uint32_t _num_au, uint32_t _au_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming nit: fall_back_to_per_au() is a bit more clear IMO

}
}
return my_referenced == referenced;
} else if (other.num_au){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could avoid this block by doing return other.equal(*this)

return false;
}
}
return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this block. Won't we return false if au_size doesn't match? (The first bytes_per_au values won't be the same.) And what would it mean to compare equality when num_au doesn't match?

It looks like this is only used from fsck, in which case the size and au_size and num_au will match, right?

if (!num_au) {
f->dump_unsigned("total_bytes", total_bytes);
} else {
f->open_array_section("use_tracker");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bytes_per_au ?

*cnt = ROUND_UP_TO(l, min_release_size) / min_release_size;
*rsz = min_release_size;
} else {
*cnt = *rsz = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since both callers initialize these to 0 we could drop these blocks and make it part of the calling convention.

uint64_t end;
if (p == ref_map.ref_map.end()) {
end = b.get_ondisk_length();
// remove form pextents according to logical release list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*from

offset += (phase ? au_size - phase : au_size);
if (bytes_per_au[pos] == 0) {
if (release_units) {
release_units->insert(pos * au_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this will build of a huge set of continguos blocks for a large blob, right?

Instead, we could add a bit more smarts here to identify contiguous au's getting freed and return a logical bluestore_pextent_t, as we did before. This would also avoid the rewrite of Blob::put_ref() (which was a pain to get right the first time around)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked to return logical bluestore_pextent_t but modified put_ref totally anyway. The rationales are:

  1. reduced complexity - original implementation had O(p*r) vs. O(p+r) for the new one. Where p - blob pextent vector size, r - release vector size.
  2. original implementation still needed update since it uses ref_map.

@liewegas
Copy link
Member

The other patches look good!

Overall I think this PR is clearly the way to go. My main hope is that we can avoid the Blob::put_ref() rewrite and avoid building a bit set<> of individual au's ...

@liewegas
Copy link
Member

liewegas commented Jan 17, 2017 via email

@ifed01 ifed01 force-pushed the wip-bluestore-onode-refmap3 branch 2 times, most recently from 5a60de8 to 89aa273 Compare January 20, 2017 21:34
@ifed01 ifed01 force-pushed the wip-bluestore-onode-refmap3 branch from 89aa273 to c88d53f Compare January 23, 2017 20:59
@@ -8144,6 +8181,7 @@ int BlueStore::_do_alloc_write(
dblob.add_unused(b_end, wi.blob_length - b_end);
}
}
b->get_ref(coll.get(), wi.b_off0, wi.length0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like this pairs with the special case in get_ref()... can we add a comment for both locations to reference the other? (e.g., // see _wctx_write_finish())

@ifed01 ifed01 force-pushed the wip-bluestore-onode-refmap3 branch 2 times, most recently from e930feb to ddace0e Compare January 24, 2017 09:38
@liewegas
Copy link
Member

  MismatchedFree
  Mismatched free() / delete / delete []
  
    
      0x4C2918D
      /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so
      operator delete(void*)
      /builddir/build/BUILD/valgrind-3.11.0/coregrind/m_replacemalloc
      vg_replace_malloc.c
      576
    
    
      0x805FB0
      /usr/bin/ceph-osd
      clear
      /usr/src/debug/ceph-11.1.0-7031-g7e25bbc/src/os/bluestore
      bluestore_types.h
      304
    
    
      0x805FB0
      /usr/bin/ceph-osd
      ~bluestore_blob_use_tracker_t
      /usr/src/debug/ceph-11.1.0-7031-g7e25bbc/src/os/bluestore
      bluestore_types.h
      299
    
    
      0x805FB0
      /usr/bin/ceph-osd
      ~Blob
      /usr/src/debug/ceph-11.1.0-7031-g7e25bbc/src/os/bluestore
      BlueStore.h
      422
    

from /a/sage-2017-01-25_23:09:16-rados-wip-sage-testing---basic-smithi/748438

Igor Fedotov added 8 commits January 26, 2017 23:37
Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
…on from ref_map to ref counting properly.

Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
…zeof' unittest.

Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
…rint for non-shared blobs.

Signed-off-by: Igor Fedotov <ifedotor@mirantis.com>
…resence case.

Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
…_encode

Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
@ifed01 ifed01 force-pushed the wip-bluestore-onode-refmap3 branch from bdd0736 to c7c7540 Compare January 26, 2017 23:38
@ifed01
Copy link
Contributor Author

ifed01 commented Jan 26, 2017

@liewegas - fixed.

@liewegas
Copy link
Member

Hit this from ObjectStore/StoreTest.SyntheticMatrixCsumVsCompression/2

2017-01-27T18:58:04.542 INFO:teuthology.orchestra.run.smithi180.stderr:    -2> 2017-01-27 18:58:04.424198 7f6b6de38700 -1 freelist _verify_range key 0xfffffffffff80000 not present, expected 0x1
2017-01-27T18:58:04.542 INFO:teuthology.orchestra.run.smithi180.stderr:    -1> 2017-01-27 18:58:04.424227 7f6b6de38700 -1 freelist _verify_range saw 1 errors
2017-01-27T18:58:04.542 INFO:teuthology.orchestra.run.smithi180.stderr:     0> 2017-01-27 18:58:04.426027 7f6b6de38700 -1 /build/ceph-11.1.0-7072-g80df3d9/src/os/bluestore/BitmapFreelistManager.cc: In function 'void BitmapFreelistManager::_verify_range(uint64_t, uint64_t, int)' thread 7f6b6de38700 time 2017-01-27 1
8:58:04.424250
2017-01-27T18:58:04.542 INFO:teuthology.orchestra.run.smithi180.stderr:/build/ceph-11.1.0-7072-g80df3d9/src/os/bluestore/BitmapFreelistManager.cc: 454: FAILED assert(0 == "bitmap freelist errors")
2017-01-27T18:58:04.542 INFO:teuthology.orchestra.run.smithi180.stderr:
2017-01-27T18:58:04.542 INFO:teuthology.orchestra.run.smithi180.stderr: ceph version 11.1.0-7072-g80df3d9 (80df3d90ce6633aae8e985e346cc3a07f0ee8568)
2017-01-27T18:58:04.542 INFO:teuthology.orchestra.run.smithi180.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x7f6b7adc657e]
2017-01-27T18:58:04.543 INFO:teuthology.orchestra.run.smithi180.stderr: 2: (BitmapFreelistManager::_verify_range(unsigned long, unsigned long, int)+0x832) [0x7f6b83be4b82]
2017-01-27T18:58:04.543 INFO:teuthology.orchestra.run.smithi180.stderr: 3: (BitmapFreelistManager::release(unsigned long, unsigned long, std::shared_ptr)+0x2ab) [0x7f6b83be738b]
2017-01-27T18:58:04.543 INFO:teuthology.orchestra.run.smithi180.stderr: 4: (BlueStore::_txc_finalize_kv(BlueStore::TransContext*, std::shared_ptr)+0x23e) [0x7f6b83b4ed6e]
2017-01-27T18:58:04.543 INFO:teuthology.orchestra.run.smithi180.stderr: 5: (BlueStore::_kv_sync_thread()+0x1665) [0x7f6b83b58db5]
2017-01-27T18:58:04.543 INFO:teuthology.orchestra.run.smithi180.stderr: 6: (BlueStore::KVSyncThread::entry()+0xd) [0x7f6b83b85e0d]
2017-01-27T18:58:04.543 INFO:teuthology.orchestra.run.smithi180.stderr: 7: (()+0x8184) [0x7f6b7a935184]
2017-01-27T18:58:04.543 INFO:teuthology.orchestra.run.smithi180.stderr: 8: (clone()+0x6d) [0x7f6b7903f37d]

which looks pretty suspicious. Maybe try running it in a loop?

@ifed01
Copy link
Contributor Author

ifed01 commented Feb 2, 2017

Passed 32 iterations without any issues..

@liewegas liewegas merged commit 013b96e into ceph:master Feb 2, 2017
@liewegas
Copy link
Member

liewegas commented Feb 2, 2017

@ifed01 ifed01 deleted the wip-bluestore-onode-refmap3 branch February 2, 2017 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants