Skip to content

MDEV-33588 buf::Block_hint is a performance hog#3098

Merged
dr-m merged 1 commit into10.6from
10.6-MDEV-33588
Apr 9, 2024
Merged

MDEV-33588 buf::Block_hint is a performance hog#3098
dr-m merged 1 commit into10.6from
10.6-MDEV-33588

Conversation

@dr-m
Copy link
Copy Markdown
Contributor

@dr-m dr-m commented Mar 4, 2024

  • The Jira issue number for this PR is: MDEV-33588

Description

The class buf::Block_hint that was added in order to fix MDEV-23693 is an unnecessary performance hog that causes unnecessary lookups on buf_pool.page_hash.

Originally, I was thinking that this function could be removed as part of rewriting the buffer pool resizing in MDEV-29445, but it turns out that we can do it independently of that. We only need to add a field btr_pcur_t::old_page_id to remember the page identifier where the persistent cursor position was stored, and pass this identifier to buf_page_optimistic_get(), which will have to perform a buf_pool.page_hash lookup anyway.

buf_page_optimistic_fix(): Prepare for buf_page_optimistic_get(). This basically is a simpler version of Buf::Block_hint.

buf_page_optimistic_get(): Assume that buf_page_optimistic_fix() has been called and the page identifier validated. Should the block be evicted, the block->modify_clock will be invalidated; we do not need to check the block->page.id() again. It suffices to check the block->modify_clock after acquiring the page latch.

btr_pcur_optimistic_latch_leaves(): Remove redundant parameters. First, invoke buf_page_optimistic_get() on the requested page. If needed, acquire a latch on the left page. Finally, acquire a latch on the target page and recheck the block->modify_clock, and that the left page (if any) points to the current page.

btr_latch_prev(): Always append the previous page to mtr_t::m_memo after the current page, even if we need to wait for the page latch. If the FIL_PAGE_PREV of the current page was changed while we were not holding a latch, fetch the previous page again. (The block->modify_clock is not incremented during page merges!) Add some consistency checks.

btr_pcur_move_backward_from_page(): Simplify the logic, now that the left block will be at the top of mtr_t::m_memo.

Release Notes

Nothing needed; this is a rather small performance improvement.

How can this PR be tested?

The function btr_pcur_move_backward_from_page() is not covered by the regression test suite. It is covered by RQG, as we learned back when working on MDEV-30400 and MDEV-29835.

Basing the PR against the correct MariaDB version

  • This is a new feature and the PR is based against the latest MariaDB development branch.
  • This is a bug fix and the PR is based against the earliest maintained branch in which the bug can be reproduced.

This is a performance bug fix that depends on earlier substantial refactoring that is only available starting with 10.6.

PR quality check

  • I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
  • For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

@dr-m dr-m self-assigned this Mar 4, 2024
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

release_left_block:
mtr->release_last_page();
return false;
mtr->lock_register(savepoint + 1, mtr_memo_type_t(mode << 5));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code path is obviously not covered by our regression test suite, but it causes massive amounts of failures in @mleich1’s RQG test runs. It should be savepoint and not savepoint + 1. Typically we would have savepoint==1 here, corresponding to a freshly started mini-transaction that holds block at mtr->m_memo slot 0 and left at mtr->m_memo slot 1.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b56d2eb addresses this (in another branch).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still something wrong with btr_pcur_move_backward_from_page(). I hope to get an rr replay trace for that soon.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9b169a6 hopefully addresses this by invoking btr_latch_prev() and by improving that function so that the preceding block is always mtr->at_savepoint(1).

@dr-m dr-m force-pushed the 10.6-MDEV-33588 branch from 1696d88 to 9b169a6 Compare March 7, 2024 11:18
Comment on lines +1012 to 1034
const page_t *const p= prev->page.frame;
if (memcmp_aligned<4>(FIL_PAGE_NEXT + p, FIL_PAGE_OFFSET + page, 4) ||
memcmp_aligned<2>(FIL_PAGE_TYPE + p, FIL_PAGE_TYPE + page, 2) ||
memcmp_aligned<2>(PAGE_HEADER + PAGE_INDEX_ID + p,
PAGE_HEADER + PAGE_INDEX_ID + page, 8) ||
page_is_comp(p) != page_is_comp(page))
{
ut_ad("corrupted" == 0); // FIXME: remove this
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matthias reported an assertion failure here, due to the FIL_PAGE_NEXT check. In the rr replay trace that I analyzed, the cause is that while btr_latch_prev() was waiting for the previous page latch in buf_page_get_gen(), there was a btr_page_split_and_insert() executed in another thread.

I think that we must implement a retry logic in case the FIL_PAGE_PREV of our current page changed while we were not holding a page latch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a4389ec should address this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, latching right to left is not straight forward. AFAIK, we don't need such latching order in general case. It is needed for backward scan and perhaps some other special cases. In 8.0, for pessimistic operation, the order is allowed to reverse for non-leaf pages with certain strict latching rules (Yasufulmi's improvement SX latching).

Can you tell me the scenario when we have this flow ? Yet to see the details but hope it is not in genera; DML/Query path if the design is not significantly different from 5.7/8.0.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SX latch mode was implemented in MySQL 5.7 already. I renamed it to U in the sux_lock. It is not free from hangs; see f209647 for a prominent example.

This code is mostly being used in btr_pcur_move_backward_from_page() but also in the change buffer (BTR_MODIFY_PREV).

Copy link
Copy Markdown
Contributor

@mariadb-DebarunBanerjee mariadb-DebarunBanerjee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Marko,

The patch is doing way more than removing buf::Block_hint. In fact it is making a major design change.

I checked the design and doesn't look very promising. The current design is to buffer fix the page and take the locks previous-current-next in order. This is used for usual scan and repositioning. During reposition the only possible conflict is a concurrent pessimistic operation.

The modified design is doing a try latch on reverse order which is bound to conflict and cause repeated retry (which we nee to do) with any concurrent Query. Historically the reverse scan is also known to have performance issue.

I don't agree with the design change and I think will cause issues for us in future. May be if you can explain the reason for the performance hog, we can improve it without changing the latch order.

Regards,
Deb

Comment on lines +1012 to 1034
const page_t *const p= prev->page.frame;
if (memcmp_aligned<4>(FIL_PAGE_NEXT + p, FIL_PAGE_OFFSET + page, 4) ||
memcmp_aligned<2>(FIL_PAGE_TYPE + p, FIL_PAGE_TYPE + page, 2) ||
memcmp_aligned<2>(PAGE_HEADER + PAGE_INDEX_ID + p,
PAGE_HEADER + PAGE_INDEX_ID + page, 8) ||
page_is_comp(p) != page_is_comp(page))
{
ut_ad("corrupted" == 0); // FIXME: remove this
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, latching right to left is not straight forward. AFAIK, we don't need such latching order in general case. It is needed for backward scan and perhaps some other special cases. In 8.0, for pessimistic operation, the order is allowed to reverse for non-leaf pages with certain strict latching rules (Yasufulmi's improvement SX latching).

Can you tell me the scenario when we have this flow ? Yet to see the details but hope it is not in genera; DML/Query path if the design is not significantly different from 5.7/8.0.

Comment on lines 3007 to 3010
{
transactional_shared_lock_guard<page_hash_latch> g
{buf_pool.page_hash.lock_get(chain)};
if (UNIV_UNLIKELY(id != block->page.id() || !block->page.frame))
if (UNIV_UNLIKELY(!buf_pool.is_uncompressed(block) ||
id != block->page.id() || !block->page.frame))
return false;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this with 79e3a65 (#3107) to see if the transactional_lock_guard works well:

perf record -e tx-abort ./mtr innodb.innodb_buffer_pool_resize
perf report

In that branch, it is OK; we have much more xabort() in the buffer pool lookup functions:

Samples: 5K of event 'tx-abort', Event count (approx.): 101716
Overhead  Command   Shared Object  Symbol
  99,98%  mariadbd  mariadbd       [.] buf_page_get_low(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, mtr_t*, dberr_t*, bool)
   0,01%  mariadbd  mariadbd       [.] buf_page_create_low(page_id_t, unsigned long, mtr_t*, buf_block_t*)
   0,00%  mariadbd  mariadbd       [.] buf_page_optimistic_get(rw_lock_type_t, buf_block_t*, page_id_t, unsigned long, mtr_t*)

In a longer sysbench benchmark, there were more xabort() in the optimistic function:

Samples: 64K of event 'tx-abort', Event count (approx.): 476326
Overhead  Command   Shared Object  Symbol
  83,71%  mariadbd  mariadbd       [.] buf_page_get_low(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, mtr_t*, dberr_t*, bool)
  15,73%  mariadbd  mariadbd       [.] buf_page_optimistic_get(rw_lock_type_t, buf_block_t*, page_id_t, unsigned long, mtr_t*)
   0,33%  mariadbd  mariadbd       [.] TMLockGuard::TMLockGuard(lock_sys_t::hash_table&, page_id_t)
   0,10%  mariadbd  mariadbd       [.] buf_page_create_low(page_id_t, unsigned long, mtr_t*, buf_block_t*)
   0,08%  mariadbd  mariadbd       [.] buf_page_free(fil_space_t*, unsigned int, mtr_t*)
   0,05%  mariadbd  mariadbd       [.] trx_t::commit_low(mtr_t*)

For this branch, the traversal of buf_pool.chunks could make the memory transaction too big. I reran the same workload on a merge of 4ac8c4c and a4389ec, and got this:

Samples: 18K of event 'tx-abort', Event count (approx.): 278138
Overhead  Command   Shared Object  Symbol
  99,55%  mariadbd  mariadbd       [.] buf_page_get_low(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, mtr_t*, dberr_t*, bool)
   0,20%  mariadbd  mariadbd       [.] buf_page_init_for_read(unsigned long, page_id_t, unsigned long, bool)
   0,13%  mariadbd  mariadbd       [.] buf_page_create_low(page_id_t, unsigned long, mtr_t*, buf_block_t*)
   0,06%  mariadbd  mariadbd       [.] buf_page_optimistic_get(rw_lock_type_t, buf_block_t*, page_id_t, unsigned long, mtr_t*)
   0,04%  mariadbd  mariadbd       [.] trx_t::commit_low(mtr_t*)
   0,02%  mariadbd  mariadbd       [.] TMLockGuard::TMLockGuard(lock_sys_t::hash_table&, page_id_t)

As we can see, the does not appear to be any issue with the buf_page_optimistic_get() here either. There are fewer samples, though: The server did not complete the innodb_fast_shutdown=0. I had to kill the process because it hung due to running out of buffer pool. With #3107 we would abort the buffer pool resize in that scenario.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removed Block_hint::buffer_fix_block_if_still_valid() was also essentially performing a buf_pool.page_hash lookup, for acquiring a latch on the hash table slice.

One thing that we can do better is to only buffer-fix the block while holding buf_pool.page_hash, and try to acquire the buf_page_t::lock after we have released the buf_pool.page_hash. The buffer-fix will prevent eviction.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the ccd4b96 simplification of buf_page_optimistic_get(), the throughput numbers for my 60-second 64-thread 64×100k row Sysbench oltp_update_index were virtually unchanged between that commit and its parent.

@dr-m dr-m force-pushed the 10.6-MDEV-33588 branch 2 times, most recently from ccd4b96 to c10fd5d Compare March 15, 2024 09:07
@dr-m
Copy link
Copy Markdown
Contributor Author

dr-m commented Mar 15, 2024

@mariadb-DebarunBanerjee, thank you for the review. c10fd5d addresses some of your concerns. Please re-review.

The retry logic in case FIL_PAGE_PREV was changed is addressing something for which I got an rr replay trace. Because buf_block_t::modify_clock is not being incremented when FIL_PAGE_PREV is changed, we’d better validate that field. I originally had 91a5e4f included in the revised fix, and in my 60-second, 64-thread Sysbench oltp_update_index test the throughput was some 0.5% better than 10.6.

Copy link
Copy Markdown
Contributor Author

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed today, I must try to create an mtr test case that covers the btr_pcur_move_backward_from_page() code path.

Copy link
Copy Markdown
Contributor

@mariadb-DebarunBanerjee mariadb-DebarunBanerjee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dr-m The latest changes look good to me. It changes the design and hence there is some risk but I think it is for good. Let's go ahead.

Added some minor comments.

Comment on lines -2864 to -2870
ut_ad(!block->page.is_read_fixed());
if (UNIV_UNLIKELY(block->page.id() != page_id)) {
block->page.lock.s_unlock();
block->page.lock.x_lock();
goto page_id_mismatch;
}
get_latch_valid:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my edification, why is the page ID mismatch check is now not needed in the "else" part of the "if" condition ?

 if (state >= buf_page_t::UNFIXED
            && allow_ibuf_merge
            && fil_page_get_type(block->page.frame) == FIL_PAGE_INDEX
            && page_is_leaf(block->page.frame)) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function buf_pool_t::corrupted_evict() does two things before releasing the exclusive page latch: Invalidate the page ID, and set the state to FREED while preserving the buffer-fix count. In the else branch we detect the latter:

		if (UNIV_UNLIKELY(state < buf_page_t::UNFIXED)) {
			mtr->release_last_page();
			goto ignore_unfixed;
		}

In the if branch the condition for page ID mismatch could in fact be redundant; the goto ignore_block or goto ignore_unfixed for the state mismatch should do the same. It is the responsibility of buf_page_t::read_complete() to invoke buf_pool_t::corrupted_evict() when needed. I will revise the logic and remove the redundant check for page ID mismatch.

@dr-m dr-m force-pushed the 10.6-MDEV-33588 branch from 8d8ebbc to 5c450f7 Compare April 4, 2024 14:19
@dr-m dr-m force-pushed the 10.6-MDEV-33588 branch from acc64aa to ac9b938 Compare April 9, 2024 09:24
In so-called optimistic buffer pool lookups, we must not
dereference a block descriptor before we have made sure that
it is accessible. While buf_pool_t::resize() is running,
block descriptors could become invalid.

The buf::Block_hint class was essentially duplicating
a buf_pool.page_hash lookup that was executed in
buf_page_optimistic_get() anyway. For better locality of
reference, we had better execute that lookup only once.

buf_page_optimistic_fix(): Prepare for buf_page_optimistic_get().
This basically is a simpler version of Buf::Block_hint.

buf_page_optimistic_get(): Assume that buf_page_optimistic_fix()
has been called and the page identifier verified. Should the block
be evicted, the block->modify_clock will be invalidated; we do not
need to check the block->page.id() again. It suffices to check
the block->modify_clock after acquiring the page latch.

btr_pcur_t::old_page_id: Store the expected page identifier
for buf_page_optimistic_fix().

btr_pcur_t::block_when_stored: Remove. This was duplicating
page_cur_t::block.

btr_pcur_optimistic_latch_leaves(): Remove redundant parameters.
First, invoke buf_page_optimistic_fix() on the requested page.
If needed, acquire a latch on the left page. Finally, acquire
a latch on the target page and recheck the block->modify_clock.
If the page had been freed while we were not holding a page latch,
fall back to the slow path. Validate the FIL_PAGE_PREV after
acquiring a latch on the current page. The block->modify_clock
is only being incremented when records are deleted or pages
reorganized or evicted; it does not guard against concurrent
page splits.

Reviewed by: Debarun Banerjee
@dr-m dr-m force-pushed the 10.6-MDEV-33588 branch from ac9b938 to a4cda66 Compare April 9, 2024 09:48
@dr-m dr-m merged commit a4cda66 into 10.6 Apr 9, 2024
@dr-m dr-m deleted the 10.6-MDEV-33588 branch April 9, 2024 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants