MDEV-33588 buf::Block_hint is a performance hog by dr-m · Pull Request #3098 · MariaDB/server

dr-m · 2024-03-04T16:30:02Z

The Jira issue number for this PR is: MDEV-33588

Description

The class buf::Block_hint that was added in order to fix MDEV-23693 is an unnecessary performance hog that causes unnecessary lookups on buf_pool.page_hash.

Originally, I was thinking that this function could be removed as part of rewriting the buffer pool resizing in MDEV-29445, but it turns out that we can do it independently of that. We only need to add a field btr_pcur_t::old_page_id to remember the page identifier where the persistent cursor position was stored, and pass this identifier to buf_page_optimistic_get(), which will have to perform a buf_pool.page_hash lookup anyway.

buf_page_optimistic_fix(): Prepare for buf_page_optimistic_get(). This basically is a simpler version of Buf::Block_hint.

buf_page_optimistic_get(): Assume that buf_page_optimistic_fix() has been called and the page identifier validated. Should the block be evicted, the block->modify_clock will be invalidated; we do not need to check the block->page.id() again. It suffices to check the block->modify_clock after acquiring the page latch.

btr_pcur_optimistic_latch_leaves(): Remove redundant parameters. First, invoke buf_page_optimistic_get() on the requested page. If needed, acquire a latch on the left page. Finally, acquire a latch on the target page and recheck the block->modify_clock, and that the left page (if any) points to the current page.

btr_latch_prev(): Always append the previous page to mtr_t::m_memo after the current page, even if we need to wait for the page latch. If the FIL_PAGE_PREV of the current page was changed while we were not holding a latch, fetch the previous page again. (The block->modify_clock is not incremented during page merges!) Add some consistency checks.

btr_pcur_move_backward_from_page(): Simplify the logic, now that the left block will be at the top of mtr_t::m_memo.

Release Notes

Nothing needed; this is a rather small performance improvement.

How can this PR be tested?

The function btr_pcur_move_backward_from_page() is not covered by the regression test suite. It is covered by RQG, as we learned back when working on MDEV-30400 and MDEV-29835.

Basing the PR against the correct MariaDB version

This is a new feature and the PR is based against the latest MariaDB development branch.
This is a bug fix and the PR is based against the earliest maintained branch in which the bug can be reproduced.

This is a performance bug fix that depends on earlier substantial refactoring that is only available starting with 10.6.

PR quality check

I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

CLAassistant · 2024-03-04T16:30:09Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

dr-m · 2024-03-06T12:23:27Z

storage/innobase/btr/btr0pcur.cc

-release_left_block:
-        mtr->release_last_page();
-        return false;
+        mtr->lock_register(savepoint + 1, mtr_memo_type_t(mode << 5));


This code path is obviously not covered by our regression test suite, but it causes massive amounts of failures in @mleich1’s RQG test runs. It should be savepoint and not savepoint + 1. Typically we would have savepoint==1 here, corresponding to a freshly started mini-transaction that holds block at mtr->m_memo slot 0 and left at mtr->m_memo slot 1.

b56d2eb addresses this (in another branch).

There is still something wrong with btr_pcur_move_backward_from_page(). I hope to get an rr replay trace for that soon.

9b169a6 hopefully addresses this by invoking btr_latch_prev() and by improving that function so that the preceding block is always mtr->at_savepoint(1).

dr-m · 2024-03-11T14:55:59Z

storage/innobase/btr/btr0cur.cc

+  const page_t *const p= prev->page.frame;
+  if (memcmp_aligned<4>(FIL_PAGE_NEXT + p, FIL_PAGE_OFFSET + page, 4) ||
+      memcmp_aligned<2>(FIL_PAGE_TYPE + p, FIL_PAGE_TYPE + page, 2) ||
+      memcmp_aligned<2>(PAGE_HEADER + PAGE_INDEX_ID + p,
+                        PAGE_HEADER + PAGE_INDEX_ID + page, 8) ||
+      page_is_comp(p) != page_is_comp(page))
  {
    ut_ad("corrupted" == 0); // FIXME: remove this


Matthias reported an assertion failure here, due to the FIL_PAGE_NEXT check. In the rr replay trace that I analyzed, the cause is that while btr_latch_prev() was waiting for the previous page latch in buf_page_get_gen(), there was a btr_page_split_and_insert() executed in another thread.

I think that we must implement a retry logic in case the FIL_PAGE_PREV of our current page changed while we were not holding a page latch.

a4389ec should address this.

Indeed, latching right to left is not straight forward. AFAIK, we don't need such latching order in general case. It is needed for backward scan and perhaps some other special cases. In 8.0, for pessimistic operation, the order is allowed to reverse for non-leaf pages with certain strict latching rules (Yasufulmi's improvement SX latching).

Can you tell me the scenario when we have this flow ? Yet to see the details but hope it is not in genera; DML/Query path if the design is not significantly different from 5.7/8.0.

The SX latch mode was implemented in MySQL 5.7 already. I renamed it to U in the sux_lock. It is not free from hangs; see f209647 for a prominent example.

This code is mostly being used in btr_pcur_move_backward_from_page() but also in the change buffer (BTR_MODIFY_PREV).

mariadb-DebarunBanerjee

Hi Marko,

The patch is doing way more than removing buf::Block_hint. In fact it is making a major design change.

I checked the design and doesn't look very promising. The current design is to buffer fix the page and take the locks previous-current-next in order. This is used for usual scan and repositioning. During reposition the only possible conflict is a concurrent pessimistic operation.

The modified design is doing a try latch on reverse order which is bound to conflict and cause repeated retry (which we nee to do) with any concurrent Query. Historically the reverse scan is also known to have performance issue.

I don't agree with the design change and I think will cause issues for us in future. May be if you can explain the reason for the performance hog, we can improve it without changing the latch order.

Regards,
Deb

mariadb-DebarunBanerjee · 2024-03-14T06:47:18Z

storage/innobase/btr/btr0cur.cc

+  const page_t *const p= prev->page.frame;
+  if (memcmp_aligned<4>(FIL_PAGE_NEXT + p, FIL_PAGE_OFFSET + page, 4) ||
+      memcmp_aligned<2>(FIL_PAGE_TYPE + p, FIL_PAGE_TYPE + page, 2) ||
+      memcmp_aligned<2>(PAGE_HEADER + PAGE_INDEX_ID + p,
+                        PAGE_HEADER + PAGE_INDEX_ID + page, 8) ||
+      page_is_comp(p) != page_is_comp(page))
  {
    ut_ad("corrupted" == 0); // FIXME: remove this


Indeed, latching right to left is not straight forward. AFAIK, we don't need such latching order in general case. It is needed for backward scan and perhaps some other special cases. In 8.0, for pessimistic operation, the order is allowed to reverse for non-leaf pages with certain strict latching rules (Yasufulmi's improvement SX latching).

Can you tell me the scenario when we have this flow ? Yet to see the details but hope it is not in genera; DML/Query path if the design is not significantly different from 5.7/8.0.

dr-m · 2024-03-14T14:20:18Z

storage/innobase/buf/buf0buf.cc

  {
    transactional_shared_lock_guard<page_hash_latch> g
      {buf_pool.page_hash.lock_get(chain)};
-    if (UNIV_UNLIKELY(id != block->page.id() || !block->page.frame))
+    if (UNIV_UNLIKELY(!buf_pool.is_uncompressed(block) ||
+                      id != block->page.id() || !block->page.frame))
      return false;


I tested this with 79e3a65 (#3107) to see if the transactional_lock_guard works well:

perf record -e tx-abort ./mtr innodb.innodb_buffer_pool_resize perf report

In that branch, it is OK; we have much more xabort() in the buffer pool lookup functions:

Samples: 5K of event 'tx-abort', Event count (approx.): 101716 Overhead Command Shared Object Symbol 99,98% mariadbd mariadbd [.] buf_page_get_low(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, mtr_t*, dberr_t*, bool) 0,01% mariadbd mariadbd [.] buf_page_create_low(page_id_t, unsigned long, mtr_t*, buf_block_t*) 0,00% mariadbd mariadbd [.] buf_page_optimistic_get(rw_lock_type_t, buf_block_t*, page_id_t, unsigned long, mtr_t*)

In a longer sysbench benchmark, there were more xabort() in the optimistic function:

Samples: 64K of event 'tx-abort', Event count (approx.): 476326 Overhead Command Shared Object Symbol 83,71% mariadbd mariadbd [.] buf_page_get_low(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, mtr_t*, dberr_t*, bool) 15,73% mariadbd mariadbd [.] buf_page_optimistic_get(rw_lock_type_t, buf_block_t*, page_id_t, unsigned long, mtr_t*) 0,33% mariadbd mariadbd [.] TMLockGuard::TMLockGuard(lock_sys_t::hash_table&, page_id_t) 0,10% mariadbd mariadbd [.] buf_page_create_low(page_id_t, unsigned long, mtr_t*, buf_block_t*) 0,08% mariadbd mariadbd [.] buf_page_free(fil_space_t*, unsigned int, mtr_t*) 0,05% mariadbd mariadbd [.] trx_t::commit_low(mtr_t*)

For this branch, the traversal of buf_pool.chunks could make the memory transaction too big. I reran the same workload on a merge of 4ac8c4c and a4389ec, and got this:

Samples: 18K of event 'tx-abort', Event count (approx.): 278138 Overhead Command Shared Object Symbol 99,55% mariadbd mariadbd [.] buf_page_get_low(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, mtr_t*, dberr_t*, bool) 0,20% mariadbd mariadbd [.] buf_page_init_for_read(unsigned long, page_id_t, unsigned long, bool) 0,13% mariadbd mariadbd [.] buf_page_create_low(page_id_t, unsigned long, mtr_t*, buf_block_t*) 0,06% mariadbd mariadbd [.] buf_page_optimistic_get(rw_lock_type_t, buf_block_t*, page_id_t, unsigned long, mtr_t*) 0,04% mariadbd mariadbd [.] trx_t::commit_low(mtr_t*) 0,02% mariadbd mariadbd [.] TMLockGuard::TMLockGuard(lock_sys_t::hash_table&, page_id_t)

As we can see, the does not appear to be any issue with the buf_page_optimistic_get() here either. There are fewer samples, though: The server did not complete the innodb_fast_shutdown=0. I had to kill the process because it hung due to running out of buffer pool. With #3107 we would abort the buffer pool resize in that scenario.

The removed Block_hint::buffer_fix_block_if_still_valid() was also essentially performing a buf_pool.page_hash lookup, for acquiring a latch on the hash table slice.

One thing that we can do better is to only buffer-fix the block while holding buf_pool.page_hash, and try to acquire the buf_page_t::lock after we have released the buf_pool.page_hash. The buffer-fix will prevent eviction.

With the ccd4b96 simplification of buf_page_optimistic_get(), the throughput numbers for my 60-second 64-thread 64×100k row Sysbench oltp_update_index were virtually unchanged between that commit and its parent.

dr-m · 2024-03-15T09:22:57Z

@mariadb-DebarunBanerjee, thank you for the review. c10fd5d addresses some of your concerns. Please re-review.

The retry logic in case FIL_PAGE_PREV was changed is addressing something for which I got an rr replay trace. Because buf_block_t::modify_clock is not being incremented when FIL_PAGE_PREV is changed, we’d better validate that field. I originally had 91a5e4f included in the revised fix, and in my 60-second, 64-thread Sysbench oltp_update_index test the throughput was some 0.5% better than 10.6.

dr-m

As discussed today, I must try to create an mtr test case that covers the btr_pcur_move_backward_from_page() code path.

storage/innobase/btr/btr0pcur.cc

mariadb-DebarunBanerjee

@dr-m The latest changes look good to me. It changes the design and hence there is some risk but I think it is for good. Let's go ahead.

Added some minor comments.

mariadb-DebarunBanerjee · 2024-04-04T06:50:07Z

storage/innobase/buf/buf0buf.cc

-			ut_ad(!block->page.is_read_fixed());
-			if (UNIV_UNLIKELY(block->page.id() != page_id)) {
-				block->page.lock.s_unlock();
-				block->page.lock.x_lock();
-				goto page_id_mismatch;
-			}
-get_latch_valid:


For my edification, why is the page ID mismatch check is now not needed in the "else" part of the "if" condition ?

if (state >= buf_page_t::UNFIXED && allow_ibuf_merge && fil_page_get_type(block->page.frame) == FIL_PAGE_INDEX && page_is_leaf(block->page.frame)) {

The function buf_pool_t::corrupted_evict() does two things before releasing the exclusive page latch: Invalidate the page ID, and set the state to FREED while preserving the buffer-fix count. In the else branch we detect the latter:

if (UNIV_UNLIKELY(state < buf_page_t::UNFIXED)) { mtr->release_last_page(); goto ignore_unfixed; }

In the if branch the condition for page ID mismatch could in fact be redundant; the goto ignore_block or goto ignore_unfixed for the state mismatch should do the same. It is the responsibility of buf_page_t::read_complete() to invoke buf_pool_t::corrupted_evict() when needed. I will revise the logic and remove the redundant check for page ID mismatch.

storage/innobase/btr/btr0cur.cc

storage/innobase/buf/buf0buf.cc

In so-called optimistic buffer pool lookups, we must not dereference a block descriptor before we have made sure that it is accessible. While buf_pool_t::resize() is running, block descriptors could become invalid. The buf::Block_hint class was essentially duplicating a buf_pool.page_hash lookup that was executed in buf_page_optimistic_get() anyway. For better locality of reference, we had better execute that lookup only once. buf_page_optimistic_fix(): Prepare for buf_page_optimistic_get(). This basically is a simpler version of Buf::Block_hint. buf_page_optimistic_get(): Assume that buf_page_optimistic_fix() has been called and the page identifier verified. Should the block be evicted, the block->modify_clock will be invalidated; we do not need to check the block->page.id() again. It suffices to check the block->modify_clock after acquiring the page latch. btr_pcur_t::old_page_id: Store the expected page identifier for buf_page_optimistic_fix(). btr_pcur_t::block_when_stored: Remove. This was duplicating page_cur_t::block. btr_pcur_optimistic_latch_leaves(): Remove redundant parameters. First, invoke buf_page_optimistic_fix() on the requested page. If needed, acquire a latch on the left page. Finally, acquire a latch on the target page and recheck the block->modify_clock. If the page had been freed while we were not holding a page latch, fall back to the slow path. Validate the FIL_PAGE_PREV after acquiring a latch on the current page. The block->modify_clock is only being incremented when records are deleted or pages reorganized or evicted; it does not guard against concurrent page splits. Reviewed by: Debarun Banerjee

dr-m requested a review from mariadb-DebarunBanerjee March 4, 2024 16:30

dr-m self-assigned this Mar 4, 2024

dr-m commented Mar 6, 2024

View reviewed changes

dr-m force-pushed the 10.6-MDEV-33588 branch from 1696d88 to 9b169a6 Compare March 7, 2024 11:18

dr-m commented Mar 11, 2024

View reviewed changes

mariadb-DebarunBanerjee reviewed Mar 14, 2024

View reviewed changes

dr-m commented Mar 14, 2024

View reviewed changes

dr-m force-pushed the 10.6-MDEV-33588 branch 2 times, most recently from ccd4b96 to c10fd5d Compare March 15, 2024 09:07

dr-m commented Mar 18, 2024

View reviewed changes

storage/innobase/btr/btr0pcur.cc Show resolved Hide resolved

mariadb-DebarunBanerjee approved these changes Apr 4, 2024

View reviewed changes

dr-m force-pushed the 10.6-MDEV-33588 branch from 8d8ebbc to 5c450f7 Compare April 4, 2024 14:19

dr-m commented Apr 4, 2024

View reviewed changes

storage/innobase/buf/buf0buf.cc Show resolved Hide resolved

dr-m force-pushed the 10.6-MDEV-33588 branch from acc64aa to ac9b938 Compare April 9, 2024 09:24

dr-m force-pushed the 10.6-MDEV-33588 branch from ac9b938 to a4cda66 Compare April 9, 2024 09:48

dr-m merged commit a4cda66 into 10.6 Apr 9, 2024

dr-m deleted the 10.6-MDEV-33588 branch April 9, 2024 10:25

Uh oh!

Conversation

dr-m commented Mar 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Release Notes

How can this PR be tested?

Basing the PR against the correct MariaDB version

PR quality check

Uh oh!

CLAassistant commented Mar 4, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mariadb-DebarunBanerjee left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dr-m commented Mar 15, 2024

Uh oh!

dr-m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mariadb-DebarunBanerjee left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

dr-m commented Mar 4, 2024 •

edited

Loading