Skip to content

Commit 2046089

Browse files
committed
MDEV-14637: Fix hang due to DDL with FOREIGN KEY
When MySQL 5.7.1 introduced WL#6326 to reduce contention on the non-leaf levels of B-trees, it introduced a new rw-lock mode SX (not conflicting with S, but conflicting with SX and X) and new rules to go with it. A thread that is holding an dict_index_t::lock aka index->lock in SX mode is permitted to acquire non-leaf buf_block_t::lock aka block->lock X or SX mode, in monotonically descending order. That is, once the thread has acquired a block->lock, it is not allowed to acquire a lock on its parent or grandparent pages. Such arbitrary-order access is only allowed when the thread acquired the index->lock in X mode upfront. A customer encountered a repeatable hang when loading a dump into InnoDB while using multiple innodb_purge_threads (default: 4). The dump makes very heavy use of FOREIGN KEY constraints. By luck, it happened so that two purge worker threads (srv_worker_thread) deadlocked with each other. Both were operating on the index FOR_REF of the InnoDB internal table SYS_FOREIGN. One of them was legitimately holding index->lock S-latch and the root block->lock S-latch. The other had acquired index->lock SX-latch, root block->lock SX-latch, and a bunch of other latches, including the fil_space_t::latch for freeing some blocks and some leaf page latches. This other thread was inside 2 nested calls to btr_compress() and it was trying to reacquire the root block->lock in X mode, violating the WL#6326 protocol. This violation led to a deadlock, because while S is compatible with SX and a thread can upgrade an SX lock to X when there are no conflicting requests, in this case there was a conflicting S lock held by the other purge worker thread. During this deadlock, both threads are holding dict_operation_lock S-latch, which would block any subsequent DDL statements, such as CREATE TABLE. The tables SYS_FOREIGN and SYS_FOREIGN_COLS are special in that they define key columns of the type VARCHAR(0), created using the InnoDB internal SQL parser. Because InnoDB does not internally enforce the maximum length of columns, it would happily write more than 0 bytes to these columns. This caused a miscalculation of node_ptr_max_size. btr_cur_will_modify_tree(): Clean up some code. (No functional change.) btr_node_ptr_max_size(): Renamed from dict_index_node_ptr_max_size(). Use a more realistic maximum size for SYS_FOREIGN and SYS_FOREIGN_COLS. btr_cur_pessimistic_delete(): Refrain from merging pages if it is not safe. This work is based on the following MySQL 5.7.23 fix: commit 58dcf0b4a4165ed59de94a9a1e7d8c954f733726 Author: Aakanksha Verma <aakanksha.verma@oracle.com> Date: Wed May 9 18:54:03 2018 +0530 BUG#26225783 MYSQL CRASH ON CREATE TABLE (REPRODUCEABLE) -> INNODB: A LONG SEMAPHORE WAIT
1 parent f70a318 commit 2046089

File tree

3 files changed

+136
-119
lines changed

3 files changed

+136
-119
lines changed

storage/innobase/btr/btr0cur.cc

Lines changed: 136 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
/*****************************************************************************
22
3-
Copyright (c) 1994, 2016, Oracle and/or its affiliates. All Rights Reserved.
3+
Copyright (c) 1994, 2018, Oracle and/or its affiliates. All Rights Reserved.
44
Copyright (c) 2008, Google Inc.
55
Copyright (c) 2012, Facebook Inc.
66
Copyright (c) 2015, 2018, MariaDB Corporation.
@@ -586,13 +586,15 @@ btr_cur_will_modify_tree(
586586
first record and following compress might delete the record and causes
587587
the uppper level node_ptr modification. */
588588

589+
const ulint n_recs = page_get_n_recs(page);
590+
589591
if (lock_intention <= BTR_INTENTION_BOTH) {
590592
ulint margin;
591593

592594
/* check delete will cause. (BTR_INTENTION_BOTH
593595
or BTR_INTENTION_DELETE) */
594596
/* first, 2nd, 2nd-last and last records are 4 records */
595-
if (page_get_n_recs(page) < 5) {
597+
if (n_recs < 5) {
596598
return(true);
597599
}
598600

@@ -638,8 +640,7 @@ btr_cur_will_modify_tree(
638640
/* Once we invoke the btr_cur_limit_optimistic_insert_debug,
639641
we should check it here in advance, since the max allowable
640642
records in a page is limited. */
641-
LIMIT_OPTIMISTIC_INSERT_DEBUG(page_get_n_recs(page),
642-
return(true));
643+
LIMIT_OPTIMISTIC_INSERT_DEBUG(n_recs, return true);
643644

644645
/* needs 2 records' space for the case the single split and
645646
insert cannot fit.
@@ -652,18 +653,16 @@ btr_cur_will_modify_tree(
652653
|| max_size < rec_size * 2) {
653654
return(true);
654655
}
655-
/* TODO: optimize this condition for compressed page.
656-
this is based on the worst compress rate.
657-
currently looking only uncompressed page, but we can look
658-
also compressed page page_zip_available() if already in the
659-
buffer pool */
656+
657+
/* TODO: optimize this condition for ROW_FORMAT=COMPRESSED.
658+
This is based on the worst case, and we could invoke
659+
page_zip_available() on the block->page.zip. */
660660
/* needs 2 records' space also for worst compress rate. */
661661
if (page_size.is_compressed()
662662
&& page_zip_empty_size(index->n_fields,
663663
page_size.physical())
664-
< rec_size * 2 + page_get_data_size(page)
665-
+ page_dir_calc_reserved_space(
666-
page_get_n_recs(page) + 2) + 1) {
664+
<= rec_size * 2 + page_get_data_size(page)
665+
+ page_dir_calc_reserved_space(n_recs + 2)) {
667666
return(true);
668667
}
669668
}
@@ -701,6 +700,98 @@ btr_cur_need_opposite_intention(
701700
return(false);
702701
}
703702

703+
/**
704+
@param[in] index b-tree
705+
@return maximum size of a node pointer record in bytes */
706+
static ulint btr_node_ptr_max_size(const dict_index_t* index)
707+
{
708+
if (dict_index_is_ibuf(index)) {
709+
/* cannot estimate accurately */
710+
/* This is universal index for change buffer.
711+
The max size of the entry is about max key length * 2.
712+
(index key + primary key to be inserted to the index)
713+
(The max key length is UNIV_PAGE_SIZE / 16 * 3 at
714+
ha_innobase::max_supported_key_length(),
715+
considering MAX_KEY_LENGTH = 3072 at MySQL imposes
716+
the 3500 historical InnoDB value for 16K page size case.)
717+
For the universal index, node_ptr contains most of the entry.
718+
And 512 is enough to contain ibuf columns and meta-data */
719+
return srv_page_size / 8 * 3 + 512;
720+
}
721+
722+
/* Each record has page_no, length of page_no and header. */
723+
ulint comp = dict_table_is_comp(index->table);
724+
ulint rec_max_size = comp
725+
? REC_NODE_PTR_SIZE + 1 + REC_N_NEW_EXTRA_BYTES
726+
+ UT_BITS_IN_BYTES(index->n_nullable)
727+
: REC_NODE_PTR_SIZE + 2 + REC_N_OLD_EXTRA_BYTES
728+
+ 2 * index->n_fields;
729+
730+
/* Compute the maximum possible record size. */
731+
for (ulint i = 0; i < dict_index_get_n_unique_in_tree(index); i++) {
732+
const dict_field_t* field
733+
= dict_index_get_nth_field(index, i);
734+
const dict_col_t* col
735+
= dict_field_get_col(field);
736+
ulint field_max_size;
737+
ulint field_ext_max_size;
738+
739+
/* Determine the maximum length of the index field. */
740+
741+
field_max_size = dict_col_get_fixed_size(col, comp);
742+
if (field_max_size) {
743+
/* dict_index_add_col() should guarantee this */
744+
ut_ad(!field->prefix_len
745+
|| field->fixed_len == field->prefix_len);
746+
/* Fixed lengths are not encoded
747+
in ROW_FORMAT=COMPACT. */
748+
rec_max_size += field_max_size;
749+
continue;
750+
}
751+
752+
field_max_size = dict_col_get_max_size(col);
753+
if (UNIV_UNLIKELY(!field_max_size)) {
754+
/* SYS_FOREIGN.ID is defined as CHAR in the
755+
InnoDB internal SQL parser, which translates
756+
into the incorrect VARCHAR(0). InnoDB does
757+
not enforce maximum lengths of columns, so
758+
that is why any data can be inserted in the
759+
first place.
760+
761+
Likewise, SYS_FOREIGN.FOR_NAME,
762+
SYS_FOREIGN.REF_NAME, SYS_FOREIGN_COLS.ID, are
763+
defined as CHAR, and also they are part of a key. */
764+
765+
ut_ad(!strcmp(index->table->name.m_name,
766+
"SYS_FOREIGN")
767+
|| !strcmp(index->table->name.m_name,
768+
"SYS_FOREIGN_COLS"));
769+
ut_ad(!comp);
770+
771+
rec_max_size += (srv_page_size == UNIV_PAGE_SIZE_MAX)
772+
? REDUNDANT_REC_MAX_DATA_SIZE
773+
: page_get_free_space_of_empty(FALSE) / 2;
774+
}
775+
field_ext_max_size = field_max_size < 256 ? 1 : 2;
776+
777+
if (field->prefix_len
778+
&& field->prefix_len < field_max_size) {
779+
field_max_size = field->prefix_len;
780+
}
781+
782+
if (comp) {
783+
/* Add the extra size for ROW_FORMAT=COMPACT.
784+
For ROW_FORMAT=REDUNDANT, these bytes were
785+
added to rec_max_size before this loop. */
786+
rec_max_size += field_ext_max_size;
787+
}
788+
789+
rec_max_size += field_max_size;
790+
}
791+
792+
return rec_max_size;
793+
}
794+
704795
/********************************************************************//**
705796
Searches an index tree and positions a tree cursor on a given level.
706797
NOTE: n_fields_cmp in tuple must be set so that it cannot be compared
@@ -1028,7 +1119,7 @@ btr_cur_search_to_nth_level(
10281119
page_id_t page_id(space, dict_index_get_page(index));
10291120

10301121
if (root_leaf_rw_latch == RW_X_LATCH) {
1031-
node_ptr_max_size = dict_index_node_ptr_max_size(index);
1122+
node_ptr_max_size = btr_node_ptr_max_size(index);
10321123
}
10331124

10341125
up_match = 0;
@@ -2128,7 +2219,7 @@ btr_cur_open_at_index_side_func(
21282219
const page_size_t& page_size = dict_table_page_size(index->table);
21292220

21302221
if (root_leaf_rw_latch == RW_X_LATCH) {
2131-
node_ptr_max_size = dict_index_node_ptr_max_size(index);
2222+
node_ptr_max_size = btr_node_ptr_max_size(index);
21322223
}
21332224

21342225
height = ULINT_UNDEFINED;
@@ -2487,7 +2578,7 @@ btr_cur_open_at_rnd_pos_func(
24872578
dberr_t err = DB_SUCCESS;
24882579

24892580
if (root_leaf_rw_latch == RW_X_LATCH) {
2490-
node_ptr_max_size = dict_index_node_ptr_max_size(index);
2581+
node_ptr_max_size = btr_node_ptr_max_size(index);
24912582
}
24922583

24932584
height = ULINT_UNDEFINED;
@@ -5162,7 +5253,6 @@ btr_cur_pessimistic_delete(
51625253
btr_discard_page(cursor, mtr);
51635254

51645255
ret = TRUE;
5165-
51665256
goto return_after_reservations;
51675257
}
51685258

@@ -5236,23 +5326,45 @@ btr_cur_pessimistic_delete(
52365326
}
52375327
}
52385328

5239-
page_cur_delete_rec(btr_cur_get_page_cur(cursor), index, offsets, mtr);
5329+
/* SPATIAL INDEX never use SX locks; we can allow page merges
5330+
while holding X lock on the spatial index tree.
5331+
Do not allow merges of non-leaf B-tree pages unless it is
5332+
safe to do so. */
5333+
{
5334+
const bool allow_merge = page_is_leaf(page)
5335+
|| dict_index_is_spatial(index)
5336+
|| btr_cur_will_modify_tree(
5337+
index, page, BTR_INTENTION_DELETE, rec,
5338+
btr_node_ptr_max_size(index),
5339+
block->page.size, mtr);
5340+
page_cur_delete_rec(btr_cur_get_page_cur(cursor), index,
5341+
offsets, mtr);
52405342
#ifdef UNIV_ZIP_DEBUG
5241-
ut_a(!page_zip || page_zip_validate(page_zip, page, index));
5343+
ut_a(!page_zip || page_zip_validate(page_zip, page, index));
52425344
#endif /* UNIV_ZIP_DEBUG */
52435345

5244-
/* btr_check_node_ptr() needs parent block latched */
5245-
ut_ad(!parent_latched || btr_check_node_ptr(index, block, mtr));
5346+
ut_ad(!parent_latched
5347+
|| btr_check_node_ptr(index, block, mtr));
5348+
5349+
if (!ret && btr_cur_compress_recommendation(cursor, mtr)) {
5350+
if (UNIV_LIKELY(allow_merge)) {
5351+
ret = btr_cur_compress_if_useful(
5352+
cursor, FALSE, mtr);
5353+
} else {
5354+
ib::warn() << "Not merging page "
5355+
<< block->page.id
5356+
<< " in index " << index->name
5357+
<< " of " << index->table->name;
5358+
ut_ad(!"MDEV-14637");
5359+
}
5360+
}
5361+
}
52465362

52475363
return_after_reservations:
52485364
*err = DB_SUCCESS;
52495365

52505366
mem_heap_free(heap);
52515367

5252-
if (ret == FALSE) {
5253-
ret = btr_cur_compress_if_useful(cursor, FALSE, mtr);
5254-
}
5255-
52565368
if (!srv_read_only_mode
52575369
&& page_is_leaf(page)
52585370
&& !dict_index_is_online_ddl(index)) {

storage/innobase/dict/dict0dict.cc

Lines changed: 0 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -2139,93 +2139,6 @@ dict_col_name_is_reserved(
21392139
return(FALSE);
21402140
}
21412141

2142-
/****************************************************************//**
2143-
Return maximum size of the node pointer record.
2144-
@return maximum size of the record in bytes */
2145-
ulint
2146-
dict_index_node_ptr_max_size(
2147-
/*=========================*/
2148-
const dict_index_t* index) /*!< in: index */
2149-
{
2150-
ulint comp;
2151-
ulint i;
2152-
/* maximum possible storage size of a record */
2153-
ulint rec_max_size;
2154-
2155-
if (dict_index_is_ibuf(index)) {
2156-
/* cannot estimate accurately */
2157-
/* This is universal index for change buffer.
2158-
The max size of the entry is about max key length * 2.
2159-
(index key + primary key to be inserted to the index)
2160-
(The max key length is UNIV_PAGE_SIZE / 16 * 3 at
2161-
ha_innobase::max_supported_key_length(),
2162-
considering MAX_KEY_LENGTH = 3072 at MySQL imposes
2163-
the 3500 historical InnoDB value for 16K page size case.)
2164-
For the universal index, node_ptr contains most of the entry.
2165-
And 512 is enough to contain ibuf columns and meta-data */
2166-
return(UNIV_PAGE_SIZE / 8 * 3 + 512);
2167-
}
2168-
2169-
comp = dict_table_is_comp(index->table);
2170-
2171-
/* Each record has page_no, length of page_no and header. */
2172-
rec_max_size = comp
2173-
? REC_NODE_PTR_SIZE + 1 + REC_N_NEW_EXTRA_BYTES
2174-
: REC_NODE_PTR_SIZE + 2 + REC_N_OLD_EXTRA_BYTES;
2175-
2176-
if (comp) {
2177-
/* Include the "null" flags in the
2178-
maximum possible record size. */
2179-
rec_max_size += UT_BITS_IN_BYTES(index->n_nullable);
2180-
} else {
2181-
/* For each column, include a 2-byte offset and a
2182-
"null" flag. */
2183-
rec_max_size += 2 * index->n_fields;
2184-
}
2185-
2186-
/* Compute the maximum possible record size. */
2187-
for (i = 0; i < dict_index_get_n_unique_in_tree(index); i++) {
2188-
const dict_field_t* field
2189-
= dict_index_get_nth_field(index, i);
2190-
const dict_col_t* col
2191-
= dict_field_get_col(field);
2192-
ulint field_max_size;
2193-
ulint field_ext_max_size;
2194-
2195-
/* Determine the maximum length of the index field. */
2196-
2197-
field_max_size = dict_col_get_fixed_size(col, comp);
2198-
if (field_max_size) {
2199-
/* dict_index_add_col() should guarantee this */
2200-
ut_ad(!field->prefix_len
2201-
|| field->fixed_len == field->prefix_len);
2202-
/* Fixed lengths are not encoded
2203-
in ROW_FORMAT=COMPACT. */
2204-
rec_max_size += field_max_size;
2205-
continue;
2206-
}
2207-
2208-
field_max_size = dict_col_get_max_size(col);
2209-
field_ext_max_size = field_max_size < 256 ? 1 : 2;
2210-
2211-
if (field->prefix_len
2212-
&& field->prefix_len < field_max_size) {
2213-
field_max_size = field->prefix_len;
2214-
}
2215-
2216-
if (comp) {
2217-
/* Add the extra size for ROW_FORMAT=COMPACT.
2218-
For ROW_FORMAT=REDUNDANT, these bytes were
2219-
added to rec_max_size before this loop. */
2220-
rec_max_size += field_ext_max_size;
2221-
}
2222-
2223-
rec_max_size += field_max_size;
2224-
}
2225-
2226-
return(rec_max_size);
2227-
}
2228-
22292142
/****************************************************************//**
22302143
If a record of this index might not fit on a single B-tree page,
22312144
return TRUE.

storage/innobase/include/dict0dict.h

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1952,14 +1952,6 @@ const char*
19521952
dict_tf_to_row_format_string(
19531953
/*=========================*/
19541954
ulint table_flag); /*!< in: row format setting */
1955-
/****************************************************************//**
1956-
Return maximum size of the node pointer record.
1957-
@return maximum size of the record in bytes */
1958-
ulint
1959-
dict_index_node_ptr_max_size(
1960-
/*=========================*/
1961-
const dict_index_t* index) /*!< in: index */
1962-
MY_ATTRIBUTE((warn_unused_result));
19631955

19641956
#define dict_col_is_virtual(col) (col)->is_virtual()
19651957

0 commit comments

Comments
 (0)