Skip to content

Commit

Permalink
MDEV-15132 Avoid accessing the TRX_SYS page
Browse files Browse the repository at this point in the history
InnoDB maintains an internal persistent sequence of transaction
identifiers. This sequence is used for assigning both transaction
start identifiers (DB_TRX_ID=trx->id) and end identifiers (trx->no)
as well as end identifiers for the mysql.transaction_registry table
that was introduced in MDEV-12894.

TRX_SYS_TRX_ID_WRITE_MARGIN: Remove. After this many updates of
the sequence we used to update the TRX_SYS page. We can avoid accessing
the TRX_SYS page if we modify the InnoDB startup so that resurrecting
the sequence from other pages of the transaction system.

TRX_SYS_TRX_ID_STORE: Deprecate. The field only exists for the purpose
of upgrading from an earlier version of MySQL or MariaDB.

Starting with this fix, MariaDB will rely on the fields
TRX_UNDO_TRX_ID, TRX_UNDO_TRX_NO in the undo log header page of
each non-committed transaction, and on the new field
TRX_RSEG_MAX_TRX_ID in rollback segment header pages.

Because of this change, setting innodb_force_recovery=5 or 6 may cause
the system to recover with trx_sys.get_max_trx_id()==0. We must adjust
checks for invalid DB_TRX_ID and PAGE_MAX_TRX_ID accordingly.

We will change the startup and shutdown messages to display the
trx_sys.get_max_trx_id() in addition to the log sequence number.

trx_sys_t::flush_max_trx_id(): Remove.

trx_undo_mem_create_at_db_start(), trx_undo_lists_init():
Add an output parameter max_trx_id, to be updated from
TRX_UNDO_TRX_ID, TRX_UNDO_TRX_NO.

TRX_RSEG_MAX_TRX_ID: New field, for persisting
trx_sys.get_max_trx_id() at the time of the latest transaction commit.
Startup is not reading the undo log pages of committed transactions.
We want to avoid additional page accesses on startup, as well as
trouble when all undo logs have been emptied.
On startup, we will simply determine the maximum value from all pages
that are being read anyway.

TRX_RSEG_FORMAT: Redefined from TRX_RSEG_MAX_SIZE.

Old versions of InnoDB wrote uninitialized garbage to unused data fields.
Because of this, we cannot simply introduce a new field in the
rollback segment pages and expect it to be always zero, like it would
if the database was created by a recent enough InnoDB version.

Luckily, it looks like the field TRX_RSEG_MAX_SIZE was always written
as 0xfffffffe. We will indicate a new subformat of the page by writing
0 to this field. This has the nice side effect that after a downgrade
to older versions of InnoDB, transactions should fail to allocate any
undo log, that is, writes will be blocked. So, there is no problem of
getting corrupted transaction identifiers after downgrading.

trx_rseg_t::max_size: Remove.

trx_rseg_header_create(): Remove the parameter max_size=ULINT_MAX.

trx_purge_add_undo_to_history(): Update TRX_RSEG_MAX_SIZE
(and TRX_RSEG_FORMAT if needed). This is invoked on transaction commit.

trx_rseg_mem_restore(): If TRX_RSEG_FORMAT contains 0,
read TRX_RSEG_MAX_SIZE.

trx_rseg_array_init(): Invoke trx_sys.init_max_trx_id(max_trx_id + 1)
where max_trx_id was the maximum that was encountered in the rollback
segment pages and the undo log pages of recovered active, XA PREPARE,
or some committed transactions. (See trx_purge_add_undo_to_history()
which invokes trx_rsegf_set_nth_undo(..., FIL_NULL, ...);
not all committed transactions will be immediately detached from the
rollback segment header.)
  • Loading branch information
dr-m committed Jan 31, 2018
1 parent bb441ca commit c7d0448
Show file tree
Hide file tree
Showing 11 changed files with 104 additions and 166 deletions.
12 changes: 5 additions & 7 deletions storage/innobase/include/trx0rseg.h
Original file line number Diff line number Diff line change
Expand Up @@ -77,15 +77,13 @@ trx_rsegf_undo_find_free(const trx_rsegf_t* rsegf);
This function is called only when a new rollback segment is created in
the database.
@param[in] space space id
@param[in] max_size max size in pages
@param[in] rseg_id rollback segment identifier
@param[in,out] sys_header the TRX_SYS page (NULL for temporary rseg)
@param[in,out] mtr mini-transaction
@return page number of the created segment, FIL_NULL if fail */
ulint
trx_rseg_header_create(
ulint space,
ulint max_size,
ulint rseg_id,
buf_block_t* sys_header,
mtr_t* mtr);
Expand Down Expand Up @@ -144,9 +142,6 @@ struct trx_rseg_t {
/** page number of the rollback segment header */
ulint page_no;

/** maximum allowed size in pages */
ulint max_size;

/** current size in pages */
ulint curr_size;

Expand Down Expand Up @@ -217,8 +212,8 @@ struct trx_rseg_t {

/* Transaction rollback segment header */
/*-------------------------------------------------------------*/
#define TRX_RSEG_MAX_SIZE 0 /* Maximum allowed size for rollback
segment in pages */
#define TRX_RSEG_FORMAT 0 /* -2 = pre-MariaDB 10.3.5 format;
0=MariaDB 10.3.5 or later */
#define TRX_RSEG_HISTORY_SIZE 4 /* Number of file pages occupied
by the logs in the history list */
#define TRX_RSEG_HISTORY 8 /* The update undo logs for committed
Expand All @@ -228,6 +223,9 @@ struct trx_rseg_t {
this page is placed */
#define TRX_RSEG_UNDO_SLOTS (8 + FLST_BASE_NODE_SIZE + FSEG_HEADER_SIZE)
/* Undo log segment slots */
/** Maximum transaction ID (valid only if TRX_RSEG_FORMAT is 0) */
#define TRX_RSEG_MAX_TRX_ID (TRX_RSEG_UNDO_SLOTS + TRX_RSEG_N_SLOTS \
* TRX_RSEG_SLOT_SIZE)
/*-------------------------------------------------------------*/

/** Read the page number of an undo log slot.
Expand Down
45 changes: 8 additions & 37 deletions storage/innobase/include/trx0sys.h
Original file line number Diff line number Diff line change
Expand Up @@ -175,18 +175,13 @@ trx_sys_create_rsegs();

/** Transaction system header */
/*------------------------------------------------------------- @{ */
#define TRX_SYS_TRX_ID_STORE 0 /*!< the maximum trx id or trx
number modulo
TRX_SYS_TRX_ID_UPDATE_MARGIN
written to a file page by any
transaction; the assignment of
transaction ids continues from
this number rounded up by
TRX_SYS_TRX_ID_UPDATE_MARGIN
plus
TRX_SYS_TRX_ID_UPDATE_MARGIN
when the database is
started */
/** In old versions of InnoDB, this persisted the value of
trx_sys.get_max_trx_id(). Starting with MariaDB 10.3.5,
the field TRX_RSEG_MAX_TRX_ID in rollback segment header pages
and the fields TRX_UNDO_TRX_ID, TRX_UNDO_TRX_NO in undo log pages
are used instead. The field only exists for the purpose of upgrading
from older MySQL or MariaDB versions. */
#define TRX_SYS_TRX_ID_STORE 0
#define TRX_SYS_FSEG_HEADER 8 /*!< segment header for the
tablespace segment the trx
system is created into */
Expand Down Expand Up @@ -379,11 +374,6 @@ FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID. */

/** Size of the doublewrite block in pages */
#define TRX_SYS_DOUBLEWRITE_BLOCK_SIZE FSP_EXTENT_SIZE

/** When a trx id which is zero modulo this number (which must be a power of
two) is assigned, the field TRX_SYS_TRX_ID_STORE on the transaction system
page is updated */
#define TRX_SYS_TRX_ID_WRITE_MARGIN ((trx_id_t) 256)
/* @} */

trx_t* current_trx();
Expand Down Expand Up @@ -925,26 +915,14 @@ struct trx_sys_t {

/**
Allocates a new transaction id.
VERY important: after the database is started, m_max_trx_id value is
divisible by TRX_SYS_TRX_ID_WRITE_MARGIN, and the following if
will evaluate to TRUE when this function is first time called,
and the value for trx id will be written to disk-based header!
Thus trx id values will not overlap when the database is
repeatedly started!
@return new, allocated trx id
*/

trx_id_t get_new_trx_id()
{
ut_ad(mutex_own(&mutex));
trx_id_t id= static_cast<trx_id_t>(my_atomic_add64_explicit(
return static_cast<trx_id_t>(my_atomic_add64_explicit(
reinterpret_cast<int64*>(&m_max_trx_id), 1, MY_MEMORY_ORDER_RELAXED));

if (UNIV_UNLIKELY(!(id % TRX_SYS_TRX_ID_WRITE_MARGIN)))
flush_max_trx_id();
return(id);
}


Expand Down Expand Up @@ -1004,13 +982,6 @@ struct trx_sys_t {
}
return 0;
}


/**
Writes the value of m_max_trx_id to the file based trx system header.
*/

void flush_max_trx_id();
};


Expand Down
10 changes: 6 additions & 4 deletions storage/innobase/include/trx0undo.h
Original file line number Diff line number Diff line change
Expand Up @@ -318,12 +318,14 @@ trx_undo_parse_page_header(
page_t* page,
mtr_t* mtr);
/** Read an undo log when starting up the database.
@param[in,out] rseg rollback segment
@param[in] id rollback segment slot
@param[in] page_no undo log segment page number
@param[in,out] rseg rollback segment
@param[in] id rollback segment slot
@param[in] page_no undo log segment page number
@param[in,out] max_trx_id the largest observed transaction ID
@return size of the undo log in pages */
ulint
trx_undo_mem_create_at_db_start(trx_rseg_t* rseg, ulint id, ulint page_no);
trx_undo_mem_create_at_db_start(trx_rseg_t* rseg, ulint id, ulint page_no,
trx_id_t& max_trx_id);
/************************************************************************
Frees an undo log memory copy. */
void
Expand Down
3 changes: 2 additions & 1 deletion storage/innobase/lock/lock0lock.cc
Original file line number Diff line number Diff line change
Expand Up @@ -383,8 +383,9 @@ lock_check_trx_id_sanity(
ut_ad(!rec_is_default_row(rec, index));

trx_id_t max_trx_id = trx_sys.get_max_trx_id();
ut_ad(max_trx_id || srv_force_recovery >= SRV_FORCE_NO_UNDO_LOG_SCAN);

if (trx_id >= max_trx_id) {
if (max_trx_id && trx_id >= max_trx_id) {
lock_report_trx_id_insanity(
trx_id, rec, index, offsets, max_trx_id);
return false;
Expand Down
13 changes: 7 additions & 6 deletions storage/innobase/page/page0page.cc
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
Copyright (c) 1994, 2016, Oracle and/or its affiliates. All Rights Reserved.
Copyright (c) 2012, Facebook Inc.
Copyright (c) 2017, MariaDB Corporation.
Copyright (c) 2017, 2018, MariaDB Corporation.
This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Expand Down Expand Up @@ -2427,18 +2427,19 @@ page_validate(
same temp-table in parallel.
max_trx_id is ignored for temp tables because it not required
for MVCC. */
if (dict_index_is_sec_or_ibuf(index)
&& !dict_table_is_temporary(index->table)
&& page_is_leaf(page)
&& !page_is_empty(page)) {
if (!page_is_leaf(page) || page_is_empty(page)
|| !dict_index_is_sec_or_ibuf(index)
|| index->table->is_temporary()) {
} else if (trx_id_t sys_max_trx_id = trx_sys.get_max_trx_id()) {
trx_id_t max_trx_id = page_get_max_trx_id(page);
trx_id_t sys_max_trx_id = trx_sys.get_max_trx_id();

if (max_trx_id == 0 || max_trx_id > sys_max_trx_id) {
ib::error() << "PAGE_MAX_TRX_ID out of bounds: "
<< max_trx_id << ", " << sys_max_trx_id;
goto func_exit2;
}
} else {
ut_ad(srv_force_recovery >= SRV_FORCE_NO_UNDO_LOG_SCAN);
}

heap = mem_heap_create(UNIV_PAGE_SIZE + 200);
Expand Down
11 changes: 6 additions & 5 deletions storage/innobase/srv/srv0start.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1088,8 +1088,7 @@ srv_undo_tablespaces_init(bool create_new_db)
if (trx_sysf_rseg_get_space(sys_header, i)
== *it) {
trx_rseg_header_create(
*it, ULINT_MAX, i,
sys_header, &mtr);
*it, i, sys_header, &mtr);
}
}

Expand Down Expand Up @@ -2679,8 +2678,9 @@ innobase_start_or_create_for_mysql()

if (srv_print_verbose_log) {
ib::info() << INNODB_VERSION_STR
<< " started; log sequence number "
<< srv_start_lsn;
<< " started; log sequence number "
<< srv_start_lsn
<< "; transaction id " << trx_sys.get_max_trx_id();
}

if (srv_force_recovery > 0) {
Expand Down Expand Up @@ -2929,7 +2929,8 @@ innodb_shutdown()

if (srv_was_started && srv_print_verbose_log) {
ib::info() << "Shutdown completed; log sequence number "
<< srv_shutdown_lsn;
<< srv_shutdown_lsn
<< "; transaction id " << trx_sys.get_max_trx_id();
}

srv_start_state = SRV_START_STATE_NONE;
Expand Down
18 changes: 18 additions & 0 deletions storage/innobase/trx/trx0purge.cc
Original file line number Diff line number Diff line change
Expand Up @@ -277,10 +277,28 @@ trx_purge_add_undo_to_history(const trx_t* trx, trx_undo_t*& undo, mtr_t* mtr)

ut_ad(undo->size == flst_get_len(
seg_header + TRX_UNDO_PAGE_LIST));
byte* rseg_format = rseg_header + TRX_RSEG_FORMAT;
if (UNIV_UNLIKELY(mach_read_from_4(rseg_format))) {
/* This database must have been upgraded from
before MariaDB 10.3.5. */
mlog_write_ulint(rseg_format, 0, MLOG_4BYTES, mtr);
/* Clear also possible garbage at the end of
the page. Old InnoDB versions did not initialize
unused parts of pages. */
ut_ad(page_offset(rseg_header) == TRX_RSEG);
byte* b = rseg_header + TRX_RSEG_MAX_TRX_ID + 8;
ulint len = UNIV_PAGE_SIZE
- (FIL_PAGE_DATA_END
+ TRX_RSEG + TRX_RSEG_MAX_TRX_ID + 8);
memset(b, 0, len);
mlog_log_string(b, len, mtr);
}

mlog_write_ulint(
rseg_header + TRX_RSEG_HISTORY_SIZE,
hist_size + undo->size, MLOG_4BYTES, mtr);
mlog_write_ull(rseg_header + TRX_RSEG_MAX_TRX_ID,
trx_sys.get_max_trx_id(), mtr);
}

/* Before any transaction-generating background threads or the
Expand Down
52 changes: 20 additions & 32 deletions storage/innobase/trx/trx0rseg.cc
Original file line number Diff line number Diff line change
Expand Up @@ -38,15 +38,13 @@ Created 3/26/1996 Heikki Tuuri
This function is called only when a new rollback segment is created in
the database.
@param[in] space space id
@param[in] max_size max size in pages
@param[in] rseg_id rollback segment identifier
@param[in,out] sys_header the TRX_SYS page (NULL for temporary rseg)
@param[in,out] mtr mini-transaction
@return page number of the created segment, FIL_NULL if fail */
ulint
trx_rseg_header_create(
ulint space,
ulint max_size,
ulint rseg_id,
buf_block_t* sys_header,
mtr_t* mtr)
Expand Down Expand Up @@ -76,9 +74,7 @@ trx_rseg_header_create(
/* Get the rollback segment file page */
rsegf = trx_rsegf_get_new(space, page_no, mtr);

/* Initialize max size field */
mlog_write_ulint(rsegf + TRX_RSEG_MAX_SIZE, max_size,
MLOG_4BYTES, mtr);
mlog_write_ulint(rsegf + TRX_RSEG_FORMAT, 0, MLOG_4BYTES, mtr);

/* Initialize the history list */

Expand Down Expand Up @@ -155,7 +151,6 @@ trx_rseg_mem_create(ulint id, ulint space, ulint page_no)
rseg->page_no = page_no;
rseg->last_page_no = FIL_NULL;
rseg->curr_size = 1;
rseg->max_size = ULINT_UNDEFINED;

mutex_create(rseg->is_persistent()
? LATCH_ID_REDO_RSEG : LATCH_ID_NOREDO_RSEG,
Expand All @@ -170,13 +165,13 @@ trx_rseg_mem_create(ulint id, ulint space, ulint page_no)

/** Read the undo log lists.
@param[in,out] rseg rollback segment
@param[in,out] max_trx_id maximum observed transaction identifier
@param[in] rseg_header rollback segment header
@param[in,out] mtr mini-transaction
@return the combined size of undo log segments in pages */
static
ulint
trx_undo_lists_init(trx_rseg_t* rseg, const trx_rsegf_t* rseg_header,
mtr_t* mtr)
trx_undo_lists_init(trx_rseg_t* rseg, trx_id_t& max_trx_id,
const trx_rsegf_t* rseg_header)
{
ut_ad(srv_force_recovery < SRV_FORCE_NO_UNDO_LOG_SCAN);

Expand All @@ -186,7 +181,7 @@ trx_undo_lists_init(trx_rseg_t* rseg, const trx_rsegf_t* rseg_header,
ulint page_no = trx_rsegf_get_nth_undo(rseg_header, i);
if (page_no != FIL_NULL) {
size += trx_undo_mem_create_at_db_start(
rseg, i, page_no);
rseg, i, page_no, max_trx_id);
MONITOR_INC(MONITOR_NUM_UNDO_SLOT_USED);
}
}
Expand All @@ -204,12 +199,18 @@ trx_rseg_mem_restore(trx_rseg_t* rseg, trx_id_t& max_trx_id, mtr_t* mtr)
{
const trx_rsegf_t* rseg_header = trx_rsegf_get_new(
rseg->space, rseg->page_no, mtr);
rseg->max_size = mach_read_from_4(rseg_header + TRX_RSEG_MAX_SIZE);
if (mach_read_from_4(rseg_header + TRX_RSEG_FORMAT) == 0) {
trx_id_t id = mach_read_from_8(rseg_header
+ TRX_RSEG_MAX_TRX_ID);
if (id > max_trx_id) {
max_trx_id = id;
}
}

/* Initialize the undo log lists according to the rseg header */

rseg->curr_size = mach_read_from_4(rseg_header + TRX_RSEG_HISTORY_SIZE)
+ 1 + trx_undo_lists_init(rseg, rseg_header, mtr);
+ 1 + trx_undo_lists_init(rseg, max_trx_id, rseg_header);

if (ulint len = flst_get_len(rseg_header + TRX_RSEG_HISTORY)) {
my_atomic_addlint(&trx_sys.rseg_history_len, len);
Expand Down Expand Up @@ -262,22 +263,9 @@ trx_rseg_array_init()
mtr.start();
if (const buf_block_t* sys = trx_sysf_get(&mtr, false)) {
if (rseg_id == 0) {
/* VERY important: after the database
is started, max_trx_id value is
divisible by TRX_SYS_TRX_ID_WRITE_MARGIN,
and the first call of
trx_sys.get_new_trx_id() will invoke
flush_max_trx_id()! Thus trx id values
will not overlap when the database is
repeatedly started! */

max_trx_id = 2 * TRX_SYS_TRX_ID_WRITE_MARGIN
+ ut_uint64_align_up(
mach_read_from_8(
TRX_SYS
+ TRX_SYS_TRX_ID_STORE
+ sys->frame),
TRX_SYS_TRX_ID_WRITE_MARGIN);
max_trx_id = mach_read_from_8(
TRX_SYS + TRX_SYS_TRX_ID_STORE
+ sys->frame);
}
const uint32_t page_no = trx_sysf_rseg_get_page_no(
sys, rseg_id);
Expand All @@ -297,7 +285,7 @@ trx_rseg_array_init()
mtr.commit();
}

trx_sys.init_max_trx_id(max_trx_id);
trx_sys.init_max_trx_id(max_trx_id + 1);
}

/** Create a persistent rollback segment.
Expand All @@ -324,8 +312,8 @@ trx_rseg_create(ulint space_id)
ulint rseg_id = trx_sys_rseg_find_free(sys_header);
ulint page_no = rseg_id == ULINT_UNDEFINED
? FIL_NULL
: trx_rseg_header_create(space_id, ULINT_MAX,
rseg_id, sys_header, &mtr);
: trx_rseg_header_create(space_id, rseg_id, sys_header,
&mtr);
if (page_no != FIL_NULL) {
ut_ad(trx_sysf_rseg_get_space(sys_header, rseg_id)
== space_id);
Expand Down Expand Up @@ -358,7 +346,7 @@ trx_temp_rseg_create()
ut_ad(space->purpose == FIL_TYPE_TEMPORARY);

ulint page_no = trx_rseg_header_create(
SRV_TMP_SPACE_ID, ULINT_MAX, i, NULL, &mtr);
SRV_TMP_SPACE_ID, i, NULL, &mtr);
trx_rseg_t* rseg = trx_rseg_mem_create(
i, SRV_TMP_SPACE_ID, page_no);
ut_ad(!rseg->is_persistent());
Expand Down
Loading

3 comments on commit c7d0448

@twelve1995
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,dr-m,
I don't understand how you use the TRX_UNDO_TRX_ID and TRX_UNDO_TRX_NO fields instead of TRX_SYS_TRX_ID_STORE to complete the assignment of the transaction number. Can you answer it for me?

@dr-m
Copy link
Contributor Author

@dr-m dr-m commented on c7d0448 Aug 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@1100hk, it turned out that the field TRX_SYS_TRX_ID_STORE is almost redundant. We are basically splitting it to existing fields and the new fields TRX_RSEG_MAX_TRX_ID in rollback segment header pages.

What we do is similar to executing the equivalent of SELECT MAX(autoinc_col) FROM t for recovering an auto-increment value, with the exception that we are not introducing any extra page reads at startup, and that any gaps in the sequence of transaction identifiers will be preserved (thanks to the addition of TRX_RSEG_MAX_TRX_ID). The main motivation why we introduced that field is that it is distributed on rollback segment header pages, and transaction commit was already modifying those pages anyway. By splitting the field, we avoid the bottleneck of accessing the TRX_SYS page during normal operation. The access to TRX_SYS_TRX_ID_STORE was causing a potential issue for @svoj’s trx_sys scalability fixes (probably MDEV-15104, because MDEV-15059 was pushed before my change). I felt that it was easiest to eliminate the redundant accesses to the TRX_SYS page.

On startup, InnoDB always reads the undo log records of any incomplete (recovered) transactions, and it also reads all rollback segment header pages to find information about not-yet-purged committed transactions.

For incomplete transactions that have persistently generated any undo log records, the start identifier (DB_TRX_ID or trx->id) will be available in the undo logs. For committed transactions, the end identifier (trx->no) will be written in TRX_UNDO_TRX_NO, but such records will be removed by purge later. That is why we need the TRX_RSEG_MAX_TRX_ID.

This area is partly covered by my talk Deep Dive: InnoDB Transactions and Write Paths (in particular, the slide "A Page View of the InnoDB Transaction Layer").

Also, note that there was a follow-up fix in dcc09af, which deals with pre-existing garbage on undo log pages.

@twelve1995
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your generous sharing. After listening to your explanation and reading the source code, I have understood the idea. Thank you very much for sharing a great learning platform, and I will share it with my partner. Thank you.

Please sign in to comment.