Skip to content

MDEV-33545: Map innodb_flush_method=O_DIRECT_NO_FSYNC to a new innodb_doublewrite setting#3091

Merged
dr-m merged 1 commit into11.0from
11.0-MDEV-33545
Apr 4, 2024
Merged

MDEV-33545: Map innodb_flush_method=O_DIRECT_NO_FSYNC to a new innodb_doublewrite setting#3091
dr-m merged 1 commit into11.0from
11.0-MDEV-33545

Conversation

@dr-m
Copy link
Copy Markdown
Contributor

@dr-m dr-m commented Feb 29, 2024

  • The Jira issue number for this PR is: MDEV-33545

Description

In commit 2464876 (MDEV-30136) the parameter innodb_flush_method was deprecated, with no direct replacement for innodb_flush_method=O_DIRECT_NO_FSYNC.

Let us change innodb_doublewrite from Boolean to ENUM, with 3 new values:

  • OFF: Assume that writes of innodb_page_size are atomic
  • ON: Prevent torn writes (the default)
  • fast: Like ON, but avoid fsync() or fdatasync() on data files

The deprecated start-up parameter innodb_flush_method=NO_FSYNC will cause innodb_doublewrite=ON to be changed to innodb_doublewrite=fast, which will prevent InnoDB from making any durable writes to data files other than the doublewrite buffer. This would normally be done right before the log checkpoint LSN is updated. Depending on the file systems being used and their configuration, this may or may not be safe.

The value innodb_doublewrite=fast differs from the previous combination of innodb_doublewrite=ON and innodb_flush_method=O_DIRECT_NO_FSYNC by always invoking os_file_flush_func() on the doublewrite buffer itself in buf_dblwr_t::flush_buffered_writes_completed(). This should be safer when there are multiple doublewrite batches between checkpoints. Typically, once per second, buf_flush_page_cleaner() would write out up to innodb_io_capacity pages and advance the log checkpoint. Also typically, innodb_io_capacity>128, which is the size of the doublewrite buffer in pages. Should os_file_flush_func() not be invoked between doublewrite batches, writes could be reordered in an unsafe way.

The setting innodb_doublewrite=fast could be safe when the doublewrite buffer (the first file of the system tablespace) and the data files reside in the same file system.

Note: Enabling the setting debug_no_sync (in the code, my_disable_sync) would also disable durable writes to the log file, which would be much less safe.

Thanks to @mdcallag for reporting this.

How can this PR be tested?

./mtr --rr innodb.alter_kill
rr replay var/log/mysqld.1.rr/mariadbd-0
rr replay var/log/mysqld.1.rr/mariadbd-1

and awatch fil_system.write_through to check that no data files are being created with O_DSYNC and no fsync or fdatasync is being invoked on them.

On the first server startup of innodb.alter_kill, with innodb_doublewrite=fast, os_file_flush_func() would only be invoked on the ibdata1 and possibly ib_logfile0 (unless using a memory-mapped log). On subsequent startups with innodb_doublewrite=OFF, os_file_flush_func() will be invoked on data files during log_checkpoint().

Side note: innodb_data_file_write_through=ON will apply also to the InnoDB temporary tablespace (ibtmp1) even though this is not necessary.

IORequest::Type: Introduce special values WRITE_DBL and PUNCH_DBL for asynchronous writes that are submitted via the doublewrite buffer. In this way, fil_space_t::use_doublewrite() or buf_dblwr.in_use() will only be consulted during buf_page_t::flush() and the doublewrite buffer can be enabled or disabled without any fear of inconsistency.

buf_dblwr_t::block_size: Replaces block_size().

buf_dblwr_t::flush_buffered_writes(): If !in_use() and the doublewrite buffer is empty, just invoke fil_flush_file_spaces() and return. The doublewrite buffer could have been disabled while a batch was in progress.

Basing the PR against the correct MariaDB version

  • This is a new feature and the PR is based against the latest MariaDB development branch.
  • This is a bug fix and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

  • I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
  • For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

@dr-m dr-m requested a review from vaintroub February 29, 2024 09:56
@dr-m dr-m self-assigned this Feb 29, 2024
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Copy Markdown
Member

@vaintroub vaintroub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I would prefer innodb_data_write_through to remain exactly what it was (O_SYNC/O_DSYNC/FILE_FLAG_WRITE_THROUGH) , and remain boolean.
    It is better to add a new parameter innodb_data_sync (or innodb_data_flush, we do not want to sound Linux-ish), so we do not mix up concepts.
    I know there is an overlap. If innodb_data_write_through is used, innodb_data_sync should be ignored. In fact, innodb_data_write_through should extremely seldomly, it is not the same as the log file where each write is typically also flushed.

  • If new setting is crash-safe, why don't we use it in all cases, without introducing new parameter. If this is not crash-safe, can we document it under which circumstances it is or it is not crash safe, like "battery-backed write cache is necessary".

  • MySQL documentation says about O_DIRECT_NO_FSYNC : "InnoDB uses O_DIRECT during flushing I/O, but skips the fsync() system call after each write operation." Is this accurate ? "skips fsync after each operations" sounds suspiciously like it is about redo log.

@dr-m
Copy link
Copy Markdown
Contributor Author

dr-m commented Mar 11, 2024

  • I would prefer innodb_data_write_through to remain exactly what it was (O_SYNC/O_DSYNC/FILE_FLAG_WRITE_THROUGH) , and remain boolean.
    It is better to add a new parameter innodb_data_sync (or innodb_data_flush, we do not want to sound Linux-ish), so we do not mix up concepts.
    I know there is an overlap. If innodb_data_write_through is used, innodb_data_sync should be ignored. In fact, innodb_data_write_through should extremely seldomly, it is not the same as the log file where each write is typically also flushed.

I think that it is a bad practice to have "don’t care" combinations of parameters. The combinatorial explosion on parameters is already too bad.

* If new setting is crash-safe, why don't we use it in all cases, without introducing new parameter. If this is not crash-safe,  can we document it under which circumstances it is or it is not crash safe, like "battery-backed write cache is necessary".

I do not think it is safe to skip fdatasync() on the ext4 file system after there was a write that extended a file or put some previously fallocate()’d space into actual use. In those cases, we could read incorrect data from such files.

* MySQL documentation says about O_DIRECT_NO_FSYNC : "InnoDB uses O_DIRECT during flushing I/O, but skips the fsync() system call after each write operation." Is this accurate ? "skips fsync after each operations" sounds suspiciously like it is about redo log.

The function log_io_complete() in MySQL 8.0 invokes fil_flush(group->space_id) in all other settings except innodb_flush_method='O_DSYNC' or innodb_flush_method='nosync'. As far as I can tell, this setting always was about writes to data files, not any log file writes.

@vaintroub
Copy link
Copy Markdown
Member

  • I would prefer innodb_data_write_through to remain exactly what it was (O_SYNC/O_DSYNC/FILE_FLAG_WRITE_THROUGH) , and remain boolean.
    It is better to add a new parameter innodb_data_sync (or innodb_data_flush, we do not want to sound Linux-ish), so we do not mix up concepts.
    I know there is an overlap. If innodb_data_write_through is used, innodb_data_sync should be ignored. In fact, innodb_data_write_through should extremely seldomly, it is not the same as the log file where each write is typically also flushed.

I think that it is a bad practice to have "don’t care" combinations of parameters. The combinatorial explosion on parameters is already too bad.
Frankenstein parameters are not better either. There is nothing like "yes/ON, no/OFF, NO_FSYNC". We can go back and
"innodb_data_file_persistence=open_writethrough,forced_sync,none", and the same for "innodb_log_file_persistence". Or maybe we can just live without Frankenstein, i.e we already have innodb_flush_method=O_DIRECT_NO_FSYNC for compatibility purposes.

@vaintroub
Copy link
Copy Markdown
Member

I do not think it is safe to skip fdatasync() on the ext4 file system after there was a write that extended a file or put some previously fallocate()’d space into actual use. In those cases, we could read incorrect data from such files.

MySQL documentation says "As of MySQL 8.0.14, fsync() is called after creating a new file, after increasing file size, and after closing a file, to ensure that file system metadata changes are synchronized. " Is this also inaccurate? Would this use be safe, in some circumstances, for example when redo log and data file are on the same disk?

@dr-m
Copy link
Copy Markdown
Contributor Author

dr-m commented Mar 11, 2024

I do not think it is safe to skip fdatasync() on the ext4 file system after there was a write that extended a file or put some previously fallocate()’d space into actual use. In those cases, we could read incorrect data from such files.

MySQL documentation says "As of MySQL 8.0.14, fsync() is called after creating a new file, after increasing file size, and after closing a file, to ensure that file system metadata changes are synchronized. " Is this also inaccurate? Would this use be safe, in some circumstances, for example when redo log and data file are on the same disk?

I think that we would be on a slippery slope if we start to assume that everything is safe, as long as everyone uses ext4fs and the same file system journal covers both ib_logfile0 and the data files. Who knows, maybe there already is an optimization that will cause only the changes to the requested file handle to be persisted. Technically, that should be possible for fdatasync() except when using the non-default mount option data=journal.

On Microsoft Windows, it may be reasonable to assume that everyone uses NTFS. On Linux, there is a wide range of choice and too many details that may affect such assumptions.

@dr-m dr-m force-pushed the 11.0-MDEV-33545 branch from eb18ba6 to 9c4f333 Compare March 25, 2024 13:40
Comment thread storage/innobase/handler/ha_innodb.cc
@dr-m dr-m changed the title MDEV-33545: Add innodb_data_file_write_through=NO_FSYNC MDEV-33545: Map innodb_flush_method=O_DIRECT_NO_FSYNC to a new innodb_doublewrite setting Mar 26, 2024
@dr-m dr-m force-pushed the 11.0-MDEV-33545 branch 2 times, most recently from 22252bb to 68e3f16 Compare April 3, 2024 13:20
@vaintroub vaintroub marked this pull request as draft April 3, 2024 13:30
Comment thread storage/innobase/handler/ha_innodb.cc Outdated
"Whether and how to use the doublewrite buffer. "
"OFF=Assume that writes of innodb_page_size are atomic; "
"ON=Prevent torn writes (the default); "
"fast=Like ON, but avoid fsync() on data files",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary Linuxism-Posixism with fsync(). maybe "flushing" is better.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Flushing" is somewhat overloaded; for buf_flush_page_cleaner or fflush it actually means "writing". Maybe, "Like ON, but do not synchronize writes to data files"?

/** Asynchronous write */
WRITE_ASYNC= WRITE_SYNC | 1,
/** Asynchronous doublewritten page */
WRITE_DBL= WRITE_ASYNC | 4,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a define DBLWR 4 or something,so you do not have to use magic number 4 in 3 different places - WRITE_DBL, PUNCH_DBL and is_doublewritten?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not want to expose that constant outside IORequest. The enum is public; there is no concept of private values. Sure, I could add a private static constexpr for that value, but I’d say that it would obfuscate things more.

void set_use_doublewrite(ulong use)
{
buf_dblwr.set_use(use);
need_unflushed_spaces= !write_through && buf_dblwr.need_fsync();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should "buffered" be checked for "need_unflushed_spaces"? buffered implies need_unflushed_spaces = true, regardless buf_dblwr.need_fsync() ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think that fil_system.buffered (a.k.a. innodb_data_file_buffering) should be checked here. I believe that also writes via O_DIRECT may require fdatasync or fsync to be invoked, in case the file had been previously extended with a special system call, be it fallocate to allocate physical blocks or ftruncate to set the logical file size. In both cases, some recent writes may have impacted the file system metadata, that is, associate the written block with the inode so that a read like system call will return data from that block, instead of pretending that the region is filled with NUL bytes. If we always extended files via pwrite or WriteFile and invoked fdatasync or similar after such file-extending calls, then yes, we might safely skip fdatasync for subsequent overwrite-in-place. But even then, I suppose that invoking fdatasync would have the benefit that it would force a write-back of any non-durable cache inside the storage layer.

In commit 2464876 (MDEV-30136)
the parameter innodb_flush_method was deprecated, with no direct
replacement for innodb_flush_method=O_DIRECT_NO_FSYNC.

Let us change innodb_doublewrite from Boolean to ENUM that can
be changed while the server is running:

OFF: Assume that writes of innodb_page_size are atomic
ON: Prevent torn writes (the default)
fast: Like ON, but avoid synchronizing writes to data files

The deprecated start-up parameter innodb_flush_method=NO_FSYNC will cause
innodb_doublewrite=ON to be changed to innodb_doublewrite=fast,
which will prevent InnoDB from making any durable writes to data files.
This would normally be done right before the log checkpoint LSN is updated.
Depending on the file systems being used and their configuration,
this may or may not be safe.

The value innodb_doublewrite=fast differs from the previous combination of
innodb_doublewrite=ON and innodb_flush_method=O_DIRECT_NO_FSYNC by always
invoking os_file_flush() on the doublewrite buffer itself
in buf_dblwr_t::flush_buffered_writes_completed(). This should be safer
when there are multiple doublewrite batches between checkpoints.
Typically, once per second, buf_flush_page_cleaner() would write out
up to innodb_io_capacity pages and advance the log checkpoint.
Also typically, innodb_io_capacity>128, which is the size of the
doublewrite buffer in pages. Should os_file_flush_func() not be invoked
between doublewrite batches, writes could be reordered in an unsafe way.

The setting innodb_doublewrite=fast could be safe when the doublewrite
buffer (the first file of the system tablespace) and the data files
reside in the same file system.

This was tested by running "./mtr --rr innodb.alter_kill". On the first
server startup, with innodb_doublewrite=fast, os_file_flush_func()
would only be invoked on the ibdata1 file and possibly ib_logfile0.
On subsequent startups with innodb_doublewrite=OFF, os_file_flush_func()
will be invoked on the individual data files during log_checkpoint().

Note: The setting debug_no_sync (in the code, my_disable_sync) would
disable all durable writes to InnoDB files, which would be much less safe.

IORequest::Type: Introduce special values WRITE_DBL and PUNCH_DBL
for asynchronous writes that are submitted via the doublewrite buffer.
In this way, fil_space_t::use_doublewrite() or buf_dblwr.in_use()
will only be consulted during buf_page_t::flush() and the doublewrite
buffer can be enabled or disabled without any fear of inconsistency.

buf_dblwr_t::block_size: Replaces block_size().

buf_dblwr_t::flush_buffered_writes(): If !in_use() and the doublewrite
buffer is empty, just invoke fil_flush_file_spaces() and return. The
doublewrite buffer could have been disabled while a batch was in
progress.

innodb_init_params(): If innodb_flush_method=O_DIRECT_NO_FSYNC,
set innodb_doublewrite=fast or innodb_doublewrite=fearless.

Thanks to Mark Callaghan for reporting this, and Vladislav Vaintroub
for feedback.
@dr-m dr-m force-pushed the 11.0-MDEV-33545 branch from 68e3f16 to 1122ac9 Compare April 4, 2024 05:14
@dr-m dr-m merged commit 1122ac9 into 11.0 Apr 4, 2024
@dr-m dr-m deleted the 11.0-MDEV-33545 branch April 4, 2024 06:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants