MDEV-33545: Map innodb_flush_method=O_DIRECT_NO_FSYNC to a new innodb_doublewrite setting#3091
MDEV-33545: Map innodb_flush_method=O_DIRECT_NO_FSYNC to a new innodb_doublewrite setting#3091
Conversation
|
|
There was a problem hiding this comment.
-
I would prefer innodb_data_write_through to remain exactly what it was (O_SYNC/O_DSYNC/FILE_FLAG_WRITE_THROUGH) , and remain boolean.
It is better to add a new parameter innodb_data_sync (or innodb_data_flush, we do not want to sound Linux-ish), so we do not mix up concepts.
I know there is an overlap. If innodb_data_write_through is used, innodb_data_sync should be ignored. In fact, innodb_data_write_through should extremely seldomly, it is not the same as the log file where each write is typically also flushed. -
If new setting is crash-safe, why don't we use it in all cases, without introducing new parameter. If this is not crash-safe, can we document it under which circumstances it is or it is not crash safe, like "battery-backed write cache is necessary".
-
MySQL documentation says about O_DIRECT_NO_FSYNC : "InnoDB uses O_DIRECT during flushing I/O, but skips the fsync() system call after each write operation." Is this accurate ? "skips fsync after each operations" sounds suspiciously like it is about redo log.
I think that it is a bad practice to have "don’t care" combinations of parameters. The combinatorial explosion on parameters is already too bad.
I do not think it is safe to skip
The function |
|
MySQL documentation says "As of MySQL 8.0.14, fsync() is called after creating a new file, after increasing file size, and after closing a file, to ensure that file system metadata changes are synchronized. " Is this also inaccurate? Would this use be safe, in some circumstances, for example when redo log and data file are on the same disk? |
I think that we would be on a slippery slope if we start to assume that everything is safe, as long as everyone uses ext4fs and the same file system journal covers both On Microsoft Windows, it may be reasonable to assume that everyone uses NTFS. On Linux, there is a wide range of choice and too many details that may affect such assumptions. |
22252bb to
68e3f16
Compare
| "Whether and how to use the doublewrite buffer. " | ||
| "OFF=Assume that writes of innodb_page_size are atomic; " | ||
| "ON=Prevent torn writes (the default); " | ||
| "fast=Like ON, but avoid fsync() on data files", |
There was a problem hiding this comment.
Unnecessary Linuxism-Posixism with fsync(). maybe "flushing" is better.
There was a problem hiding this comment.
"Flushing" is somewhat overloaded; for buf_flush_page_cleaner or fflush it actually means "writing". Maybe, "Like ON, but do not synchronize writes to data files"?
| /** Asynchronous write */ | ||
| WRITE_ASYNC= WRITE_SYNC | 1, | ||
| /** Asynchronous doublewritten page */ | ||
| WRITE_DBL= WRITE_ASYNC | 4, |
There was a problem hiding this comment.
Maybe a define DBLWR 4 or something,so you do not have to use magic number 4 in 3 different places - WRITE_DBL, PUNCH_DBL and is_doublewritten?
There was a problem hiding this comment.
I do not want to expose that constant outside IORequest. The enum is public; there is no concept of private values. Sure, I could add a private static constexpr for that value, but I’d say that it would obfuscate things more.
| void set_use_doublewrite(ulong use) | ||
| { | ||
| buf_dblwr.set_use(use); | ||
| need_unflushed_spaces= !write_through && buf_dblwr.need_fsync(); |
There was a problem hiding this comment.
should "buffered" be checked for "need_unflushed_spaces"? buffered implies need_unflushed_spaces = true, regardless buf_dblwr.need_fsync() ?
There was a problem hiding this comment.
I don’t think that fil_system.buffered (a.k.a. innodb_data_file_buffering) should be checked here. I believe that also writes via O_DIRECT may require fdatasync or fsync to be invoked, in case the file had been previously extended with a special system call, be it fallocate to allocate physical blocks or ftruncate to set the logical file size. In both cases, some recent writes may have impacted the file system metadata, that is, associate the written block with the inode so that a read like system call will return data from that block, instead of pretending that the region is filled with NUL bytes. If we always extended files via pwrite or WriteFile and invoked fdatasync or similar after such file-extending calls, then yes, we might safely skip fdatasync for subsequent overwrite-in-place. But even then, I suppose that invoking fdatasync would have the benefit that it would force a write-back of any non-durable cache inside the storage layer.
In commit 2464876 (MDEV-30136) the parameter innodb_flush_method was deprecated, with no direct replacement for innodb_flush_method=O_DIRECT_NO_FSYNC. Let us change innodb_doublewrite from Boolean to ENUM that can be changed while the server is running: OFF: Assume that writes of innodb_page_size are atomic ON: Prevent torn writes (the default) fast: Like ON, but avoid synchronizing writes to data files The deprecated start-up parameter innodb_flush_method=NO_FSYNC will cause innodb_doublewrite=ON to be changed to innodb_doublewrite=fast, which will prevent InnoDB from making any durable writes to data files. This would normally be done right before the log checkpoint LSN is updated. Depending on the file systems being used and their configuration, this may or may not be safe. The value innodb_doublewrite=fast differs from the previous combination of innodb_doublewrite=ON and innodb_flush_method=O_DIRECT_NO_FSYNC by always invoking os_file_flush() on the doublewrite buffer itself in buf_dblwr_t::flush_buffered_writes_completed(). This should be safer when there are multiple doublewrite batches between checkpoints. Typically, once per second, buf_flush_page_cleaner() would write out up to innodb_io_capacity pages and advance the log checkpoint. Also typically, innodb_io_capacity>128, which is the size of the doublewrite buffer in pages. Should os_file_flush_func() not be invoked between doublewrite batches, writes could be reordered in an unsafe way. The setting innodb_doublewrite=fast could be safe when the doublewrite buffer (the first file of the system tablespace) and the data files reside in the same file system. This was tested by running "./mtr --rr innodb.alter_kill". On the first server startup, with innodb_doublewrite=fast, os_file_flush_func() would only be invoked on the ibdata1 file and possibly ib_logfile0. On subsequent startups with innodb_doublewrite=OFF, os_file_flush_func() will be invoked on the individual data files during log_checkpoint(). Note: The setting debug_no_sync (in the code, my_disable_sync) would disable all durable writes to InnoDB files, which would be much less safe. IORequest::Type: Introduce special values WRITE_DBL and PUNCH_DBL for asynchronous writes that are submitted via the doublewrite buffer. In this way, fil_space_t::use_doublewrite() or buf_dblwr.in_use() will only be consulted during buf_page_t::flush() and the doublewrite buffer can be enabled or disabled without any fear of inconsistency. buf_dblwr_t::block_size: Replaces block_size(). buf_dblwr_t::flush_buffered_writes(): If !in_use() and the doublewrite buffer is empty, just invoke fil_flush_file_spaces() and return. The doublewrite buffer could have been disabled while a batch was in progress. innodb_init_params(): If innodb_flush_method=O_DIRECT_NO_FSYNC, set innodb_doublewrite=fast or innodb_doublewrite=fearless. Thanks to Mark Callaghan for reporting this, and Vladislav Vaintroub for feedback.
Description
In commit 2464876 (MDEV-30136) the parameter
innodb_flush_methodwas deprecated, with no direct replacement forinnodb_flush_method=O_DIRECT_NO_FSYNC.Let us change
innodb_doublewritefrom Boolean toENUM, with 3 new values:OFF: Assume that writes of innodb_page_size are atomicON: Prevent torn writes (the default)fast: LikeON, but avoidfsync()orfdatasync()on data filesThe deprecated start-up parameter
innodb_flush_method=NO_FSYNCwill causeinnodb_doublewrite=ONto be changed toinnodb_doublewrite=fast, which will prevent InnoDB from making any durable writes to data files other than the doublewrite buffer. This would normally be done right before the log checkpoint LSN is updated. Depending on the file systems being used and their configuration, this may or may not be safe.The value
innodb_doublewrite=fastdiffers from the previous combination ofinnodb_doublewrite=ONandinnodb_flush_method=O_DIRECT_NO_FSYNCby always invokingos_file_flush_func()on the doublewrite buffer itself inbuf_dblwr_t::flush_buffered_writes_completed(). This should be safer when there are multiple doublewrite batches between checkpoints. Typically, once per second,buf_flush_page_cleaner()would write out up toinnodb_io_capacitypages and advance the log checkpoint. Also typically,innodb_io_capacity>128, which is the size of the doublewrite buffer in pages. Shouldos_file_flush_func()not be invoked between doublewrite batches, writes could be reordered in an unsafe way.The setting
innodb_doublewrite=fastcould be safe when the doublewrite buffer (the first file of the system tablespace) and the data files reside in the same file system.Note: Enabling the setting
debug_no_sync(in the code,my_disable_sync) would also disable durable writes to the log file, which would be much less safe.Thanks to @mdcallag for reporting this.
How can this PR be tested?
and
awatch fil_system.write_throughto check that no data files are being created withO_DSYNCand nofsyncorfdatasyncis being invoked on them.On the first server startup of
innodb.alter_kill, withinnodb_doublewrite=fast,os_file_flush_func()would only be invoked on theibdata1and possiblyib_logfile0(unless using a memory-mapped log). On subsequent startups withinnodb_doublewrite=OFF,os_file_flush_func()will be invoked on data files duringlog_checkpoint().Side note:
innodb_data_file_write_through=ONwill apply also to the InnoDB temporary tablespace (ibtmp1) even though this is not necessary.IORequest::Type: Introduce special valuesWRITE_DBLandPUNCH_DBLfor asynchronous writes that are submitted via the doublewrite buffer. In this way,fil_space_t::use_doublewrite()orbuf_dblwr.in_use()will only be consulted duringbuf_page_t::flush()and the doublewrite buffer can be enabled or disabled without any fear of inconsistency.buf_dblwr_t::block_size: Replacesblock_size().buf_dblwr_t::flush_buffered_writes(): If!in_use()and the doublewrite buffer is empty, just invokefil_flush_file_spaces()and return. The doublewrite buffer could have been disabled while a batch was in progress.Basing the PR against the correct MariaDB version
PR quality check