Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wasabi 215p11 #1

Merged
merged 32 commits into from
May 28, 2024
Merged

Conversation

allanjude
Copy link

Integrate changes into the Wasabi 2.1.5p11 branch

robn and others added 30 commits April 3, 2024 09:58
Entries in the dbuf cache contribute only the size of the dbuf data to
the cache size. Attached "user" data is not counted. This can lead to
the data currently "owned" by the cache consuming more memory accounting
appears to show. In some cases (eg a metadnode data block with all child
dnode_t slots allocated), the actual size can be as much as 3x as what
the cache believes it to be.

This is arguably correct behaviour, as the cache is only tracking the
size of the dbuf data, not even the overhead of the dbuf_t. On the other
hand, in the above case of dnodes, evicting cached metadnode dbufs is
the only current way to reclaim the dnode objects, and can lead to the
situation where the dbuf cache appears to be comfortably within its
target memory window and yet is holding enormous amounts of slab memory
that cannot be reclaimed.

This commit adds a facility for a dbuf user to artificially inflate the
apparent size of the dbuf for caching purposes. This at least allows for
cache tuning to be adjusted to match something closer to the real memory
overhead.

metadnode dbufs carry a >1KiB allocation per dnode in their user data.
This informs the dbuf cache machinery of that fact, allowing it to make
better decisions when evicting dbufs.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes openzfs#15511
(cherry picked from commit 92dc4ad)
Before 5.4 we have to do a little math.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
(cherry picked from commit 93fb150)
The regular ABD iterators yield data buffers, so they have to map and
unmap pages into kernel memory. If the caller only wants to count
chunks, or can use page pointers directly, then the map/unmap is just
unnecessary overhead.

This adds adb_iterate_page_func, which yields unmapped struct page
instead.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
(cherry picked from commit 930b785)
This is just renaming the existing functions we're about to replace and
grouping them together to make the next commits easier to follow.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
(cherry picked from commit 9bf6a7c)
Light reshuffle to make it a bit more linear to read and get rid of a
bunch of args that aren't needed in all cases.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
(cherry picked from commit ad847ff)
This is just setting up for the next couple of commits, which will add a
new IO function and a parameter to select it.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
(cherry picked from commit 7ee8369)
This commit tackles a number of issues in the way BIOs (`struct bio`)
are constructed for submission to the Linux block layer.

### BIO segment limits are set incorrectly

The kernel has a hard upper limit on the number of pages/segments that
can be added to a BIO, as well as a separate limit for each device
(related to its queue depth and other scheduling characteristics).

ZFS counts the number of memory pages in the request ABD
(`abd_nr_pages_off()`, and then uses that as the number of segments to
put into the BIO, up to the hard upper limit. If it requires more than
the limit, it will create multiple BIOs.

Leaving aside the fact that page count method is wrong (see below), not
limiting to the device segment max means that the device driver will
need to split the BIO in half. This is alone is not necessarily a
problem, but it interacts with another issue to cause a much larger
problem.

### BIOs are filled inefficiently

The kernel function to add a segment to a BIO (`bio_add_page()`) takes a
`struct page` pointer, and offset+len within it. `struct page` can
represent a run of contiguous memory pages (known as a "compound page").
In can be of arbitrary length.

The ZFS functions that count ABD pages and load them into the BIO
(`abd_nr_pages_off()`, `bio_map()` and `abd_bio_map_off()`) will never
consider a page to be more than `PAGE_SIZE` (4K), even if the `struct
page` is for multiple pages. In this case, it will load the same `struct
page` into the BIO multiple times, with the offset adjusted each time.

With a sufficiently large ABD, this can easily lead to the BIO being
entirely filled much earlier than it could have been. This is also
further contributes to the problem caused by the incorrect segment limit
calculation, as its much easier to go past the device limit, and so
require a split.

Again, this is not a problem on its own.

### Incomplete pages are submitted to BIOs

The logic for "never submit more than `PAGE_SIZE`" is actually a little
more subtle. It will actually never submit a buffer that crosses a 4K
page boundary.

In practice, this is fine, as most ABDs are scattered, that is a list of
complete 4K pages, and so are loaded in as such.

Linear ABDs are typically allocated from slabs, and for small sizes they
are frequently not aligned to page boundaries. For example, a 12K
allocation can span four pages, eg:

     -- 4K -- -- 4K -- -- 4K -- -- 4K --
    |        |        |        |        |
          :## ######## ######## ######:    [1K, 4K, 4K, 3K]

Such an allocation would be loaded into a BIO as you see:

    [1K, 4K, 4K, 3K]

This tends not to be a problem in practice, because even if the BIO were
filled and needed to be split, each half would still have either a start
or end aligned to the logical block size of the device (assuming 4K at
least).

---

In ideal circumstances, these shortcomings don't cause any particular
problems. Its when they start to interact with other ZFS features that
things get interesting.

### Aggregation

Aggregation will create a "gang" ABD, which is simply a list of other
ABDs. Iterating over a gang ABD is just iterating over each ABD within
it in turn.

Because the segments are simply loaded in order, we can end up with
uneven segments either side of the "gap" between the two ABDs. For
example, two 12K ABDs might be aggregated and then loaded as:

    [1K, 4K, 4K, 3K, 2K, 4K, 4K, 2K]

Should a split occur, each individual BIO can end up either having an
start or end offset that is not aligned to the logical block size, which
some drivers (eg SCSI) will reject. However, this tends not to happen
because the default aggregation limit usually keeps the BIO small enough
to not require more than one split, and most pages are actually full 4K
pages, so hitting an uneven gap is very rare anyway.

### Gang blocks

If the pool is under particular memory pressure, then an IO can be
broken down into a "gang block", a 512-byte block composed of a header
and up to three block pointers. Each points to a fragment of the
original write, or in turn, another gang block, breaking the original
data up over and over until space can be found in the pool for each of
them.

Each gang header is a separate 512-byte memory allocation from a slab,
that needs to be written down to disk. When the gang header is added to
the BIO, its a single 512-byte segment.

### Aggregation with gang blocks

Pulling all this together, consider a large aggregated write of gang
blocks. This results a BIO containing lots of 512-byte segments. Given
our tendency to overfill the BIO, a split is likely, and most possible
split points will yield a pair of BIOs that are misaligned. Drivers that
care, like the SCSI driver, will reject them.

---

This commit is a substantial refactor and rewrite of much of `vdev_disk`
to sort all this out.

### Configure maximum segment size for device

`vdev_bio_max_segs()` now returns the ideal maximum size for the device,
if available. There's also a tuneable `zfs_vdev_disk_max_segs` to
override this, to assist with testing.

### ABDs checked up front for page count and alignment

We scan the ABD up front to count the number of pages within it, and to
confirm that if we submitted all those pages to one or more BIOs, it
could be split at any point with creating a misaligned BIO. Along the
way, we determine how many BIO segments we'll need to handle the entire
ABD, accounting for BIO fill limits (including segment and byte limits).

If the pages in the BIO are not usable (as in any of the above
situations), the ABD is linearised, and then checked again. This is the
same technique used in `vdev_geom` on FreeBSD, adjusted for Linux's
variable page size and allocator quirks.

In the end, a count of segments is produced, which is then used to
determine how many BIOs will be allocated.

### Virtual block IO object

`vbio_t` is a cleanup and enhancement of the old `dio_request_t`. The
idea is simply that it can hold all the state needed to create, submit
and return multiple BIOs, including all the refcounts, the ABD copy if
it was needed, and so on. Apart from what I hope is a clearer interface,
the major difference is that because we know how many BIOs we'll need up
front, we don't need the old overflow logic that would grow the BIO
array, throw away all the old work and restart. We can get it right from
the start.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
(cherry picked from commit 588a6a2)
This makes the submission method selectable at module load time via the
`zfs_vdev_disk_classic` parameter, allowing this change to be backported
to 2.2 safely, and disabled in favour of the "classic" submission method
if new problems come up.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
(cherry picked from commit 2382fdb)
Simplifies our code a lot, so we don't have to wait for each and
reassemble them.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes openzfs#15533
Closes openzfs#15588
(cherry picked from commit 72fd834)
Before 4.5 (specifically, torvalds/linux@ddc58f2), head and tail pages
in a compound page were refcounted separately. This means that using the
head page without taking a reference to it could see it cleaned up later
before we're finished with it. Specifically, bio_add_page() would take a
reference, and drop its reference after the bio completion callback
returns.

If the zio is executed immediately from the completion callback, this is
usually ok, as any data is referenced through the tail page referenced
by the ABD, and so becomes "live" that way. If there's a delay in zio
execution (high load, error injection), then the head page can be freed,
along with any dirty flags or other indicators that the underlying
memory is used. Later, when the zio completes and that memory is
accessed, its either unmapped and an unhandled fault takes down the
entire system, or it is mapped and we end up messing around in someone
else's memory. Both of these are very bad.

The solution on these older kernels is to take a reference to the head
page when we use it, and release it when we're done. There's not really
a sensible way under our current structure to do this; the "best" would
be to keep a list of head page references in the ABD, and release them
when the ABD is freed.

Since this additional overhead is totally unnecessary on 4.5+, where
head and tail pages share refcounts, I've opted to simply not use the
compound head in ABD page iteration there. This is theoretically less
efficient (though cleaning up head page references would add overhead),
but its safe, and we still get the other benefits of not mapping pages
before adding them to a bio and not mis-splitting pages.

There doesn't appear to be an obvious symbol name or config option we
can match on to discover this behaviour in configure (and the mm/page
APIs have changed a lot since then anyway), so I've gone with a simple
version check.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes openzfs#15533
Closes openzfs#15588
(cherry picked from commit c6be6ce)
After IO is unplugged, it may complete immediately and vbio_completion
be called on interrupt context. That may interrupt or deschedule our
task. If its the last bio, the vbio will be freed. Then, we get
rescheduled, and try to write to freed memory through vbio->.

This patch just removes the the cleanup, and the corresponding assert.
These were leftovers from a previous iteration of vbio_submit() and were
always "belt and suspenders" ops anyway, never strictly required.

Reported-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
(cherry picked from commit 34f662a)
We are finding that as customers get larger and faster machines
(hundreds of cores, large NVMe-backed pools) they keep hitting
relatively low performance ceilings. Our profiling work almost always
finds that they're running into bottlenecks on the SPA IO taskqs.
Unfortunately there's often little we can advise at that point, because
there's very few ways to change behaviour without patching.

This commit adds two load-time parameters `zio_taskq_read` and
`zio_taskq_write` that can configure the READ and WRITE IO taskqs
directly.

This achieves two goals: it gives operators (and those that support
them) a way to tune things without requiring a custom build of OpenZFS,
which is often not possible, and it lets us easily try different config
variations in a variety of environments to inform the development of
better defaults for these kind of systems.

Because tuning the IO taskqs really requires a fairly deep understanding
of how IO in ZFS works, and generally isn't needed without a pretty
serious workload and an ability to identify bottlenecks, only minimal
documentation is provided. Its expected that anyone using this is going
to have the source code there as well.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
(cherry picked from commit 12a031a)
Missed in openzfs#15696, backporting openzfs#15675.

Signed-off-by: Rob Norris <robn@despairlabs.com>
(cherry picked from commit 437d598)
sbuf_cpy() resets the sbuf state, which is wrong for sbufs allocated by
sbuf_new_for_sysctl().  In particular, this code triggers an assertion
failure in sbuf_clear().

Simplify by just using sysctl_handle_string() for both reading and
setting the tunable.

Fixes: 6930ecb ("spa: make read/write queues configurable")
Reported-by: Peter Holm <pho@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
(cherry picked from commit e5230fb)
For FreeBSD sysctls, we don't want the extra newline, since the
sysctl(8) utility will format strings appropriately.

Signed-off-by: Mark Johnston <markj@FreeBSD.org>
(cherry picked from commit 4cf9437)
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes openzfs#15036
(cherry picked from commit d78d49c)
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Non-blocking form of taskq_dispatch_ent(), returns false if it can't
acquire the taskq lock.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Turns out decompression never worked at all; likely an oversight
converting the original "transparent" versions to a true compression
option. This makes it work.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Just an extra explicit opt-in, since it changes the compression method
isn't upstreamed yet.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes openzfs#15290
(cherry picked from commit 8f5aa8c)
Previously, if a cachefile is passed to zpool import, the cached config
is mostly offered as-is to ZFS_IOC_POOL_TRYIMPORT->spa_tryimport(), and
the results are taken as the canonical pool config and handed back to
ZFS_IOC_POOL_IMPORT.

In the course of its operation, spa_load() will inspect the pool and
build a new config from what it finds on disk. However, it then
regenerates a new config ready to import, and so rightly sets the hostid
and hostname for the local host in the config it returns.

Because of this, the "require force" checks always decide the pool is
exported and last touched by the local host, even if this is not true,
which is possible in a HA environment when MMP is not enabled. The pool
may be imported on another head, but the import checks still pass here,
so the pool ends up imported on both.

(This doesn't happen when a cachefile isn't used, because the pool
config is discovered in userspace in zpool_find_import(), and that does
find the on-disk hostid and hostname correctly).

Since the systemd zfs-import-cache.service unit uses cachefile imports,
this can lead to a system returning after a crash with a "valid"
cachefile on disk and automatically, quietly, importing a pool that has
already been taken up by a secondary head.

This commit causes the on-disk hostid and hostname to be included in the
ZPOOL_CONFIG_LOAD_INFO item in the returned config, and then changes the
"force" checks for zpool import to use them if present.

This method should give no change in behaviour for old userspace on new
kernels (they won't know to look for the new config items) and for new
userspace on old kernels (the won't find the new config items).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes openzfs#15290
(cherry picked from commit 54b1b1d)
Detail the import progress of log spacemaps as they can take a very
long time.  Also grab the spa_note() messages to, as they provide
insight into what is happening

Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Closes openzfs#15539
This commit allow spa_load() to drop the spa_namespace_lock so
that imports can happen concurrently. Prior to dropping the
spa_namespace_lock, the import logic will set the spa_load_thread
value to track the thread which is doing the import.

Consumers of spa_lookup() retain the same behavior by blocking
when either a thread is holding the spa_namespace_lock or the
spa_load_thread value is set. This will ensure that critical
concurrent operations cannot take place while a pool is being
imported.

The zpool command is also enhanced to provide multi-threaded support
when invoking zpool import -a.

Lastly, zinject provides a mechanism to insert artificial delays
when importing a pool and new zfs tests are added to verify parallel
import functionality.

Contributions-by: Don Brady <don.brady@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <gwilson@delphix.com>
Closes openzfs#16093
Simplify vdev probes in the zio_vdev_io_done context to
avoid holding the spa config lock for a long duration.

Also allow zpool clear if no evidence of another host
is using the pool.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes openzfs#15839
When failmode=continue is set and the pool suspends, both 'zpool status'
and the 'zfs/pool/state' kstat ignore it and report the normal vdev tree
state. There's no clear indicator that the pool is suspended. This is
unlike suspend in failmode=wait, or suspend due to MMP check failure,
which both report "SUSPENDED" explicitly.

This commit changes it so SUSPENDED is reported for failmode=continue
the same as for other modes.

Rationale:

The historical behaviour of failmode=continue is roughly, "press on as
though all is well". To this end, the fact that the pool had suspended
was not shown, to maintain the façade that all is well.

Its unclear why hiding this information was considered appropriate. One
possibility is that it was expected that a true pool fault would always
be reported as DEGRADED or FAULTED, and that the pool could not suspend
without these happening.

That is not necessarily true, as vdev health and suspend state are only
loosely connected, such that a pool in (apparent) good health can be
suspended for good reasons, and of course a degraded pool does not lead
to suspension. Even if that expectation were true, there's still a
difference in urgency - a degraded pool may not need to be attended to
for hours, while a suspended pool is most often unusable until an
operator intervenes.

An operator that has set failmode=continue has presumably done so
because their workload is one that can continue to operate in a useful
way when the pool suspends. In this case the operator still needs a
clear indicator that there is a problem that needs attending to.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes openzfs#15297
If a linear buffer spans multiple pages, and the first page has a
non-zero starting offset, the checker would not include the offset, and
so would think there was an alignment gap at the end of the first page,
rather than at the start.

That is, for a 16K buffer spread across five pages with an initial 512B
offset:

    [.XXXXXXX][XXXXXXXX][XXXXXXXX][XXXXXXXX][XXXXXXX.]

It would be interpreted as:

    [XXXXXXX.][XXXXXXXX]...

And be rejected as misaligned.

Since it's already a linear ABD, the "linearising" copy would just reuse
the buffer as-is, and the second check would failing, tripping the
VERIFY in vdev_disk_io_rw().

This commit fixes all this by including the offset in the check for
end-of-page alignment.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
(cherry picked from commit 52f60c2)
zil_suspend() forces a full ZIL commit before suspending the ZIL, so has
to be allowed to call it!
0mp and others added 2 commits May 7, 2024 02:15
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Changed spa_export_common() such that it no longer holds the
spa_namespace_lock for the entire duration and instead sets
spa_export_thread to indicate an import is in progress on the
spa.  This allows for an export to a diffent pool to proceed
in parallel while an export is still processing potentially
long operations like spa_unload_log_sm_flush_all().

Calls like spa_lookup() and spa_vdev_enter() that rely on
the spa_namespace_lock to serialize them against a concurrent
export, now wait for any in-progress export thread to complete
before proceeding.

The 'zpool import -a' sub-command also provides multi-threaded
support, using a thread pool to submit the exports in parallel.

Sponsored-By: Klara Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Don Brady <don.brady@klarasystems.com>
@geoffamey geoffamey merged commit 483b712 into BlueArchive:zfs-2.1.5-v4 May 28, 2024
0 of 5 checks passed
@robn robn deleted the wasabi-215p11 branch June 14, 2024 01:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants