Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
Backfill metadnode more intelligently
Only attempt to backfill lower metadnode object numbers if at least 4096
objects have been freed since the last rescan, and at most once per
transaction group. This avoids a pathology in dmu_object_alloc() that
caused O(N^2) behavior for create-heavy workloads and substantially
improves object creation rates.

"Normally, the object allocator simply checks to see if the next
object is available. The slow calls happened when dmu_object_alloc()
checks to see if it can backfill lower object numbers. This happens
every time we move on to a new L1 indirect block (i.e. every 32 *
128 = 4096 objects).  When re-checking lower object numbers, we use
the on-disk fill count (blkptr_t:blk_fill) to quickly skip over
indirect blocks that don’t have enough free dnodes (defined as an L2
with at least 393,216 of 524,288 dnodes free). Therefore, we may
find that a block of dnodes has a low (or zero) fill count, and yet
we can’t allocate any of its dnodes, because they've been allocated
in memory but not yet written to disk. In this case we have to hold
each of the dnodes and then notice that it has been allocated in
memory.

The end result is that allocating N objects in the same TXG can
require CPU usage proportional to N^2."

Signed-off-by: Ned Bass <bass6@llnl.gov>
  • Loading branch information
nedbass authored and behlendorf committed May 19, 2016
1 parent 4825ed3 commit 050b0e6
Show file tree
Hide file tree
Showing 4 changed files with 33 additions and 11 deletions.
5 changes: 5 additions & 0 deletions include/sys/dmu_objset.h
Expand Up @@ -57,6 +57,9 @@ struct dmu_tx;

#define OBJSET_FLAG_USERACCOUNTING_COMPLETE (1ULL<<0)

/* Backfill lower metadnode objects after this many have been freed. */
#define OBJSET_RESCAN_DNODE_THRESHOLD 4096

This comment has been minimized.

Copy link
@adilger

adilger May 20, 2016

This should probably default to be at least:

DNODES_PER_BLOCK << (DMU_META_DNODE(os)->dn_indblkshift - SPA_BLKPTRSHIFT)

but more likely 4-8x that amount. The reason is that there is no point going back to rescan blocks if there aren't at least 3/4 of one block of free dnodes, otherwise the fill count heuristic will just skip the partially free blocks anyway. Also, if the freed dnodes are not all in a single block, the fill count heuristic will also skip the partially-empty blocks, so we may as well not rescan until there are a likely to be full free blocks to fill, since this always scans from the start of the metadnode (which may be referencing billions of objects on a large Lustre filesystem) so it makes sense to avoid this rescanning as much as possible.


typedef struct objset_phys {
dnode_phys_t os_meta_dnode;
zil_header_t os_zil_header;
Expand Down Expand Up @@ -106,6 +109,8 @@ struct objset {
zil_header_t os_zil_header;
list_t os_synced_dnodes;
uint64_t os_flags;
uint64_t os_freed_dnodes;
boolean_t os_rescan_dnodes;

/* Protected by os_obj_lock */
kmutex_t os_obj_lock;
Expand Down
31 changes: 20 additions & 11 deletions module/zfs/dmu_object.c
Expand Up @@ -36,33 +36,42 @@ dmu_object_alloc(objset_t *os, dmu_object_type_t ot, int blocksize,
dmu_object_type_t bonustype, int bonuslen, dmu_tx_t *tx)
{
uint64_t object;
uint64_t L2_dnode_count = DNODES_PER_BLOCK <<
uint64_t L1_dnode_count = DNODES_PER_BLOCK <<
(DMU_META_DNODE(os)->dn_indblkshift - SPA_BLKPTRSHIFT);
dnode_t *dn = NULL;
int restarted = B_FALSE;

mutex_enter(&os->os_obj_lock);
for (;;) {
object = os->os_obj_next;
/*
* Each time we polish off an L2 bp worth of dnodes
* (2^13 objects), move to another L2 bp that's still
* reasonably sparse (at most 1/4 full). Look from the
* beginning once, but after that keep looking from here.
* If we can't find one, just keep going from here.
* Each time we polish off a L1 bp worth of dnodes (2^12
* objects), move to another L1 bp that's still reasonably
* sparse (at most 1/4 full). Look from the beginning at most
* once per txg, but after that keep looking from here.
* os_scan_dnodes is set during txg sync if enough objects
* have been freed since the previous rescan to justify
* backfilling again. If we can't find a suitable block, just
* keep going from here.
*
* Note that dmu_traverse depends on the behavior that we use
* multiple blocks of the dnode object before going back to
* reuse objects. Any change to this algorithm should preserve
* that property or find another solution to the issues
* described in traverse_visitbp.
*/
if (P2PHASE(object, L2_dnode_count) == 0) {
uint64_t offset = restarted ? object << DNODE_SHIFT : 0;
int error = dnode_next_offset(DMU_META_DNODE(os),

if (P2PHASE(object, L1_dnode_count) == 0) {
uint64_t offset;
int error;
if (os->os_rescan_dnodes) {
offset = 0;
os->os_rescan_dnodes = B_FALSE;
} else {
offset = object << DNODE_SHIFT;
}
error = dnode_next_offset(DMU_META_DNODE(os),
DNODE_FIND_HOLE,
&offset, 2, DNODES_PER_BLOCK >> 2, 0);
restarted = B_TRUE;
if (error == 0)
object = offset >> DNODE_SHIFT;
}
Expand Down
7 changes: 7 additions & 0 deletions module/zfs/dmu_objset.c
Expand Up @@ -1152,6 +1152,13 @@ dmu_objset_sync(objset_t *os, zio_t *pio, dmu_tx_t *tx)
if (dr->dr_zio)
zio_nowait(dr->dr_zio);
}

/* Enable dnode backfill if enough objects have been freed. */
if (os->os_freed_dnodes >= OBJSET_RESCAN_DNODE_THRESHOLD) {
os->os_rescan_dnodes = B_TRUE;
os->os_freed_dnodes = 0;
}

/*
* Free intent log blocks up to this tx.
*/
Expand Down
1 change: 1 addition & 0 deletions module/zfs/dnode_sync.c
Expand Up @@ -688,6 +688,7 @@ dnode_sync(dnode_t *dn, dmu_tx_t *tx)
}

if (freeing_dnode) {
dn->dn_objset->os_freed_dnodes++;

This comment has been minimized.

Copy link
@adilger

adilger May 20, 2016

Is there a lock protecting this increment? There doesn't strictly need to be a lock protecting it, since it is just a heuristic and at worst this might delay the rescanning by some amount (I think the worst case would be a factor of nr_active_cpus(), but probably much less).

That said, it would be helpful to have a comment here explaining that this is potentially racy but harmless (and the reason why), since it seems possible that static code analysis or observant coders might try to fix this by introducing locking here.

This comment has been minimized.

Copy link
@adilger

adilger May 20, 2016

The only other suggestion would be to track the lowest dnode number that was freed (or better - the lowest dnode in an L2 block that exceeds the fill count threshold, if the fill count is easily accessible). In a huge MDT filesystem (O(1B files) >= 2000 L2 blocks) scanning from the beginning each time might take a non-trivial amount of time, especially if it needs to read any of those blocks from disk. If there is a heavy create+delete workload, then it is likely they will be reusing the same dnodes each time and the rescanning could be kept to a minimum number of blocks.

Again, this could be updated without locking, since "perfect" behaviour would only need to get the minimum within the 512k dnodes of an L2 block, and at worst there would be some L2 block(s) with unused dnodes until some other dnode is unlinked with a lower number or the dataset is next mounted.

dnode_sync_free(dn, tx);
return;
}
Expand Down

1 comment on commit 050b0e6

@adilger
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patch looks very simple and straight forward, and I have only minor suggestions for improvement.

I know @bzzz77 was also working on this issue, and has a more complex solution that tracks specific freed dnodes. Alex's might be faster in a heavy create/unlink workload since it can find the empty dnode slots more easily. It might be worthwhile to compare the two, but other things being equal this patch has the benefit of being less complex and easier to understand, and less likely to have any bugs.

Please sign in to comment.