Commit
Only attempt to backfill lower metadnode object numbers if at least 4096 objects have been freed since the last rescan, and at most once per transaction group. This avoids a pathology in dmu_object_alloc() that caused O(N^2) behavior for create-heavy workloads and substantially improves object creation rates. "Normally, the object allocator simply checks to see if the next object is available. The slow calls happened when dmu_object_alloc() checks to see if it can backfill lower object numbers. This happens every time we move on to a new L1 indirect block (i.e. every 32 * 128 = 4096 objects). When re-checking lower object numbers, we use the on-disk fill count (blkptr_t:blk_fill) to quickly skip over indirect blocks that don’t have enough free dnodes (defined as an L2 with at least 393,216 of 524,288 dnodes free). Therefore, we may find that a block of dnodes has a low (or zero) fill count, and yet we can’t allocate any of its dnodes, because they've been allocated in memory but not yet written to disk. In this case we have to hold each of the dnodes and then notice that it has been allocated in memory. The end result is that allocating N objects in the same TXG can require CPU usage proportional to N^2." Signed-off-by: Ned Bass <bass6@llnl.gov>
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -688,6 +688,7 @@ dnode_sync(dnode_t *dn, dmu_tx_t *tx) | |
| } | ||
|
|
||
| if (freeing_dnode) { | ||
| dn->dn_objset->os_freed_dnodes++; | ||
This comment has been minimized.
Sorry, something went wrong.
adilger
|
||
| dnode_sync_free(dn, tx); | ||
| return; | ||
| } | ||
|
|
||
1 comment
on commit 050b0e6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Patch looks very simple and straight forward, and I have only minor suggestions for improvement.
I know @bzzz77 was also working on this issue, and has a more complex solution that tracks specific freed dnodes. Alex's might be faster in a heavy create/unlink workload since it can find the empty dnode slots more easily. It might be worthwhile to compare the two, but other things being equal this patch has the benefit of being less complex and easier to understand, and less likely to have any bugs.
This should probably default to be at least:
DNODES_PER_BLOCK << (DMU_META_DNODE(os)->dn_indblkshift - SPA_BLKPTRSHIFT)
but more likely 4-8x that amount. The reason is that there is no point going back to rescan blocks if there aren't at least 3/4 of one block of free dnodes, otherwise the fill count heuristic will just skip the partially free blocks anyway. Also, if the freed dnodes are not all in a single block, the fill count heuristic will also skip the partially-empty blocks, so we may as well not rescan until there are a likely to be full free blocks to fill, since this always scans from the start of the metadnode (which may be referencing billions of objects on a large Lustre filesystem) so it makes sense to avoid this rescanning as much as possible.