Qu-Wenruo/Btrf…
Commits on Jul 25, 2016
-
btrfs: dedupe: Introduce new reconfigure ioctl
Introduce new reconfigure ioctl, and new FORCE flag for in-band dedupe ioctls. Now dedupe enable and reconfigure ioctl are stateful. -------------------------------------------- | Current state | Ioctl | Next state | -------------------------------------------- | Disabled | enable | Enabled | | Enabled | enable | Not allowed | | Enabled | reconf | Enabled | | Enabled | disable | Disabled | | Disabled | dsiable | Disabled | | Disabled | reconf | Not allowed | -------------------------------------------- (While disbale is always stateless) While for guys prefer stateless ioctl (myself for example), new FORCE flag is introduced. In FORCE mode, enable/disable is completely stateless. -------------------------------------------- | Current state | Ioctl | Next state | -------------------------------------------- | Disabled | enable | Enabled | | Enabled | enable | Enabled | | Enabled | disable | Disabled | | Disabled | disable | Disabled | -------------------------------------------- Also, re-configure ioctl will only modify specified fields. Unlike enable, un-specified fields will be filled with default value. For example: # btrfs dedupe enable --block-size 64k /mnt # btrfs dedupe reconfigure --limit-hash 1m /mnt Will leads to: dedupe blocksize: 64K dedupe hash limit nr: 1m While for enable: # btrfs dedupe enable --force --block-size 64k /mnt # btrfs dedupe enable --force --limit-hash 1m /mnt Will reset blocksize to default value: dedupe blocksize: 128K << reset dedupe hash limit nr: 1m Suggested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
-
btrfs: dedupe: fix false ENOSPC
When testing in-band dedupe, sometimes we got ENOSPC error, though fs still has much free space. After some debuging work, we found that it's btrfs_delalloc_reserve_metadata() which sometimes tries to reserve plenty of metadata space, even for very small data range. In btrfs_delalloc_reserve_metadata(), the number of metadata bytes we try to reserve is calculated by the difference between outstanding_extents and reserved_extents. Please see below case for how ENOSPC occurs: 1, Buffered write 128MB data in unit of 1MB, so finially we'll have inode outstanding extents be 1, and reserved_extents be 128. Note it's btrfs_merge_extent_hook() that merges these 1MB units into one big outstanding extent, but do not change reserved_extents. 2, When writing dirty pages, for in-band dedupe, cow_file_range() will split above big extent in unit of 16KB(assume our in-band dedupe blocksize is 16KB). When first split opeartion finishes, we'll have 2 outstanding extents and 128 reserved extents, and just right the currently generated ordered extent is dispatched to run and complete, then btrfs_delalloc_release_metadata()(see btrfs_finish_ordered_io()) will be called to release metadata, after that we will have 1 outstanding extents and 1 reserved extents(also see logic in drop_outstanding_extent()). Later cow_file_range() continues to handles left data range[16KB, 128MB), and if no other ordered extent was dispatched to run, there will be 8191 outstanding extents and 1 reserved extent. 3, Now if another bufferd write for this file enters, then btrfs_delalloc_reserve_metadata() will at least try to reserve metadata for 8191 outstanding extents' metadata, for 64K node size, it'll be 8191*65536*16, about 8GB metadata, so obviously it'll return ENOSPC error. But indeed when a file goes through in-band dedupe, its max extent size will no longer be BTRFS_MAX_EXTENT_SIZE(128MB), it'll be limited by in-band dedupe blocksize, so current metadata reservation method in btrfs is not appropriate or correct, here we introduce btrfs_max_extent_size(), which will return max extent size for corresponding files, which go through in-band and we use this value to do metadata reservation and extent_io merge, split, clear operations, we can make sure difference between outstanding_extents and reserved_extents will not be so big. Currently only buffered write will go through in-band dedupe if in-band dedupe is enabled. Reported-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
-
btrfs: improve inode's outstanding_extents computation
This issue was revealed by modifying BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB, When modifying BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB, fsstress test often gets these warnings from btrfs_destroy_inode(): WARN_ON(BTRFS_I(inode)->outstanding_extents); WARN_ON(BTRFS_I(inode)->reserved_extents); Simple test program below can reproduce this issue steadily. Note: you need to modify BTRFS_MAX_EXTENT_SIZE to 64KB to have test, otherwise there won't be such WARNING. #include <string.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> int main(void) { int fd; char buf[68 *1024]; memset(buf, 0, 68 * 1024); fd = open("testfile", O_CREAT | O_EXCL | O_RDWR); pwrite(fd, buf, 68 * 1024, 64 * 1024); return; } When BTRFS_MAX_EXTENT_SIZE is 64KB, and buffered data range is: 64KB 128K 132KB |-----------------------------------------------|---------------| 64 + 4KB 1) for above data range, btrfs_delalloc_reserve_metadata() will reserve metadata and set BTRFS_I(inode)->outstanding_extents to 2. (68KB + 64KB - 1) / 64KB == 2 Outstanding_extents: 2 2) then btrfs_dirty_page() will be called to dirty pages and set EXTENT_DELALLOC flag. In this case, btrfs_set_bit_hook() will be called twice. The 1st set_bit_hook() call will set DEALLOC flag for the first 64K. 64KB 128KB |-----------------------------------------------| 64KB DELALLOC Outstanding_extents: 2 Set_bit_hooks() uses FIRST_DELALLOC flag to avoid re-increase outstanding_extents counter. So for 1st set_bit_hooks() call, it won't modify outstanding_extents, it's still 2. Then FIRST_DELALLOC flag is *CLEARED*. 3) 2nd btrfs_set_bit_hook() call. Because FIRST_DELALLOC have been cleared by previous set_bit_hook(), btrfs_set_bit_hook() will increase BTRFS_I(inode)->outstanding_extents by one, so now BTRFS_I(inode)->outstanding_extents is 3. 64KB 128KB 132KB |-----------------------------------------------|----------------| 64K DELALLOC 4K DELALLOC Outstanding_extents: 3 But the correct outstanding_extents number should be 2, not 3. The 2nd btrfs_set_bit_hook() call just screwed up this, and leads to the WARN_ON(). Normally, we can solve it by only increasing outstanding_extents in set_bit_hook(). But the problem is for delalloc_reserve/release_metadata(), we only have a 'length' parameter, and calculate in-accurate outstanding_extents. If we only rely on set_bit_hook() release_metadata() will crew things up as it will decrease inaccurate number. So the fix we use is: 1) Increase *INACCURATE* outstanding_extents at delalloc_reserve_meta Just as a place holder. 2) Increase *accurate* outstanding_extents at set_bit_hooks() This is the real increaser. 3) Decrease *INACCURATE* outstanding_extents before returning This makes outstanding_extents to correct value. For 128M BTRFS_MAX_EXTENT_SIZE, due to limitation of __btrfs_buffered_write(), each iteration will only handle about 2MB data. So btrfs_dirty_pages() won't need to handle cases cross 2 extents. Cc: Mark Fasheh <mfasheh@suse.de> Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> -
btrfs: relocation: Enhance error handling to avoid BUG_ON
Since the introduce of btrfs dedupe tree, it's possible that balance can race with dedupe disabling. When this happens, dedupe_enabled will make btrfs_get_fs_root() return PTR_ERR(-ENOENT). But due to a bug in error handling branch, when this happens backref_cache->nr_nodes is increased but the node is neither added to backref_cache or nr_nodes decreased. Causing BUG_ON() in backref_cache_cleanup() [ 2611.668810] ------------[ cut here ]------------ [ 2611.669946] kernel BUG at /home/sat/ktest/linux/fs/btrfs/relocation.c:243! [ 2611.670572] invalid opcode: 0000 [#1] SMP [ 2611.686797] Call Trace: [ 2611.687034] [<ffffffffa01f71d3>] btrfs_relocate_block_group+0x1b3/0x290 [btrfs] [ 2611.687706] [<ffffffffa01cc177>] btrfs_relocate_chunk.isra.40+0x47/0xd0 [btrfs] [ 2611.688385] [<ffffffffa01cdb12>] btrfs_balance+0xb22/0x11e0 [btrfs] [ 2611.688966] [<ffffffffa01d9611>] btrfs_ioctl_balance+0x391/0x3a0 [btrfs] [ 2611.689587] [<ffffffffa01ddaf0>] btrfs_ioctl+0x1650/0x2290 [btrfs] [ 2611.690145] [<ffffffff81171cda>] ? lru_cache_add+0x3a/0x80 [ 2611.690647] [<ffffffff81171e4c>] ? lru_cache_add_active_or_unevictable+0x4c/0xc0 [ 2611.691310] [<ffffffff81193f04>] ? handle_mm_fault+0xcd4/0x17f0 [ 2611.691842] [<ffffffff811da423>] ? cp_new_stat+0x153/0x180 [ 2611.692342] [<ffffffff8119913d>] ? __vma_link_rb+0xfd/0x110 [ 2611.692842] [<ffffffff81199209>] ? vma_link+0xb9/0xc0 [ 2611.693303] [<ffffffff811e7e81>] do_vfs_ioctl+0xa1/0x5a0 [ 2611.693781] [<ffffffff8104e024>] ? __do_page_fault+0x1b4/0x400 [ 2611.694310] [<ffffffff811e83c1>] SyS_ioctl+0x41/0x70 [ 2611.694758] [<ffffffff816dfc6e>] entry_SYSCALL_64_fastpath+0x12/0x71 [ 2611.695331] Code: ff 48 8b 45 bf 49 83 af a8 05 00 00 01 49 89 87 a0 05 00 00 e9 2e fd ff ff b8 f4 ff ff ff e9 e4 fb ff ff 0f 0b 0f 0b 0f 0b 0f 0b <0f> 0b 0f 0b 41 89 c6 e9 b8 fb ff ff e8 9e a6 e8 e0 4c 89 e7 44 [ 2611.697870] RIP [<ffffffffa01f6fc1>] relocate_block_group+0x741/0x7a0 [btrfs] [ 2611.698818] RSP <ffff88002a81fb30> This patch will call remove_backref_node() in error handling branch, and cache the returned -ENOENT in relocate_tree_block() and continue balancing. Reported-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
-
btrfs: dedupe: Add ioctl for inband dedupelication
Add ioctl interface for inband dedupelication, which includes: 1) enable 2) disable 3) status And a pseudo RO compat flag, to imply that btrfs now supports inband dedup. However we don't add any ondisk format change, it's just a pseudo RO compat flag. All these ioctl interfaces are state-less, which means caller don't need to bother previous dedupe state before calling them, and only need to care the final desired state. For example, if user want to enable dedupe with specified block size and limit, just fill the ioctl structure and call enable ioctl. No need to check if dedupe is already running. These ioctls will handle things like re-configure or disable quite well. Also, for invalid parameters, enable ioctl interface will set the field of the first encounted invalid parameter to (-1) to inform caller. While for limit_nr/limit_mem, the value will be (0). Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
-
btrfs: dedupe: Inband in-memory only de-duplication implement
Core implement for inband de-duplication. It reuse the async_cow_start() facility to do the calculate dedupe hash. And use dedupe hash to do inband de-duplication at extent level. The work flow is as below: 1) Run delalloc range for an inode 2) Calculate hash for the delalloc range at the unit of dedupe_bs 3) For hash match(duplicated) case, just increase source extent ref and insert file extent. For hash mismatch case, go through the normal cow_file_range() fallback, and add hash into dedupe_tree. Compress for hash miss case is not supported yet. Current implement restore all dedupe hash in memory rb-tree, with LRU behavior to control the limit. Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
-
btrfs: ordered-extent: Add support for dedupe
Add ordered-extent support for dedupe. Note, current ordered-extent support only supports non-compressed source extent. Support for compressed source extent will be added later. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
-
btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
Unlike in-memory or on-disk dedupe method, only SHA256 hash method is supported yet, so implement btrfs_dedupe_calc_hash() interface using SHA256. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
-
btrfs: dedupe: Introduce function to search for an existing hash
Introduce static function inmem_search() to handle the job for in-memory hash tree. The trick is, we must ensure the delayed ref head is not being run at the time we search the for the hash. With inmem_search(), we can implement the btrfs_dedupe_search() interface. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
-
btrfs: delayed-ref: Add support for increasing data ref under spinlock
For in-band dedupe, btrfs needs to increase data ref with delayed_ref locked, so add a new function btrfs_add_delayed_data_ref_lock() to increase extent ref with delayed_refs already locked. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
-
btrfs: dedupe: Introduce function to remove hash from in-memory tree
Introduce static function inmem_del() to remove hash from in-memory dedupe tree. And implement btrfs_dedupe_del() and btrfs_dedup_disable() interfaces. Also for btrfs_dedupe_disable(), add new functions to wait existing writer and block incoming writers to eliminate all possible race. Cc: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
-
btrfs: dedupe: Introduce function to add hash into in-memory tree
Introduce static function inmem_add() to add hash into in-memory tree. And now we can implement the btrfs_dedupe_add() interface. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
-
btrfs: dedupe: Introduce function to initialize dedupe info
Add generic function to initialize dedupe info. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
-
btrfs: dedupe: Introduce dedupe framework and its header
Introduce the header for btrfs in-band(write time) de-duplication framework and needed header. The new de-duplication framework is going to support 2 different dedupe methods and 1 dedupe hash. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
-
btrfs: expand btrfs_set_extent_delalloc() and its friends to support …
…in-band dedupe and subpage size patchset Extract btrfs_set_extent_delalloc() and extent_clear_unlock_delalloc() parameters for both in-band dedupe and subpage sector size patchset. This should reduce conflict of both patchset and the effort to rebase them. Cc: Chandan Rajendra <chandan@linux.vnet.ibm.com> Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Commits on Jul 24, 2016
-
Add linux-next specific files for 20160724
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
sfrothwell committedJul 24, 2016 -
sfrothwell committed
Jul 24, 2016 -
dma-mapping: document the DMA attributes next to the declaration
Copy documentation abstract about each DMA attribute from Documentation/DMA-attributes.txt to the place with declaration. Suggested-by: Christoph Hellwig <hch@infradead.org> Link: http://lkml.kernel.org/r/1468399300-5399-46-git-send-email-k.kozlowski@samsung.com Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
dma-mapping: remove dma_get_attr
After switching DMA attributes to unsigned long it is easier to just compare the bits. Link: http://lkml.kernel.org/r/1468399300-5399-45-git-send-email-k.kozlowski@samsung.com Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Acked-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no> [avr32] Acked-by: Vineet Gupta <vgupta@synopsys.com> [arc] Acked-by: Robin Murphy <robin.murphy@arm.com> [arm64 and dma-iommu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
remoteproc: qcom: use unsigned long for dma_attrs
Link: http://lkml.kernel.org/r/1468399300-5399-44-git-send-email-k.kozlowski@samsung.com Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
xtensa: dma-mapping: use unsigned long for dma_attrs
Split out subsystem specific changes for easier reviews. This will be squashed with main commit. Link: http://lkml.kernel.org/r/1468399300-5399-43-git-send-email-k.kozlowski@samsung.com Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
unicore32: dma-mapping: use unsigned long for dma_attrs
Split out subsystem specific changes for easier reviews. This will be squashed with main commit. Link: http://lkml.kernel.org/r/1468399300-5399-42-git-send-email-k.kozlowski@samsung.com Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
tile: dma-mapping: use unsigned long for dma_attrs
Split out subsystem specific changes for easier reviews. This will be squashed with main commit. Link: http://lkml.kernel.org/r/1468399300-5399-41-git-send-email-k.kozlowski@samsung.com Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
sparc: dma-mapping: use unsigned long for dma_attrs
Split out subsystem specific changes for easier reviews. This will be squashed with main commit. Link: http://lkml.kernel.org/r/1468399300-5399-40-git-send-email-k.kozlowski@samsung.com Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
sh: dma-mapping: use unsigned long for dma_attrs
Split out subsystem specific changes for easier reviews. This will be squashed with main commit. Link: http://lkml.kernel.org/r/1468399300-5399-39-git-send-email-k.kozlowski@samsung.com Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
s390: dma-mapping: use unsigned long for dma_attrs
Split out subsystem specific changes for easier reviews. This will be squashed with main commit. Link: http://lkml.kernel.org/r/1468399300-5399-38-git-send-email-k.kozlowski@samsung.com Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [s390] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
misc: mic: dma-mapping: use unsigned long for dma_attrs
Split out subsystem specific changes for easier reviews. This will be squashed with main commit. Link: http://lkml.kernel.org/r/1468399300-5399-37-git-send-email-k.kozlowski@samsung.com Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
parisc: dma-mapping: use unsigned long for dma_attrs
Split out subsystem specific changes for easier reviews. This will be squashed with main commit. Link: http://lkml.kernel.org/r/1468399300-5399-36-git-send-email-k.kozlowski@samsung.com Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
openrisc: dma-mapping: use unsigned long for dma_attrs
Split out subsystem specific changes for easier reviews. This will be squashed with main commit. Link: http://lkml.kernel.org/r/1468399300-5399-35-git-send-email-k.kozlowski@samsung.com Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
nios2: dma-mapping: use unsigned long for dma_attrs
Split out subsystem specific changes for easier reviews. This will be squashed with main commit. Link: http://lkml.kernel.org/r/1468399300-5399-34-git-send-email-k.kozlowski@samsung.com Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
mn10300: dma-mapping: use unsigned long for dma_attrs
Split out subsystem specific changes for easier reviews. This will be squashed with main commit. Link: http://lkml.kernel.org/r/1468399300-5399-33-git-send-email-k.kozlowski@samsung.com Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
mips: dma-mapping: use unsigned long for dma_attrs
Split out subsystem specific changes for easier reviews. This will be squashed with main commit. Link: http://lkml.kernel.org/r/1468399300-5399-32-git-send-email-k.kozlowski@samsung.com Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
microblaze: dma-mapping: use unsigned long for dma_attrs
Split out subsystem specific changes for easier reviews. This will be squashed with main commit. Link: http://lkml.kernel.org/r/1468399300-5399-31-git-send-email-k.kozlowski@samsung.com Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
metag: dma-mapping: use unsigned long for dma_attrs
Split out subsystem specific changes for easier reviews. This will be squashed with main commit. Link: http://lkml.kernel.org/r/1468399300-5399-30-git-send-email-k.kozlowski@samsung.com Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
m68k: dma-mapping: use unsigned long for dma_attrs
Split out subsystem specific changes for easier reviews. This will be squashed with main commit. Link: http://lkml.kernel.org/r/1468399300-5399-29-git-send-email-k.kozlowski@samsung.com Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>