Development Tasks

Reasonably developed ideas for Duperemove. If you're interested in taking one of these on, let me know.

If a task has been checkmarked, it is at least feature complete and in master branch.

Small / Medium Tasks

Large Tasks

Lookup extent owners during checksum phase
- This is waiting on implementation of a new ioctl to reverse map extents to inodes
- We can use this later for a more accurate duplicate extent search.
- Can also be used to skip reading extents that have already been read. For example, if files A and B share an extent, we can just copy the hashes from the first read instead of checksumming it twice.
(done except now we use mtime so as to be portable) Store results of our search to speed up subsequent runs.
- When writing hashes, store latest btrfs transaction id in the hash file (import find-new from btrfsprogs for this)
- When reading hashes and we have a transaction id, do a scan of btrfs objects to see which inodes have changed.
  - Changed inodes get re-checksummed
  - Deleted inodes get their hashes removed
  - New inodes get checksummed
- To tie this all together, we need to add an option (maybe simply called --hash-file?) which acts as a create-or-update mode for hash file usage. Now the user can run duperemove, with one command, on a regular basis and we'll automatically keep the hash database updated.
Multi-threaded extent search.
- This needs locking and atomic updates to some fields, so it's not quite as straight forward as threading the hash or dedupe stage. A start would be to have find_all_dups queue up appropriate dup_blocks_lists to worker threads. The threads have to properly lock the filerec compared trees and results trees. b_seen also needs to be read/written atomically.
feature-complete now, thanks to Jack Kenrick Largely decrease memory footprint by using a bloom filter (thanks to darkling on #btrfs for this idea). Currently we store all hashes in memory in an index (hash tree) and search that index to find duplicates. With this algorithm, our working set in memory is all hashes. Instead, we can use a bloom filter to record when a hash has been seen, allowing us to reducing the working set to only those hashes with dupes. From http://en.wikipedia.org/wiki/Bloom_filter:

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set.
- The general idea is that the bloom filter can tell us (with a certain amount of confidence) whether a hash was seen already. Memory overhead for a bloom filter is tiny, far less than that which we would save by being able to discard non-duped blocks:
  
  fewer than 10 bits per element are required for a 1% false positive probability
- However, we can't get the exact hash values back from the bloom filter, so we would have to do multiple read/csum passes of the data:
PASS 1: Read+csum file blocks, for each checksum:
- add the csum to the bloom filter
- if the csum already exists in the bloom filter, add the block to our index of duplicate blocks
- if the csum already exists in the index of duplicated blocks, dedupe it against the block we found there
PASS 2: Read+hash file blocks, for each hash:
- if the hash exists in the bloom filter and is not in the index, we dedupe it against what is in the index.
The 2nd pass is basically re-finding that 1st block of any set of dupes which was put in the bloom filter, but never stored in the index. In theory the user could skip that pass if they didn't care about missing blocks with only one duplicate.

There are two potential drawbacks:
1. The 2nd file scan is expensive. We can work around this by streaming our hashes to a temporary file during the 1st pass, and streaming them back from the file during our 2nd pass.
2. We don't get the chance to make extents for dedupe but instead pass individual blocks which might result in less efficient space layout. This can be worked around too though by just pushing the dedupe portion of our algorithm to the 2nd pass and updating the extent search code to handle discontiguous regions (which happens to be a TODO item above).

Kernel Work

We have a kernel component which needs maintenance, this list describes some of those tasks. Actual discussion / development needs to happen on the btrfs mailing list. But it's nice to have the dedupe tasks all in one place.

We have a very messy story with respect to open modes for deduping files. For example, see https://github.com/markfasheh/duperemove/issues/86 and https://github.com/markfasheh/duperemove/issues/129. We're opening dedupe targets O_RDWR because the kernel insists on them being open for write, (unless the user is root). The relevant line is in vfs_dedupe_file_range():
```
  	} else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE))) {
  		info->status = -EINVAL;
```

This seemed like a good idea when I implemented the ioctl - we were worried about the implication of one user being able to dedupe another users files. The check is too coarse though and as a result we have many complaints about not being able to dedupe readonly files as a non-root user. The most obvious fix here would be to simply delete the check. Users could then use -A (and we need to update the documentation) to force read-only opens. Once the fix is in the wild long enough we can revert to always opening readonly.

Allow dedupe of extents with non-aligned length. Basically we want to dedupe the tail of files and extent-same was rejecting unaligned len. Fix is in the btrfs tree.
Fix readpage / btrfs-extent-same deadlock. We're taking page lock and extent lock in opposite orders, btrfs-extent-same code needs to be fixed.
(done except better because no flag is needed) Optional mtime updates (need to discuss this on list). Clone is updating mtime, which blows up backup programs for some users. We want a 'dont-set-mtime' flag that btrfs-extent-same can pass through to clone when userspace requests it.
(done but this exposes some extremely poor performance in btrfs fiemap, disabled for now) Allow for dedupe of same file. Clone supports this now so we want to remove our checks and do sanity testing of various scenarios.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development Tasks

Small / Medium Tasks

Large Tasks

Kernel Work

Clone this wiki locally