Skip to content

Development Tasks

jack edited this page Sep 20, 2023 · 26 revisions

Reasonably developed ideas for Duperemove. If you're interested in taking one of these on, let me know.

If a task has been checkmarked, it is at least feature complete and in master branch.

Small / Medium Tasks

  • Test/benchmark the following possible enhancements for csum_whole_file()

    • posix_fadvise with POSIX_FADV_SEQUENTIAL
    • readahead(2)
    • mmap (with madvise)
    • ioctl BLKRASET et BLKRAGET : set readahead for the whole fs
    • io_uring
  • csum_whole_file() should do the count of pre-deduped shared bytes for each file while fiemapping for extent flags. Then we won't have to do it in the dedupe stage - reducing the total number of calls to fiemap we make.

  • Move our file backend to an sqlite db.

    • This is particularly important because the next version of duperemove will depend more heavily on the file backend and I would like to get the sqlite backend in there before we push that out.
    • right now our backend is essentially a streaming write of memory state to disk. This works great for unit testing of the various dedupe stages, but as soon as we want to do anything fancy with our data things are going to need major updating. Before that happens, I'd like to move to a backend that can handle a couple of essentials such as caching, random access, and flexibility to grow portions of a dataset.
  • Add a mode (or new binary called duperemove-fdupes?) where we take a file list from fdupes and do whole-file dedupe on said list.

    • For this we skip most of the file scan and all of the csum stages
    • It does not interact with most options (i.e., we're not writing hashfiles from this)
    • Might require small enhancements to the dedupe core but otherwise it can handle this pretty easily.
    • Related Discussion
  • csum_whole_file() still does a read/checksum of holes and unwritten extents (even though we detect and mark them now). If we calculate and store (in memory) the checksum of a zero'd block, we can skip the read and copy our known value directly into the block digest member.

  • Import sha256 module from polarssl

    • it looks like their sha256.c and sha256.h files are pretty nicely self contained
    • doing this would allow us to drop the dependency on libgcrypt, which in turn makes it easier for us to choose a new default hash.
  • Multi-threaded dedupe stage

    • dedupe_extent_list() is a great candidate for putting on a worker thread. A quick glance shows only that filerec->fd would be written between processes, but this can be easily handled by just letting each thread store the fd locally.
  • (not needed now that we have the bloom filter code) Improve memory usage by freeing non-duplicated hashes after the csum step

    • This is a tiny bit tricky because the extent search assumes file block hashes are logically contiguous - it needs to know that if the next block isn't contiguous that the search can end.
  • Use capabilities of SQL for efficiency

    • dbfile_load_hashes_bloom: query for digests that occur more than once and iterate over those (e.g. select digest from hashes group by digest having count(*) > 1;)
  • Improve duperemove --help output

  • Implement incremental deduplication. In this mode, we would run the deduplication phase every n new files. This would reduce cpu and memory consumption for the deduplication phase, especially on very large dataset.

  • Allow user to process a single file/directory while keeping previous files on the hashfiles. This would help people with very large dataset, where searching for new files might take a lot of time, if they know which files have changed.

    • For example: an directory with data splited by year and month. The user knows that new files are located in the last months, yet wants to deduplicate those files with the previous periods.
  • Improve partial lookup. Currently, the code loops for each filerec. Maybe rebase it on top of sqlite or exclude unscaned filerec ?

  • Always compute a hash for the entire file. Block deduplication is expensive, but extent deduplication misses some obvious opportunities : a per-file global hash would be a cheap way to fetch easy catches.

  • Reimplement the run_filerec_stats() function with extent' support

  • Update memory and cpu usage samples in the documentation as well as the hardware recommandations etc

  • Check if we need to open files in read-write mode. If not, then make readonly the default and remove the option.

Large Tasks

  • Lookup extent owners during checksum phase

    • This is waiting on implementation of a new ioctl to reverse map extents to inodes
    • We can use this later for a more accurate duplicate extent search.
    • Can also be used to skip reading extents that have already been read. For example, if files A and B share an extent, we can just copy the hashes from the first read instead of checksumming it twice.
  • (done except now we use mtime so as to be portable) Store results of our search to speed up subsequent runs.

    • When writing hashes, store latest btrfs transaction id in the hash file (import find-new from btrfsprogs for this)
    • When reading hashes and we have a transaction id, do a scan of btrfs objects to see which inodes have changed.
      • Changed inodes get re-checksummed
      • Deleted inodes get their hashes removed
      • New inodes get checksummed
    • To tie this all together, we need to add an option (maybe simply called --hash-file?) which acts as a create-or-update mode for hash file usage. Now the user can run duperemove, with one command, on a regular basis and we'll automatically keep the hash database updated.
  • Multi-threaded extent search.

    • This needs locking and atomic updates to some fields, so it's not quite as straight forward as threading the hash or dedupe stage. A start would be to have find_all_dups queue up appropriate dup_blocks_lists to worker threads. The threads have to properly lock the filerec compared trees and results trees. b_seen also needs to be read/written atomically.
  • feature-complete now, thanks to Jack Kenrick Largely decrease memory footprint by using a bloom filter (thanks to darkling on #btrfs for this idea). Currently we store all hashes in memory in an index (hash tree) and search that index to find duplicates. With this algorithm, our working set in memory is all hashes. Instead, we can use a bloom filter to record when a hash has been seen, allowing us to reducing the working set to only those hashes with dupes. From http://en.wikipedia.org/wiki/Bloom_filter:

    A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set.

    • The general idea is that the bloom filter can tell us (with a certain amount of confidence) whether a hash was seen already. Memory overhead for a bloom filter is tiny, far less than that which we would save by being able to discard non-duped blocks:

      fewer than 10 bits per element are required for a 1% false positive probability

    • However, we can't get the exact hash values back from the bloom filter, so we would have to do multiple read/csum passes of the data:

    PASS 1: Read+csum file blocks, for each checksum:

    • add the csum to the bloom filter
    • if the csum already exists in the bloom filter, add the block to our index of duplicate blocks
    • if the csum already exists in the index of duplicated blocks, dedupe it against the block we found there

    PASS 2: Read+hash file blocks, for each hash:

    • if the hash exists in the bloom filter and is not in the index, we dedupe it against what is in the index.

    The 2nd pass is basically re-finding that 1st block of any set of dupes which was put in the bloom filter, but never stored in the index. In theory the user could skip that pass if they didn't care about missing blocks with only one duplicate.

    There are two potential drawbacks:

    1. The 2nd file scan is expensive. We can work around this by streaming our hashes to a temporary file during the 1st pass, and streaming them back from the file during our 2nd pass.

    2. We don't get the chance to make extents for dedupe but instead pass individual blocks which might result in less efficient space layout. This can be worked around too though by just pushing the dedupe portion of our algorithm to the 2nd pass and updating the extent search code to handle discontiguous regions (which happens to be a TODO item above).

Kernel Work

We have a kernel component which needs maintenance, this list describes some of those tasks. Actual discussion / development needs to happen on the btrfs mailing list. But it's nice to have the dedupe tasks all in one place.

This seemed like a good idea when I implemented the ioctl - we were worried about the implication of one user being able to dedupe another users files. The check is too coarse though and as a result we have many complaints about not being able to dedupe readonly files as a non-root user. The most obvious fix here would be to simply delete the check. Users could then use -A (and we need to update the documentation) to force read-only opens. Once the fix is in the wild long enough we can revert to always opening readonly.

  • Allow dedupe of extents with non-aligned length. Basically we want to dedupe the tail of files and extent-same was rejecting unaligned len. Fix is in the btrfs tree.

  • Fix readpage / btrfs-extent-same deadlock. We're taking page lock and extent lock in opposite orders, btrfs-extent-same code needs to be fixed.

  • (done except better because no flag is needed) Optional mtime updates (need to discuss this on list). Clone is updating mtime, which blows up backup programs for some users. We want a 'dont-set-mtime' flag that btrfs-extent-same can pass through to clone when userspace requests it.

  • (done but this exposes some extremely poor performance in btrfs fiemap, disabled for now) Allow for dedupe of same file. Clone supports this now so we want to remove our checks and do sanity testing of various scenarios.