insane fast is not very fast at all #53

SmartLayer · 2021-03-05T09:51:51Z

when recursive into a directory of 6.2GB with 767 files in it, I thought the insane fast one will:

compute a summary for each file by the csum of each block included in the file;
Do a sort / uniq of these files

Since csum is already computed, this shouldn't take more than a minute in a modern computer. Instead, the process has been running 30 minutes now and the result already showing 2020 non-matching results.

SmartLayer · 2021-03-07T04:25:42Z

My guess is that instead of using a sorting algorithm to find rows of duplicated files, your code might have constructed a 𝑛²-𝑛 size array (𝑛 refer to the number of files in the recursive directory) and checked every file pair in the array, right?

Lakshmipathi · 2021-03-08T02:02:47Z

Yes, right. It does check two file at once since it doesn't store previously computed results on disk. "Insane" mode is fast with real dedupe between two files (say comparing two 100GB files). But its slow with large no.of files. I'll update the tool to store previously computed results in db. (sqlite) That should improve all over performance.

SmartLayer · 2021-03-09T04:51:56Z

Hi how aobut this real quick way to solve this problem

First, bare with me explaining how people deduplicate files before the dedup tools were popular.

Deduplicating with existing Linux commands

Back in the 90s when Bojack Horseman was horsing around, people were used to do this:

$ find dir/ -type f -exec md5sum '{}' ';' | sort | uniq -w 32 -D

Which means compute the md5 checksum of every file under directory, then sort it (hence duplicated files are clustered together), then run uniq to find all lines where the first 32 characters duplicate. The first 32 characters is the md5sum value, while the rest are file names.

Assumption

Let's assume that there is a quick way to get a file's fingerprint of some sort by using btrfs

New method

Now, given that sorting and uniqueness (duplicate detect) are already in the Linux command line set, you only need to add a parameter such as --fingerprint, which outputs every file's fingerprint followed by the file's name. the fingerprint can be a 32-character data computed by computing a checksum of all csums of all its blocks. A user can then use a pipe to get the duplicated items.

$ dduper --fingerprint --recursive --dir /folder | sort | uniq -w 32 -D

And you can put that info in the EXAMPLE section of your man page.

The insanely fast but inaccurate method already working today

The following method didn't take advantage of btrfs, meaning btrfs should be able to outperform this.

There is a shortcut to get potential duplicates insanely fast by simply compare the file size (in the case of video mp4 files larger than 1GB, files of equal size is usually duplicate).

$ du -ab /videos/ | sort -n | awk -F $'\t' '{printf("%16s\t%s\n", $1, $2)}' | uniq -w 16 -D

The additional awk in the above command line merely converts file size into fixed 16 characters (16 characters is larger enough for all possible website videos' size).

But there are video files of equal size with different content, so a hash should be computed to be sure.

Lakshmipathi · 2021-03-10T18:28:07Z

Off-topic: One of my other project server with more than 225000 users burned 😭 http://community.webminal.org/t/webminal-org-down-status-update-thread/1481 I will get back so this issue soon

ytrezq · 2021-03-11T09:20:20Z

Would it be possible to perform block scanning and hashes in parallel using process pool or even using Numba please?

jedie · 2021-04-23T17:06:35Z

when recursive into a directory of 6.2GB with 767 files in it, I thought the insane fast one will:
1. compute a summary for each file by the csum of each block included in the file;
2. Do a sort / uniq of these files

A missing point is the "os.walk()"... If you have many more than 767 files you will get some problems with current implementation here:

dduper/dduper

Lines 427 to 443 in 8745078

    
           def dedupe_dir(dir_path, dry_run, recurse): 
        
               file_list = [] 
        
               if recurse is True: 
        
                   for dirname in dir_path: 
        
                       for path, dirs, files in os.walk(dirname): 
        
                           for filename in files: 
        
                               fn = os.path.join(path, filename) 
        
                               if validate_file(fn) is True: 
        
                                   file_list.append(fn) 
        
               else: 
        
                   for dirname in dir_path: 
        
                       for fi in os.listdir(dirname): 
        
                           if os.path.isfile(os.path.join(dirname, fi)): 
        
                               fn = os.path.join(dirname, fi) 
        
                               if validate_file(fn) is True: 
        
                                   file_list.append(fn) 
        
               dedupe_files(file_list, dry_run)

Collect all files may take really long, before the real dedupe part can be start
RAM usage will become a problem at some point, because the file list is just to big

Lakshmipathi · 2021-05-19T03:57:14Z

Added sqlite db to keep track of file csum fetched from btrfs csum-tree. It should increase the performance. Could you please check the latest code from this repo? thanks!

adamryczkowski · 2021-06-02T09:12:14Z

I believe this still did not solve the problem. After 18h24m8s of waiting I quited the process

sudo ./dduper --device /dev/mapper/adama-docs --dir /home/Adama-docs --recurse

bedup scans this filesystem in a few minutes (after purging its cache).

$ sudo btrfs fi usage /home/Adama-docs/
Overall:
    Device size:		 744.01GiB
    Device allocated:		 744.01GiB
    Device unallocated:		   1.00MiB
    Device missing:		     0.00B
    Used:			 624.19GiB
    Free (estimated):		 118.17GiB	(min: 118.17GiB)
    Data ratio:			      1.00
    Metadata ratio:		      1.00
    Global reserve:		 512.00MiB	(used: 0.00B)

Data,single: Size:740.00GiB, Used:621.83GiB (84.03%)
   /dev/mapper/adama-docs	 740.00GiB

Metadata,single: Size:4.01GiB, Used:2.36GiB (58.95%)
   /dev/mapper/adama-docs	   4.01GiB

System,single: Size:4.00MiB, Used:96.00KiB (2.34%)
   /dev/mapper/adama-docs	   4.00MiB

Unallocated:
   /dev/mapper/adama-docs	   1.00MiB

$ sudo btrfs fi du -s /home/Adama-docs/ 
     Total   Exclusive  Set shared  Filename
 660.36GiB   646.95GiB     6.70GiB  /home/Adama-docs/

Lakshmipathi · 2021-06-03T03:24:26Z

@adamryczkowski could you pls share dduper.log file? It should be available from the directory where you run this command.

adamryczkowski · 2021-06-03T05:38:58Z

It is very short. Here you go:

DEBUG:root:Phase-1: Validating files and creating DB
DEBUG:root:Phase-1.1: Populate records using threads
DEBUG:root:Phase-1: Validating files and creating DB
DEBUG:root:Phase-1.1: Populate records using threads
DEBUG:root:Phase-1: Validating files and creating DB
DEBUG:root:Phase-1.1: Populate records using threads

dduper.db is 986MB big. That may be a problem. Do you want me to upload that?

Lakshmipathi · 2021-06-03T06:16:38Z

ah, okay. Looks like it never completed populating DB phase. dduper.db file don't have much info other than your filename, sha526sum. so uploading is not necessary. Do you have large no.of small files under /home/Adama-docs/ ?

Can you share output for the below command:

sqlite3 dduper.db 'select count(*) from filehash;'

It should display no.of records (file entry) on the DB.

adamryczkowski · 2021-06-04T09:11:15Z

$ sqlite3 dduper.db 'select count(*) from filehash;'
341865

adamryczkowski · 2021-06-04T09:13:09Z

$ sudo find /home/Adama-docs -type f | wc -l
1501608

adamryczkowski · 2021-06-04T09:13:55Z

$ df -h /home/Adama-docs
Filesystem              Size  Used Avail Use% Mounted on
/dev/mapper/adama-docs  745G  626G  118G  85% /home/Adama-docs

SmartLayer changed the title ~~take out bin and put btrfs.static into RELEASE tab~~ insane fast is not very fast at all Mar 7, 2021

Lakshmipathi self-assigned this Mar 8, 2021

Lakshmipathi added the enhancement New feature or request label Mar 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

insane fast is not very fast at all #53

insane fast is not very fast at all #53

SmartLayer commented Mar 5, 2021 •

edited

Loading

SmartLayer commented Mar 7, 2021

Lakshmipathi commented Mar 8, 2021

SmartLayer commented Mar 9, 2021 •

edited

Loading

Lakshmipathi commented Mar 10, 2021

ytrezq commented Mar 11, 2021 •

edited

Loading

jedie commented Apr 23, 2021

Lakshmipathi commented May 19, 2021

adamryczkowski commented Jun 2, 2021

Lakshmipathi commented Jun 3, 2021

adamryczkowski commented Jun 3, 2021

Lakshmipathi commented Jun 3, 2021

adamryczkowski commented Jun 4, 2021

adamryczkowski commented Jun 4, 2021

adamryczkowski commented Jun 4, 2021

insane fast is not very fast at all #53

insane fast is not very fast at all #53

Comments

SmartLayer commented Mar 5, 2021 • edited Loading

SmartLayer commented Mar 7, 2021

Lakshmipathi commented Mar 8, 2021

SmartLayer commented Mar 9, 2021 • edited Loading

Deduplicating with existing Linux commands

Assumption

New method

The insanely fast but inaccurate method already working today

Lakshmipathi commented Mar 10, 2021

ytrezq commented Mar 11, 2021 • edited Loading

jedie commented Apr 23, 2021

Lakshmipathi commented May 19, 2021

adamryczkowski commented Jun 2, 2021

Lakshmipathi commented Jun 3, 2021

adamryczkowski commented Jun 3, 2021

Lakshmipathi commented Jun 3, 2021

adamryczkowski commented Jun 4, 2021

adamryczkowski commented Jun 4, 2021

adamryczkowski commented Jun 4, 2021

SmartLayer commented Mar 5, 2021 •

edited

Loading

SmartLayer commented Mar 9, 2021 •

edited

Loading

ytrezq commented Mar 11, 2021 •

edited

Loading