Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

insane fast is not very fast at all #53

Open
SmartLayer opened this issue Mar 5, 2021 · 14 comments
Open

insane fast is not very fast at all #53

SmartLayer opened this issue Mar 5, 2021 · 14 comments
Assignees
Labels
enhancement New feature or request
Projects

Comments

@SmartLayer
Copy link

SmartLayer commented Mar 5, 2021

when recursive into a directory of 6.2GB with 767 files in it, I thought the insane fast one will:

  1. compute a summary for each file by the csum of each block included in the file;
  2. Do a sort / uniq of these files

Since csum is already computed, this shouldn't take more than a minute in a modern computer. Instead, the process has been running 30 minutes now and the result already showing 2020 non-matching results.

@SmartLayer SmartLayer changed the title take out bin and put btrfs.static into RELEASE tab insane fast is not very fast at all Mar 7, 2021
@SmartLayer
Copy link
Author

My guess is that instead of using a sorting algorithm to find rows of duplicated files, your code might have constructed a 𝑛²-𝑛 size array (𝑛 refer to the number of files in the recursive directory) and checked every file pair in the array, right?

@Lakshmipathi
Copy link
Owner

Yes, right. It does check two file at once since it doesn't store previously computed results on disk. "Insane" mode is fast with real dedupe between two files (say comparing two 100GB files). But its slow with large no.of files. I'll update the tool to store previously computed results in db. (sqlite) That should improve all over performance.

@Lakshmipathi Lakshmipathi self-assigned this Mar 8, 2021
@Lakshmipathi Lakshmipathi added the enhancement New feature or request label Mar 8, 2021
@SmartLayer
Copy link
Author

SmartLayer commented Mar 9, 2021

Hi how aobut this real quick way to solve this problem

First, bare with me explaining how people deduplicate files before the dedup tools were popular.

Deduplicating with existing Linux commands

Back in the 90s when Bojack Horseman was horsing around, people were used to do this:

$ find dir/ -type f -exec md5sum '{}' ';' | sort | uniq -w 32 -D

Which means compute the md5 checksum of every file under directory, then sort it (hence duplicated files are clustered together), then run uniq to find all lines where the first 32 characters duplicate. The first 32 characters is the md5sum value, while the rest are file names.

Assumption

Let's assume that there is a quick way to get a file's fingerprint of some sort by using btrfs

New method

Now, given that sorting and uniqueness (duplicate detect) are already in the Linux command line set, you only need to add a parameter such as --fingerprint, which outputs every file's fingerprint followed by the file's name. the fingerprint can be a 32-character data computed by computing a checksum of all csums of all its blocks. A user can then use a pipe to get the duplicated items.

$ dduper --fingerprint --recursive --dir /folder | sort | uniq -w 32 -D

And you can put that info in the EXAMPLE section of your man page.

The insanely fast but inaccurate method already working today

The following method didn't take advantage of btrfs, meaning btrfs should be able to outperform this.

There is a shortcut to get potential duplicates insanely fast by simply compare the file size (in the case of video mp4 files larger than 1GB, files of equal size is usually duplicate).

$ du -ab /videos/ | sort -n | awk -F $'\t' '{printf("%16s\t%s\n", $1, $2)}' | uniq -w 16 -D

The additional awk in the above command line merely converts file size into fixed 16 characters (16 characters is larger enough for all possible website videos' size).

But there are video files of equal size with different content, so a hash should be computed to be sure.

@Lakshmipathi
Copy link
Owner

Off-topic: One of my other project server with more than 225000 users burned 😭 http://community.webminal.org/t/webminal-org-down-status-update-thread/1481 I will get back so this issue soon

@ytrezq
Copy link

ytrezq commented Mar 11, 2021

Would it be possible to perform block scanning and hashes in parallel using process pool or even using Numba please?

@jedie
Copy link

jedie commented Apr 23, 2021

when recursive into a directory of 6.2GB with 767 files in it, I thought the insane fast one will:

1. compute a summary for each file by the csum of each block included in the file;
2. Do a sort / uniq of these files

A missing point is the "os.walk()"... If you have many more than 767 files you will get some problems with current implementation here:

dduper/dduper

Lines 427 to 443 in 8745078

def dedupe_dir(dir_path, dry_run, recurse):
file_list = []
if recurse is True:
for dirname in dir_path:
for path, dirs, files in os.walk(dirname):
for filename in files:
fn = os.path.join(path, filename)
if validate_file(fn) is True:
file_list.append(fn)
else:
for dirname in dir_path:
for fi in os.listdir(dirname):
if os.path.isfile(os.path.join(dirname, fi)):
fn = os.path.join(dirname, fi)
if validate_file(fn) is True:
file_list.append(fn)
dedupe_files(file_list, dry_run)

  1. Collect all files may take really long, before the real dedupe part can be start
  2. RAM usage will become a problem at some point, because the file list is just to big

@Lakshmipathi
Copy link
Owner

Added sqlite db to keep track of file csum fetched from btrfs csum-tree. It should increase the performance. Could you please check the latest code from this repo? thanks!

@adamryczkowski
Copy link

I believe this still did not solve the problem. After 18h24m8s of waiting I quited the process

sudo ./dduper --device /dev/mapper/adama-docs --dir /home/Adama-docs --recurse

bedup scans this filesystem in a few minutes (after purging its cache).

$ sudo btrfs fi usage /home/Adama-docs/
Overall:
    Device size:		 744.01GiB
    Device allocated:		 744.01GiB
    Device unallocated:		   1.00MiB
    Device missing:		     0.00B
    Used:			 624.19GiB
    Free (estimated):		 118.17GiB	(min: 118.17GiB)
    Data ratio:			      1.00
    Metadata ratio:		      1.00
    Global reserve:		 512.00MiB	(used: 0.00B)

Data,single: Size:740.00GiB, Used:621.83GiB (84.03%)
   /dev/mapper/adama-docs	 740.00GiB

Metadata,single: Size:4.01GiB, Used:2.36GiB (58.95%)
   /dev/mapper/adama-docs	   4.01GiB

System,single: Size:4.00MiB, Used:96.00KiB (2.34%)
   /dev/mapper/adama-docs	   4.00MiB

Unallocated:
   /dev/mapper/adama-docs	   1.00MiB
$ sudo btrfs fi du -s /home/Adama-docs/ 
     Total   Exclusive  Set shared  Filename
 660.36GiB   646.95GiB     6.70GiB  /home/Adama-docs/

@Lakshmipathi
Copy link
Owner

@adamryczkowski could you pls share dduper.log file? It should be available from the directory where you run this command.

@adamryczkowski
Copy link

It is very short. Here you go:

DEBUG:root:Phase-1: Validating files and creating DB
DEBUG:root:Phase-1.1: Populate records using threads
DEBUG:root:Phase-1: Validating files and creating DB
DEBUG:root:Phase-1.1: Populate records using threads
DEBUG:root:Phase-1: Validating files and creating DB
DEBUG:root:Phase-1.1: Populate records using threads

dduper.db is 986MB big. That may be a problem. Do you want me to upload that?

@Lakshmipathi
Copy link
Owner

ah, okay. Looks like it never completed populating DB phase. dduper.db file don't have much info other than your filename, sha526sum. so uploading is not necessary. Do you have large no.of small files under /home/Adama-docs/ ?

Can you share output for the below command:

sqlite3 dduper.db 'select count(*) from filehash;'

It should display no.of records (file entry) on the DB.

@adamryczkowski
Copy link

$ sqlite3 dduper.db 'select count(*) from filehash;'
341865

@adamryczkowski
Copy link

$ sudo find /home/Adama-docs -type f | wc -l
1501608

@adamryczkowski
Copy link

$ df -h /home/Adama-docs
Filesystem              Size  Used Avail Use% Mounted on
/dev/mapper/adama-docs  745G  626G  118G  85% /home/Adama-docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
dduper
  
Awaiting triage
Development

No branches or pull requests

5 participants