# Deduplication Comparison

## Data

In this experiment, we will use `wiki40b` to compare the performance of different deduplication methods.

In [14]:
repo_path = "/Users/chenghao/Downloads/Projects/text-dedup/deduplicate-text-datasets"
dataset = "wiki40b"

The hardware used is a 64GB RAM 10-core Macbook Pro.

## Suffix Array Deduplication

In [15]:
print(f"""python -m text_dedup.suffix_array \\
    --path {dataset} \\
    --name "en" \\
    --split "train" \\
    --cache_dir "./cache" \\
    --output "output/suffix_array/dedup" \\
    --column "text" \\
    --google_repo_path {repo_path}""")

python -m text_dedup.suffix_array \
    --path wiki40b \
    --name "en" \
    --split "train" \
    --cache_dir "./cache" \
    --output "output/suffix_array/dedup" \
    --column "text" \
    --google_repo_path /Users/chenghao/Downloads/Projects/text-dedup/deduplicate-text-datasets


## Exact Deduplication

In [17]:
print(f"""python -m text_dedup.exact_hash \\
    --path {dataset} \\
    --name en \\
    --split "train" \\
    --cache_dir "./cache" \\
    --output "output/exact_hash/dedup" \\
    --column "text" \\
    --batch_size 1000""")

python -m text_dedup.exact_hash \
    --path wiki40b \
    --name en \
    --split "train" \
    --cache_dir "./cache" \
    --output "output/exact_hash/dedup" \
    --column "text" \
    --batch_size 1000


```
Loading                         : 4.10s  
Processing                      : 37.39s 
Filtering                       : 4.81s  
Saving                          : 166.47s
Total                           : 212.76s
Before                          : 2926536
After                           : 2926536
```

In [19]:
print(f"""python -m text_dedup.bloom_filter \\
    --path {dataset} \\
    --name en \\
    --split "train" \\
    --cache_dir "./cache" \\
    --output "output/bloom_filter/dedup" \\
    --column "text" \\
    --batch_size 1000""")

python -m text_dedup.bloom_filter \
    --path wiki40b \
    --name en \
    --split "train" \
    --cache_dir "./cache" \
    --output "output/bloom_filter/dedup" \
    --column "text" \
    --batch_size 1000


```
Loading                         : 2.52s  
Processing                      : 104.85s
Filtering                       : 3.72s  
Saving                          : 170.84s
Total                           : 281.92s
Before                          : 2926536
After                           : 2926521
```

## MinHash + LSH Deduplication

In [20]:
print(f"""python -m text_dedup.minhash \\
  --path {dataset} \\
  --name "en" \\
  --split "train" \\
  --cache_dir "./cache" \\
  --output "output/minhash/dedup" \\
  --column "text" \\
  --batch_size 10000""")

python -m text_dedup.minhash \
  --path wiki40b \
  --name "en" \
  --split "train" \
  --cache_dir "./cache" \
  --output "output/minhash/dedup" \
  --column "text" \
  --batch_size 10000


```
Loading                         : 2.43s  
MinHashing                      : 386.02s
Clustering                      : 90.41s 
Filtering                       : 12.17s 
Saving                          : 190.42s
Total                           : 681.46s
Before                          : 2926536
After                           : 2905488
```

## SimHash Deduplication

In [21]:
print(f"""python -m text_dedup.simhash \\
  --path {dataset} \\
  --name "en" \\
  --split "train" \\
  --cache_dir "./cache" \\
  --output "output/simhash/dedup" \\
  --column "text" \\
  --batch_size 10000""")

python -m text_dedup.simhash \
  --path wiki40b \
  --name "en" \
  --split "train" \
  --cache_dir "./cache" \
  --output "output/simhash/dedup" \
  --column "text" \
  --batch_size 10000


```
Loading                         : 2.42s  
SimHashing                      : 157.54s
Clustering                      : 460.78s
Filtering                       : 11.07s 
Saving                          : 199.22s
Total                           : 831.03s
Before                          : 2926536
After                           : 2918229
```

## Results

| Method | Time (s) | Before | After | Reduction |
|--------|----------|--------|-------|-----------|
| Exact Hash | 212.76 | 2926536 | 2926536 | 0.00% |
| Bloom Filter | 281.92 | 2926536 | 2926521 | 0.00% |
| MinHash | 681.46 | 2926536 | 2905488 | 0.72% |
| SimHash | 831.03 | 2926536 | 2918229 | 0.28% |
|SuffixArray | OOM | OOM | OOM | OOM |