## This notebook can be used to filter duplicate SVGs based on their content

It should be run using the environment created from the requirements.txt file in the SVGRepresentation folder.

The following code expects the folder 'SVG_Data' to be in the same directory as the cloned project repository.

```
parent
    ├── SVG_Data
    └── SVG_LogoGenerator
```

If this is not the case on your machine make sure to
* set the correct path to the SVG_Data folder in the first cell of this notebook.

The script will process the directory specified in variable `input_folder_path`. By default this is `SVG_Data/raw/SVG_Logo/SVG_Logo`.

Outputs will be written to the path specified in variable `output_folder_path`. By default, this is `SVG_Data/raw/SVG_Logo/SVG_Logo_filtered`

In [13]:
datafolder = "../../../SVG_Data/"

### Helper Functions:

In [18]:
import os
from typing import List, Dict

from concurrent import futures
import glob
from tqdm import tqdm

from collections import defaultdict

In [19]:
NUM_WORKERS = 16

In [20]:
def create_dir(directory: str) -> str:
    os.makedirs(directory, exist_ok=True)
    return directory

In [21]:
input_folder_path = os.path.abspath(os.path.join(datafolder, 'raw/SVG_Logo/SVG_Logo'))
output_folder_path = create_dir(os.path.abspath(os.path.join(datafolder, 'raw/SVG_Logo/SVG_Logo_filtered')))

In [22]:
def read_file_content(file_path: str) -> str:
    with open(file_path, "r") as f:
        return f.read()

In [23]:
def write_file_content(file_path: str, content: str):
    with open(file_path, "w") as f:
        f.write(content)

In [24]:
def filter_equal_svgs(input_files: List[str], output_folder: str):
    all_files_with_contents = list(map(lambda file: (file, read_file_content(file)), input_files))

    all_files_with_contents.sort(key=lambda file_with_content: file_with_content[1])

    last_content = ""

    for file, content in all_files_with_contents:
        if content == last_content:
            continue

        last_content = content

        filename = os.path.basename(file)
        output_file = os.path.join(output_folder, filename)
        write_file_content(output_file, content)

In [25]:
def group_files_by_size(files: List[str]) -> Dict[int, List[str]]:
    grouped = defaultdict(list)

    for file in files:
        file_size = os.stat(file).st_size
        grouped[file_size].append(file)

    return grouped

In [26]:
def process_directory(input_folder, output_folder):
    svg_files = glob.glob(os.path.join(input_folder, "**.svg"))

    print(f"Grouping {len(svg_files)} input files by size...")
    svgs_by_size = group_files_by_size(svg_files)
    print(f"Finished grouping. Got {len(svgs_by_size)} groups, start processing...")

    with tqdm(total=len(svgs_by_size)) as pbar:

        with futures.ThreadPoolExecutor(max_workers=NUM_WORKERS) as executor:
            all_futures = []

            for svg_files in svgs_by_size.values():
                all_futures.append(executor.submit(filter_equal_svgs, svg_files, output_folder))

            print("added all futures...")

            for _ in futures.as_completed(all_futures):
                pbar.update(1)


# Execution

## Run single file (for testing)

In [27]:
filter_equal_svgs([
    os.path.join(input_folder_path, "postman-icon.svg"),
    os.path.join(input_folder_path, "postman-icon.svg")
], output_folder_path)

## Run directory

In [10]:
process_directory(input_folder_path, output_folder_path)

Grouping 67680 input files by size...


  0%|          | 1/8731 [00:00<24:36,  5.91it/s]

Finished grouping. Got 8731 groups, start processing...
added all futures...


100%|██████████| 8731/8731 [14:15<00:00, 10.20it/s]  
