# Interacting with Huggingface (HF)

### HF Limitations for our datasets
To upload our datasets to HF for easy distribution we use a non-trivial approach as most of our datasets are very large and HF has limits
- Not more than 10k files per folder
- Not more than 100k in total
- Loading with COCO format is not really supported
  
### Our approach
Because of this we zip all image and label files in the respective folders when uploading and unzip them when downloading. Just follow this setup as it gides you through the process. 

Note: Normal Git usage with HF is not really supported as tracking doesnt work since we zip/unzip

## Imports

In [1]:
import os
import zipfile
from huggingface_hub import HfApi, HfFolder
import shutil

  from .autonotebook import tqdm as notebook_tqdm


## Login to HF
The cache is not directly used for our method but just in case

In [2]:
# HF Cache
os.environ["HF_HOME"] = "../../.cache"
!echo $HF_HOME
!huggingface-cli whoami

../../.cache


TorgeSchwark
[1morgs: [0m Basket-AEye


## Settings

In [5]:
DATASET_DIR = "ai_shelf/sd/10classes" # Starting from a local huggingface folder in the repo
HF_REPO = "Basket-AEye/ai_shelf10"

# Upload to HF

We only zip in subdirectories so any txt, json files in the first level are not zipped and can be used for HF settings and so on

In [4]:
created_zips = []

# ----- Functions -----
def zip_folder(folder_path, output_zip):
    """Zips the contents of a folder."""
    with zipfile.ZipFile(output_zip, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(folder_path):
            for file in files:
                file_path = os.path.join(root, file)
                arcname = os.path.relpath(file_path, folder_path)
                zipf.write(file_path, arcname)
    print(f"Folder zipped to {output_zip}")
    created_zips.append(output_zip)

def upload_to_hf(path, repo_path):
    """Uploads a file to Hugging Face Hub."""
    api = HfApi()
    api.upload_file(
        path_or_fileobj=path,
        path_in_repo=repo_path,
        repo_id=HF_REPO,
        repo_type="dataset"
    )
    print(f"File {repo_path} uploaded to {HF_REPO}")

complete_dir = os.path.join("../../huggingface/", DATASET_DIR)

# Go through each folder in the dataset directory
def process_folder(folder_path, repo_prefix):
    """Processes a folder, zipping and uploading its contents recursively."""
    contains_image_or_text = any(
        file.endswith(('.png', '.jpg', '.jpeg', '.txt', '.json'))
        for file in os.listdir(folder_path)
    )
    if contains_image_or_text:
        zip_path = f"{folder_path}.zip"
        zip_folder(folder_path, zip_path)
        upload_to_hf(zip_path, repo_prefix + f"/{os.path.basename(folder_path)}.zip")
    else:
        for item in os.listdir(folder_path):
            item_path = os.path.join(folder_path, item)
            if os.path.isdir(item_path):
                process_folder(item_path, repo_prefix + f"/{os.path.basename(folder_path)}")
            elif os.path.isfile(item_path):
                upload_to_hf(item_path, repo_prefix + f"/{os.path.basename(folder_path)}/{item}")

# ----- Main -----
if not os.path.exists(complete_dir):
    print(f"Directory {complete_dir} does not exist.")
    exit(1)

# Get a list of all folders in the complete directory
folder_paths = [os.path.join(complete_dir, item) for item in os.listdir(complete_dir) if os.path.isdir(os.path.join(complete_dir, item))]

# Execute process_folder for each folder in the list
for folder_path in folder_paths:
    process_folder(folder_path, "")


Folder zipped to ../../huggingface/ai_shelf/artificial_created_dataset/images/train.zip


train.zip: 100%|██████████| 137M/137M [00:09<00:00, 15.0MB/s]   


File /images/train.zip uploaded to Basket-AEye/first_artificial
Folder zipped to ../../huggingface/ai_shelf/artificial_created_dataset/images/val.zip


val.zip: 100%|██████████| 41.1M/41.1M [00:01<00:00, 35.1MB/s]


File /images/val.zip uploaded to Basket-AEye/first_artificial
Folder zipped to ../../huggingface/ai_shelf/artificial_created_dataset/images/test.zip


test.zip: 100%|██████████| 13.4M/13.4M [00:00<00:00, 37.3MB/s]


File /images/test.zip uploaded to Basket-AEye/first_artificial
Folder zipped to ../../huggingface/ai_shelf/artificial_created_dataset/labels/train.zip


train.zip: 100%|██████████| 195k/195k [00:00<00:00, 753kB/s]


File /labels/train.zip uploaded to Basket-AEye/first_artificial
Folder zipped to ../../huggingface/ai_shelf/artificial_created_dataset/labels/val.zip


val.zip: 100%|██████████| 58.6k/58.6k [00:00<00:00, 290kB/s]


File /labels/val.zip uploaded to Basket-AEye/first_artificial
Folder zipped to ../../huggingface/ai_shelf/artificial_created_dataset/labels/test.zip


test.zip: 100%|██████████| 18.8k/18.8k [00:00<00:00, 114kB/s]


File /labels/test.zip uploaded to Basket-AEye/first_artificial


### Remove zip files

In [5]:
for zip_file in created_zips:
    if os.path.exists(zip_file):
        os.remove(zip_file)
        print(f"Removed {zip_file}")
    else:
        print(f"{zip_file} does not exist.")
created_zips.clear()

Removed ../../huggingface/ai_shelf/artificial_created_dataset/images/train.zip
Removed ../../huggingface/ai_shelf/artificial_created_dataset/images/val.zip
Removed ../../huggingface/ai_shelf/artificial_created_dataset/images/test.zip
Removed ../../huggingface/ai_shelf/artificial_created_dataset/labels/train.zip
Removed ../../huggingface/ai_shelf/artificial_created_dataset/labels/val.zip
Removed ../../huggingface/ai_shelf/artificial_created_dataset/labels/test.zip


# Download from HF

This process removes the local directory, redownloads the entire repo and unzips

In [6]:
from huggingface_hub import snapshot_download

complete_dir = os.path.join("../../huggingface/", DATASET_DIR)

# Check if the path exists and delete it
if os.path.exists(complete_dir):
    #shutil.rmtree(complete_dir)
    print(f"Existing directory: {complete_dir}")

# Download the repo from HF using snapshot_download
snapshot_path = snapshot_download(
    repo_id=HF_REPO,
    repo_type="dataset",
    local_dir=complete_dir
)
print(f"Downloaded snapshot of repo {HF_REPO} to {snapshot_path}")
# Unzip all zip files recursively and put files into a folder with the name of the zip
for root, dirs, files in os.walk(snapshot_path):
    for file in files:
        if file.endswith('.zip'):
            zip_path = os.path.join(root, file)
            folder_name = os.path.splitext(file)[0]  # Get the name of the zip file without extension
            extract_path = os.path.join(root, folder_name)  # Create a folder with the name of the zip
            os.makedirs(extract_path, exist_ok=True)  # Ensure the folder exists
            with zipfile.ZipFile(zip_path, 'r') as zip_ref:
                zip_ref.extractall(extract_path)  # Extract files into the folder
            # os.remove(zip_path)  # Uncomment for better caching but takes more space
            print(f"Unzipped {zip_path} into {extract_path}")


Existing directory: ../../huggingface/ai_shelf/sd/10classes


Fetching 12 files: 100%|██████████| 12/12 [00:03<00:00,  3.12it/s]


Downloaded snapshot of repo Basket-AEye/ai_shelf10 to /data22/stu236894/GitRepos/TinyML-MT/huggingface/ai_shelf/sd/10classes
Unzipped /data22/stu236894/GitRepos/TinyML-MT/huggingface/ai_shelf/sd/10classes/coffee.zip into /data22/stu236894/GitRepos/TinyML-MT/huggingface/ai_shelf/sd/10classes/coffee
Unzipped /data22/stu236894/GitRepos/TinyML-MT/huggingface/ai_shelf/sd/10classes/apple.zip into /data22/stu236894/GitRepos/TinyML-MT/huggingface/ai_shelf/sd/10classes/apple
Unzipped /data22/stu236894/GitRepos/TinyML-MT/huggingface/ai_shelf/sd/10classes/banana.zip into /data22/stu236894/GitRepos/TinyML-MT/huggingface/ai_shelf/sd/10classes/banana
Unzipped /data22/stu236894/GitRepos/TinyML-MT/huggingface/ai_shelf/sd/10classes/oatmeal.zip into /data22/stu236894/GitRepos/TinyML-MT/huggingface/ai_shelf/sd/10classes/oatmeal
Unzipped /data22/stu236894/GitRepos/TinyML-MT/huggingface/ai_shelf/sd/10classes/lemon.zip into /data22/stu236894/GitRepos/TinyML-MT/huggingface/ai_shelf/sd/10classes/lemon
Unzippe

## Remove zip files (No caching)
When downloading we leave the zip files for caching: Comparing the two zip files and then not redownloading if same (TAKES MORE STORAGE THOUGH)
If not wanted execute the cell below

In [7]:
complete_dir = os.path.join("../../huggingface/", DATASET_DIR)

for root, dirs, files in os.walk(complete_dir):
    for file in files:
        if file.endswith('.zip'):
            zip_path = os.path.join(root, file)
            os.remove(zip_path)
            print(f"Removed {zip_path}")

Removed ../../huggingface/ai_shelf/sd/10classes/coffee.zip
Removed ../../huggingface/ai_shelf/sd/10classes/apple.zip
Removed ../../huggingface/ai_shelf/sd/10classes/banana.zip
Removed ../../huggingface/ai_shelf/sd/10classes/oatmeal.zip
Removed ../../huggingface/ai_shelf/sd/10classes/lemon.zip
Removed ../../huggingface/ai_shelf/sd/10classes/pasta.zip
Removed ../../huggingface/ai_shelf/sd/10classes/fruit tea.zip
Removed ../../huggingface/ai_shelf/sd/10classes/tomato sauce.zip
Removed ../../huggingface/ai_shelf/sd/10classes/avocado.zip
Removed ../../huggingface/ai_shelf/sd/10classes/cucumber.zip
