## Download a dataset from Hugging Face - Rémi Ançay - 2025

This notebook allows you to download and save an audio dataset from Hugging Face. It uses the `datasets` library to load the dataset and save it in a specified folder.

Configure the dataset name and the save folder path below, then run the cells to download and save the dataset.

In [1]:
import os
import shutil
from datasets import load_dataset
from tqdm import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# Choose the output directory
OUT_DIR = "./Datasets/RawDownload/barkopedia_individual_datasets"
os.makedirs(OUT_DIR, exist_ok=True)

In [None]:
# Load the dataset from Hugging Face.
ds = load_dataset(
    "ArlingtonCL2/Barkopedia_Individual_Dog_Recognition_Dataset"
)

# Could HTTP error 429. Just retry fiew times.

In [16]:
ds

DatasetDict({
    train: Dataset({
        features: ['audio', 'label'],
        num_rows: 7137
    })
    validation: Dataset({
        features: ['audio', 'label'],
        num_rows: 709
    })
})

In [19]:
label_names = ds["train"].features["label"].names # liste des labels
print(f"Labels: {label_names}")

Labels: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '4', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '5', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '6', '60', '7', '8', '9']


In [21]:
for split in ds.keys():
    split_ds = ds[split]
    print(f"Traitement du split '{split}' avec {len(split_ds)} exemples.")

    directory = os.path.join(OUT_DIR, split)
    os.makedirs(directory, exist_ok=True)

    for idx, example in tqdm(enumerate(split_ds), total=len(split_ds), desc="Downloading"):
        # Déclenche le download dans le cache
        audio_info = example["audio"]          
        src_path   = audio_info["path"]
        
        label_id   = example["label"]
        label_name = label_names[label_id]
        
        # Créer un sous-dossier par label
        label_dir = os.path.join(directory, label_name)
        os.makedirs(label_dir, exist_ok=True)
        
        # Génere un nom de fichier unique : {label}_{idx:04d}.wav
        dst_filename = f"{label_name}_{idx:04d}.wav"
        dst_path     = os.path.join(label_dir, dst_filename)
        
        # Copier le fichier depuis le cache HF vers le dossier de sortie
        shutil.copyfile(src_path, dst_path)

print(f"Tous les fichiers sont copiés dans : {OUT_DIR}")

Traitement du split 'train' avec 7137 exemples.


Downloading: 100%|██████████| 7137/7137 [00:20<00:00, 347.13it/s]


Traitement du split 'validation' avec 709 exemples.


Downloading: 100%|██████████| 709/709 [00:01<00:00, 561.35it/s]

Tous les fichiers sont copiés dans : ./Datasets/RawDownload/barkopedia_individual_datasets



