# Upload the BEANS esc50 recordings to HF

#### 1. Download the recordings the way BEANS do it on their [GitHub](https://github.com/earthspecies/beans)

First we will navigate into the mounted data_birdset folder to download the temporary files from the Repo their and install wget & unzip as they are not on the university bash.

In [1]:
%cd '../../../../../data_birdset/beans'
!pwd
!sudo apt install wget
!sudo apt install unzip

/workspace/data_birdset/beans
/workspace/data_birdset/beans


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
wget is already the newest version (1.21.2-2ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
unzip is already the newest version (6.0-26ubuntu3.2).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


Then we will run their script to download the metadata and recordings to have the same splits.

In [2]:
# Their script:
import pandas as pd
from plumbum import local

target_dir = "data/esc50"

git = local["git"]
git["clone", "https://github.com/karolpiczak/ESC-50.git", target_dir]()

df = pd.read_csv(f"{target_dir}/meta/esc50.csv")


def convert(row):
    new_row = pd.Series(
        {
            "path": f"data/esc50/audio/{row['filename']}",
            "label": row["target"],
            "fold": row["fold"],
        }
    )

    return new_row


df = df.apply(convert, axis=1)


def _get_fold(row):
    return int(row["fold"])
    # return int(row['filename'][0])


df_train = df[df.apply(lambda r: _get_fold(r) <= 3, axis=1)]
df_train_low = df[df.apply(lambda r: _get_fold(r) == 1, axis=1)]
df_valid = df[df.apply(lambda r: _get_fold(r) == 4, axis=1)]
df_test = df[df.apply(lambda r: _get_fold(r) == 5, axis=1)]

df_train.to_csv(f"{target_dir}/meta/esc50.train.csv")
df_train_low.to_csv(f"{target_dir}/meta/esc50.train-low.csv")
df_valid.to_csv(f"{target_dir}/meta/esc50.valid.csv")
df_test.to_csv(f"{target_dir}/meta/esc50.test.csv")

#### 2. Convert to HF format

In [3]:
from datasets import Dataset, Audio
import pandas as pd


def load_dataset(split_name):
    df = pd.read_csv(f"data/esc50/meta/esc50.{split_name}.csv")
    dataset = Dataset.from_pandas(df)
    dataset = dataset.cast_column("path", Audio())
    return dataset


splits = ["train", "train-low", "valid", "test"]
datasets = {split: load_dataset(split) for split in splits}
datasets["train_low"] = datasets.pop(
    "train-low"
)  # Rename split from train-low to train_low as HF does not accept -
for split, dataset in datasets.items():
    print(dataset[0])

{'Unnamed: 0': 0, 'path': {'path': 'data/esc50/audio/1-100032-A-0.wav', 'array': array([0., 0., 0., ..., 0., 0., 0.]), 'sample_rate': 44100}, 'label': 0, 'fold': 1}
{'Unnamed: 0': 1200, 'path': {'path': 'data/esc50/audio/4-102844-A-49.wav', 'array': array([ 0.0302124 ,  0.0458374 ,  0.05664062, ..., -0.04403687,
       -0.05728149, -0.05892944]), 'sample_rate': 44100}, 'label': 49, 'fold': 4}
{'Unnamed: 0': 1600, 'path': {'path': 'data/esc50/audio/5-103415-A-2.wav', 'array': array([0.16473389, 0.17315674, 0.17971802, ..., 0.26345825, 0.1300354 ,
       0.03866577]), 'sample_rate': 44100}, 'label': 2, 'fold': 5}
{'Unnamed: 0': 0, 'path': {'path': 'data/esc50/audio/1-100032-A-0.wav', 'array': array([0., 0., 0., ..., 0., 0., 0.]), 'sample_rate': 44100}, 'label': 0, 'fold': 1}


#### 3. Upload the datasets to HF

In [4]:
for split, dataset in datasets.items():
    dataset.push_to_hub("DBD-research-group/beans_esc50", split=split)

Uploading the dataset shards:   0%|          | 0/2 [00:00<?, ?it/s]

Map:   0%|          | 0/600 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/600 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/4 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/388 [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/4 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/493 [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/4 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/595 [00:00<?, ?B/s]

---
# Download the dataset from HF

Download all splits from the hub. Even when specifying a specific split it still downloads everything! Use `streaming=True` and `cache_dir='...'` for shorter loading times.

In [5]:
from datasets import load_dataset, DatasetDict

# dataset = load_dataset(path='DBD-research-group/beans_esc50', split='train_low')
dataset: DatasetDict = load_dataset(
    name="default", path="DBD-research-group/beans_esc50"
)

Downloading readme:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 399M/399M [00:39<00:00, 10.0MB/s] 
Downloading data: 100%|██████████| 399M/399M [00:37<00:00, 10.7MB/s] 
Downloading data: 100%|██████████| 402M/402M [00:35<00:00, 11.3MB/s] 
Downloading data: 100%|██████████| 396M/396M [00:45<00:00, 8.66MB/s] 
Downloading data: 100%|██████████| 406M/406M [00:45<00:00, 8.84MB/s] 
Downloading data: 100%|██████████| 397M/397M [00:40<00:00, 9.72MB/s] 
Downloading data: 100%|██████████| 287M/287M [00:25<00:00, 11.4MB/s] 
Downloading data: 100%|██████████| 313M/313M [00:29<00:00, 10.5MB/s] 
Downloading data: 100%|██████████| 24.7M/24.7M [00:02<00:00, 9.70MB/s]


Generating train split:   0%|          | 0/84843 [00:00<?, ? examples/s]

Generating valid split:   0%|          | 0/9981 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11005 [00:00<?, ? examples/s]

Generating train_low split:   0%|          | 0/849 [00:00<?, ? examples/s]

In [6]:
# print number of samples and number of distinct classes
print(f"Number of samples: {len(dataset['train'])}")
print(f"Number of distinct classes: {len(dataset['train'].unique('label'))}")
dataset

Number of samples: 84843
Number of distinct classes: 35


DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'path', 'label'],
        num_rows: 84843
    })
    valid: Dataset({
        features: ['Unnamed: 0', 'path', 'label'],
        num_rows: 9981
    })
    test: Dataset({
        features: ['Unnamed: 0', 'path', 'label'],
        num_rows: 11005
    })
    train_low: Dataset({
        features: ['Unnamed: 0', 'path', 'label'],
        num_rows: 849
    })
})