# Upload the BEANS cbi recordings to HF

#### 1. Download the recordings the way BEANS do it on their [GitHub](https://github.com/earthspecies/beans)

First we will navigate into the mounted data_birdset folder to download the temporary files from the Repo their and install wget & unzip as they are not on the university bash.

In [1]:
%cd '../../../../data_birdset/beans'
!pwd
!sudo apt install wget
!sudo apt install unzip

/workspace/data_birdset/beans
/workspace/data_birdset/beans


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
wget is already the newest version (1.21.2-2ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
unzip is already the newest version (6.0-26ubuntu3.2).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


Then we will run their script to download the metadata and recordings to have the same splits.

In [2]:
# Their script:
import sys
import pandas as pd
from plumbum import local, FG
from pathlib import Path
import random

sox = local['sox']
local['mkdir']['-p', 'data/cbi/wav']()
local['kaggle']['competitions', 'download', '-p', 'data/cbi', 'birdsong-recognition'] & FG
local['unzip']['data/cbi/birdsong-recognition.zip', '-d', 'data/cbi/'] & FG

random.seed(42)
df = pd.read_csv('data/cbi/train.csv')
all_recordist = df.recordist.unique()

random.shuffle(all_recordist)
train_recordist = set(all_recordist[:int(.6 * len(all_recordist))])
valid_recordist = set(all_recordist[int(.6 * len(all_recordist)):int(.8 * len(all_recordist))])
test_recordist = set(all_recordist[int(.8 * len(all_recordist)):])

def convert(row):
    if row['recordist'] in train_recordist:
        split = 'train'
    elif row['recordist'] in valid_recordist:
        split = 'valid'
    else:
        split = 'test'

    src_file = Path('data/cbi/train_audio') / row['ebird_code'] / row['filename']
    tgt_file = Path('data/cbi/wav') / (Path(row['filename']).stem + '.wav')
    print(f'Converting {src_file} ...', file=sys.stderr)

    sox[src_file, '-r', '44100', '-R', tgt_file, 'remix', '-', 'trim', '0', '10']()

    new_row = pd.Series({
        'path': tgt_file,
        'label': row['ebird_code'],
        'split': split
    })

    return new_row

df = df.apply(convert, axis=1)
df[df.split == 'train'].to_csv('data/cbi/annotations.train.csv')
df[df.split == 'valid'].to_csv('data/cbi/annotations.valid.csv')
df[df.split == 'test'].to_csv('data/cbi/annotations.test.csv')

Downloading birdsong-recognition.zip to data/cbi


100%|██████████| 22.1G/22.1G [12:57<00:00, 30.5MB/s]  



Archive:  data/cbi/birdsong-recognition.zip
  inflating: data/cbi/example_test_audio/BLKFR-10-CPL_20190611_093000.pt540.mp3  
  inflating: data/cbi/example_test_audio/ORANGE-7-CAP_20190606_093000.pt623.mp3  
  inflating: data/cbi/example_test_audio_metadata.csv  
  inflating: data/cbi/example_test_audio_summary.csv  
  inflating: data/cbi/sample_submission.csv  
  inflating: data/cbi/test.csv       
  inflating: data/cbi/train.csv      
  inflating: data/cbi/train_audio/aldfly/XC134874.mp3  
  inflating: data/cbi/train_audio/aldfly/XC135454.mp3  
  inflating: data/cbi/train_audio/aldfly/XC135455.mp3  
  inflating: data/cbi/train_audio/aldfly/XC135456.mp3  
  inflating: data/cbi/train_audio/aldfly/XC135457.mp3  
  inflating: data/cbi/train_audio/aldfly/XC135459.mp3  
  inflating: data/cbi/train_audio/aldfly/XC135460.mp3  
  inflating: data/cbi/train_audio/aldfly/XC135883.mp3  
  inflating: data/cbi/train_audio/aldfly/XC137570.mp3  
  inflating: data/cbi/train_audio/aldfly/XC138639.mp3 

Converting data/cbi/train_audio/aldfly/XC134874.mp3 ...
Converting data/cbi/train_audio/aldfly/XC135454.mp3 ...
Converting data/cbi/train_audio/aldfly/XC135455.mp3 ...
Converting data/cbi/train_audio/aldfly/XC135456.mp3 ...
Converting data/cbi/train_audio/aldfly/XC135457.mp3 ...
Converting data/cbi/train_audio/aldfly/XC135459.mp3 ...
Converting data/cbi/train_audio/aldfly/XC135460.mp3 ...
Converting data/cbi/train_audio/aldfly/XC135883.mp3 ...
Converting data/cbi/train_audio/aldfly/XC137570.mp3 ...
Converting data/cbi/train_audio/aldfly/XC138639.mp3 ...
Converting data/cbi/train_audio/aldfly/XC139577.mp3 ...
Converting data/cbi/train_audio/aldfly/XC140298.mp3 ...
Converting data/cbi/train_audio/aldfly/XC142065.mp3 ...
Converting data/cbi/train_audio/aldfly/XC142066.mp3 ...
Converting data/cbi/train_audio/aldfly/XC142067.mp3 ...
Converting data/cbi/train_audio/aldfly/XC142068.mp3 ...
Converting data/cbi/train_audio/aldfly/XC142329.mp3 ...
Converting data/cbi/train_audio/aldfly/XC144672.

#### 2. Convert to HF format

In [5]:
from datasets import Dataset, Audio
import pandas as pd

def load_dataset(split_name):
    df = pd.read_csv(f'data/cbi/annotations.{split_name}.csv')
    dataset = Dataset.from_pandas(df)
    dataset = dataset.cast_column('path', Audio())
    return dataset

splits = ['train', 'valid', 'test']
datasets = {split: load_dataset(split) for split in splits}
for split, dataset in datasets.items():
    print(dataset[0])



{'Unnamed: 0': 1, 'path': {'path': 'data/cbi/wav/XC135454.wav', 'array': array([ 0.        ,  0.        ,  0.        , ..., -0.00674438,
       -0.00839233, -0.00601196]), 'sampling_rate': 44100}, 'label': 'aldfly', 'split': 'train'}
{'Unnamed: 0': 0, 'path': {'path': 'data/cbi/wav/XC134874.wav', 'array': array([ 0.        ,  0.        ,  0.        , ..., -0.04196167,
       -0.02774048,  0.01846313]), 'sampling_rate': 44100}, 'label': 'aldfly', 'split': 'valid'}
{'Unnamed: 0': 8, 'path': {'path': 'data/cbi/wav/XC137570.wav', 'array': array([0.        , 0.        , 0.        , ..., 0.0015564 , 0.00189209,
       0.00119019]), 'sampling_rate': 44100}, 'label': 'aldfly', 'split': 'test'}


#### 3. Upload the datasets to HF

In [6]:
for split, dataset in datasets.items():
    dataset.push_to_hub('DBD-research-group/beans_cbi', split=split)

Uploading the dataset shards:   0%|          | 0/24 [00:00<?, ?it/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/591 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/6 [00:00<?, ?it/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/592 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/591 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/591 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/591 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/591 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/402 [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/7 [00:00<?, ?it/s]

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/517 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/517 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/517 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/517 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/517 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Map:   0%|          | 0/517 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/510 [00:00<?, ?B/s]

---
# Download the dataset from HF

Download all splits from the hub. Even when specifying a specific split it still downloads everything! Use `streaming=True` and `cache_dir='...'` for shorter loading times.

In [8]:
from datasets import load_dataset, DatasetDict

#dataset = load_dataset(path='DBD-research-group/beans_cbi', split='train_low')
dataset: DatasetDict = load_dataset(name='default', path='DBD-research-group/beans_cbi')

Downloading readme:   0%|          | 0.00/615 [00:00<?, ?B/s]

KeyboardInterrupt: 

In [42]:
# print number of samples and number of distinct classes
print(f"Number of samples: {len(dataset['train'])}")
print(f"Number of distinct classes: {len(dataset['train'].unique('label'))}")
dataset

Number of samples: 1200
Number of distinct classes: 10
