### Readme
so this notebook takes the recordings csv of the HumBug DB (the metadata), it takes the csv and a directory of all of the recordings, it then copies all the recordings that are in the csv (after you select the classes you want - and tha task) and copies all the species recordings into a directory of the species

for example, you're intrested in Ae agpyti, it'll copy all Ae agpyti file of the HumBug DB directory into a seprate directory only for Ae agpyti, this just makes any future work much easier and enables easier parallelism

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
import shutil
from tqdm import tqdm
from pathlib import Path
import librosa

import scipy.signal
import soundfile as sf

### source/destination paths

In [2]:
#create a path object
source = Path('drive/MyDrive/train') #source of all HumBug DB recordings

destin = Path('drive/MyDrive/HumBug') #destination of where you want to save the driectories
classes= ['an arabiensis', 'culex pipiens complex', 'an funestus ss', 'ae aegypti', 'background']
#create a path object for each class. each class will have it's own folder
paths  = {}
for c in classes:
  paths[c] = destin/c

#create the new directories
for c,path in paths.items():
  path.mkdir(parents=True, exist_ok=True)
  if not path.exists():
    print(f"Warning: {path} was not created.")

### Loading csv and selecting relavent classes

In [3]:
#load the df
df = pd.read_csv('humbugdb_zenodo_0_0_2.csv')
 #only take the parts made for classification purposes
df = df[(df["country"] == "Tanzania") & (df['location_type'] == 'cup')] #assumes the task is classification, just negate this if you want MED instead
 #only take mosquitoes in our species
df = df[((df['species'].isin(classes)) & (df['sound_type'] == 'mosquito')) | (df['sound_type'] == 'background')] #assumes you want background aswell

In [4]:
df.head(3)

Unnamed: 0,id,length,upload_time,name,sample_rate,sound_type,species,gender,fed,plurality,age,method,mic_type,device_type,country,district,province,place,location_type
1878,219949,65.097143,2021-05-23 21:44:00,IFA_17_24_664_background.wav,44100,background,,,,,,HBN,telinga,tascam,Tanzania,Kilombero District,Morogoro,Ifakara,cup
1882,221149,2.56,2021-05-23 21:44:00,IFA_17_26_666.wav,44100,mosquito,an arabiensis,Female,f,Single,,HBN,telinga,tascam,Tanzania,Kilombero District,Morogoro,Ifakara,cup
1883,221150,2.56,2021-05-23 21:44:00,IFA_17_26_666.wav,44100,mosquito,an arabiensis,Female,f,Single,,HBN,telinga,tascam,Tanzania,Kilombero District,Morogoro,Ifakara,cup


In [5]:
#loop over our df, for every file find it's designated folder, and copy the original one to the destination
from tqdm import tqdm
for _, row in tqdm(df.iterrows(),
                   total=len(df),
                   desc='Copying audio files',
                   unit='file',
                   bar_format='{l_bar}{bar:40}{r_bar}'):
  cls = ''
  if row['sound_type'] == 'background':
    cls = 'background'
  else:
    cls = row['species']

  destination = paths[cls]
  filename    = str(row['id']) + '.wav'

  source_file      = source/filename
  destination_file = destination/filename

  if not source_file.exists():
    print(f"Missing source file: {filename}")
    continue;

  #before copying, check if it exists first in the destination folder
  if not destination_file.exists():
    shutil.copy2(source_file, destination_file)

print("All files in df are now in the HumBug files.\n- Frost.")

Copying audio files: 100%|████████████████████████████████████████| 1937/1937 [11:11<00:00,  2.89file/s]
All files in df are now in the HumBug files.
- Frost.
