## Preprocessing the data

This notebook will show you how to create a 5 kfold cross validation CSV file, that can be used to train a network to automatically classify bird calls in an audio call. 

We use pandas, numpy as sklearn to create the file

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold

We load in a csv file with pandas

In [2]:
df = pd.read_csv('../labels_OM_david.csv')
df.head()

Unnamed: 0,Label,File,Event_ID,X_min,X_max,Y_min,Y_max,filepaths,wavepath,Group,Species,duration
0,alobel,20190226_B69T11_2018-06-29_00-12-30.wav,58,7250.45,7285.3,0.302,0.69,Spectros_OM/20190226_B69T11_2018-06-29_00-12-3...,Wavs_OM/20190226_B69T11_2018-06-29_00-12-30.wav,mammal,Howler monkey,34.85
1,alobel,20190222_B261T8_2018-06-13_03-37-00.wav,137,8954.2,8960.0,1.12,1.248,Spectros_OM/20190222_B261T8_2018-06-13_03-37-0...,Wavs_OM/20190222_B261T8_2018-06-13_03-37-00.wav,mammal,Howler monkey,5.8
2,alobel,20190226_B69T11_2018-06-29_00-20-30.wav,132,11583.85,11615.8,0.172,0.388,Spectros_OM/20190226_B69T11_2018-06-29_00-20-3...,Wavs_OM/20190226_B69T11_2018-06-29_00-20-30.wav,mammal,Howler monkey,31.95
3,alobel,20190226_B69T11_2018-06-29_00-22-45.wav,30,1645.7,1706.65,1.336,1.722,Spectros_OM/20190226_B69T11_2018-06-29_00-22-4...,Wavs_OM/20190226_B69T11_2018-06-29_00-22-45.wav,mammal,Howler monkey,60.95
4,alobel,20190226_B69T11_2018-06-29_00-13-00.wav,37,3462.7,3526.55,0.344,0.776,Spectros_OM/20190226_B69T11_2018-06-29_00-13-0...,Wavs_OM/20190226_B69T11_2018-06-29_00-13-00.wav,mammal,Howler monkey,63.85


We put this into a form that can be used by our network changing the x_min,x_max to seconds and the frequecy to hz.

In [3]:
df.X_min /=1000
df.X_max /=1000
df.Y_min *=1000
df.Y_max *=1000
df.head()

Unnamed: 0,Label,File,Event_ID,X_min,X_max,Y_min,Y_max,filepaths,wavepath,Group,Species,duration
0,alobel,20190226_B69T11_2018-06-29_00-12-30.wav,58,7.25045,7.2853,302.0,690.0,Spectros_OM/20190226_B69T11_2018-06-29_00-12-3...,Wavs_OM/20190226_B69T11_2018-06-29_00-12-30.wav,mammal,Howler monkey,34.85
1,alobel,20190222_B261T8_2018-06-13_03-37-00.wav,137,8.9542,8.96,1120.0,1248.0,Spectros_OM/20190222_B261T8_2018-06-13_03-37-0...,Wavs_OM/20190222_B261T8_2018-06-13_03-37-00.wav,mammal,Howler monkey,5.8
2,alobel,20190226_B69T11_2018-06-29_00-20-30.wav,132,11.58385,11.6158,172.0,388.0,Spectros_OM/20190226_B69T11_2018-06-29_00-20-3...,Wavs_OM/20190226_B69T11_2018-06-29_00-20-30.wav,mammal,Howler monkey,31.95
3,alobel,20190226_B69T11_2018-06-29_00-22-45.wav,30,1.6457,1.70665,1336.0,1722.0,Spectros_OM/20190226_B69T11_2018-06-29_00-22-4...,Wavs_OM/20190226_B69T11_2018-06-29_00-22-45.wav,mammal,Howler monkey,60.95
4,alobel,20190226_B69T11_2018-06-29_00-13-00.wav,37,3.4627,3.52655,344.0,776.0,Spectros_OM/20190226_B69T11_2018-06-29_00-13-0...,Wavs_OM/20190226_B69T11_2018-06-29_00-13-00.wav,mammal,Howler monkey,63.85


We can use pandas to check the number of items we have per a particular column, and identify the unique number of classes of that column. 
Here is an example of the Label column

In [4]:
df.Label.value_counts(), df.Label.nunique()

(alobel           7387
 cicada           7028
 NO               6856
 megwat_social    5177
 leppen           3845
                  ... 
 mam5                2
 mam11               2
 sf30                1
 phyvai              1
 bird13              1
 Name: Label, Length: 146, dtype: int64,
 146)

We can remove labels using pandas if they are below a certain count. 
Here we will remove Labels that have less than 5 entries in the CSV file

In [5]:
df = df.groupby("Label").filter(lambda x: len(x)>10)

In [6]:
df.Label.value_counts(), df.Label.nunique()

(alobel           7387
 cicada           7028
 NO               6856
 megwat_social    5177
 leppen           3845
                  ... 
 adesp              14
 bird6              14
 nycaet_call        12
 bird11             11
 bird24             11
 Name: Label, Length: 120, dtype: int64,
 120)

## Deciding on species

We can also filter by the type of animal and bird callwithin the file

In [7]:
df.Group.value_counts()

bird           27190
no-call        14009
frog           12675
mammal         11981
insect         11426
frog/insect     3444
noise           1351
reptile         1272
Mammal           768
?                354
Name: Group, dtype: int64

In [8]:
df = df[df.Group.isin(['bird'])]; df.Label.nunique()

44

We need to save a list of the classes so that when we come to evaluate the a test set or predict on an audio file, we know which class is represneted by the position in the final prediction vector

In [9]:
classes = df.Label.unique(); classes

array(['antser', 'cff', 'crysou', 'cryvar', 'cryvar_call2',
       'glahar_social', 'hercac', 'hercac_call', 'lopcri_social',
       'lursem_social', 'megcho', 'megcho3', 'megwat_social', 'nightsp',
       'nycaet_cic', 'nycalb', 'nycalb2', 'nycgra_social', 'nycgri',
       'nycleu', 'nycoce', 'odoguj', 'odoguj_call', 'ortmot', 'owl6',
       'pulper_social', 'rhysim_alarm', 'rhysim_call', 'rhysim_social',
       'rooster', 'strhuh_call2', 'strhuh_small', 'strhuh_social',
       'tintao_song', 'wd', 'wd2', 'tingut_social', 'nycgra_call',
       'bird3', 'nycaet_call', 'strhuh_call', 'megwat_alarm', 'pipcuj',
       'nycaet_song'], dtype=object)

In [10]:
np.savetxt('classes.txt',classes, fmt='%s') #this save it as a string array

In [11]:
classes = list(np.loadtxt('classes.txt', delimiter='\n', dtype=str))

In [12]:
classes

['antser',
 'cff',
 'crysou',
 'cryvar',
 'cryvar_call2',
 'glahar_social',
 'hercac',
 'hercac_call',
 'lopcri_social',
 'lursem_social',
 'megcho',
 'megcho3',
 'megwat_social',
 'nightsp',
 'nycaet_cic',
 'nycalb',
 'nycalb2',
 'nycgra_social',
 'nycgri',
 'nycleu',
 'nycoce',
 'odoguj',
 'odoguj_call',
 'ortmot',
 'owl6',
 'pulper_social',
 'rhysim_alarm',
 'rhysim_call',
 'rhysim_social',
 'rooster',
 'strhuh_call2',
 'strhuh_small',
 'strhuh_social',
 'tintao_song',
 'wd',
 'wd2',
 'tingut_social',
 'nycgra_call',
 'bird3',
 'nycaet_call',
 'strhuh_call',
 'megwat_alarm',
 'pipcuj',
 'nycaet_song']

We can now add a recording_id of the audio file, and a species id to the CSV file as well using list comprehensions in python.
The filepath is used to load audio files, the species id is the class that has been converted to a number. 

In [13]:
df['species_id'] = [classes.index(l) for l in df.Label] #we are adding the id for the classes we are training on.

In [14]:
df['filepath'] =[f'Wavs_OM/{f}'for f in df.File]

We also filter the rows of the CSV file, to make sure we do not have any birdcalls that have a duration of 0 seconds. 
This can be changed to filter out other thresholds such as below 30 ms

In [15]:
df['t_diff'] = df['X_max'] - df['X_min']; len(df)
df = df[df.t_diff != 0.0];

We can also filter audio files that are less than the total duration, or possibly could cause issues with training, such as not having an exact length or has the wrong sample rate

In [16]:
import librosa

ModuleNotFoundError: No module named 'librosa'

In [None]:
fns=[]
for i, fn in enumerate(df.filepath.unique()):
    if i%100==0:
        print(i)
    y,sr = librosa.load(f'../{fn}',sr=None)
    if sr!=44100:
        fns.append(fn)
    if librosa.get_duration(y,sr=sr) < 14.9:
        fns.append(fn)

In [None]:
len(fns), df.filepath.nunique()

In [17]:
df =df[~df.filepath.isin(fns)];

NameError: name 'fns' is not defined

We also need to filter out samples where the X_max maybe be above the time we are looking for. 

In [37]:
df = df[df.X_max > 14.9];

We Now create the kfolds and store them within the CSV file

In [41]:
len(df)

342

In [38]:
FOLDS = 5 # Number of folds, this can be changed 
SEED = 42 # The setting the random seed, so that the same set of data will be generated again. 

# We use unique recording ids with their species ids to create a CSV so that all recording ids are within the same
# fold
train_gby = df.groupby("filepath")[["species_id"]].first().reset_index()
train_gby = train_gby.sample(frac=1, random_state=SEED).reset_index(drop=True)
train_gby.loc[:, 'kfold'] = -1

X = train_gby["filepath"].values
y = train_gby["species_id"].values

kfold = StratifiedKFold(n_splits=FOLDS)
for fold, (t_idx, v_idx) in enumerate(kfold.split(X, y)):
    train_gby.loc[v_idx, "kfold"] = fold

train = df.merge(train_gby[['filepath', 'kfold']], on="filepath", how="left")
print(train.kfold.value_counts())
train.to_csv("train_folds.csv", index=False)

4    74
1    70
2    67
3    66
0    65
Name: kfold, dtype: int64




the CSV file has been  saved out as train_folds.csv.

The above warning means that the one of the classes only appears within one of the kfolds, and doesn't appear in another. 

### Fin

In [40]:
len(train)

342