## Preprocessing the data

This notebook will show you how to create a 5 kfold cross validation CSV file, that can be used to train a network to automatically classify bird calls in an audio call. 

We use pandas, numpy as sklearn to create the file

In [15]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold

We load in a csv file with pandas

In [16]:
df = pd.read_csv('../labels_CD_20210309.csv')
df.head()

Unnamed: 0,Label,File,Event_ID,X_min,X_max,Y_min,Y_max,Species,EngName,Group,Date,recID,wave,duration
0,AshDro,extr_0117_pt1_B08_20190525_060000_All_Day_3h.txt,AshDro_1,2256.809,3151.751,2.422658,3.538126,Dicrurus leucophaeus,Ashy Drongo,Birds,25/05/2019,B08,extr_0117_pt1_B08_20190525_060000_All_Day_3h.wav,894.942
1,AshDro,extr_0195_pt1_B03_20190410_060000_All_Day_3h.txt,AshDro_2,4970.817,5963.035,2.232967,3.621978,Dicrurus leucophaeus,Ashy Drongo,Birds,10/04/2019,B03,extr_0195_pt1_B03_20190410_060000_All_Day_3h.wav,992.218
2,AshDro,extr_0043_pt2_B10_20190903_060000_All_Day_3h.txt,AshDro_3,9394.572,10615.866,2.133333,3.183333,Dicrurus leucophaeus,Ashy Drongo,Birds,03/09/2019,B10,extr_0043_pt2_B10_20190903_060000_All_Day_3h.wav,1221.294
3,AshDro,extr_1185_B02_20191002_060013_All_Day_3h.txt,AshDro_4,1481.481,3379.63,1.634409,3.56129,Dicrurus leucophaeus,Ashy Drongo,Birds,02/10/2019,B02,extr_1185_B02_20191002_060013_All_Day_3h.wav,1898.149
4,AshDro,extr_1185_B02_20191002_060013_All_Day_3h.txt,AshDro_5,8128.205,10487.179,1.608392,3.426573,Dicrurus leucophaeus,Ashy Drongo,Birds,02/10/2019,B02,extr_1185_B02_20191002_060013_All_Day_3h.wav,2358.974


We put this into a form that can be used by our network changing the x_min,x_max to seconds and the frequecy to hz.

In [17]:
df.X_min /=1000
df.X_max /=1000
df.Y_min *=1000
df.Y_max *=1000
df.head()

Unnamed: 0,Label,File,Event_ID,X_min,X_max,Y_min,Y_max,Species,EngName,Group,Date,recID,wave,duration
0,AshDro,extr_0117_pt1_B08_20190525_060000_All_Day_3h.txt,AshDro_1,2.256809,3.151751,2422.657959,3538.126465,Dicrurus leucophaeus,Ashy Drongo,Birds,25/05/2019,B08,extr_0117_pt1_B08_20190525_060000_All_Day_3h.wav,894.942
1,AshDro,extr_0195_pt1_B03_20190410_060000_All_Day_3h.txt,AshDro_2,4.970817,5.963035,2232.967041,3621.978027,Dicrurus leucophaeus,Ashy Drongo,Birds,10/04/2019,B03,extr_0195_pt1_B03_20190410_060000_All_Day_3h.wav,992.218
2,AshDro,extr_0043_pt2_B10_20190903_060000_All_Day_3h.txt,AshDro_3,9.394572,10.615866,2133.333496,3183.333496,Dicrurus leucophaeus,Ashy Drongo,Birds,03/09/2019,B10,extr_0043_pt2_B10_20190903_060000_All_Day_3h.wav,1221.294
3,AshDro,extr_1185_B02_20191002_060013_All_Day_3h.txt,AshDro_4,1.481481,3.37963,1634.408569,3561.290283,Dicrurus leucophaeus,Ashy Drongo,Birds,02/10/2019,B02,extr_1185_B02_20191002_060013_All_Day_3h.wav,1898.149
4,AshDro,extr_1185_B02_20191002_060013_All_Day_3h.txt,AshDro_5,8.128205,10.487179,1608.391602,3426.573486,Dicrurus leucophaeus,Ashy Drongo,Birds,02/10/2019,B02,extr_1185_B02_20191002_060013_All_Day_3h.wav,2358.974


We can use pandas to check the number of items we have per a particular column, and identify the unique number of classes of that column. 
Here is an example of the Label column

In [18]:
df.Label.value_counts(), df.Label.nunique()

(PieShrVir     151
 HorWreBab1    127
 SunCuc1       123
 PygCup1       112
 JavTes1       104
              ... 
 JavHel1         1
 JavSho1         1
 JavKin          1
 JavHawEag       1
 CheThrYel3      1
 Name: Label, Length: 133, dtype: int64,
 133)

We can remove labels using pandas if they are below a certain count. 
Here we will remove Labels that have less than 5 entries in the CSV file

In [19]:
df = df.groupby("Label").filter(lambda x: len(x)>10)

In [20]:
df.Label.value_counts(), df.Label.nunique()

(PieShrVir     151
 HorWreBab1    127
 SunCuc1       123
 PygCup1       112
 JavTes1       104
 EyeWreBab1    100
 MouLeaWar1     92
 MouLea         91
 SunWar         90
 CreCheBab1     87
 GreHeaCan      62
 JavTes3        56
 SunBusWar      49
 LitPieFly1     48
 FlaFroBar1     39
 SunBruCuc1     34
 JavFul1        31
 SunBruCuc2     29
 JavHel2        27
 LitPieFly3     27
 LitPieFly2     26
 BroThrBar1     26
 CheBelPar      25
 TriShrVir1     20
 LesSho         20
 BarCucDov      19
 PygCup2        16
 SunCucShr1     14
 JavLau1        12
 JavCoc1        12
 LitSpi2        11
 SnoBroFly2     11
 MouLeaWar2     11
 BroThrBar3     11
 SnoBroFly1     11
 PygCup3        11
 HorWreBab3     11
 Name: Label, dtype: int64,
 37)

We need to save a list of the classes so that when we come to evaluate the a test set or predict on an audio file, we know which class is represneted by the position in the final prediction vector

In [21]:
classes = df.Label.unique(); classes

array(['BarCucDov', 'BroThrBar1', 'BroThrBar3', 'CheBelPar', 'CreCheBab1',
       'EyeWreBab1', 'FlaFroBar1', 'GreHeaCan', 'HorWreBab1',
       'HorWreBab3', 'JavCoc1', 'JavFul1', 'JavHel2', 'JavLau1',
       'JavTes1', 'JavTes3', 'LesSho', 'LitPieFly1', 'LitPieFly2',
       'LitPieFly3', 'LitSpi2', 'MouLea', 'MouLeaWar1', 'MouLeaWar2',
       'PieShrVir', 'PygCup1', 'PygCup2', 'PygCup3', 'SnoBroFly1',
       'SnoBroFly2', 'SunBruCuc1', 'SunBruCuc2', 'SunBusWar', 'SunCuc1',
       'SunCucShr1', 'SunWar', 'TriShrVir1'], dtype=object)

In [22]:
np.savetxt('classes.txt',classes, fmt='%s') #this save it as a string array

In [23]:
classes = list(np.loadtxt('classes.txt', delimiter='\n', dtype=str))

In [24]:
classes

['BarCucDov',
 'BroThrBar1',
 'BroThrBar3',
 'CheBelPar',
 'CreCheBab1',
 'EyeWreBab1',
 'FlaFroBar1',
 'GreHeaCan',
 'HorWreBab1',
 'HorWreBab3',
 'JavCoc1',
 'JavFul1',
 'JavHel2',
 'JavLau1',
 'JavTes1',
 'JavTes3',
 'LesSho',
 'LitPieFly1',
 'LitPieFly2',
 'LitPieFly3',
 'LitSpi2',
 'MouLea',
 'MouLeaWar1',
 'MouLeaWar2',
 'PieShrVir',
 'PygCup1',
 'PygCup2',
 'PygCup3',
 'SnoBroFly1',
 'SnoBroFly2',
 'SunBruCuc1',
 'SunBruCuc2',
 'SunBusWar',
 'SunCuc1',
 'SunCucShr1',
 'SunWar',
 'TriShrVir1']

We can now add a recording_id of the audio file, and a species id to the CSV file as well using list comprehensions in python.
The recording id is used to load audio files, the species id is the class that has been converted to a number. 

In [30]:
df['species_id'] = [classes.index(l) for l in df.Label]
df['recording_id'] = [f[:-4] for f in df.File]

We also filter the rows of the CSV file, to make sure we do not have any birdcalls that have a duration of 0 seconds. 
This can be changed to filter out other thresholds such as below 30 ms

In [35]:
df['t_diff'] = df['X_max'] - df['X_min']; len(df)
df = df[df.t_diff != 0.0];

We Now create the kfolds and store them within the CSV file

In [43]:
FOLDS = 5 # Number of folds, this can be changed 
SEED = 42 # The setting the random seed, so that the same set of data will be generated again. 

# We use unique recording ids with their species ids to create a CSV so that all recording ids are within the same
# fold
train_gby = df.groupby("recording_id")[["species_id"]].first().reset_index()
train_gby = train_gby.sample(frac=1, random_state=SEED).reset_index(drop=True)
train_gby.loc[:, 'kfold'] = -1

X = train_gby["recording_id"].values
y = train_gby["species_id"].values

kfold = StratifiedKFold(n_splits=FOLDS)
for fold, (t_idx, v_idx) in enumerate(kfold.split(X, y)):
    train_gby.loc[v_idx, "kfold"] = fold

train = df.merge(train_gby[['recording_id', 'kfold']], on="recording_id", how="left")
print(train.kfold.value_counts())
train.to_csv("train_folds.csv", index=False)

4    373
0    356
1    348
3    339
2    330
Name: kfold, dtype: int64




the CSV file has been  saved out as train_folds.csv.

The above warning means that the one of the classes only appears within one of the kfolds, and doesn't appear in another. 

### Fin