# Data Preprocessing

The original dataset has been shared as part of ["Barking in domestic dogs: context specificity and individual identification"](https://www.sciencedirect.com/science/article/abs/pii/S000334720400123X), a paper by Sophia Yin and Brenda McCowan.

You can download the unprocessed dataset from [Internet Archive](https://archive.org/details/dog-barks-raw).

This is the version of the dataset we will begin our preprocessing with. Let's first download and extract the data.

In [4]:
!mkdir data

In [5]:
!cd data && wget https://archive.org/download/dog-barks-raw/Dog%20Bark%20Data.zip

--2021-02-14 19:30:17--  https://archive.org/download/dog-barks-raw/Dog%20Bark%20Data.zip
Resolving archive.org (archive.org)... 207.241.224.2
Connecting to archive.org (archive.org)|207.241.224.2|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://ia801409.us.archive.org/11/items/dog-barks-raw/Dog%20Bark%20Data.zip [following]
--2021-02-14 19:30:17--  https://ia801409.us.archive.org/11/items/dog-barks-raw/Dog%20Bark%20Data.zip
Resolving ia801409.us.archive.org (ia801409.us.archive.org)... 207.241.228.149
Connecting to ia801409.us.archive.org (ia801409.us.archive.org)|207.241.228.149|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1943878068 (1.8G) [application/zip]
Saving to: ‘Dog Bark Data.zip’


2021-02-14 19:50:15 (1.55 MB/s) - ‘Dog Bark Data.zip’ saved [1943878068/1943878068]



In [None]:
!cd data && unzip -q Dog\ Bark\ Data.zip

The data is contained in several directories.

In [9]:
ls data/Dog\ Bark\ Data/Dog

[0m[01;34mDog Bark Raw Data I - 09.24.00[0m/    [01;34mDogs 2001 - Siggy Sessions 1-4[0m/
[01;34mDog Bark Raw Data II - 09.24.00[0m/   [01;34mSophia's Dog Barks - Louie Re-cued[0m/
[01;34mDog Bark Raw Data III - 09.24.00[0m/  [01;34mVMTRC - UC Davis Dogs[0m/


There is also a key provided that allows us to decipher the data.


```
Filename: Mac-1-A-1a.aif
Annotation: NameofDog-Session#-SessionSequenceNumber.aif

A = aggression
C = contact
P = play

Refer to Yin& McCowan 2004 for data collection methods.
```

Going by the directory structure would be invovled - let's instead enumerate all the audio files

In [38]:
import librosa
import glob

In [39]:
paths = glob.glob('data/**/**/**/**/*.aif') + glob.glob('data/**/**/**/**/*.wav')

There are 720 recordings in the dataset.

In [41]:
len(paths)

720

All of them have been recorded with a sampling rate of 44100.

In [42]:
set([librosa.core.load(paths[0], sr=None)[1] for path in paths])

{44100}

Let us now obtain annotations from file names and let's store all the audio files in a flat directory structure for ease of access.

In [97]:
mkdir data/audio

In [133]:
import re
import shutil

filename = []
name = []
context = []

for path in paths:
    try:
        n, session, c, session_sequence, *_ = re.findall('(\w+)', path.split('/')[-1])
        if c not in ['A', 'C', 'P']:
            n, c, *_ = re.findall('(\w+)', path.split('/')[-1])
            if c not in ['A', 'C', 'P']:
                n, _, c = re.findall('(\w+)', path.split('/')[-1])
                c = c[0]
                if c not in ['A', 'C', 'P']: continue
    except:
        continue
    shutil.copy(path, 'data/audio')
    filename.append(path.split('/')[-1])
    name.append(n)
    context.append(c)

In [157]:
fix_names = {
    'Freid3': 'Freid',
    'Freid4': 'Freid',
    'Fried': 'Freid',
    'Keri3': 'Keri',
    'Kerik': 'Keri',
    'Zzoe': 'Zoe',
    'luke': 'Luke',
    'Louis': 'Louie'
    
}

name = [n if n not in fix_names.keys() else fix_names[n] for n in name]

Let's add additional labels from the paper.

In [128]:
age_map = {
    'Farley': 3,
    'Freid': 5,
    'Keri': 4,
    'Louie': 2,
    'Luke': 5,
    'Mac': 5,
    'Roodie': 12,
    'Rudy': 11,
    'Siggy': 11,
    'Zoe': 7
}

weight_map = {
    'Farley': 25,
    'Freid': 6,
    'Keri': 34,
    'Louie': 19,
    'Luke': 25,
    'Mac': 34,
    'Roodie': 18,
    'Rudy': 32,
    'Siggy': 36,
    'Zoe': 16
}

sex_map = {
    'Farley': 'male',
    'Freid': 'male',
    'Keri': 'female',
    'Louie': 'male',
    'Luke': 'male',
    'Mac': 'male',
    'Roodie': 'male',
    'Rudy': 'male',
    'Siggy': 'male',
    'Zoe': 'female'
}

breed_map = {
    'Farley': 'Australian shepherd',
    'Freid': 'Dachsund',
    'Keri': 'Labrador mix',
    'Louie': 'Springer spaniel',
    'Luke': 'Australian shepherd',
    'Mac': 'German shorthair pointer',
    'Roodie': 'Australian cattle dog',
    'Rudy': 'German shorthair pointer',
    'Siggy': 'German shorthair pointer',
    'Zoe': 'Australian cattle dog'
}

age, weight, sex, breed = zip(*[(age_map[n], weight_map[n], sex_map[n], breed_map[n]) for n in name])

Let's construct our annotations csv file.

In [159]:
import pandas as pd

anno = pd.DataFrame(data={
    'filename': filename,
    'name': name,
    'context': context,
    'age': age,
    'weight': weight,
    'sex': sex,
    'breed': breed
})

In [160]:
context_map = {
    'A': 'aggression',
    'C': 'contact',
    'P': 'play'
}

anno.context = anno.context.apply(lambda c: context_map[c]) 

In [161]:
anno.head()

Unnamed: 0,filename,name,context,age,weight,sex,breed
0,Mac-3-A-3.aif,Mac,aggression,5,34,male,German shorthair pointer
1,Mac-3-P-3.aif,Mac,play,5,34,male,German shorthair pointer
2,Mac-2-P-2d.aif,Mac,play,5,34,male,German shorthair pointer
3,Mac-2-P-2b.aif,Mac,play,5,34,male,German shorthair pointer
4,Mac-2-A-2a..aif,Mac,aggression,5,34,male,German shorthair pointer


In [167]:
# we can drop this row as the file is empty
anno = anno[anno.filename != 'Siggy-4-A-4.aif']

In [171]:
anno.to_csv('data/annotations.csv', index=False)

In [172]:
!cd data && zip -qr dog_barks.zip annotations.csv audio