# Data Processing

The first step in data processing is obtaining and preparing the data. We used the Animal Sound Archive, it is available in the folder `animal-sound`.

As one can see, there are many files in that folder. The relevants files are `occurrence.txt` and `multimedia.txt`. The first has information about each ocurrence, and the second has the links to the audio files.

## Imports

In [1]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

import requests

import os

import librosa

import warnings

import torchaudio
import torchaudio.transforms as T

import matplotlib.pyplot as plt

import warnings

## Merging dataframes

Before anything else, we have to merge the relevant data in the files.

In [2]:
ocurrences = pd.read_csv("../animal-sound/occurrence.txt",
                         delimiter="\t")
ocurrences.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,gbifID,abstract,accessRights,accrualMethod,accrualPeriodicity,accrualPolicy,alternative,audience,available,bibliographicCitation,...,identifiedByID,level0Gid,level0Name,level1Gid,level1Name,level2Gid,level2Name,level3Gid,level3Name,iucnRedListCategory
0,1572324720,,,,,,,,,,...,,DEU,Germany,DEU.4_1,Brandenburg,DEU.4.18_1,Uckermark,DEU.4.18.11_1,Schwedt/Oder,LC
1,1572324719,,,,,,,,,,...,,DEU,Germany,DEU.4_1,Brandenburg,DEU.4.18_1,Uckermark,DEU.4.18.11_1,Schwedt/Oder,LC
2,1572324718,,,,,,,,,,...,,DEU,Germany,DEU.4_1,Brandenburg,DEU.4.18_1,Uckermark,DEU.4.18.11_1,Schwedt/Oder,LC
3,1572324717,,,,,,,,,,...,,DEU,Germany,DEU.4_1,Brandenburg,DEU.4.18_1,Uckermark,DEU.4.18.1_1,Angermünde,LC
4,1572324716,,,,,,,,,,...,,DEU,Germany,DEU.4_1,Brandenburg,DEU.4.18_1,Uckermark,DEU.4.18.11_1,Schwedt/Oder,LC


We can see that the `ocurrences.txt` file has a lot of columns, and most of them are filled with `NaN`. We have to get only the columns that are relevant to the problem: 
 - **gbifID**
 - **species**
 - **genus**
 - **family**
 - **class**
 - **phylum**

In [3]:
ocurrences = ocurrences[["gbifID", "species", "genus", "family", "class", "phylum"]]
ocurrences.head()

Unnamed: 0,gbifID,species,genus,family,class,phylum
0,1572324720,Crex crex,Crex,Rallidae,Aves,Chordata
1,1572324719,Crex crex,Crex,Rallidae,Aves,Chordata
2,1572324718,Crex crex,Crex,Rallidae,Aves,Chordata
3,1572324717,Crex crex,Crex,Rallidae,Aves,Chordata
4,1572324716,Crex crex,Crex,Rallidae,Aves,Chordata


In [4]:
multimedia = pd.read_csv("../animal-sound/multimedia.txt",
                         delimiter="\t")
multimedia.head()

Unnamed: 0,gbifID,type,format,identifier,references,title,description,source,audience,created,creator,contributor,publisher,license,rightsHolder
0,1572324720,Sound,audio/mpeg,http://www.tierstimmenarchiv.de/recordings/Cre...,http://www.tierstimmenarchiv.de/webinterface/c...,,,,,,,,,http://creativecommons.org/licenses/by-nc-sa/4.0/,
1,1572324719,Sound,audio/mpeg,http://www.tierstimmenarchiv.de/recordings/Cre...,http://www.tierstimmenarchiv.de/webinterface/c...,,,,,,,,,http://creativecommons.org/licenses/by-nc-sa/4.0/,
2,1572324718,Sound,audio/mpeg,http://www.tierstimmenarchiv.de/recordings/Cre...,http://www.tierstimmenarchiv.de/webinterface/c...,,,,,,,,,http://creativecommons.org/licenses/by-nc-sa/4.0/,
3,1572324717,Sound,audio/mpeg,http://www.tierstimmenarchiv.de/recordings/Cre...,http://www.tierstimmenarchiv.de/webinterface/c...,,,,,,,,,http://creativecommons.org/licenses/by-nc-sa/4.0/,
4,1572324716,Sound,audio/mpeg,http://www.tierstimmenarchiv.de/recordings/Cre...,http://www.tierstimmenarchiv.de/webinterface/c...,,,,,,,,,http://creativecommons.org/licenses/by-nc-sa/4.0/,


The `multimedia.txt` file has irrelevant columns. We only need the **gbifID** and **identifier** columns. 

In [5]:
multimedia = multimedia[["gbifID", "identifier"]]
multimedia.head()

Unnamed: 0,gbifID,identifier
0,1572324720,http://www.tierstimmenarchiv.de/recordings/Cre...
1,1572324719,http://www.tierstimmenarchiv.de/recordings/Cre...
2,1572324718,http://www.tierstimmenarchiv.de/recordings/Cre...
3,1572324717,http://www.tierstimmenarchiv.de/recordings/Cre...
4,1572324716,http://www.tierstimmenarchiv.de/recordings/Cre...


Now we have to combine the 2 datasets merging them by the **gbifID**.

In [6]:
df = multimedia.merge(ocurrences, on="gbifID", how="inner")
df.head()

Unnamed: 0,gbifID,identifier,species,genus,family,class,phylum
0,1572324720,http://www.tierstimmenarchiv.de/recordings/Cre...,Crex crex,Crex,Rallidae,Aves,Chordata
1,1572324719,http://www.tierstimmenarchiv.de/recordings/Cre...,Crex crex,Crex,Rallidae,Aves,Chordata
2,1572324718,http://www.tierstimmenarchiv.de/recordings/Cre...,Crex crex,Crex,Rallidae,Aves,Chordata
3,1572324717,http://www.tierstimmenarchiv.de/recordings/Cre...,Crex crex,Crex,Rallidae,Aves,Chordata
4,1572324716,http://www.tierstimmenarchiv.de/recordings/Cre...,Crex crex,Crex,Rallidae,Aves,Chordata


In [7]:
df.shape

(16385, 7)

The next step is to remove the samples with `NaN` in any column.

In [8]:
df = df.dropna()
df.shape

(16270, 7)

## Download the audios

Now that we have all the links of the audios, we have to save them.

In [9]:
tqdm.pandas(desc="Downloading files")

In [10]:
def createAudioFile(row):
  ocurr_id = str(row["gbifID"])
  phylum = row["phylum"]
  class_name = row["class"]
  family = row["family"]
  genus = row["genus"]
  species = row["species"]
  url = row["identifier"]
  
  folder = '{}/{}/{}/{}/{}/'.format(phylum, class_name, family, genus, species).replace(" ", "_")
  
  file_name = folder + ocurr_id + ".mp3"
  
  
  
  if not os.path.exists('../data/' + folder):
    os.makedirs('../data/' + folder)

  path = "../data/" + file_name
    
  if not os.path.exists(path):
    with open(path, "wb") as f:
      f.write(requests.get(url).content)
    
  return file_name

In [11]:
df["file_name"] = df.progress_apply(createAudioFile, axis=1)
df.head()

Downloading files:   0%|          | 0/16270 [00:00<?, ?it/s]

Unnamed: 0,gbifID,identifier,species,genus,family,class,phylum,file_name
0,1572324720,http://www.tierstimmenarchiv.de/recordings/Cre...,Crex crex,Crex,Rallidae,Aves,Chordata,Chordata/Aves/Rallidae/Crex/Crex_crex/15723247...
1,1572324719,http://www.tierstimmenarchiv.de/recordings/Cre...,Crex crex,Crex,Rallidae,Aves,Chordata,Chordata/Aves/Rallidae/Crex/Crex_crex/15723247...
2,1572324718,http://www.tierstimmenarchiv.de/recordings/Cre...,Crex crex,Crex,Rallidae,Aves,Chordata,Chordata/Aves/Rallidae/Crex/Crex_crex/15723247...
3,1572324717,http://www.tierstimmenarchiv.de/recordings/Cre...,Crex crex,Crex,Rallidae,Aves,Chordata,Chordata/Aves/Rallidae/Crex/Crex_crex/15723247...
4,1572324716,http://www.tierstimmenarchiv.de/recordings/Cre...,Crex crex,Crex,Rallidae,Aves,Chordata,Chordata/Aves/Rallidae/Crex/Crex_crex/15723247...


We have to remove corrupted data

In [12]:
tqdm.pandas(desc="Dropping corrupted files")

In [13]:
def isCorrupted(row):
  try:
    with warnings.catch_warnings():
      warnings.simplefilter("ignore")
      torchaudio.load("../data/" + row.file_name)
      return False
  except:
    return True

In [14]:
df["is_corrupted"] = df.progress_apply(isCorrupted, axis=1)

Dropping corrupted files:   0%|          | 0/16270 [00:00<?, ?it/s]

formats: can't open input file `../data/Chordata/Aves/Picidae/Mulleripicus/Mulleripicus_pulverulentus/1571067681.mp3': 
formats: can't open input file `../data/Chordata/Aves/Picidae/Mulleripicus/Mulleripicus_pulverulentus/1571067680.mp3': 
formats: can't open input file `../data/Chordata/Aves/Phylloscopidae/Phylloscopus/Phylloscopus_trochilus/1291660145.mp3': 
formats: can't open input file `../data/Chordata/Aves/Upupidae/Upupa/Upupa_epops/1229954459.mp3': 
formats: can't open input file `../data/Chordata/Aves/Upupidae/Upupa/Upupa_epops/1229954458.mp3': 
formats: can't open input file `../data/Chordata/Aves/Upupidae/Upupa/Upupa_epops/1229954457.mp3': 
formats: can't open input file `../data/Chordata/Aves/Bucerotidae/Tockus/Tockus_nasutus/1229954456.mp3': 
formats: can't open input file `../data/Chordata/Aves/Bucerotidae/Tockus/Tockus_deckeni/1229954455.mp3': 
formats: can't open input file `../data/Chordata/Aves/Bucerotidae/Tockus/Tockus_nasutus/1229954450.mp3': 
formats: can't open in

formats: can't open input file `../data/Chordata/Aves/Bucerotidae/Buceros/Buceros_rhinoceros/1229950867.mp3': 
formats: can't open input file `../data/Chordata/Aves/Bucerotidae/Berenicornis/Berenicornis_comatus/1229950854.mp3': 
formats: can't open input file `../data/Chordata/Aves/Bucerotidae/Berenicornis/Berenicornis_comatus/1229950852.mp3': 
formats: can't open input file `../data/Chordata/Aves/Bucerotidae/Berenicornis/Berenicornis_comatus/1229950851.mp3': 
formats: can't open input file `../data/Chordata/Aves/Bucerotidae/Berenicornis/Berenicornis_comatus/1229950848.mp3': 
formats: can't open input file `../data/Chordata/Aves/Bucerotidae/Anthracoceros/Anthracoceros_malayanus/1229950793.mp3': 
formats: can't open input file `../data/Chordata/Aves/Bucerotidae/Anthracoceros/Anthracoceros_coronatus/1229950792.mp3': 
formats: can't open input file `../data/Chordata/Aves/Bucerotidae/Anthracoceros/Anthracoceros_coronatus/1229950791.mp3': 
formats: can't open input file `../data/Chordata/Av

In [15]:
df.head(10)

Unnamed: 0,gbifID,identifier,species,genus,family,class,phylum,file_name,is_corrupted
0,1572324720,http://www.tierstimmenarchiv.de/recordings/Cre...,Crex crex,Crex,Rallidae,Aves,Chordata,Chordata/Aves/Rallidae/Crex/Crex_crex/15723247...,False
1,1572324719,http://www.tierstimmenarchiv.de/recordings/Cre...,Crex crex,Crex,Rallidae,Aves,Chordata,Chordata/Aves/Rallidae/Crex/Crex_crex/15723247...,False
2,1572324718,http://www.tierstimmenarchiv.de/recordings/Cre...,Crex crex,Crex,Rallidae,Aves,Chordata,Chordata/Aves/Rallidae/Crex/Crex_crex/15723247...,False
3,1572324717,http://www.tierstimmenarchiv.de/recordings/Cre...,Crex crex,Crex,Rallidae,Aves,Chordata,Chordata/Aves/Rallidae/Crex/Crex_crex/15723247...,False
4,1572324716,http://www.tierstimmenarchiv.de/recordings/Cre...,Crex crex,Crex,Rallidae,Aves,Chordata,Chordata/Aves/Rallidae/Crex/Crex_crex/15723247...,False
5,1572324715,http://www.tierstimmenarchiv.de/recordings/Cre...,Crex crex,Crex,Rallidae,Aves,Chordata,Chordata/Aves/Rallidae/Crex/Crex_crex/15723247...,False
6,1572324714,http://www.tierstimmenarchiv.de/recordings/Cre...,Crex crex,Crex,Rallidae,Aves,Chordata,Chordata/Aves/Rallidae/Crex/Crex_crex/15723247...,False
7,1571202165,http://www.tierstimmenarchiv.de/recordings/Phy...,Phylloscopus ibericus,Phylloscopus,Phylloscopidae,Aves,Chordata,Chordata/Aves/Phylloscopidae/Phylloscopus/Phyl...,False
8,1571067681,http://www.tierstimmenarchiv.de/recordings/Mül...,Mulleripicus pulverulentus,Mulleripicus,Picidae,Aves,Chordata,Chordata/Aves/Picidae/Mulleripicus/Mulleripicu...,True
9,1571067680,http://www.tierstimmenarchiv.de/recordings/Mül...,Mulleripicus pulverulentus,Mulleripicus,Picidae,Aves,Chordata,Chordata/Aves/Picidae/Mulleripicus/Mulleripicu...,True


In [16]:
df = df.drop(df[df.is_corrupted].index).drop(columns="is_corrupted")

In [17]:
df.shape

(16172, 8)

Finaly, we save the file

In [18]:
df = df.reset_index(drop=True)

In [19]:
df.to_csv("../datasets/AnimalSoundFull.csv", index=False)

In [20]:
df

Unnamed: 0,gbifID,identifier,species,genus,family,class,phylum,file_name
0,1572324720,http://www.tierstimmenarchiv.de/recordings/Cre...,Crex crex,Crex,Rallidae,Aves,Chordata,Chordata/Aves/Rallidae/Crex/Crex_crex/15723247...
1,1572324719,http://www.tierstimmenarchiv.de/recordings/Cre...,Crex crex,Crex,Rallidae,Aves,Chordata,Chordata/Aves/Rallidae/Crex/Crex_crex/15723247...
2,1572324718,http://www.tierstimmenarchiv.de/recordings/Cre...,Crex crex,Crex,Rallidae,Aves,Chordata,Chordata/Aves/Rallidae/Crex/Crex_crex/15723247...
3,1572324717,http://www.tierstimmenarchiv.de/recordings/Cre...,Crex crex,Crex,Rallidae,Aves,Chordata,Chordata/Aves/Rallidae/Crex/Crex_crex/15723247...
4,1572324716,http://www.tierstimmenarchiv.de/recordings/Cre...,Crex crex,Crex,Rallidae,Aves,Chordata,Chordata/Aves/Rallidae/Crex/Crex_crex/15723247...
...,...,...,...,...,...,...,...,...
16167,779844260,http://www.tierstimmenarchiv.de/recordings/Acc...,Accipiter gentilis,Accipiter,Accipitridae,Aves,Chordata,Chordata/Aves/Accipitridae/Accipiter/Accipiter...
16168,779844259,http://www.tierstimmenarchiv.de/recordings/Acc...,Accipiter gentilis,Accipiter,Accipitridae,Aves,Chordata,Chordata/Aves/Accipitridae/Accipiter/Accipiter...
16169,779844258,http://www.tierstimmenarchiv.de/recordings/Acc...,Accipiter gentilis,Accipiter,Accipitridae,Aves,Chordata,Chordata/Aves/Accipitridae/Accipiter/Accipiter...
16170,779844257,http://www.tierstimmenarchiv.de/recordings/Acc...,Accipiter nisus,Accipiter,Accipitridae,Aves,Chordata,Chordata/Aves/Accipitridae/Accipiter/Accipiter...
