# Data Processing

The first step in data processing is obtaining and preparing the data. We used the Animal Sound Archive, it is available in the folder `animal-sound`.

As one can see, there are many files in that folder. The relevants files are `occurrence.txt` and `multimedia.txt`. The first has information about each ocurrence, and the second has the links to the audio files.

## Imports

In [None]:
import pandas as pd
import dask.dataframe as dd
from dask.multiprocessing import get

import numpy as np
from tqdm.notebook import tqdm

import requests

import os
import shutil

import librosa

import warnings

import torchaudio
import torchaudio.transforms as T

import matplotlib.pyplot as plt

import warnings

## Merging dataframes

Before anything else, we have to merge the relevant data in the files.

In [None]:
ocurrences = pd.read_csv("../animal-sound/occurrence.txt",
                         delimiter="\t")
ocurrences.head()

We can see that the `ocurrences.txt` file has a lot of columns, and most of them are filled with `NaN`. We have to get only the columns that are relevant to the problem: 
 - **gbifID**
 - **species**
 - **genus**
 - **family**
 - **class**
 - **phylum**

In [None]:
ocurrences = ocurrences[["gbifID", "species", "genus", "family", "class", "phylum"]]
ocurrences.head()

In [None]:
multimedia = pd.read_csv("../animal-sound/multimedia.txt",
                         delimiter="\t")
multimedia.head()

The `multimedia.txt` file has irrelevant columns. We only need the **gbifID** and **identifier** columns. 

In [None]:
multimedia = multimedia[["gbifID", "identifier"]]
multimedia.head()

Now we have to combine the 2 datasets merging them by the **gbifID**.

In [None]:
df = multimedia.merge(ocurrences, on="gbifID", how="inner")
df.head()

In [None]:
df.shape

The next step is to remove the samples with `NaN` in any column.

In [None]:
df = df.dropna()
df.shape

## Download the audios

Now that we have all the links of the audios, we have to save them.

In [None]:
tqdm.pandas(desc="Downloading files")

In [None]:
def createAudioFile(row):
  ocurr_id = str(row["gbifID"])
  phylum = row["phylum"]
  class_name = row["class"]
  family = row["family"]
  genus = row["genus"]
  species = row["species"]
  url = row["identifier"]
  
  folder = '{}/{}/{}/{}/{}/'.format(phylum, class_name, family, genus, species).replace(" ", "_")
  
  file_name = folder + ocurr_id + ".mp3"
  folder_name = r"../data/{}".format(folder)

  if not os.path.exists('../data/' + folder):
    os.makedirs(folder_name, exist_ok=True)

  path = r"../data/{}".format(file_name)
    
  if not os.path.exists(path):
    with open(path, "wb") as f:
      f.write(requests.get(url).content)
    
  return file_name

df.head()

In [None]:
# Creating dask dataframe
ddf = dd.from_pandas(df, npartitions=16)

# df["file_name"] = df.progress_apply(createAudioFile, axis=1)
res = ddf.map_partitions(lambda df: df.apply((lambda row: createAudioFile(row)), axis=1)).compute(scheduler=get)


We have to remove corrupted data

In [None]:
tqdm.pandas(desc="Dropping corrupted files")

In [None]:
def isCorrupted(row):
  try:
    with warnings.catch_warnings():
      warnings.simplefilter("ignore")
      torchaudio.load("../data/" + row.file_name)
      return False
  except:
    return True

In [None]:
df["is_corrupted"] = df.progress_apply(isCorrupted, axis=1)

In [None]:
df.head(10)

In [None]:
df = df.drop(df[df.is_corrupted].index).drop(columns="is_corrupted")

In [None]:
df.shape

Finaly, we save the file

In [None]:
df = df.reset_index(drop=True)

In [None]:
df.to_csv("../datasets/AnimalSoundFull.csv", index=False)