# Data Preparation from Macaulay Library of Cornell Lab and Xeno-Canto
> *If you cannot see the HTML rendering (e.g. colors), view this notebook [here](https://nbviewer.jupyter.org/github/Mipanox/Bird_cocktail/blob/master/notebooks/data_preparation.ipynb)*

_(Dated: 02/21/18)_

In the current implementation, we are interested in classifying ~300 species of birds in California, U.S.A. 
To download only the birds living in the region, we first acquire species information 
(name, number of recordings, catalog number, etc.) from the website search. In this notebook, we show from scratch how we download the audio recordings and convert them to `.wav` files, ready for pre-processing.

The task requires the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) web crawler package.

In [2]:
import sys
sys.path.append('../codes/')
from data_util import *

## 1. Search the region
The following urls are the search results of the region (California):
- XC: [search](https://www.xeno-canto.org/explore?query=box%3A32.25%2C-125.771%2C42.294%2C-113.423+)
- ML: [search](http://macaulaylibrary.org/search?&asset_format_id=1000&collection_type_id=1&layout=1&quality1_id_min=3&quality1_id_max=5&taxon=Birds&taxon_id=11994031&taxon_rank_id=21&country_name=United%20States&country_id=211&state_name=California%20--%20US&state_id=3619&sort=1)
_(Note: For ML, only the database with catalog numbers less than 1,000,000 as per [term](https://www.macaulaylibrary.org/about/request-media/))_

### (1) XC
#### Get the species names
In order to maximize the number of independent data (i.e. recordings), the top ~300 species 
(sorted by frequnecy of occurrence in the search results) are chosen:

_(The following cell may take a long time to run)_

In [None]:
xcpreurl = 'https://www.xeno-canto.org/explore?query=box%3A32.25%2C-125.771%2C42.294%2C-113.423+'
XC_CA = XC_dn(preurl=xcpreurl,pglim=210,num_spe=400) # said 372 species by XC search

__Note__: The `pglim` parameter limits the search to the specified page. In normal circumstances, the last few pages are to be excluded, which contain 'identity unknown' and 'soundscape' results

#### Store sorted species names
After running the cell above, the `specnmlist` attribute of the `XC_CA` object will be all the species names ever occur in the searches. We sort them and store the sorted result into an `.npy` object, which alleviates the burden of running through the searches again should anything bad happen such as download fails.

In [None]:
_ = XC_CA._spenm_one()
np.save('XC_CA_spenmlist_one.npy',XC_CA.spenmlist_one)

### (2) ML
The ML website turns out to be less convenient for directly downloading the recordings. Instead, we will obtain the catalog numbers for each species through `.csv` files which are downloadable.

#### Get the species names

In [None]:
mlcapreurl = 'http://macaulaylibrary.org/search?&asset_format_id=1000&collection_type_id=1&layout=1&quality1_id_min=3&quality1_id_max=5&taxon=Birds&taxon_id=11994031&taxon_rank_id=21&country_name=United%20States&country_id=211&state_name=California%20--%20US&state_id=3619&sort=1'
ML_ca_star3 = ML_dn(preurl=mlcapreurl,pglim=36,num_spe=350)

#### Store sorted species names
Calling the following function will sort the species names just like the XC case

__Note:__ A little warning at this point. The ML library restricts the number of entries in every `csv` file for non-members. For example, one may login to the ML website on Firefox and specify path of Firefox's profile to the `profile_path` argument in the `get_specsv` function (see the commented code below)

In [None]:
ML_ca_star3.get_specsv()
## ML_ca_star3.get_specsv(profile_path='/Users/jasonhc/Library/Application Support/Firefox/Profiles/q20kmjze.default')

Again, save the species names (might as well use the same species as in XC)

In [None]:
np.save('ML_ca_spenmlist_one.npy',ML_ca_star3.spenmlist_one)

## 2. Download mp3
If one provides their own species name list, stored in the `.npy` file, they can download the species as follows, omitting the steps outlined above: 

*Note that the species names must follow the convention used (for dealing with whitespaces, quotation marks, etc.) in each database. See the `.npy` [files](https://github.com/Mipanox/Bird_cocktail/tree/master/datasets/species_name_lists) as examples*

In [None]:
## optional, if reading a pre-stored npy
### ML
ML_ca_star3 = ML_dn(preurl='https://',read_in=False)
ML_ca_star3.spenmlist_one = np.load('../datasets/species_name_lists/ML_ca_spenmlist_one.npy')
ML_ca_star3.get_specsv()

### XC
XC_CA.spenmlist_one = np.load('../datasets/species_name_lists/XC_CA_spenmlist_one.npy')

Then downoald

In [None]:
### ML
ML_ca_star3.dn_mp3(path_csv='../datasets/csv_ca/',path_mp3='../datasets/mp3_ca')

### XC
XC_CA.dn_mp3(pglim=10,path_mp3='../datasets/mp3_ca')

## 3. Convert to .wav (optional)
Unfortunately, some python packages for audio processing require `.wav` files instead of `.mp3`. Below, we outline converting bulk of the downloaded `.mp3` files through command line scripts.

As a prerequisite, make sure we have `ffmpeg` installed on the machine:

In [None]:
! which ffmpeg

An example shell [script](https://github.com/Mipanox/Bird_cocktail/blob/master/codes/to_wav.sh) to convert downloaded ML `.mp3` files to `.wav`, in a separate folder, also structured by species.

```bash

ppath='/scratch/users/jasonhc/bird_cocktail/datasets/ML/'

## clean up log
rm ffmpeg.log

mkdir ${ppath}wav/

for d in "${ppath}"mp3/*; do
  ## navigate all species folders
  if [ -d "$d" ]; then
    ## mkdir in wav folder
    wav_path=${ppath}wav/$(echo $d | rev | cut -d'/' -f-1 | rev)
    mkdir ${wav_path}

    echo "Now converting for species: "$(echo $d | rev | cut -d'/' -f-1 | rev)

    for i in "$d/"*; do
      #echo "$d/"$(basename $i .mp3).wav
      ffmpeg -i $i "${wav_path}/"$(basename $i .mp3).wav &>> ffmpeg.log
    done
  fi
done
```