# Dataset Generator

The models we design are only as good as the data that is fed into them. No matter how well a model or network is designed the efficacy of its predictions will be limited by the integrity of its data, "garbage in, garbage out". Since no open-access dataset exists sufficiently large enough to suit our needs we created our own. The code in this notebook will allow you to interactively create a dataset using the same process we used. To generate an identical dataset leave all parameters at their default settings.

## Clone Repository

In order to run the code in this notebook you will need to clone [this git repository](https://github.com/JerameyATyler/cepbimo.git). The following steps will walk you through cloning the repo.

### Running in Google Colab

If running this notebook from Google Colab you will need to perform a few additional steps first.

#### Mount Google Drive

In [None]:
# @title Mounting Google Drive
# @markdown We need to mount a Google Drive in order to download the code we need.
# @markdown > Import `drive` from `google.colab`
from google.colab import drive
# @markdown > Mount Google Drive
drive.mount('/content/gdrive')

#### Make Directory for Repository

In [None]:
# @markdown > Create a directory for the code.
% mkdir "gdrive/MyDrive/Colab Notebooks/"

#### Change Directories

In [None]:
# @markdown > Change directories to the new one
% cd "gdrive/MyDrive/Colab Notebooks/"

#### Clone the Git Repository

In [None]:
# @title Cloning Git Repository
# @markdown Next we need to clone the Git repository
# @markdown > Clone the repository
! git clone https://github.com/JerameyATyler/cepbimo.git

In [None]:
# @markdown > Change directory to the repository
% cd cepbimo/cepbimo

#### Install Dependencies

In [None]:
!pip install pydub
!pip install --upgrade matplotlib

In [None]:
!apt install ffmpeg

## Data sources

In [None]:
%mkdir data/
%mkdir data/anechoic/
%mkdir data/hrtf/
%mkdir data/reflections
%cd data

### Anechoic Data

In order to generate a dataset of reflections we need to start with anechoic data. A dataset of anechoic orchestral recordings is available thanks to [(Pätynen and Lokki, 2008)](#patynen-lokki-08).
***
> 1. [Mozart's An aria of Donna Elvira](https://mediatech.aalto.fi/images/research/virtualacoustics/recordings/mozart_mp3.zip)
>   - Flute, clarinet, bassoon, french horns (1-2), violin (I), violin (II), viola, cello, contrabass, soprano (soloist)
> 2. [Beethoven's Symphony no. 7, I movement, bars 1-53](https://mediatech.aalto.fi/images/research/virtualacoustics/recordings/beethoven_mp3.zip)
>   - Flutes (1-2), oboes (1-2), clarinets (1-2), bassoon (1-2), french horns (1-2),
>     trumpets (1-3), timpani, violin (I), violin (II), viola, cello, contrabass
> 3. [Bruckner's Symphony no. 8, II movement, bars 1-61](https://mediatech.aalto.fi/images/research/virtualacoustics/recordings/bruckner_mp3.zip)
>   - Flutes (1-3), oboes (1-3), clarinets (1-3), bassoon (1-3),
>     french horns (1-8), trumpets (1-3), trombones (1-3), tuba, timpani,
>     violin (I) (two divisi), violin (II) (two divisi), viola (two divisi),
>     cello (two divisi), contrabass (two divisi)
> 4. [Mahler's Symphony no. 1, IV movement, bars 1-85](https://mediatech.aalto.fi/images/research/virtualacoustics/recordings/mahler_mp3.zip)
>   - Piccolo (1-2) (fl1), flutes (1-2) (fl3), oboes (1-4), clarinets (1-4),
> bassoon (1-3), french horns (1-7), trumpets (1-4), tuba, timpani,
> percussions (1-2), violin I (two divisi), violin II (two divisi),
> viola, cello, contrabass
***
For more information on the anechoic data we use click [this link.](https://research.cs.aalto.fi//acoustics/virtual-acoustics/research/acoustic-measurement-and-analysis/85-anechoic-recordings.html)

In [None]:
# @title Anechoic Data
# @markdown Anechoic data recorded by [(Pätynen and Lokki, 2008)](#patynen-lokki-08)
# @markdown > Links to anechoic recordings
file_links = [
    'https://mediatech.aalto.fi/images/research/virtualacoustics/recordings/beethoven_mp3.zip',
    'https://mediatech.aalto.fi/images/research/virtualacoustics/recordings/mozart_mp3.zip',
    'https://mediatech.aalto.fi/images/research/virtualacoustics/recordings/bruckner_mp3.zip',
    'https://mediatech.aalto.fi/images/research/virtualacoustics/recordings/mahler_mp3.zip']
# @markdown > Download files if they don't exist
for l in file_links:
    !wget -nc {l}

### Head-related Transfer Functions

Next we need a dataset of head-related transfer functions (HRTF). A set of HRTFs is available thanks to [(Gardner & Martin, 1994)](#gardner-martin-94) and MIT Media Lab.
***
For more information see [KEMAR HRTF measurements](https://sound.media.mit.edu/resources/KEMAR.html)

In [None]:
# @title HRTF Data
# @markdown HRTF data recorded by [(Gardner & Martin, 1994)](#gardner-martin-94)
!wget -nc "https://sound.media.mit.edu/resources/KEMAR/full.zip"

### Extracting Data

Extract the archived data to the `data/` directory.
***
**Warning: Overwrites existing files**

In [None]:
# @title Extracting Data
# @markdown Extract zipped data to `data/` directory.
# @markdown > List of files to unzip
zip_files = [
    ['beethoven_mp3.zip', 'anechoic/beethoven'],
    ['mozart_mp3.zip', 'anechoic/mozart'],
    ['bruckner_mp3.zip', 'anechoic/bruckner'],
    ['mahler_mp3.zip', 'anechoic/mahler'],
    ['full.zip', 'hrtf']]
# > Unzip each archive to `data/anechoic` directory
for z in zip_files:
    !unzip -o {z[0]} -d {z[1]}

In [None]:
%cd ..

## Dataset Parameters

In [None]:
from ipywidgets import widgets
from data_picker import DataPicker

dp = DataPicker()
dp.ui()

## References

* Gardner, B., & Martin, K. (1994). Hrtf measurements of a kemar dummy-head microphone. MIT Media Lab. Perceptual Computing-Technical Report, 280, 1-7.
<a name='gardner-martin-94'><a/>

* Pätynen, J., Pulkki, V., & Lokki, T. (2008). Anechoic recording system for symphony orchestra. Acta Acustica united with Acustica, 94(6), 856-865. 
<a name='patynen-lokki-08'><a/>