# Deep Learning for Audio Part 2a: pre-process UrbanSound Dataset

## Introduction

In this jupyter notebook, we will process the audio files and extract the useful features that will be fed into a Convolutional Neural Network. 



We will train and predict on [UrbanSound8K](https://serv.cusp.nyu.edu/projects/urbansounddataset/download-urbansound8k.html) dataset. There are a few published benchmarks, which are mentioned in the papers below:

- [Environmental sound classification with convolutional neural networks](http://karol.piczak.com/papers/Piczak2015-ESC-ConvNet.pdf) by Karol J Piczak.
- [Deep convolutional neural networks and data augmentation for environmental sound classification](https://arxiv.org/abs/1608.04363) by Justin Salamon and Juan Pablo Bello
- [Learning from Between-class Examples for Deep Sound Recognition](https://arxiv.org/abs/1711.10282) by Yuji Tokozume, Yoshitaka Ushiku, Tatsuya Harada


The state-of-art result is from the last paper by Tokozume et al., where the best error rate achieved is 21.7%. In this tutorial we will show you how to build a neural network that can achieve the state-of-art performance using Azure.


This jupyter notebook borrows some of the pre-processing code on the Github Repo here: http://aqibsaeed.github.io/2016-09-24-urban-sound-classification-part-2/, but with a lot of modifications. It is tested with **Python3.5**, **Keras 2.1.2** and **Tensorflow 1.4.0**.

## Setup

We will use librosa as our audio processing library. For more details on librosa, please refer to the librosa documenent [here](https://librosa.github.io/librosa/tutorial.html). We also need to install a bunch of libraries. Most of them are python packages, but you still may need to install a few audio processing libraries using apt-get:

`sudo apt-get install -y --no-install-recommends \
        openmpi-bin \
        build-essential \
        autoconf \
        libtool \
        libav-tools \
        pkg-config`
        
        
We also need to install librosa and a few other deep learning libraries in pip:

`pip install librosa pydot graphviz keras tensorflow-gpu`


## Download dataset

Due to licensing issues, we cannot download the data directly. Please go to the [UrbanSound8K Download](https://serv.cusp.nyu.edu/projects/urbansounddataset/download-urbansound8k.html) site, fill in the related information, download from there, and put it in the right place. You need to update the `parent_path` and `save_dir` below. In this particular case, we don't need the label file, as the labels are already reflected in the file names. We will parse the labels directly from the file names.

PS: on Data Science Virtual Machine, you may want to run

`
sudo chown -R <your username> /mnt
sudo chgrp -R <your username> /mnt
`
 
then move the downloaded UrbanSound8K.tar.gz dataset to `/mnt` and run

`
cd /mnt; tar xvzf UrbanSound8K.tar.gz
`



## Import libraries and initialize global varaibles

In [1]:
import glob
import os

import librosa
import numpy as np
from joblib import Parallel, delayed
# used to featurize the dataset
from scipy import signal

# how many classes do we have; for one-hot encoding and parallel processing purpose
num_total_classes = 10

# Where you have saved the UrbanSound8K data set. Need to be absolute path.
parent_dir = "/mnt/UrbanSound8K/audio"

# specify bands that you want to use. This is also the "height" of the spectrogram image
n_bands = 150
# specify frames that you want to use. This is also the "width" of the spectrogram image
n_frames = 150


# sample rate of the target files
sample_rate = 22050
# update this part to produce different images
save_dir = "/mnt/us8k-" + str(n_bands) + "bands-" + str(n_frames) + "frames-3channel"


## Preprocessing the Data

The choice of the length of the sliding window used to featurize the data into a mel spectrogram is empirical – based on [Environmental sound classification with convolutional neural networks](http://karol.piczak.com/papers/Piczak2015-ESC-ConvNet.pdf) paper by Piczak, longer windows seems to perform better than shorter windows. In this blog, we will use a sliding window with a length of 2s with a 1 second overlapping; this will also determines the width of our spectrogram. 


In [2]:
# Read wav helper method to force audio resampling
# duration is set for a 4 second clip
def read_audio(audio_path, target_fs=None, duration=4):
    (audio, fs) = librosa.load(audio_path, sr=None, duration=duration)
    # if this is not a mono sounds file
    if audio.ndim > 1:
        audio = np.mean(audio, axis=1)
    if target_fs is not None and fs != target_fs:
        audio = librosa.resample(audio, orig_sr=fs, target_sr=target_fs)
        fs = target_fs
    return audio, fs

def pad_trunc_seq_rewrite(x, max_len):
    """Pad or truncate a sequence data to a fixed length.

    Args:
      x: ndarray, input sequence data.
      max_len: integer, length of sequence to be padded or truncated.

    Returns:
      ndarray, Padded or truncated input sequence data.
    """

    if x.shape[1] < max_len:
        pad_shape = (x.shape[0], max_len - x.shape[1])
        pad = np.ones(pad_shape) * np.log(1e-8)
        #x_new = np.concatenate((x, pad), axis=1)
        x_new = np.hstack((x, pad))
    # no pad necessary - truncate
    else:
        x_new = x[:, 0:max_len]
    return x_new


In [3]:
def extract_features(parent_dir, sub_dirs, bands, frames, file_ext="*.wav"):
    # 4 second clip with 50% window overlap with small offset to guarantee frames
    n_window = int(sample_rate * 4. / frames * 2) - 4 * 2
    # 50% overlap
    n_overlap = int(n_window / 2.)
    # Mel filter bank
    melW = librosa.filters.mel(sr=sample_rate, n_fft=n_window, n_mels=bands, fmin=0., fmax=8000.)
    # Hamming window
    ham_win = np.hamming(n_window)
    log_specgrams_list = []
    labels = []
    for l, sub_dir in enumerate(sub_dirs):
        for fn in glob.glob(os.path.join(parent_dir, sub_dir, file_ext)):
            # print("processing", fn)
            sound_clip, fn_fs = read_audio(fn, target_fs=sample_rate)
            assert (int(fn_fs) == sample_rate)

            if sound_clip.shape[0] < n_window:
                print("File %s is shorter than window size - DISCARDING - look into making the window larger." % fn)
                continue

            label = fn.split('fold')[1].split('-')[1]
            # Skip corrupted wavs
            if sound_clip.shape[0] == 0:
                print("File %s is corrupted!" % fn)
                continue
                # raise NameError("Check filename - it's an empty sound clip.")

            # Compute spectrogram                
            [f, t, x] = signal.spectral.spectrogram(
                x=sound_clip,
                window=ham_win,
                nperseg=n_window,
                noverlap=n_overlap,
                detrend=False,
                return_onesided=True,
                mode='magnitude')
            x = np.dot(x.T, melW.T)
            x = np.log(x + 1e-8)
            x = x.astype(np.float32).T
            x = pad_trunc_seq_rewrite(x, frames)

            log_specgrams_list.append(x)
            labels.append(label)

    log_specgrams = np.asarray(log_specgrams_list).reshape(len(log_specgrams_list), bands, frames, 1)
    features = np.concatenate((log_specgrams, np.zeros(np.shape(log_specgrams))), axis=3)
    features = np.concatenate((features, np.zeros(np.shape(log_specgrams))), axis=3)
    for i in range(len(features)):
        # first order difference, computed over 9-step window
        features[i, :, :, 1] = librosa.feature.delta(features[i, :, :, 0])
        # for using 3 dimensional array to use ResNet and other frameworks
        features[i, :, :, 2] = librosa.feature.delta(features[i, :, :, 1])

    return np.array(features), np.array(labels, dtype=np.int)

# convert labels to one-hot encoding
def one_hot_encode(labels):
    n_labels = len(labels)
    n_unique_labels = num_total_classes
    one_hot_encode = np.zeros((n_labels, n_unique_labels))
    one_hot_encode[np.arange(n_labels), labels] = 1
    return one_hot_encode


## Saving Extracted Features

The code in the cell below can convert the raw audio files into features using multi-processing to fully utilize the CPU. The processed data are stored as numpy arrays and will be loaded during training time.

It takes around 10 mins to complete - the time will vary depending on your CPU.

In [4]:
%%time
# use this to process the audio files into numpy arrays
def save_folds(data_dir, k, bands, frames):
    fold_name = 'fold' + str(k)
    print("Saving " + fold_name)

    features, labels = extract_features(parent_dir, [fold_name], bands=bands, frames=frames)
    labels = one_hot_encode(labels)

    print("Features of", fold_name, " = ", features.shape)
    print("Labels of", fold_name, " = ", labels.shape)

    feature_file = os.path.join(data_dir, fold_name + '_x.npy')
    labels_file = os.path.join(data_dir, fold_name + '_y.npy')
    np.save(feature_file, features)
    print("Saved " + feature_file)
    np.save(labels_file, labels)
    print("Saved " + labels_file)


def assure_path_exists(path):
    mydir = os.path.join(os.getcwd(), path)
    if not os.path.exists(mydir):
        os.makedirs(mydir)


assure_path_exists(save_dir)
Parallel(n_jobs=num_total_classes)(delayed(save_folds)(save_dir, k, bands=n_bands, frames=n_frames) for k in range(1, 11))


CPU times: user 28.6 ms, sys: 52.8 ms, total: 81.5 ms
Wall time: 9min 10s
