# **Code for the Paper "End-to-End Sound Recognition using Temporal Convolutional Networks"** 
*Author: Eric Schölzel, TU Dresden*


---


A quick-and-dirty implementation of the proposed pipeline, using mel spectrograms, a CNN-TCN architecture and live training data augmentation with SpecAugment.

The UrbanSound8K Dataset (non-Kaggle Version!) is used here as an example.
It contains 8732 samples in 10 (mutually exclusive) classes. All Samples have a length of <=4s (most are 4s) - so keep in mind that this is the type of dataset where a variety of approaches like fully-fledged State-of-the-Art Image Recognition networks with millions of parameters or non-neural-network approaches like Decision Trees usually have their domain.
It's just one example of the pipeline that can easily adapted to tasks with other requirements such as Speech Recognition or Segmentation of audio files.

*Disclaimer: This is run on a custom train test split. For actuall scientific comparisons to other approaches 10-fold cross validation is strongly recommended by the authers of the dataset: https://urbansounddataset.weebly.com/urbansound8k.html*





**Install Additional Dependencies**

In [0]:
!pip install librosa specaugment wget keras-tcn

!apt-get -y install ffmpeg

Collecting keras-tcn
  Downloading https://files.pythonhosted.org/packages/ea/71/a23ddfcee18342a4c3ce464f99c44e5dad1c637be13c73638d8551d57906/keras_tcn-2.8.2-py2.py3-none-any.whl
Installing collected packages: keras-tcn
Successfully installed keras-tcn-2.8.2
Reading package lists... Done
Building dependency tree       
Reading state information... Done
ffmpeg is already the newest version (7:3.4.6-0ubuntu0.18.04.1).
The following package was automatically installed and is no longer required:
  libnvidia-common-410
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 4 not upgraded.


**Download and extract the dataset**
(non-Kaggle version)

*This may take a while. It's around ~6GB. If you already have the spectrograms calculated and saved on Google Drive, you can skip that.*


In [0]:
import wget
import tarfile

local_file = "urbansound8k.tar.gz"
print("Downloading dataset...")
url = "https://zenodo.org/record/1203745/files/UrbanSound8K.tar.gz"
wget.download(url, local_file)

print("Extracting dataset...")
with tarfile.open(local_file, "r:gz") as tar:
    tar.extractall()
    tar.close()

print("Done.")

**Import Python libraries and set paths**

In [0]:
import numpy as np # linear algebra
import librosa
# see https://github.com/librosa/librosa/issues/477
# import soundfile
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt 
import tcn
import os
import zipfile
import pickle
import keras
import random
import tensorflow as tf
import time
import sys
import gc
from sklearn.model_selection import train_test_split
from math import ceil
from tensorflow.keras import backend as K
from tcn import TCN

script_path = "./"
dataset_path = "./UrbanSound8K/"

Using TensorFlow backend.


**Calculate features/Spectrograms.**

You only need to run this once (unless the data get deleted - which probably will if you're on Google Colab and close the session).

This may take a while... In the meantime, you can drink a coffee. 
(It will take some time. It's slow. Really. That's one of the reasons why real time data augmentation on the spectrograms is so great! And yeah, it's long enough that I've spent the time to include a timer...)

This could probably be faster with some parameter tuning... our outside of Google Colab with a better CPU.

(I tried to save them to Google Drive once calculated and use the API to download them here, but the API was bugged and just gave errors... Maybe that'll get fixed at some point. Uploading them myself was extremely slow so that didn't work out either. Idk why.)

*Note: we load the entire dataset into RAM here since RAM in Colab is sufficient for that. When there's not as much RAM, you'll have to change that.*

***Do NOT close Google Colab while calculating. This may reset your session and you'll have to start over again.***

**If you saved the spectrograms to Google Drive earlier, you can skip this block.**

In [0]:

def load_sample_and_calc_features(filename):
    audio, sample_rate = librosa.load(filename, res_type="kaiser_fast")
    features = librosa.feature.melspectrogram(y=audio, sr=sample_rate, n_fft=2048, n_mels=256, hop_length=512, power=2.0)
    return features

def load_dataset(label_file, sample_folder, dataset_labels=set()):
    dataset_x = []
    dataset_y = []
    processed = 0
    with open(label_file, "r") as train_labels_file:
        all_content = train_labels_file.readlines()
        
        for line in all_content[1:]:
            line = line.replace("\n", "")
            slice_file_name, fsID, start, end, salience, fold, classID, classname = line.split(",")
            
            dataset_labels.add(classname)
            
            filename = sample_folder + "fold" + fold + "/" + slice_file_name
            if processed % 100 == 0:
                print(processed, end= " -> ")
            x = load_sample_and_calc_features(filename)
            dataset_x.append(x)
            dataset_y.append(int(classID))
            processed += 1

    dataset_labels = list(dataset_labels)

    dataset_y = keras.utils.to_categorical(dataset_y)

    dataset = list(zip(dataset_x, dataset_y))
    return dataset, dataset_labels

time_load_start = time.time()
dataset_and_classes = load_dataset(dataset_path + "/metadata/UrbanSound8K.csv", dataset_path + "/audio/")
time_load = time.time() - time_load_start

dataset_full, classes = dataset_and_classes
pickle_path = script_path + "/dataset_and_classes.pickle"

print("\nDone. Loading and calculating features took " + str(time_load) + " seconds")

0 -> 100 -> 200 -> 300 -> 400 -> 500 -> 600 -> 700 -> 800 -> 900 -> 1000 -> 1100 -> 1200 -> 1300 -> 1400 -> 1500 -> 1600 -> 1700 -> 1800 -> 1900 -> 2000 -> 2100 -> 2200 -> 2300 -> 2400 -> 2500 -> 2600 -> 2700 -> 2800 -> 2900 -> 3000 -> 3100 -> 3200 -> 3300 -> 3400 -> 3500 -> 3600 -> 3700 -> 3800 -> 3900 -> 4000 -> 4100 -> 4200 -> 4300 -> 4400 -> 4500 -> 4600 -> 4700 -> 4800 -> 4900 -> 5000 -> 5100 -> 5200 -> 5300 -> 5400 -> 5500 -> 5600 -> 5700 -> 5800 -> 5900 -> 6000 -> 6100 -> 6200 -> 6300 -> 6400 -> 6500 -> 6600 -> 6700 -> 6800 -> 6900 -> 7000 -> 7100 -> 7200 -> 7300 -> 7400 -> 7500 -> 7600 -> 7700 -> 7800 -> 7900 -> 8000 -> 8100 -> 8200 -> 8300 -> 8400 -> 8500 -> 8600 -> 8700 -> 
Done. Loading and calculating features took 4805.445519685745 seconds


**Optional**:
Connect to Google Drive (You have to enter the Authorization Code and press enter).

Pickle your calculated spectrograms to Google Drive instead for not having to calculate it again later.

*OR*
load it if you did that already and skipped the previous section**

In [0]:
from google.colab import drive
drive.mount('/content/drive')
pickle_path = script_path + "drive/My Drive/dataset_and_classes.pickle"

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


***Pickle it... (If not done already)***

In [0]:
if dataset_and_classes is not None:
  with open(pickle_path, "wb+") as pfile:
      pickle.dump(dataset_and_classes, pfile)

**Loading pickled Spectrograms**

You can load previously the previously saved spectrograms here.

In [0]:
with open(pickle_path, "rb") as pfile:
    dataset_full, classes = pickle.load(pfile)
pfile.close()


Since the "full" SpecAugment implementation (from https://github.com/shelling203/SpecAugment) still seems to be buggy (Tensorflow implementation throws errors and is extremely brutaly slow and PyTorch implementation isn't completed yet and still seems to be slow - could maybe be Google Colab related), instead skip the next block and use the small one. This will be without Time Warping, tho. All of those implementations are "inofficial" btw.
 
The PyTorch version of that works. Since Time Warping isn't completely implemented yet, this code doesn't depend on PyTorch yet. It's not throwing errors and being unusable slow like the tensorflow one, but it's still slow. This could be Google Colab related. *TODO for people with Nvidia GPU: Test the Tensorflow Implementation outside of Google Colab ;)*

*This has been copyied here because we need to use matplotlib.use('Agg') (see import section) to avoid errors. That doesn't seem to work when just using the imported module.*

**TL;DR:  since it doesn't work properly in Google Colab atm, you can just skip the next 2 code blocks**

In [0]:
# SpecAugment PyTorch Implementation

# Copyright 2019 RnD at Spoon Radio
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""SpecAugment Implementation for Tensorflow.
Related paper : https://arxiv.org/pdf/1904.08779.pdf

In this paper, show summarized parameters by each open datasets in Tabel 1.
-----------------------------------------
Policy | W  | F  | m_F |  T  |  p  | m_T
-----------------------------------------
None   |  0 |  0 |  -  |  0  |  -  |  -
-----------------------------------------
LB     | 80 | 27 |  1  | 100 | 1.0 | 1
-----------------------------------------
LD     | 80 | 27 |  2  | 100 | 1.0 | 2
-----------------------------------------
SM     | 40 | 15 |  2  |  70 | 0.2 | 2
-----------------------------------------
SS     | 40 | 27 |  2  |  70 | 0.2 | 2
-----------------------------------------
LB : LibriSpeech basic
LD : LibriSpeech double
SM : Switchboard mild
SS : Switchboard strong
"""

import librosa
import librosa.display
import math
import numpy as np
import random
import matplotlib
# matplotlib.use('TkAgg')
import matplotlib.pyplot as plt


def spec_augment_pytorch(mel_spectrogram, time_warping_para=80, frequency_masking_para=27,
                 time_masking_para=100, frequency_mask_num=1, time_mask_num=1):
    """Spec augmentation Calculation Function.

    'SpecAugment' have 3 steps for audio data augmentation.
    first step is time warping using Tensorflow's image_sparse_warp function.
    Second step is frequency masking, last step is time masking.

    # Arguments:
      mel_spectrogram(numpy array): audio file path of you want to warping and masking.
      time_warping_para(float): Augmentation parameter, "time warp parameter W".
        If none, default = 80 for LibriSpeech.
      frequency_masking_para(float): Augmentation parameter, "frequency mask parameter F"
        If none, default = 100 for LibriSpeech.
      time_masking_para(float): Augmentation parameter, "time mask parameter T"
        If none, default = 27 for LibriSpeech.
      frequency_mask_num(float): number of frequency masking lines, "m_F".
        If none, default = 1 for LibriSpeech.
      time_mask_num(float): number of time masking lines, "m_T".
        If none, default = 1 for LibriSpeech.

    # Returns
      mel_spectrogram(numpy array): warped and masked mel spectrogram.
    """
    v = mel_spectrogram.shape[0]
    tau = mel_spectrogram.shape[1]

    # Step 1 : Time warping (TO DO...)
    warped_mel_spectrogram = np.zeros(mel_spectrogram.shape,
                                      dtype=mel_spectrogram.dtype)

    for i in range(v):
        for j in range(tau):
            offset_x = 0
            offset_y = 0
            if i + offset_y < v:
                warped_mel_spectrogram[i, j] = mel_spectrogram[(i + offset_y) % v, j]
            else:
                warped_mel_spectrogram[i, j] = mel_spectrogram[i, j]

    # Step 2 : Frequency masking
    for i in range(frequency_mask_num):
        f = np.random.uniform(low=0.0, high=frequency_masking_para)
        f = int(f)
        f0 = random.randint(0, v - f)
        warped_mel_spectrogram[f0:f0 + f, :] = 0

    # Step 3 : Time masking
    for i in range(time_mask_num):
        t = np.random.uniform(low=0.0, high=time_masking_para)
        t = int(t)
        t0 = random.randint(0, tau - t)
        warped_mel_spectrogram[:, t0:t0 + t] = 0

    return warped_mel_spectrogram


def visualization_spectrogram(mel_spectrogram, title):
    """visualizing result of SpecAugment

    # Arguments:
      mel_spectrogram(ndarray): mel_spectrogram to visualize.
      title(String): plot figure's title
    """
    # Show mel-spectrogram using librosa's specshow.
    plt.figure(figsize=(10, 4))
    librosa.display.specshow(librosa.power_to_db(mel_spectrogram, ref=np.max), y_axis='mel', fmax=8000, x_axis='time')
    # plt.colorbar(format='%+2.0f dB')
    plt.title(title)
    plt.tight_layout()
    plt.show()

In [0]:
# SpecAugment Tensorflow Implementation


# Copyright 2019 RnD at Spoon Radio
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""SpecAugment Implementation for Tensorflow.
Related paper : https://arxiv.org/pdf/1904.08779.pdf

In this paper, show summarized parameters by each open datasets in Tabel 1.
-----------------------------------------
Policy | W  | F  | m_F |  T  |  p  | m_T
-----------------------------------------
None   |  0 |  0 |  -  |  0  |  -  |  -
-----------------------------------------
LB     | 80 | 27 |  1  | 100 | 1.0 | 1
-----------------------------------------
LD     | 80 | 27 |  2  | 100 | 1.0 | 2
-----------------------------------------
SM     | 40 | 15 |  2  |  70 | 0.2 | 2
-----------------------------------------
SS     | 40 | 27 |  2  |  70 | 0.2 | 2
-----------------------------------------
LB : LibriSpeech basic
LD : LibriSpeech double
SM : Switchboard mild
SS : Switchboard strong
"""

import librosa
import librosa.display
import tensorflow as tf
from tensorflow.contrib.image import sparse_image_warp
import numpy as np
import random
import matplotlib
# matplotlib.use('TkAgg')
import matplotlib.pyplot as plt


def spec_augment_tensorflow(mel_spectrogram, time_warping_para=80, frequency_masking_para=27,
                 time_masking_para=100, frequency_mask_num=1, time_mask_num=1):
    """Spec augmentation Calculation Function.

    'SpecAugment' have 3 steps for audio data augmentation.
    first step is time warping using Tensorflow's image_sparse_warp function.
    Second step is frequency masking, last step is time masking.

    # Arguments:
      mel_spectrogram(numpy array): audio file path of you want to warping and masking.
      time_warping_para(float): Augmentation parameter, "time warp parameter W".
        If none, default = 80 for LibriSpeech.
      frequency_masking_para(float): Augmentation parameter, "frequency mask parameter F"
        If none, default = 100 for LibriSpeech.
      time_masking_para(float): Augmentation parameter, "time mask parameter T"
        If none, default = 27 for LibriSpeech.
      frequency_mask_num(float): number of frequency masking lines, "m_F".
        If none, default = 1 for LibriSpeech.
      time_mask_num(float): number of time masking lines, "m_T".
        If none, default = 1 for LibriSpeech.

    # Returns
      mel_spectrogram(numpy array): warped and masked mel spectrogram.
    """
    v = mel_spectrogram.shape[0]
    tau = mel_spectrogram.shape[1]

    # Step 1 : Time warping
    # Image warping control point setting.
    mel_spectrogram_holder = tf.placeholder(tf.float32, shape=[1, v, tau, 1])
    location_holder = tf.placeholder(tf.float32, shape=[1, 1, 2])
    destination_holder = tf.placeholder(tf.float32, shape=[1, 1, 2])

    center_position = v/2
    random_point = np.random.randint(low=time_warping_para, high=tau - time_warping_para)
    # warping distance chose.
    w = np.random.uniform(low=0, high=time_warping_para)

    control_point_locations = [[center_position, random_point]]
    control_point_locations = np.float32(np.expand_dims(control_point_locations, 0))

    control_point_destination = [[center_position, random_point + w]]
    control_point_destination = np.float32(np.expand_dims(control_point_destination, 0))

    # mel spectrogram data type convert to tensor constant for sparse_image_warp.
    mel_spectrogram = mel_spectrogram.reshape([1, mel_spectrogram.shape[0], mel_spectrogram.shape[1], 1])
    mel_spectrogram = np.float32(mel_spectrogram)

    warped_mel_spectrogram_op, _ = sparse_image_warp(mel_spectrogram_holder,
                                                     source_control_point_locations=location_holder,
                                                     dest_control_point_locations=destination_holder,
                                                     interpolation_order=2,
                                                     regularization_weight=0,
                                                     num_boundary_points=1
                                                     )

    # Change warp result's data type to numpy array for masking step.
    feed_dict = {mel_spectrogram_holder:mel_spectrogram,
                 location_holder:control_point_locations,
                 destination_holder:control_point_destination}

    with tf.Session() as sess:
        warped_mel_spectrogram = sess.run(warped_mel_spectrogram_op, feed_dict=feed_dict)

    warped_mel_spectrogram = warped_mel_spectrogram.reshape([warped_mel_spectrogram.shape[1],
                                                             warped_mel_spectrogram.shape[2]])

    # Step 2 : Frequency masking
    for i in range(frequency_mask_num):
        f = np.random.uniform(low=0.0, high=frequency_masking_para)
        f = int(f)
        f0 = random.randint(0, v - f)
        warped_mel_spectrogram[f0:f0 + f, :] = 0

    # Step 3 : Time masking
    for i in range(time_mask_num):
        t = np.random.uniform(low=0.0, high=time_masking_para)
        t = int(t)
        t0 = random.randint(0, tau - t)
        warped_mel_spectrogram[:, t0:t0 + t] = 0

    return warped_mel_spectrogram


def visualization_spectrogram(mel_spectrogram, title):
    """visualizing result of SpecAugment

    # Arguments:
      mel_spectrogram(ndarray): mel_spectrogram to visualize.
      title(String): plot figure's title
    """
    # Show mel-spectrogram using librosa's specshow.
    plt.figure(figsize=(10, 4))
    librosa.display.specshow(librosa.power_to_db(mel_spectrogram, ref=np.max), y_axis='mel', fmax=8000, x_axis='time')
    # plt.colorbar(format='%+2.0f dB')
    plt.title(title)
    plt.tight_layout()
    plt.show()


**Simple SpecAugment (currently used!)**


Due to the problems described above (could be a Google Colab problem, I couldn't test this properly at home since I don't have any Nvidia GPU),
we use this simple implementation without Time Warping for now. It's taken from https://www.kaggle.com/davids1992/specaugment-quick-implementation
It doesn't contain Time Warping, but it's fast.

In [0]:
# from https://www.kaggle.com/davids1992/specaugment-quick-implementation
# without time warping
def spec_augment_simple(spec: np.ndarray, num_mask=2, 
                 freq_masking_max_percentage=0.15, time_masking_max_percentage=0.3):

    spec = spec.copy()
    for i in range(num_mask):
        all_frames_num, all_freqs_num = spec.shape
        freq_percentage = random.uniform(0.0, freq_masking_max_percentage)
        
        num_freqs_to_mask = int(freq_percentage * all_freqs_num)
        f0 = np.random.uniform(low=0.0, high=all_freqs_num - num_freqs_to_mask)
        f0 = int(f0)
        spec[:, f0:f0 + num_freqs_to_mask] = 0

        time_percentage = random.uniform(0.0, time_masking_max_percentage)
        
        num_frames_to_mask = int(time_percentage * all_frames_num)
        t0 = np.random.uniform(low=0.0, high=all_frames_num - num_frames_to_mask)
        t0 = int(t0)
        spec[t0:t0 + num_frames_to_mask, :] = 0
    
    return spec

**Model Building**

Build the network model. It contains an Input Layer, two Convolutional Layers for additional feature extraction (see Paper) and a TCN unit.

*This architecture has only ~560k Weights (for comparison: State-of-the-Art image recognition networks often have tens of millions of weights!). Weight file size is ~7.28mb which is definitely suitable for mobile apps, for example. (Who wants 500mb apps just for sound detection? :P)*

Adam (with default parameters) is used for optimization. SGD+Momentum can lead to better results, when hyperparameters are set good enough.


Due to make the generator simpler, currently a fixed input size is used. The length of the samples varies (128, n) and if n < 174, it gets zero-padded to that size (174 is max length of the spectrogram which is 4s in the raw audio).
However, if this is adapted to datasets where samples can have arbitrary length it makes sense to change that to a varying size (which would save memory/computation time for smaller samples).

With TCN it's possible to set a variable time length - but it takes some adjustments, e.g. the generator would have to look for the longest sample in the batch first, because in a batch every sample must have the same length.

To make things simpler, here a fixed maximum size is used. Samples are NOT simply scaled to match that size here (would result in stretching/squashing smaller/longer samples) how it is done in many other approaches. Zero-Padding is used instead.

First, determine maximum length and class count.

In [0]:
maxlen = 0
for item in dataset_full:
  arr = np.asarray(item[0])
  itemlen = arr.shape[1]
  if itemlen > maxlen:
    maxlen = itemlen

item0 = np.asarray(dataset_full[0][0])
input_shape = (item0.shape[0], maxlen)
print("Shape is " + str(input_shape))

n_classes = len(classes)
print(str(n_classes) + " target classes")



Shape is (256, 174)
10 target classes


Now build the model.

In [0]:
dataset_full = list(dataset_full)

dataset_train, dataset_test = train_test_split(dataset_full, train_size=0.9, test_size=0.1)

def build_model():
    # Use he_normal as initializer for CNNs whenever possible. Here's why -> https://towardsdatascience.com/why-default-cnn-are-broken-in-keras-and-how-to-fix-them-ce295e5e5f2
    # Unfortunately, keras-tcn doesn't use that yet. However, that shouldn't be a big deal here.
  
    # See https://towardsdatascience.com/get-started-with-using-cnn-lstm-for-forecasting-6f0f4dde5826 how CNN-LSTM works.
    # That can be adopted for CNN-TCN. Except here we're not using 1D-Conv Layers, but 2D feature extractors instead.
    # (So 3x3 Kernels instead of slicing 1x3 Kernels)
  
    # Input_shape is (Features, TimeSteps)
    # For our network, it has to be (TimeSteps, Features), so we change it.
    input_layer = keras.layers.Input(shape=(input_shape[1], input_shape[0], 1))
    
    conv1 = keras.layers.Conv2D(filters=32, kernel_size=3, activation="relu", kernel_initializer="he_normal", name="conv1", padding="same")(input_layer)
    # Stride of (1, 2) -> stride of 2 in feature dimension, reducing feature dimensionality per timestep
    conv2 = keras.layers.Conv2D(filters=32, kernel_size=3, strides=(1,2), activation="relu", kernel_initializer="he_normal", padding="same")(conv1)
    
    # Slicing 1D Flattening!
    feature_distributor = keras.layers.TimeDistributed(keras.layers.Flatten())(conv2)
    
    # TCN Unit
    tcn1 = tcn.TCN(return_sequences=False, kernel_size=(2), nb_filters=64, dilations=[1, 2, 4, 8, 16, 32, 64], nb_stacks=2,
                   dropout_rate=0.00, name="tcn1", padding="same")(feature_distributor)
    tcn1 = keras.layers.BatchNormalization()(tcn1)
 
    output_layer = keras.layers.Dense(n_classes, activation="softmax")(tcn1)

    model = keras.Model(inputs=input_layer, outputs=output_layer)

    optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
    model.compile(optimizer, loss=keras.losses.categorical_crossentropy, metrics=["categorical_accuracy"])
    return model

model = build_model()

model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            (None, 174, 256, 1)  0                                            
__________________________________________________________________________________________________
conv1 (Conv2D)                  (None, 174, 256, 32) 320         input_2[0][0]                    
__________________________________________________________________________________________________
conv2d_2 (Conv2D)               (None, 174, 128, 32) 9248        conv1[0][0]                      
__________________________________________________________________________________________________
time_distributed_2 (TimeDistrib (None, 174, 4096)    0           conv2d_2[0][0]                   
__________________________________________________________________________________________________
conv1d_44 

**Generator Functions and Training**

An own generator function is needed since augmented batches should be created. Since all the samples have to be the same length inside a batch, zero padding is used there.
Data Augmentation is performed live when creating the batch.

Batches of *Test* data won't be augmented (wouldn't make sense). A *Model Checkpoint* is used to save model at the best epoch (and only that).

To switch the SpecAugment implementation, just switch the lines "x = spec_augment(_simple)(x)"

(Results can vary by some percentage, depending on train test split, random weight initialization and therefore maybe star constellation.)

In [0]:
gc.collect()

def generator(dataset, augment=False, debug=False):
    n_batches = 0
    current_batch_size = min(batch_size, len(dataset))
    batch_x = np.zeros((current_batch_size, input_shape[0], input_shape[1], 1))
    batch_y = np.zeros((current_batch_size, n_classes))
    samples_in_batch = 0
    while True:
        samples_in_epoch = 0
        epoch_order = list(np.random.permutation(len(dataset)))
        current_batch_size = min(batch_size, len(epoch_order))
          
        for sample_id in epoch_order:
            sample = dataset[sample_id]
            x = sample[0]
            x_zeropad = np.zeros((input_shape[0], input_shape[1]))
            x_zeropad[:, :sample[0].shape[1]] = x
            x = x_zeropad
            if augment:
                x = spec_augment_simple(x)  # simple, no time warping, but faster
                # x = spec_augment_pytorch(mel_spectrogram=x)
                # x = spec_augment_tensorflow(mel_spectrogram=x)  # seems broken + extreeemely slow (no gpu usage?), but includes Time Warping... :/
            sample = (x, sample[1])
            if debug:
              print(current_batch_size)
              print(samples_in_batch)
              print(samples_in_epoch)
            batch_x[samples_in_batch, :, :sample[0].shape[1], :] = np.expand_dims(sample[0], 2)
            batch_y[samples_in_batch, :] = sample[1]
            samples_in_batch += 1
            if samples_in_batch == current_batch_size:
                batch = (np.swapaxes(batch_x[0:current_batch_size, :, :], 1, 2), batch_y[0:current_batch_size, :])
                yield batch
                samples_in_epoch += samples_in_batch
                samples_in_batch = 0
                current_batch_size=min(batch_size, len(dataset) - epoch_order.index(sample_id))
                batch_x = np.zeros((batch_size, input_shape[0], input_shape[1], 1))
                batch_y = np.zeros((batch_size, n_classes))
                
                n_batches += 1
                
n_epochs = 300
batch_size = 64
train_gen = generator(dataset_train, augment=True, debug=False)
test_gen = generator(dataset_test, debug=False)

steps_per_epoch = int(len(dataset_train) / batch_size)
validation_steps = ceil(len(dataset_test) / batch_size)

max_lr = 0.15
num_samples = len(dataset_train)

callbacks = []
model_checkpoint = keras.callbacks.ModelCheckpoint("model_best.h5", monitor='val_categorical_accuracy', verbose=1, save_best_only=True, save_weights_only=False, mode='auto', period=1)
callbacks.append(model_checkpoint)


model.fit_generator(train_gen, steps_per_epoch=steps_per_epoch, epochs=n_epochs, \
                   validation_data=test_gen, validation_steps=validation_steps, \
                    callbacks=callbacks, max_queue_size=64)

Epoch 1/300

Epoch 00001: val_categorical_accuracy improved from -inf to 0.22545, saving model to model_best.h5
Epoch 2/300

Epoch 00002: val_categorical_accuracy improved from 0.22545 to 0.29129, saving model to model_best.h5
Epoch 3/300

Epoch 00003: val_categorical_accuracy improved from 0.29129 to 0.34710, saving model to model_best.h5
Epoch 4/300

Epoch 00004: val_categorical_accuracy did not improve from 0.34710
Epoch 5/300

Epoch 00005: val_categorical_accuracy improved from 0.34710 to 0.49888, saving model to model_best.h5
Epoch 6/300

Epoch 00006: val_categorical_accuracy improved from 0.49888 to 0.53013, saving model to model_best.h5
Epoch 7/300

Epoch 00007: val_categorical_accuracy improved from 0.53013 to 0.58929, saving model to model_best.h5
Epoch 8/300

Epoch 00008: val_categorical_accuracy improved from 0.58929 to 0.62277, saving model to model_best.h5
Epoch 9/300

Epoch 00009: val_categorical_accuracy did not improve from 0.62277
Epoch 10/300

Epoch 00010: val_categor

<keras.callbacks.History at 0x7f4c5f0f5cf8>