## Feature extraction
This notebook aims to properly extract the features from the development and evaluation dataset, needed to properly train the neural network implemented and proposed as system for the DCASE Challenge 2020, Task3. 
The script extracts, from a file audio, the log mel spectrogram using a 64 mel band filter, together with acoustic intensity vector (in the case of Ambisonic format) or generalized cross-correlation (in the case of Microphone Array format). These features will be given as input to the convolutional recurrent neural network to make predictions regarding sound event detection and sound event localization.

With the aim of additionally increase the SELD score and to reduce the overfitting of the system, the training dataset size will be increased using data augmentation based on channel rotations and reflection on the xy plane in the FOA domain. In particular, we implemented the 16 patterns technique proposed for the first time by Mazzon et.al in [1].  The suggested data manipulations correspond to rotations of 0, -90◦, +90◦, and +180◦ related to the azimuth angle, leading to 8 rotations about the z axis, and 2 reflections with respect to the xy plane (considering the opposite elevation angle), for a total of 15 new patterns plus the original one. The user can decide how many augmented files would like to create from the the original one, increasing the dataset size. 

The user is invitated to follow the instructions, to change file and folder paths and configuring the system as he sees fit. 

[1]: Mazzon, L., et al. "Sound event localization and detection using FOA domain spatial augmentation." Proc. of the 4th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE). 2019.

Please, follow the instructions and change the folder paths with the ones related to your machine or drive. 
The paths needed to be changed are marked with a TODO command. 


In [None]:
#all the data are save in gogole colab so the first instruction would be mounting google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [1]:
#importing all the modules needed fo running the script 
import os 
import numpy as np
import librosa
import matplotlib.pyplot as plt
import random
import scipy.io.wavfile as wav

## Parameters definition

The next cell defines all the paramaters, which need to be properly configured  to run the algorithm.  

Please, change the folder paths with the ones related to you machine and/or drive. The paths to be changed are marked with a TODO comand. 

In [2]:
# DATASET LOADING PARAMETERS
case = 1
# 'dev' - development or 'eval' - evaluation dataset, or both. 
process_str = 'dev, eval'

# 'dev' - development or 'eval' - evaluation dataset
mode='eval'
# 'foa' - ambisonic or 'mic' - microphone signals                                               
dataset='foa'    

#if the dataset used is foa, than we re using data augmentation, otherwise no data augmnetatin will be used for the mic.   
dataset_aug = False if dataset == 'mic' else True

#BASE PATH
#base dir path. 
#TODO: Change it with with proper path in your machine
base_dir = '/content/drive/My Drive/Dataset-FP/'
#defining cross-validation split
if mode == 'dev':
  train_splits = [3, 4, 5, 6]
elif mode == 'eval':
  train_splits = [2, 3, 4, 5, 6]

#how many new files we want to create from the original one.
data_augmentation_nb = 2

# INPUT PATH
# Base folder containing the foa/mic and metadata folders. 
#TODO: Change it with proper path in your machine 
dataset_dir = base_dir + 'dataset-eval'
# Directory where extracted features and labels    
feat_label_dir= os.path.join(dataset_dir, 'feat_label-AR/')       

#pattern rotation to consider for data augmentation. 
# From the 16 patterns listed in the paper above mentioned, we decided to consider all the patterns but the orignal one. 
rotation_pattern = 15
                                                                                                                                                                                      
#FEATURE PARAMS
fs=24000
hop_len_s=0.02
label_hop_len_s=0.1
max_audio_len_s=60
nb_mel_bins=64            

unique_classes = {
            'alarm': 0,
            'baby': 1,
            'crash': 2,
            'dog': 3,
            'engine': 4,
            'female_scream': 5,
            'female_speech': 6,
            'fire': 7,
            'footsteps': 8,
            'knock': 9,
            'male_scream': 10,
            'male_speech': 11,
            'phone': 12,
            'piano': 13
        }


    # ########### User defined parameters ##############
    # different user parameters so to set dev or eval mode and foa or mic dataset, or quick test 
if case == 1:
  print("USING DEFAULT PARAMETERS\n")

elif case == 2:
  mode = 'dev'
  dataset = 'mic'

elif case == 3:
   mode = 'eval'
   dataset = 'mic'

elif case == 4:
    mode = 'dev'
    dataset = 'foa'

elif case == 5:
    mode = 'eval'
    dataset = 'foa'

elif case == 999:
      print("QUICK TEST MODE\n")
      quick_test = True
      epochs_per_fit = 1

else:
    print('ERROR: unknown argument {}'.format(case))
    exit()
       

USING DEFAULT PARAMETERS



In [4]:
#function to create a new folder in case it does not exist yet. 
def create_folder(folder_name):
    """
    The function creates a new folder in case it does not exist yet.
    Parameters
    -------------
      :folder_name: name of the folder to be created (folder path)
    """
    if not os.path.exists(folder_name):
        print('{} folder does not exist, creating it.'.format(folder_name))
        os.makedirs(folder_name)

## Data augmentation definition 

The next cell defines and implements 15 channel rotations and reflection in FOA domain. 
Once a rotation pattern has been randomly selected, the same augmentation pattern will be applied for the input feature extraction of a data and for its corresponding label. 

In [13]:
import numpy as np
import librosa
import matplotlib.pyplot as plt
import os
import random
import scipy.io.wavfile as wav


def apply_data_augmentation(audio):
    """
    The function applies one of the 15 patterns in order to augment the data. The 15 patterns are based on 90, -90 and 180 degree channel rotations
    Parameters
    -------------
      :audio: audio to be augmented
    Return
    -------------
      :audio_aug: augmented audio
      :pattern: pattern applied for the data augmentation of this particular audio
    """
    
    #selecting the a random pattern between the 15 implemented
    pattern = random.randrange(rotation_pattern) # original case not considered for augmentation
    print("Data augmentation pattern {}".format(pattern))
        
    audio_aug = np.zeros(audio.shape)

    w = audio[:, 0]
    y = audio[:, 1]
    z = audio[:, 2]
    x = audio[:, 3]
    
    # w channel never change 
    audio_aug[:, 0] = w

    # the 15 pattern rotations and reflection
    if (pattern == 0):
        #print('Φ - pi/2')
        audio_aug[:, 1] = -x
        audio_aug[:, 3] = y
        audio_aug[:, 2] = z  
    elif (pattern == 1):
        #print('Φ - pi/2, -θ')
        audio_aug[:, 1] = -x
        audio_aug[:, 3] = y
        audio_aug[:, 2] = -z
    
    elif (pattern == 2):
        #print('Φ, -θ')
        audio_aug[:, 1] = y
        audio_aug[:, 3] = x
        audio_aug[:, 2] = -z
    
    elif (pattern == 3):
        #print('Φ + pi/2')
        audio_aug[:, 1] = x
        audio_aug[:, 3] = -y
        audio_aug[:, 2] = z    
    elif (pattern == 4):    
        #print('Φ + pi/2, -θ')
        audio_aug[:, 1] = x
        audio_aug[:, 3] = -y
        audio_aug[:, 2] = -z
        
    elif (pattern == 5):
        #print('-Φ - pi/2')
        audio_aug[:, 1] = x
        audio_aug[:, 3] = -y
        audio_aug[:, 2] = z
    elif (pattern == 6):
        #print('-Φ - pi/2, -θ')
        audio_aug[:, 1] = -x
        audio_aug[:, 3] = -y
        audio_aug[:, 2] = -z
    
    elif (pattern == 7):
        #print('-Φ')
        audio_aug[:, 1] = -y
        audio_aug[:, 3] = x
        audio_aug[:, 2] = z  
    elif (pattern == 8):
        # - azimuth - pi/2 - elevation
        #print('-Φ, -θ')
        audio_aug[:, 1] = -y
        audio_aug[:, 3] = x
        audio_aug[:, 2] = -z
    
    elif (pattern == 9):
        #print('-Φ + pi/2')
        audio_aug[:, 1] = x
        audio_aug[:, 3] = y
        audio_aug[:, 2] = z  
    elif (pattern == 10):
        #print('-Φ + pi/2, -θ')
        audio_aug[:, 1] = x
        audio_aug[:, 3] = y
        audio_aug[:, 2] = -z
    
    elif (pattern == 11):
        #print('-Φ + pi')
        audio_aug[:, 1] = y
        audio_aug[:, 3] = -x
        audio_aug[:, 2] = z      
    elif (pattern == 12):
        # azimuth + pi/2 - elevation
        #print('-Φ + pi, -θ')
        audio_aug[:, 1] = y
        audio_aug[:, 3] = -x
        audio_aug[:, 2] = -z
        
    elif (pattern == 13):
        #print('Φ + pi')
        audio_aug[:, 1] = -x
        audio_aug[:, 3] = -y
        audio_aug[:, 2] = z    
    elif (pattern == 14):
        #print('Φ + pi, -θ')
        audio_aug[:, 1] = -x
        audio_aug[:, 3] = -y
        audio_aug[:, 2] = -z  
    
    else:
        print("Wrong pattern selection")
        
    return audio_aug, pattern


def label_rotation(label, pattern):

  """
    The function use the pattern received as input to augment the label of the corresponding audio.
    The channel rotation, implemented frame-by-frame, is implemnted on the label frame received as input. 
    Parameters
    -------------
      :label: frame of the label to be augmented
      :pattern: pattern applied
    Return
    -------------
      :label_aug: augmented frame of the label
    """

  label_aug = np.zeros(len(label))
    # w channel never change 
  label_aug[0] = label[0]

  if (pattern == 0):
        #print('Φ - pi/2')
        label_aug[1] = label[2]
        label_aug[2] = -label[1]
        label_aug[3] = label[3]
  elif (pattern == 1):
        #print('Φ - pi/2, -θ')
        label_aug[1] = label[2]
        label_aug[2] = -label[1]
        label_aug[3] = -label[3]
    
  elif (pattern == 2):
        #print('Φ, -θ')
        label_aug[1] = label[1]
        label_aug[2] = label[2]
        label_aug[3] = -label[3]
    
  elif (pattern == 3):
        #print('Φ + pi/2')
        label_aug[1] = -label[2]
        label_aug[2] = label[1]
        label_aug[3] = label[3]
  elif (pattern == 4):    
        #print('Φ + pi/2, -θ')
        label_aug[1] = -label[2]
        label_aug[2] = label[1]
        label_aug[3] = -label[3]
        
  elif (pattern == 5):
        #print('-Φ - pi/2')
        label_aug[1] = -label[2]
        label_aug[2] = label[1]
        label_aug[3] = label[3]
  elif (pattern == 6):
        #print('-Φ - pi/2, -θ')
        label_aug[1] = -label[2]
        label_aug[2] = -label[1]
        label_aug[3] = -label[3]
    
  elif (pattern == 7):
        #print('-Φ')
        label_aug[1] = label[1]
        label_aug[2] = -label[2]
        label_aug[3] = label[3]
  elif (pattern == 8):
        # - azimuth - pi/2 - elevation
        #print('-Φ, -θ')
        label_aug[1] = label[1]
        label_aug[2] = -label[2]
        label_aug[3] = -label[3]
    
  elif (pattern == 9):
        #print('-Φ + pi/2')
        label_aug[1] = label[2]
        label_aug[2] = label[1]
        label_aug[3] = label[3]
  elif (pattern == 10):
        #print('-Φ + pi/2, -θ')
        label_aug[1] = label[2]
        label_aug[2] = label[1]
        label_aug[3] = -label[3]
    
  elif (pattern == 11):
        #print('-Φ + pi')
        label_aug[1] = -label[1]
        label_aug[2] = label[2]
        label_aug[3] = label[3]     
  elif (pattern == 12):
        # azimuth + pi/2 - elevation
        #print('-Φ + pi, -θ')
        label_aug[1] = -label[1]
        label_aug[2] = label[2]
        label_aug[3] = -label[3]
        
  elif (pattern == 13):
        #print('Φ + pi')
        label_aug[1] = -label[2]
        label_aug[2] = -label[1]
        label_aug[3] = label[3]
  elif (pattern == 14):
        #print('Φ + pi, -θ')
        label_aug[1] = -label[2]
        label_aug[2] = -label[1]
        label_aug[3] = -label[3]
    
  else:
        print("Wrong pattern selection")
        
  return label_aug

def label_augmentation(label_dir, pattern):

    #print("Label augmentation pattern {}".format(pattern))
    
    """
    The function applies the received pattern in order to augment the label of the corresponding audio.
    The function receives a label dictinary as input and augment the label frame by frame. 
    Parameters
    -------------
      :label: label dictionary containing all the frames that need to be augmented. 
      :pattern: pattern to apply
    Return
    -------------
      :label_aug: augmented label dictinary
    """

    label_aug_dict = {}
    for frame_cnt in label_dir.keys():
        if frame_cnt not in label_aug_dict:
            label_aug_dict[frame_cnt] = []
            for tmp_val in label_dir[frame_cnt]:
                aug_tmp_value = label_rotation(tmp_val, pattern)
                label_aug_dict[frame_cnt].append(aug_tmp_value)
    return label_aug_dict

# Feature class 

The next cell defines and implements the class used for features extraction and the relative functions such as extraction of spectrogram, log-mel band spectrgram, acoustic intensity vector and generilized cross-correlactrion according to the domain. 

Next cell also implement the functions used for the features extraction process. In particular, the feature extraction, the pre-process of the features (normalization of dataset) and the extraction of corrispective labels. 


In [24]:
# Contains routines for labels creation, features extraction and normalization
from sklearn import preprocessing
from sklearn.externals import joblib
import math


def nCr(n, r):
    return math.factorial(n) // math.factorial(r) // math.factorial(n-r)


class FeatureClass:
    def __init__(self, is_eval=False):
        """
        Parameters
        --------------
        :param params: parameters dictionary
        :param is_eval: if True, does not load dataset labels.
        """

        # Input directories
        self._feat_label_dir = feat_label_dir
        self._dataset_dir = dataset_dir
        self._dataset_combination = '{}_{}'.format(dataset, 'eval' if is_eval else 'dev')
        self._aud_dir = os.path.join(self._dataset_dir, self._dataset_combination)
        self._desc_dir = None if is_eval else os.path.join(self._dataset_dir, 'metadata_dev')

        # Output directories
        self._label_dir = None
        self._feat_dir = None
        self._feat_dir_norm = None

        # Local parameters
        self._is_eval = is_eval

        self._fs = fs
        self._hop_len_s = hop_len_s
        self._hop_len = int(self._fs * self._hop_len_s)

        self._label_hop_len_s = label_hop_len_s
        self._label_hop_len = int(self._fs * self._label_hop_len_s)
        self._label_frame_res = self._fs / float(self._label_hop_len)
        self._nb_label_frames_1s = int(self._label_frame_res)

        self._win_len = 2 * self._hop_len
        self._nfft = self._next_greater_power_of_2(self._win_len)
        self._nb_mel_bins = nb_mel_bins
        self._mel_wts = librosa.filters.mel(sr=self._fs, n_fft=self._nfft, n_mels=self._nb_mel_bins).T

        self._dataset = dataset
        self._eps = 1e-8
        self._nb_channels = 4

        # Sound event classes dictionary
        self._unique_classes = unique_classes
        self._audio_max_len_samples = max_audio_len_s * self._fs 

        self._max_feat_frames = int(np.ceil(self._audio_max_len_samples / float(self._hop_len)))
        self._max_label_frames = int(np.ceil(self._audio_max_len_samples / float(self._label_hop_len)))

    def _load_audio(self, audio_path):

        fs, audio = wav.read(audio_path)
        audio = audio[:, :self._nb_channels] / 32768.0 + self._eps
        if audio.shape[0] < self._audio_max_len_samples:
            zero_pad = np.random.rand(self._audio_max_len_samples - audio.shape[0], audio.shape[1])*self._eps
            audio = np.vstack((audio, zero_pad))
        elif audio.shape[0] > self._audio_max_len_samples:
            audio = audio[:self._audio_max_len_samples, :]
        return audio, fs

    # INPUT FEATURES
    @staticmethod
    def _next_greater_power_of_2(x):
        return 2 ** (x - 1).bit_length()

    def _spectrogram(self, audio_input):
      """
        The function generates the spectrogram of a audio received as input
        Parameters
        -------------
          :audio_input: audio received as input
        Return
        -------------
          :spectra: spectrogram of the audio received as input
      """
      _nb_ch = audio_input.shape[1]
      nb_bins = self._nfft // 2
      spectra = np.zeros((self._max_feat_frames, nb_bins + 1, _nb_ch), dtype=complex)
      for ch_cnt in range(_nb_ch):
          stft_ch = librosa.core.stft(np.asfortranarray(audio_input[:, ch_cnt]), n_fft=self._nfft, hop_length=self._hop_len,
                                        win_length=self._win_len, window='hann')
          spectra[:, :, ch_cnt] = stft_ch[:, :self._max_feat_frames].T
      return spectra

    def _get_mel_spectrogram(self, linear_spectra_list):
        """
        The function generates the list of log mel spectrogram of an audio, together with the list of augmented audio generated
        Parameters
        -------------
          :audio_input: list of audio received as input
        Return
        -------------
          :spectra: spectrogram of the audio received as input
        """
        
        mel_feat_list = []
        for spec in range(len(linear_spectra_list)):
            linear_spectra = linear_spectra_list[spec]
            mel_feat = np.zeros((linear_spectra.shape[0], self._nb_mel_bins, linear_spectra.shape[-1]))
            for ch_cnt in range(linear_spectra.shape[-1]):
                mag_spectra = np.abs(linear_spectra[:, :, ch_cnt])**2
                mel_spectra = np.dot(mag_spectra, self._mel_wts)
                log_mel_spectra = librosa.power_to_db(mel_spectra)
                mel_feat[:, :, ch_cnt] = log_mel_spectra
            mel_feat = mel_feat.reshape((linear_spectra.shape[0], self._nb_mel_bins * linear_spectra.shape[-1]))
            mel_feat_list.append(mel_feat)

        return mel_feat_list

    def _get_foa_intensity_vectors(self, linear_spectra_list):
        """
        Function to generate the list of acoustic intensity vector of the list of log mel spectrogram received as input
        Parameters
        -------------
          :linear_spectra_list: list of log mel spectrogram received as input
        Return
        -------------
          :foa_iv_list: list of acoustic intensity vector 
        """
        foa_iv_list = []
        for spec in range(len(linear_spectra_list)):
            linear_spectra = linear_spectra_list[spec]
            IVx = np.real(np.conj(linear_spectra[:, :, 0]) * linear_spectra[:, :, 1])
            IVy = np.real(np.conj(linear_spectra[:, :, 0]) * linear_spectra[:, :, 2])
            IVz = np.real(np.conj(linear_spectra[:, :, 0]) * linear_spectra[:, :, 3])

            normal = np.sqrt(IVx ** 2 + IVy ** 2 + IVz ** 2) + self._eps
            IVx = np.dot(IVx / normal, self._mel_wts)
            IVy = np.dot(IVy / normal, self._mel_wts)
            IVz = np.dot(IVz / normal, self._mel_wts)

            # we are doing the following instead of simply concatenating to keep the processing similar to mel_spec and gcc
            foa_iv = np.dstack((IVx, IVy, IVz))
            foa_iv = foa_iv.reshape((linear_spectra.shape[0], self._nb_mel_bins * 3))
            if np.isnan(foa_iv).any():
                print('Feature extraction is generating nan outputs')
                exit()
            foa_iv_list.append(foa_iv)

        return foa_iv_list



    def _get_gcc(self, linear_spectra_list):
        """
        The fucntion generates the list of generalized cross-correlation vector of the list of log mel spectrogram received as input
        Parameters
        -------------
          :linear_spectra_list: list of log mel spectrogram received as input
        Return
        -------------
          :gcc_feat_list: list of generalized cross-validation vector  
        """
        gcc_feat_list = []
        for linear_spectra in linear_spectra_list:
            gcc_channels = nCr(linear_spectra.shape[-1], 2)
            gcc_feat = np.zeros((linear_spectra.shape[0], self._nb_mel_bins, gcc_channels))
            cnt = 0
            for m in range(linear_spectra.shape[-1]):
                for n in range(m+1, linear_spectra.shape[-1]):
                    R = np.conj(linear_spectra[:, :, m]) * linear_spectra[:, :, n]
                    cc = np.fft.irfft(np.exp(1.j*np.angle(R)))
                    cc = np.concatenate((cc[:, -self._nb_mel_bins//2:], cc[:, :self._nb_mel_bins//2]), axis=-1)
                    gcc_feat[:, :, cnt] = cc
                    cnt += 1
            gcc_feat = gcc_feat.reshape((linear_spectra.shape[0], self._nb_mel_bins*gcc_channels))
            gcc_feat_list.append(gcc_feat)

        return gcc_feat_list

    def _get_spectrogram_for_file(self, audio_filename):
        """
        The function generates the list of spectogram for file. The list contains the original spectrogram of the audio and the augmented ones
        Parameters
        -------------
          :audio_filename: filename of the audio file
        Return
        -------------
          :list_spec: list of spectrogram (original + augmented)
        """
        list_spec = []
        
        if dataset == 'foa':
          pattern = np.zeros(data_augmentation_nb, dtype=int)

        audio_in, fs = self._load_audio(os.path.join(self._aud_dir, audio_filename))
        audio_spec = self._spectrogram(audio_in)

        list_spec.append(audio_spec)

        if int(audio_filename[4]) in train_splits and dataset == 'foa' and not self._is_eval:
            for i in range(data_augmentation_nb):
                audio_aug, pattern[i] = apply_data_augmentation(audio_in)
                audio_spec_aug = self._spectrogram(audio_aug)
                list_spec.append(audio_spec_aug)

        if dataset == 'foa':
          return list_spec, pattern
        else:
          return list_spec


    # OUTPUT LABELS
    def get_labels_for_file(self, _desc_file):
        """
        The function reads description file and returns classification based SED labels and regression based DOA labels
        Parameters:
        ----------------
          :param _desc_file: metadata description file
        Returns:
        ----------------------------
        :return: label_mat: labels of the format [sed_label, doa_label],
              where sed_label is of dimension [nb_frames, nb_classes] which is 1 for active sound event else zero
              where doa_labels is of dimension [nb_frames, 3*nb_classes], nb_classes each for x, y, z axis,
        """
        se_label = np.zeros((self._max_label_frames, len(self._unique_classes)))
        x_label = np.zeros((self._max_label_frames, len(self._unique_classes)))
        y_label = np.zeros((self._max_label_frames, len(self._unique_classes)))
        z_label = np.zeros((self._max_label_frames, len(self._unique_classes)))

        for frame_ind, active_event_list in _desc_file.items():
            if frame_ind < self._max_label_frames:
                for active_event in active_event_list:
                    se_label[frame_ind, active_event[0]] = 1
                    x_label[frame_ind, active_event[0]] = active_event[1]
                    y_label[frame_ind, active_event[0]] = active_event[2]
                    z_label[frame_ind, active_event[0]] = active_event[3]

        label_mat = np.concatenate((se_label, x_label, y_label, z_label), axis=1)
        return label_mat

    # ------------------------------- EXTRACT FEATURE AND PREPROCESS IT -------------------------------
    def extract_all_feature(self):
        """
          The function extracts all the features from audio data file
        """
        # setting up folders
        self._feat_dir = self.get_unnormalized_feat_dir()
        create_folder(self._feat_dir)

        # extraction starts
        print('Extracting spectrogram:')
        print('\t\taud_dir {}\n\t\tdesc_dir {}\n\t\tfeat_dir {}'.format(
            self._aud_dir, self._desc_dir, self._feat_dir))

        for file_cnt, file_name in enumerate(os.listdir(self._aud_dir)):
            #save the pattern implemented for each file
            if file_name != '.DS_Store':
                wav_filename = '{}.wav'.format(file_name.split('.')[0])
                if dataset == 'foa':
                  # implementation of data augmnetation based on channel rotations
                  spect, pattern = self._get_spectrogram_for_file(wav_filename)
                else:
                  spect = self._get_spectrogram_for_file(wav_filename)

                #extract mel
                mel_spect = self._get_mel_spectrogram(spect)

                feat_list = []
                if self._dataset is 'foa':
                    # extract intensity vectors
                    foa_iv = self._get_foa_intensity_vectors(spect)
                    for foa_index in range(len(foa_iv)):

                        #plot figures
                        #print("Plotting")
                        #plot.figure(figsize=(22, 5))

                        #plot.subplot(211, xlabel='Time', ylabel='Frequency', title='mel-spect %s' % (foa_index)), \
                        #plot.imshow(mel_spect[foa_index].T)


                        #plot.subplot(212, xlabel='Time', ylabel='Frequency', title='intensity vector'), \
                        #plot.imshow(foa_iv[foa_index].T)
                        #plot.savefig(fname=image_dir + '%s_plot_%s.png' % (file_name.split('.')[0], foa_index), format='png')
                        #plot.close()

                        feat = np.concatenate((mel_spect[foa_index], foa_iv[foa_index]), axis=-1)
                        feat_list.append(feat)

                elif self._dataset is 'mic':
                    # extract gcc
                    gcc = self._get_gcc(spect)
                    for gcc_index in range(len(gcc)):

                        # plot figures
                        # print("Plotting")
                        #plot.figure(figsize=(22, 5))

                        #plot.subplot(211, xlabel='Time', ylabel='Frequency', title='mel-spect %s' % (gcc_index))
                        #plot.imshow(mel_spect[gcc_index].T)

                        #plot.subplot(212, xlabel='Time', ylabel='Frequency', title='generilized cross corrrelation')
                        #plot.imshow(gcc[gcc_index].T)
                        #plot.savefig(fname=image_dir + '%s_plot_%s.png' % (file_name.split('.')[0], gcc_index),
                         #            format='png')

                        feat = np.concatenate((mel_spect[gcc_index], gcc[gcc_index]), axis=-1)
                        feat_list.append(feat)
                else:
                    print('ERROR: Unknown dataset format {}'.format(self._dataset))
                    exit()

                if len(feat_list) != 0:
                    for element in range(len(feat_list)):
                        print('{}: {}, {}, {}'.format(file_cnt, file_name, feat_list[element].shape, element))
                        if element == 0:
                          #no need to change the label name 
                          np.save(os.path.join(self._feat_dir, '{}-{}.npy'.format(wav_filename.split('.')[0], element)), feat_list[element]) 
                        else:
                          #augmented data
                          np.save(os.path.join(self._feat_dir, '{}-{}-{}.npy'.format(wav_filename.split('.')[0], element, pattern[element-1])), feat_list[element])
                          #creation of augmented data labels
                          label_folder = self._desc_dir
                          label_path = os.path.join(label_folder, wav_filename.replace('.wav', '.csv'))
                          label_dir_polar = self.load_output_format_file(label_path)
                          label_dir = self.convert_output_format_polar_to_cartesian(label_dir_polar)
                          #make the same rotation for the label and save with the same name on the metadata folder 
                          label_aug_dict = label_augmentation(label_dir, pattern[element-1])
                          label_aug_dict_pol = self.convert_output_format_cartesian_to_polar(label_aug_dict)
                          output_file_path = os.path.join(label_folder, '{}-{}-{}.csv'.format(wav_filename.split('.')[0], element, pattern[element-1]))
                          self.write_polar_file(output_file_path, label_aug_dict_pol)
                          
                      


    def preprocess_features(self):
        """
          The function pre-processes and normalizes all the features already extracted using StandardScalar preprocessing
        """
        # Setting up folders and filenames
        self._feat_dir = self.get_unnormalized_feat_dir()
        self._feat_dir_norm = self.get_normalized_feat_dir()
        create_folder(self._feat_dir_norm)
        normalized_features_wts_file = self.get_normalized_wts_file()
        spec_scaler = None

        # pre-processing starts
        if self._is_eval:
            spec_scaler = joblib.load(normalized_features_wts_file)
            print('Normalized_features_wts_file: {}. Loaded.'.format(normalized_features_wts_file))

        else:
            #normalization using partial fit only on training dataset
            print('Estimating weights for normalizing feature files:')
            print('\t\tfeat_dir: {}'.format(self._feat_dir))

            spec_scaler = preprocessing.StandardScaler()
            for file_cnt, file_name in enumerate(os.listdir(self._feat_dir)):
                print('{}: {}'.format(file_cnt, file_name))
                feat_file = np.load(os.path.join(self._feat_dir, file_name))
                spec_scaler.partial_fit(feat_file)
                del feat_file
            joblib.dump(
                spec_scaler,
                normalized_features_wts_file
            )
            print('Normalized_features_wts_file: {}. Saved.'.format(normalized_features_wts_file))

        #transformign all the dataset
        print('Normalizing feature files:')
        print('\t\tfeat_dir_norm {}'.format(self._feat_dir_norm))
        for file_cnt, file_name in enumerate(os.listdir(self._feat_dir)):
            print('{}: {}'.format(file_cnt, file_name))
            feat_file = np.load(os.path.join(self._feat_dir, file_name))
            feat_file = spec_scaler.transform(feat_file)
            np.save(
                os.path.join(self._feat_dir_norm, file_name),
                feat_file
            )
            del feat_file

        print('normalized files written to {}'.format(self._feat_dir_norm))

    # ------------------------------- EXTRACT LABELS AND PREPROCESS IT -------------------------------
    def extract_all_labels(self):
        """
         The function properly extracts all the labels
        """
        self._label_dir = self.get_label_dir()

        print('Extracting labels:')
        print('\t\taud_dir {}\n\t\tdesc_dir {}\n\t\tlabel_dir {}'.format(
            self._aud_dir, self._desc_dir, self._label_dir))
        create_folder(self._label_dir)

        for file_cnt, file_name in enumerate(os.listdir(self._desc_dir)):
            wav_filename = '{}.wav'.format(file_name.split('.')[0])
            desc_file_polar = self.load_output_format_file(os.path.join(self._desc_dir, file_name))
            desc_file = self.convert_output_format_polar_to_cartesian(desc_file_polar)
            label_mat = self.get_labels_for_file(desc_file)
            print('{}: {}, {}'.format(file_cnt, file_name, label_mat.shape))
            np.save(os.path.join(self._label_dir, '{}.npy'.format(wav_filename.split('.')[0])), label_mat)

    # -------------------------------  DCASE OUTPUT  FORMAT FUNCTIONS -------------------------------
    def load_output_format_file(self, _output_format_file):
        """
         The function loads DCASE output format csv file and returns it in dictionary format
        Parameters:
        -------------------
          :param _output_format_file: DCASE output format CSV
        Returns:
        -------------
          :return: _output_dict: dictionary
        """
        _output_dict = {}
        _fid = open(_output_format_file, 'r')
        for _line in _fid:
            _words = _line.strip().split(',')
            _frame_ind = int(_words[0])
            if _frame_ind not in _output_dict:
                _output_dict[_frame_ind] = []
            if len(_words) == 5: #read polar coordinates format, we ignore the track count 
                _output_dict[_frame_ind].append([int(_words[1]), float(_words[3]), float(_words[4])])
            elif len(_words) == 6: # read Cartesian coordinates format, we ignore the track count
                _output_dict[_frame_ind].append([int(_words[1]), float(_words[3]), float(_words[4]), float(_words[5])])
        _fid.close()
        return _output_dict

    def write_output_format_file(self, _output_format_file, _output_format_dict):
        """
        The function writes DCASE output format csv file, given output format dictionary
        Parameters:
        -----------------------
        :param _output_format_file: file name to write
        :param _output_format_dict: output dictionary
        """
        _fid = open(_output_format_file, 'w')
        # _fid.write('{},{},{},{}\n'.format('frame number with 20ms hop (int)', 'class index (int)', 'azimuth angle (int)', 'elevation angle (int)'))
        for _frame_ind in _output_format_dict.keys():
            for _value in _output_format_dict[_frame_ind]:
                # Write Cartesian format output. Since baseline does not estimate track count we use a fixed value.
                _fid.write('{},{},{},{},{},{}\n'.format(int(_frame_ind), int(_value[0]), 0, float(_value[1]), float(_value[2]), float(_value[3])))
        _fid.close()

    def write_polar_file(self, output_file, output_format_dict):
        """
        The function writes DCASE output format csv file in polar coordinates, given output format dictionary
        Parameters:
        -----------------------
        :param _output_format_file: file name to write
        :param _output_format_dict: output dictionary
        """
        _fid = open(output_file, 'w')
        # _fid.write('{},{},{},{}\n'.format('frame number with 20ms hop (int)', 'class index (int)', 'azimuth angle (int)', 'elevation angle (int)'))
        for _frame_ind in output_format_dict.keys():
            for _value in output_format_dict[_frame_ind]:
              # Write polar coordinates format output. Since baseline does not estimate track count we use a fixed value.
                _fid.write('{},{},{},{},{}\n'.format(int(_frame_ind), int(_value[0]), 0, int(_value[1]), int(_value[2])))
    
        _fid.close()

    def segment_labels(self, _pred_dict, _max_frames):
        '''
        The function Collects class-wise sound event location information in segments of length 1s from reference dataset
        Paremeters:
        -----------------
        :param _pred_dict: Dictionary containing frame-wise sound event time and location information. Output of SELD method
        :param _max_frames: Total number of frames in the recording
        Return:
        -----------------
        :return: Dictionary containing class-wise sound event location information in each segment of audio
                dictionary_name[segment-index][class-index] = list(frame-cnt-within-segment, azimuth, elevation)
        '''
        nb_blocks = int(np.ceil(_max_frames/float(self._nb_label_frames_1s)))
        output_dict = {x: {} for x in range(nb_blocks)}
        for frame_cnt in range(0, _max_frames, self._nb_label_frames_1s):

            # Collect class-wise information for each block
            # [class][frame] = <list of doa values>
            # Data structure supports multi-instance occurence of same class
            block_cnt = frame_cnt // self._nb_label_frames_1s
            loc_dict = {}
            for audio_frame in range(frame_cnt, frame_cnt+self._nb_label_frames_1s):
                if audio_frame not in _pred_dict:
                    continue
                for value in _pred_dict[audio_frame]:
                    if value[0] not in loc_dict:
                        loc_dict[value[0]] = {}

                    block_frame = audio_frame - frame_cnt
                    if block_frame not in loc_dict[value[0]]:
                        loc_dict[value[0]][block_frame] = []
                    loc_dict[value[0]][block_frame].append(value[1:])

            # Update the block wise details collected above in a global structure
            for class_cnt in loc_dict:
                if class_cnt not in output_dict[block_cnt]:
                    output_dict[block_cnt][class_cnt] = []

                keys = [k for k in loc_dict[class_cnt]]
                values = [loc_dict[class_cnt][k] for k in loc_dict[class_cnt]]

                output_dict[block_cnt][class_cnt].append([keys, values])

        return output_dict

    def regression_label_format_to_output_format(self, _sed_labels, _doa_labels):
        """
        The function converts the sed (classification) and doa labels predicted in regression format to dcase output format.
        Paremeters:
        -----------------
        :param _sed_labels: SED labels matrix [nb_frames, nb_classes]
        :param _doa_labels: DOA labels matrix [nb_frames, 2*nb_classes] or [nb_frames, 3*nb_classes]
        Return:
        ------------------  
        :return: _output_dict: returns a dict containing dcase output format
        """
        _nb_classes = len(self._unique_classes)
        _is_polar = _doa_labels.shape[-1] == 2*_nb_classes
        _azi_labels, _ele_labels = None, None
        _x, _y, _z = None, None, None
        if _is_polar:
            _azi_labels = _doa_labels[:, :_nb_classes]
            _ele_labels = _doa_labels[:, _nb_classes:]
        else:
            _x = _doa_labels[:, :_nb_classes]
            _y = _doa_labels[:, _nb_classes:2*_nb_classes]
            _z = _doa_labels[:, 2*_nb_classes:]

        _output_dict = {}
        for _frame_ind in range(_sed_labels.shape[0]):
            _tmp_ind = np.where(_sed_labels[_frame_ind, :])
            if len(_tmp_ind[0]):
                _output_dict[_frame_ind] = []
                for _tmp_class in _tmp_ind[0]:
                    if _is_polar:
                        _output_dict[_frame_ind].append([_tmp_class, _azi_labels[_frame_ind, _tmp_class], _ele_labels[_frame_ind, _tmp_class]])
                    else:
                        _output_dict[_frame_ind].append([_tmp_class, _x[_frame_ind, _tmp_class], _y[_frame_ind, _tmp_class], _z[_frame_ind, _tmp_class]])
        return _output_dict

    def convert_output_format_polar_to_cartesian(self, in_dict):
        """
        The function converts the output format from polar coordinates to cartesian coordinates
        Paremeters:
        -----------------
        :in_dict: dictionary output format in polar coordinates to be converted in cartesian coordinates
        Return:
        ------------------  
        :return: out_dict: returns a dict containing in_dict data converted from polar to cartesian coordinates
        """
        out_dict = {}
        for frame_cnt in in_dict.keys():
            if frame_cnt not in out_dict:
                out_dict[frame_cnt] = []
                for tmp_val in in_dict[frame_cnt]:

                    ele_rad = tmp_val[2]*np.pi/180.
                    azi_rad = tmp_val[1]*np.pi/180

                    tmp_label = np.cos(ele_rad)
                    x = np.cos(azi_rad) * tmp_label
                    y = np.sin(azi_rad) * tmp_label
                    z = np.sin(ele_rad)
                    out_dict[frame_cnt].append([tmp_val[0], x, y, z])
        return out_dict

    def convert_output_format_cartesian_to_polar(self, in_dict):
        """
        The function converts the output format from cartesian coordinates to poolar coordinates
        Paremeters:
        -----------------
        :in_dict: dictionary output format in cartesian coordinates to be converted in polar coordinates
        Return:
        ------------------  
        :return: out_dict: returns a dict containing in_dict data converted from cartesian to polar coordinates
        """
        out_dict = {}
        for frame_cnt in in_dict.keys():
            if frame_cnt not in out_dict:
                out_dict[frame_cnt] = []
                for tmp_val in in_dict[frame_cnt]:
                    x, y, z = tmp_val[1], tmp_val[2], tmp_val[3]

                    # in degrees
                    azimuth = np.arctan2(y, x) * 180 / np.pi
                    elevation = np.arctan2(z, np.sqrt(x**2 + y**2)) * 180 / np.pi
                    r = np.sqrt(x**2 + y**2 + z**2)
                    out_dict[frame_cnt].append([tmp_val[0], azimuth, elevation])
        return out_dict

    # ------------------------------- Misc public functions -------------------------------
    def get_classes(self):
        """
        The function returns the unique classes dictionary
        """
        return self._unique_classes

    def get_normalized_feat_dir(self):
        """
        The function returns the normalized feature folder path
        """
        return os.path.join(
            self._feat_label_dir,
            '{}_norm'.format(self._dataset_combination)
        )

    def get_unnormalized_feat_dir(self):
        """
        The function returns the unnormalized feature folder path
        """
        return os.path.join(
            self._feat_label_dir,
            '{}'.format(self._dataset_combination)
        )

    def get_label_dir(self):
        """
        The function returns the label feature folder path (if it is not the evaluation dataset)
        """
        if self._is_eval:
            return None
        else:
            return os.path.join(
                self._feat_label_dir, '{}_label'.format(self._dataset_combination)
            )

    def get_normalized_wts_file(self):
        """
        The function returns the weighthed pre-process file 
        """
        return os.path.join(
            self._feat_label_dir,
            '{}_wts'.format(self._dataset)
        )

    def get_nb_channels(self):
        """
        The function returns the number of channels considered
        """
        return self._nb_channels

    def get_nb_classes(self):
        """
        The function returns the number of classes considered
        """
        return len(self._unique_classes)

    def nb_frames_1s(self):
        """
        The function returns the number of frame contained in 1s
        """
        return self._nb_label_frames_1s

    def get_hop_len_sec(self):
        """
        The function returns the hop lenght in seconds
        """
        return self._hop_len_s

    def get_nb_frames(self):
        """
        The function returns the maximum number of frames considered
        """
        return self._max_label_frames

    def get_nb_mel_bins(self):
        """
        The function returns the number of mel band considered
        """
        return self._nb_mel_bins





## Running the algorithm 

The next cell runs the features extraction algorithm related to the development dataset. It first extracts all the features, secondly pre-processes the features (normalization process) and, lastly, it extracts the corresponding labels. 



In [25]:
# Extracts the features, labels, and normalizes the development and evaluation split features.
if 'dev' in process_str:
    # -------------- Extract features and labels for development set -----------------------------
    dev_feat_cls = FeatureClass(is_eval=False)

    # Extract features and normalize them
    dev_feat_cls.extract_all_feature()
    dev_feat_cls.preprocess_features()

    # # Extract labels in regression mode
    dev_feat_cls.extract_all_labels()

    print("Development dataset extraction finished")

Development dataset extraction finished


The next cell runs the features extraction algorithm related to the evaluation dataset. It first extracts all the features and then pre-processes the data. 
Labels are not extracted in this case, because there is not ground truth for the predictions to be made. 


In [26]:
if 'eval' in process_str:
    # -----------------------------Extract ONLY features for evaluation set-----------------------------
    eval_feat_cls = FeatureClass(is_eval=True)

    # Extract features and normalize them
    eval_feat_cls.extract_all_feature()
    eval_feat_cls.preprocess_features()
    print("Evaluation dataset extraction finished")

Evaluation dataset extraction finished
