## Python audio load

There are several ways to read audio files and extract the audio features in Python. This notebook focus on comparing the advantages and disadvantages of using different packages.


### Librosa_load

#### Advantages:
1. Librosa can automatically convert the audios using different sample rate into a certain sample rate, default is 22050 Hz. 
2. Librosa also support audios using different bits
3. Librosa converts the audio signals to float in the range of [-1, 1]

#### Disadvantage:
1. The speed is the biggest disadvantage of using the librosa_load (By testing the librosa_load on reading 1000 audios, the average loading time is 0.222 with std 0.017). 

In [2]:
import librosa
import time
import numpy as np

time_lib = []
audio_sample = '/efs/kevin/audio/Train/0.wav'

for i in range(1000):
    start_time = time.time()
    signal, sample_rate = librosa.load(audio_sample)  
    #print("--- %s seconds ---" % (time.time() - start_time))
    time_lib.append(time.time() - start_time)

print(np.mean(time_lib))
print(np.std(time_lib))

0.21801227474212648
0.01819368315483547


By testing the librosa_load on reading 1000 audios, the average loading time is 0.222 with std 0.017). 

### PySoundFile

#### Advantages
1. PySoundFile can also convert the audio signals to float in the range of [-1, 1].
2. It is much faster than the librosa package. (By testing the PySoundFile read on reading 1000 audios, the average loading time is 0.0080 with std 0.0007). 

#### Disadvantage:
1. It does not support the resampling technique meaning it can not automatically convert the audios into the same sample rate. 




In [5]:
import soundfile as sf
import time
time_sf = []
audio_sample = '/efs/kevin/audio/Train/0.wav'

for i in range(1000):
    start_time = time.time()
    signal, sample_rate = sf.read(audio_sample)
    signal = signal.sum(axis=1) / 2
    #print("--- %s seconds ---" % (time.time() - start_time))
    time_sf.append(time.time() - start_time)

print(np.mean(time_sf))
print(np.std(time_sf))

0.007678573608398438
0.0007520200370818408


By testing the PySoundFile on reading 1000 audios, the average loading time is 0.0080 with std 0.0007). 

### SciPy (not recommended) 

#### Advantages 
1. It is faster than Librosa but slower than PySound (By testing the PySoundFile read on reading 1000 audios, the average loading time is 0.014 with std 0.0006)

#### Disadvantages
1. Many formats are not supported. 
2. Certain metadata fields in a wav file may also lead to errors.
3. The audio is not converted to the floate range [-1,1].


In [13]:
import scipy.io.wavfile
import time

time_wav = []

for i in range(1000):
    audio_sample = '/efs/kevin/audio/Train/0.wav'
    start_time = time.time()
    sample_rate, signal = scipy.io.wavfile.read(audio_sample)  # File assumed to be in the same directory
    signal = signal.sum(axis=1) / 2
    norm = np.linalg.norm(signal)
    signal = signal/norm
    #print("--- %s seconds ---" % (time.time() - start_time))
    time_wav.append(time.time() - start_time)
    
print(np.mean(time_wav))
print(np.std(time_wav))


0.014380881547927856
0.0006542380005182429


By testing the SciPy on reading 1000 audios, the average loading time is 0.014 with std 0.0006).

### Numerical stability of using three methods

We also compare the results generate using the three packages. Using PySoundFile generates the same results as Librosa. However, SciPy is a bit different comparing with the other two methods. 


In [18]:
signal, sample_rate = sf.read(audio_sample)
signal = signal.sum(axis=1) / 2
print(signal)
signal, sample_rate = librosa.load(audio_sample, sample_rate)
print(signal)
sample_rate, signal = scipy.io.wavfile.read(audio_sample)  # File assumed to be in the same directory
signal = signal.sum(axis=1) / 2
signal = signal.astype(np.float32)
signal = (signal / np.max(np.abs(signal)))
signal -= np.mean(signal)
signal = signal/2
print(signal)

[-0.01499939 -0.02107239 -0.02680969 ...  0.01426697 -0.00772095
 -0.01293945]
[-0.01499939 -0.02107239 -0.02680969 ...  0.01426697 -0.00772095
 -0.01293945]
[-0.0142446  -0.02005143 -0.02553728 ...  0.01373906 -0.00728516
 -0.01227495]


#### Conclusion
Comparing the three packages, Scipy is recommended. 

## Python features extraction. 

We know librosa can help extract the features and also doing the audio augmentation. However, it is known to be slow. Here we will discuss different ways of extracting audio features.  

The main sources are listed here:
1. https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html
2. https://python-speech-features.readthedocs.io/en/latest/

In [2]:
import librosa
import scipy.io.wavfile
import numpy
import time 
from scipy.fftpack import dct
from python_speech_features import mfcc

In [5]:
import soundfile as sf
audio_sample = '/efs/kevin/audio/Train/0.wav'
start = time.time()
x, Fs = sf.read(audio_sample)
print(time.time() - start)

0.012367010116577148


In [7]:
Fs

44100

In [14]:
!pip install python_speech_features

Collecting python_speech_features
  Downloading https://files.pythonhosted.org/packages/ff/d1/94c59e20a2631985fbd2124c45177abaa9e0a4eee8ba8a305aa26fc02a8e/python_speech_features-0.6.tar.gz
Building wheels for collected packages: python-speech-features
  Building wheel for python-speech-features (setup.py) ... [?25ldone
[?25h  Created wheel for python-speech-features: filename=python_speech_features-0.6-cp36-none-any.whl size=5887 sha256=4816dff088453065da8932d5caa56b12c120d23a2d55d38e6b3a319c1381ac2e
  Stored in directory: /home/ubuntu/.cache/pip/wheels/3c/42/7c/f60e9d1b40015cd69b213ad90f7c18a9264cd745b9888134be
Successfully built python-speech-features
Installing collected packages: python-speech-features
Successfully installed python-speech-features-0.6


In [25]:
sample_rate, signal = scipy.io.wavfile.read(audio_sample, 16000) 

In [26]:
signal

memmap([[-1093,   110],
        [-1262,  -119],
        [-1476,  -281],
        ...,
        [  553,   382],
        [ -533,    27],
        [ -994,   146]], dtype=int16)

In [18]:
def speech_mfcc(audio):
    start_time = time.time()
    
    sample_rate, signal = scipy.io.wavfile.read(audio_sample) 
    signal = signal.sum(axis=1) / 2
    #norm = np.linalg.norm(signal)
    #signal = signal/norm
    pre_emphasis = 0.97
    emphasized_signal = numpy.append(signal[0], signal[1:] - pre_emphasis * signal[:-1])
    res = mfcc(emphasized_signal, sample_rate)
    print("--- %s seconds ---" % (time.time() - start_time))
    return(res)
    
audio_sample = '/efs/kevin/audio/Train/0.wav'
speech_mfcc(audio_sample)  



--- 0.03383207321166992 seconds ---


array([[ 16.54031732, -17.50755119, -40.50939113, ...,   6.34731066,
          5.92619134,   7.28971799],
       [ 16.60084093,  -7.59244194, -35.61287637, ...,  -3.76493038,
          7.07757956,  -2.97408499],
       [ 16.73923831, -17.80143536, -45.96136945, ...,  16.08051821,
         10.50033786,  10.34469117],
       ...,
       [ 16.27960321, -17.85235779, -37.84628609, ...,  -1.32503418,
          2.146214  ,  10.96781671],
       [ 16.30563504, -19.54822195, -37.36319627, ...,   1.19234113,
          0.48887082,   5.64367358],
       [ 16.43376117, -22.6739382 , -47.26553261, ...,   9.84395669,
         -4.61113531,   8.39586759]])

In [11]:
def fast_mfcc(audio, n_mfcc):
    start_time = time.time()
    sample_rate, signal = scipy.io.wavfile.read(audio_sample) 
    signal = signal.sum(axis=1) / 2
    #norm = np.linalg.norm(signal)
    #signal = signal/norm
    pre_emphasis = 0.97
    emphasized_signal = numpy.append(signal[0], signal[1:] - pre_emphasis * signal[:-1])
    
    frame_size = 0.025
    frame_stride = 0.01
    
    frame_length, frame_step = frame_size * sample_rate, frame_stride * sample_rate  # Convert from seconds to samples
    signal_length = len(emphasized_signal)
    frame_length = int(round(frame_length))
    frame_step = int(round(frame_step))
    num_frames = int(numpy.ceil(float(numpy.abs(signal_length - frame_length)) / frame_step))  # Make sure that we have at least 1 frame

    pad_signal_length = num_frames * frame_step + frame_length
    z = numpy.zeros((pad_signal_length - signal_length))
    pad_signal = numpy.append(emphasized_signal, z) # Pad Signal to make sure that all frames have equal number of samples without truncating any samples from the original signal

    indices = numpy.tile(numpy.arange(0, frame_length), (num_frames, 1)) + numpy.tile(numpy.arange(0, num_frames * frame_step, frame_step), (frame_length, 1)).T
    frames = pad_signal[indices.astype(numpy.int32, copy=False)]
    
    NFFT = 512
    frames *= numpy.hamming(frame_length)
    mag_frames = numpy.absolute(numpy.fft.rfft(frames, NFFT))  # Magnitude of the FFT
    pow_frames = ((1.0 / NFFT) * ((mag_frames) ** 2)) 
    
    low_freq_mel = 0
    num_ceps = 12
    high_freq_mel = (2595 * numpy.log10(1 + (sample_rate / 2) / 700))  # Convert Hz to Mel
    mel_points = numpy.linspace(low_freq_mel, high_freq_mel, n_mfcc + 2)  # Equally spaced in Mel scale
    hz_points = (700 * (10**(mel_points / 2595) - 1))  # Convert Mel to Hz
    bin = numpy.floor((NFFT + 1) * hz_points / sample_rate)

    fbank = numpy.zeros((n_mfcc, int(numpy.floor(NFFT / 2 + 1))))
    for m in range(1, n_mfcc + 1):
        f_m_minus = int(bin[m - 1])   # left
        f_m = int(bin[m])             # center
        f_m_plus = int(bin[m + 1])    # right

        for k in range(f_m_minus, f_m):
            fbank[m - 1, k] = (k - bin[m - 1]) / (bin[m] - bin[m - 1])
        for k in range(f_m, f_m_plus):
            fbank[m - 1, k] = (bin[m + 1] - k) / (bin[m + 1] - bin[m])
    filter_banks = numpy.dot(pow_frames, fbank.T)
    filter_banks = numpy.where(filter_banks == 0, numpy.finfo(float).eps, filter_banks)  # Numerical Stability
    filter_banks = 20 * numpy.log10(filter_banks)  # dB
    
    mfcc = dct(filter_banks, type=2, axis=1, norm='ortho')[:, 1 : (num_ceps + 1)] # Keep 2-13
    filter_banks -= (numpy.mean(filter_banks, axis=0) + 1e-8)
    mfcc -= (numpy.mean(mfcc, axis=0) + 1e-8)
    print("--- %s seconds ---" % (time.time() - start_time))



In [12]:
audio_sample = '/efs/kevin/audio/Train/0.wav'
n_mfcc = 40
fast_mfcc(audio_sample, n_mfcc)

--- 0.029557228088378906 seconds ---


In [4]:
time_wav = []
time_lib = []

for i in range(1000):
    audio_sample = '/efs/kevin/audio/Train/0.wav'
    start_time = time.time()
    sample_rate_wav, signal_wav = scipy.io.wavfile.read(audio_sample)  # File assumed to be in the same directory
    signal_wav = signal_wav.sum(axis=1) / 2
    norm = np.linalg.norm(signal_wav)
    signal_wav = signal_wav/norm
    #print("--- %s seconds ---" % (time.time() - start_time))
    time_wav.append(time.time() - start_time)
    
    start_time = time.time()
    signal, sample_rate = librosa.load(audio_sample)  # File assumed to be in the same directory
    #print("--- %s seconds ---" % (time.time() - start_time))
    time_lib.append(time.time() - start_time)

In [10]:
print(np.mean(time_wav))
print(np.mean(time_lib))

0.012060447931289672
0.24378951597213744


In [None]:
signal_wav

In [12]:
signal

array([-0.01212928, -0.02760112, -0.02535508, ...,  0.09790608,
        0.04330474, -0.00681015], dtype=float32)

In [8]:
signal

array([-0.01212928, -0.02760112, -0.02535508, ...,  0.09790608,
        0.04330474, -0.00681015], dtype=float32)

In [9]:
signal_wav

array([-491.5, -690.5, -878.5, ...,  467.5, -253. , -424. ])