In this notebook we will try the various data augmentation functions, so that we can decide which are the best combinations

# Import libraries

In [1]:
import data_augmentation
from scipy.io import wavfile as wav
import IPython
import librosa
import os

Import requested from: 'numba.decorators', please update to use 'numba.core.decorators' or pin to Numba version 0.48.0. This alias will not be present in Numba version 0.50.0.
  from numba.decorators import jit as optional_jit
Import of 'jit' requested from: 'numba.decorators', please update to use 'numba.core.decorators' or pin to Numba version 0.48.0. This alias will not be present in Numba version 0.50.0.
  from numba.decorators import jit as optional_jit


# Test data augmentation functions
## Random noise
### Original audio:

In [2]:
fn_us= "./preprocessed_recs/0_khaled_0.wav"
fn_spoken_dataset= "./recordings/0_yweweler_0.wav"
rate=8000
original_signal_us, _ = librosa.load(fn_us, sr=rate)
original_signal_spoken_dataset, _ = librosa.load(fn_spoken_dataset, sr=rate)

IPython.display.Audio(original_signal_us,rate=rate)

In [3]:
IPython.display.Audio(original_signal_spoken_dataset,rate=rate)

### Upper bound

In [4]:
noise_audio_us = data_augmentation.add_random_noise(original_signal_us, mu=0, stdev=0.05)
noise_audio_spoken_dataset = data_augmentation.add_random_noise(original_signal_spoken_dataset, mu=0, stdev=0.025)

In [5]:
IPython.display.Audio(noise_audio_us,rate=rate)

In [6]:
IPython.display.Audio(noise_audio_spoken_dataset,rate=rate)

Our recordings is clearer than the spoken dataset, however in both the noise is quite dominant: let's use 0.05 as upper bound for our recordings and 0.025 for the free spoken digit dataset
### Lower bound

In [7]:
noise_audio_us = data_augmentation.add_random_noise(original_signal_us, mu=0, stdev=0.002)
noise_audio_spoken_dataset = data_augmentation.add_random_noise(original_signal_spoken_dataset, mu=0, stdev=0.001)

In [8]:
IPython.display.Audio(noise_audio_us,rate=rate)

In [9]:
IPython.display.Audio(noise_audio_spoken_dataset,rate=rate)

Also in this case for getting "similar" results we had to use different stdev.

## PITCH SHIFT

## Original audio

In [10]:
IPython.display.Audio(original_signal_us,rate=rate)

In [11]:
IPython.display.Audio(original_signal_spoken_dataset,rate=rate)

### Lower bound

In [12]:
original_signal_us

array([-0.00271606, -0.00195312, -0.00109863, ...,  0.01226807,
        0.01376343,  0.0133667 ], dtype=float32)

In [13]:
original_signal_spoken_dataset

array([ 3.0517578e-04,  3.0517578e-05,  3.9672852e-04, ...,
       -2.7465820e-04, -3.9672852e-04, -3.9672852e-04], dtype=float32)

In [14]:
pitch_shift_audio_us = data_augmentation.change_pitch(original_signal_us, sampling_rate=rate, pitch_step = -5)
pitch_shift_audio_spoken_dataset = data_augmentation.change_pitch(original_signal_spoken_dataset, sampling_rate=rate, pitch_step = -5)
IPython.display.Audio(pitch_shift_audio_us,rate=rate)

In [15]:
IPython.display.Audio(pitch_shift_audio_spoken_dataset,rate=rate)

Nica caveman voices :) Let's try with -10:

In [16]:
pitch_shift_audio_us = data_augmentation.change_pitch(original_signal_us, sampling_rate=rate, pitch_step = -10)
pitch_shift_audio_spoken_dataset = data_augmentation.change_pitch(original_signal_spoken_dataset, sampling_rate=rate, pitch_step = -10)
IPython.display.Audio(pitch_shift_audio_us,rate=rate)

In [17]:
IPython.display.Audio(pitch_shift_audio_spoken_dataset,rate=rate)

The voices are nearly incomprehensible. let's try with a pitch step of -6

In [18]:
pitch_shift_audio_us = data_augmentation.change_pitch(original_signal_us, sampling_rate=rate, pitch_step = -6)
pitch_shift_audio_spoken_dataset = data_augmentation.change_pitch(original_signal_spoken_dataset, sampling_rate=rate, pitch_step = -6)
IPython.display.Audio(pitch_shift_audio_us,rate=rate)

In [19]:
IPython.display.Audio(pitch_shift_audio_spoken_dataset,rate=rate)

### Upper bound

In [20]:
pitch_shift_audio_us = data_augmentation.change_pitch(original_signal_us, sampling_rate=rate, pitch_step = 5)
pitch_shift_audio_spoken_dataset = data_augmentation.change_pitch(original_signal_spoken_dataset, sampling_rate=rate, pitch_step = 5)
IPython.display.Audio(pitch_shift_audio_us,rate=rate)

In [21]:
IPython.display.Audio(pitch_shift_audio_spoken_dataset,rate=rate)

Let's try with 4

In [22]:
pitch_shift_audio_us = data_augmentation.change_pitch(original_signal_us, sampling_rate=rate, pitch_step = 4)
pitch_shift_audio_spoken_dataset = data_augmentation.change_pitch(original_signal_spoken_dataset, sampling_rate=rate, pitch_step = 4)
IPython.display.Audio(pitch_shift_audio_us,rate=rate)

In [23]:
IPython.display.Audio(pitch_shift_audio_spoken_dataset,rate=rate)

5 seems a good compromise! Pitch, however, is a personal characteristic, therefore we will apply it only for digit recognition.

In [24]:
my_string = "0_khaled_0.wav"

In [27]:
any(ext in my_string for ext in ['01', 'khadled', 'e1d'])

False