Cloning the rg_speech_to_text repository, to have access to audiofiles

In [8]:
!git clone https://github.com/TheSoundOfAIOSR/rg_speech_to_text

Cloning into 'rg_speech_to_text'...
remote: Enumerating objects: 783, done.[K
remote: Counting objects: 100% (315/315), done.[K
remote: Compressing objects: 100% (223/223), done.[K
remote: Total 783 (delta 150), reused 237 (delta 86), pack-reused 468[K
Receiving objects: 100% (783/783), 13.55 MiB | 13.10 MiB/s, done.
Resolving deltas: 100% (395/395), done.


In [10]:
!ls rg_speech_to_text/data/finetuning-dataset/audiofiles

TA-0.wav   TA-5.wav   TK-20.wav  TK-36.wav  TK-9.wav   TM-24.wav  TM-3.wav
TA-10.wav  TA-6.wav   TK-21.wav  TK-37.wav  TM-0.wav   TM-25.wav  TM-40.wav
TA-11.wav  TA-7.wav   TK-22.wav  TK-38.wav  TM-10.wav  TM-26.wav  TM-41.wav
TA-12.wav  TA-8.wav   TK-23.wav  TK-39.wav  TM-11.wav  TM-27.wav  TM-42.wav
TA-13.wav  TA-9.wav   TK-24.wav  TK-3.wav   TM-12.wav  TM-28.wav  TM-43.wav
TA-14.wav  TK-0.wav   TK-25.wav  TK-40.wav  TM-13.wav  TM-29.wav  TM-44.wav
TA-15.wav  TK-10.wav  TK-26.wav  TK-41.wav  TM-14.wav  TM-2.wav   TM-45.wav
TA-16.wav  TK-11.wav  TK-27.wav  TK-42.wav  TM-15.wav  TM-30.wav  TM-46.wav
TA-17.wav  TK-12.wav  TK-28.wav  TK-43.wav  TM-16.wav  TM-31.wav  TM-4.wav
TA-18.wav  TK-13.wav  TK-29.wav  TK-44.wav  TM-17.wav  TM-32.wav  TM-5.wav
TA-19.wav  TK-14.wav  TK-2.wav	 TK-45.wav  TM-18.wav  TM-33.wav  TM-6.wav
TA-1.wav   TK-15.wav  TK-30.wav  TK-46.wav  TM-19.wav  TM-34.wav  TM-7.wav
TA-20.wav  TK-16.wav  TK-31.wav  TK-4.wav   TM-1.wav   TM-35.wav  TM-8.wav
TA-21.wav  TK-17.wa

## Prerequisites
In Google Colab we don't have sox pre-installed, so we have to install it first; next, we install torchaudio and WavAugment.

In [11]:
!apt-get install libsox-fmt-all libsox-dev sox > /dev/null
! python -m pip install torchaudio > /dev/null
! python -m pip install git+https://github.com/facebookresearch/WavAugment.git > /dev/null

  Running command git clone -q https://github.com/facebookresearch/WavAugment.git /tmp/pip-req-build-2r03q9z7


In [12]:
import torchaudio

In [13]:
# and load it as a tensor
x, sr = torchaudio.load('rg_speech_to_text/data/finetuning-dataset/audiofiles/TA-0.wav')

## Applying augmentation

In [14]:
import torch
import augment
import numpy as np

import IPython.display as ipd

In [15]:
print(f'We loaded a speech example; sample rate: {sr}, number of channels: {x.size(0)}, its length is {x.size(1)} frames or about {x.size(1) // sr} seconds.')
ipd.Audio(x, rate=sr)

We loaded a speech example; sample rate: 16000, number of channels: 2, its length is 62450 frames or about 3 seconds.


WavAugment is a sequence of effects, accessible in `augment.EffectChain`. The chained effects are conceptually similar how it is implemented in `sox`.

In [21]:
# empty effect
empty_chain = augment.EffectChain()
y = empty_chain.apply(x, src_info={'rate': sr})

In [19]:
# clip effect
clip_chain = augment.EffectChain().clip(0.25)

In [18]:
y = clip_chain.apply(x, src_info={'rate': sr})
ipd.Audio(y, rate=sr)

t the moment, `WavAugment`'s pitch provides a somewhat thin wrapper around the corresponding effect of `libsox`. Internally, `libsox` would represent change in the pitch as combination of tempo and rate effects; so for the time being we need to change the rate back manually.

In [22]:
# lowering the pitch
y = augment.EffectChain().pitch(-200).rate(sr) \
  .apply(x, src_info={'rate': sr})
ipd.Audio(y, rate=sr)

In [23]:
# lifting up the pich
y = augment.EffectChain().pitch(200).rate(sr) \
  .apply(x, src_info={'rate': sr})
ipd.Audio(y, rate=sr)

In [25]:
# reverb effect
y = augment.EffectChain().reverb(50, 50, 50).channels(1).apply(x, src_info={'rate': sr})
ipd.Audio(y, rate=sr)

In [26]:
# dropout
y = augment.EffectChain().time_dropout(max_seconds=0.5).apply(x, src_info={'rate': sr})
ipd.Audio(y, rate=sr)

In [28]:
# noise
noise_generator = lambda: torch.zeros_like(x).uniform_()
y = augment.EffectChain().additive_noise(noise_generator, snr=15).apply(x, src_info={'rate': sr})
ipd.Audio(y, rate=sr)

In [30]:
# normalization
y = augment.EffectChain().sinc('-a', '120', '500-100').apply(x, src_info={'rate': sr})
ipd.Audio(y, rate=sr)

In [32]:
random_room_size = lambda: np.random.randint(0, 101)
random_reverb = augment.EffectChain().reverb(50, 50, random_room_size).channels(1)

y = random_reverb.apply(x, src_info={'rate': sr})
ipd.Audio(y, rate=sr)

In [35]:
random_pitch_shift = lambda: np.random.randint(-400, +400)
# the pitch will be changed by a shift somewhere between (-400, +400)
random_pitch_shift_effect = augment.EffectChain().pitch("-q", random_pitch_shift).rate(sr)
# -q flag enables faster, but lower quality processing

In [37]:
y = random_pitch_shift_effect.apply(x, src_info={'rate': sr})
ipd.Audio(y, rate=sr)

In [38]:
random_room_size = lambda: np.random.randint(0, 101)
random_reverb = augment.EffectChain().reverb(50, 50, random_room_size).channels(1)

y = random_reverb.apply(x, src_info={'rate': sr})
ipd.Audio(y, rate=sr)

In [39]:
y = random_reverb.apply(x, src_info={'rate': sr})
ipd.Audio(y, rate=sr)