# This file shows an example usage of the python library [AudioAugmentor](https://pypi.org/project/AudioAugmentor/).

Python version >= 3.10 is needed.

Note: AudioAugmentor was mainly tested using Python 3.11.8 and Fedora 38


In [1]:
!python --version

Python 3.11.8


### Install AudioAugmentor package from PyPi

In [2]:
!pip install AudioAugmentor

Defaulting to user installation because normal site-packages is not writeable


### You also need to install `sox`, `libsox-dev`(Ubuntu), `sox-devel` (Fedora) and `ffmpeg` packages.

In [3]:
# !apt-get install -y sox          # UBUNTU
# !apt-get install -y libsox-dev   # UBUNTU
# !apt-get install -y ffmpeg       # UBUNTU

!dnf install -y sox             # FEDORA
!dnf install -y sox-devel       # FEDORA
!dnf install -y ffmpeg          # FEDORA

!echo '@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'
!which sox
!ffmpeg -version

Error: This command has to be run with superuser privileges (under the root user on most systems).
Error: This command has to be run with superuser privileges (under the root user on most systems).
Error: This command has to be run with superuser privileges (under the root user on most systems).
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
/usr/bin/sox
ffmpeg version 6.0.1 Copyright (c) 2000-2023 the FFmpeg developers
built with gcc 13 (GCC)
configuration: --prefix=/usr --bindir=/usr/bin --datadir=/usr/share/ffmpeg --docdir=/usr/share/doc/ffmpeg --incdir=/usr/include/ffmpeg --libdir=/usr/lib64 --mandir=/usr/share/man --arch=x86_64 --optflags='-O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -f

## Import necessary libraries

In [4]:
import torch
import torchaudio
import numpy as np
import audiomentations as AA
from IPython.display import Audio, display

from AudioAugmentor import transf_gen
from AudioAugmentor import sox_parser
from AudioAugmentor import core
from AudioAugmentor import rir_setup
from AudioAugmentor import torchaudio_transf_wrapper as TTW

In [5]:
torch.manual_seed(0)
torch.cuda.manual_seed(0)

In [6]:
# Load librispeech dataset
dataset = torchaudio.datasets.LIBRISPEECH(root='../data/', url="train-clean-100", download=False, folder_in_archive='LibriSpeechSmall')
sampling_rate = 16000

orig_dataloader = torch.utils.data.DataLoader(
    dataset,
    batch_size=1,
    num_workers=0,
)

## Creating a file with pseudo SoX commands
In the following cell, we create a file containing multiple pseudo [SoX](https://www.wikiwand.com/en/SoX) commands which can be later used for specifying which augmentations we want to apply to our audio data.

This file containing multiple SoX commands must have only one SoX command per line.

SoX command must be in this format:
  * **--sox="norm gain 0 highpass 1000 phaser 0.5 0.6 1 0.45 0.6 -s"**
  
  (When you don't want to apply some codec after applying SoX effects)

  OR

  * **--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" amr audio_bitrate 4.75k**
  
  (In this case you want to apply codec after applying SoX effects -> Codec is entered in the form `codec_name` `codec_parameter_name` `codec_parameter_value` directly after sox effects command)

In [7]:
sox_file_content_to_write = '''--sox="norm gain 0 highpass 1000 phaser 0.5 0.6 1 0.45 0.6 -s"
--sox="norm gain 0 highpass 1000 phaser 0.5 0.6 1 0.45 0.5 -s"
--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s"
--sox="norm gain 10 highpass 1500 phaser 0.5 0.6 1 0.45 0.6 -s"
--sox="norm gain 15 highpass 1000 phaser 0.5 0.6 1 0.45 0.6 -s"
--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.4 0.6 -s"
--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s"
--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" mp3 bitrate 8
--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" pcm_mulaw
--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" g726 audio_bitrate 40k
--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" gsm
--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" amr audio_bitrate 4.75k
'''

with open('sox_file_example.txt', 'w') as f:
    f.write(sox_file_content_to_write)

This file we have just created needs to be loaded using `load_sox_file` function from the `sox_parser` module which is contained within AudioAugmentor.

In [8]:
sox_file_content = sox_parser.load_sox_file('sox_file_example.txt')
print('SOX FILE LOADED:', sox_file_content, type(sox_file_content))

SOX FILE LOADED: ['--sox="norm gain 0 highpass 1000 phaser 0.5 0.6 1 0.45 0.6 -s"\n', '--sox="norm gain 0 highpass 1000 phaser 0.5 0.6 1 0.45 0.5 -s"\n', '--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s"\n', '--sox="norm gain 10 highpass 1500 phaser 0.5 0.6 1 0.45 0.6 -s"\n', '--sox="norm gain 15 highpass 1000 phaser 0.5 0.6 1 0.45 0.6 -s"\n', '--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.4 0.6 -s"\n', '--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s"\n', '--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" mp3 bitrate 8\n', '--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" pcm_mulaw\n', '--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" g726 audio_bitrate 40k\n', '--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" gsm\n', '--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" amr audio_bitrate 4.75k\n'] <class 'list'>


You can also choose just one SoX command which will be used for applying augmentation on the data.

In [9]:
# example_sox = '--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" amr audio_bitrate 4.75k'
example_sox = '--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s"'

In [10]:
# This is how you specify parameter for room generation

# 1. Randomized room parameters
rir_kwargs = {
    'audio_sample_rate': sampling_rate,
    'x_range': (0, 100),
    'y_range': (0, 100),
    'num_vertices_range': (3, 6),
    'mic_height': 1.5,
    'source_height': 1.5,
    'walls_mat': 'curtains_cotton_0.5',
    'room_height': 2.0,
    'max_order': 3,
    'floor_mat': 'carpet_cotton',
    'ceiling_mat': 'hard_surface',
    'ray_tracing': True,
    'air_absorption': True,
}
# OR
# 2. Directly specified room parameters
rir_kwargs = {
    'audio_sample_rate': sampling_rate,
    'corners_coord': [[0, 0], [0, 3], [5, 3], [5, 1], [3, 1], [3, 0]],
    'walls_mat': 'curtains_cotton_0.5',
    'room_height': 2.0,
    'max_order': 3,
    'floor_mat': 'carpet_cotton',
    'ceiling_mat': 'hard_surface',
    'ray_tracing': True,
    'air_absorption': True,
    'source_coord': [[1.0], [1.0], [0.5]],
    'microphones_coord': [[3.5], [2.0], [0.5]],
}

In [11]:
# Definition of the augmentations
transformations = transf_gen.transf_gen(verbose=True,
                                        # ApplyRIR=rir_kwargs,
                                        # PitchShift={'sample_rate': sampling_rate, 'n_steps': [1, 1.5, 0.1], 'p': 1.0},
                                        # Speed={'orig_freq': sampling_rate, 'factor': [0.9, 1.5, 0.1], 'p': 1},
                                        # Vol={'gain': [2.5, 3, 0.1], 'p': 1.0},
                                        # AddColoredNoise=f'min_snr_in_db=9, max_snr_in_db=10, p=1, sample_rate={sampling_rate}',
                                        AddBackgroundNoise=f'background_paths="../data/musan/musan", min_snr_in_db=10, max_snr_in_db=20, p=1, sample_rate={sampling_rate}',
                                        # BandPassFilter=f'''min_center_frequency=200, 
                                        # max_center_frequency=4000,
                                        #   min_bandwidth_fraction=0.5,
                                        #     max_bandwidth_fraction=1.99,
                                        #       sample_rate={sampling_rate},
                                        #         p=1
                                        #         ''',
                                        # BandStopFilter=f'min_center_frequency=200, max_center_frequency=4000, min_bandwidth_fraction=0.5, max_bandwidth_fraction=1.99, sample_rate={sampling_rate}, p=1',
                                        # HighPassFilter=f'min_cutoff_freq=700, max_cutoff_freq=800, sample_rate={sampling_rate}, p=1',
                                        ApplyImpulseResponse=f'ir_paths="../data/RIRS/RealRIRs", p=1, sample_rate={sampling_rate}',
                                        LowPassFilter={'min_cutoff_freq': 700, 'max_cutoff_freq': 800, 'sample_rate': sampling_rate, 'p': 1},
                                        # PeakNormalization={'p': 1, 'sample_rate': sampling_rate},
                                        # PolarityInversion={'p': 1, 'sample_rate': sampling_rate},
                                        # Shift={'min_shift': 1, 'max_shift': 2, 'p': 1, 'sample_rate': sampling_rate},
                                        # TimeInversion={'p': 1, 'sample_rate': sampling_rate},
                                        # AddGaussianNoise={'min_amplitude': 0.001, 'max_amplitude': 0.015, 'p': 1},
                                        # AddShortNoises={'sounds_path': "../data/musan/musan",
                                        #                 'min_snr_in_db': 3.0,
                                        #                 'max_snr_in_db': 30.0,
                                        #                 'noise_rms': "relative_to_whole_input",
                                        #                 'min_time_between_sounds': 2.0,
                                        #                 'max_time_between_sounds': 8.0,
                                        #                 'noise_transform': AA.PolarityInversion(),
                                        #                 'p': 1.0},
                                        # AdjustDuration={'duration_seconds': 3.5, 'padding_mode': 'silence', 'p': 1},
                                        AirAbsorption={'min_distance': 10.0, 'max_distance': 50.0, 'min_humidity': 80.0, 'max_humidity': 90.0, 'min_temperature': 10.0, 'max_temperature': 20.0, 'p': 1.0},
                                        # ClippingDistortion={'min_percentile_threshold': 10, 'max_percentile_threshold': 30, 'p': 1},
                                        # Gain={'min_gain_db': -5, 'max_gain_db': 20, 'p': 1},
                                        # GainTransition={'min_gain_db': 30, 'max_gain_db': 40, 'min_duration': 5, 'max_duration': 16, 'duration_unit': 'seconds', 'p': 1},
                                        # HighShelfFilter={'min_center_freq': 2000, 'max_center_freq': 5000, 'min_gain_db': 10.0, 'max_gain_db': 16.0, 'min_q': 0.5, 'max_q': 1.0, 'p': 1},
                                        # Limiter='min_threshold_db= -24, max_threshold_db = -2, min_attack = 0.0005,max_attack = 0.025, min_release= 0.05, max_release = 0.7, threshold_mode = "relative_to_signal_peak", p=1',
                                        # LoudnessNormalization={'min_lufs': -31, 'max_lufs': -13, 'p': 1},
                                        # LowShelfFilter={'min_center_freq': 20, 'max_center_freq': 600, 'min_gain_db': -16.0, 'max_gain_db': 16.0, 'min_q': 0.5, 'max_q': 1.0, 'p': 1},
                                        Mp3Compression={'min_bitrate': 8, 'max_bitrate': 8,'backend': 'pydub', 'p': 1},
                                        # Normalize={'p': 1},
                                        # Padding={'mode': 'silence', 'min_fraction': 0.02, 'max_fraction': 0.8, 'pad_section': 'start', 'p': 1},
                                        # PeakingFilter={'min_center_freq': 51, 'max_center_freq': 7400, 'min_gain_db': -22, 'max_gain_db': 22, 'min_q': 0.5, 'max_q': 1.0, 'p': 1},
                                        # SevenBandParametricEQ={'min_gain_db': -10, 'max_gain_db': 10, 'p': 1},
                                        # TimeStretch='min_rate=0.5, max_rate=0.6, p=0.2, leave_length_unchanged=False',
                                        # TanhDistortion={'min_distortion': 0.1, 'max_distortion': 0.8, 'p': 1},
                                        # g726={'audio_bitrate': '40k'},
                                        # gsm=True,
                                        # amr={'audio_bitrate': '4.75k'},
                                        
                                        # MelSpectrogram={'sample_rate': 16000},                              
                                        # TimeMasking={'time_mask_param': 80},
                                        # FrequencyMasking={'freq_mask_param': 80},
                                        )


ADDED: AddBackgroundNoise, 
		{'background_paths': '../data/musan/musan', 'min_snr_in_db': 10, 'max_snr_in_db': 20, 'p': 1, 'sample_rate': 16000}

ADDED: ApplyImpulseResponse, 
		{'ir_paths': '../data/RIRS/RealRIRs', 'p': 1, 'sample_rate': 16000}

ADDED: LowPassFilter, 
		{'min_cutoff_freq': 700, 'max_cutoff_freq': 800, 'sample_rate': 16000, 'p': 1}

ADDED: AirAbsorption, 
		{'min_distance': 10.0, 'max_distance': 50.0, 'min_humidity': 80.0, 'max_humidity': 90.0, 'min_temperature': 10.0, 'max_temperature': 20.0, 'p': 1.0}

ADDED: Mp3Compression, 
		{'min_bitrate': 8, 'max_bitrate': 8, 'backend': 'pydub', 'p': 1}



FINAL TRANSFORMATIONS LIST:
AddBackgroundNoise()
ApplyImpulseResponse()
LowPassFilter()
<audiomentations.augmentations.air_absorption.AirAbsorption object at 0x7ff77246c210>
<audiomentations.augmentations.mp3_compression.Mp3Compression object at 0x7ff772464050>


## Usage of core.Collator class which is used in cooperation with PyTorch's DataLoader class

In [12]:
collate_fn = core.Collator(
    transformations=transformations, device='cpu', sox_effects=None, sample_rate=sampling_rate, verbose=True,
    #transformations=None, device='cpu', sox_effects=example_sox, sample_rate=sampling_rate, verbose=False,
    #transformations=None, device='cpu', sox_effects=sox_file_content, sample_rate=sampling_rate, verbose=False,
)

In [13]:
aug_dataloader = torch.utils.data.DataLoader(
    dataset,
    collate_fn=collate_fn,
)

In [14]:
import matplotlib.pyplot as plt

# If specaug was used, you can see spectrogram with the augmentation applied
check_transformations = [
    torchaudio.transforms._transforms.MelSpectrogram,
    torchaudio.transforms._transforms.Spectrogram,
    torchaudio.transforms._transforms.TimeMasking,
    torchaudio.transforms._transforms.FrequencyMasking
]
contains_transformations = [isinstance(t, check_t) for t in transformations for check_t in check_transformations]
if any(contains_transformations):
  a = next(iter(aug_dataloader))
  plt.figure()
  plt.imshow((a[0][0]+1e-9).log2()[0,:,:].detach().cpu().numpy(), cmap='viridis')
  plt.show()

else:
  NUM_EXAMPLES_TO_SHOW = 3

  print('AUGMENTED\n')
  for i, (data, *rest) in enumerate(aug_dataloader):
      display(Audio(data[0].squeeze(0).cpu(), rate=sampling_rate))
      if i == NUM_EXAMPLES_TO_SHOW-1:
          break

  print('ORIGINALS\n')
  for j, (orig_data, *orig_rest) in enumerate(orig_dataloader):
      display(Audio(orig_data[0].squeeze(0).cpu(), rate=sampling_rate))
      if j == NUM_EXAMPLES_TO_SHOW-1:
        break

AUGMENTED

CURRENT TRANSFORM: AddBackgroundNoise()
CURRENT TRANSFORM: ApplyImpulseResponse()
CURRENT TRANSFORM: LowPassFilter()
CURRENT TRANSFORM: <audiomentations.augmentations.air_absorption.AirAbsorption object at 0x7ff77246c210>
CURRENT TRANSFORM: <audiomentations.augmentations.mp3_compression.Mp3Compression object at 0x7ff772464050>


CURRENT TRANSFORM: AddBackgroundNoise()
CURRENT TRANSFORM: ApplyImpulseResponse()
CURRENT TRANSFORM: LowPassFilter()
CURRENT TRANSFORM: <audiomentations.augmentations.air_absorption.AirAbsorption object at 0x7ff77246c210>
CURRENT TRANSFORM: <audiomentations.augmentations.mp3_compression.Mp3Compression object at 0x7ff772464050>


CURRENT TRANSFORM: AddBackgroundNoise()
CURRENT TRANSFORM: ApplyImpulseResponse()
CURRENT TRANSFORM: LowPassFilter()
CURRENT TRANSFORM: <audiomentations.augmentations.air_absorption.AirAbsorption object at 0x7ff77246c210>
CURRENT TRANSFORM: <audiomentations.augmentations.mp3_compression.Mp3Compression object at 0x7ff772464050>


ORRIGINALS



## Usage of core.AugmentWaveform class which is used while augmenting just single waveform.

In [15]:
wave_path = '../data/LibriSpeechSmall/train-clean-100/103/1240/103-1240-0000.flac'
signal, fs = torchaudio.load(wave_path)
augment = core.AugmentWaveform(
    transformations=transformations, device='cpu', sox_effects=None, sample_rate=sampling_rate, verbose=True,
    #transformations=None, device='cpu', sox_effects=example_sox, sample_rate=sampling_rate, verbose=True,
    #transformations=None, device='cpu', sox_effects=sox_file_content, sample_rate=sampling_rate, verbose=True,
)

waveform = augment(signal.cpu().numpy()[0])
display(Audio(waveform, rate=fs))

CURRENT TRANSFORM: AddBackgroundNoise()
CURRENT TRANSFORM: ApplyImpulseResponse()
CURRENT TRANSFORM: LowPassFilter()
CURRENT TRANSFORM: <audiomentations.augmentations.air_absorption.AirAbsorption object at 0x7ff77246c210>
CURRENT TRANSFORM: <audiomentations.augmentations.mp3_compression.Mp3Compression object at 0x7ff772464050>


## Usage of core.AugmentLocalAudioDataset class which is used top augment local audio waveforms.

In [16]:
augment = core.AugmentLocalAudioDataset(
    #transformations=transformations, device='cpu', sox_effects=None, sample_rate=sampling_rate, verbose=True,
    #transformations=None, device='cpu', sox_effects=example_sox, sample_rate=sampling_rate, verbose=True,
    transformations=None, device='cpu', sox_effects=sox_file_content, sample_rate=sampling_rate, verbose=True,
)
augment(input_dir='../data/test-input-folder', output_dir='../data/test-output-folder')

Processed file: 103-1240-0000.flac
Processed file: 103-1240-0001.flac
Processed file: 103-1240-0002.flac
Processed file: 103-1240-0003.flac
Processed file: 103-1240-0004.flac
Processed file: 103-1240-0005.flac
Processed file: 103-1240-0006.flac
Processed file: 103-1240-0007.flac
Processed file: 103-1240-0008.flac
Processed file: 103-1240-0009.flac
