# Audio spectrogram in rocAL 

This example presents a simple rocAL pipeline that loads and decodes audio data along with the calculation of a spectrogram. We use MIVISIONX-data which contains audio data samples in wav format. Illustrated below how to create a pipeline, set_outputs, build, run the pipeline and enumerate over the results.

## Reference implementation

To verify the correctness of rocAL's implementation, we will compare it against librosa.  

In [None]:
import matplotlib.pyplot as plt
import os
%matplotlib inline

import librosa.display
import librosa as librosa

import numpy as np
import torch
torch.set_printoptions(threshold=10_000)

import amd.rocal.types as types
import amd.rocal.fn as fn
from amd.rocal.pipeline import Pipeline, pipeline_def
from amd.rocal.plugin.pytorch import ROCALAudioIterator

In [None]:
def show_spectrogram(spec, title, sr, hop_length, y_axis='log', x_axis='time'):
    librosa.display.specshow(
        spec, sr=16000, y_axis=y_axis, x_axis=x_axis, hop_length=hop_length)
    plt.title(title)
    plt.colorbar(format='%+2.0f dB')
    plt.tight_layout()
    plt.show()

## Librosa implementation
Librosa provides an API to calculate the STFT, producing a complex output (i.e. complex numbers). It is then trivial to calculate the power spectrum from the complex STFT by the following.
Here we load and decoder the audio file and applied spectrogram to it using librosa.


In [None]:

# Set the ROCAL_DATA_PATH env variable before running the botebook
rocal_audio_data_path = os.path.join(os.environ['ROCAL_DATA_PATH'], "rocal_data", "audio")
data_path = f"{rocal_audio_data_path}/wav/19-198-0000.wav"

y, sr = librosa.load(data_path, sr=16000)

# Size of the FFT, which will also be used as the window length
n_fft = 2048

# Step or stride between windows. If the step is smaller than the window length, the windows will overlap
hop_length = 512

# Calculate the spectrogram as the square of the complex magnitude of the STFT
spectrogram_librosa = np.abs(librosa.stft(
    y, n_fft=n_fft, hop_length=hop_length, win_length=n_fft, window='hann', pad_mode='reflect')) ** 2

# We can now transform the spectrogram output to a logarithmic scale by transforming the amplitude to decibels.
spectrogram_librosa_db = librosa.power_to_db(spectrogram_librosa, ref=np.max)

# The last step is to display the spectrogram
show_spectrogram(spectrogram_librosa_db,
                 'Reference power spectrogram', sr, hop_length)

## Configuring rocAL pipeline
Configure the pipeline paramters as required by the user.

In [None]:
file_list = f"{rocal_audio_data_path}/wav_file_list.txt"
seed = 1000
nfft = 2048
window_length = 2048
window_step = 512
num_shards = 1
rocal_cpu = True

audio_pipeline = Pipeline(
    batch_size=1, num_threads=8, rocal_cpu=rocal_cpu)

## Audio pipeline 
Here we use the file reader followed by audio decoder. Then the decoded audio data is passed to spectrogram. We enable the output for spectrogram using set_output

In [None]:
with audio_pipeline:
    audio, labels = fn.readers.file(file_root=rocal_audio_data_path, file_list=file_list)
    decoded_audio = fn.decoders.audio(
        audio,
        file_root=rocal_audio_data_path,
        file_list_path=file_list,
        downmix=False,
        shard_id=0,
        num_shards=1,
        stick_to_shard=False)
    spec = fn.spectrogram(
        decoded_audio,
        nfft=2048,
        window_length=2048,
        window_step=512,
        output_dtype=types.FLOAT)
    audio_pipeline.set_outputs(spec)

## Building the Pipeline
Here we are creating the pipeline. In order to use our Pipeline, we need to build it. This is achieved by calling the build function. Then iterator object is created with ROCALAudioIterator(audio_pipeline)

In [None]:
audio_pipeline.build()
audioIteratorPipeline = ROCALAudioIterator(audio_pipeline)

In [None]:
for i, output_list in enumerate(audioIteratorPipeline):
    for x in range(len(output_list[0])):
        for audio_tensor, label, roi in zip(output_list[0][x], output_list[1], output_list[2]):
            print("Audio shape", audio_tensor.shape)
            print("Label", label)
            print("Roi", roi)
audioIteratorPipeline.reset()

## Visualizing outputs

We have plotted the output of the spectrogram to visually compare it with librosa output.

In [None]:
for i, it in enumerate(audioIteratorPipeline):
    output = it[0]
    # Augmentation outputs are stored in list[(batch_size, output_shape)] so we index to get each output
    spec_output = output[0][0].numpy()
    roi = it[2][0].numpy()
    # We slice the padded output using the ROI dimensions
    spec_roi_output = spec_output[:roi[0], :roi[1]]
    spectrogram_db = librosa.power_to_db(spec_roi_output, ref=np.max)
    show_spectrogram(spectrogram_db, ' rocal spectrogram', 16000, hop_length)
audioIteratorPipeline.reset()

As a last check, we can verify that the numerical difference between the reference implementation and rocAL's is insignificant

In [None]:
output, label, roi_tensor = next(audioIteratorPipeline)
# Augmentation outputs are stored in list[(batch_size, output_shape)] so we index to get each output
spec_output = output[0][0].numpy()
roi = roi_tensor[0].numpy()
# We slice the padded output using the ROI dimensions
spec_roi_output = spec_output[:roi[0], :roi[1]]
spectrogram_db = librosa.power_to_db(spec_roi_output, ref=np.max)
print("Average error: {0:.5f} dB".format(
    np.mean(np.abs(spectrogram_db - spectrogram_librosa_db))))
assert (np.allclose(spectrogram_db, spectrogram_librosa_db, atol=2))