In [None]:
#| hide
from auditus.core import *

# auditus

> Simple Audio Embeddings

`auditus` gives you simple access to state-of-the-art audio embeddings. Like [SentenceTransformers](https://sbert.net/) for audio.

```sh
$ pip install auditus
```


[repo]: https://github.com/CarloLepelaars/auditus
[docs]: https://CarloLepelaars.github.io/auditus/
[pypi]: https://pypi.org/project/auditus/

## Quickstart

The most high-level object in `auditus` is the `AudioPipeline` which takes in a path and returns a pooled embedding.

In [None]:
from auditus.transform import AudioPipeline

pipe = AudioPipeline(
    model_name="MIT/ast-finetuned-audioset-10-10-0.4593", # Default AST model
    return_tensors="pt", # PyTorch output
    target_sr=16000, # Resampled to 16KhZ
    num_mel_bins=64, # Embedding output of 64
    pooling="max", # Max pooling
)

output = pipe("../test_files/XC119042.ogg").squeeze(0)
print(output.shape)
output[:5]

torch.Size([64])


tensor([ 0.3470,  0.2991,  0.1366, -0.0023, -0.1394])

## Individual steps

`auditus` offers a range of transforms to process audio for downstream tasks.

### Loading

Simply load audio with a given sampling rate.

In [None]:
from auditus.transform import AudioLoader

audio = AudioLoader(sr=32000)("../test_files/XC119042.ogg")
audio

auditus.core.AudioArray(a=array([-2.64216160e-05, -2.54259703e-05,  5.56615578e-06, ...,
       -2.03555092e-01, -2.03390077e-01, -2.45199591e-01]), sr=32000)

The `AudioArray` object offers a convenient interface to inspect the audio data. Like listening to the audio in Jupyter Notebook with `audio.audio()`.

In [None]:
audio.a[:5], audio.sr, len(audio)

(array([-2.64216160e-05, -2.54259703e-05,  5.56615578e-06, -5.17481631e-08,
        -1.35020821e-06]),
 32000,
 632790)

### Resampling

Many Audio Transformer models work only on a specific sampling rate. With `Resampling` you can resample the audio to the desired sampling rate. Here we go from 32kHz to 16kHz.

In [None]:
from auditus.transform import Resampling

resampled = Resampling(target_sr=16000)(audio)
resampled

auditus.core.AudioArray(a=array([-2.64216160e-05,  5.56613802e-06, -1.35020873e-06, ...,
       -2.39605007e-01, -2.03555112e-01, -2.45199591e-01]), sr=16000)

### Embedding

The main transform in `auditus` is the `AudioEmbedding` transform. It takes an `AudioArray` and returns a tensor. Check out the [HuggingFace docs](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer#transformers.ASTFeatureExtractor) for more information on the available parameters.

In [None]:
from auditus.transform import AudioEmbedding

emb = AudioEmbedding(return_tensors="pt", num_mel_bins=64, sampling_rate=16000)(resampled)
print(emb.shape)
emb[0][0][:5]


torch.Size([1, 1024, 64])


tensor([-0.8148, -0.9460, -0.9955, -0.9856, -1.0303])

### Pooling

After generating the embeddings, you often want to pool the embeddings to a single vector. `Pooling` support `mean` and `max` pooling.

In [None]:
from auditus.transform import Pooling

pooled = Pooling(pooling="max")(emb)
print(pooled.shape)
pooled[0][:5]

torch.Size([1, 64])


tensor([ 0.3470,  0.2991,  0.1366, -0.0023, -0.1394])