# Audio Machine Learning - Workshop Week 8
Pre-Trained Audio Feature Extraction Models

This worksheet demonstrates feature extraction using a pre-trained VGGish audio classification model.

This is intended as a simple demonstration. You should apply feature extraction to your own datasets, and also explore using other pre-trained models.

## 0 - Import Libraries

In [146]:
%pip install resampy
%pip install soundfile
%pip install datasets

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-19.0.1-cp312-cp312-macosx_12_0_arm64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Using cached dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py312-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting aiohttp (from datasets)
  Downloading aiohttp-3.11.13-cp312-cp312-macosx_11_0_arm64.whl.metadata (7.7 kB)
Collecting aiohappyeyeballs>=2.3.0 

In [149]:
import numpy as np
import transformers
import datasets
import torch
import resampy
import soundfile
from transformers import ClapConfig, ClapModel
import librosa
import numpy as np
from glob import glob
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

## 1 - Load VGGISH Model

VGGISH is a convolutional neural network model for large scale audio classification. You can read more about it here:

https://arxiv.org/abs/1609.09430

The model is trained on the AudioSet dataset, a large scale audio dataset taken from YouTube videos:

https://research.google.com/audioset/

Pre-trained models are available from a variety of sources. torch hub is one way of loading pre-trained models:

https://pytorch.org/docs/stable/hub.html#loading-models-from-hub

Below is an example that loads a torch VGGish model, from the GitHub repo:

https://github.com/harritaylor/torchvggish

In [125]:
model = torch.hub.load('harritaylor/torchvggish', 'vggish') # The arguments are the github repo that hosts the model, and the name of the model
#This just removes the output ReLU layer from the model
model.postprocess = False 
model.embeddings = torch.nn.Sequential(*list(model.embeddings.children())[:-1])
model.eval()

Using cache found in /Users/awrigh2/.cache/torch/hub/harritaylor_torchvggish_master


VGGish(
  (features): Sequential(
    (0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (11): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (12): ReLU(inplace=True)
    (13): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (14): ReLU(inplace=True)
    (15): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False

## 2 - Explore Model

Now you can look at the model code:

https://github.com/harritaylor/torchvggish/blob/master/torchvggish/vggish.py

Under the VGGish class you can see what happens when the model 'forward' method is called. You can also use debug mode to step through the model when it is processing inputs. In this case, the VGGish model can accept inputs as 1-d numpy arrays, and it will pre-process them automatically into log mel spectrograms in the 'waveform_to_examples' function.

I HIGHLY recommend that you look through these functions and classes on the GitHub page.

In [132]:
dummy_audio = np.random.randn(16000) # Make 1-second of audio at 16kHz sample rate

dummy_feats = model.forward(x=dummy_audio, fs=16000) # Call the model on the dummy audio

Look at the features the model extracted - what dimensions do they have? These are the features extracted from the output layer of the VGGish model.

torch.Size([128])

## 3 - Feature Extraction

Here I have used the VGGish model to extract features from the digit classification dataset.


In [126]:
train_audio = []
train_labels = []
train_embeddings = []
test_audio = []
test_labels = []
test_embeddings = []
for n in glob('DigitData/*.wav'):
    file_name = n.split('/')[-1]
    label = int(file_name[0])
    id = int(file_name.split('_')[-1].split('.')[0])
    train = True if id < 40 else False # Train/Test Split 80/20
    
    # Load Audio and append to train/test list
    audio, fs = librosa.load(n)
    if train:
        train_audio.append(audio)
        train_labels.append(label)
    else:
        test_audio.append(audio)
        test_labels.append(label)
    
    # Make audio the same length - 1 second long
    if audio.shape[0] < fs:
        audio = np.concatenate((audio, np.zeros((fs - audio.shape[0]))))
    elif audio.shape[0] > fs:
        audio = audio[0:fs]
        
    with torch.inference_mode(): # Inference mode saves computation as it disables gradient tracking
        feats = model.forward(audio, fs)

    if train:
        train_embeddings.append(feats)
    else:
        test_embeddings.append(feats)

In [127]:
train_emb_np = torch.stack(train_embeddings, dim=0)
train_emb_np = train_emb_np.squeeze().numpy()
train_labels_np = np.stack(train_labels)

test_emb_np = torch.stack(test_embeddings, dim=0)
test_emb_np = test_emb_np.squeeze().numpy()
test_labels_np = np.stack(test_labels)

## 4 - Model Fitting

The below code fits a K nearest neighbours on the above data

In [141]:
nn = KNeighborsClassifier(n_neighbors=5)
nn.fit(train_emb_np, train_labels_np)

In [142]:
preds = nn.predict(test_emb_np)

In [145]:
print(f'Accuracy is {100*sum(test_labels_np == preds)/test_labels_np.shape[0]} %!')

Accuracy is 79.0 %!


This should get accuracy of around 80%. Whilst not perfect, it does demonstrate that the features extracted by VGGish are useful for downstream audio classification tasks, especially considered that VGGish wasn't trained on the task of speech recognition at all!

## 5 - Further Work

This worksheet was mostly a demonstration of using an audio embedding model for classification. You should try and modify your own feature extraction code so you have the option of using VGGish embeddings as a feature. 

There are also many other pre-trained models available, from torch hub, but also from other sources like 'Hugging Face':

https://huggingface.co/models

There are many models out there:

CLAP is multi-modal model that creates embeddings from text or audio. You can use this for supervised audio classification, but you can also use it to for text-to-audio retrieval.
https://github.com/LAION-AI/CLAP

An example of using a Hugging Face CLAP model is found on this page:

https://huggingface.co/laion/larger_clap_music

I encourage you to look through some of the audio models available on Hugging Face:

https://huggingface.co/docs/transformers/index