#VGGish Audio Embedding Colab

This colab demonstrates how to extract the AudioSet embeddings, using a VGGish deep neural network (DNN).

It's an updated version of [malcolmslaney's original](https://colab.research.google.com/drive/1TbX92UL9sYWbdwdGE0rJ9owmezB-Rl1C#scrollTo=2qiXIggxzusy), modified to work with the updated tensorflow/models VGGish distribution, as well as TensorFlow 2.

#Importing and Testing the VGGish System

Based on the directions at: https://github.com/tensorflow/models/tree/master/research/audioset/vggish

In [None]:
!pip install numpy scipy
!pip install resampy tensorflow
!pip install tf_slim

Collecting resampy
  Downloading resampy-0.4.2-py3-none-any.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: resampy
Successfully installed resampy-0.4.2


In [None]:
!rm -rf models

In [None]:
!git clone https://github.com/tensorflow/models.git

Cloning into 'models'...
remote: Enumerating objects: 93002, done.[K
remote: Counting objects: 100% (2892/2892), done.[K
remote: Compressing objects: 100% (1675/1675), done.[K
remote: Total 93002 (delta 1280), reused 2776 (delta 1193), pack-reused 90110[K
Receiving objects: 100% (93002/93002), 616.94 MiB | 25.80 MiB/s, done.
Resolving deltas: 100% (66217/66217), done.


In [None]:
# Check to see where are in the kernel's file system.
!pwd

/content


In [None]:
# Grab the VGGish model
!curl -O https://storage.googleapis.com/audioset/vggish_model.ckpt
!curl -O https://storage.googleapis.com/audioset/vggish_pca_params.npz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  277M  100  277M    0     0  87.5M      0  0:00:03  0:00:03 --:--:-- 87.6M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 73020  100 73020    0     0   519k      0 --:--:-- --:--:-- --:--:--  520k


In [None]:
# Make sure we got the model data.
!ls

models	sample_data  vggish_model.ckpt	vggish_pca_params.npz


In [None]:
# Verify the location of the AudioSet source files
!ls models/research/audioset/vggish

mel_features.py   vggish_export_tfhub.py    vggish_params.py	   vggish_smoke_test.py
README.md	  vggish_inference_demo.py  vggish_postprocess.py  vggish_train_demo.py
requirements.txt  vggish_input.py	    vggish_slim.py


In [None]:
# Copy the source files to the current directory.
!cp models/research/audioset/vggish/* .

In [None]:
# Make sure the source files got copied correctly.
!ls

mel_features.py   sample_data		    vggish_model.ckpt	   vggish_slim.py
models		  vggish_export_tfhub.py    vggish_params.py	   vggish_smoke_test.py
README.md	  vggish_inference_demo.py  vggish_pca_params.npz  vggish_train_demo.py
requirements.txt  vggish_input.py	    vggish_postprocess.py


In [None]:
# Run the test, which also loads all the necessary functions.
from vggish_smoke_test import *

In [None]:
# path of wav files
audio1 = "/content/1-aircraft1.wav"
audio2 = "/content/8-clap.wav"
long_audio = "/content/bird.wav"

In [None]:
# VGGish demo
!python vggish_inference_demo.py --wav_file "/content/1-aircraft1.wav"

In [None]:
from __future__ import print_function

import numpy as np
import six
import soundfile
import tensorflow.compat.v1 as tf

import vggish_input
import vggish_params
import vggish_postprocess
import vggish_slim
import soundfile as sf

In [None]:
# dimensions of computed input features
test = vggish_input.wavfile_to_examples(audio1)
print(test.shape)

(910, 96, 64)


In [None]:
# restore PCA parameters
pproc = vggish_postprocess.Postprocessor('vggish_pca_params.npz')

In [None]:
# genenrate embeddings from wav file
def generate_embeddings(wav_file):
  example_batch = vggish_input.wavfile_to_examples(wav_file)
  with tf.Graph().as_default(), tf.Session() as sess:
    vggish_slim.define_vggish_slim(training=False)
    vggish_slim.load_vggish_slim_checkpoint(sess, 'vggish_model.ckpt')
    features_tensor = sess.graph.get_tensor_by_name(vggish_params.INPUT_TENSOR_NAME)
    embedding_tensor = sess.graph.get_tensor_by_name(vggish_params.OUTPUT_TENSOR_NAME)
    # Run inference and postprocessing.
    [embedding] = sess.run([embedding_tensor],feed_dict={features_tensor: example_batch})
    postprocessed = pproc.postprocess(embedding)
    print(postprocessed)
    print(f'{wav_file} embedding shape: {postprocessed.shape}')

In [None]:
generate_embeddings(audio1)

[[ 68 176  44 ... 248 173 191]
 [ 51 152  82 ... 140 141 140]
 [ 83 140  50 ... 196 106 204]
 ...
 [100 181  71 ... 201  94 138]
 [  2 246  12 ...  63 108  87]
 [ 66 195  66 ... 255 223   0]]
/content/1-aircraft1.wav embedding shape: (10, 128)


In [None]:
generate_embeddings(long_audio)

[[ 78  62 167 ... 211 250 201]
 [ 12 119 147 ... 110 255 160]
 [  0 239  97 ... 255 255  42]
 ...
 [  0 132 146 ... 255 255   0]
 [  0 133 157 ... 255 255   0]
 [ 69 115 125 ...  72 255 222]]
/content/bird.wav embedding shape: (58, 128)
