# VGGish - Speech Commands - Embedding Generation

**Pipeline Step 2: Feature Extraction with VGGish**

This notebook takes the spectrograms generated in the previous step and passes each one through the pre-trained VGGish network to produce a 128-dimensional embedding vector. These embeddings serve as compact, semantically meaningful feature representations of each audio clip.

**Why embeddings?** Rather than training a classifier directly on the raw 96x64 spectrograms (6,144 values), we can use VGGish's pre-trained knowledge to compress each spectrogram into a 128-dimensional vector. This is the "feature extraction" approach to transfer learning -- using a pre-trained network as a fixed feature extractor.

**Pipeline context:**
1. Generate spectrograms (previous notebook)
2. **Generate embeddings** (this notebook) -- Pass spectrograms through VGGish
3. Train classifier (next notebook) -- Use embeddings as input features

## Setup

Import TensorFlow and the VGGish library. This notebook was run on an AWS instance with 16 GPUs (p3.16xlarge), though the embedding generation loop uses a single GPU session.

In [1]:
import sys
sys.path.append("/home/ubuntu/odsc/vggish/lib/models/research/audioset/vggish")

In [2]:
import pandas as pd
import os

In [3]:
import vggish_params


In [4]:
vggish_params.EXAMPLE_HOP_SECONDS, vggish_params.EXAMPLE_WINDOW_SECONDS

(0.96, 0.96)

In [5]:
import numpy as np
import six
import soundfile
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))


Instructions for updating:
non-resource variables are not supported in the long term
Num GPUs Available:  16


In [6]:
from tensorflow.python.client import device_lib

def get_available_gpus():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos if x.device_type == 'GPU']

get_available_gpus()

['/device:GPU:0',
 '/device:GPU:1',
 '/device:GPU:2',
 '/device:GPU:3',
 '/device:GPU:4',
 '/device:GPU:5',
 '/device:GPU:6',
 '/device:GPU:7',
 '/device:GPU:8',
 '/device:GPU:9',
 '/device:GPU:10',
 '/device:GPU:11',
 '/device:GPU:12',
 '/device:GPU:13',
 '/device:GPU:14',
 '/device:GPU:15']

In [7]:

import vggish_input
import vggish_postprocess
import vggish_slim

pca_params = '/home/ubuntu/odsc/vggish/lib/vggish_pca_params.npz'
ckpt = '/home/ubuntu/odsc/vggish/lib/vggish_model.ckpt'

## Loading Pre-Generated Spectrograms

Load the spectrogram data saved in the previous notebook. The binary file contains the raw float64 values of all 105,835 spectrograms, which we reshape into `(105835, 96, 64)` -- one 96x64 spectrogram per audio file.

In [8]:
pwd

'/home/ubuntu/odsc/vggish'

In [9]:
df = pd.read_csv('wavfile_df.csv', index_col=0)
df.shape

(105835, 3)

In [10]:
with open('wavfile_spec.dat', 'rb') as f:
    audio_data = np.fromfile(f)

In [11]:
audio_array = audio_data.reshape((-1, 96, 64))

## Defining the VGGish Inference Pipeline

Define the VGGish model in inference mode (`training=False`) and load the pre-trained checkpoint. The model takes spectrograms as input (`features_tensor`) and produces 128-dim embeddings as output (`embedding_tensor`). No weights are updated during this step -- VGGish acts as a fixed feature extractor.

## Generating Embeddings in Batches

Processing all 105,835 spectrograms at once would exceed GPU memory, so we split them into 20 batches (~5,290 spectrograms each). Each batch is fed through VGGish, and the resulting embeddings are stored in a pre-allocated output array.

The output is a matrix of shape `(105835, 128)` -- one 128-dimensional embedding vector per audio file.

In [12]:
audio_array.shape

(105835, 96, 64)

In [13]:
df.shape

(105835, 3)

### Examining the Raw Embedding

Each embedding is a 128-dimensional vector of non-negative floating-point values (due to the ReLU activation). Many values are zero, while non-zero values capture different learned audio features. The sparse, non-negative structure is typical of deep network activations.

### Post-Processing (Optional)

The VGGish authors provide a post-processing step that applies PCA, whitening, and quantization to convert the 128-dim float32 embeddings into 128-dim uint8 vectors. This was designed for compatibility with the YouTube-8M pipeline. Below, we demonstrate this step but use the **raw float32 embeddings** for our downstream classification tasks.

In [23]:
def define_and_init_vggish():
    # Define the model in inference mode, load the checkpoint, and
    # locate input and output tensors.
    vggish_slim.define_vggish_slim(training=False)
    vggish_slim.load_vggish_slim_checkpoint(sess, ckpt)
    features_tensor = sess.graph.get_tensor_by_name(
        vggish_params.INPUT_TENSOR_NAME)
    embedding_tensor = sess.graph.get_tensor_by_name(
        vggish_params.OUTPUT_TENSOR_NAME)

    return features_tensor, embedding_tensor

## Saving Embeddings

Save the raw embeddings both as a binary file (`wavfile_embed.dat`) and as additional columns in the metadata DataFrame (`wavfile_embed.csv`). The CSV file contains the original metadata (file_name, label, valid) plus 128 embedding columns (e0 through e127), making it convenient for downstream analysis and modeling.

In [29]:
batches = np.array_split(np.arange(audio_array.shape[0]), 20)

In [32]:
with tf.Graph().as_default(), tf.Session() as sess:
    features_tensor, embedding_tensor = define_and_init_vggish()
    
    for b in batches:
        [embedding_output[b]] = sess.run([embedding_tensor],
                                 feed_dict={features_tensor: audio_array[b,:,:]})
        print('Processed {}'.format(b.max()))


INFO:tensorflow:Restoring parameters from /home/ubuntu/odsc/vggish/lib/vggish_model.ckpt
Processed 5291
Processed 10583
Processed 15875
Processed 21167
Processed 26459
Processed 31751
Processed 37043
Processed 42335
Processed 47627
Processed 52919
Processed 58211
Processed 63503
Processed 68795
Processed 74087
Processed 79379
Processed 84670
Processed 89961
Processed 95252
Processed 100543
Processed 105834


In [33]:
embedding_output.shape

(105835, 128)

In [34]:
embedding_output[0,:]

array([0.        , 0.        , 0.57563651, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.80524492, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.09647748,
       0.        , 0.2215701 , 0.36531198, 0.        , 0.        ,
       0.10503449, 0.        , 0.43128461, 0.        , 0.        ,
       0.        , 0.        , 0.09738244, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.71650451, 0.45736679,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.07548475, 0.        , 0.04537981, 0.        , 0.        ,
       0.28914386, 0.        , 0.42669204, 0.        , 0.72561944,
       0.00455567, 0.        , 0.        , 0.        , 0.        ,
       0.        , 1.11860538, 0.        , 0.19068947, 0.01635753,
       0.        , 0.        , 0.        , 0.        , 0.0788549 ,
       0.        , 0.        , 0.15197024, 0.        , 0.02502114,
       0.        , 0.        , 0.        , 0.        , 0.     

In [35]:
pproc = vggish_postprocess.Postprocessor(pca_params)
postprocessed = pproc.postprocess(embedding_output)

In [36]:
postprocessed[0,:]

array([158,  14, 154, 100, 205,  72, 121,  65, 132, 249,  96,  86, 101,
       154,  70, 161, 100, 100, 163, 121,  16, 255, 134,  67,  66, 131,
       168, 210,  64, 186, 228, 102,  32,  75,   0, 219,  46,   0, 148,
       152,   0, 197,  96,  92, 187, 111, 255, 193,  93, 225, 160,  82,
        91,  76, 115, 106, 255,  42, 149, 137, 117,  93,  45, 220,  83,
        90, 144,   4, 129, 190, 136, 140, 172,  64, 108, 132,   0, 255,
        15,  48,  16,  92, 161, 101,  82, 158, 127, 145, 255,  32, 255,
       129,  52,   6, 149, 255, 218,  98, 253, 218,  47, 135, 255, 173,
         0,   0,  50,  45, 255,  78, 140,  85,  84,  41, 255,   0,  76,
       247,   0, 167, 123, 116,  13,   0, 168,   0, 178, 255], dtype=uint8)

In [37]:
with open('wavfile_embed.dat', 'wb') as f:
    embedding_output.tofile(f)

In [38]:
for i in range(embedding_output.shape[1]):
    df[f'e{i}'] = embedding_output[:, i]

In [39]:
df.head()

Unnamed: 0,file_name,label,valid,e0,e1,e2,e3,e4,e5,e6,...,e118,e119,e120,e121,e122,e123,e124,e125,e126,e127
0,/home/ubuntu/audio/speech_commands/zero/8a90cf...,zero,True,0.0,0.0,0.575637,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.098917,0.3974,0.0,0.0,0.0,0.080007,0.0
1,/home/ubuntu/audio/speech_commands/zero/173ae7...,zero,True,0.813162,0.0,0.280367,0.0,0.006822,0.0,0.0,...,0.0,0.07224,0.0,0.0,0.0,1.272675,0.463936,0.0,0.018412,0.0
2,/home/ubuntu/audio/speech_commands/zero/eb76bc...,zero,True,0.701961,0.0,0.114244,0.0,0.0,0.0,0.0,...,0.0,0.084316,0.0,0.0,0.0,0.571525,0.0,0.838637,0.160843,0.0
3,/home/ubuntu/audio/speech_commands/zero/978240...,zero,True,0.751647,0.0,0.163232,0.0,0.0,0.0,0.0,...,0.0,0.623571,0.0,0.0,0.0,0.692807,0.924771,0.304728,0.0,0.0
4,/home/ubuntu/audio/speech_commands/zero/246328...,zero,True,1.11538,0.0,0.111188,0.0,0.0,0.0,0.0,...,0.0,0.141433,0.0,0.0,0.0,0.760137,0.021478,0.095431,0.0,0.0


In [41]:
df.to_csv('wavfile_embed.csv')