# VGGish - Speech Commands - Generate Spectrograms

**Pipeline Step 1: Audio Preprocessing**

This is the first step in the audio classification pipeline. It processes all ~105,000 audio files from the Speech Commands dataset and converts each one into a log-mel spectrogram suitable for the VGGish network.

**Pipeline overview:**
1. **Generate spectrograms** (this notebook) -- Convert WAV files to 96x64 spectrograms
2. Generate embeddings -- Pass spectrograms through VGGish to produce 128-dim vectors
3. Train classifier -- Use embeddings or spectrograms to classify spoken words

**What this notebook does:**
- Walks the Speech Commands dataset directory to catalog all WAV files with their labels
- Converts each audio file to a VGGish-format spectrogram (96 time frames x 64 mel bands)
- Flags invalid entries (files that don't produce exactly one 96x64 spectrogram)
- Saves the spectrogram array and metadata to disk for use in subsequent notebooks

## Setup

Import the VGGish library and its spectrogram generation utilities. The VGGish pipeline expects audio at 16 kHz and produces spectrograms with a window hop of 0.96 seconds, meaning each ~1-second audio clip yields a single 96x64 spectrogram frame.

In [1]:
import sys
sys.path.append("/home/ubuntu/odsc/vggish/lib/models/research/audioset/vggish")

In [2]:
import pandas as pd
import os

In [3]:
import vggish_params


In [4]:
vggish_params.EXAMPLE_HOP_SECONDS, vggish_params.EXAMPLE_WINDOW_SECONDS

(0.96, 0.96)

In [5]:
import numpy as np
import six
import soundfile
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))


Instructions for updating:
non-resource variables are not supported in the long term
Num GPUs Available:  0


In [6]:
from tensorflow.python.client import device_lib

def get_available_gpus():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos if x.device_type == 'GPU']

get_available_gpus()

[]

## Building the Dataset Catalog

Walk the Speech Commands directory tree to build a DataFrame of all WAV files. Each subdirectory name is the spoken word label (e.g., "zero", "yes", "stop"). The dataset contains 105,835 files across 35 word categories plus a `_background_noise_` directory (6 files) which is included here but later filtered out.

The class distribution is imbalanced: the 20 "core" words have ~3,700-4,000 samples each, while the 15 "auxiliary" words have ~1,400-2,100 samples each.

In [7]:

import vggish_input
import vggish_postprocess
import vggish_slim

pca_params = '/home/ubuntu/odsc/vggish/lib/vggish_pca_params.npz'
ckpt = '/home/ubuntu/odsc/vggish/lib/vggish_model.ckpt'

In [6]:
wav_files = {
    'file_name' : [],
    'label': []
}

rootDir = '/home/ubuntu/audio/speech_commands'
for dirName, subdirList, fileList in os.walk(rootDir):
    for fname in fileList:
        if fname.endswith('.wav'):
            wav_files['label'].append(os.path.basename(dirName))
            wav_files['file_name'].append(os.path.join(dirName, fname))

df = pd.DataFrame(data=wav_files)

In [7]:
df.head()

Unnamed: 0,file_name,label
0,/home/ubuntu/audio/speech_commands/zero/8a90cf...,zero
1,/home/ubuntu/audio/speech_commands/zero/173ae7...,zero
2,/home/ubuntu/audio/speech_commands/zero/eb76bc...,zero
3,/home/ubuntu/audio/speech_commands/zero/978240...,zero
4,/home/ubuntu/audio/speech_commands/zero/246328...,zero


## Verifying Single-File Spectrogram Generation

Before processing the full dataset, we verify that a single WAV file produces the expected 96x64 spectrogram. The `wavfile_to_examples()` function returns a batch of spectrograms -- for 1-second clips, this should be exactly one spectrogram with shape `(1, 96, 64)`.

In [37]:
df.shape

(105835, 2)

In [8]:
df['label'].value_counts()

five                  4052
zero                  4052
yes                   4044
seven                 3998
no                    3941
nine                  3934
down                  3917
one                   3890
go                    3880
two                   3880
stop                  3872
six                   3860
on                    3845
left                  3801
eight                 3787
right                 3778
off                   3745
four                  3728
three                 3727
up                    3723
dog                   2128
wow                   2123
house                 2113
marvin                2100
bird                  2064
happy                 2054
cat                   2031
sheila                2022
bed                   2014
tree                  1759
backward              1664
visual                1592
follow                1579
learn                 1575
forward               1557
_background_noise_       6
Name: label, dtype: int64

## Batch Spectrogram Generation

Process all 105,835 audio files. For each file:
1. Generate the VGGish spectrogram using `wavfile_to_examples()`
2. If the result has exactly 1 frame (shape `(1, 96, 64)`), store it and mark as valid
3. If the result has 0 or >1 frames (due to audio being too short or too long), mark as invalid

Some files produce invalid spectrograms -- typically the `_background_noise_` files which are longer than 1 second. These will be filtered out before training.

The spectrograms are stored in a pre-allocated NumPy array of shape `(105835, 96, 64)`. Progress is printed every 1,000 files.

In [9]:
fname = df['file_name'][0]
fname

'/home/ubuntu/audio/speech_commands/zero/8a90cf67_nohash_0.wav'

## Saving Results

Save the DataFrame (with file paths, labels, and validity flags) to CSV and the spectrogram array to a binary file. These outputs are used by the subsequent notebooks:
- `wavfile_df.csv` -- metadata for all 105,835 audio files
- `wavfile_spec.dat` -- raw binary dump of the spectrogram NumPy array (105,835 x 96 x 64 float64 values)

In [10]:
ex = vggish_input.wavfile_to_examples(fname)

In [11]:
ex.shape

(1, 96, 64)

In [12]:
audio_data = np.empty((df.shape[0], 96, 64))

In [44]:
for ind, row in df.iterrows():
    data = vggish_input.wavfile_to_examples(row.file_name)
    if data.shape[0] != 1:
        df.loc[ind, 'valid'] = False
        continue
    audio_data[ind, :, :] = data
    df.loc[ind, 'valid'] = True
    if ind % 1000==0:
        print(ind)

0
1000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
36000
38000
39000
40000
41000
42000
43000
44000
45000
46000
47000
48000
49000
50000
51000
52000
53000
54000
55000
56000
57000
58000
59000
61000
63000
65000
67000
68000
69000
70000
71000
72000
73000
74000
75000
76000
77000
78000
79000
81000
82000
83000
84000
85000
87000
88000
89000
90000
91000
94000
95000
96000
97000
98000
99000
100000
101000
102000
103000
104000
105000


In [46]:
df.to_csv('wavfile_df.csv')

In [47]:
with open('wavfile_spec.dat', 'wb') as f:
    audio_data.tofile(f)