To implement a solution that works for your exact case, we would need more detailed information about your exact setup. However, I'll provide a more comprehensive code snippet below that you can adapt to your case. This assumes you have a directory structure where each subdirectory's name is the class label, and each subdirectory contains the corresponding audio files. The structure would look something like this:


- main_directory
    - class1
        - file1.wav
        - file2.wav
        ...
    - class2
        - file1.wav
        - file2.wav
        ...
    ...


In this notebook and video, we are going to do three things:

1. Load and use the YAMNet model for inference.

2. Build a new model using the YAMNet embeddings to classify cat and dog sounds.

3. Evaluate and export your model

In [3]:
import os

from IPython import display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import tensorflow as tf
import tensorflow_hub as hub


# YAMNet

YAMNet is a pre-trained deep neural network that can predict audio events from 521 classes, such as laughter, barking, or a siren.

## In detail

YAMNet is a pre-trained neural network that employs the MobileNetV1 depthwise-separable convolution architecture. It can use an audio waveform as input and make independent predictions for each of the 521 audio events from the AudioSet corpus.

Internally, the model extracts "frames" from the audio signal and processes batches of these frames. This version of the model uses frames that are 0.96 second long and extracts one frame every 0.48 seconds .

The model accepts a 1-D float32 Tensor or NumPy array containing a waveform of arbitrary length, represented as single-channel (mono) 16 kHz samples in the range [-1.0, +1.0]. This tutorial contains code to help you convert WAV files into the supported format.

The model returns 3 outputs, including the class scores, embeddings (which you will use for transfer learning), and the log mel spectrogram. You can find more details here.

One specific use of YAMNet is as a high-level feature extractor - the 1,024-dimensional embedding output. You will use the base (YAMNet) model's input features and feed them into your shallower model consisting of one hidden tf.keras.layers.Dense layer. Then, you will train the network on a small amount of data for audio classification without requiring a lot of labeled data and training end-to-end. (This is similar to transfer learning for image classification with TensorFlow Hub for more information.)

First, we will test the model and see the results of classifying audio. You will then construct the data pre-processing pipeline.

## Loading YAMNet from TensorFlow Hub

You are going to use a pre-trained YAMNet from Tensorflow Hub to extract the embeddings from the sound files.

Loading a model from TensorFlow Hub is straightforward: choose the model, copy its URL, and use the load function

In [15]:
import tensorflow_io as tfio

In [141]:
import os
import tensorflow as tf
import tensorflow_hub as hub
from scipy.signal import resample_poly
import matplotlib.pyplot as plt
from IPython.display import display
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model


# Load YAMNet model
yamnet_model_handle = 'https://tfhub.dev/google/yamnet/1'
yamnet_model = hub.load(yamnet_model_handle)

# Constants
main_directory = '/Users/ankush/Downloads/deakin-units/data/b3'
class_names = sorted(os.listdir(main_directory))



2023-08-02 14:25:26.537850: W tensorflow/core/kernels/data/cache_dataset_ops.cc:854] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
2023-08-02 14:25:26.565143: W tensorflow/core/kernels/data/cache_dataset_ops.cc:854] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
2023-08-02 14:25:26.574543: W tensorflow/core/kernels/data/cache_dataset_ops.cc:854] The calling iterator did not fully read the dataset being cached. I

In [142]:


# Function to get file paths and labels
def get_file_paths_and_labels(main_directory, class_names):
    filenames = []
    labels = []
    for label, class_name in enumerate(class_names):
        class_directory = os.path.join(main_directory, class_name)
        if os.path.isdir(class_directory): # Ensure it's a directory
            for file_name in os.listdir(class_directory):
                if file_name.endswith('.wav'):
                    filenames.append(os.path.join(class_directory, file_name))
                    labels.append(label)
    return filenames, labels
from scipy.signal import resample

def resample_audio(wav, num_samples):
    return resample(wav, num_samples)

def load_wav_16k_mono(filename, target_length=16000):
    """ Load a WAV file, convert it to a float tensor, resample to 16 kHz single-channel audio, and pad to target length. """
    file_contents = tf.io.read_file(filename)
    wav, sample_rate = tf.audio.decode_wav(
          file_contents,
          desired_channels=1)
    wav = tf.squeeze(wav, axis=-1)
    sample_rate = tf.cast(sample_rate, dtype=tf.float32)
    target_sample_rate = 16000.0

    # Compute the number of samples for the target sample rate
    num_samples = tf.cast(tf.shape(wav)[0], dtype=tf.float32) * target_sample_rate / sample_rate
    num_samples = tf.cast(num_samples, tf.int32)

    # Resample the wav using scipy resample
    resampled_wav = tf.numpy_function(resample_audio, [wav, num_samples], tf.float32)

    # Pad or truncate to target length
    resampled_wav = tf.cond(tf.shape(resampled_wav)[0] < target_length,
                            lambda: tf.pad(resampled_wav, [[0, target_length - tf.shape(resampled_wav)[0]]]),
                            lambda: resampled_wav[:target_length])

    return resampled_wav



def load_and_preprocess_data(filename, label):
    wav_data = load_wav_16k_mono(filename)
    scores, embeddings, _ = yamnet_model(wav_data)
    embeddings = tf.reduce_mean(embeddings, axis=0)  # Average across frames
    return embeddings, label




In [143]:

# Get file paths and labels
filenames, labels = get_file_paths_and_labels(main_directory, class_names)

filenames_ds = tf.data.Dataset.from_tensor_slices(filenames)
labels_ds = tf.data.Dataset.from_tensor_slices(labels)

# Zipping the datasets to create pairs of (filename, label)
main_ds = tf.data.Dataset.zip((filenames_ds, labels_ds))

# Apply loading and preprocessing
main_ds = main_ds.map(load_and_preprocess_data)

# Splitting the dataset
train_ds = main_ds.take(int(len(filenames) * 0.7))
test_ds = main_ds.skip(int(len(filenames) * 0.7))

train_ds = train_ds.cache().shuffle(1000).batch(32).prefetch(tf.data.AUTOTUNE)
test_ds = test_ds.cache().batch(32).prefetch(tf.data.AUTOTUNE)


In [147]:
from tensorflow.keras.layers import Input, Dense, Flatten
from tensorflow.keras.models import Sequential

# Model definition
from tensorflow.keras.layers import Dropout

my_model = Sequential([
    Input(shape=(1024,), dtype=tf.float32, name='input_embedding'),
    Dense(512, activation='relu'),
    Dropout(0.5), # Adding dropout
    Dense(256, activation='relu'), # Additional hidden layer
    Dense(len(class_names))
], name='my_model')







my_model.summary()

my_model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                 optimizer="adam",
                 metrics=['accuracy'])


Model: "my_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_42 (Dense)            (None, 512)               524800    
                                                                 
 dropout_5 (Dropout)         (None, 512)               0         
                                                                 
 dense_43 (Dense)            (None, 256)               131328    
                                                                 
 dense_44 (Dense)            (None, 16)                4112      
                                                                 
Total params: 660240 (2.52 MB)
Trainable params: 660240 (2.52 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [150]:

callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3, restore_best_weights=True)

history = my_model.fit(train_ds, epochs=50, callbacks=callback)



Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50


In [151]:
loss, accuracy = my_model.evaluate(test_ds)

print("Loss: ", loss)
print("Accuracy: ", accuracy)


Loss:  18.87421989440918
Accuracy:  0.027426160871982574


In [79]:

# Test on a specific file (replace 'testing_wav_file_name' with a path to a WAV file)
testing_wav_data = load_wav_16k_mono('testing_wav_file_name')
scores, embeddings, _ = yamnet_model(testing_wav_data)
result = my_model(embeddings).numpy()

inferred_class = class_names[result.mean(axis=0).argmax()]
print(f'The main sound is: {inferred_class}')

2023-08-02 12:46:43.486298: W tensorflow/core/framework/op_kernel.cc:1828] OP_REQUIRES failed at whole_file_read_ops.cc:116 : NOT_FOUND: testing_wav_file_name; No such file or directory


NotFoundError: {{function_node __wrapped__ReadFile_device_/job:localhost/replica:0/task:0/device:CPU:0}} testing_wav_file_name; No such file or directory [Op:ReadFile]

In [73]:
for x, y in train_ds.take(1):
    print(x.shape, y.shape)  # x should be (32, 1024) and y should be (32,)


(32, 2, 1024) (32,)


2023-08-02 12:42:02.887027: W tensorflow/core/kernels/data/cache_dataset_ops.cc:854] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
