### Humpback whales acoustic detector (by NOAA & Google)

#### Song classifier using CNN
- classify audio segments as containing or not containing humpback whale sounds
- "intended to be applied as a detector by scoring every context window (3.92 sec) in a set of underwater passive acoustic monitoring data."
- PCEN-normalized spectrogram -> ResNet-50 -> single logistic output unit
- original study from sounds collected in Hawaiian archipelago of humpback winter breeding grounds.
- Metric: score
    - "scores batches of waveforms at once, framing each waveform in the batch into multiple context windows before outputting per-window scores."
- HARP used to collect original data https://ieeexplore.ieee.org/document/4231090 that was used to train model. deployed at hundreds of meters under water (reduces noise disturbance).

#### Use cases
This model is suitable for:
- Predicting the presence of a humpback whale call in a given audio sample.
- Analyzing acoustic data collected by deep-water deployments.

This model is NOT suitable for:
- Detecting species of whales other than humpback whales.
- Counting how many whales are present.
- Localizing whales.
- Analyzing acoustic data with high levels of surface or platform noise.

Dataset: https://data.noaa.gov/metaview/page?xml=NOAA/NESDIS/NGDC/MGG/passive_acoustic//iso/xml/PIFSC_HARP_10kHzDecimated.xml&view=getDataView

https://console.cloud.google.com/storage/browser/noaa-passive-bioacoustic/pifsc;tab=objects?prefix=&forceOnObjectsSortingFiltering=false

A. Allen et al., "A convolutional neural network for automated detection of humpback whale song in a diverse, long-term passive acoustic dataset", Front. Mar. Sci., 2021, doi: 10.3389/fmars.2021.607321.

M. Harvey, "Acoustic Detection of Humpback Whales Using a Convolutional Neural Network," Google AI Blog, Oct. 29, 2018.

##### Code Synopsis
This code imports the TensorFlow and TensorFlow Hub libraries. It then loads a pre-trained machine learning model for audio classification from TensorFlow Hub, specifically the Humpback Whale Classification model with a version number of 1. The code then reads a WAV audio file from a Google Cloud Storage bucket and decodes it into a waveform tensor and its sample rate. The waveform tensor is then reshaped into a batch of size 1, and the sample rate is cast into an integer tensor. The model's 'score' signature is then obtained and used to make predictions on the waveform tensor with the given sample rate, resulting in scores which are then printed.

#### Inputs
- waveform, a float32 Tensor of shape [batch_size, num_samples, num_channels], where it is required that num_channels = 1, but where batch_size and num_samples may take the caller's preferred values on each call.
    - Each audio channel (slice [channel_index, :, 0]), should contain 10kHz PCM float32 audio.
        - The training data left plenty of headroom; the level of clips with humpback present was typically 0.003 RMS, 0.02 peak, much "quieter" than consumer digital audio.
        - Although the model is relatively insensitive to input gain variations as wide as +/-20 dB, users may wish to apply linear scaling to match the levels the model saw in training.
- context_samples, an int64 Tensor of shape [], the hop length at which to slide the scoring context window over waveform.
### Outputs
- scores, a float32 Tensor of shape [batch_size, num_windows, num_classes], where it will always be true that num_classes = 1, where batch_size will equal the one from the input, and where num_windows is determined by num_samples and context_step_samples.

In [2]:
#pip install protobuf==3.20

In [1]:
import tensorflow as tf
import tensorflow_hub as hub #contains reusable/pre-trained models
import tensorflow.compat.v1 as tf #allows access the TensorFlow 1.x API while running their code on TensorFlow 2.x,

In [14]:
FILENAME = 'gs://bioacoustics-www1/sounds/Cross_02_060203_071428.d20_7.wav' #file as string
model = hub.load('https://tfhub.dev/google/humpback_whale/1') #load model

#decode WAV audio file
#reads WAV file 'FILENAME' with 'tf.io.read_file()', then passes contents to 'tf.audio.decode_wav()' to decode
#output is a tensor (multidimensional matrix that stores data & performs computations. represents output of data) with audio data
waveform, sample_rate = tf.audio.decode_wav(tf.io.read_file(FILENAME))
waveform = tf.expand_dims(waveform, 0) #makes a batch of size 1
context_step_samples = tf.cast(sample_rate, tf.int64) #'tf.cast' converts 'sample_rate' from float to integer

#access 'score' function by using key 'score' in the 'signatures' dictionary, from 'model' object in pre-trained model
#can run 'score' function on inputs (audio data) to return score/prediction (specific to pre-trained model)
score_fn = model.signatures['score']

#'waveform' = tensor that represents audio data
#'context_step_samples'= integer that represents number of samples in a context window
scores = score_fn(waveform=waveform, context_step_samples=context_step_samples)
print(scores)

{'scores': <tf.Tensor: shape=(1, 72, 1), dtype=float32, numpy=
array([[[0.90622807],
        [0.95901173],
        [0.8324897 ],
        [0.9507026 ],
        [0.8979412 ],
        [0.86508614],
        [0.9822989 ],
        [0.8624318 ],
        [0.7249464 ],
        [0.37265933],
        [0.990205  ],
        [0.9814316 ],
        [0.98871076],
        [0.99518096],
        [0.97144973],
        [0.7024987 ],
        [0.99146575],
        [0.9982121 ],
        [0.9950129 ],
        [0.9127229 ],
        [0.9984227 ],
        [0.995862  ],
        [0.9872175 ],
        [0.95618635],
        [0.98510885],
        [0.9986504 ],
        [0.9936776 ],
        [0.99768335],
        [0.9981987 ],
        [0.9946393 ],
        [0.9704775 ],
        [0.9052821 ],
        [0.96297425],
        [0.92693096],
        [0.9818676 ],
        [0.9975908 ],
        [0.9129576 ],
        [0.9471698 ],
        [0.74529874],
        [0.93864006],
        [0.88221884],
        [0.986338  ],
        [0.99

In [34]:
import tensorflow.compat.v1 as tf
import tensorflow_hub as hub

#create tensorflow graph 
graph = tf.Graph()
with graph.as_default():
    model = hub.load('https://tfhub.dev/google/humpback_whale/1')
    
    filename = tf.placeholder(tf.string)
    waveform, sample_rate = tf.audio.decode_wav(tf.io.read_file(filename))
    
    waveform = tf.expand_dims(waveform, 0)
    context_step_samples = tf.cast(sample_rate, tf.int64)
    score_fn = model.signatures['score']
    scores = score_fn(waveform=waveform, context_step_samples=context_step_samples)

with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    feed_dict = {filename: FILENAME}
    print(sess.run(scores, feed_dict))


{'scores': array([[[0.90622807],
        [0.95901173],
        [0.8324897 ],
        [0.9507026 ],
        [0.8979412 ],
        [0.86508614],
        [0.9822989 ],
        [0.8624318 ],
        [0.7249464 ],
        [0.37265933],
        [0.990205  ],
        [0.9814316 ],
        [0.98871076],
        [0.99518096],
        [0.97144973],
        [0.7024987 ],
        [0.99146575],
        [0.9982121 ],
        [0.9950129 ],
        [0.9127229 ],
        [0.9984227 ],
        [0.995862  ],
        [0.9872175 ],
        [0.95618635],
        [0.98510885],
        [0.9986504 ],
        [0.9936776 ],
        [0.99768335],
        [0.9981987 ],
        [0.9946393 ],
        [0.9704775 ],
        [0.9052821 ],
        [0.96297425],
        [0.92693096],
        [0.9818676 ],
        [0.9975908 ],
        [0.9129576 ],
        [0.9471698 ],
        [0.74529874],
        [0.93864006],
        [0.88221884],
        [0.986338  ],
        [0.99153847],
        [0.981916  ],
        [0.99406624],

#### Advanced Usage
Model attributes allow isolated reuse of parts of the model, in accord with the Reusable SavedModels interface. The callable attributes exposed are:

- front_end, which can be called on a waveform Tensor as described in the score signature inputs to produce a PCEN-normalized spectrogram of shape [batch_size, num_stft_bins, num_channels], where num_channels = 64 is fixed and where num_stft_bins depends on the number of input samples.
- features, which when called on a PCEN spectrogram slice of shape [batch_size, 128, 64] produces feature vectors of shape [batch_size, 2048]. (These might be useful for detecting other audio event types in the HARP data or similar underwater passive acoustic monitoring datasets, but the model developers have not yet validated this through experiment.)
- logits, which, when called on the same type of input as features, outputs the log odds of the input spectrogram containing humpback vocalization.

This code uses TensorFlow and TensorFlow Hub to classify an audio file of a whale call. The audio file is located in a Google Cloud Storage bucket and is specified by the FILENAME constant.

The code first loads a pre-trained model from TensorFlow Hub by using the hub.load() function and passing in the URL of the model. Then, it decodes the WAV audio file into a tensor using tf.audio.decode_wav() and expands its dimension with tf.expand_dims() so it has a batch size of 1.

The code then passes the audio tensor through the front-end and features functions of the pre-trained model to extract logits, which are used to make predictions about the audio. The logits are passed through the sigmoid function to obtain probabilities between 0 and 1.

Finally, the code prints out a dictionary containing the intermediate results of the computations, including the pcen spectrogram, features, logits, and probabilities.

In [27]:
import tensorflow.compat.v1 as tf
import tensorflow_hub as hub

FILENAME = 'gs://bioacoustics-www1/sounds/Cross_02_060203_071428.d20_7.wav'

model = hub.load('https://tfhub.dev/google/humpback_whale/1')

waveform, _ = tf.audio.decode_wav(tf.io.read_file(FILENAME))
waveform = tf.expand_dims(waveform, 0)# makes a batch of size 1

pcen_spectrogram = model.front_end(waveform)
context_window = pcen_spectrogram[:, :128, :]
features = model.features(context_window)
logits = model.logits(context_window)
probabilities = tf.nn.sigmoid(logits)

print({
    'pcen_spectrogram': pcen_spectrogram,
    'features': features,
    'logits': logits,
    'probabilities': probabilities,
})

{'pcen_spectrogram': <tf.Tensor: shape=(1, 2497, 64), dtype=float32, numpy=
array([[[0.34337044, 0.35289788, 0.3364389 , ..., 0.3155048 ,
         0.30421436, 0.3076836 ],
        [0.01286328, 0.11843169, 0.3996929 , ..., 0.3654033 ,
         0.38193524, 0.3762566 ],
        [0.02342975, 0.06558263, 0.18811762, ..., 0.48939443,
         0.47875965, 0.29559004],
        ...,
        [0.14307141, 0.222906  , 0.5149369 , ..., 0.16156292,
         0.25384068, 0.39627314],
        [0.1394161 , 0.0664804 , 0.13821816, ..., 0.22090197,
         0.30796683, 0.3320781 ],
        [0.1844101 , 0.17328942, 0.07394493, ..., 0.21915352,
         0.3080541 , 0.18283725]]], dtype=float32)>, 'features': <tf.Tensor: shape=(1, 2048), dtype=float32, numpy=
array([[1.6633592, 0.7753173, 1.2360735, ..., 2.1375275, 1.1309164,
        2.763502 ]], dtype=float32)>, 'logits': <tf.Tensor: shape=(1, 1), dtype=float32, numpy=array([[2.268426]], dtype=float32)>, 'probabilities': <tf.Tensor: shape=(1, 1), dtype=floa

In [31]:
#cleaned up print out

import tensorflow.compat.v1 as tf
import tensorflow_hub as hub

FILENAME = 'gs://bioacoustics-www1/sounds/Cross_02_060203_071428.d20_7.wav'

graph = tf.Graph()
with graph.as_default():
    model = hub.load('https://tfhub.dev/google/humpback_whale/1')
    filename = tf.placeholder(tf.string)
    waveform, _ = tf.audio.decode_wav(tf.io.read_file(filename))
    waveform = tf.expand_dims(waveform, 0)# makes a batch of size 1
    
    pcen_spectrogram = model.front_end(waveform)
    context_window = pcen_spectrogram[:, :128, :]
    features = model.features(context_window)
    logits = model.logits(context_window)
    probabilities = tf.nn.sigmoid(logits)
    
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    feed_dict = {filename: FILENAME}
    print(
        sess.run(
            {
                'pcen_spectrogram': pcen_spectrogram,
                'features': features,
                'logits': logits,
                'probabilities': probabilities,
            }, feed_dict))


{'pcen_spectrogram': array([[[0.34337044, 0.35289788, 0.3364389 , ..., 0.3155048 ,
         0.30421436, 0.3076836 ],
        [0.01286328, 0.11843169, 0.3996929 , ..., 0.3654033 ,
         0.38193524, 0.3762566 ],
        [0.02342975, 0.06558263, 0.18811762, ..., 0.48939443,
         0.47875965, 0.29559004],
        ...,
        [0.14307141, 0.222906  , 0.5149369 , ..., 0.16156292,
         0.25384068, 0.39627314],
        [0.1394161 , 0.0664804 , 0.13821816, ..., 0.22090197,
         0.30796683, 0.3320781 ],
        [0.1844101 , 0.17328942, 0.07394493, ..., 0.21915352,
         0.3080541 , 0.18283725]]], dtype=float32), 'features': array([[1.6633592, 0.7753173, 1.2360735, ..., 2.1375275, 1.1309164,
        2.763502 ]], dtype=float32), 'logits': array([[2.268426]], dtype=float32), 'probabilities': array([[0.9062281]], dtype=float32)}


The metadata signature returns the sample rate of the audio the model expects to see as input and the duration of the context window to which each score applies. This signature is a bit of future proofing so that batch inference systems can support models where these values may differ.



In [32]:
import tensorflow_hub as hub

model = hub.load('https://tfhub.dev/google/humpback_whale/1')
metadata_fn = model.signatures['metadata']
metadata = metadata_fn()
print(metadata)

{'input_sample_rate': <tf.Tensor: shape=(), dtype=int64, numpy=10000>, 'context_width_samples': <tf.Tensor: shape=(), dtype=int64, numpy=39124>, 'class_names': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Mn'], dtype=object)>}


In [33]:
import tensorflow.compat.v1 as tf
import tensorflow_hub as hub

graph = tf.Graph()
with graph.as_default():
    model = hub.load('https://tfhub.dev/google/humpback_whale/1')
    metadata_fn = model.signatures['metadata']
    metadata = metadata_fn()

with tf.Session(graph=graph) as sess:
    print(sess.run(metadata))

{'context_width_samples': 39124, 'input_sample_rate': 10000, 'class_names': array([b'Mn'], dtype=object)}
