# 🐸 AnuraSet Lab 2 — Embeddings & Linear Models 

In this lab you will:

1. **Extract audio embeddings** from frog recordings using two pretrained networks  
   * [YAMNet](https://tfhub.dev/google/yamnet) (TensorFlow Hub)
   * [Perch]()
2. **Save embeddings** to a NumPy `.npy` file and a companion *index* `.csv`  
3. **Analyze AnuraSet annotations** to label the species in each 5-second window 
4. **Train a linear classifier** (logistic-regression / linear-SVM) with scikit-learn to classify species based on embeddings
5. Explore **k-nearest-neighbors** as a non-parametric baseline
6. Practice **OOP** by wrapping each model behind a common `Embedder` interface

Gaps marked **📝 Exercise** should be completed by you.  

### (Potentially) useful references:

[Sound classification with YAMNet: Colab notebook](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/hub/tutorials/yamnet.ipynb#scrollTo=Wo9KJb-5zuz1)

[Transfer learning with YAMNet for environmental sound classification: Colab notebook](https://www.tensorflow.org/tutorials/audio/transfer_learning_audio)

[Perch documentation](https://www.kaggle.com/models/google/bird-vocalization-classifier/tensorFlow2/bird-vocalization-classifier/4?tfhub-redirect=true)


In [None]:
# ------------------------------------------------------------
# SETUP
# ------------------------------------------------------------
import pandas as pd
import numpy as np
import librosa
import tensorflow_hub as hub
from pathlib import Path
from IPython.display import Audio, display
from tqdm.auto import tqdm


2025-04-28 21:30:14.423735: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
# ------------------------------------------------------------
# Directory structure
# ------------------------------------------------------------
DATA_DIR   = Path('/mnt/class_data/anuraset')
RAW_DIR    = DATA_DIR / 'raw_data'         # long recordings (.wav)
LABEL_DIR  = DATA_DIR / 'strong_labels'    # txt annotation files
OUT_DIR    = Path('./embeddings_raw_out')
OUT_DIR.mkdir(exist_ok=True)
print('Raw WAV dir  :', RAW_DIR)
print('Label txt dir:', LABEL_DIR)


Raw WAV dir  : /mnt/class_data/anuraset/raw_data
Label txt dir: /mnt/class_data/anuraset/strong_labels


In [3]:
# 📝 Exercise 0. Data exploration. 
# 
# We are working with the raw version of Anuraset now. Let's take a minute to see what we've got.
# 
# 1. Use `os.listdir`, `Path.glob` (https://docs.python.org/3/library/pathlib.html#pathlib.Path.glob), and/or navigate via the command line 
#       to investigate the contents of RAW_DIR and LABEL_DIR. What do you see?
# 2. Load an audio file with librosa.load and display it in this notebook with display(Audio(...)) as in Lab 1.
# 3. Load the corresponding label file with pandas and display its head().
#       a. Things might look weird. Check out the `sep` keyword of pd.read_csv: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
#       b. What do you think the different columns correspond to? Feel free to reference the official documentation of the Anuraset dataset.
#       c. Use header=None, and use the `names` parameter to label the columns appropriately.
# 4. Find a labeled frog call whose duration is shorter than the total length of the clip.
#       a. Extract and display the corresponding sub-clip from the longer clip you loaded in (2)
#           Hint: Audio files can be indexed by sample. Use the sample rate (from librosa) and the timestamps (in seconds) from the annotation file to get to the correct indices.
#       b. Load and display the corresponding sub-clip using librosa.load. 
#           This is a convenient way to load a short clip without reading the entire WAV file into memory, and may be useful later.
#           Hint: Look at the `offset` and `duration` parameters: https://librosa.org/doc/main/generated/librosa.load.html
#

# Example solution:
# ------------------------------------------------------------
import glob

# 1.
sample_file = next(RAW_DIR.glob("**/*.wav"))

# 2.
audio, sr = librosa.load(sample_file, sr=None)
display(Audio(audio, rate=sr))

# 3.
label_file = str(sample_file).replace(str(RAW_DIR), str(LABEL_DIR)).replace(".wav", ".txt")
df = pd.read_csv(label_file, sep='\t', header=None, names=['start_s', 'end_s', 'species'])
display(df.head())

# 4a.
sample_row = df.iloc[3]
display(sample_row)
short_audio = audio[int(sample_row.start_s*sr):int(sample_row.end_s*sr)]
display(Audio(short_audio, rate=sr))

# 4b.
audio, sr = librosa.load(sample_file, sr=None, offset=sample_row.start_s, duration=(sample_row.end_s - sample_row.start_s))
display(Audio(audio, rate=sr))

Unnamed: 0,start_s,end_s,species
0,0.0,59.988753,BOABIS_M
1,0.0,59.138519,LEPLAT_L
2,0.425117,16.76851,SCIPER_L
3,0.732146,2.090159,SPHSUR_L
4,3.76701,3.979569,SPHSUR_L


start_s    0.732146
end_s      2.090159
species    SPHSUR_L
Name: 3, dtype: object

In [4]:
# 📝 Exercise 1. Chunked inference on a large file with YAMNet.
# TODO: Turn this into an assignment

CHUNK_LEN_S = 5     # use 5s chunks for consistency with Perch
TARGET_SR = 16_000  # YAMNet is 16khz

yamnet_model = hub.load('https://tfhub.dev/google/yamnet/1')
sample_file = '/mnt/class_data/anuraset/raw_data/INCT20955/INCT20955_20191031_030000.wav' # example

# load and resample audio file
wav, sr = librosa.load(sample_file, sr=None, mono=True)
if sr != TARGET_SR:
    print("Resampling from", str(sr), "to", str(TARGET_SR))
    wav = librosa.resample(wav, orig_sr=sr, target_sr=TARGET_SR)
    sr = TARGET_SR
wav_len = len(wav)
print("WAV len", str(wav_len), "samples;", str(wav_len // sr), "seconds")

embeddings = []
# iterate through long file in CHUNK_LEN_S chunks
for start_sample in tqdm(range(0, wav_len, CHUNK_LEN_S*TARGET_SR)):
    end_sample = start_sample + CHUNK_LEN_S*TARGET_SR
    end_sample = min(end_sample, wav_len)
    chunk = wav[start_sample:end_sample]
    
    # get embeddings from YAMNet
    _, frames_embeddings, _ = yamnet_model(chunk)
    
    # YAMNet internally splits each chunk into 0.48s 'frames' and returns one embedding for each.
    # We will follow the following approach:
    # "When a model’s window size is shorter than a target example, we frame the audio according to the model’s window size, 
    # create an embedding for each frame, and then average the results" https://www.nature.com/articles/s41598-023-49989-z
    embeddings.append(frames_embeddings.numpy().mean(axis=0))

embeddings = np.stack(embeddings)
embeddings.shape

Resampling from 22050 to 16000
WAV len 959821 samples; 59 seconds


  0%|          | 0/12 [00:00<?, ?it/s]

(12, 1024)

In [5]:
# 📝 Exercise 2. Chunked inference on a large file with Perch.
# TODO Turn this into an assignment.

# Example code from https://www.kaggle.com/models/google/bird-vocalization-classifier/tensorFlow2/bird-vocalization-classifier/4?tfhub-redirect=true
# model = hub.load('https://www.kaggle.com/models/google/bird-vocalization-classifier/TensorFlow2/bird-vocalization-classifier/4')
# waveform = np.zeros(5 * 32000, dtype=np.float32)
# logits, embeddings = model.infer_tf(waveform[np.newaxis, :])

def zeropad(A, size):
    t = size - len(A)
    return np.pad(A, pad_width=(0, t), mode='constant')

CHUNK_LEN_S = 5
TARGET_SR = 32_000  # Perch is 32khz

perch_model = hub.load('https://www.kaggle.com/models/google/bird-vocalization-classifier/TensorFlow2/bird-vocalization-classifier/4')
sample_file = '/mnt/class_data/anuraset/raw_data/INCT20955/INCT20955_20191031_030000.wav' # example

# load and resample audio file
wav, sr = librosa.load(sample_file, sr=None, mono=True)
if sr != TARGET_SR:
    print("Resampling from", str(sr), "to", str(TARGET_SR))
    wav = librosa.resample(wav, orig_sr=sr, target_sr=TARGET_SR)
    sr = TARGET_SR
wav_len = len(wav)
print("WAV len", str(wav_len), "samples;", str(wav_len // sr), "seconds")

embeddings = []
# iterate through long file in CHUNK_LEN_S chunks
# TODO: add batching to speed it up
# TODO: possibly better padding
for start_sample in tqdm(range(0, wav_len, CHUNK_LEN_S*TARGET_SR)):
    end_sample = start_sample + CHUNK_LEN_S*TARGET_SR
    end_sample = min(end_sample, wav_len)

    # perch takes (batch_size, samples) shape input and must be exactly 5s long
    chunk = wav[start_sample:end_sample]
    chunk = zeropad(chunk, CHUNK_LEN_S*TARGET_SR)
    
    # get embeddings
    _, chunk_embedding = perch_model.infer_tf(chunk[np.newaxis, :])
    embeddings.append(chunk_embedding.numpy().squeeze())

embeddings = np.stack(embeddings)
embeddings.shape

Resampling from 22050 to 32000
WAV len 1919641 samples; 59 seconds


  0%|          | 0/12 [00:00<?, ?it/s]

2025-04-28 21:30:24.432373: I external/local_xla/xla/service/service.cc:168] XLA service 0x60bcf0d0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2025-04-28 21:30:24.432412: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2025-04-28 21:30:24.805236: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2025-04-28 21:30:24.811407: W tensorflow/compiler/tf2xla/kernels/assert_op.cc:38] Ignoring Assert operator jax2tf_infer_fn_/assert_equal_1/Assert/AssertGuard/Assert
I0000 00:00:1745875827.563093   97304 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
2025-04-28 21:30:27.575174: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream.


(12, 1280)

In [None]:
# 📝 Exercise 3. OOP practice
# TODO: Turn this into an assignment

class Embedder:
    def __init__(self):
        """
        For subclasses, set:
            self.sr -> sample rate
            self.chunk_len -> chunk length in seconds
            self.pad_chunks -> boolean, whether to pad chunks to self.chunk_len before inference
            self.model -> the actual model that will be used to run inference
        """
        raise NotImplemented

    def resample(self, wav, orig_sr):
        if orig_sr != self.sr:
            print("Resampling from", str(orig_sr), "to", str(self.sr))
            wav = librosa.resample(wav, orig_sr=sr, target_sr=self.sr)
        return wav
    
    def embed_large_file(self, wav_filepath):
        """Return: np.array of shape [num_chunks, embedding_size]"""
        wav, orig_sr = librosa.load(sample_file, sr=None, mono=True)
        wav = self.resample(wav, orig_sr)
        wav_len = len(wav)
        print("WAV len", str(wav_len), "samples;", str(wav_len // self.sr), "seconds")

        embeddings = []
        for start_sample in tqdm(range(0, wav_len, self.chunk_len*self.sr)):
            end_sample = start_sample + self.chunk_len*self.sr
            end_sample = min(end_sample, wav_len)

            chunk = wav[start_sample:end_sample]
            if self.pad_chunks:
                chunk = zeropad(chunk, self.chunk_len*self.sr)
            
            embeddings.append(self.get_embedding(chunk))
    
        return np.stack(embeddings)
    
    def get_embedding(self, chunk):
        """
        Implement this for each subclass.
        Returns:
            np.array of shape (embedding_dim,)
        """
        raise NotImplemented
    
class YAMNetEmbedder(Embedder):
    def __init__(self):
        self.sr = 16_000
        self.chunk_len = 5
        self.pad_chunks = False
        self.model = hub.load('https://tfhub.dev/google/yamnet/1')

    def get_embedding(self, chunk):
        _, frames_embeddings, _ = self.model(chunk)
        return frames_embeddings.numpy().mean(axis=0)
    
class PerchEmbedder(Embedder):
    def __init__(self):
        self.sr = 32_000
        self.chunk_len = 5
        self.pad_chunks = True
        self.model = hub.load('https://www.kaggle.com/models/google/bird-vocalization-classifier/TensorFlow2/bird-vocalization-classifier/4')

    def get_embedding(self, chunk):
        _, chunk_embedding = perch_model.infer_tf(chunk[np.newaxis, :])
        return chunk_embedding.numpy().squeeze()

In [None]:
# 📝 Exercise 4. Write all embeddings to disk, plus a file that describes where they came from

In [None]:
# 📝 Exercise 5. Extract labels

In [None]:
# 📝 Exercise 6. Train simple logistic regression with default train/test split (we will explore better splitting next time)

In [None]:
# 📝 Exercise 7. Train KNN?