<a target="_blank" href="https://colab.research.google.com/github/SidSaxena01/sound-classification/blob/main/AMPLAB%20Module%204%20-%20Sound%20classification.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# AMPLAB Module 4 Machine Listening - Embeddings Extractor

This notebook includes the code to extract audio embeddings that could then be used in your other machine listening tasks. This code does not extract embeddings for BSD10k audio files, you'll have to provide your own audio files. If you want to re-analyze BSD10k, you can do so by downloading it. You'll find details in the [BSD10k repository](https://github.com/allholy/BSD10k). You should be able to run this notebook locally or in Google Colab without problems.

In order to run this notebook locally, you'll need to create a Python virtual environment and install the requirements (`pip install -r requirements.txt`). Also, you'll need to download the file  `amplab_machine_listening_module_data.zip` that [you'll find in this shared folder](https://drive.google.com/drive/folders/1FHEmzEXgBV1CCAWo_F3KDpw9QM5ecuZf?usp=sharing), and place it uncompressed next to this notebook (the uncompressed folder should be named `amplab_machine_listening_module_data`).

If running in Google Colab, you'll need to make a copy of this notebook somewhere in your Google Drive, and add a shortcut to the `amplab_data` shared folder next to your notebook (the shortcut must be named same as the folder, `amplab_data`). Then run the cells normally. Note that before running the first cell, you'll need to update the `%cd ...` path to set the working directory to the folder where the notebook (and the shortcut) are placed within your Google Drive. If in Colab, running the first cell will take some minutes as it needs to copy some data and unzip.

This work is similar to that of a paper we published at DCASE 2024:
[Anastasopoulou, Panagiota, et al. "Heterogeneous sound classification with the Broad Sound Taxonomy and Dataset." DCASE Workshop (2024)](https://dcase.community/documents/workshop2024/proceedings/DCASE2024Workshop_Anastasopoulou_39.pdf).


In [1]:
try:
  from google.colab import drive
  # If this does not fail, it means we're running in a Colab environment

  # First mount google drive
  drive.mount('/content/drive')

  # Set the working directory to the directory where this notebook has been placed.
  # This directory should have a Google Drive shortcut to the "amplab_data" shared folder.
  # Edit the below to point to the Google Drive directory where this notebook is located.
  %cd '/content/drive/MyDrive/SMC/AMPLab2425/AMPLAB 2025 Module 4 - Machine Listening'

  # Now copy data files to the colab runtime local storage and uncompress the .zip file.
  # By placing data files in the notebook runtime local storage, we will make data loading much faster in the cells below.
  !cp "amplab_data/amplab_machine_listening_module_data.zip" /content/amplab_machine_listening_module_data.zip
  !unzip  -u /content/amplab_machine_listening_module_data.zip -d /content/
  DATA_FOLDER = '/content/amplab_machine_listening_module_data'

  # Install dependencies (if not running in Colab, this will need to be installed manually)
  !pip install laion_clap essentia-tensorflow

except:
  # Not running in Colab
  DATA_FOLDER = 'amplab_machine_listening_module_data'

import os
import json
import numpy as np
import laion_clap
import librosa
import essentia.standard as estd
import os
import pandas as pd
import numpy as np
import freesound
from IPython.display import display
from dotenv import load_dotenv


In [2]:
load_dotenv(".env")

# load freesound api key
FREESOUND_API_KEY = os.getenv("FREESOUND_API_KEY")

# MFCC "EMBEDDINGS"

In [3]:
mfcc_algo = estd.MFCC()
w_algo = estd.Windowing(type='blackmanharris62')
spectrum_algo = estd.Spectrum()

def get_mfcc_embeddings(audio_path):
  loader = estd.MonoLoader(filename=audio_path, sampleRate=48000)
  audio = loader()
  mfcc_frames = []
  for frame in estd.FrameGenerator(audio, frameSize=2048, hopSize=1024):
        spec = spectrum_algo(w_algo(frame))
        _, mfcc_coeffs = mfcc_algo(spec)
        mfcc_frames.append(mfcc_coeffs)
  mfcc_frames = np.array(mfcc_frames)
  mfcc_average = np.mean(mfcc_frames, axis=0)
  return mfcc_average

# FREESOUND SIMILARITY EMBEDDINGS

In [4]:
gaia_pca_dataset_history = json.load(open(os.path.join(DATA_FOLDER, 'gaia_pca_dataset_history.json')))
normalization_coefficients = [transform_info for transform_info in gaia_pca_dataset_history if transform_info["Analyzer name"] == "normalize"][0]["Applier parameters"]["coeffs"]
normalization_additional_info = [transform_info for transform_info in gaia_pca_dataset_history if transform_info["Analyzer name"] == "normalize"][0]["Additional info"]
dimensions_per_descriptor = {descriptor_name: len(descriptor_stats['mean']) for descriptor_name, descriptor_stats in normalization_additional_info.items()}
pca_descriptor_names = [transform_info for transform_info in gaia_pca_dataset_history if transform_info["Analyzer name"] == "pca"][0]["Applier parameters"]["descriptorNames"]
pca_matrix_raw = [transform_info for transform_info in gaia_pca_dataset_history if transform_info["Analyzer name"] == "pca"][0]["Applier parameters"]["matrix"]
pca_matrix_raw = pca_matrix_raw[2:]
pca_matrix = []
for i in range(0, len(pca_matrix_raw), len(pca_matrix_raw)//100):
    pca_matrix.append(pca_matrix_raw[i:i+len(pca_matrix_raw)//100])
pca_matrix = np.matrix(pca_matrix).transpose()

def project_sound_to_legacy_similarity_space(features):
    # Normalize
    normed_descriptors = {}
    for descriptor_name in pca_descriptor_names:
        value = features.get(descriptor_name[1:]
                             .replace('spectral_contrast.', 'spectral_contrast_coeffs.')
                             .replace('scvalleys.', 'spectral_contrast_valleys.')
                             .replace('erb_bands.', 'erbbands.')
                             .replace('frequency_bands.', 'barkbands.'))  # descriptor names have '.' at the beginning, and some have changed
        if type(value) == np.ndarray:
            value = list(value)  # Make sure this is not ndarray at this point
        if 'frequency_bands.' in descriptor_name:
            value += [value[-2]]  # frequency_bands descriptor (which is "same" as barkbands?), has one more dimension in the legacy extractor (and that missing dimension seems to be usualy similar to the penultimate)

        if type(value) == list:
            value_dimensionality = len(value)
        else:
            value_dimensionality = 1
        have_same_dimension = value_dimensionality == dimensions_per_descriptor[descriptor_name]
        if value is not None and have_same_dimension:
            coeffs = normalization_coefficients[descriptor_name]
            if type(value) != list:
                norm_value = value * coeffs['a'][0] + coeffs['b'][0]
            else:
                norm_value = [v * coeffs['a'][i] + coeffs['b'][i] for i, v in enumerate(value)]
            normed_descriptors[descriptor_name] = norm_value
        else:
            # If a descriptor is missing, we set it to 0
            # This might (will) happen if some sounds don't have values for all descriptors
            #print('Unaligned descriptor', descriptor_name)
            if dimensions_per_descriptor[descriptor_name] > 1:
                normed_descriptors[descriptor_name] = [0.0 for i in range(0, dimensions_per_descriptor[descriptor_name])]
            else:
                normed_descriptors[descriptor_name] = 0.0

    # Project to pca space
    # First concatenate all values into one flat list
    vector = []
    for descriptor_name in pca_descriptor_names:
        normed_value = normed_descriptors[descriptor_name]
        if type(normed_value) == list:
            vector += normed_value
        else:
            vector.append(normed_value)
    # Then multiply by pca matrix
    pca_vector = list(np.squeeze(np.asarray(np.matmul(np.matrix(vector), pca_matrix))))
    return pca_vector

def get_freesound_similarity_embeddings(audio_path):
  fs_pool, _ = estd.FreesoundExtractor()(audio_path)
  features = dict()
  for descriptor in fs_pool.descriptorNames():
      features[descriptor] = fs_pool[descriptor]
  sim_vector = project_sound_to_legacy_similarity_space(features)
  return np.array(sim_vector)

# FSD-SINET EMBEDDINGS

In [5]:
model_embeddings = estd.TensorflowPredictFSDSINet(graphFilename=os.path.join(DATA_FOLDER, "fsd-sinet-vgg42-tlpf_aps-1.pb"), output="model/global_max_pooling1d/Max")

def add_silence(audio, sr, silence_duration=0.5):
    silence = np.zeros(int(silence_duration * sr))
    repeated_audio = np.concatenate((silence, audio))
    return repeated_audio

def get_fsdsinet_embeddings(audio_path):
  loader = estd.MonoLoader(filename=audio_path, sampleRate=44100)
  audio = loader()
  if len(audio)/44100 < 0.5:
    audio = add_silence(audio, 44100)
  embeddings = model_embeddings(audio).mean(axis=0)  # Take mean of frame embeddings
  return embeddings

[   INFO   ] TensorflowPredict: Successfully loaded graph file: `amplab_machine_listening_module_data/fsd-sinet-vgg42-tlpf_aps-1.pb`


# CLAP EMBEDDINGS

In [6]:
model = laion_clap.CLAP_Module(enable_fusion=True)
model.load_ckpt(model_id=3) # download the default pretrained checkpoint, this might take some time...

def get_clap_embeddings_from_audio(audio_path):
    audio, _ = librosa.load(audio_path, sr=48000)
    np.random.seed(0)  # Make CLAP's random slice selection for >10s sounds deterministic so we get consistent results when re-run
    audio_embed = model.get_audio_embedding_from_data(x=[audio], use_tensor=False)
    audio_embed = audio_embed[0, :]
    return audio_embed

def get_clap_embeddings_from_text(text):
    text_embed = model.get_text_embedding([text])
    text_embed = text_embed[0, :]
    return text_embed

  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  checkpoint = torch.load(checkpoint_path, map_location=map_location)


Load our best checkpoint in the paper.
The checkpoint is already downloaded
Load Checkpoint...
logit_scale_a 	 Loaded
logit_scale_t 	 Loaded
audio_branch.spectrogram_extractor.stft.conv_real.weight 	 Loaded
audio_branch.spectrogram_extractor.stft.conv_imag.weight 	 Loaded
audio_branch.logmel_extractor.melW 	 Loaded
audio_branch.bn0.weight 	 Loaded
audio_branch.bn0.bias 	 Loaded
audio_branch.patch_embed.proj.weight 	 Loaded
audio_branch.patch_embed.proj.bias 	 Loaded
audio_branch.patch_embed.norm.weight 	 Loaded
audio_branch.patch_embed.norm.bias 	 Loaded
audio_branch.patch_embed.mel_conv2d.weight 	 Loaded
audio_branch.patch_embed.mel_conv2d.bias 	 Loaded
audio_branch.patch_embed.fusion_model.local_att.0.weight 	 Loaded
audio_branch.patch_embed.fusion_model.local_att.0.bias 	 Loaded
audio_branch.patch_embed.fusion_model.local_att.1.weight 	 Loaded
audio_branch.patch_embed.fusion_model.local_att.1.bias 	 Loaded
audio_branch.patch_embed.fusion_model.local_att.3.weight 	 Loaded
audio_branc

# RUN EMBEDDING EXTRACTORS

In [7]:
FREESOUND_API_KEY = FREESOUND_API_KEY  # Please replace by your own Freesound API key
FILES_DIR = 'files'  # Place where to store the downloaded diles. Will be relative to the current folder.
DATAFRAME_FILENAME = 'dataframe.csv'  # File where we'll store the metadata of our sounds collection
FREESOUND_STORE_METADATA_FIELDS = ['id', 'name', 'username', 'previews', 'license', 'tags']  # Freesound metadata properties to store

freesound_client = freesound.FreesoundClient()
freesound_client.set_token(FREESOUND_API_KEY)
if not os.path.exists(FILES_DIR): 
    os.mkdir(FILES_DIR)

In [8]:
# Define some util functions

def query_freesound(query, filter, num_results=10):
    """Queries freesound with the given query and filter values.
    If no filter is given, a default filter is added to only get sounds shorter than 30 seconds.
    """
    if filter is None:
        filter = 'duration:[0 TO 30]'  # Set default filter
    pager = freesound_client.text_search(
        query = query,
        filter = filter,
        fields = ','.join(FREESOUND_STORE_METADATA_FIELDS),
        group_by_pack = 1,
        page_size = num_results
    )
    return [sound for sound in pager]

def retrieve_sound_preview(sound, directory):
    """Download the high-quality OGG sound preview of a given Freesound sound object to the given directory.
    """
    return freesound.FSRequest.retrieve(
        sound.previews.preview_hq_ogg,
        freesound_client,
        os.path.join(directory, sound.previews.preview_hq_ogg.split('/')[-1])
    )

def make_pandas_record(sound):
    """Create a dictionary with the metadata that we want to store for each sound.
    """
    record = {key: sound.as_dict()[key] for key in FREESOUND_STORE_METADATA_FIELDS}
    del record['previews']  # Don't store previews dict in record
    record['freesound_id'] = record['id']  # Rename 'id' to 'freesound_id'
    del record['id']
    record['path'] = "files/" + sound.previews.preview_hq_ogg.split("/")[-1]  # Store path of downloaded file
    return record

In [9]:
# Build our collection of sounds

# make directory
if not os.path.exists(FILES_DIR):
    os.makedirs(FILES_DIR, exist_ok=True)
# Our collection of sounds is made by appending the results of a number of different queries to freesound
# The query terms, query filters and the number of results per query are all defined here.
# Information about how to define filters can be found in the Freesound API documentation: https://freesound.org/docs/api/resources_apiv2.html#request-parameters-text-search-parameters
freesound_queries = [
    {
        'query': 'music box',
        'filter': 'duration:[0 TO 40]',
        'num_results': 10,
    },
    {
        'query': 'videogame',
        'filter': 'duration:[0 TO 40]',
        'num_results': 10,
    },
    {
        'query': '8 bit',
        'filter': 'duration:[0 TO 40]',
        'num_results': 10,
    },

]
 

# Do all queries and concatenate the results in a single list of sounds
sounds = sum([query_freesound(query['query'], query['filter'], query['num_results']) for query in freesound_queries],[])

# Download the sounds and save them to FILES_DIR folder
for count, sound in enumerate(sounds):
    print('Downloading sound with id {0} [{1}/{2}]'.format(sound.id, count + 1, len(sounds)))
    retrieve_sound_preview(sound, 'files/')

# Make a Pandas DataFrame with the metadata of our sound collection and save it
df = pd.DataFrame([make_pandas_record(s) for s in sounds])
df.to_csv(DATAFRAME_FILENAME)
print('Saved DataFrame with {0} entries! {1}'.format(len(df), DATAFRAME_FILENAME))

# Show the contents of our DataFrame (the metadata of our source collection)
display(df)

Downloading sound with id 170101 [1/30]
Downloading sound with id 386927 [2/30]
Downloading sound with id 116402 [3/30]
Downloading sound with id 72653 [4/30]
Downloading sound with id 336609 [5/30]
Downloading sound with id 613089 [6/30]
Downloading sound with id 335967 [7/30]
Downloading sound with id 737611 [8/30]
Downloading sound with id 343373 [9/30]
Downloading sound with id 157261 [10/30]
Downloading sound with id 691653 [11/30]
Downloading sound with id 727924 [12/30]
Downloading sound with id 581359 [13/30]
Downloading sound with id 727926 [14/30]
Downloading sound with id 752080 [15/30]
Downloading sound with id 703250 [16/30]
Downloading sound with id 332629 [17/30]
Downloading sound with id 264828 [18/30]
Downloading sound with id 333038 [19/30]
Downloading sound with id 781093 [20/30]
Downloading sound with id 670975 [21/30]
Downloading sound with id 664082 [22/30]
Downloading sound with id 660356 [23/30]
Downloading sound with id 663402 [24/30]
Downloading sound with id 

Unnamed: 0,name,username,license,tags,freesound_id,path
0,music box loop 35.wav,klankbeeld,https://creativecommons.org/licenses/by/4.0/,"[antique, wind-up, musicbox, sequence, mechani...",170101,files/170101_1648170-hq.ogg
1,music box,gumballworld,https://creativecommons.org/licenses/by-nc/4.0/,"[box, cute, music, baby]",386927,files/386927_7227448-hq.ogg
2,Let it Be - Music Box.wav,Puniho,http://creativecommons.org/licenses/by/3.0/,"[music, box, chimes, chiming, tinkling, tinkle...",116402,files/116402_1956076-hq.ogg
3,Music box 04.wav,LG,https://creativecommons.org/licenses/by/4.0/,"[music-box, music, box, old, squeak, squeaky, ...",72653,files/72653_36188-hq.ogg
4,music box - wind up 06.wav,Anthousai,http://creativecommons.org/publicdomain/zero/1.0/,"[box, wind-up, music-box, winding, music, tick...",336609,files/336609_5923045-hq.ogg
5,Dancing Ducks Music Box,f-r-a-g-i-l-e,http://creativecommons.org/publicdomain/zero/1.0/,"[lullaby, wind-up, old, rustling, dance, sampl...",613089,files/613089_8466114-hq.ogg
6,"Music Box: ""Fur Elise""",newagesoup,https://creativecommons.org/licenses/by/4.0/,"[box, mechanical-instrument, wind-up, music-bo...",335967,files/335967_4067257-hq.ogg
7,Music Box of Terror,Gustavo_Alivera,https://creativecommons.org/licenses/by/4.0/,"[possessed, scary, demented, box, child, demon...",737611,files/737611_16024318-hq.ogg
8,Drop The Music Box Bass by Rob.wav,Tapepusher,https://creativecommons.org/licenses/by-nc/4.0/,"[box, turn, lovely, music]",343373,files/343373_651383-hq.ogg
9,spieluhr12.wav,baujahr66,http://creativecommons.org/publicdomain/zero/1.0/,"[box, music, tones]",157261,files/157261_2461172-hq.ogg


In [10]:
import random
# required imports for audio display
import IPython.display as ipd
from IPython.display import Audio

# Get a random file from our collection
random_index = random.randint(0, len(df) - 1)
test_sound_path = df.iloc[random_index]['path']
print(f"Selected file: {df.iloc[random_index]['name']} (index: {random_index})")

# play the sound
ipd.display(Audio(test_sound_path))


print('\nMFCC')
mfcc_embed = get_mfcc_embeddings(test_sound_path)
print(mfcc_embed.shape)
print(mfcc_embed)

print('\nFreesound similarity')
fssim_embed = get_freesound_similarity_embeddings(test_sound_path)
print(fssim_embed.shape)
print(fssim_embed[0:20])

print('\nFSD-SINET')
fsdsinet_embed = get_fsdsinet_embeddings(test_sound_path)
print(fsdsinet_embed.shape)
print(fsdsinet_embed[0:20])

print('\nCLAP')
audio_embed = get_clap_embeddings_from_audio(test_sound_path)
print(audio_embed.shape)
print(audio_embed[0:20])

text_embed = get_clap_embeddings_from_text("Video Game")
print(text_embed.shape)
print(text_embed[0:20])

# NOTE: if you wan to do NN search in the CLAP space now, maybe you could use NearestNeighbors from scikit-learn similarly as we did for the audio mosaicing notebooks

Selected file: 8 bit arpeggio 001 minor 120 bpm triangle 016 Dis2.wav (index: 25)



MFCC
(13,)
[-1008.89014      46.079193     23.089693     12.1100025     1.1797501
     3.8329391    -1.5277956    -6.245073     -4.2980933   -10.417763
   -12.946444    -11.879673    -15.421695 ]

Freesound similarity
(100,)
[12.58567091 -0.76026066  5.00254313 -0.55291605 -4.30472116 -0.65118871
 -0.60038989  0.03666678 -2.16746531 -1.27247259  0.0201243  -0.48667295
  0.67070021  0.28855362  0.20799315 -1.01811045  0.21175726  0.21582351
  1.54769671  0.4242891 ]

FSD-SINET
(512,)
[0.8001462  1.6971749  1.456046   0.09415981 0.9695438  2.8602765
 1.3105874  0.40244913 0.64532346 1.1427574  1.5274771  0.49339193
 0.78523195 1.6282692  1.8348593  0.4158755  1.0681199  2.1892197
 2.1517825  1.8086166 ]

CLAP


[   INFO   ] FreesoundExtractor: Read metadata
[   INFO   ] FreesoundExtractor: Compute md5 audio hash, codec, length, and EBU 128 loudness
[   INFO   ] FreesoundExtractor: Compute audio features
[   INFO   ] On connection Flux::flux → IIR::signal:
[   INFO   ] BUFFER SIZE MISMATCH: max=0 - asked for read size 4096
[   INFO   ] resizing buffer to 36040/4505
[   INFO   ] FrameCutter: dropping incomplete frame
[   INFO   ] FreesoundExtractor: Compute aggregation
[   INFO   ] All done
2025-03-08 12:38:49.054584: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled


(512,)
[-0.01323269  0.0536596   0.00124436 -0.01663688 -0.01685316  0.0527622
  0.0039786  -0.00719106 -0.01543173  0.05764656  0.03563081  0.03215203
  0.01697233  0.02058556  0.09207574  0.0013378  -0.04961937 -0.01317688
 -0.03551325 -0.0409841 ]
(512,)
[ 0.03628378  0.01537836 -0.00308793 -0.00941453 -0.10184305  0.00029932
 -0.0414701   0.02251457 -0.0287109  -0.03378308 -0.01515904  0.01394989
 -0.0165832   0.05608677  0.01364529 -0.02325141  0.02437724 -0.03079577
 -0.04963004 -0.03121673]




# LANGUAGE-BASED AUDIO RETRIEVAL EXAMPLE USING CLAP EMBEDDING SPACE

In [11]:
from sklearn.neighbors import NearestNeighbors
import pandas as pd
import sys
from IPython.display import display, IFrame

def load_embeddings_for_dataset(df, embeddings_folder):
  # Returns a numpy array of shape (n, d) where "n" is the number of sounds in the dataset and "d" is the number of dimensions of the embeddings
  # Available embedding types: "clap", "fs_similarity", "fsdsinet", "mfcc", "fsdsinet_frames", "mfcc_frames"
  # NOTE: if you are loading embeddings which have been stored frame by frame (i.e. those ending with "_frames"), you'll need to add some code
  # here to summarize them into a one-dimensional vectors before adding them to the returned numpy array.

  base_dir = os.path.join(DATA_FOLDER, 'embeddings', embeddings_folder)
  filenames = [os.path.join(base_dir, f'{df.iloc[i]["sound_id"]}.npy') for i in range(len(df))]
  example_embedding_vector = np.load(filenames[0])
  num_dimensions = len(example_embedding_vector)

  print(f'Will load {len(filenames)} points of data with {num_dimensions} dimensions each')
  X = np.zeros((len(filenames), num_dimensions))
  for i, fn in enumerate(filenames):
    if (i + 1) % 100 == 0:
      sys.stdout.write(f'\r{i + 1}/{len(filenames)}')
      sys.stdout.flush()
    X[i, :] = np.load(fn)
  sys.stdout.write(f'\rLoaded {len(filenames)} embeddings from "{embeddings_folder}"!')
  print()
  return X

def show_sound_player(sound_id):
  display(IFrame(f'https://freesound.org/embed/sound/iframe/{sound_id}/simple/medium/', width=696, height=100))

dataset_df = pd.read_csv(open(os.path.join(DATA_FOLDER, 'BSD10k_metadata.csv')))
X = load_embeddings_for_dataset(dataset_df, embeddings_folder="clap")

nbrs = NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(X)

Will load 10309 points of data with 512 dimensions each
Loaded 10309 embeddings from "clap"!


In [19]:
# target = get_clap_embeddings_from_audio(os.path.join(DATA_FOLDER, 'test_sounds', '93100__cgeffex__whip-crack-01.wav'))
# target = get_clap_embeddings_from_audio(os.path.join(DATA_FOLDER, 'test_sounds', '15.wav'))

target = get_clap_embeddings_from_text('Videogame')

# get random file from files folder
random_index = random.randint(0, len(df) - 1)
test_sound_path = df.iloc[random_index]['path']
print(f"Selected file: {df.iloc[random_index]['name']} (index: {random_index})")
ipd.display(Audio(test_sound_path))

target = get_clap_embeddings_from_audio(test_sound_path)
distances, indices = nbrs.kneighbors([target])


for count, (distance, idx) in enumerate(zip(distances[0], indices[0])):
  fs_id = dataset_df.iloc[idx]["sound_id"]
  print(count + 1, '!', distance, fs_id)
  show_sound_player(fs_id)

Selected file: videogame sci-fi metalish damage efect (index: 19)


1 ! 0.8646804631983444 2994



Argument 'onesided' has been deprecated and has no influence on the behavior of this module.



2 ! 0.9022101127561737 54279


3 ! 0.9184073117705528 88579


4 ! 0.9303321375520076 127162


5 ! 0.938649585716979 420263


In [20]:
# Create audio embeddings for each sound in the collection
print("Computing audio embeddings for sounds in collection...")
audio_embeddings = []

for i in range(len(df)):
    path = df.iloc[i]['path']
    print(f"Processing {i+1}/{len(df)}: {df.iloc[i]['name']}")
    embedding = get_clap_embeddings_from_audio(path)
    audio_embeddings.append(embedding)

# Convert to numpy array for KNN search
audio_embeddings = np.array(audio_embeddings)

# Create a model to find matches between our sounds and text queries
audio_nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(audio_embeddings)

# Display results for each text query
text_queries = ["Music Box", "Videogame", "8 bit"]

# Compute text embeddings for each query
text_embeddings = []
for query in text_queries:
    text_embedding = get_clap_embeddings_from_text(query)
    text_embeddings.append(text_embedding)

for query_idx, query in enumerate(text_queries):
    print(f"\n\nResults for query: '{query}'")
    print("-" * 50)
    
    # Get distances and indices of nearest matches
    distances, indices = audio_nbrs.kneighbors([text_embeddings[query_idx]])
    
    # Display the top matches
    for i, (distance, idx) in enumerate(zip(distances[0], indices[0])):
        print(f"Match #{i+1}: {df.iloc[idx]['name']} (distance: {distance:.4f})")
        print(f"Freesound ID: {df.iloc[idx]['freesound_id']}")
        
        # Play the sound
        ipd.display(Audio(df.iloc[idx]['path']))
        print()


Computing audio embeddings for sounds in collection...
Processing 1/30: music box loop 35.wav
Processing 2/30: music box



Argument 'onesided' has been deprecated and has no influence on the behavior of this module.



Processing 3/30: Let it Be - Music Box.wav
Processing 4/30: Music box 04.wav
Processing 5/30: music box - wind up 06.wav
Processing 6/30: Dancing Ducks Music Box
Processing 7/30: Music Box: "Fur Elise"
Processing 8/30: Music Box of Terror
Processing 9/30: Drop The Music Box Bass by Rob.wav
Processing 10/30: spieluhr12.wav
Processing 11/30: Videogame Starting Combat Jingle
Processing 12/30: Enemy Spotting
Processing 13/30: Another melody for a indie videogame or something (Give me a second chance)
Processing 14/30: quest finish
Processing 15/30: videogame_controllers_micking_scifi_weapon_movment_sounds
Processing 16/30: Water attack (small)
Processing 17/30: Item Pickup
Processing 18/30: Text-Message or Videogame-Jump
Processing 19/30: Videogame Menu Button Clicking Sound 13 
Processing 20/30: videogame sci-fi metalish damage efect
Processing 21/30: 8 bit arpeggio 001 minor 120 bpm pulse2 024 B2.wav
Processing 22/30: 8 bit arpeggio 001 minor 120 bpm pulse1 024 B2.wav
Processing 23/30: 8


Match #2: Music Box: "Fur Elise" (distance: 1.0951)
Freesound ID: 335967





Results for query: 'Videogame'
--------------------------------------------------
Match #1: 8 bit arpeggio 001 minor 120 bpm pulse1 024 B2.wav (distance: 1.1766)
Freesound ID: 664082



Match #2: 8 bit arpeggio 001 major 120 bpm square 048 B4.wav (distance: 1.1993)
Freesound ID: 660356





Results for query: '8 bit'
--------------------------------------------------
Match #1: 8 bit bass figure 001 115 bpm note 22 A.wav (distance: 1.1938)
Freesound ID: 659777



Match #2: 8 bit arpeggio 001 minor 120 bpm square 002 Cis1.wav (distance: 1.2032)
Freesound ID: 660630





In [21]:
from sklearn.decomposition import PCA

# Analyze the text-to-audio retrieval results

# Create a more formal analysis of the retrieval results
print("# Analysis of Text-to-Audio Retrieval Results\n")

# Compare similarity between text queries and audio embeddings
print("## Similarity Analysis")

# For each text query, calculate and display average distance to top matches
for query_idx, query in enumerate(text_queries):
    # Get text embedding for this query
    query_embedding = text_embeddings[query_idx]
    
    # Calculate distances between this query and all audio embeddings
    distances = []
    for audio_embed in audio_embeddings:
        # Use cosine similarity (dot product of normalized vectors)
        similarity = np.dot(query_embedding, audio_embed) / (np.linalg.norm(query_embedding) * np.linalg.norm(audio_embed))
        distances.append(similarity)
    
    # Find top 3 matches
    top_indices = np.argsort(distances)[-3:][::-1]
    top_distances = [distances[i] for i in top_indices]
    
    print(f"\nQuery: '{query}'")
    print(f"Top 3 matches (indices): {top_indices}")
    print(f"Top 3 similarity scores: {[f'{d:.4f}' for d in top_distances]}")
    print(f"Average similarity of top 3 matches: {sum(top_distances)/len(top_distances):.4f}")
    print(f"Best matching sound: {df.iloc[top_indices[0]]['name']}")
    
# Compare text queries to each other
print("\n## Cross-Query Analysis")
for i in range(len(text_queries)):
    for j in range(i+1, len(text_queries)):
        query1 = text_queries[i]
        query2 = text_queries[j]
        similarity = np.dot(text_embeddings[i], text_embeddings[j]) / (np.linalg.norm(text_embeddings[i]) * np.linalg.norm(text_embeddings[j]))
        print(f"Similarity between '{query1}' and '{query2}': {similarity:.4f}")


# Analysis of Text-to-Audio Retrieval Results

## Similarity Analysis

Query: 'Music Box'
Top 3 matches (indices): [5 6 0]
Top 3 similarity scores: ['0.4314', '0.4004', '0.3906']
Average similarity of top 3 matches: 0.4075
Best matching sound: Dancing Ducks Music Box

Query: 'Videogame'
Top 3 matches (indices): [21 22 20]
Top 3 similarity scores: ['0.3078', '0.2809', '0.2699']
Average similarity of top 3 matches: 0.2862
Best matching sound: 8 bit arpeggio 001 minor 120 bpm pulse1 024 B2.wav

Query: '8 bit'
Top 3 matches (indices): [28 27 20]
Top 3 similarity scores: ['0.2874', '0.2762', '0.2760']
Average similarity of top 3 matches: 0.2799
Best matching sound: 8 bit bass figure 001 115 bpm note 22 A.wav

## Cross-Query Analysis
Similarity between 'Music Box' and 'Videogame': 0.1845
Similarity between 'Music Box' and '8 bit': -0.1323
Similarity between 'Videogame' and '8 bit': 0.3945


In [22]:
from sklearn.metrics.pairwise import cosine_similarity
import plotly.express as px
import plotly.graph_objects as go

# Visualize the embedding space (simplified 2D projection)
print("\n## Visualizing the Embedding Space")

# Combine text and audio embeddings for visualization
all_embeddings = np.vstack([np.array(text_embeddings), audio_embeddings])
labels = text_queries + [df.iloc[i]['name'] for i in range(len(audio_embeddings))]
types = ["Text Query"] * len(text_queries) + ["Audio File"] * len(audio_embeddings)

# Create 2D projection
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(all_embeddings)

# Create a DataFrame for plotting
plot_df = pd.DataFrame({
    'x': embeddings_2d[:, 0],
    'y': embeddings_2d[:, 1],
    'label': labels,
    'type': types
})

# Find nearest audio embedding for each text query
nearest_audio_indices = []
similarities = []
for i in range(len(text_queries)):
    sim = cosine_similarity([text_embeddings[i]], audio_embeddings)[0]
    nearest_idx = np.argmax(sim)
    nearest_audio_indices.append(nearest_idx)
    similarities.append(sim[nearest_idx])

# Create the interactive figure
fig = go.Figure()

# Add audio points
audio_df = plot_df[plot_df['type'] == 'Audio File']
fig.add_trace(go.Scatter(
    x=audio_df['x'],
    y=audio_df['y'],
    mode='markers',
    name='Audio Files',
    marker=dict(color='gray', size=8, opacity=0.6),
    text=audio_df['label'],
    hoverinfo='text'
))

# Add text query points with different colors
colors = ['red', 'blue', 'green']
for i, query in enumerate(text_queries):
    # Add the text query point
    fig.add_trace(go.Scatter(
        x=[plot_df.iloc[i]['x']],
        y=[plot_df.iloc[i]['y']],
        mode='markers+text',
        marker=dict(color=colors[i], size=15, symbol='star'),
        text=query,
        textposition="top center",
        name=f"Query: {query}",
        hoverinfo='text'
    ))
    
    # Highlight the nearest audio point
    nearest_idx = nearest_audio_indices[i]
    fig.add_trace(go.Scatter(
        x=[plot_df.iloc[len(text_queries) + nearest_idx]['x']],
        y=[plot_df.iloc[len(text_queries) + nearest_idx]['y']],
        mode='markers',
        marker=dict(color=colors[i], size=12, symbol='circle', line=dict(width=2, color='black')),
        name=f"Best match for '{query}' (sim={similarities[i]:.2f})",
        text=f"Best match: {df.iloc[nearest_idx]['name']}",
        hoverinfo='text'
    ))
    
    # Draw a line between the text query and its nearest audio point
    fig.add_trace(go.Scatter(
        x=[plot_df.iloc[i]['x'], plot_df.iloc[len(text_queries) + nearest_idx]['x']],
        y=[plot_df.iloc[i]['y'], plot_df.iloc[len(text_queries) + nearest_idx]['y']],
        mode='lines',
        line=dict(color=colors[i], width=2, dash='dot'),
        showlegend=False
    ))

# Update layout
fig.update_layout(
    title='Interactive Visualization of CLAP Embedding Space (2D PCA Projection)',
    xaxis_title="Principal Component 1",
    yaxis_title="Principal Component 2",
    template='plotly_white',
    legend=dict(x=1.05, y=1),
    height=700,
    width=1100
)

# Add variance explanation
explained_variance_ratio = pca.explained_variance_ratio_
fig.add_annotation(
    xref="paper", yref="paper",
    x=0.5, y=1.05,
    text=f"Explained variance: PC1 {explained_variance_ratio[0]:.2f}, PC2 {explained_variance_ratio[1]:.2f}",
    showarrow=False,
    font=dict(size=12)
)

# Show the plot
fig.show()


## Visualizing the Embedding Space
