# Visualise CLAP embeddings with t-SNE

###  Goal of the notebook
Visualise CLAP embeddings with t-SNE.

### CLAP
<img src="https://raw.githubusercontent.com/MichielBontenbal/run_CLAP/main/images/CLAP.jpg" width="600"/>

- Modelcard: https://huggingface.co/laion/larger_clap_general
- CLAP paper: https://arxiv.org/abs/2211.06687
- Reference: https://dataloop.ai/library/model/laion_larger_clap_music_and_speech/

In this notebook we will use two CLAP models:
1. larger_clap_music_and_speech
2. larger_clap_general

### How CLAP works

<img src="https://raw.githubusercontent.com/MichielBontenbal/run_CLAP/main/images/create_embeddings.jpg" width="600"/>

<img src= "https://raw.githubusercontent.com/MichielBontenbal/run_CLAP/main/images/CLAP_cos_sim.jpg" width='600'/>


### Use 🤗 datasets and 🤗transformers
The dataset is hosted on the Huggingface Hub at: https://huggingface.co/datasets/MichielBontenbal/UrbanSoundsII

(This is a new version of the same dataset as the old dataset got corrupted.)

This dataset contains nine classes of audio events in an urban environment. 

In this notebook we will use 🤗  ```dataset``` library to load this dataset. 

And we'll use the 🤗 ```transformers``` library to run the CLAP model. Please find more info: https://huggingface.co/docs/transformers/model_doc/clap 


### Contents
0. Install packages & check versions
1. Inspection of dataset
2. Get audio embeddings
3. Cosine similarity for audio embeddings
4. Cluster audio embeddings (work in progress)
5. Get the labels 
6. PCA and t-SNE on the dataset


[def]: ttps://github.com/MichielBontenbal/run_CLAP/blob/main/images/CLAP.jp

## 0. Install packages

In [43]:
#!pip install datasets

In [44]:
#!pip install soundfile

In [45]:
#!pip install librosa

In [46]:
#%pip install datasets\[audio\]

In [47]:
#!pip install transformers

In [48]:
pip install numpy==1.26

Note: you may need to restart the kernel to use updated packages.


In [49]:
#check python version
import platform
print(platform.python_version())


3.12.7


In [50]:
import numpy as np
print(f'numpy version: {np.__version__}')
import soundfile
print(f'soundfile version: {soundfile.__version__}')
import librosa
print(f'librosa version: {librosa.__version__}')
import IPython
print(f'IPython version: {IPython.__version__}')


numpy version: 1.26.0
soundfile version: 0.12.1
librosa version: 0.10.2.post1
IPython version: 8.27.0


In [51]:
import datasets
print(f'datasets version: {datasets.__version__}')
import transformers
print(f'transformers version: {transformers.__version__}')
import torch
print(f'torch version: {torch.__version__}')

datasets version: 3.1.0
transformers version: 4.46.3
torch version: 2.2.2


## 1. Inspection of the dataset

In [52]:
from datasets import load_dataset

dataset =load_dataset("MichielBontenbal/UrbanSoundsII", split="train")

Resolving data files:   0%|          | 0/223 [00:00<?, ?it/s]

In [53]:
#The ESC50 dataset is one of the very few other datasets on Environmental Sound classification
#You could try this as an alternative
#dataset = load_dataset("confit/esc50-demo", "fold1")

In [54]:
#inspect the dataset
#dataset = ds
dataset

Dataset({
    features: ['audio', 'label'],
    num_rows: 223
})

In [55]:
#Inspect one sample from 
example = dataset['audio'][0]
label = dataset['label'][0]

You may notice that the audio column contains several features. Here’s what they are:

- path: the path to the downloaded (and converted) audio file
- array: The decoded audio data, represented as a 1-dimensional NumPy array.
- sampling_rate. The sampling rate of the audio file.

In [56]:
example

{'path': '/Users/michielbontenbal/.cache/huggingface/datasets/downloads/c6de4f4db3f3eb3f95189e05acc6bbd4db6e00537183441353e0477e89b96e45',
 'array': array([-0.00015259, -0.00012207, -0.00021362, ...,  0.00015259,
         0.00018311,  0.        ]),
 'sampling_rate': 44100}

In [57]:
#inspecting the audio array
array = dataset["audio"][0]["array"]
sampling_rate = example["sampling_rate"]
print(array.shape)
print(array)
print(type(array))
print(sampling_rate)

(441000,)
[-0.00015259 -0.00012207 -0.00021362 ...  0.00015259  0.00018311
  0.        ]
<class 'numpy.ndarray'>
44100


## 2. CLAP embeddings

In [58]:
from datasets import load_dataset
from transformers import ClapModel, ClapProcessor

# Load the model and processor
model = ClapModel.from_pretrained("laion/larger_clap_music_and_speech")
processor = ClapProcessor.from_pretrained("laion/larger_clap_music_and_speech")

dataset =load_dataset("MichielBontenbal/UrbanSoundsII", split="train")

Resolving data files:   0%|          | 0/223 [00:00<?, ?it/s]

In [59]:
# Load an audio sample
audio_sample_1 = dataset[0]

# Preprocess the audio sample
inputs_1 = processor(audios=audio_sample_1["audio"]["array"], return_tensors="pt", sampling_rate=48000)

# Run the model
audio_embedding_1 = model.get_audio_features(**inputs_1)

In [60]:
print(audio_embedding_1.ndim)
print(audio_embedding_1.shape)
print(audio_embedding_1.dtype)
print(type(audio_embedding_1))
#print(audio_embedding_1)

2
torch.Size([1, 512])
torch.float32
<class 'torch.Tensor'>


In [61]:
# Now do the same for another audio sample
audio_sample_2 = dataset[0]
inputs_2 = processor(audios=audio_sample_2["audio"]["array"], return_tensors="pt", sampling_rate=48000)
audio_embedding_2 = model.get_audio_features(**inputs_2)

In [62]:
#A function to get all the audio embeddings and store them as individual .pt files
def get_audio_embeddings(i):
    global embedding
    # Preprocess and encode the first image
    sample = dataset[i]
    inputs = processor(audios=sample["audio"]["array"], return_tensors="pt", sampling_rate=48000)
    embedding = model.get_audio_features(**inputs)
    torch.save(embedding, 'embedding'+str(i)+'.pt')
    return embedding

for i in range(len(dataset)):
    get_audio_embeddings(i) 

In [63]:
#Load all audio embeddings into a python dict
import glob
embeddings_files = glob.glob('*.pt')

#load all files as embeddings
embeddings_list=[]
for i in range(len(embeddings_files)):
    embeddings_list.append(torch.load('embedding'+str(1)+'.pt'))

embeddings_dict = dict(enumerate(embeddings_list))
#embeddings_dict

In [64]:
#printing an example to check it
embeddings_dict[5]

tensor([[ 7.0078e-02,  3.4412e-02, -2.7923e-02, -4.3773e-03,  2.4402e-02,
         -4.9248e-03,  9.0602e-03,  8.5705e-02,  2.4642e-02, -7.1067e-03,
         -4.3704e-02, -6.1173e-02, -3.5182e-03,  2.6701e-03,  1.0201e-01,
          6.2395e-02, -5.0307e-02,  7.1411e-03,  6.1110e-02,  2.7588e-02,
         -6.8656e-02, -4.5197e-02, -1.9453e-02,  5.9399e-02, -1.0194e-01,
          1.6907e-02, -5.9350e-02,  3.8986e-02,  1.8429e-02, -1.8185e-02,
         -5.7925e-02,  2.5480e-03,  1.6057e-02, -5.9625e-02,  4.0433e-02,
         -6.2270e-02, -3.3921e-03,  7.1809e-02,  2.9825e-03, -8.3548e-02,
          2.6152e-02, -3.0779e-02, -6.4825e-02, -5.8739e-02, -1.4845e-02,
          4.6709e-02, -5.6650e-02,  1.9142e-02,  4.4384e-02,  1.6380e-02,
         -3.4536e-02,  1.7028e-02, -6.9841e-02,  3.5079e-02,  1.4325e-02,
         -9.6263e-04,  5.4077e-02,  1.6110e-02, -3.7469e-02,  5.0608e-02,
          3.2914e-02, -1.0023e-02, -3.8195e-02, -1.9179e-02,  1.2755e-02,
         -3.1353e-02,  5.2495e-03,  2.

## 3. Calculate cosine similarity

In [65]:
audio_embeddding_1 = torch.load('embedding1.pt')
#audio_embedding_1

In [66]:
audio_embeddding_2 = torch.load('embedding2.pt')
#audio_embedding_2

In [67]:
import torch
import torch.nn.functional

# Calculate cosine similarity
cosine_similarity = torch.nn.functional.cosine_similarity(audio_embedding_1, audio_embedding_2, dim=1)

print(f"Cosine Similarity: {cosine_similarity.item()}")

Cosine Similarity: 1.0000001192092896


In [68]:
#Inspect the result by listening to it
import IPython
print(f'Audio sample 1:')
IPython.display.Audio(audio_sample_1['audio']['array'], rate=example['sampling_rate'])

Audio sample 1:


In [69]:
print(f'Audio sample 2:')
IPython.display.Audio(audio_sample_2['audio']['array'], rate=example['sampling_rate'])

Audio sample 2:


In [70]:
#create a random number to select from dataset
import random

random_number_1 = random.randint(0, len(dataset['audio']))
random_number_2 = random.randint(0, len(dataset['audio']))
print(f'First example: {random_number_1}')
print(f'Second example: {random_number_2}')

First example: 75
Second example: 11


## 4. Cluster the embeddings based on cosine similarity IN PROGRESS

In [71]:
import torch

def calculate_cosine_similarity(embedding_1, embedding_2):
    global similarity
    """
    Calculate cosine similarity between two embeddings.
    """
    cosine_similarity = torch.nn.functional.cosine_similarity(embedding_1, embedding_2, dim=1)
    similarity = cosine_similarity.item()
    return similarity

calculate_cosine_similarity(audio_embedding_1, audio_embedding_2)

1.0000001192092896

In [72]:
#load the dictionary with audio_embeddings to do 
for key, value in embeddings_dict.items():
    print(f"Key: {key}, Value: {value}")


Key: 0, Value: tensor([[ 7.0078e-02,  3.4412e-02, -2.7923e-02, -4.3773e-03,  2.4402e-02,
         -4.9248e-03,  9.0602e-03,  8.5705e-02,  2.4642e-02, -7.1067e-03,
         -4.3704e-02, -6.1173e-02, -3.5182e-03,  2.6701e-03,  1.0201e-01,
          6.2395e-02, -5.0307e-02,  7.1411e-03,  6.1110e-02,  2.7588e-02,
         -6.8656e-02, -4.5197e-02, -1.9453e-02,  5.9399e-02, -1.0194e-01,
          1.6907e-02, -5.9350e-02,  3.8986e-02,  1.8429e-02, -1.8185e-02,
         -5.7925e-02,  2.5480e-03,  1.6057e-02, -5.9625e-02,  4.0433e-02,
         -6.2270e-02, -3.3921e-03,  7.1809e-02,  2.9825e-03, -8.3548e-02,
          2.6152e-02, -3.0779e-02, -6.4825e-02, -5.8739e-02, -1.4845e-02,
          4.6709e-02, -5.6650e-02,  1.9142e-02,  4.4384e-02,  1.6380e-02,
         -3.4536e-02,  1.7028e-02, -6.9841e-02,  3.5079e-02,  1.4325e-02,
         -9.6263e-04,  5.4077e-02,  1.6110e-02, -3.7469e-02,  5.0608e-02,
          3.2914e-02, -1.0023e-02, -3.8195e-02, -1.9179e-02,  1.2755e-02,
         -3.1353e-02,  

In [73]:
len(embeddings_dict)

223

In [74]:
print(embeddings_dict[0])

tensor([[ 7.0078e-02,  3.4412e-02, -2.7923e-02, -4.3773e-03,  2.4402e-02,
         -4.9248e-03,  9.0602e-03,  8.5705e-02,  2.4642e-02, -7.1067e-03,
         -4.3704e-02, -6.1173e-02, -3.5182e-03,  2.6701e-03,  1.0201e-01,
          6.2395e-02, -5.0307e-02,  7.1411e-03,  6.1110e-02,  2.7588e-02,
         -6.8656e-02, -4.5197e-02, -1.9453e-02,  5.9399e-02, -1.0194e-01,
          1.6907e-02, -5.9350e-02,  3.8986e-02,  1.8429e-02, -1.8185e-02,
         -5.7925e-02,  2.5480e-03,  1.6057e-02, -5.9625e-02,  4.0433e-02,
         -6.2270e-02, -3.3921e-03,  7.1809e-02,  2.9825e-03, -8.3548e-02,
          2.6152e-02, -3.0779e-02, -6.4825e-02, -5.8739e-02, -1.4845e-02,
          4.6709e-02, -5.6650e-02,  1.9142e-02,  4.4384e-02,  1.6380e-02,
         -3.4536e-02,  1.7028e-02, -6.9841e-02,  3.5079e-02,  1.4325e-02,
         -9.6263e-04,  5.4077e-02,  1.6110e-02, -3.7469e-02,  5.0608e-02,
          3.2914e-02, -1.0023e-02, -3.8195e-02, -1.9179e-02,  1.2755e-02,
         -3.1353e-02,  5.2495e-03,  2.

In [75]:
#calculate the cosine similarity to each image in the list. This may take some time .
cos_sim_list =[]
for i in range(len(embeddings_dict)):
    sample1 =  embeddings_dict[i]
    for j in range(len(embeddings_dict)):
        sample2 = embeddings_dict[j]
        #get_audio_embeddings(sample1, sample2)
        calculate_cosine_similarity(sample1, sample2)
        cos_sim_list.append(round(similarity,8))

print(cos_sim_list)

[0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.9

In [76]:
max(cos_sim_list)

0.99999988

In [77]:

import numpy as np
num_rows = len(cos_sim_list)
cosine_similarity_matrix = np.array(cos_sim_list).reshape(num_rows, -1)
cosine_similarity_matrix.shape

(49729, 1)

## 5. Get the labels
Huggingface does not give you the option to name the labels in the dataset.
So we will run some code to get the right label names.

In [78]:
#print the label data
labels_list = dataset['label']
print(labels_list)
print(len(labels_list))

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8]
223


In [79]:
# This is code to convert the given labels (0,1,2,3,4,5,6,7,8) to a real string. 
# create a dictionary the converts the class folders to real names
label_dict ={0:'Gunshot', 1:'Moped alarm', 2:'Moped', 3:'Claxon', 4:'Slamming door', 5:'Screaming', 6:'Motorcycle', 7:'Talking', 8:'Music'}
print('The given labels are: ')
for i in range(0,9):
    print(label_dict[i])

The given labels are: 
Gunshot
Moped alarm
Moped
Claxon
Slamming door
Screaming
Motorcycle
Talking
Music


In [80]:
#now iterate trough the dict to get the labelnames based on the numbers
labelnames_list = [label_dict[key] for key in labels_list]

print(labelnames_list)

['Gunshot', 'Gunshot', 'Gunshot', 'Gunshot', 'Gunshot', 'Gunshot', 'Gunshot', 'Gunshot', 'Gunshot', 'Gunshot', 'Gunshot', 'Gunshot', 'Gunshot', 'Gunshot', 'Gunshot', 'Gunshot', 'Gunshot', 'Gunshot', 'Gunshot', 'Gunshot', 'Gunshot', 'Gunshot', 'Gunshot', 'Gunshot', 'Gunshot', 'Moped alarm', 'Moped alarm', 'Moped alarm', 'Moped alarm', 'Moped alarm', 'Moped alarm', 'Moped alarm', 'Moped alarm', 'Moped alarm', 'Moped alarm', 'Moped alarm', 'Moped alarm', 'Moped alarm', 'Moped alarm', 'Moped alarm', 'Moped alarm', 'Moped alarm', 'Moped alarm', 'Moped alarm', 'Moped alarm', 'Moped alarm', 'Moped alarm', 'Moped alarm', 'Moped', 'Moped', 'Moped', 'Moped', 'Moped', 'Moped', 'Moped', 'Moped', 'Moped', 'Moped', 'Moped', 'Moped', 'Moped', 'Moped', 'Moped', 'Moped', 'Moped', 'Moped', 'Moped', 'Moped', 'Moped', 'Moped', 'Moped', 'Moped', 'Moped', 'Claxon', 'Claxon', 'Claxon', 'Claxon', 'Claxon', 'Claxon', 'Claxon', 'Claxon', 'Claxon', 'Claxon', 'Claxon', 'Claxon', 'Claxon', 'Claxon', 'Claxon', 'Cla

## 6. Visualise the embeddings with PCA and t-SNE

In [81]:
#A function to get all the audio embeddings in numpy format
import numpy as np

def get_audio_embeddings_np(i):
    global embedding_np
    # Preprocess and encode the first image
    sample = dataset[i]
    inputs = processor(audios=sample["audio"]["array"], return_tensors="pt", sampling_rate=48000)
    embedding = model.get_audio_features(**inputs)
    
    embedding_np = np.array(embedding.detach().cpu().numpy())
    return embedding_np



In [82]:
#Create an array with all the audio embeddings (array of arrays)
combined_array = np.empty((0, 512))

for i in range(len(dataset)):
   get_audio_embeddings_np(i) 
   combined_array = np.vstack((combined_array, embedding_np))
   #print(combined_array.shape)

#Check the shape of the array and items in the array
print(combined_array.shape)
print(combined_array[0].shape)


(223, 512)
(512,)


In [83]:
# Normalize the embeddings
embeddings_np = combined_array / np.linalg.norm(combined_array, axis=1, keepdims=True)

from sklearn.decomposition import PCA
# Apply PCA to reduce dimensions (optional)
pca = PCA(n_components=2)  # Reduce to 2 dimensions for faster t-SNE
embeddings_pca = pca.fit_transform(embeddings_np)

# Apply t-SNE to reduce to 2D for visualization
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=42)
embeddings_2d = tsne.fit_transform(embeddings_pca)

In [84]:
import plotly.express as px
import numpy as np

# Assuming you have a list of class labels corresponding to each point
# If you don't have this, you'll need to create it based on your data
class_labels = labelnames_list  # Your list of class labels here, should be same length as embeddings_2d

# Create a color map for the 9 classes
color_map = px.colors.qualitative.Set1[:9]

fig = px.scatter(
    x=embeddings_2d[:, 0],
    y=embeddings_2d[:, 1],
    color=class_labels,
    color_discrete_sequence=color_map,
    opacity=0.7,
    title='Visualization of Urban Sounds dataset using CLAP embeddings and t-SNE',
    labels={'color': 'Class'}
)

fig.update_traces(marker=dict(size=8))

fig.show()
