# Audio classification on the Urban SoundsII dataset with CLAP

###  Goal of the notebook
In this notebook you can do audio classification with CLAP.

### CLAP
CLAP (Contrastive Language-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task.

- Modelcard: https://huggingface.co/laion/larger_clap_general
- CLAP paper: https://arxiv.org/abs/2211.06687
- Reference: https://dataloop.ai/library/model/laion_larger_clap_music_and_speech/

In this notebook we will use two CLAP models:
1. larger_clap_music_and_speech
2. larger_clap_general

### Using 🤗 datasets and 🤗transformers
The dataset is hosted on the Huggingface Hub at: https://huggingface.co/datasets/MichielBontenbal/UrbanSoundsII

(This is a new version of the same dataset as the old dataset got corrupted.)

This dataset contains nine classes of audio events in an urban environment. 

In this notebook we will use 🤗  ```dataset``` library to load this dataset. 

And we'll use the 🤗 ```transformers``` library to run the CLAP model. Please find more info: https://huggingface.co/docs/transformers/model_doc/clap 


### Contents
0. Install packages & check versions
1. Inspection of dataset
2. Get audio embeddings
3. Cosine similarity for audio embeddings
4. Cluster audio embeddings (to do)
5. Get text embeddings (to do)


## 0. Install packages

In [2]:
#!pip install datasets

In [3]:
#!pip install soundfile

In [4]:
#!pip install librosa

In [5]:
#%pip install datasets\[audio\]

In [None]:
#!pip install transformers



In [7]:
pip install numpy==1.26

Note: you may need to restart the kernel to use updated packages.


In [8]:
#check python version
import platform
print(platform.python_version())


3.12.7


In [1]:
import numpy as np
print(f'numpy version: {np.__version__}')
import soundfile
print(f'soundfile version: {soundfile.__version__}')
import librosa
print(f'librosa version: {librosa.__version__}')
import IPython
print(f'IPython version: {IPython.__version__}')


numpy version: 1.26.0
soundfile version: 0.12.1
librosa version: 0.10.2.post1
IPython version: 8.27.0


In [2]:
import datasets
print(f'datasets version: {datasets.__version__}')
import transformers
print(f'transformers version: {transformers.__version__}')
import torch
print(f'torch version: {torch.__version__}')

datasets version: 3.1.0
transformers version: 4.46.3
torch version: 2.2.2


## 1. Inspection of the dataset

In [44]:
from datasets import load_dataset

dataset = load_dataset("MichielBontenbal/UrbanSoundsII")


Resolving data files:   0%|          | 0/223 [00:00<?, ?it/s]

In [12]:
#The ESC50 dataset is one of the very few other datasets on Environmental Sound classification
#You could try this as an alternative
#dataset = load_dataset("confit/esc50-demo", "fold1")

In [45]:
#inspect the dataset
#dataset = ds
dataset

DatasetDict({
    train: Dataset({
        features: ['audio', 'label'],
        num_rows: 223
    })
})

In [46]:
#Inspect one sample from 
example = dataset['train']['audio'][0]
label = dataset['train']['label'][0]

You may notice that the audio column contains several features. Here’s what they are:

- path: the path to the downloaded (and converted) audio file
- array: The decoded audio data, represented as a 1-dimensional NumPy array.
- sampling_rate. The sampling rate of the audio file.

In [47]:
example

{'path': '/Users/michielbontenbal/.cache/huggingface/datasets/downloads/c6de4f4db3f3eb3f95189e05acc6bbd4db6e00537183441353e0477e89b96e45',
 'array': array([-0.00015259, -0.00012207, -0.00021362, ...,  0.00015259,
         0.00018311,  0.        ]),
 'sampling_rate': 44100}

In [52]:
#inspecting the audio array
array = dataset["train"]["audio"][0]["array"]
sampling_rate = example["sampling_rate"]
print(array.shape)
print(array)
print(type(array))
print(sampling_rate)

(441000,)
[-0.00015259 -0.00012207 -0.00021362 ...  0.00015259  0.00018311
  0.        ]
<class 'numpy.ndarray'>
44100


In [48]:
#print the label data
print(dataset['train']['label'])
print(len(dataset['train']['label']))

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8]
223


## 2. CLAP embeddings

In [53]:
from datasets import load_dataset
from transformers import ClapModel, ClapProcessor

# Load the model and processor
model = ClapModel.from_pretrained("laion/larger_clap_music_and_speech")
processor = ClapProcessor.from_pretrained("laion/larger_clap_music_and_speech")

dataset =load_dataset("MichielBontenbal/UrbanSoundsII", split="train")

Resolving data files:   0%|          | 0/223 [00:00<?, ?it/s]

In [54]:
# Load an audio sample
audio_sample_1 = dataset[0]

# Preprocess the audio sample
inputs_1 = processor(audios=audio_sample_1["audio"]["array"], return_tensors="pt", sampling_rate=48000)

# Run the model
audio_embedding_1 = model.get_audio_features(**inputs_1)

In [58]:
print(audio_embedding_1.ndim)
print(audio_embedding_1.shape)
print(audio_embedding_1.dtype)
print(type(audio_embedding_1))
print(audio_embedding_1)

2
torch.Size([1, 512])
torch.float32
<class 'torch.Tensor'>
tensor([[ 6.9978e-02,  4.9438e-02, -3.2051e-02, -3.1273e-02,  1.8951e-02,
         -1.2967e-02,  4.1404e-02,  1.0382e-01,  9.0262e-03, -2.6426e-02,
         -3.5315e-02, -4.3891e-02,  2.5547e-02,  6.2200e-04,  9.4766e-02,
          6.6196e-02, -6.1133e-02,  1.4466e-02,  4.4434e-02,  1.6863e-02,
         -8.5666e-02, -2.1581e-02, -2.3259e-02,  5.2551e-02, -1.1534e-01,
          6.2922e-03, -3.3362e-02,  2.8030e-02, -5.6451e-03, -3.5905e-02,
         -5.2743e-02, -6.9354e-03,  1.2081e-02, -6.1858e-02,  1.8697e-02,
         -4.3829e-02, -3.1626e-02,  7.7767e-02,  8.0904e-03, -9.2447e-02,
          4.3738e-02, -1.8235e-02, -8.6115e-02, -5.5174e-02, -7.1279e-03,
          3.9049e-02, -8.4117e-02,  3.5187e-02,  3.3596e-02,  2.5555e-02,
         -3.1600e-02,  3.2825e-02, -5.9688e-02,  1.8258e-02,  2.8436e-02,
          1.6404e-02,  5.2465e-02,  2.7770e-02, -7.3164e-02,  4.1530e-02,
          5.2760e-03, -1.7303e-02, -4.9916e-02, -1.5

In [12]:
# Now do the same for another audio sample
audio_sample_2 = dataset[0]
inputs_2 = processor(audios=audio_sample_2["audio"]["array"], return_tensors="pt", sampling_rate=48000)
audio_embedding_2 = model.get_audio_features(**inputs_2)

In [13]:
#TO DO CREATE A 
#function to get all the audio embeddings and store them
def get_audio_embeddings(i):
    global embedding
    # Preprocess and encode the first image
    sample = dataset[i]
    inputs = processor(audios=sample["audio"]["array"], return_tensors="pt", sampling_rate=48000)
    embedding = model.get_audio_features(**inputs)
    torch.save(embedding, 'embedding'+str(i)+'.pt')
    return embedding

for i in range(len(dataset)):
    get_audio_embeddings(i) 

In [14]:
#Load all audio embeddings into a python dict
import glob
embeddings_files = glob.glob('*.pt')

#load all files as embeddings
embeddings_list=[]
for i in range(len(embeddings_files)):
    embeddings_list.append(torch.load('embedding'+str(1)+'.pt'))

embeddings_dict = dict(enumerate(embeddings_list))
embeddings_dict

{0: tensor([[ 7.0078e-02,  3.4412e-02, -2.7923e-02, -4.3773e-03,  2.4402e-02,
          -4.9248e-03,  9.0602e-03,  8.5705e-02,  2.4642e-02, -7.1067e-03,
          -4.3704e-02, -6.1173e-02, -3.5182e-03,  2.6701e-03,  1.0201e-01,
           6.2395e-02, -5.0307e-02,  7.1411e-03,  6.1110e-02,  2.7588e-02,
          -6.8656e-02, -4.5197e-02, -1.9453e-02,  5.9399e-02, -1.0194e-01,
           1.6907e-02, -5.9350e-02,  3.8986e-02,  1.8429e-02, -1.8185e-02,
          -5.7925e-02,  2.5480e-03,  1.6057e-02, -5.9625e-02,  4.0433e-02,
          -6.2270e-02, -3.3921e-03,  7.1809e-02,  2.9825e-03, -8.3548e-02,
           2.6152e-02, -3.0779e-02, -6.4825e-02, -5.8739e-02, -1.4845e-02,
           4.6709e-02, -5.6650e-02,  1.9142e-02,  4.4384e-02,  1.6380e-02,
          -3.4536e-02,  1.7028e-02, -6.9841e-02,  3.5079e-02,  1.4325e-02,
          -9.6263e-04,  5.4077e-02,  1.6110e-02, -3.7469e-02,  5.0608e-02,
           3.2914e-02, -1.0023e-02, -3.8195e-02, -1.9179e-02,  1.2755e-02,
          -3.1353e-02,

In [15]:
#printing an example to check it
embeddings_dict[5]

tensor([[ 7.0078e-02,  3.4412e-02, -2.7923e-02, -4.3773e-03,  2.4402e-02,
         -4.9248e-03,  9.0602e-03,  8.5705e-02,  2.4642e-02, -7.1067e-03,
         -4.3704e-02, -6.1173e-02, -3.5182e-03,  2.6701e-03,  1.0201e-01,
          6.2395e-02, -5.0307e-02,  7.1411e-03,  6.1110e-02,  2.7588e-02,
         -6.8656e-02, -4.5197e-02, -1.9453e-02,  5.9399e-02, -1.0194e-01,
          1.6907e-02, -5.9350e-02,  3.8986e-02,  1.8429e-02, -1.8185e-02,
         -5.7925e-02,  2.5480e-03,  1.6057e-02, -5.9625e-02,  4.0433e-02,
         -6.2270e-02, -3.3921e-03,  7.1809e-02,  2.9825e-03, -8.3548e-02,
          2.6152e-02, -3.0779e-02, -6.4825e-02, -5.8739e-02, -1.4845e-02,
          4.6709e-02, -5.6650e-02,  1.9142e-02,  4.4384e-02,  1.6380e-02,
         -3.4536e-02,  1.7028e-02, -6.9841e-02,  3.5079e-02,  1.4325e-02,
         -9.6263e-04,  5.4077e-02,  1.6110e-02, -3.7469e-02,  5.0608e-02,
          3.2914e-02, -1.0023e-02, -3.8195e-02, -1.9179e-02,  1.2755e-02,
         -3.1353e-02,  5.2495e-03,  2.

## 3. Calculate cosine similarity

In [43]:
audio_embeddding_1 = torch.load('embedding1.pt')
audio_embedding_1

NameError: name 'audio_embedding_1' is not defined

In [None]:
audio_embeddding_2 = torch.load('embedding2.pt')
audio_embedding_2

In [24]:
import torch
import torch.nn.functional

# Calculate cosine similarity
cosine_similarity = torch.nn.functional.cosine_similarity(audio_embedding_1, audio_embedding_2, dim=1)

print(f"Cosine Similarity: {cosine_similarity.item()}")

Cosine Similarity: 1.0000001192092896


In [25]:
#Inspect the result by listening to it
import IPython
print(f'Audio sample 1:')
IPython.display.Audio(audio_sample_1['audio']['array'], rate=example['sampling_rate'])

Audio sample 1:


In [26]:
print(f'Audio sample 2:')
IPython.display.Audio(audio_sample_2['audio']['array'], rate=example['sampling_rate'])

Audio sample 2:


In [27]:
#create a random number to select from dataset
import random

random_number_1 = random.randint(0, len(dataset['audio']))
random_number_2 = random.randint(0, len(dataset['audio']))
print(f'First example: {random_number_1}')
print(f'Second example: {random_number_2}')

First example: 157
Second example: 137


## 4. Cluster the embeddings based on cosine similarity IN PROGRESS

In [30]:
import torch

def calculate_cosine_similarity(embedding_1, embedding_2):
    global similarity
    """
    Calculate cosine similarity between two embeddings.
    """
    cosine_similarity = torch.nn.functional.cosine_similarity(embedding_1, embedding_2, dim=1)
    similarity = cosine_similarity.item()
    return similarity

calculate_cosine_similarity(audio_embedding_1, audio_embedding_2)

1.0000001192092896

In [None]:
#load the dictionary with audio_embeddings to do 
for key, value in embeddings_dict.items():
    print(f"Key: {key}, Value: {value}")


In [50]:
len(embeddings_dict)

223

In [43]:
print(embeddings_dict[0])

audio_embedding_0


In [60]:
#calculate the cosine similarity to each image in the list. This may take some time .
cos_sim_list =[]
for i in range(len(embeddings_dict)):
    sample1 =  embeddings_dict[i]
    for j in range(len(embeddings_dict)):
        sample2 = embeddings_dict[j]
        #get_audio_embeddings(sample1, sample2)
        calculate_cosine_similarity(sample1, sample2)
        cos_sim_list.append(round(similarity,8))

print(cos_sim_list)

[0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.99999988, 0.9

In [61]:
max(cos_sim_list)

0.99999988

In [62]:

import numpy as np
num_rows = len(cos_sim_list)
cosine_similarity_matrix = np.array(cos_sim_list).reshape(num_rows, -1)
cosine_similarity_matrix.shape

(49729, 1)

## TO DO CLUSTERING - THIS IS TOO LARGE

In [None]:
# Do the clustering with the K-Means algo and show it using Matplotlib. 
# Warning: this is a little different because we do clustering based on 1 Dimension (instead of 2 or more)
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Convert cosine similarity to distance (1 - similarity)
distance_matrix = 1 - cosine_similarity_matrix

# Choose the number of clusters (k)
num_clusters = 9

# Perform K-Means clustering
kmeans = KMeans(n_clusters=num_clusters)
kmeans.fit(distance_matrix)

# Get cluster labels
labels = kmeans.labels_

# Print the cluster assignments
for i, label in enumerate(labels):
    print(f"Image {i} is in cluster {label}")

# Optional: Visualize the clustering result
# Here we assume you have 2D data, for visualization purposes only
plt.scatter(distance_matrix[:, 0], distance_matrix[:, 1], c=labels)
plt.title('Clustering of Images based on Cosine Similarity')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.show()

## 5. Getting text embeddings TO DO

In [None]:
# This is code to convert the given labels (0,1,2,3,4,5,6,7,8,9) to a real string. 
# create a dictionary the converts the class folders to real names
label_dict ={0:'Gunshot', 1:'Moped alarm', 2:'Moped', 3:'Claxon', 4:'Slamming door', 5:'Screaming', 6:'Motorcycle', 7:'Talking', 8:'Music'}
print('The given labels are: ')
for i in range(0,9):
    print(label_dict[i])

The given labels are: 
Gunshot
Moped alarm
Moped
Claxon
Slamming door
Screaming
Motorcycle
Talking
Music


## 6. t-SNE (in progress, check datatypes)

We should have a vector of shape {2,512} for t-SNE. 

In [24]:
import glob
my_pts = glob.glob('*.pt')
my_pts

['embedding99.pt',
 'embedding209.pt',
 'embedding198.pt',
 'embedding188.pt',
 'embedding219.pt',
 'embedding89.pt',
 'embedding8.pt',
 'embedding98.pt',
 'embedding199.pt',
 'embedding208.pt',
 'embedding218.pt',
 'embedding189.pt',
 'embedding88.pt',
 'embedding9.pt',
 'embedding155.pt',
 'embedding60.pt',
 'embedding104.pt',
 'embedding31.pt',
 'embedding93.pt',
 'embedding130.pt',
 'embedding54.pt',
 'embedding161.pt',
 'embedding192.pt',
 'embedding203.pt',
 'embedding2.pt',
 'embedding213.pt',
 'embedding182.pt',
 'embedding171.pt',
 'embedding44.pt',
 'embedding120.pt',
 'embedding15.pt',
 'embedding83.pt',
 'embedding21.pt',
 'embedding114.pt',
 'embedding70.pt',
 'embedding145.pt',
 'embedding50.pt',
 'embedding165.pt',
 'embedding134.pt',
 'embedding207.pt',
 'embedding196.pt',
 'embedding100.pt',
 'embedding35.pt',
 'embedding151.pt',
 'embedding64.pt',
 'embedding97.pt',
 'embedding6.pt',
 'embedding87.pt',
 'embedding74.pt',
 'embedding141.pt',
 'embedding25.pt',
 'embedd

In [25]:
#inspect one embedding
embedding0 = torch.load('embedding0.pt')
print(embedding0.shape)
print(embedding0.ndim)
print(embedding0.size)
print(type(embedding0))

torch.Size([1, 512])
2
<built-in method size of Tensor object at 0x12d4fd9a0>
<class 'torch.Tensor'>


In [42]:
new_emb = torch.squeeze(embedding0)
new_emb.shape

torch.Size([512])

In [None]:
import torch
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Assuming you have a list of embeddings, each of shape torch.Size([1, 512])
embeddings_list = my_pts  # Your list of embeddings goes here

# Step 1: Convert the list of PyTorch tensors to a single NumPy array
embeddings_loaded =[]
for embedding in embeddings_list:
    embeddings_loaded.append(torch.load(embedding))
    #embeddings_array = np.array([embedding1.cpu().detach().numpy()])

embeddings_loaded


[tensor([[-0.0688, -0.0042,  0.0452,  0.0324, -0.0221, -0.0275, -0.0299,  0.0352,
           0.0375, -0.1065,  0.0047,  0.0513, -0.0500,  0.0061,  0.0124,  0.0149,
          -0.0389,  0.0109, -0.0150,  0.0295, -0.0528, -0.0045, -0.0284,  0.0510,
           0.0158,  0.0360, -0.0409,  0.0519,  0.0655,  0.0363, -0.0091,  0.0230,
          -0.0131,  0.0318,  0.0093, -0.0067,  0.0175, -0.0472,  0.0015, -0.0677,
           0.0778,  0.0151,  0.0735, -0.0074, -0.0229, -0.0020, -0.0455, -0.0601,
          -0.0050,  0.0489, -0.0440,  0.0385, -0.0035,  0.0784,  0.0914, -0.0622,
           0.1223,  0.0182, -0.0353,  0.0401,  0.0672, -0.0473,  0.0204, -0.0461,
          -0.0380, -0.0046,  0.0050, -0.0025,  0.0249, -0.0287,  0.0688,  0.0114,
          -0.0050,  0.0334,  0.0107,  0.0417, -0.0476,  0.0425, -0.0133,  0.0092,
          -0.0327, -0.0362,  0.0189,  0.0599, -0.0408,  0.0131,  0.0203,  0.0349,
           0.0883, -0.0569, -0.0273, -0.0741, -0.0104, -0.0206, -0.0305, -0.0090,
          -0.035

In [37]:
print(type(embeddings_loaded))
print(embeddings_loaded[0].ndim)
print(embeddings_loaded[0].shape)
print(embeddings_loaded[0].size)

<class 'list'>
2
torch.Size([1, 512])
<built-in method size of Tensor object at 0x12d4e0640>


In [38]:
#create a numpy array with shape

import numpy as np
import glob

file_list = glob.glob('*.pt')
arrays = [np.fromfile(file, dtype=float) for file in file_list]
combined_array = np.concatenate(arrays)

In [40]:
print(combined_array.shape)

(90528,)


In [34]:
concatenated_tensor = torch.cat(embeddings_loaded, dim=0)
concatenated_tensor.shape

torch.Size([223, 512])

In [36]:
reshaped_tensor = embeddings_loaded.reshape(2, 512)
reshaped_tensor.shape

AttributeError: 'list' object has no attribute 'reshape'

In [18]:
#print(embeddings_array)
print(embeddings_array.ndim)
print(embeddings_array.dtype)
print(type(embeddings_array))
print(embeddings_array.shape)

3
float32
<class 'numpy.ndarray'>
(1, 1, 512)


In [16]:
print(type(embeddings_array[0]))
print(embeddings_array[0].ndim)
print(embeddings_array[0].size)
print(embeddings_array[0].shape)

<class 'numpy.ndarray'>
2
512
(1, 512)


In [8]:
print(embeddings_array.shape[0])
n_samples = embeddings_array.shape[0]
perplexity = min(30, n_samples - 1) 
perplexity = float(perplexity)
perplexity

1


0.0

In [13]:
# Step 2: Perform t-SNE
for_tsne = embeddings_array[0]

tsne = TSNE(n_components=2, random_state=42, perplexity=0.1) 
embeddings_tsne = tsne.fit_transform(for_tsne)


ValueError: Found array with 1 sample(s) (shape=(1, 512)) while a minimum of 2 is required by TSNE.