# Visualising UrbanSoundsSamples with CLAP embeddings
# WORK IN PROGRESS

###  Goal of the notebook
To visualise the UrbansSoundsSamples dataset with CLAP and PCA/t-SNE.

### CLAP
CLAP (Contrastive Language-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task.

- Modelcard: https://huggingface.co/laion/larger_clap_general
- CLAP paper: https://arxiv.org/abs/2211.06687
- Reference: https://dataloop.ai/library/model/laion_larger_clap_music_and_speech/

In this notebook we will use CLAP model: larger_clap_general

### Using 🤗 datasets and 🤗transformers
The dataset is hosted on the Huggingface Hub at: https://huggingface.co/datasets/MichielBontenbal/UrbanSoundsII

(This is a new version of the same dataset as the old dataset got corrupted.)

This dataset contains nine classes of audio events in an urban environment. 

In this notebook we will use 🤗  ```dataset``` library to load this dataset. 

And we'll use the 🤗 ```transformers``` library to run the CLAP model. Please find more info: https://huggingface.co/docs/transformers/model_doc/clap 


### Contents
0. Install packages & check versions
1. Inspection of dataset
2. Get audio embeddings
5. Get the labels 
6. PCA and t-SNE on the dataset


## 0. Install packages

In [1]:
#!pip install datasets

In [2]:
#!pip install soundfile

In [3]:
#!pip install librosa

In [4]:
#%pip install datasets\[audio\]

In [5]:
#!pip install transformers

In [None]:
pip install numpy==1.26

In [None]:
#check python version
import platform
print(platform.python_version())


In [None]:
import numpy as np
print(f'numpy version: {np.__version__}')
import soundfile
print(f'soundfile version: {soundfile.__version__}')
import librosa
print(f'librosa version: {librosa.__version__}')
import IPython
print(f'IPython version: {IPython.__version__}')


In [None]:
import datasets
print(f'datasets version: {datasets.__version__}')
import transformers
print(f'transformers version: {transformers.__version__}')
import torch
print(f'torch version: {torch.__version__}')

## 1. Inspection of the dataset

In [None]:
from datasets import load_dataset

dataset =load_dataset("UrbanSounds/UrbanSoundsSamples", split="train")

In [None]:
#inspect the dataset
#dataset = ds
dataset

In [None]:
#Inspect one sample from 
example = dataset['audio'][0]
example

You may notice that the audio column contains several features. Here’s what they are:

- path: the path to the downloaded (and converted) audio file
- array: The decoded audio data, represented as a 1-dimensional NumPy array.
- sampling_rate. The sampling rate of the audio file.

In [None]:
#inspecting the audio array
array = dataset["audio"][0]["array"]
sampling_rate = example["sampling_rate"]
print(array.shape)
print(array)
print(type(array))
print(sampling_rate)

## 2. CLAP embeddings

In [1]:
#Import the dataset and define model, processor
from datasets import load_dataset
from transformers import ClapModel, ClapProcessor

# Load the model and processor
model = ClapModel.from_pretrained("laion/larger_clap_music_and_speech")
processor = ClapProcessor.from_pretrained("laion/larger_clap_music_and_speech")

dataset =load_dataset("UrbanSounds/UrbanSoundsSamples", split="train")

Resolving data files:   0%|          | 0/50 [00:00<?, ?it/s]

In [3]:
# Load an audio sample
audio_sample_1 = dataset[0]

# Preprocess the audio sample
inputs_1 = processor(audios=audio_sample_1["audio"]["array"], return_tensors="pt", sampling_rate=48000)

# Run the model
audio_embedding_1 = model.get_audio_features(**inputs_1)

In [None]:
print(audio_embedding_1.ndim)
print(audio_embedding_1.shape)
print(audio_embedding_1.dtype)
print(type(audio_embedding_1))
#print(audio_embedding_1)

In [5]:
# Now do the same for another audio sample
audio_sample_2 = dataset[0]
inputs_2 = processor(audios=audio_sample_2["audio"]["array"], return_tensors="pt", sampling_rate=48000)
audio_embedding_2 = model.get_audio_features(**inputs_2)

In [None]:
#A function to get all the audio embeddings and store them as individual .pt files
def get_audio_embeddings(i):
    global embedding
    # Preprocess and encode the first image
    sample = dataset[i]
    inputs = processor(audios=sample["audio"]["array"], return_tensors="pt", sampling_rate=48000)
    embedding = model.get_audio_features(**inputs)
    torch.save(embedding, 'embedding'+str(i)+'.pt')
    return embedding

for i in range(len(dataset)):
    get_audio_embeddings(i) 

In [11]:
#Load all audio embeddings into a python dict
import glob
embeddings_files = glob.glob('*.pt')

#load all files as embeddings
embeddings_list=[]
for i in range(len(embeddings_files)):
    embeddings_list.append(torch.load('embedding'+str(1)+'.pt'))

embeddings_dict = dict(enumerate(embeddings_list))
#embeddings_dict

In [None]:
#printing an example to check it
embeddings_dict[5]

## 3. Calculate cosine similarity

In [13]:
audio_embeddding_1 = torch.load('embedding1.pt')
#audio_embedding_1

In [14]:
audio_embeddding_2 = torch.load('embedding2.pt')
#audio_embedding_2

In [None]:
import torch
import torch.nn.functional

# Calculate cosine similarity
cosine_similarity = torch.nn.functional.cosine_similarity(audio_embedding_1, audio_embedding_2, dim=1)

print(f"Cosine Similarity: {cosine_similarity.item()}")

In [None]:
#Inspect the result by listening to it
import IPython
print(f'Audio sample 1:')
IPython.display.Audio(audio_sample_1['audio']['array'], rate=example['sampling_rate'])

In [None]:
print(f'Audio sample 2:')
IPython.display.Audio(audio_sample_2['audio']['array'], rate=example['sampling_rate'])

In [None]:
#create a random number to select from dataset
import random

random_number_1 = random.randint(0, len(dataset['audio']))
random_number_2 = random.randint(0, len(dataset['audio']))
print(f'First example: {random_number_1}')
print(f'Second example: {random_number_2}')

## 5. Get the labels
Huggingface does not give you the option to name the labels in the dataset.
So we will run some code to get the right label names.

In [None]:
# This is code to convert the given labels (0,1,2,3,4,5,6,7,8) to a real string. 
# create a dictionary the converts the class folders to real names
label_dict ={0:'Gunshot', 1:'Moped alarm', 2:'Moped', 3:'Claxon', 4:'Slamming door', 5:'Screaming', 6:'Motorcycle', 7:'Talking', 8:'Music'}
print('The given labels are: ')
for i in range(0,9):
    print(label_dict[i])

## 6. Visualise the embeddings with PCA and t-SNE

In [2]:
#A function to get all the audio embeddings in numpy format
import torch
import numpy as np

def get_audio_embeddings_np(i):
    global embedding_np
    # Preprocess and encode the first image
    sample = dataset[i]
    inputs = processor(audios=sample["audio"]["array"], return_tensors="pt", sampling_rate=48000)
    embedding = model.get_audio_features(**inputs)
    
    embedding_np = np.array(embedding.detach().cpu().numpy())
    return embedding_np



In [3]:
#Create an array with all the audio embeddings (array of arrays)
combined_array = np.empty((0, 512))

for i in range(len(dataset)-1):
   get_audio_embeddings_np(i) 
   combined_array = np.vstack((combined_array, embedding_np))
   #print(combined_array.shape)

#Check the shape of the array and items in the array
print(combined_array.shape)
print(combined_array[0].shape)


(49, 512)
(512,)


In [4]:
# Normalize the embeddings
embeddings_np = combined_array / np.linalg.norm(combined_array, axis=1, keepdims=True)


In [5]:
from sklearn.decomposition import PCA
# Apply PCA to reduce dimensions (optional)
pca = PCA(n_components=2)  # Reduce to 2 dimensions for faster t-SNE
embeddings_pca = pca.fit_transform(embeddings_np)



In [6]:
# Apply t-SNE to reduce to 2D for visualization
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=42)
embeddings_2d = tsne.fit_transform(embeddings_pca)

Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md



In [8]:
import plotly.express as px
import numpy as np

# Assuming you have a list of class labels corresponding to each point
# If you don't have this, you'll need to create it based on your data
#class_labels = labelnames_list  # Your list of class labels here, should be same length as embeddings_2d

# Create a color map for the 9 classes
color_map = px.colors.qualitative.Set1[:9]

fig = px.scatter(
    x=embeddings_2d[:, 0],
    y=embeddings_2d[:, 1],
    #color=class_labels,
    color_discrete_sequence=color_map,
    opacity=0.7,
    title='Visualization of Urban Sounds dataset using CLAP embeddings and t-SNE',
    #labels={'color': 'Class'}
)

fig.update_traces(marker=dict(size=8))

fig.show()


##  Manually labeling the samples

In [17]:
#Code to label one sample
#To do: create a neat function and loop through the set of .wavs
import ipywidgets as widgets
from IPython.display import display, Audio
import numpy as np

# Load your audio file (replace 'your_audio_file.wav' with your actual file path)
#audio_file = "shot556.70.ch01.180718.162941.68..wav"
audio_file = "shot556.203.ch01.180718.165052.67..wav"
# Create an Audio widget
audio = Audio(audio_file)

# Create an input widget
text_input = widgets.Text(
    value='',
    placeholder='Enter your label here',
    description='Label:',
    disabled=False
)

# Create a button widget
button = widgets.Button(description="Submit")
output = widgets.Output()

# Dictionary to store the sample and input
#sample_dict = {'audio': audio_file, 'label': ''}

# Define button click event
def on_button_clicked(b):
    with output:

        output.clear_output()
        sample_dict['label'] = text_input.value
        print(f"Label '{text_input.value}' has been added to the dictionary.")

# Attach the function to the button
button.on_click(on_button_clicked)

# Display everything
display(audio)
display(text_input)
display(button)
display(output)


Text(value='', description='Label:', placeholder='Enter your label here')

Button(description='Submit', style=ButtonStyle())

Output()

In [18]:
sample_dict

{'audio': 'shot556.70.ch01.180718.162941.68..wav', 'label': 'gunshot'}