# Audio classification on the Urban SoundsII dataset with CLAP

###  Goal of the notebook
In this notebook you can do audio classification with CLAP.

### CLAP
<img src="https://raw.githubusercontent.com/MichielBontenbal/run_CLAP/main/images/CLAP.jpg" width="600"/>

- Modelcard: https://huggingface.co/laion/larger_clap_general
- CLAP paper: https://arxiv.org/abs/2211.06687
- Reference: https://dataloop.ai/library/model/laion_larger_clap_music_and_speech/

In this notebook we will use two CLAP models:
1. larger_clap_music_and_speech
2. larger_clap_general

### How CLAP works

<img src="https://raw.githubusercontent.com/MichielBontenbal/run_CLAP/main/images/create_embeddings.jpg" width="600"/>

<img src= "https://raw.githubusercontent.com/MichielBontenbal/run_CLAP/main/images/CLAP_cos_sim.jpg" width='600'/>


### Use 🤗 datasets and 🤗transformers
The dataset is hosted on the Huggingface Hub at: 

- https://huggingface.co/datasets/UrbanSounds/UrbanSoundsNew

This dataset contains nine classes of audio events in an urban environment. 

In this notebook we will use 🤗  ```dataset``` library to load this dataset. 

And we'll use the 🤗 ```transformers``` library to run the CLAP model. Please find more info: https://huggingface.co/docs/transformers/model_doc/clap 


### Contents
0. Install packages & check versions
1. Inspection of dataset
2. CLAP on the Urban Sounds Amsterdam dataset


## 0. Install packages

In [1]:
#!pip install datasets

In [2]:
#!pip install soundfile

In [3]:
#!pip install librosa

In [4]:
#%pip install datasets\[audio\]

In [5]:
pip install numpy==1.26

Note: you may need to restart the kernel to use updated packages.


In [6]:
#check python version
import platform
print(platform.python_version())

3.12.7


In [7]:
import numpy as np
print(f'numpy={np.__version__}')
import soundfile
print(f'soundfile={soundfile.__version__}')
import librosa
print(f'librosa={librosa.__version__}')
import IPython
print(f'ipython={IPython.__version__}')


numpy=1.26.0
soundfile=0.9.0
librosa=0.10.2.post1
ipython=8.27.0


In [8]:
import datasets
print(f'datasets={datasets.__version__}')
import transformers
print(f'transformers={transformers.__version__}')

datasets=3.1.0
transformers=4.46.3


## 1. Inspection of the dataset

In [9]:
from datasets import load_dataset
#
dataset = load_dataset("MichielBontenbal/UrbanSoundsII")


README.md:   0%|          | 0.00/306 [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/223 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/223 [00:00<?, ?files/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [10]:
#The ESC50 dataset is one of the very few other datasets on Environmental Sound classification
#You could try this as an alternative
#dataset = load_dataset("confit/esc50-demo", "fold1")

In [11]:
#inspect the dataset
#dataset = ds
dataset

DatasetDict({
    train: Dataset({
        features: ['audio', 'label'],
        num_rows: 223
    })
})

In [12]:
print(dir(dataset))

['__class__', '__class_getitem__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__ior__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__or__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__ror__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_check_values_features', '_check_values_type', 'align_labels_with_mapping', 'cache_files', 'cast', 'cast_column', 'class_encode_column', 'cleanup_cache_files', 'clear', 'column_names', 'copy', 'data', 'filter', 'flatten', 'flatten_indices', 'formatted_as', 'from_csv', 'from_json', 'from_parquet', 'from_text', 'fromkeys', 'get', 'items', 'keys', 'load_from_disk', 'map', 'num_columns', 'num_rows', 'pop', 'popitem', 'push_to_hub', 'remove_columns', 'rename_co

In [13]:
dataset

DatasetDict({
    train: Dataset({
        features: ['audio', 'label'],
        num_rows: 223
    })
})

In [14]:
#Inspect one sample from 
example = dataset['train']['audio'][0]
label = dataset['train']['label'][0]

You may notice that the audio column contains several features. Here’s what they are:

- path: the path to the downloaded (and converted) audio file
- array: The decoded audio data, represented as a 1-dimensional NumPy array.
- sampling_rate. The sampling rate of the audio file.

In [15]:
example

{'path': '/Users/michielbontenbal/.cache/huggingface/hub/datasets--MichielBontenbal--UrbanSoundsII/snapshots/76bfaa6c9dcc58084f9109a20dd5a56c091ddcfb/01_gunshot/shot556_141_ch01_180718_163938_63_.wav',
 'array': array([-0.00015259, -0.00012207, -0.00021362, ...,  0.00015259,
         0.00018311,  0.        ]),
 'sampling_rate': 44100}

In [16]:
#print the label data
print(dataset['train']['label'])
print(len(dataset['train']['label']))

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8]
223


In [17]:
#inspecting the audio array
array = dataset["train"]["audio"][0]["array"]
sampling_rate = example["sampling_rate"]
print(array.shape)
print(array)
print(type(array))
print(sampling_rate)

(441000,)
[-0.00015259 -0.00012207 -0.00021362 ...  0.00015259  0.00018311
  0.        ]
<class 'numpy.ndarray'>
44100


## 2. CLAP on the Urban Sounds Amsterdam dataset


In [28]:
#create a random number to select from dataset
import random

random_number = random.randint(0, len(dataset['train']['audio']))
random_number

28

### Runnning it with "Larger CLAP music and speech" model

In [19]:
#Script to load a random number out of the dataset
from transformers import ClapModel, ClapProcessor
from datasets import load_dataset
from transformers import pipeline
import IPython

example=dataset['train']['audio'][random_number]
audio = dataset["train"]["audio"][random_number]["array"]

audio_classifier = pipeline(task="zero-shot-audio-classification", model="laion/larger_clap_music_and_speech")
output = audio_classifier(audio, candidate_labels=["Motorcycle", "Moped", 'Claxon','Alarm', 'Silence','Loud people','Talking','Gunshot', 'Slamming door','Music'])
print(output[0],'\n',output[1])
print(random_number)
IPython.display.Audio(example["array"], rate=example['sampling_rate'])

{'score': 0.8347903490066528, 'label': 'Motorcycle'} 
 {'score': 0.10060350596904755, 'label': 'Loud people'}
143


### Runnning it with "Larger CLAP general" model

In [20]:
# This is code to convert the given labels (0,1,2,3,4,5,6,7,8,9) to a real string. 
# create a dictionary the converts the class folders to real names
label_dict ={0:'Gunshot', 1:'Moped alarm', 2:'Moped', 3:'Claxon', 4:'Slamming door', 5:'Screaming', 6:'Motorcycle', 7:'Talking', 8:'Music'}
print('The given labels are: ')
for i in range(0,9):
    print(label_dict[i])

The given labels are: 
Gunshot
Moped alarm
Moped
Claxon
Slamming door
Screaming
Motorcycle
Talking
Music


In [30]:
#larger_clap_general
from transformers import ClapModel, ClapProcessor
from transformers import pipeline
from datasets import load_dataset
import IPython

example=dataset['train']['audio'][random_number]
audio = dataset["train"]["audio"][random_number]['array']

audio_classifier = pipeline(task="zero-shot-audio-classification", model="laion/larger_clap_general")
output = audio_classifier(audio, candidate_labels=["Gunshot", "Moped", 'Moped alarm','Claxon','Screaming', 'Motorcycle','Talking', 'Slamming door','Music', 'Silence'])

predicted_label = output[0]['label']
print(f'Prediction: {predicted_label}')

label_name =label_dict[dataset['train']['label'][random_number]]
print(f'The given label: {label_name}')
print(i)
if label_name == output[0]['label']:
    print("This is correct")
else:
    print('This is false')
print(f'Probability: {round(output[0]["score"],3)}')

IPython.display.Audio(example['array'], rate=example['sampling_rate'])

Prediction: Moped alarm
The given label: Moped alarm
8
This is correct
Probability: 0.999
