# Real World UrbanSoundsSamples dataset with CLAP

###  Goal of the notebook
In this notebook you can do audio classification with CLAP.
We will use the real world samples as collected in the fall of 2024. We analyse 50 samples. 

### CLAP
CLAP (Contrastive Language-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task.

Source: https://huggingface.co/laion/larger_clap_general
CLAP paper: https://arxiv.org/abs/2211.06687

In this notebook we will use two CLAP models:
1. larger clap music and speech
2. larger clap general

In general I believe the larger clap general model gives better results. 

### Using 🤗 datasets and 🤗transformers
The dataset is hosted on the Huggingface Hub at: https://huggingface.co/datasets/MichielBontenbal/UrbanSoundsII

(This is a new version of the same dataset as the old dataset got corrupted.)

This dataset contains nine classes of audio events in an urban environment. 

In this notebook we will use 🤗  ```dataset``` library to load this dataset. 

And we'll use the 🤗 ```transformers``` library to run the CLAP model. Please find more info: https://huggingface.co/docs/transformers/model_doc/clap 


### Contents
0. Install packages & check versions
1. Inspection of dataset
2. Testing one sample of the UrbanSoundsSamples dataset
3. Generating results for the whole dataset


## 0. Install packages

In [5]:
#!pip install datasets

In [6]:
#!pip install soundfile

In [7]:
#!pip install librosa

In [8]:
#%pip install datasets\[audio\]

In [None]:
pip install numpy==1.26

In [None]:
#check python version
import platform
print(platform.python_version())

In [None]:
import numpy as np
print(f'numpy={np.__version__}')
import soundfile
print(f'soundfile={soundfile.__version__}')
import librosa
print(f'librosa={librosa.__version__}')
import IPython
print(f'ipython={IPython.__version__}')


In [None]:
import datasets
print(f'datasets={datasets.__version__}')
import transformers
print(f'transformers={transformers.__version__}')

## 1. Inspection of the dataset

In [None]:
from datasets import load_dataset
#
dataset = load_dataset("UrbanSounds/UrbanSoundsSamples", split='train')


In [14]:
#The ESC50 dataset is one of the very few other datasets on Environmental Sound classification
#You could try this as an alternative
#dataset = load_dataset("confit/esc50-demo", "fold1")

In [None]:
#inspect the dataset
#dataset = ds
dataset

In [None]:
len(dataset)

In [None]:
#Inspect one sample from 
example = dataset['audio'][0]
example

You may notice that the audio column contains several features. Here’s what they are:

- path: the path to the downloaded (and converted) audio file
- array: The decoded audio data, represented as a 1-dimensional NumPy array.
- sampling_rate. The sampling rate of the audio file.

In [None]:
#inspecting the audio array
array = dataset["audio"][0]["array"]
sampling_rate = example["sampling_rate"]
print(array.shape)
print(array)
print(type(array))
print(sampling_rate)

## 2. Testing one sample of the UrbanSoundsSamples dataset

Instruction: select a random number from the dataset and start listening to it.


In [None]:
#create a random number to select from dataset
import random

random_number = random.randint(0, len(dataset['audio']))
random_number

### Runnning it with "Larger CLAP music and speech" model

In [None]:
#Script to load a random number out of the dataset
from transformers import ClapModel, ClapProcessor
from transformers import pipeline
import IPython

example=dataset['audio'][random_number]
audio = dataset["audio"][random_number]["array"]

audio_classifier = pipeline(task="zero-shot-audio-classification", model="laion/larger_clap_music_and_speech")
output = audio_classifier(audio, candidate_labels=["Motorcycle", "Moped", 'Claxon','Alarm','Loud people','Talking','Gunshot', 'Slamming door','Music', 'Machine'])
print(f'Sample number: {random_number}')
print(f'{output[0]["label"]} {round(output[0]["score"],3)}')
print(f'{output[1]["label"]} {round(output[1]["score"],3)}')
print(f'{output[2]["label"]} {round(output[2]["score"],3)}')
IPython.display.Audio(example["array"], rate=example['sampling_rate'])

### Runnning it with "Larger CLAP general" model

In [None]:
#larger_clap_general
from transformers import ClapModel, ClapProcessor
from transformers import pipeline
import IPython

example=dataset['audio'][random_number]
audio = dataset["audio"][random_number]['array']

audio_classifier = pipeline(task="zero-shot-audio-classification", model="laion/larger_clap_general")
output = audio_classifier(audio, candidate_labels=["Gunshot", "Moped", 'Moped alarm','Claxon','Screaming', 'Motorcycle','Talking', 'Slamming door','Music', 'Machine'])

print(f'Sample number: {random_number}')
print(f'{output[0]["label"]} {round(output[0]["score"],3)}')
print(f'{output[1]["label"]} {round(output[1]["score"],3)}')
print(f'{output[2]["label"]} {round(output[2]["score"],3)}')

IPython.display.Audio(example['array'], rate=example['sampling_rate'])

## 3. Generating results for the whole dataset


In [None]:
#Creating a neat function to call the model

from transformers import ClapModel, ClapProcessor
from transformers import pipeline
import IPython


def call_clap(sample_no, model):
    global output
    audio_sample = dataset["audio"][sample_no]['array']
    audio_classifier = pipeline(task="zero-shot-audio-classification", model=model)
    output = audio_classifier(audio_sample, candidate_labels=["Gunshot", "Moped", 'Moped alarm','Claxon','Screaming', 'Motorcycle','Talking', 'Slamming door','Music', 'Machine'])
    return output

sample_no = 0
call_clap(sample_no, "laion/larger_clap_general")

print(f'Sample number: {sample_no}')
print(f'{output[0]["label"]} {round(output[0]["score"],3)}')
print(f'{output[1]["label"]} {round(output[1]["score"],3)}')
print(f'{output[2]["label"]} {round(output[2]["score"],3)}')
IPython.display.Audio(dataset["audio"][sample_no]['array'], rate=example['sampling_rate']) 


In [None]:
result_list =[] 
for i in range(len(dataset)-1):
    sample_no = i
    call_clap(sample_no, "laion/larger_clap_general")
    print(f'Sample {i+1}: {output[0]["label"]} - {round(output[0]["score"],3)}')
    result_list.append(f'{output[0]["label"]} - {round(output[0]["score"],3)}')
    

In [None]:
from IPython.display import display, Audio

for i in range(len(result_list)):
    print(result_list[i])
    audio = Audio(data=dataset["audio"][i]['array'], rate=example['sampling_rate'])
    display(audio)