# Audio classification on the Urban Sounds dataset with CLAP and AST

###  Goal of the notebook
In this notebook you can do audio classification. 

Two AI models are used in this notebook:
- CLAP 
- Audio Spectrum Transformer (AST)

### Urban Sounds dataset using 🤗 datasets
The dataset is hosted on the Huggingface Hub at: https://huggingface.co/datasets/UrbanSounds/urban_sounds_small

This dataset contains nine classes of audio events in an urban environment. See for more information the model card at Huggingface.

In this notebook we will use Huggingface's ```dataset``` library to load this dataset. 

More information at https://pypi.org/project/datasets/

### Contents
0. Install packages
1. Inspection of dataset
2. CLAP on the Urban Sounds Amsterdam dataset
3. AST on the Urban Sounds Amsterdam dataset

## 0. Install packages

In [1]:
#!pip install datasets

In [4]:
#!pip install soundfile

In [3]:
#%pip install datasets\[audio\]

In [4]:
#check numpy version is <1.24
import numpy as np
np.__version__

'1.23.5'

In [5]:
!pip3 install numpy==1.23.5



## 1. Inspection of the dataset

In [7]:
from datasets import load_dataset

dataset = load_dataset("UrbanSounds/urban_sounds_small")

Resolving data files:   0%|          | 0/224 [00:00<?, ?it/s]

In [8]:
#inspect the dataset
dataset

DatasetDict({
    train: Dataset({
        features: ['audio', 'label'],
        num_rows: 223
    })
})

In [9]:
#Inspect one sample from 
example = dataset['train']['audio'][0]
label = dataset['train']['label'][0]

You may notice that the audio column contains several features. Here’s what they are:

- path: the path to the downloaded (and converted) audio file
- array: The decoded audio data, represented as a 1-dimensional NumPy array.
- sampling_rate. The sampling rate of the audio file.

In [10]:
example

{'path': '/Users/michielbontenbal/.cache/huggingface/datasets/downloads/dbdb87c19e3ee981f58e3f0dd8645e8c7df1d4ea52c5ad90f52877dc954f77fe',
 'array': array([-0.00015259, -0.00012207, -0.00021362, ...,  0.00015259,
         0.00018311,  0.        ]),
 'sampling_rate': 44100}

In [11]:
#print the label data
print(dataset['train']['label'])
print(len(dataset['train']['label']))

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8]
223


In [12]:
#inspecting the audio array
array = dataset["train"]["audio"][0]["array"]
sampling_rate = example["sampling_rate"]
print(array.shape)
print(array)
print(type(array))
print(sampling_rate)

(441000,)
[-0.00015259 -0.00012207 -0.00021362 ...  0.00015259  0.00018311
  0.        ]
<class 'numpy.ndarray'>
44100


## 2. CLAP on the Urban Sounds Amsterdam dataset

Source: https://huggingface.co/laion/larger_clap_general

Paper: https://arxiv.org/abs/2211.06687

CLAP (Contrastive Language-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task.

In this notebook we will use two CLAP models:
1. larger_clap_music_and_speech
2. larger_clap_general


In [17]:
#create a random number to select from dataset
import random

random_number = random.randint(0, len(dataset['train']['audio']))
random_number

134

### Runnning it with "Larger CLAP music and speech" model

In [13]:
#Script to load a random number out of the dataset
from transformers import ClapModel, ClapProcessor
from datasets import load_dataset
from transformers import pipeline
import IPython

example=dataset['train']['audio'][random_number]
audio = dataset["train"]["audio"][random_number]["array"]

audio_classifier = pipeline(task="zero-shot-audio-classification", model="laion/larger_clap_music_and_speech")
output = audio_classifier(audio, candidate_labels=["Motorcycle", "Moped", 'Claxon','Alarm', 'Silence','Loud people','Talking','Gunshot', 'Slamming door','Music'])
print(output[0],'\n',output[1])
print(random_number)
IPython.display.Audio(example["array"], rate=example['sampling_rate'])

{'score': 0.8583150506019592, 'label': 'Claxon'} 
 {'score': 0.06773696839809418, 'label': 'Motorcycle'}
184


### Runnning it with "Larger CLAP general" model

In [14]:
#create a dictionary the converts the class folders to real names
label_dict ={0:'Gunshot', 1:'Moped alarm', 2:'Moped', 3:'Claxon', 4:'Slamming door', 5:'Screaming', 6:'Motorcycle', 7:'Talking', 8:'Music'}
print('The given labels are: ')
for i in range(0,9):
    print(label_dict[i])

The given labels are: 
Gunshot
Moped alarm
Moped
Claxon
Slamming door
Screaming
Motorcycle
Talking
Music


In [16]:
#larger_clap_general
from transformers import ClapModel, ClapProcessor
from transformers import pipeline
from datasets import load_dataset
import IPython

example=dataset['train']['audio'][random_number]
audio = dataset["train"]["audio"][random_number]['array']

audio_classifier = pipeline(task="zero-shot-audio-classification", model="laion/larger_clap_general")
output = audio_classifier(audio, candidate_labels=["Gunshot", "Moped", 'Moped alarm','Claxon','Screaming', 'Motorcycle','Talking', 'Slamming door','Music', 'Silence'])

predicted_label = output[0]['label']
print(f'Predicted label: {predicted_label}')

label_name =label_dict[dataset['train']['label'][i]]
print(f'The given label: {label_name}')

if label_name == output[0]['label']:
    print("This is correct")
else:
    print('This is false')
print(f'Probability: {round(output[0]["score"],3)}')

IPython.display.Audio(example['array'], rate=example['sampling_rate'])

Predicted label: Claxon
The given label: Gunshot
This is false
Probability: 0.829


# 2. Audio Spectrum Transfomer with Urban Sounds Amsterdam Dataset

Model: https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593
Paper: https://arxiv.org/abs/2104.01778


In [1]:
from transformers import AutoFeatureExtractor, ASTForAudioClassification
from datasets import load_dataset
import torch

dataset = load_dataset("UrbanSounds/urban_sounds_small")

feature_extractor = AutoFeatureExtractor.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")
model = ASTForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")

# audio file is decoded on the fly
inputs = feature_extractor(dataset['train']["audio"][1]["array"], sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_ids = torch.argmax(logits, dim=-1).item()
predicted_label = model.config.id2label[predicted_class_ids]
print(predicted_label)

# compute loss - target_label is e.g. "down"
target_label = model.config.id2label[0]
inputs["labels"] = torch.tensor([model.config.label2id[target_label]])
loss = model(**inputs).loss
print(round(loss.item(), 2))

Resolving data files:   0%|          | 0/224 [00:00<?, ?it/s]

Silence
4.13


In [2]:
import IPython
example=dataset['train']["audio"][1]
IPython.display.Audio(example["array"], rate=example['sampling_rate'])