# Huggingface audio pipelines
source: https://huggingface.co/learn/audio-course/chapter2/audio_classification_pipeline?fw=pt


### Contents
0. Install packages
1. First script
2. CLAP on the Urban Sounds Amsterdam dataset
3. AST on the Urban Sounds Amsterdam dataset

## Introduction 

### Using datasets
To classify sounds, I've uploaded a sub-sample of the Amsterdam Sounds Database at Huggingface. 

In this notebook we will use HUggingface's ```dataset``` library to load this dataset. 

links:
- https://pypi.org/project/datasets/
- https://huggingface.co/docs/datasets/audio_process
- https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/main_classes#datasets.Audio

# 0. Install packages

In [20]:
#!pip install datasets

In [1]:
pip install soundfile

Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip install datasets\[audio\]

Note: you may need to restart the kernel to use updated packages.


In [11]:
#check numpy version is 1.24
import numpy as np
np.__version__

'1.24.0'

## 1. Audio classification with a pipeline

source: https://huggingface.co/learn/audio-course/chapter2/audio_classification_pipeline?fw=pt

In [117]:
from datasets import load_dataset
from datasets import Audio

minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))

In [118]:
from transformers import pipeline

classifier = pipeline(
    "audio-classification",
    model="anton-l/xtreme_s_xlsr_300m_minds14",
)

In [119]:
example = minds[0]
classifier(example["audio"]["array"])

[{'score': 0.9625310301780701, 'label': 'pay_bill'},
 {'score': 0.028672754764556885, 'label': 'freeze'},
 {'score': 0.0033497940748929977, 'label': 'card_issues'},
 {'score': 0.0020058027002960443, 'label': 'abroad'},
 {'score': 0.000848432769998908, 'label': 'high_value_payment'}]

In [120]:
example

{'path': '/Users/michielbontenbal/.cache/huggingface/datasets/downloads/extracted/91523c3a7f67fb2017e19a89742e7c63e1f07d178e1bb911001f7434d12237ee/en-AU~PAY_BILL/response_4.wav',
 'audio': {'path': '/Users/michielbontenbal/.cache/huggingface/datasets/downloads/extracted/91523c3a7f67fb2017e19a89742e7c63e1f07d178e1bb911001f7434d12237ee/en-AU~PAY_BILL/response_4.wav',
  'array': array([2.36120541e-05, 1.92325111e-04, 2.19285139e-04, ...,
         9.40908212e-04, 1.16613181e-03, 7.20883720e-04]),
  'sampling_rate': 16000},
 'transcription': 'I would like to pay my electricity bill using my card can you please assist',
 'english_transcription': 'I would like to pay my electricity bill using my card can you please assist',
 'intent_class': 13,
 'lang_id': 2}

In [22]:
#look up the label
id2label = minds.features["intent_class"].int2str
id2label(example["intent_class"])

'pay_bill'

## 2. CLAP on the Urban Sounds Amsterdam dataset

Source: https://huggingface.co/laion/larger_clap_general

Paper: https://arxiv.org/abs/2211.06687

CLAP (Contrastive Language-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task.



In [1]:
from datasets import load_dataset

dataset = load_dataset("MichielBontenbal/UrbanSounds")

Resolving data files:   0%|          | 0/239 [00:00<?, ?it/s]

In [2]:
#inspect the dataset
dataset

DatasetDict({
    train: Dataset({
        features: ['audio'],
        num_rows: 238
    })
})

In [19]:
dataset['train']['audio'][1]

'M-AS-roos-001.200120.141022.02.wav'

In [3]:
from transformers import ClapModel, ClapProcessor
from transformers import pipeline
import IPython

example = dataset['train']['audio'][1]
audio = dataset["train"]["audio"][-1]["array"]

audio_classifier = pipeline(task="zero-shot-audio-classification", model="laion/larger_clap_music_and_speech")
output = audio_classifier(audio, candidate_labels=["Motorcycle", "Moped", 'Claxon','Alarm', 'Silence','Loud people','Talking','Gunshot', 'Slamming door','Music'])
print(output[0],output[1], output[2])
IPython.display.Audio(example["array"], rate=example['sampling_rate'])

2023-12-05 13:55:34.878188: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


FileNotFoundError: [Errno 2] No such file or directory: '/Users/michielbontenbal/.cache/huggingface/datasets/downloads/1ca59e846ac0d8e3d557473077e27c63b62dcc90e1ae6a3c5dbb62f14ef78c93'

You may notice that the audio column contains several features. Here’s what they are:

- path: the path to the audio file
- array: The decoded audio data, represented as a 1-dimensional NumPy array.
- sampling_rate. The sampling rate of the audio file.

In [4]:
#print the example
example

NameError: name 'example' is not defined

In [25]:
example['sampling_rate']

44100

In [26]:
print(type(audio))
print(audio.shape)
print(audio.dtype)
print(audio.nbytes)

<class 'numpy.ndarray'>
(441000,)
float64
3528000


In [27]:
#check length of dataset

print(len(dataset['train']['audio']))
print(type(dataset['train']['audio']))

238
<class 'list'>


In [28]:
#the display script
from IPython.display import Audio

Audio(example["array"], rate=example['sampling_rate'])

In [22]:
#a short script to get a number from the list
import random
my_list = [i for i in range(0, len(dataset['train']['audio']))]

# Getting a random number from the list
random_number = random.choice(my_list)
print(random_number)

221


In [23]:
#Script to load a random number out of the dataset
from transformers import ClapModel, ClapProcessor
from datasets import load_dataset
import IPython

#dataset = load_dataset("MichielBontenbal/UrbanSounds")

example=dataset['train']['audio'][random_number]
audio = dataset["train"]["audio"][random_number]["array"]

audio_classifier = pipeline(task="zero-shot-audio-classification", model="laion/larger_clap_music_and_speech")
output = audio_classifier(audio, candidate_labels=["Motorcycle", "Moped", 'Claxon','Alarm', 'Silence','Loud people','Talking','Gunshot', 'Slamming door','Music'])
print(output[0],'\n',output[1])
print(random_number)
IPython.display.Audio(example["array"], rate=example['sampling_rate'])

Resolving data files:   0%|          | 0/239 [00:00<?, ?it/s]

{'score': 0.9190650582313538, 'label': 'Gunshot'} 
 {'score': 0.07606956362724304, 'label': 'Loud people'}
221


## CLIP on the urban_sounds_small dataset

In [22]:
from datasets import load_dataset

dataset = load_dataset("UrbanSounds/urban_sounds_small2")

Downloading readme:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/224 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/223 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/882k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/962k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/960k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data files: 0it [00:00, ?it/s]

Extracting data files: 0it [00:00, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [33]:
dataset

DatasetDict({
    train: Dataset({
        features: ['audio', 'label'],
        num_rows: 223
    })
})

In [88]:
print(dataset['train']['label'])
print(len(dataset['train']['label']))

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8]
223


In [69]:
#print the data from one example
example=dataset['train']['audio'][57]
label = dataset['train']['label'][57]
print(example)
print(label)

{'path': '/Users/michielbontenbal/.cache/huggingface/datasets/downloads/537e9a572e42a7a70bedd66a044544c33cba10384a129b3b91dfa161a9e5360e', 'array': array([-0.00088501,  0.0019989 ,  0.00263977, ..., -0.02125549,
       -0.02056885, -0.02049255]), 'sampling_rate': 48000}
2


In [60]:
label_dict ={0:'Gunshot', 1:'Moped', 2:'Moped alarm', 3:'Claxon', 4:'Slamming door (car)', 5:'Loud people', 6:'Motorcycle', 7:'Talking (terrace noise)', 8:'Music'}
label_dict[8]

'Music'

In [103]:
#Select the item from the dataset < 223
i = 20

In [109]:
#larger_clap_general
from transformers import ClapModel, ClapProcessor
from transformers import pipeline
from datasets import load_dataset
import IPython

#dataset = load_dataset("MichielBontenbal/UrbanSounds")

example=dataset['train']['audio'][i]
audio = dataset["train"]["audio"][i]['array']

audio_classifier = pipeline(task="zero-shot-audio-classification", model="laion/larger_clap_general")
output = audio_classifier(audio, candidate_labels=["Gunshot", "Moped", 'Moped alarm','Claxon','Loud people', 'Motorcycle','Talking', 'Slamming door','Music', 'Silence'])

predicted_label = output[0]['label']
print(f'Predicted label: {predicted_label}')

label_name =label_dict[dataset['train']['label'][i]]
print(f'The given label: {label_name}')

if label_name == output[0]['label']:
    print("This is correct")
else:
    print('This is false')
print(f'Probability: {round(output[0]["score"],3)}')

IPython.display.Audio(example['array'], rate=example['sampling_rate'])

Predicted label: Gunshot
The given label: Gunshot
This is correct
Probability: 0.983


## Audio Spectrum Transfomer with Urban Sounds Amsterdam Dataset

Model: https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593
Paper: https://arxiv.org/abs/2104.01778


In [107]:
from transformers import AutoFeatureExtractor, ASTForAudioClassification
from datasets import load_dataset
import torch

#dataset = load_dataset("MichielBontenbal/UrbanSounds")

feature_extractor = AutoFeatureExtractor.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")
model = ASTForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")

# audio file is decoded on the fly
inputs = feature_extractor(dataset['train']["audio"][1]["array"], sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_ids = torch.argmax(logits, dim=-1).item()
predicted_label = model.config.id2label[predicted_class_ids]
print(predicted_label)

# compute loss - target_label is e.g. "down"
target_label = model.config.id2label[0]
inputs["labels"] = torch.tensor([model.config.label2id[target_label]])
loss = model(**inputs).loss
print(round(loss.item(), 2))

Silence
4.13


In [108]:
import IPython
example=dataset['train']["audio"][1]
IPython.display.Audio(example["array"], rate=example['sampling_rate'])