# Huggingface audio pipelines
source: https://huggingface.co/learn/audio-course/chapter2/audio_classification_pipeline?fw=pt


### Contents
0. Install packages
1. First script
2. CLAP on the Urban Sounds Amsterdam dataset
3. AST on the Urban Sounds Amsterdam dataset

## Introduction 

### Using datasets
To classify sounds, I've uploaded a sub-sample of the Amsterdam Sounds Database at Huggingface. 

In this notebook we will use HUggingface's ```dataset``` library to load this dataset. 

links:
- https://pypi.org/project/datasets/
- https://huggingface.co/docs/datasets/audio_process
- https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/main_classes#datasets.Audio

# 0. Install packages

In [20]:
#!pip install datasets

In [1]:
pip install soundfile

Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip install datasets\[audio\]

Note: you may need to restart the kernel to use updated packages.


In [11]:
#check numpy version is 1.24
import numpy as np
np.__version__

'1.24.0'

## 1. Audio classification with a pipeline

source: https://huggingface.co/learn/audio-course/chapter2/audio_classification_pipeline?fw=pt

In [117]:
from datasets import load_dataset
from datasets import Audio

minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))

In [118]:
from transformers import pipeline

classifier = pipeline(
    "audio-classification",
    model="anton-l/xtreme_s_xlsr_300m_minds14",
)

In [119]:
example = minds[0]
classifier(example["audio"]["array"])

[{'score': 0.9625310301780701, 'label': 'pay_bill'},
 {'score': 0.028672754764556885, 'label': 'freeze'},
 {'score': 0.0033497940748929977, 'label': 'card_issues'},
 {'score': 0.0020058027002960443, 'label': 'abroad'},
 {'score': 0.000848432769998908, 'label': 'high_value_payment'}]

In [120]:
example

{'path': '/Users/michielbontenbal/.cache/huggingface/datasets/downloads/extracted/91523c3a7f67fb2017e19a89742e7c63e1f07d178e1bb911001f7434d12237ee/en-AU~PAY_BILL/response_4.wav',
 'audio': {'path': '/Users/michielbontenbal/.cache/huggingface/datasets/downloads/extracted/91523c3a7f67fb2017e19a89742e7c63e1f07d178e1bb911001f7434d12237ee/en-AU~PAY_BILL/response_4.wav',
  'array': array([2.36120541e-05, 1.92325111e-04, 2.19285139e-04, ...,
         9.40908212e-04, 1.16613181e-03, 7.20883720e-04]),
  'sampling_rate': 16000},
 'transcription': 'I would like to pay my electricity bill using my card can you please assist',
 'english_transcription': 'I would like to pay my electricity bill using my card can you please assist',
 'intent_class': 13,
 'lang_id': 2}

In [22]:
#look up the label
id2label = minds.features["intent_class"].int2str
id2label(example["intent_class"])

'pay_bill'

## 2. CLAP on the Urban Sounds Amsterdam dataset

Source: https://huggingface.co/laion/larger_clap_general

Paper: https://arxiv.org/abs/2211.06687

CLAP (Contrastive Language-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task.



In [2]:
from datasets import load_dataset

dataset = load_dataset("MichielBontenbal/UrbanSounds")

Resolving data files:   0%|          | 0/239 [00:00<?, ?it/s]

In [4]:
#inspect the dataset
dataset

DatasetDict({
    train: Dataset({
        features: ['audio'],
        num_rows: 238
    })
})

In [13]:
from transformers import ClapModel, ClapProcessor
from transformers import pipeline
import IPython

example = dataset['train']['audio'][1]
audio = dataset["train"]["audio"][-1]["array"]

audio_classifier = pipeline(task="zero-shot-audio-classification", model="laion/larger_clap_music_and_speech")
output = audio_classifier(audio, candidate_labels=["Motorcycle", "Moped", 'Claxon','Alarm', 'Silence','Loud people','Talking','Gunshot', 'Slamming door','Music'])
print(output[0],output[1], output[2])
IPython.display.Audio(example["array"], rate=example['sampling_rate'])

{'score': 0.9225516319274902, 'label': 'Gunshot'} {'score': 0.061576180160045624, 'label': 'Loud people'} {'score': 0.013743710704147816, 'label': 'Slamming door'}


You may notice that the audio column contains several features. Here’s what they are:

- path: the path to the audio file
- array: The decoded audio data, represented as a 1-dimensional NumPy array.
- sampling_rate. The sampling rate of the audio file.

In [9]:
#print the example
example

{'path': '/Users/michielbontenbal/.cache/huggingface/datasets/downloads/261be9491a75f8b854776ced14c604eba19e34b4a4cb4244de6afe4e615d0413',
 'array': array([-0.02656555, -0.03103638, -0.03013611, ..., -0.00975037,
        -0.00650024,  0.        ]),
 'sampling_rate': 48000}

In [25]:
example['sampling_rate']

44100

In [26]:
print(type(audio))
print(audio.shape)
print(audio.dtype)
print(audio.nbytes)

<class 'numpy.ndarray'>
(441000,)
float64
3528000


In [27]:
#check length of dataset

print(len(dataset['train']['audio']))
print(type(dataset['train']['audio']))

238
<class 'list'>


In [28]:
#the display script
from IPython.display import Audio

Audio(example["array"], rate=example['sampling_rate'])

In [9]:
#a short script to get a number from the list
import random
my_list = [i for i in range(0, len(dataset['train']['audio']))]

# Getting a random number from the list
random_number = random.choice(my_list)
print(random_number)

184


In [16]:
%%time
#Script to load a random number out of the dataset
from transformers import ClapModel, ClapProcessor
from transformers import pipeline
from datasets import load_dataset
import IPython

#dataset = load_dataset("MichielBontenbal/UrbanSounds") #uncomment if you want to load it

#creat an example file with the accomp
example=dataset['train']['audio'][random_number]
audio = dataset["train"]["audio"][random_number]["array"]

audio_classifier = pipeline(task="zero-shot-audio-classification", model="laion/larger_clap_general")
output = audio_classifier(audio, candidate_labels=["Motorcycle", "Moped", 'Claxon','Alarm', 'Silence','Loud people','Talking','Gunshot', 'Slamming door','Music'])
print(output[0],'\n',output[1])
print(random_number)
IPython.display.Audio(example["array"], rate=example['sampling_rate'])

{'score': 0.9141817092895508, 'label': 'Music'} 
 {'score': 0.02761041186749935, 'label': 'Motorcycle'}
184
CPU times: user 7.93 s, sys: 2 s, total: 9.94 s
Wall time: 7.63 s


## Audio Spectrum Transfomer with Urban Sounds Amsterdam Dataset

Model: https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593
Paper: https://arxiv.org/abs/2104.01778


In [17]:
from transformers import AutoFeatureExtractor, ASTForAudioClassification
from datasets import load_dataset
import torch

dataset = load_dataset("MichielBontenbal/UrbanSounds")

feature_extractor = AutoFeatureExtractor.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")
model = ASTForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")

# audio file is decoded on the fly
inputs = feature_extractor(dataset['train']["audio"][random_number]["array"], sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_ids = torch.argmax(logits, dim=-1).item()
predicted_label = model.config.id2label[predicted_class_ids]
print(predicted_label)

# compute loss - target_label is e.g. "down"
target_label = model.config.id2label[0]
inputs["labels"] = torch.tensor([model.config.label2id[target_label]])
loss = model(**inputs).loss
print(round(loss.item(), 2))

Resolving data files:   0%|          | 0/239 [00:00<?, ?it/s]

Music
2.23


In [18]:
import IPython
example=dataset['train']["audio"][random_number]
IPython.display.Audio(example["array"], rate=example['sampling_rate'])