# Huggingface audio pipelines
source: https://huggingface.co/learn/audio-course/chapter2/audio_classification_pipeline?fw=pt

## Introduction 

### Using datasets
To classify sounds, I've uploaded a sub-sample of the Amsterdam Sounds Database at Huggingface. 

In this notebook we will use HUggingface's ```dataset``` library to load this dataset. 


links:
- https://pypi.org/project/datasets/
- https://huggingface.co/docs/datasets/audio_process
- https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/main_classes#datasets.Audio

NameError: name 'datasets' is not defined

# 0. Install packages

In [20]:
#!pip install datasets

In [1]:
pip install soundfile

Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip install datasets\[audio\]

Note: you may need to restart the kernel to use updated packages.


## 1. Audio classification with a pipeline

source: https://huggingface.co/learn/audio-course/chapter2/audio_classification_pipeline?fw=pt

In [117]:
from datasets import load_dataset
from datasets import Audio

minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))

In [118]:
from transformers import pipeline

classifier = pipeline(
    "audio-classification",
    model="anton-l/xtreme_s_xlsr_300m_minds14",
)

In [119]:
example = minds[0]
classifier(example["audio"]["array"])

[{'score': 0.9625310301780701, 'label': 'pay_bill'},
 {'score': 0.028672754764556885, 'label': 'freeze'},
 {'score': 0.0033497940748929977, 'label': 'card_issues'},
 {'score': 0.0020058027002960443, 'label': 'abroad'},
 {'score': 0.000848432769998908, 'label': 'high_value_payment'}]

In [120]:
example

{'path': '/Users/michielbontenbal/.cache/huggingface/datasets/downloads/extracted/91523c3a7f67fb2017e19a89742e7c63e1f07d178e1bb911001f7434d12237ee/en-AU~PAY_BILL/response_4.wav',
 'audio': {'path': '/Users/michielbontenbal/.cache/huggingface/datasets/downloads/extracted/91523c3a7f67fb2017e19a89742e7c63e1f07d178e1bb911001f7434d12237ee/en-AU~PAY_BILL/response_4.wav',
  'array': array([2.36120541e-05, 1.92325111e-04, 2.19285139e-04, ...,
         9.40908212e-04, 1.16613181e-03, 7.20883720e-04]),
  'sampling_rate': 16000},
 'transcription': 'I would like to pay my electricity bill using my card can you please assist',
 'english_transcription': 'I would like to pay my electricity bill using my card can you please assist',
 'intent_class': 13,
 'lang_id': 2}

In [22]:
#look up the label
id2label = minds.features["intent_class"].int2str
id2label(example["intent_class"])

'pay_bill'

## 2. CLAP

In [5]:
from datasets import load_dataset
from transformers import pipeline

dataset = load_dataset("ashraq/esc50")
audio = dataset["train"]["audio"][-1]["array"]

audio_classifier = pipeline(task="zero-shot-audio-classification", model="laion/larger_clap_music_and_speech")
output = audio_classifier(audio, candidate_labels=["Sound of a dog", "Sound of vaccum cleaner"])
print(output)


Downloading readme:   0%|          | 0.00/345 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading metadata:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/387M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/387M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/2000 [00:00<?, ? examples/s]

(…)usic_and_speech/resolve/main/config.json:   0%|          | 0.00/635 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/776M [00:00<?, ?B/s]

(…)peech/resolve/main/tokenizer_config.json:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

(…)music_and_speech/resolve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

(…)music_and_speech/resolve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

(…)c_and_speech/resolve/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

(…)ech/resolve/main/special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

(…)ch/resolve/main/preprocessor_config.json:   0%|          | 0.00/541 [00:00<?, ?B/s]

[{'score': 0.9999446868896484, 'label': 'Sound of a dog'}, {'score': 5.5327400332316756e-05, 'label': 'Sound of vaccum cleaner'}]


In [7]:
dataset

DatasetDict({
    train: Dataset({
        features: ['filename', 'fold', 'target', 'category', 'esc10', 'src_file', 'take', 'audio'],
        num_rows: 2000
    })
})

In [18]:
#print the dataset 'audio' unfortenately 'path' is always empty so we can't find the file
dataset['train']['audio']

[{'path': None,
  'array': array([0., 0., 0., ..., 0., 0., 0.]),
  'sampling_rate': 44100},
 {'path': None,
  'array': array([-0.01184082, -0.10336304, -0.14141846, ...,  0.06985474,
          0.04049683,  0.00274658]),
  'sampling_rate': 44100},
 {'path': None,
  'array': array([-0.00695801, -0.01251221, -0.01126099, ...,  0.215271  ,
         -0.00875854, -0.28903198]),
  'sampling_rate': 44100},
 {'path': None,
  'array': array([0.53897095, 0.39627075, 0.26739502, ..., 0.09729004, 0.11227417,
         0.07983398]),
  'sampling_rate': 44100},
 {'path': None,
  'array': array([-0.00036621, -0.0007019 , -0.00079346, ...,  0.00317383,
          0.00222778,  0.00158691]),
  'sampling_rate': 44100},
 {'path': None,
  'array': array([-9.46044922e-04, -6.71386719e-04, -6.10351562e-05, ...,
         -2.13623047e-03, -2.62451172e-03, -3.17382812e-03]),
  'sampling_rate': 44100},
 {'path': None,
  'array': array([0.00012207, 0.00018311, 0.00012207, ..., 0.        , 0.        ,
         0.     

In [19]:
#print the category of the last item in the list
dataset['train']['category'][-1]

'dog'

In [12]:
ls

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
A_quick_start_guide_for_soundscape_IR.ipynb
Audio Yamnet_visualisation.ipynb
Audio with OpenL3.ipynb
[34mESC-50-master[m[m/
Gradio Audio Demo waveform.ipynb
Huggingface audio pipelines.ipynb
Inference_with_the_Audio_Spectogram_Transformer_to_classify_audio.ipynb
Noisereduce full tutorial.ipynb
OpenEars - scripts to rename the files in Amsterdam Sound Dataset.ipynb
README.md
Torchopenl3 (not working locally).ipynb
Transfer_learning_audio YAMNet (works in Colab not local).ipynb
fft_frequency.ipynb
[34mflagged[m[m/
testme.flac


In [13]:
cd ESC-50-master

/Users/michielbontenbal/Documents/Documents - Michiel’s MacBook Pro/GitHub/Audio_advanced/ESC-50-master


In [15]:
cd audio

/Users/michielbontenbal/Documents/Documents - Michiel’s MacBook Pro/GitHub/Audio_advanced/ESC-50-master/audio


#check last file in the list and check class on dataset huggingface: https://huggingface.co/datasets/ashraq/esc50/viewer/default/train?q=dog select last file in the list 5-9032-A.wav and listen there

In [16]:
ls

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
1-100032-A-0.wav   2-118624-A-30.wav  3-151212-A-24.wav  4-196671-A-8.wav
1-100038-A-14.wav  2-118625-A-30.wav  3-151213-A-24.wav  4-196671-B-8.wav
1-100210-A-36.wav  2-118817-A-32.wav  3-151255-A-28.wav  4-196672-A-8.wav
1-100210-B-36.wav  2-118964-A-0.wav   3-151269-A-35.wav  4-197103-A-6.wav
1-101296-A-19.wav  2-119102-A-21.wav  3-151273-A-35.wav  4-197454-A-28.wav
1-101296-B-19.wav  2-119139-A-31.wav  3-151557-A-28.wav  4-197454-B-28.wav
1-101336-A-30.wav  2-119161-A-8.wav   3-151557-B-28.wav  4-197871-A-15.wav
1-101404-A-34.wav  2-119161-B-8.wav   3-152007-A-20.wav  4-198025-A-23.wav
1-103298-A-9.wav   2-119161-C-8.wav   3-152007-B-20.wav  4-198360-A-49.wav
1-103995-A-30.wav  2-119748-A-38.wav 

## 3. Test on the dataset of the Amsterdam Sounds using Transformers MICHIEL BONTENBAL

In [121]:
#load the dataset
from datasets import load_dataset

dataset = load_dataset("MichielBontenbal/AmsterdamUrbanSounds")
dataset = dataset.cast_column("audio", Audio(sampling_rate=48_000))

In [122]:
dataset

DatasetDict({
    train: Dataset({
        features: ['audio'],
        num_rows: 1
    })
})

In [125]:
dataset['train']['audio'][0]

{'path': '/Users/michielbontenbal/.cache/huggingface/datasets/downloads/a5928477ff76953cc2ca36b6025d527fe7c9e5cfe3cd716acf4a9f4bb446d9b1',
 'array': array([0.07316589, 0.07098389, 0.06954956, ..., 0.00628662, 0.00708008,
        0.00947571]),
 'sampling_rate': 48000}

In [124]:
from datasets import load_dataset
from transformers import ClapModel, ClapProcessor

audio = dataset["train"]["audio"][-1]["array"]

audio_classifier = pipeline(task="zero-shot-audio-classification", model="laion/larger_clap_music_and_speech")
output = audio_classifier(audio, candidate_labels=["Moped", "Sound of dog"])
print(output)

[{'score': 0.9998893737792969, 'label': 'Moped'}, {'score': 0.00011060674296459183, 'label': 'Sound of dog'}]


In [None]:
dataset_head = ds['train'].take(3)

In [None]:
from IPython.display import Audio

Audio(example["audio"]["array"], rate=16000)

## Dataset Sensemakers 


In [126]:
from datasets import load_dataset

dataset = load_dataset("UrbanSounds/AmsterdamSounds")

Resolving data files:   0%|          | 0/900 [00:00<?, ?it/s]

In [95]:
from transformers import ClapModel, ClapProcessor

audio = dataset["train"]["audio"][0]["array"]

audio_classifier = pipeline(task="zero-shot-audio-classification", model="laion/larger_clap_music_and_speech")
output = audio_classifier(audio, candidate_labels=["Motorcycle", "Moped", 'Claxon','Alarm', 'Silence','Loud people','Talking','Gunshot', 'Slamming door','Music'])
print(output[0])

{'score': 0.6788613200187683, 'label': 'Claxon'}


You may notice that the audio column contains several features. Here’s what they are:

- path: the path to the audio file
- array: The decoded audio data, represented as a 1-dimensional NumPy array.
- sampling_rate. The sampling rate of the audio file.

In [102]:
dataset['train']

Dataset({
    features: ['audio'],
    num_rows: 900
})

In [96]:
dataset['train']['audio'][0]

{'path': '/Users/michielbontenbal/.cache/huggingface/datasets/downloads/3ebf510251fda452c1e641f7ec83a9ceea64bd398b33a93e4d6c2122f3d594cf',
 'array': array([-0.0118866 , -0.00939941, -0.00857544, ...,  0.00495911,
         0.00469971,  0.00242615]),
 'sampling_rate': 48000}

In [105]:
audio

array([-0.0118866 , -0.00939941, -0.00857544, ...,  0.00495911,
        0.00469971,  0.00242615])

In [106]:
type(audio)

numpy.ndarray

In [111]:
print(type(audio))
print(audio.shape)
print(audio.dtype)
print(audio.nbytes)

<class 'numpy.ndarray'>
(480000,)
float64
3840000


In [115]:
import numpy as np
from scipy.io.wavfile import write
rate=48000

write('test.wav', rate, audio.astype(np.float64))

In [116]:
#most simple
import IPython
IPython.display.Audio('test.wav')