<a href="https://colab.research.google.com/github/Ryukijano/DL_audio/blob/hf_audio_course/hf_audio_course_applications.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Audio classification with a pipeline

Introduction

Audio classification involves assigning one or more labels to an audio recording based on its content. The labels could correspond to different sound categories, such as music, speech, or noise, or more specific categories like bird song or car engine sounds.

Using a pre-trained model
Before diving into details on how the most popular audio transformers work, and before fine-tuning a custom model, let’s see how you can use an off-the-shelf pre-trained model for audio classification with only a few lines of code with 🤗 Transformers.

The MINDS-14 dataset
We'll use the same MINDS-14 dataset that we explored in the previous unit. This dataset contains recordings of people asking an e-banking system questions in several languages and dialects, and has the intent_class for each recording.

Loading the dataset
As before, we'll start by loading the en-AU subset of the data to try out the pipeline, and upsample it to 16kHz sampling rate which is what most speech models require.

In [8]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m45.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m78.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m60.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, safetensors, transformers
Successfully installed safetensors-0.3.1 tokenizers-0.13.3 transformer

In [1]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.13.0-py3-none-any.whl (485 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.6/485.6 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.14-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
Collec

In [6]:
from datasets import load_dataset
from datasets import Audio

minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))

Downloading builder script:   0%|          | 0.00/5.95k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.29k [00:00<?, ?B/s]

Downloading and preparing dataset minds14/en-AU to /root/.cache/huggingface/datasets/PolyAI___minds14/en-AU/1.0.0/65c7e0f3be79e18a6ffaf879a083daf706312d421ac90d25718459cbf3c42696...


Downloading data:   0%|          | 0.00/471M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset minds14 downloaded and prepared to /root/.cache/huggingface/datasets/PolyAI___minds14/en-AU/1.0.0/65c7e0f3be79e18a6ffaf879a083daf706312d421ac90d25718459cbf3c42696. Subsequent calls will reuse this data.


TO classify an audio recording into a set of classes, we can use the audio-classificatin pipeline from 🤗 Transformers. We are using a model that's been fine-tuned for intent classification, and specifically on the MINDS-14 dataset from the Hub and we load it using the pipeline() function:

In [9]:
from transformers import pipeline

classifier = pipeline(
    "audio-classification",
    model="anton-l/xtreme_s_xlsr_300m_minds14",
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/2.73k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

Looking at an example

In [11]:
example = minds[0]

In [12]:
example

{'path': '/root/.cache/huggingface/datasets/downloads/extracted/a19fbc5032eacf25eab0097832db7b7f022b42104fbad6bd5765527704a428b9/en-AU~PAY_BILL/response_4.wav',
 'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/a19fbc5032eacf25eab0097832db7b7f022b42104fbad6bd5765527704a428b9/en-AU~PAY_BILL/response_4.wav',
  'array': array([2.36119668e-05, 1.92324660e-04, 2.19284790e-04, ...,
         9.40907281e-04, 1.16613181e-03, 7.20883254e-04]),
  'sampling_rate': 16000},
 'transcription': 'I would like to pay my electricity bill using my card can you please assist',
 'english_transcription': 'I would like to pay my electricity bill using my card can you please assist',
 'intent_class': 13,
 'lang_id': 2}

Passing the raw audio data stored in the NumPY array under ["audio"]["array"] into the classifier

In [13]:
classifier(example["audio"]["array"])
[
    {"score": 0.9631525278091431, "label": "pay_bill"},
    {"score": 0.02819698303937912, "label": "freeze"},
    {"score": 0.0032787492964416742, "label": "card_issues"},
    {"score": 0.0019414445850998163, "label": "abroad"},
    {"score": 0.0008378693601116538, "label": "high_value_payment"},
]

[{'score': 0.9631525278091431, 'label': 'pay_bill'},
 {'score': 0.02819698303937912, 'label': 'freeze'},
 {'score': 0.0032787492964416742, 'label': 'card_issues'},
 {'score': 0.0019414445850998163, 'label': 'abroad'},
 {'score': 0.0008378693601116538, 'label': 'high_value_payment'}]

Since we see a high confident score for the model that the caller intended to learn about paying the bill, we check the actual label using int2str


In [14]:
id2label = minds.features["intent_class"].int2str
id2label(example["intent_class"])

'pay_bill'

We find that the prediction was actually correct : )

Automatic speech recognition with a pipeline

Automatic Speech Recognition (ASR) is a task that involves transcribing speech audio recording into text. This task has has numerous practical applications, from creating closed captions for videos to enabling voice commands for virtual assistants like Siri and Alexa.

Here I try to use the automatic-speech-recognition pipeline to transcribe an audio recording of a person asking a question about paying a bill using the same MINDS-14 dataset as before. We will upsample the audio files to 16Khz

In [15]:
from transformers import pipeline
asr = pipeline("automatic-speech-recognition")

No model was supplied, defaulted to facebook/wav2vec2-base-960h and revision 55bb623 (https://huggingface.co/facebook/wav2vec2-base-960h).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading (…)okenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

In [23]:
example = minds[10]
asr(example["audio"]["array"])


{'text': 'HI O I DON PAY MY BEL PLEASE AS I HAD A NOT STALING HIM OUTS THINKE'}

In [24]:
example["english_transcription"]

'I like to pay my bill please thank you'

Testing it for the Deutsch version of the audio datasets!

In [25]:
from datasets import load_dataset
from datasets import Audio

minds = load_dataset("PolyAI/minds14", name="de-DE", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))

Downloading and preparing dataset minds14/de-DE to /root/.cache/huggingface/datasets/PolyAI___minds14/de-DE/1.0.0/65c7e0f3be79e18a6ffaf879a083daf706312d421ac90d25718459cbf3c42696...


Generating train split: 0 examples [00:00, ? examples/s]

Dataset minds14 downloaded and prepared to /root/.cache/huggingface/datasets/PolyAI___minds14/de-DE/1.0.0/65c7e0f3be79e18a6ffaf879a083daf706312d421ac90d25718459cbf3c42696. Subsequent calls will reuse this data.


In [26]:
example = minds[0]
example["transcription"]
"ich möchte gerne Geld auf mein Konto einzahlen"

'ich möchte gerne Geld auf mein Konto einzahlen'

In [27]:
from transformers import pipeline

asr = pipeline("automatic-speech-recognition", model="maxidl/wav2vec2-large-xlsr-german")
asr(example["audio"]["array"])
{"text": "ich möchte gerne geld auf mein konto einzallen"}

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.57k [00:00<?, ?B/s]



Downloading pytorch_model.bin:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading (…)rocessor_config.json:   0%|          | 0.00/158 [00:00<?, ?B/s]

{'text': 'ich möchte gerne geld auf mein konto einzallen'}

In [32]:
from datasets import load_dataset
from datasets import Audio

vox = load_dataset("facebook/voxpopuli", name="en", split="train")
vox = vox.cast_column("audio", Audio(sampling_rate=16_000))

Downloading and preparing dataset voxpopuli/en to /root/.cache/huggingface/datasets/facebook___voxpopuli/en/1.3.0/b5ff837284f0778eefe0f642734e142d8c3f574eba8c9c8a4b13602297f73604...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.69G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.71G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.71G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.69G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.69G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.69G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.70G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.70G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.67G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.69G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.70G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.69G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.70G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.70G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.71G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.70G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.70G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.69G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.66G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.71G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.73G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.69G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.70G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.70G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.69G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.70G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.69G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.66G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.70G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.67G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/850M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/591M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/595M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

KeyboardInterrupt: ignored