## Audio classification with a pipeline

let’s start by loading the en-AU subset of the data to try out the pipeline, and upsample it to 16kHz sampling rate which is what most speech models require.



In [3]:
from datasets import load_dataset
from datasets import Audio

minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))

To classify an audio recording into a set of classes, we can use the audio-classification pipeline from 🤗 Transformers. In our case, we need a model that’s been fine-tuned for intent classification, and specifically on the MINDS-14 dataset. Luckily for us, the Hub has a model that does just that! Let’s load it by using the pipeline() function:



In [4]:
pip install torch

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [5]:
from transformers import pipeline

classifier = pipeline(
    "audio-classification",
    model="anton-l/xtreme_s_xlsr_300m_minds14",
)

Device set to use cpu


This pipeline expects the audio data as a NumPy array. All the preprocessing of the raw audio data will be conveniently handled for us by the pipeline. Let’s pick an example to try it out:



In [6]:
example = minds[0]

  "cipher": algorithms.TripleDES,
  "class": algorithms.Blowfish,
  "class": algorithms.TripleDES,


If you recall the structure of the dataset, the raw audio data is stored in a NumPy array under ["audio"]["array"], let’s pass it straight to the classifier:



In [7]:
classifier(example["audio"]["array"])

[{'score': 0.9625311493873596, 'label': 'pay_bill'},
 {'score': 0.028672732412815094, 'label': 'freeze'},
 {'score': 0.003349794540554285, 'label': 'card_issues'},
 {'score': 0.0020058020018041134, 'label': 'abroad'},
 {'score': 0.0008484324789606035, 'label': 'high_value_payment'},
 {'score': 0.0007367952493950725, 'label': 'direct_debit'},
 {'score': 0.0004056991310790181, 'label': 'latest_transactions'},
 {'score': 0.0003397076216060668, 'label': 'joint_account'},
 {'score': 0.00033127894857898355, 'label': 'address'},
 {'score': 0.0003288650477770716, 'label': 'balance'},
 {'score': 0.00014877507055643946, 'label': 'app_error'},
 {'score': 0.00014772488793823868, 'label': 'atm_limit'},
 {'score': 8.815681940177456e-05, 'label': 'cash_deposit'},
 {'score': 6.512475374620408e-05, 'label': 'business_loan'}]

The model is very confident that the caller intended to learn about paying their bill. Let’s see what the actual label for this example is:



In [8]:
id2label = minds.features["intent_class"].int2str
id2label(example["intent_class"])

'pay_bill'

Hooray! The predicted label was correct! Here we were lucky to find a model that can classify the exact labels that we need. A lot of the times, when dealing with a classification task, a pre-trained model’s set of classes is not exactly the same as the classes you need the model to distinguish. In this case, you can fine-tune a pre-trained model to “calibrate” it to your exact set of class labels. We’ll learn how to do this in the upcoming units. Now, let’s take a look at another very common task in speech processing, automatic speech recognition.



## Speech Recognition with a Pipeline

To get started, load the dataset and upsample it to 16kHz as described in Audio classification with a pipeline, if you haven’t done that yet.

To transcribe an audio recording, we can use the automatic-speech-recognition pipeline from Transformers. Let’s instantiate the pipeline:



In [9]:
from transformers import pipeline

asr = pipeline("automatic-speech-recognition")

No model was supplied, defaulted to facebook/wav2vec2-base-960h and revision 22aad52 (https://huggingface.co/facebook/wav2vec2-base-960h).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

Device set to use cpu


Next, we’ll take an example from the dataset and pass its raw data to the pipeline:

In [10]:
example = minds[0]
asr(example["audio"]["array"])

{'text': 'I WOULD LIKE TO PAY MY ELECTRICITY BILL USING MY CAD CAN YOU PLEASE ASSIST'}

Let’s compare this output to what the actual transcription for this example is:

In [11]:
example["english_transcription"]

'I would like to pay my electricity bill using my card can you please assist'

The model seems to have done a pretty good job at transcribing the audio! It only got one word wrong (“card”) compared to the original transcription, which is pretty good considering the speaker has an Australian accent, where the letter “r” is often silent. Having said that, I wouldn’t recommend trying to pay your next electricity bill with a fish!



By default, this pipeline uses a model trained for automatic speech recognition for English language, which is fine in this example. If you’d like to try transcribing other subsets of MINDS-14 in different language, you can find a pre-trained ASR model on the Hub. You can filter the models list by task first, then by language. Once you have found the model you like, pass it’s name as the model argument to the pipeline.



Let’s try this for the German split of the MINDS-14. Load the “de-DE” subset:

In [12]:
from datasets import load_dataset
from datasets import Audio

minds = load_dataset("PolyAI/minds14", name="de-DE", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))

Generating train split: 0 examples [00:00, ? examples/s]

In [13]:
example = minds[0]
example["transcription"]

'ich möchte gerne Geld auf mein Konto einzahlen'

Find a pre-trained ASR model for German language on the Hub, instantiate a pipeline, and transcribe the example:



In [14]:
from transformers import pipeline

asr = pipeline("automatic-speech-recognition", model="maxidl/wav2vec2-large-xlsr-german")
asr(example["audio"]["array"])

config.json:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/158 [00:00<?, ?B/s]

Device set to use cpu


{'text': 'ich möchte gerne geld auf mein konto einzallen'}

## Audio Generation with a Pipeline

In [15]:
pip install --upgrade transformers

Defaulting to user installation because normal site-packages is not writeableNote: you may need to restart the kernel to use updated packages.



Let’s begin by exploring text-to-speech generation. First, just as it was the case with audio classification and automatic speech recognition, we’ll need to define the pipeline. We’ll define a text-to-speech pipeline since it best describes our task, and use the suno/bark-small checkpoint:



In [16]:
from transformers import pipeline

pipe = pipeline("text-to-speech", model="suno/bark-small")

model.safetensors:  20%|##        | 419M/2.10G [00:00<?, ?B/s]

Device set to use cpu


The next step is as simple as passing some text through the pipeline. All the preprocessing will be done for us under the hood:



In a notebook, we can use the following code snippet to listen to the result:

In [20]:
from IPython.display import Audio

Audio(output["audio"], rate=output["sampling_rate"])

In [21]:
text = "Ladybugs have had important roles in culture and religion, being associated with luck, love, fertility and prophecy. "
output = pipe(text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


The model that we’re using with the pipeline, Bark, is actually multilingual, so we can easily substitute the initial text with a text in, say, French, and use the pipeline in the exact same way. It will pick up on the language all by itself:



In [None]:
fr_text = "Contrairement à une idée répandue, le nombre de points sur les élytres d'une coccinelle ne correspond pas à son âge, ni en nombre d'années, ni en nombre de mois. "
output = pipe(fr_text)
Audio(output["audio"], rate=output["sampling_rate"])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


Not only is this model multilingual, it can also generate audio with non-verbal communications and singing. Here’s how you can make it sing:



In [None]:
song = "♪ In the jungle, the mighty jungle, the ladybug was seen. ♪ "
output = pipe(song)
Audio(output["audio"], rate=output["sampling_rate"])

We’ll dive deeper into Bark specifics in the later unit dedicated to Text-to-speech, and will also show how you can use other models for this task. Now, let’s generate some music!



## Generating Music

 For music generation, we’ll define a text-to-audio pipeline, and initialise it with the pretrained checkpoint facebook/musicgen-small



In [None]:
music_pipe = pipeline("text-to-audio", model="facebook/musicgen-small")

Let’s create a text description of the music we’d like to generate:

In [None]:
text = "90s rock song with electric guitar and heavy drums"

We can control the length of the generated output by passing an additional max_new_tokens parameter to the model.



In [None]:
forward_params = {"max_new_tokens": 512}

output = music_pipe(text, forward_params=forward_params)
Audio(output["audio"][0], rate=output["sampling_rate"])