# Audio Classificication with a Pipeline

In [1]:
from datasets import load_dataset
from datasets import Audio

minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
minds

Dataset({
    features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
    num_rows: 654
})

To classify an audio recording into a set of classes, we can use the audio-classification pipeline from 🤗 Transformers. In our case, we need a model that’s been fine-tuned for intent classification, and specifically on the MINDS-14 dataset. Luckily for us, the Hub has a model that does just that! Let’s load it by using the pipeline() function:

In [2]:
from transformers import pipeline

classifier = pipeline(
    "audio-classification",
    model="anton-l/xtreme_s_xlsr_300m_minds14",
)




pytorch_model.bin:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


preprocessor_config.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

In [17]:
example = minds[303]
example

{'path': 'C:\\Users\\furka\\.cache\\huggingface\\datasets\\downloads\\extracted\\0254525c29529d1916d49e22775fa58c67ef650389b27f8bba54af8a1aa9db37\\en-AU~JOINT_ACCOUNT\\response_50.wav',
 'audio': {'path': 'C:\\Users\\furka\\.cache\\huggingface\\datasets\\downloads\\extracted\\0254525c29529d1916d49e22775fa58c67ef650389b27f8bba54af8a1aa9db37\\en-AU~JOINT_ACCOUNT\\response_50.wav',
  'array': array([-3.04158311e-06,  1.48926978e-04,  2.47845892e-04, ...,
          7.30325701e-04,  4.14848852e-04,  3.27119837e-04]),
  'sampling_rate': 16000},
 'transcription': "hi I just wanted to find out how I set up the joint account I can't go for it thanks",
 'english_transcription': "hi I just wanted to find out how I set up the joint account I can't go for it thanks",
 'intent_class': 11,
 'lang_id': 2}

In [18]:
classifier(example["audio"]["array"])

[{'score': 0.9984550476074219, 'label': 'joint_account'},
 {'score': 0.0004052676085848361, 'label': 'business_loan'},
 {'score': 0.0003753997152671218, 'label': 'cash_deposit'},
 {'score': 0.000162547075888142, 'label': 'atm_limit'},
 {'score': 0.00013410499377641827, 'label': 'abroad'}]

In [19]:
id2label = minds.features["intent_class"].int2str
id2label(example["intent_class"])

'joint_account'

# Automatic speech recognition with a pipeline

Automatic Speech Recognition (ASR) is a task that involves transcribing speech audio recording into text. This task has numerous practical applications, from creating closed captions for videos to enabling voice commands for virtual assistants like Siri and Alexa.

In this section, we’ll use the automatic-speech-recognition pipeline to transcribe an audio recording of a person asking a question about paying a bill using the same MINDS-14 dataset as before.

In [20]:
from transformers import pipeline

asr = pipeline("automatic-speech-recognition")

No model was supplied, defaulted to facebook/wav2vec2-base-960h and revision 22aad52 (https://huggingface.co/facebook/wav2vec2-base-960h).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [23]:
example = minds[100]
example["english_transcription"]
asr(example["audio"]["array"])

{'text': 'I AM TRYING TO USE THE TE APT BUT THE APT DOS NOT WATERED IT CAPES FRAZING'}

In [24]:
example["english_transcription"]

"I'm trying to use the new app but the app does not load it keeps freezing"

# Audio generation with a pipeline

## Generating Speech

In [25]:
from transformers import pipeline

pipe = pipeline("text-to-speech", model="suno/bark-small")

config.json:   0%|          | 0.00/8.80k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

  self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)


generation_config.json:   0%|          | 0.00/4.91k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/353 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.92M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

passing some text through the pipeline. All the preprocessing will be done for us under the hood:

In [28]:
text = "Furkan is trying to be the best, and Ladybugs have had important roles in culture and religion, being associated with luck, love, fertility and prophecy. "
output = pipe(text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


we can use the following code snippet to listen to the result:

In [29]:
from IPython.display import Audio

Audio(output["audio"], rate=output["sampling_rate"])