# I. Audio classification with a pipeline

Audio classification involves **assigning one or more labels to an audio recording** based on its content.
The labels could correspond to different sound categories, such as music, speech, or noise, or more specific categories like bird song or car engine sounds.  

**Example:** the MINDS-14 dataset that contains recordings of people asking an e-banking system questions in several languages and dialects, and has the intent_class for each recording.



In [1]:
from datasets import load_dataset
from datasets import Audio

minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))

minds14.py:   0%|          | 0.00/5.83k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.28k [00:00<?, ?B/s]

The repository for PolyAI/minds14 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/PolyAI/minds14.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


MInDS-14.zip:   0%|          | 0.00/471M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

# II. Automatic Speech Recognition (ASR) with a pipeline

ASR is a task that involves transcribing speech audio recording into text.

In this section, we’ll use the automatic-speech-recognition pipeline to transcribe an audio recording of a person asking a question about paying a bill using the same MINDS-14 dataset as before.

In [2]:
from transformers import pipeline

# instantiate the pipeline
asr = pipeline("automatic-speech-recognition")

#  take an example from the dataset and pass its raw data to the pipeline
example = minds[0]
asr(example["audio"]["array"])

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

No model was supplied, defaulted to facebook/wav2vec2-base-960h and revision 22aad52 (https://huggingface.co/facebook/wav2vec2-base-960h).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]



preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

{'text': 'I WOULD LIKE TO PAY MY ELECTRICITY BILL USING MY CAD CAN YOU PLEASE ASSIST'}

# III. Audio generation with a pipeline

Audio generation encompasses a versatile set of tasks that involve producing an audio output. 
The tasks that we will look into here are speech generation (aka “text-to-speech”) and music generation. 

- In text-to-speech, a model transforms a piece of text into lifelike spoken language sound, opening the door to applications such as virtual assistants, accessibility tools for the visually impaired, and personalized audiobooks.
- On the other hand, music generation can enable creative expression, and finds its use mostly in entertainment and game development industries.


In 🤗 Transformers, you’ll find a pipeline that covers both of these tasks. 
This pipeline is called "text-to-audio", but for convenience, it also has a "text-to-speech" alias. 


### 1. Generating speech

In [4]:
from transformers import pipeline

pipe = pipeline("text-to-speech", model="suno/bark-small")

# passing some text through the pipeline; all the preprocessing will be done for us under the hood:
text = "Ladybugs have had important roles in culture and religion, being associated with luck, love, fertility and prophecy. "
output = pipe(text)

# In a notebook, we can use the following code snippet to listen to the result:
from IPython.display import Audio
Audio(output["audio"], rate=output["sampling_rate"])

TypeError: transformers.generation.utils.GenerationMixin.generate() got multiple values for keyword argument 'generation_config'

In [None]:
song = "♪ In the jungle, the mighty jungle, the ladybug was seen. ♪ "
output = pipe(song)
Audio(output["audio"], rate=output["sampling_rate"])

### 2. Generating music

In [5]:
music_pipe = pipeline("text-to-audio", model="facebook/musicgen-small")

# create a text description of the music we’d like to generate
text = "90s rock song with electric guitar and heavy drums"

# control the length of the generated output by passing an additional max_new_tokens parameter to the model
forward_params = {"max_new_tokens": 512}

output = music_pipe(text, forward_params=forward_params)
Audio(output["audio"][0], rate=output["sampling_rate"])

config.json:   0%|          | 0.00/7.87k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.36G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/224 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.37k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]



TypeError: Audio.__init__() got an unexpected keyword argument 'rate'