<a href="https://colab.research.google.com/github/TirendazAcademy/Audio-Data-with-HuggingFace/blob/main/2-Introduction-to-Audio-Applications.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Audio classification with a pipeline

First, let's install the datasets library.

In [None]:
!pip install -qU datasets

Next, let's load the dataset and then set the sampling rate of it.

## Preprocessing data

In [None]:
from datasets import load_dataset
from datasets import Audio

minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))

Now let me set the pipeline.

In [None]:
from transformers import pipeline

classifier = pipeline(
    "audio-classification",
    model="anton-l/xtreme_s_xlsr_300m_minds14",
)

To see how to work the pipeline, let's get an example.

In [None]:
example = minds[0]

In [None]:
example

Note that we'll pass arrays of the example into the pipeline.

In [None]:
example["audio"]["array"]

In [None]:
classifier(example["audio"]["array"])

The pipeline made a prediction and guessed the label. Let's check if it is correct.

In [None]:
id2label = minds.features["intent_class"].int2str
id2label(example["intent_class"])

# Automatic speech recognition with a pipeline

To transcribe an audio recording, we can use the automatic-speech-recognition pipeline from 🤗 Transformers.

In [None]:
from transformers import pipeline

asr = pipeline("automatic-speech-recognition")

Let's take an example from the dataset and pass its raw data to the pipeline:

In [None]:
example = minds[0]
asr(example["audio"]["array"])

Let’s compare this output to what the actual transcription for this example is:

In [None]:
example["english_transcription"]

Let’s try this for the German split of the MINDS-14. Load the “de-DE” subset. First, let me load the dataset and set it.

In [None]:
from datasets import load_dataset
from datasets import Audio

minds = load_dataset("PolyAI/minds14", name="de-DE", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))

Get an example and see what the transcription is supposed to be:

In [None]:
example = minds[0]
example["transcription"]

What we're going to do now is to find a pre-trained ASR model for German language on the 🤗 Hub, instantiate a pipeline, and transcribe the example:

In [None]:
from transformers import pipeline

asr = pipeline("automatic-speech-recognition", model="maxidl/wav2vec2-large-xlsr-german")
asr(example["audio"]["array"])

# Audio generation with a pipeline

First, let me install the last version of Transformers.

In [None]:
!pip install -qU transformers

In [None]:
import transformers
transformers.__version__

## Generating speech

Let's get started to define a text-to-speech pipeline.

In [None]:
from transformers import pipeline

pipe = pipeline("text-to-speech", model="suno/bark-small")

The next step is as simple as passing some text through the pipeline. All the preprocessing will be done for us under the hood:

In [None]:
text = "Ladybugs have had important roles in culture and religion, being associated with luck, love, fertility and prophecy. "
output = pipe(text)

In [None]:
output

Let me listen to the output.

In [None]:
from IPython.display import Audio

Audio(output["audio"], rate=output["sampling_rate"])

Note that Bark, is a multilingual model. Now let's take a look at another example with a text in French.

In [None]:
fr_text = "Contrairement à une idée répandue, le nombre de points sur les élytres d'une coccinelle ne correspond pas à son âge, ni en nombre d'années, ni en nombre de mois. "
output = pipe(fr_text)
Audio(output["audio"], rate=output["sampling_rate"])

## Generating music

For music generation, Let me define a text-to-audio pipeline:

In [None]:
music_pipe = pipeline("text-to-audio", model="facebook/musicgen-small")

Let’s create a text description of the music we’d like to generate:

In [None]:
text = "90s rock song with electric guitar and heavy drums"

Notice that we can control the length of the generated output by passing an additional max_new_tokens parameter to the model.

In [None]:
forward_params = {"max_new_tokens": 512}

output = music_pipe(text, forward_params=forward_params)
Audio(output["audio"][0], rate=output["sampling_rate"])