## Audio Quickstart


In this quickstart, you’ll prepare the MInDS-14 dataset for a model train on and classify the banking issue a customer is having.

Load the MInDS-14 dataset by providing the load_dataset() function with the dataset name, dataset configuration (not all datasets will have a configuration), and a dataset split.

`Dataset` is backed by an Apache Arrow table.

In [10]:
from datasets import load_dataset, Audio

dataset = load_dataset("PolyAI/minds14", "en-US", split="train")

Found cached dataset minds14 (/home/vscode/.cache/huggingface/datasets/PolyAI___minds14/en-US/1.0.0/dbb7ed8d7a009916cc6561b16095b37bb4815461a20c26fb2c2d37a634bb9e37)


Next, load a pretrained Wav2Vec2 model and its corresponding feature extractor from the 🤗 Transformers library. It is totally normal to see a warning after you load the model about some weights not being initialized. This is expected because you are loading this model checkpoint for training with another task.

In [2]:
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor

model = AutoModelForAudioClassification.from_pretrained("facebook/wav2vec2-base")
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")

Downloading config.json:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

  torch.utils._pytree._register_pytree_node(


Downloading pytorch_model.bin:   0%|          | 0.00/380M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['classifier.weight', 'projector.weight', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'projector.bias', 'classifier.bias', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading (…)rocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]



The MInDS-14 dataset card indicates the sampling rate is 8kHz, but the Wav2Vec2 model was pretrained on a sampling rate of 16kHZ. You’ll need to upsample the audio column with the cast_column() function and Audio feature to match the model’s sampling rate.

In [12]:
dataset[0]

{'path': '/home/vscode/.cache/huggingface/datasets/downloads/extracted/ae597f7b595b19678bcc1c11dfe231bdd0e83da27da19ab6d0f8b5e8ad51dcc8/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'audio': {'path': '/home/vscode/.cache/huggingface/datasets/downloads/extracted/ae597f7b595b19678bcc1c11dfe231bdd0e83da27da19ab6d0f8b5e8ad51dcc8/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
  'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
          0.        ,  0.        ]),
  'sampling_rate': 8000},
 'transcription': 'I would like to set up a joint account with my partner',
 'english_transcription': 'I would like to set up a joint account with my partner',
 'intent_class': 11,
 'lang_id': 4}

In [11]:
dataset[0]["audio"]

{'path': '/home/vscode/.cache/huggingface/datasets/downloads/extracted/ae597f7b595b19678bcc1c11dfe231bdd0e83da27da19ab6d0f8b5e8ad51dcc8/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
         0.        ,  0.        ]),
 'sampling_rate': 8000}

In [16]:
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
dataset[0]["audio"]

{'path': '/home/vscode/.cache/huggingface/datasets/downloads/extracted/ae597f7b595b19678bcc1c11dfe231bdd0e83da27da19ab6d0f8b5e8ad51dcc8/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'array': array([ 1.70562416e-05,  2.18727451e-04,  2.28099874e-04, ...,
         3.43842403e-05, -5.96364771e-06, -1.76846661e-05]),
 'sampling_rate': 16000}

```
{'array': array([ 2.3443763e-05,  2.1729663e-04,  2.2145823e-04, ...,
         3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'sampling_rate': 16000}
 ```

Create a function to preprocess the audio array with the feature extractor, and truncate and pad the sequences into tidy rectangular tensors. The most important thing to remember is to call the audio array in the feature extractor since the array - the actual speech signal - is the model input.

In [15]:
def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=16000,
        padding=True,
        max_length=100000,
        truncation=True,
    )
    return inputs

Apply the preprocess_function to the first few examples in the dataset:

In [90]:
processed_dataset = preprocess_function(dataset[:5])
processed_dataset["input_values"][0].shape

(100000,)

In [6]:
processed_dataset["input_values"][1].shape

(100000,)

Once you have a preprocessing function, use the map() function to speed up processing by applying the function to batches of examples in the dataset.

In [88]:
#dataset = dataset.map(preprocess_function, batched=True)

Use the rename_column() function to rename the intent_class column to labels, which is the expected input name in Wav2Vec2ForSequenceClassification:

In [25]:
dataset = dataset.rename_column("intent_class", "labels")

In [27]:
dataset[0]

{'path': '/home/vscode/.cache/huggingface/datasets/downloads/extracted/ae597f7b595b19678bcc1c11dfe231bdd0e83da27da19ab6d0f8b5e8ad51dcc8/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'audio': {'path': '/home/vscode/.cache/huggingface/datasets/downloads/extracted/ae597f7b595b19678bcc1c11dfe231bdd0e83da27da19ab6d0f8b5e8ad51dcc8/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
  'array': array([ 1.70562416e-05,  2.18727451e-04,  2.28099874e-04, ...,
          3.43842403e-05, -5.96364771e-06, -1.76846661e-05]),
  'sampling_rate': 16000},
 'transcription': 'I would like to set up a joint account with my partner',
 'english_transcription': 'I would like to set up a joint account with my partner',
 'labels': 11,
 'lang_id': 4}

In [92]:
miniset = dataset[:5]

In [93]:
type(dataset)

datasets.arrow_dataset.Dataset

In [94]:
type(miniset)

dict

In [104]:
from datasets import Dataset

miniset['input_values'] = processed_dataset['input_values']
ds = Dataset.from_dict(miniset)                                 #map-style dataset - must have the entire dataset stored on your disk or in memory
iterable_dataset = ds.to_iterable_dataset()                     #iterable dataset - only a small fraction of examples is loaded in memory, and you don’t write anything on disk
iterable_dataset

<datasets.iterable_dataset.IterableDataset at 0x7f7f1baa1bb0>

Use the set_format() function to set the dataset format to torch and specify the columns you want to format. This function applies formatting on-the-fly. After converting to PyTorch tensors, wrap the dataset in torch.utils.data.DataLoader:

In [105]:
from torch.utils.data import DataLoader

new_ds = ds
new_ds.set_format(type="torch", columns=["input_values", "labels"])
dataloader = DataLoader(new_ds, batch_size=4)

### Data loading

There are several ways you can increase the speed your data is loaded which can save you time, especially if you are working with large datasets. PyTorch offers parallelized data loading, retrieving batches of indices instead of individually, and streaming to iterate over the dataset without downloading it on disk.

* multiple workers - starts num_workers processes. Each process reloads the dataset passed to the DataLoader and is used to query examples. Reloading the dataset inside a worker doesn’t fill up your RAM, since it simply memory-maps the dataset again from your disk
* stream data - using an IterableDataset.  If the dataset is split in several shards (i.e. if the dataset consists of multiple data files), then you can stream in parallel using num_workers.  Learn more about which type of dataset is best for your use case in the [choosing between a regular dataset or an iterable dataset guide](https://huggingface.co/docs/datasets/v2.18.0/en/about_mapstyle_vs_iterable).
* distributed - use for each chunk of data

In [102]:
#workers
ds = Dataset.from_dict(miniset)
dataloader = DataLoader(ds, batch_size=32, num_workers=4)

In [106]:
for batch in dataloader:
    print(batch)

{'labels': tensor([11, 11, 11, 11]), 'input_values': tensor([[ 3.7660e-04,  2.8342e-03,  2.9484e-03,  ..., -6.5770e-04,
          2.7497e-03,  4.6367e-03],
        [ 9.8650e-05,  3.7332e-03,  6.9454e-03,  ...,  1.3290e-02,
          1.8930e-02,  1.9547e-02],
        [ 2.2214e-04,  5.0336e-04,  2.9262e-04,  ..., -2.6319e+00,
         -2.1793e+00, -1.7696e+00],
        [ 2.8104e-03,  1.9075e-03,  2.7828e-04,  ..., -1.3702e-05,
         -1.3702e-05, -1.3702e-05]])}
{'labels': tensor([11]), 'input_values': tensor([[-2.2645e-03, -1.0360e-03,  4.2160e-06,  ...,  1.3076e-05,
          1.3076e-05,  1.3076e-05]])}


In [110]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
new_ds = ds.with_format("torch", device=device)

new_ds[0]['labels']         #properly converted to tensor

tensor(11)

In [82]:
#iterable and workers
ds = Dataset.from_dict(miniset)
iterable_dataset = ds.to_iterable_dataset()
dataloader = DataLoader(iterable_dataset, batch_size=32, num_workers=4)

In [85]:
next(iter(iterable_dataset))

{'path': '/home/vscode/.cache/huggingface/datasets/downloads/extracted/ae597f7b595b19678bcc1c11dfe231bdd0e83da27da19ab6d0f8b5e8ad51dcc8/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'audio': {'array': [1.7056241631507874e-05,
   0.0002187274512834847,
   0.00022809987422078848,
   -1.8313483451493084e-05,
   -0.00022935566084925085,
   -0.00017231237143278122,
   -1.3297983969096094e-05,
   -4.723966048914008e-05,
   -0.00023254717234522104,
   -0.0002546980103943497,
   -9.68912718235515e-06,
   0.00024144818598870188,
   0.0002517475513741374,
   9.680321090854704e-05,
   -5.372377927415073e-06,
   -4.742782039102167e-06,
   3.014065441675484e-06,
   -1.2012722436338663e-05,
   -5.634065018966794e-07,
   2.3945649445522577e-05,
   -1.946016709553078e-06,
   -4.5713190047536045e-05,
   4.479006747715175e-06,
   0.0001499589707236737,
   0.00023714111011940986,
   0.000160158087965101,
   9.471463272348046e-06,
   -5.550216883420944e-05,
   -1.1858661309815943e-05,
   3.333858330

## Audio Classification