# Audio classification tutorial
from [hugging face](https://huggingface.co/docs/transformers/tasks/audio_classification)

## Load MInDS-14 dataset

In [36]:
from datasets import load_dataset, Audio

In [37]:
import torch

In [38]:
torch.cuda.is_available()

True

In [39]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"

In [40]:
minds = load_dataset("PolyAI/minds14", name="en-US", split="train")
minds = minds.train_test_split(test_size=0.2)
minds

Found cached dataset minds14 (/home/gahye/.cache/huggingface/datasets/PolyAI___minds14/en-US/1.0.0/aa40414f15e0f919231d617440192034af844835dc1e6a697f4b552e0551fd26)


DatasetDict({
    train: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 450
    })
    test: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 113
    })
})

In [41]:
# Focus on the ["audio", "intent_class"] -> remove other columns

minds = minds.remove_columns(["path", "transcription", "english_transcription", "lang_id"])

In [42]:
minds["train"][0]

{'audio': {'path': '/home/gahye/.cache/huggingface/datasets/downloads/extracted/0a2a29f254dc05d0b2ecfa01c1f626eb57557516077f3fc1f1d5ca49b3999f35/en-US~ATM_LIMIT/602b9bb2bb1e6d0fbce91f6c.wav',
  'array': array([-0.00024414,  0.00024414,  0.        , ...,  0.        ,
         -0.00024414,  0.00024414], dtype=float32),
  'sampling_rate': 8000},
 'intent_class': 3}

In [43]:
# audio -> 1-d array of the speech signal (must be call to load and resample the audio file)
# intent_class -> integer (class id of intent)
# This mapping will help the model recover the label name from the label number

labels = minds["train"].features["intent_class"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

In [44]:
id2label["2"]

'app_error'

## Preprocess

In [45]:
from transformers import AutoFeatureExtractor

In [46]:
# Load Wav2Vec2 feature extractor to process the audio signal

feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")

loading configuration file preprocessor_config.json from cache at /home/gahye/.cache/huggingface/hub/models--facebook--wav2vec2-base/snapshots/0b5b8e868dd84f03fd87d01f9c4ff0f080fecfe8/preprocessor_config.json
loading configuration file config.json from cache at /home/gahye/.cache/huggingface/hub/models--facebook--wav2vec2-base/snapshots/0b5b8e868dd84f03fd87d01f9c4ff0f080fecfe8/config.json
Model config Wav2Vec2Config {
  "_name_or_path": "facebook/wav2vec2-base",
  "activation_dropout": 0.0,
  "adapter_kernel_size": 3,
  "adapter_stride": 2,
  "add_adapter": false,
  "apply_spec_augment": true,
  "architectures": [
    "Wav2Vec2ForPreTraining"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 1,
  "classifier_proj_size": 256,
  "codevector_dim": 256,
  "contrastive_logits_temperature": 0.1,
  "conv_bias": false,
  "conv_dim": [
    512,
    512,
    512,
    512,
    512,
    512,
    512
  ],
  "conv_kernel": [
    10,
    3,
    3,
    3,
    3,
    2,
    2
  ],
  "conv_stride": [
 

In [47]:
# Resample the dataset to use the pretrained Wav2Vec2 model

minds =minds.cast_column("audio", Audio(sampling_rate=16_000))
minds["train"][0]

{'audio': {'path': '/home/gahye/.cache/huggingface/datasets/downloads/extracted/0a2a29f254dc05d0b2ecfa01c1f626eb57557516077f3fc1f1d5ca49b3999f35/en-US~ATM_LIMIT/602b9bb2bb1e6d0fbce91f6c.wav',
  'array': array([-2.1219255e-04,  4.3378686e-06,  2.1558166e-04, ...,
         -1.3332080e-05,  1.8235171e-04,  2.0268247e-04], dtype=float32),
  'sampling_rate': 16000},
 'intent_class': 3}

The preprocessing function needs to:

1. Call the audio column to load and if necessary resample the audio file.
2. Check the **sampling rate of the audio file** matches the **sampling rate of the audio data a model was pretrained with**. You can find this information on the [Wav2Vec2 model card](https://huggingface.co/facebook/wav2vec2-base).
3. Set a **maximum input length** so longer inputs are batched without being truncated.

In [48]:
def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(audio_arrays,
                               sampling_rate=feature_extractor.sampling_rate,
                               max_length=16000,
                               truncation=True)
    return inputs

Use datasets [map](https://huggingface.co/docs/datasets/v2.6.1/en/package_reference/main_classes#datasets.Dataset.map) function to apply preprocess_function over the entire dataset.   
batched=True -> speed up by processing multiple elements of the dataset at once.


In [49]:
# Remove useless columns
# Rename intent_class to label -> what the model expects

encoded_minds = minds.map(preprocess_function, remove_columns="audio", batched=True)
encoded_minds = encoded_minds.rename_column("intent_class", "label")

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

## Train

In [50]:
from transformers import AutoModelForAudioClassification,  TrainingArguments, Trainer

In [51]:
num_labels = len(id2label)
model = AutoModelForAudioClassification.from_pretrained(
    "facebook/wav2vec2-base", num_labels=num_labels, label2id=label2id, id2label=id2label
).to(device)

loading configuration file config.json from cache at /home/gahye/.cache/huggingface/hub/models--facebook--wav2vec2-base/snapshots/0b5b8e868dd84f03fd87d01f9c4ff0f080fecfe8/config.json
Model config Wav2Vec2Config {
  "_name_or_path": "facebook/wav2vec2-base",
  "activation_dropout": 0.0,
  "adapter_kernel_size": 3,
  "adapter_stride": 2,
  "add_adapter": false,
  "apply_spec_augment": true,
  "architectures": [
    "Wav2Vec2ForPreTraining"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 1,
  "classifier_proj_size": 256,
  "codevector_dim": 256,
  "contrastive_logits_temperature": 0.1,
  "conv_bias": false,
  "conv_dim": [
    512,
    512,
    512,
    512,
    512,
    512,
    512
  ],
  "conv_kernel": [
    10,
    3,
    3,
    3,
    3,
    2,
    2
  ],
  "conv_stride": [
    5,
    2,
    2,
    2,
    2,
    2,
    2
  ],
  "ctc_loss_reduction": "sum",
  "ctc_zero_infinity": false,
  "diversity_loss_weight": 0.1,
  "do_stable_layer_norm": false,
  "eos_token_id": 2,
  "feat_ex

1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/v4.24.0/en/main_classes/trainer#transformers.TrainingArguments).
2. Pass the training arguments to [Trainer](https://huggingface.co/docs/transformers/v4.24.0/en/main_classes/trainer#transformers.Trainer) along with the model, datasets, and feature extractor.
3. Call [train()](https://huggingface.co/docs/transformers/v4.24.0/en/main_classes/trainer#transformers.Trainer.train) to fine-tune your model.

In [52]:
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-5,
    num_train_epochs=5
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [53]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_minds["train"],
    eval_dataset=encoded_minds["test"],
    tokenizer=feature_extractor
)

In [54]:
trainer.train()

***** Running training *****
  Num examples = 450
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 145
  Number of trainable parameters = 94572174


Epoch,Training Loss,Validation Loss
1,No log,2.653276
2,No log,2.666083
3,No log,2.66234
4,No log,2.678468
5,No log,2.681089


***** Running Evaluation *****
  Num examples = 113
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-29
Configuration saved in ./results/checkpoint-29/config.json
Model weights saved in ./results/checkpoint-29/pytorch_model.bin
Feature extractor saved in ./results/checkpoint-29/preprocessor_config.json
***** Running Evaluation *****
  Num examples = 113
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-58
Configuration saved in ./results/checkpoint-58/config.json
Model weights saved in ./results/checkpoint-58/pytorch_model.bin
Feature extractor saved in ./results/checkpoint-58/preprocessor_config.json
***** Running Evaluation *****
  Num examples = 113
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-87
Configuration saved in ./results/checkpoint-87/config.json
Model weights saved in ./results/checkpoint-87/pytorch_model.bin
Feature extractor saved in ./results/checkpoint-87/preprocessor_config.json
***** Running Evaluation *****
  Num 

TrainOutput(global_step=145, training_loss=2.6369523673221984, metrics={'train_runtime': 224.2957, 'train_samples_per_second': 10.031, 'train_steps_per_second': 0.646, 'total_flos': 2.0427589584e+16, 'train_loss': 2.6369523673221984, 'epoch': 5.0})