<a href="https://colab.research.google.com/github/MarcAtanante/ai-for-fun/blob/main/02b%20-%20Audio_Classification_on_Speaker_Intent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Fine-tuning for Audio Classification with 🤗 Transformers**
### [Credits to Hugging Face for the Documentation](https://huggingface.co/docs/transformers/v4.27.2/tasks/audio_classification)

This notebook shows how to fine-tune [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) on the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset to classify speaker intent.

MINDS-14 is training and evaluation resource for intent detection task with spoken data. It covers 14 intents extracted from a commercial system in the e-banking domain, associated with spoken examples in 14 diverse language varieties.

*The task illustrated is supported by the following model architectures:
Audio Spectrogram Transformer, Data2VecAudio, Hubert, SEW, SEW-D, UniSpeech, UniSpeechSat, Wav2Vec2, Wav2Vec2-Conformer, WavLM, Whisper

In [1]:
%%capture
!pip install transformers datasets evaluate
!pip install huggingface_hub==0.11

In [2]:
from huggingface_hub import notebook_login

notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.huggingface/token
Login successful


## Loading the MInDS-14 dataset

In [3]:
from datasets import load_dataset, Audio

minds = load_dataset("PolyAI/minds14", name="en-US", split="train")

Downloading builder script:   0%|          | 0.00/5.95k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.29k [00:00<?, ?B/s]

Downloading and preparing dataset minds14/en-US to /root/.cache/huggingface/datasets/PolyAI___minds14/en-US/1.0.0/aa40414f15e0f919231d617440192034af844835dc1e6a697f4b552e0551fd26...


Downloading data:   0%|          | 0.00/471M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset minds14 downloaded and prepared to /root/.cache/huggingface/datasets/PolyAI___minds14/en-US/1.0.0/aa40414f15e0f919231d617440192034af844835dc1e6a697f4b552e0551fd26. Subsequent calls will reuse this data.


Split the dataset’s train split into a smaller train and test set with the train_test_split method. This’ll give you a chance to experiment and make sure everything works before spending more time on the full dataset.

In [4]:
minds = minds.train_test_split(test_size=0.4)

Then take a look at the dataset:

In [5]:
minds

DatasetDict({
    train: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 337
    })
    test: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 226
    })
})

While the dataset contains a lot of useful information, like lang_id and english_transcription, you’ll focus on the audio and intent_class in this guide. Remove the other columns with the remove_columns method:

In [6]:
minds = minds.remove_columns(["path", "transcription", "english_transcription", "lang_id"])

Take a look at an example now:

In [7]:
minds["train"][0]

{'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~BALANCE/602b9be4bb1e6d0fbce91f70.wav',
  'array': array([ 0.        ,  0.        ,  0.        , ...,  0.        ,
         -0.00024414, -0.00024414], dtype=float32),
  'sampling_rate': 8000},
 'intent_class': 4}

There are two fields:

- audio: a 1-dimensional array of the speech signal that must be called to load and resample the audio file.
- intent_class: represents the class id of the speaker’s intent.
To make it easier for the model to get the label name from the label id, create a dictionary that maps the label name to an integer and vice versa:

In [9]:
labels = minds["train"].features["intent_class"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

Now you can convert the label id to a label name:

In [10]:
id2label[str(3)]

'atm_limit'

## Preprocessing
The next step is to load a Wav2Vec2 feature extractor to process the audio signal:

In [11]:
from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")

Downloading:   0%|          | 0.00/159 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.84k [00:00<?, ?B/s]



The MInDS-14 dataset has a sampling rate of 8000khz (you can find this information in it’s dataset card), which means you’ll need to resample the dataset to 16000kHz to use the pretrained Wav2Vec2 model:

In [12]:
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
minds["train"][0]

{'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~BALANCE/602b9be4bb1e6d0fbce91f70.wav',
  'array': array([-1.8589206e-05, -2.0205407e-05,  1.9787047e-05, ...,
         -3.2973092e-04, -2.6013408e-04, -9.2785835e-05], dtype=float32),
  'sampling_rate': 16000},
 'intent_class': 4}

Now create a preprocessing function that:

Calls the audio column to load, and if necessary, resample the audio file.
Checks if the sampling rate of the audio file matches the sampling rate of the audio data a model was pretrained with. You can find this information in the Wav2Vec2 model card.
Set a maximum input length to batch longer inputs without truncating them.

In [13]:
def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000, truncation=True
    )
    return inputs

To apply the preprocessing function over the entire dataset, use 🤗 Datasets map function. You can speed up map by setting batched=True to process multiple elements of the dataset at once. Remove the columns you don’t need, and rename intent_class to label because that’s the name the model expects:

In [14]:
encoded_minds = minds.map(preprocess_function, remove_columns="audio", batched=True)
encoded_minds = encoded_minds.rename_column("intent_class", "label")

Map:   0%|          | 0/337 [00:00<?, ? examples/s]

Map:   0%|          | 0/226 [00:00<?, ? examples/s]

## Evaluate
Including a metric during training is often helpful for evaluating your model’s performance. You can quickly load a evaluation method with the 🤗 Evaluate library. For this task, load the accuracy metric (see the 🤗 Evaluate quick tour to learn more about how to load and compute a metric):

In [15]:
import evaluate

accuracy = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Then create a function that passes your predictions and labels to compute to calculate the accuracy:

In [16]:
import numpy as np


def compute_metrics(eval_pred):
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=eval_pred.label_ids)

Your compute_metrics function is ready to go now, and you’ll return to it when you setup your training.

## Training
You’re ready to start training your model now! Load Wav2Vec2 with AutoModelForAudioClassification along with the number of expected labels, and the label mappings:

In [17]:
from transformers import AutoModelForAudioClassification, TrainingArguments, Trainer

num_labels = len(id2label)
model = AutoModelForAudioClassification.from_pretrained(
    "facebook/wav2vec2-base", num_labels=num_labels, label2id=label2id, id2label=id2label
)



Downloading:   0%|          | 0.00/380M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/wav2vec2-base were not used when initializing Wav2Vec2ForSequenceClassification: ['project_q.weight', 'quantizer.weight_proj.weight', 'quantizer.weight_proj.bias', 'quantizer.codevectors', 'project_hid.weight', 'project_q.bias', 'project_hid.bias']
- This IS expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['projector.weight', 'classifier.bias', 'projector.

### At this point, only three steps remain:

- Define your training hyperparameters in TrainingArguments. The only required parameter is output_dir which specifies where to save your model. You’ll push this model to the Hub by setting push_to_hub=True (you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the Trainer will evaluate the accuracy and save the training checkpoint.
- Pass the training arguments to Trainer along with the model, dataset, tokenizer, data collator, and compute_metrics function.
- Call train() to finetune your model.

In [18]:
training_args = TrainingArguments(
    output_dir="my_mind_model",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=32,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=32,
    num_train_epochs=10,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_minds["train"],
    eval_dataset=encoded_minds["test"],
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
)

trainer.train()

Cloning https://huggingface.co/marcatanante1/my_mind_model into local empty directory.


Download file pytorch_model.bin:   0%|          | 18.2k/361M [00:00<?, ?B/s]

Download file training_args.bin: 100%|##########| 3.50k/3.50k [00:00<?, ?B/s]

Download file runs/Mar22_09-45-53_ec50e36978a9/events.out.tfevents.1679478425.ec50e36978a9.757.0: 100%|#######…

Clean file training_args.bin:  29%|##8       | 1.00k/3.50k [00:00<?, ?B/s]

Download file runs/Mar22_09-45-53_ec50e36978a9/1679478425.3367798/events.out.tfevents.1679478425.ec50e36978a9.…

Clean file runs/Mar22_09-45-53_ec50e36978a9/events.out.tfevents.1679478425.ec50e36978a9.757.0:  11%|#         …

Clean file runs/Mar22_09-45-53_ec50e36978a9/1679478425.3367798/events.out.tfevents.1679478425.ec50e36978a9.757…

Clean file pytorch_model.bin:   0%|          | 1.00k/361M [00:00<?, ?B/s]



Epoch,Training Loss,Validation Loss,Accuracy
0,No log,2.64076,0.075221
1,No log,2.644122,0.061947
2,No log,2.643489,0.097345
4,2.629300,2.64465,0.088496
4,2.629300,2.64489,0.084071
5,2.629300,2.646341,0.084071
6,2.629300,2.645164,0.088496
7,2.617700,2.645027,0.09292


TrainOutput(global_step=20, training_loss=2.6235186576843263, metrics={'train_runtime': 206.0888, 'train_samples_per_second': 16.352, 'train_steps_per_second': 0.097, 'total_flos': 2.228876996832e+16, 'train_loss': 2.6235186576843263, 'epoch': 7.27})

Once training is completed, share your model to the Hub with the push_to_hub() method so everyone can use your model:

In [19]:
trainer.push_to_hub()

Upload file pytorch_model.bin:   0%|          | 1.00/361M [00:00<?, ?B/s]

Upload file runs/Mar22_12-09-04_2a3d5b8594f4/events.out.tfevents.1679487061.2a3d5b8594f4.1103.0:   0%|        …

To https://huggingface.co/marcatanante1/my_mind_model
   3621dbd..a6cdb24  main -> main

   3621dbd..a6cdb24  main -> main

To https://huggingface.co/marcatanante1/my_mind_model
   a6cdb24..95e7a6a  main -> main

   a6cdb24..95e7a6a  main -> main



'https://huggingface.co/marcatanante1/my_mind_model/commit/a6cdb24dafda8d16b06a702cc0dc013af7f0616e'

## Inference
Load an audio file you’d like to run inference on. Remember to resample the sampling rate of the audio file to match the sampling rate of the model if you need to!

In [20]:
from datasets import load_dataset, Audio

dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
sampling_rate = dataset.features["audio"].sampling_rate
audio_file = dataset[0]["audio"]["path"]



The simplest way to try out your finetuned model for inference is to use it in a pipeline(). Instantiate a pipeline for audio classification with your model, and pass your audio file to it:

In [21]:
from transformers import pipeline

classifier = pipeline("audio-classification", model="marcatanante1/my_mind_model")
classifier(audio_file)

Downloading:   0%|          | 0.00/3.07k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/378M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/215 [00:00<?, ?B/s]

[{'score': 0.07904768735170364, 'label': 'cash_deposit'},
 {'score': 0.07740778475999832, 'label': 'joint_account'},
 {'score': 0.07583969086408615, 'label': 'card_issues'},
 {'score': 0.07381478697061539, 'label': 'balance'},
 {'score': 0.07268241047859192, 'label': 'app_error'}]

You can also manually replicate the results of the pipeline if you’d like:

In [22]:
from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained("marcatanante1/my_mind_model")
inputs = feature_extractor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

In [23]:
from transformers import AutoModelForAudioClassification
import torch

model = AutoModelForAudioClassification.from_pretrained("marcatanante1/my_mind_model")
with torch.no_grad():
    logits = model(**inputs).logits

In [24]:
logits

tensor([[-0.0065,  0.0068,  0.0228, -0.0628,  0.0389, -0.0454,  0.0652,  0.1117,
         -0.0500, -0.0107, -0.0237,  0.0811, -0.0937,  0.0051]])

In [25]:
minds["train"].features["intent_class"].names

['abroad',
 'address',
 'app_error',
 'atm_limit',
 'balance',
 'business_loan',
 'card_issues',
 'cash_deposit',
 'direct_debit',
 'freeze',
 'high_value_payment',
 'joint_account',
 'latest_transactions',
 'pay_bill']

In [26]:
predicted_class_ids = torch.argmax(logits).item()
predicted_label = model.config.id2label[predicted_class_ids]
predicted_label

'cash_deposit'