<a href="https://www.kaggle.com/code/aisuko/audio-classification?scriptVersionId=163069290" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

Audio classification which is same to text; assigns a class label output formt the input data. The only difference is instead of text inputs, we have raw waveforms. Some practical applications of audio classification include identifying speaker intent, language classification, and even animal species by their sounds.


Let's finetune `Wav2Vec2` with a `Automatic Speech Recognition` labels dataset to classify speaker intent.

In [1]:
%%capture
!pip install transformers==4.35.2
!pip install datasets==2.15.0
!pip install evaluate==0.4.1

In [2]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Fine-tune-models"
os.environ["WANDB_NOTES"] = "Fine tune model distilbert base uncased"
os.environ["WANDB_NAME"] = "ft-wav2vec2-with-minds"

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


# Load the dataset

In [3]:
from datasets import load_dataset, Audio

minds=load_dataset("PolyAI/minds14", name="en-US", split="train")

Downloading builder script:   0%|          | 0.00/5.90k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.29k [00:00<?, ?B/s]



Downloading data:   0%|          | 0.00/471M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Split the dataset's `train` split into a smaller train and test set with the `train_test_split` method. This will give us a chance to experiment and make sure the `Preprocess` works before adapte to the entire datasets.

In [4]:
minds=minds.train_test_split(test_size=0.2)
minds

DatasetDict({
    train: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 450
    })
    test: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 113
    })
})

Let's focus on the ausio and intent_class here. Remove the other columns with the `remove_columns` method

In [5]:
minds=minds.remove_columns(["path", "transcription", "english_transcription", "lang_id"])
minds["train"][0]

{'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/28aa727f91fee90575c34956bab09d1716cfaf460c6afcba86a10f04a7d58b83/en-US~ABROAD/602ba898bb1e6d0fbce92101.wav',
  'array': array([ 0.        , -0.00024414, -0.00024414, ...,  0.        ,
          0.00024414,  0.00024414]),
  'sampling_rate': 8000},
 'intent_class': 0}

There are two fields:

* `audio`: a 1-dimensional `array` of the speech signal that must be called to load and resample thr audio file.
* `intent_class`: representes the class id of the speaker's intent


We want to make it easier for the model to get the label name from the label id, create a dictionary that maps the label name to an integer and vice versa:

In [6]:
labels=minds["train"].features["intent_class"].names
label2id, id2label=dict(), dict()

for i, label in enumerate(labels):
    label2id[label]=str(i)
    id2label[str(i)]=label

In [7]:
id2label[str(2)]

'app_error'

# Preprocess

Load a Wav2Vec2 feature extractor to process the audio signal:

In [8]:
from transformers import AutoFeatureExtractor

feature_extractor=AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")

preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.84k [00:00<?, ?B/s]



The MinDS-14 dataset has a sampling rate of 8000khz, which means we will need to resample the dataset to 16000kHz to use the pretrained Wav2Vec2 model:

In [9]:
minds=minds.cast_column("audio", Audio(sampling_rate=16_000))
minds["train"][0]

{'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/28aa727f91fee90575c34956bab09d1716cfaf460c6afcba86a10f04a7d58b83/en-US~ABROAD/602ba898bb1e6d0fbce92101.wav',
  'array': array([-2.46873242e-07, -1.05235536e-04, -2.44102412e-04, ...,
          2.98825675e-04,  2.37689281e-04,  1.11417277e-04]),
  'sampling_rate': 16000},
 'intent_class': 0}

## Preprocessing function

* Calls the `audio` column to load, and if necessary, resample the audio file.
* Checks if the sampling rate of the audio file matches the sampling rate of the pretrained audio data a model
* Set a maximum input length to bacth longer inputs without truncating them

In [10]:
def preprocess_function(examples):
    audio_arrays=[x["array"] for x in examples["audio"]]
    inputs=feature_extractor(
        audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000, truncation=True
    )
    return inputs

In [11]:
inputs_demo=preprocess_function(minds["train"])
print(inputs_demo["input_values"][:10])

[array([-4.5119724e-04, -1.9180886e-02, -4.3954354e-02, ...,
        2.6654488e-01,  1.1939685e+00,  1.4371129e+00], dtype=float32), array([ 3.2901776e-03,  9.0409674e-02, -1.2186392e-03, ...,
        1.2323352e+00,  1.3919932e+00,  1.0287923e+00], dtype=float32), array([ 9.4104398e-05,  1.4094682e-03,  1.6799134e-03, ...,
        1.5397958e-01, -4.7274339e-03,  6.5522909e-02], dtype=float32), array([ 1.1782044e-02,  1.3268645e-02,  1.0825084e-02, ...,
       -1.3078242e+01, -1.1467905e+01, -6.8332992e+00], dtype=float32), array([-7.6369458e-04, -2.5151190e-04,  1.2338213e-03, ...,
        1.2644606e+00,  1.4632099e+00,  1.4876673e+00], dtype=float32), array([ 3.7581448e-03, -6.2643498e-04,  5.2391035e-03, ...,
       -6.2701187e+00,  1.6558124e+00,  5.6201258e+00], dtype=float32), array([ 2.0641934e-03,  1.6626728e-03,  7.3201118e-05, ...,
        7.1904987e-02,  9.4751179e-02, -5.6778070e-02], dtype=float32), array([ 0.02840313,  0.04050473, -0.03754713, ..., -0.10273816,
        0.1

In [12]:
encoded_minds=minds.map(preprocess_function, remove_columns="audio", batched=True)
encoded_minds=encoded_minds.rename_column("intent_class", "label")

Map:   0%|          | 0/450 [00:00<?, ? examples/s]

Map:   0%|          | 0/113 [00:00<?, ? examples/s]

# Evaluate

In [13]:
import evaluate

accuracy=evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [14]:
import numpy as np

def compute_metrics(eval_pred):
    predictions=np.argmax(eval_pred.predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=eval_pred.label_ids)

# Training

In [15]:
from transformers import AutoModelForAudioClassification, TrainingArguments, Trainer

num_labels=len(id2label)
model=AutoModelForAudioClassification.from_pretrained(
    "facebook/wav2vec2-base", num_labels=num_labels, label2id=label2id, id2label=id2label
)



pytorch_model.bin:   0%|          | 0.00/380M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['projector.weight', 'classifier.weight', 'projector.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
training_args=TrainingArguments(
    output_dir=os.getenv("WANDB_NAME"),
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=32,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=32,
    num_train_epochs=10,
    fp16=True,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to="wandb",
    run_name=os.getenv("WANDB_NAME"),
    push_to_hub=False,
)

trainer=Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_minds["train"],
    eval_dataset=encoded_minds["test"],
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
)

trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33murakiny[0m ([33mcausal_language_trainer[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.16.3 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.16.1
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20240216_104921-hi20ucaj[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mft-wav2vec2-with-minds[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/causal_language_trainer/Fine-tune-models[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/causal_language_trainer/Fine-tune-models/runs/hi20ucaj[0m


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.651032,0.00885
2,No log,2.653506,0.026549
3,No log,2.649577,0.044248
4,No log,2.646866,0.053097
5,2.632400,2.64459,0.061947
6,2.632400,2.650745,0.079646
7,2.632400,2.655138,0.061947
8,2.632400,2.652855,0.053097
9,2.632400,2.649713,0.061947
10,2.629900,2.650259,0.061947




TrainOutput(global_step=20, training_loss=2.6311530113220214, metrics={'train_runtime': 309.1094, 'train_samples_per_second': 14.558, 'train_steps_per_second': 0.065, 'total_flos': 4.0855179168e+16, 'train_loss': 2.6311530113220214, 'epoch': 10.0})

In [17]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")



Perplexity: 14.16


In [18]:
feature_extractor.push_to_hub(os.getenv("WANDB_NAME"))
trainer.push_to_hub(os.getenv("WANDB_NAME"))

training_args.bin:   0%|          | 0.00/4.16k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

'https://huggingface.co/aisuko/ft-wav2vec2-with-minds/tree/main/'

# Inference

In [19]:
from datasets import load_dataset, Audio

dataset=load_dataset("PolyAI/minds14", name="en-US", split="train")
dataset=dataset.cast_column("audio", Audio(sampling_rate=16000))
sampling_rate=dataset.features["audio"].sampling_rate
audio_file=dataset[0]["audio"]["path"]

In [20]:
from transformers import pipeline

classifier=pipeline("audio-classification", model=os.getenv("WANDB_NAME"))
classifier(audio_file)

[{'score': 0.08343921601772308, 'label': 'high_value_payment'},
 {'score': 0.07882718741893768, 'label': 'cash_deposit'},
 {'score': 0.07754970341920853, 'label': 'joint_account'},
 {'score': 0.0769776925444603, 'label': 'pay_bill'},
 {'score': 0.07450034469366074, 'label': 'freeze'}]

## With PyTorch

In [21]:
feature_extractor=AutoFeatureExtractor.from_pretrained(os.getenv("WANDB_NAME"))
inputs=feature_extractor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

In [22]:
import torch

model=AutoModelForAudioClassification.from_pretrained(os.getenv("WANDB_NAME"))
with torch.no_grad():
    logits=model(**inputs).logits

In [23]:
predicted_class_ids=torch.argmax(logits).item()
predicted_label=model.config.id2label[predicted_class_ids]
predicted_label

'high_value_payment'