# Minds14 audio classification with Wav2vec2 model

This notebook performs audio classification using the Wav2Vec2 model fine-tuned on the Minds14 dataset.   
We will cover the steps from loading and preprocessing the audio data to training the model and making predictions.

**You can adopt some models like distilhubert-finetuned-Minds14 to obtain the higher accuracy results.**  
The model might not run in my notebook due to time and cloud storage, but I produced the results locally.

## Installation of Required Libraries
First, ensure all required libraries are installed.

In [1]:
! pip install transformers datasets evaluate

Defaulting to user installation because normal site-packages is not writeable


In [2]:
! pip install evaluate soundfile

Defaulting to user installation because normal site-packages is not writeable


In [3]:
! pip install librosa accelerate>=0.26.0

## Load MInDS-14 dataset

We will use the Minds14 dataset available from the 🤗 datasets library, focusing on the 'en-US' subset for audio classification.

In [4]:
from datasets import load_dataset, Audio

minds = load_dataset("PolyAI/minds14", name="en-US", split="train", trust_remote_code=True)

Split the dataset's `train` split into a smaller train and test set.

In [5]:
minds = minds.train_test_split(test_size=0.2)

Then take a look at the dataset

In [6]:
minds

DatasetDict({
    train: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 450
    })
    test: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 113
    })
})

In [7]:
minds["train"][0]

{'path': 'C:\\Users\\Feixing\\.cache\\huggingface\\datasets\\downloads\\extracted\\97d01a83cccb2cf5e353762faba56f0f7c58b404b3594e5510f040b0870a5c0a\\en-US~BUSINESS_LOAN\\602bac4d963e11ccd901cda8.wav',
 'audio': {'path': 'C:\\Users\\Feixing\\.cache\\huggingface\\datasets\\downloads\\extracted\\97d01a83cccb2cf5e353762faba56f0f7c58b404b3594e5510f040b0870a5c0a\\en-US~BUSINESS_LOAN\\602bac4d963e11ccd901cda8.wav',
  'array': array([-0.00024414,  0.        , -0.00024414, ...,  0.        ,
          0.00024414,  0.00024414]),
  'sampling_rate': 8000},
 'transcription': "hi I'm an account holder have been for 18 years and I'd like to speak to somebody about taking out a business loan",
 'english_transcription': "hi I'm an account holder have been for 18 years and I'd like to speak to somebody about taking out a business loan",
 'intent_class': 5,
 'lang_id': 4}

We'll remove unnecessary columns and ensure the audio data is properly formatted for our model.

In [8]:
minds = minds.remove_columns(["path", "transcription", "english_transcription", "lang_id"])

Display an example from the dataset to understand the format

In [9]:
minds["train"][0]

{'audio': {'path': 'C:\\Users\\Feixing\\.cache\\huggingface\\datasets\\downloads\\extracted\\97d01a83cccb2cf5e353762faba56f0f7c58b404b3594e5510f040b0870a5c0a\\en-US~BUSINESS_LOAN\\602bac4d963e11ccd901cda8.wav',
  'array': array([-0.00024414,  0.        , -0.00024414, ...,  0.        ,
          0.00024414,  0.00024414]),
  'sampling_rate': 8000},
 'intent_class': 5}

In [10]:
labels = minds["train"].features["intent_class"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

Now you can convert the label id to a label name:

In [11]:
id2label[str(2)]

'app_error'

## Preprocess and Feature Extraction

We'll load the Wav2Vec2 feature extractor and prepare our dataset by extracting features suitable for audio classification.

In [12]:
from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")



Resample the dataset to 16000kHz to use the pretrained Wav2Vec2 model:

In [13]:
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
minds["train"][0]

{'audio': {'path': 'C:\\Users\\Feixing\\.cache\\huggingface\\datasets\\downloads\\extracted\\97d01a83cccb2cf5e353762faba56f0f7c58b404b3594e5510f040b0870a5c0a\\en-US~BUSINESS_LOAN\\602bac4d963e11ccd901cda8.wav',
  'array': array([-1.99531321e-04, -4.72508618e-05, -4.45675541e-05, ...,
          2.77621846e-04,  2.34195439e-04,  1.25979845e-04]),
  'sampling_rate': 16000},
 'intent_class': 5}

Create a preprocessing function that:

In [14]:
def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000, truncation=True
    )
    return inputs

To apply the preprocessing function over the entire dataset, use 🤗 Datasets map function.

In [15]:
encoded_minds = minds.map(preprocess_function, remove_columns="audio", batched=True)
encoded_minds = encoded_minds.rename_column("intent_class", "label")
print(encoded_minds)

Map:   0%|          | 0/450 [00:00<?, ? examples/s]

Map:   0%|          | 0/113 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'input_values'],
        num_rows: 450
    })
    test: Dataset({
        features: ['label', 'input_values'],
        num_rows: 113
    })
})


## Evaluate

Set compute_metrics function

In [16]:
import evaluate

accuracy = evaluate.load("accuracy")

Then create a function that passes your predictions and labels to [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) to calculate the accuracy:

In [17]:
import numpy as np


def compute_metrics(eval_pred):
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=eval_pred.label_ids)

## Model Training

We will now load the pre-trained Wav2Vec2 model and fine-tune it on the Minds14 dataset for audio classification.

In [18]:
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification, TrainingArguments, Trainer

feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
num_labels = len(id2label)
model = AutoModelForAudioClassification.from_pretrained(
    "facebook/wav2vec2-base", num_labels=num_labels, label2id=label2id, id2label=id2label
)
print(model)

Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'projector.bias', 'projector.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Wav2Vec2ForSequenceClassification(
  (wav2vec2): Wav2Vec2Model(
    (feature_extractor): Wav2Vec2FeatureEncoder(
      (conv_layers): ModuleList(
        (0): Wav2Vec2GroupNormConvLayer(
          (conv): Conv1d(1, 512, kernel_size=(10,), stride=(5,), bias=False)
          (activation): GELUActivation()
          (layer_norm): GroupNorm(512, 512, eps=1e-05, affine=True)
        )
        (1-4): 4 x Wav2Vec2NoLayerNormConvLayer(
          (conv): Conv1d(512, 512, kernel_size=(3,), stride=(2,), bias=False)
          (activation): GELUActivation()
        )
        (5-6): 2 x Wav2Vec2NoLayerNormConvLayer(
          (conv): Conv1d(512, 512, kernel_size=(2,), stride=(2,), bias=False)
          (activation): GELUActivation()
        )
      )
    )
    (feature_projection): Wav2Vec2FeatureProjection(
      (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (projection): Linear(in_features=512, out_features=768, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)


**You can adopt new model like distilhubert-finetuned-Minds14 to obtain the higher accuracy results**

In [19]:
# Load distilhubert-finetuned-Minds14 model
# from transformers import AutoProcessor, AutoModelForAudioClassification

# processor = AutoProcessor.from_pretrained("Teapack1/distilhubert-finetuned-Minds14")
# model = AutoModelForAudioClassification.from_pretrained("Teapack1/distilhubert-finetuned-Minds14")

Start to train the model

In [20]:
# Training arguments
training_args = TrainingArguments(
    output_dir="minds14_wav2vec2",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to="none"  # Disable wandb
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_minds["train"],
    eval_dataset=encoded_minds["test"],
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
)

trainer.train()

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
0,No log,2.641577,0.070796
1,2.638900,2.643246,0.070796
2,2.629200,2.64975,0.053097
4,2.619000,2.659642,0.053097
5,2.612300,2.681885,0.053097


RuntimeError: [enforce fail at inline_container.cc:603] . unexpected pos 96001216 vs 96001104

## Inference

After training, let's use our fine-tuned model to predict new audio samples.

In [46]:
# Example of predicting a single audio file
test_audio = encoded_minds["test"][12]["input_values"]  # Simulating loading a test audio file
test_audio_tensor = torch.tensor(test_audio)
# Predicting using the trained model
model.eval()
with torch.no_grad():
    logits = model(test_audio_tensor.unsqueeze(0)).logits  # Unsqueeze batch dimension
predicted_label = logits.argmax(-1).item()

print("Predicted class for the test audio:", predicted_label)
print("Real class for the test audio:", encoded_minds["test"][12]["label"])

Predicted class for the test audio: 3
Real class for the test audio: 7


This model is not accurate based on minds14 dataset.