# Speech Commands Classification with Wav2vec2 Model


## Model Introduction
This model is fine-tuned from the wav2vec2-base pre-trained model, specifically tailored for the speech_commands-v0.02 dataset. Through fine-tuning, the model has been trained to better recognize and classify specific spoken commands.

**I used the Great Lakes setup with 20cores, 100GB, 6 Gpus, and the final training time was around 5 hours**  
You can change the number of epochs if you want to go faster, but the accuracy might drop

### Model Files Description
The model and its configuration files are stored in a directory named wav2vec2-base-finetuned-speech_commands-v0.02, which includes the following key files:  
config.json: The configuration file for the model that defines its architecture and parameters.  
pytorch_model.bin: Contains the model weights, crucial for the model to perform its tasks correctly.  
preprocessor_config.json: Preprocessing configuration file that specifies how to process input data to fit the model.  
tokenizer_config.json: Tokenizer configuration file that guides how text inputs are transformed into a format manageable by the model.  
vocab.json: The vocabulary file containing all the tokens used during the training of the model.  
training_args.bin: Stores the settings used during the model training.

### Loading the Model
The model can be loaded using the transformers library, with the following code:

In [1]:
from transformers import AutoModelForAudioClassification, AutoProcessor

# Model directory path
model_path = "wav2vec2-base-finetuned-speech_commands-v0.02"

# Load the model and processor
model = AutoModelForAudioClassification.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)


KeyboardInterrupt



**Then, let's start our journey of audio classification!**  
Divided into the following parts:
1. Environment setup and library installation  
2. Dataset loading  
3. Data preprocessing
4. Model training
5. Inference

## Install necessary libraries and import

In [1]:
!pip install huggingface transformers datasets evaluate

Defaulting to user installation because normal site-packages is not writeable


In [2]:
!pip install torch torchvision torchaudio

Defaulting to user installation because normal site-packages is not writeable


In [3]:
!pip install datasets[audio]

Defaulting to user installation because normal site-packages is not writeable


In [4]:
!pip install transformers[torch]

Defaulting to user installation because normal site-packages is not writeable


In [5]:
!pip install soundfile

Defaulting to user installation because normal site-packages is not writeable


In [6]:
# Import used packages
import torch
import torchaudio
import matplotlib.pyplot as plt
import IPython.display as ipd
from tqdm import tqdm

Let’s check if a CUDA GPU is available and select our device. Running the network on a GPU will greatly decrease the training/testing runtime.

In [7]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


## Loading the dataset
The Speech Commands dataset contains 35 commands spoken by different people; each audio is about 1 second long (16,000 frames).  
We use Hugging Face's 'datasets' library to load its' v0.02' version.  
The dataset contains two subsets:  
- **Training set** : used for model training  
- **Test set** : used for model validation

In [8]:
from datasets import load_dataset

# Load the Speech Commands dataset
dataset = load_dataset("google/speech_commands", "v0.02")
train_dataset = dataset["train"]
test_dataset = dataset["test"]

print(f"First sample in training set: {train_dataset[0]}")

First sample in training set: {'file': 'backward/2356b88d_nohash_0.wav', 'audio': {'path': 'backward/2356b88d_nohash_0.wav', 'array': array([ 0.        ,  0.        ,  0.        , ..., -0.00012207,
       -0.00015259, -0.00012207]), 'sampling_rate': 16000}, 'label': 30, 'is_unknown': True, 'speaker_id': '2356b88d', 'utterance_id': 0}


In [9]:
# Map labels to indexes
labels = dataset["train"].features["label"].names
label2id, id2label = {}, {}
for i, label in enumerate(labels):
    label2id[label] = i
    id2label[i] = label

# Show example mappings
print(f"Label 'go' maps to ID: {label2id['go']}")
print(f"ID 2 maps back to Label: {id2label[2]}")

Label 'go' maps to ID: 9
ID 2 maps back to Label: up


## Model and processor loading
Use Hugging Face ` AutoModelForAudioClassification ` load my wav2vec2-base-finetuned-speech_commands-v0.02 model and processor.

In [10]:
from transformers import AutoModelForAudioClassification, AutoProcessor

model_path = "wav2vec2-base-finetuned-speech_commands-v0.02"

# Load the model and processor
model = AutoModelForAudioClassification.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)

## Data preprocessing
Use the processor to convert the audio data into a format acceptable to the model, including:
- Resample to 16kHz
- padding or truncating to a fixed length
- Convert to PyTorch tensor format

In [11]:
def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    # Use the processor
    inputs = processor(
        audio_arrays, 
        sampling_rate=16000, 
        return_tensors="pt", 
        padding=True, 
        truncation=True, 
        max_length=16000
    )
    return {
        "input_values": inputs.input_values.numpy().tolist(),
        "labels": examples["label"]
    }

In [12]:
train_dataset = train_dataset.map(
    preprocess_function, 
    batched=True, 
    batch_size=8, 
    remove_columns=train_dataset.column_names
)
test_dataset = test_dataset.map(
    preprocess_function, 
    batched=True, 
    batch_size=8, 
    remove_columns=test_dataset.column_names
)

Check the preprocessed results

In [13]:
train_dataset[10000]

{'input_values': [-0.0003110016114078462,
  -0.0020742989145219326,
  -0.0056008934043347836,
  -0.0056008934043347836,
  -0.003837596159428358,
  -0.0003110016114078462,
  -0.0003110016114078462,
  -0.003837596159428358,
  -0.0020742989145219326,
  -0.0003110016114078462,
  -0.0020742989145219326,
  -0.003837596159428358,
  -0.0056008934043347836,
  -0.0020742989145219326,
  -0.0003110016114078462,
  -0.0020742989145219326,
  -0.0003110016114078462,
  -0.0020742989145219326,
  -0.0003110016114078462,
  -0.003837596159428358,
  -0.0056008934043347836,
  -0.003837596159428358,
  -0.0020742989145219326,
  -0.0003110016114078462,
  -0.0020742989145219326,
  -0.0020742989145219326,
  0.0014522956917062402,
  -0.0003110016114078462,
  -0.0020742989145219326,
  -0.0003110016114078462,
  -0.0003110016114078462,
  -0.0020742989145219326,
  -0.0020742989145219326,
  -0.0003110016114078462,
  -0.0003110016114078462,
  -0.0003110016114078462,
  -0.0003110016114078462,
  -0.003837596159428358,
  -

## Model training
Define the training process using the 'Trainer' API with the following configuration:  
- Evaluate model performance after each training round
- Save the best model
- Record accuracy and F1 score as evaluation metrics

In [17]:
from transformers import TrainingArguments, Trainer
import evaluate
import numpy as np

# load the metrics
accuracy_metric = evaluate.load("accuracy")
f1_score_metric = evaluate.load("f1")

# Define the metrics function
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    acc = accuracy_metric.compute(predictions=predictions, references=labels)
    f1 = f1_score_metric.compute(predictions=predictions, references=labels, average='macro')
    return {"accuracy": acc['accuracy'], "f1_score": f1['f1']}

# Train configuration parameters
training_args = TrainingArguments(
    output_dir="./speechcommandresults",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    warmup_ratio=0.1,
    weight_decay=0.01,
    logging_dir="./logs",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    gradient_accumulation_steps=4,
    lr_scheduler_type="linear",
    save_total_limit=1
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

# Model training
trainer.train()

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss,Accuracy,F1 Score
1,No log,0.455451,0.899591,0.869608
2,No log,0.45341,0.901227,0.865497
3,0.228000,0.462737,0.901227,0.85753
4,0.228000,0.453496,0.899796,0.839169
5,0.195300,0.458421,0.901431,0.856639




TrainOutput(global_step=1105, training_loss=0.20915486607616304, metrics={'train_runtime': 4131.5796, 'train_samples_per_second': 102.682, 'train_steps_per_second': 0.267, 'total_flos': 3.85187498406912e+18, 'train_loss': 0.20915486607616304, 'epoch': 5.0})

## Inference and model evaluation
Test model performance, randomly pick samples and check how well the predicted labels match the true labels.

In [18]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Select a test sample at random
sample = test_dataset[127]
input_values = torch.tensor(sample["input_values"]).unsqueeze(0).to(device)
true_label = sample["labels"]

# Model prediction
model.eval()
with torch.no_grad():
    logits = model(input_values).logits
    predicted_label = torch.argmax(logits, dim=-1).item()

# Print the result of prediction
predicted_label_name = id2label[predicted_label] if predicted_label in id2label else "Unknown"
true_label_name = id2label[true_label] if true_label in id2label else "Unknown"
print(f"Predicted Label: {predicted_label_name}")
print(f"True Label: {true_label_name}")

if predicted_label == true_label:
    print("Prediction is correct!")
else:
    print("Prediction is incorrect.")

Predicted Label: down
True Label: down
Prediction is correct!
