# Accent Classification Project: Train DistilHuBERT on L2-Arctic data

**Purpose:**

The goal of this project is to create an accent classifier for people who learned English as a second language by fine-tuning a speech recognition model to classify accents from 24 people speaking English whose first language is Hindi, Korean, Arabic, Vietnamese, Spanish, or Mandarin.

**Why?**

Existing accent classifiers focus on native English speakers from around the world but exclude people who learned English as a second language rendering them inacurate for many common accents among people in the US, such as people whose first language is Spanish or Chinese.

**Data source**

The [L2-Arctic](https://psi.engr.tamu.edu/l2-arctic-corpus/) data is ~8GB and comes via email. It includes approximately 24-30 hours of recordings where 24 speakers read passages in English. The first languages of the speakers are Arabic, Hindi, Korean, Mandarin, Spanish, and Vietnamese.  There's 2 women and 2 men in each language group.

**Foundation Model**

[DistilHuBERT](https://huggingface.co/ntu-spml/distilhubert) is a smaller version of HuBERT that was modified from BERT. BERT is a speech recognition model with encoder-only CTC architecture.  For this project, a classification layer was added.

**Resulting Model for Accent Classification**

DistilHuBERT was fine-tuned on 50% of the L2-Arctic data to classify the accents in the 6 language groups.

The following model was created and uploaded to Hugging Face:
[kaysrubio/accent-id-distilhubert-finetuned-l2-arctic2](https://huggingface.co/kaysrubio/accent-id-distilhubert-finetuned-l2-arctic2)

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 10
- mixed_precision_training: Native AMP

**Limitations**

The model is very accurate for novel recordings from the original dataset that were not used for train/test. However, the model is not accurate for voices from outside the dataset.  Unfortunately with only 24 speakers represented, it seems like the model memorized other characteristics of these voices besides accent, thus not creating a model very generalizable to the real world.

**Next Steps**

The code is good! If a new dataset becomes available that includes many more voices and clear accent categories, this code may be reused to train a better model.

**Data Preparation**

This file is 3 of 3 in the Accent Classification Project. The first 2 files reformat the data.

## Download data
Download the data which was put on google drive in directories in format arctic/speaker/wav/*.wav

In [1]:
#!pip install datasets

In [2]:
import numpy as np
from datasets import Dataset
from datasets import load_from_disk
from google.colab import output

In [None]:
# Load data from google drive which is in format {['speaker': 'ABA', 'file_path': 'drive...wav']}
from google.colab import drive
drive.mount('/content/drive')
# Check the present working directory from google collab, list contents, and check inside the MyDrive folder
!ls drive/MyDrive

Mounted at /content/drive
 arctic   arctic_data_formatted  'Colab Notebooks'


In [6]:
data = load_from_disk('drive/MyDrive/arctic_data_formatted')

In [7]:
data

Dataset({
    features: ['label', 'audio'],
    num_rows: 1737
})

In [11]:
# Check data
data[500]['label']

3

In [12]:
len(data[500]['audio'])

461320

## Split into train/test

In [13]:
# shuffle and split off 10% for test data
data = data.train_test_split(seed=42, shuffle=True, test_size=0.1)
data

DatasetDict({
    train: Dataset({
        features: ['label', 'audio'],
        num_rows: 1563
    })
    test: Dataset({
        features: ['label', 'audio'],
        num_rows: 174
    })
})

In [15]:
# Choose pretrained model DistilHuBERT which is a smaller version of HuBERT
# Alternatively could try full HuBERT or Wav2Vec2 but these will take longer to train
# HuBERT and Wav2Vec2 models take in raw audio, not spectrograms
# https://huggingface.co/ntu-spml/distilhubert
model_id = "ntu-spml/distilhubert"

## Use AutoFeatureExtractor from model to prepare dataset if not done already

In [14]:
# DON'T RUN ON GPU IN FUTURE
# Instantiate the AutoFeatureExtractor for DistilHuBERT so we can format data in
# way that model expects
from transformers import AutoFeatureExtractor

In [16]:
# DON'T RUN ON GPU IN FUTURE
feature_extractor = AutoFeatureExtractor.from_pretrained(
    model_id, do_normalize=True, return_attention_mask=True
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/214 [00:00<?, ?B/s]

In [20]:
# DON'T RUN ON GPU IN FUTURE
# distilHuBERT expects audio clips to be exactly 30 seconds
MAX_DURATION = 30.0
# define a function to apply the feature_extractor to all the data
def preprocess_function(examples):
    # This is getting all raw signals in an array. So for each audio in the array passed to the function,
    # take the audio column, then the array column, isolate those and put them in their own array
    audio_arrays = [x for x in examples["audio"]]
    # Now apply the feature_extractor to all the audio arrays, and tell it the SR matches what
    # it expects
    # max_length in samples
    # tell it to use truncation and return attention mask
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=int(feature_extractor.sampling_rate * MAX_DURATION),
        truncation=True,
        return_attention_mask=True,
    )
    return inputs

In [21]:
# DON'T RUN ON GPU IN FUTURE
# apply the function to truncate the audio to the dataset using map
# lower batch size to 100 if using google collab free GPU
# took 2min with 10% dataset
# took 10min with 50% dataset
data_encoded = data.map(
    preprocess_function, # pass the preprocess_function defined above
    batched=True,
    batch_size=100,
    num_proc=1,
)
data_encoded
# - map method from dataset class supports working with batches of examples, with default batch size of 1000
# - depending on your GPU and RAM try lower batch sizes of 500, 250, 100 or 50
# - attention mask has a binary mask of 0/1 values that inducate where the audio input has been padded

Map:   0%|          | 0/1563 [00:00<?, ? examples/s]

Map:   0%|          | 0/174 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'audio', 'input_values', 'attention_mask'],
        num_rows: 1563
    })
    test: Dataset({
        features: ['label', 'audio', 'input_values', 'attention_mask'],
        num_rows: 174
    })
})

## Prepare labels

In [26]:
# use method to map labels feature to human-readable names
id2label_fn = data_encoded['train'].features["label"].int2str

In [27]:
# create id2label variable that uses the function above
id2label = {
    str(i): id2label_fn(i)
    for i in range(len(data_encoded["train"].features["label"].names))
}

In [28]:
# create label2id variable
label2id = {v: k for k, v in id2label.items()}

In [29]:
# Check one of them
id2label["0"]

'Arabic'

## Fine-tune DistilHuBERT

In [2]:
#!pip install evaluate

In [None]:
import evaluate
from transformers import AutoModelForAudioClassification
from transformers import TrainingArguments
from transformers import Trainer

In [33]:
num_labels = len(id2label)

In [34]:
# Use AutoModelForAudioClassification class and its from_pretrained method which
# automatically adds a classification head to the pretrained model
model = AutoModelForAudioClassification.from_pretrained(
    model_id,
    num_labels=num_labels,
    label2id=label2id,
    id2label=id2label,
)

config.json:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/94.0M [00:00<?, ?B/s]

Some weights of HubertForSequenceClassification were not initialized from the model checkpoint at ntu-spml/distilhubert and are newly initialized: ['classifier.bias', 'classifier.weight', 'projector.bias', 'projector.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [43]:
# Define training arguments
batch_size = 8
gradient_accumulation_steps = 1
num_train_epochs = 10 # update to 10 in future

In [44]:
model_name = model_id.split("/")[-1]
model_name

'distilhubert'

In [45]:
# Link your jupyter notebook to the hugging face hub
# this will post your new model to hub and save it at certain checkpoints during training
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# Log into Weights & Biases which will track training stuff
!pip install wandb

In [None]:
import wandb
wandb.login()

In [46]:
training_args = TrainingArguments(
    # run_name="my_custom_experiment"  # Set a unique name
    # report_to="none"  # Optionally disable W&B
    #f"{model_name}-finetuned-l2arctic2",
    "accent-id-distilhubert-finetuned-l2-arctic2",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    warmup_ratio=0.1,
    logging_steps=5,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=True,
    push_to_hub=True, # enable automatic upload of fine-tuned checkpionts to the hugging face hub
)



In [40]:
# Define metrics, such as accuracy. Use the Evaluate library: https://huggingface.co/docs/evaluate/en/index
metric = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [47]:
def compute_metrics(eval_pred):
    """Computes accuracy on a batch of predictions"""
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

In [48]:
# Instantiate the trainer
trainer = Trainer(
    model,
    training_args,
    train_dataset=data_encoded["train"],
    eval_dataset=data_encoded["test"],
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
)

  trainer = Trainer(


In [50]:
# Train the model
trainer.train()
# - if you get CUDA 'out of memory' issue, reduce the batch_size by factors of 2
# and update gradient_accumulation_steps to compensate



Epoch,Training Loss,Validation Loss,Accuracy
1,0.5216,0.438296,1.0
2,0.0106,0.006715,1.0
3,0.0038,0.002383,1.0


Epoch,Training Loss,Validation Loss,Accuracy
1,0.5216,0.438296,1.0
2,0.0106,0.006715,1.0
3,0.0038,0.002383,1.0
4,0.0021,0.001338,1.0
5,0.0014,0.000906,1.0
6,0.0011,0.00068,1.0
7,0.0009,0.000553,1.0
8,0.0008,0.000478,1.0
9,0.0007,0.000438,1.0
10,0.0007,0.000424,1.0


TrainOutput(global_step=1960, training_loss=0.15397932205042725, metrics={'train_runtime': 10294.2957, 'train_samples_per_second': 1.518, 'train_steps_per_second': 0.19, 'total_flos': 1.0520711201609905e+18, 'train_loss': 0.15397932205042725, 'epoch': 10.0})

## Save trained model to Hugging Face Hub

In [51]:
# define some keyword arguments for pushing training results to the hugging face model hub.
kwargs = {
    "dataset_tags": "l2-arctic",
    "dataset": "l2-arctic",
    "model_name": "accent-id-distilhubert-finetuned-l2-arctic2",
    "finetuned_from": model_id,
    "tasks": "audio-classification",
    # "commit_message": "Updated model with new checkpoints",
    # "overwrite": True  # This will allow overwriting the existing model
}

In [52]:
# Push your training results to the hub to save the training logs and model weights under your username/model-name
# e.g., go to https://huggingface.co/[username]/[model-name]
# for another example, go to https://huggingface.co/sanchit-gandhi/distilhubert-finetuned-gtzan
# trainer.push_to_hub(**kwargs) # got error - "model-index[0].results[0].dataset.config" must be a string
trainer.push_to_hub("accent-id-distilhubert-finetuned-l2-arctic2")

events.out.tfevents.1741099212.fd98160263fb.711.0:   0%|          | 0.00/92.4k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/kaysrubio/accent-id-distilhubert-finetuned-l2-arctic2/commit/e85ff7981c4cef7eb3dedb084483dc1e656631e9', commit_message='accent-id-distilhubert-finetuned-l2-arctic2', commit_description='', oid='e85ff7981c4cef7eb3dedb084483dc1e656631e9', pr_url=None, repo_url=RepoUrl('https://huggingface.co/kaysrubio/accent-id-distilhubert-finetuned-l2-arctic2', endpoint='https://huggingface.co', repo_type='model', repo_id='kaysrubio/accent-id-distilhubert-finetuned-l2-arctic2'), pr_revision=None, pr_num=None)

## Test Model
Can be run on CPU

In [None]:
# Link your jupyter notebook to the hugging face hub
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [2]:
# To use the model in the future you can use it in a pipeline:
from transformers import pipeline

pipe = pipeline(
    "audio-classification",
    model="kaysrubio/accent-id-distilhubert-finetuned-l2-arctic2"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.75k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/94.8M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

Device set to use cpu


In [3]:
import torch
import torchaudio

In [None]:
first_langs = ['arabic', 'mandarin', 'hindi', 'korean', 'spanish', 'vietnamese']
files = ['arctic_b0303_ABA_arabic.wav', 'arctic_b0303_TXHC_mandarin.wav', 'arctic_b0303_TNI_hindi.wav', 'arctic_b0303_YKWK_korean.wav', 'arctic_b0303_ERMS_spanish.wav', 'arctic_b0303_THV_vietnamese.wav']

In [23]:
for i, lang in enumerate(first_langs):
  audio, sr = torchaudio.load(files[i])  # Load audio
  audio = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)(audio)
  audio = audio.squeeze().numpy()
  result = pipe(audio, top_k=6)
  print('For ' + lang + ' model results: ', result)

For arabic model results:  [{'score': 0.47213926911354065, 'label': 'Arabic'}, {'score': 0.3043089210987091, 'label': 'Hindi'}, {'score': 0.0836687833070755, 'label': 'Spanish'}, {'score': 0.05265321210026741, 'label': 'Korean'}, {'score': 0.05122620239853859, 'label': 'Vietnamese'}, {'score': 0.036003611981868744, 'label': 'Mandarin'}]
For mandarin model results:  [{'score': 0.6848005652427673, 'label': 'Mandarin'}, {'score': 0.13612370193004608, 'label': 'Vietnamese'}, {'score': 0.06616392731666565, 'label': 'Korean'}, {'score': 0.0565626285970211, 'label': 'Spanish'}, {'score': 0.03449594974517822, 'label': 'Hindi'}, {'score': 0.021853145211935043, 'label': 'Arabic'}]
For hindi model results:  [{'score': 0.38408488035202026, 'label': 'Hindi'}, {'score': 0.2965010106563568, 'label': 'Vietnamese'}, {'score': 0.11677749454975128, 'label': 'Arabic'}, {'score': 0.09038836508989334, 'label': 'Spanish'}, {'score': 0.06128711253404617, 'label': 'Mandarin'}, {'score': 0.05096109211444855, 'l

In [None]:
# Test on a novel voices with various accents, some native English from outside US, some not
# american: clip from Reservation Dogs, a show with Indigenous/Native American actors
# irish: clip from Derry Girls, an Irish TV show
# indian: Abdul Bari, and Indian professor, on YouTube teaching algorithms
# mexican: Jaime Camil, Mexican actor from Jane the Virgin
# south_african: Trevor Noah US-based comedian born in South Africa
# chinese: Ronny Chieng, Chinese-American comedian
# nigerian: Daniel Etim Effiong and Tana Adelana in Dinner for Four, a Nigerian Film
accents = ['american1', 'irish1', 'indian1', 'mexican1', 'south_african1', 'chinese1', 'nigerian1']
test_audios = []
for i, accent in enumerate(accents):
  audio, sr = torchaudio.load(accent+".wav")  # Load audio
  # audio = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)(audio) # already resampled these
  audio = audio.squeeze().numpy()
  result = pipe(audio, top_k=6)
  print('For ' + accent + ' model results: ', result)

For american1 model results:  [{'score': 0.5479803681373596, 'label': 'Hindi'}, {'score': 0.18456389009952545, 'label': 'Vietnamese'}, {'score': 0.13126447796821594, 'label': 'Arabic'}, {'score': 0.050019942224025726, 'label': 'Korean'}, {'score': 0.04981285333633423, 'label': 'Spanish'}, {'score': 0.03635850176215172, 'label': 'Mandarin'}]
For irish1 model results:  [{'score': 0.5186117887496948, 'label': 'Hindi'}, {'score': 0.21003329753875732, 'label': 'Vietnamese'}, {'score': 0.1306159347295761, 'label': 'Arabic'}, {'score': 0.05273653194308281, 'label': 'Spanish'}, {'score': 0.05042556673288345, 'label': 'Korean'}, {'score': 0.0375768207013607, 'label': 'Mandarin'}]
For indian1 model results:  [{'score': 0.5400822162628174, 'label': 'Hindi'}, {'score': 0.1788196563720703, 'label': 'Arabic'}, {'score': 0.14255128800868988, 'label': 'Vietnamese'}, {'score': 0.053038764744997025, 'label': 'Spanish'}, {'score': 0.04660993069410324, 'label': 'Korean'}, {'score': 0.038898080587387085, '