# Accent Classification Project: Data Prep Part 1
This file is 2 of 3 in an Accent Classification Project

**Purpose:**

The goal of this project is to create an accent classifier for people who learned English as a second language by fine-tuning a speech recognition model to classify accents from 24 people speaking English whose first language is Hindi, Korean, Arabic, Vietnamese, Spanish, or Mandarin.

**Data source**

https://psi.engr.tamu.edu/l2-arctic-corpus/

L2-Arctic dataset comes via email and includes approximately 24-30 hours of recordings where 24 speakers read passages in English. The first languages of the speakers are Arabic, Hindi, Korean, Mandarin, Spanish, and Vietnamese.  There's 2 women and 2 men in each language group.

The original dataset is around 8GB with contains 27,000 rows of data, each with an audio file of 3-4s with 48k Hz sampling rate.

**Summary of this file**

This file reformats the original L2-Arctic data for distilHuBERT model by splitting the file in 6 smaller pieces, one for each language group. The number of files per speaker is limited to 560 to use approximately half of the original data. Thus each piece is about 0.66GB with 2,240 rows. 
For each language group file, the wav is loaded, resampled to 16,000 Hz, and rows are then combined so the audio's are up to 30s long, as expected by distilHuBERT model.  This reduced the number of rows to about 300 in each language group. Then the reformatted data is wrapped in the Hugging Face dataset class (most memory-intensive step) and saved to disk. 

**Result**

6 Hugging Face datasets (one for each language group) with about 300 rows. Each row contains the label for the language group and an audio file of 30 seconds or less at 16k Hz.

**Environment**

It runs best on a mac CPU, which is faster than google colab's CPU or GPU.  Even when code is re-written to process files in bulk with GPU, a mac CPU is still surprisingly much faster, and splitting the dataset into smaller pieces avoids memory problems.  Run file on a language group, save to disk, then shut down and re-start the Jupyter Notebook server to clear working memory before starting the next language group.

**Data source**

The [L2-Arctic](https://psi.engr.tamu.edu/l2-arctic-corpus/) data is ~8GB and comes via email. It includes approximately 24-30 hours of recordings where 24 speakers read passages in English. The first languages of the speakers are Arabic, Hindi, Korean, Mandarin, Spanish, and Vietnamese.  There's 2 women and 2 men in each language group.

**Foundation Model**

[DistilHuBERT](https://huggingface.co/ntu-spml/distilhubert) is a smaller version of HuBERT that was modified from BERT. BERT is a speech recognition model with encoder-only CTC architecture.  For this project, a classification layer was added. 

In [None]:
# Check to make sure Jupyter Notebook can access my external drive where the data is saved
!ls /Volumes

[34mData[m[m         [34mLaCie[m[m        [35mMacintosh HD[m[m


In [2]:
#!pip install datasets

In [3]:
import os
import glob
import librosa
import pandas as pd
import numpy as np
import torch
import torchaudio
from datasets import Dataset

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# Define the parent directory
parent_dir = "/Volumes/LaCie/l2-arctic-data/arctic/"

In [5]:
# Iterate over wav files in each speaker folder and create dataset in format {['speaker': 'ABA', 'file_path': 'drive...wav']}
# Limit number of files per speaker, there's about 1122 files per speaker in total so 560 is about half
num_files_per_speaker = 560

### Change language here ###
language = 'Vietnamese'

data = []
# speakers are defined here: https://psi.engr.tamu.edu/l2-arctic-corpus/
arabic_speakers = ['ABA', 'SKA', 'YBAA', 'ZHAA']
mandarin_speakers = ['BWC', 'LXC', 'NCC', 'TXHC']
hindi_speakers = ['ASI', 'RRBI', 'SVBI', 'TNI']
korean_speakers = ['HJK', 'HKK', 'YDCK', 'YKWK']
spanish_speakers = ['EBVS', 'ERMS', 'MBMPS', 'NJS']
vietnamese_speakers = ['HQTV', 'PNV', 'THV', 'TLV']

### Change speakers here ###
for speaker in vietnamese_speakers:
  file_paths = glob.glob(os.path.join(parent_dir, speaker, "wav", "*.wav"))
  #for file_path in file_paths: # use if using whole dataset
  for file_path in file_paths[:num_files_per_speaker]:
    dict = {'file_path': file_path, 'label': language}
    data.append(dict)

In [6]:
# Wrap in to Hugging Face dataset class
data = Dataset.from_list(data)

In [7]:
data

Dataset({
    features: ['file_path', 'label'],
    num_rows: 2240
})

In [8]:
data[0]

{'file_path': '/Volumes/LaCie/DELETE_Apr2025_l2-arctic-data/arctic/HQTV/wav/arctic_b0126.wav',
 'label': 'Vietnamese'}

In [9]:
# Check to see features
data.features

{'file_path': Value(dtype='string', id=None),
 'label': Value(dtype='string', id=None)}

## Load and resample audio for model

In [10]:
# Choose pretrained model DistilHuBERT which is a smaller version of HuBERT
# Alternatively could try full HuBERT or Wav2Vec2 but these will take longer to train
# HuBERT and Wav2Vec2 models take in raw audio, not spectrograms
# https://huggingface.co/ntu-spml/distilhubert
model_id = "ntu-spml/distilhubert"

In [11]:
# Instantiate the AutoFeatureExtractor for DistilHuBERT so we can format data in way that model expects
from transformers import AutoFeatureExtractor

In [12]:
feature_extractor = AutoFeatureExtractor.from_pretrained(
    model_id, do_normalize=True, return_attention_mask=True
)

In [13]:
# check the sampling rate in the feature_extractor to see what SR the model expects
SR = feature_extractor.sampling_rate
SR

16000

In [14]:
# Check the sr used in dataset
waveform, original_sr = torchaudio.load(data[0]['file_path'], normalize=True)
original_sr

44100

In [15]:
# distilHuBERT expects audio clips to be exactly 30 seconds
# This dataset has audio's that are only 3-4 seconds
# Training could be made more efficient if audios from the same participant were appended to one another
# until they are 30 seconds or slightly less
# rather than padding a 3 second audio with 27 seconds of silence
MAX_DURATION = 30.0

In [None]:
# Calculate the max number of samples
MAX_SAMPLES=int(SR * MAX_DURATION)
MAX_SAMPLES

480000

In [None]:
# Use for CPU only
# Load, resample, and combine for 30s in 1 function for efficiency
def download_resample_and_merge_audio(dataset):
    tracker = 0
    grouped_data = []  # Stores the final dataset
    resampler = torchaudio.transforms.Resample(orig_freq=original_sr, new_freq=SR)
    new_audio = []

    for row in dataset:
        file_path = row["file_path"]
        waveform, sr = torchaudio.load(file_path, normalize=True)
        resampled_waveform = resampler(waveform)
        audio = {"audio": resampled_waveform.numpy()[0]}
        
        audio_samples = audio["audio"]

        # Check if adding this row exceeds max limit
        if len(new_audio) + len(audio_samples) <= MAX_SAMPLES:
            new_audio.extend(audio_samples)
        else:
            # Save the current row and start a new row
            grouped_data.append({"label": language, "audio": new_audio})
            new_audio = []
            new_audio.extend(audio_samples)

        # Print a tracker to watch progress
        tracker = tracker+1
        if tracker % 500 == 0:
            print('completed ' + str(tracker) + ' rows')

    
    # Save remaining data that didn't exceed MAX_SAMPLES
    if len(new_audio) > 0:
        print('saving last row with this many samples: ' + str(len(new_audio)))
        grouped_data.append({"label": language, "audio": new_audio})
    return grouped_data
    

data3 = download_resample_and_merge_audio(data)

completed 500 rows
completed 1000 rows
completed 1500 rows
completed 2000 rows
saving last row with this many samples: 405404


In [18]:
# check data format
# data3
# data3[0]
len(data3) # Check length of dataset after rows were combined
# len(data3[0]['audio']) # check length of an individual audio - should be just less than 480,000
# len(data3[len(data3)-1]['audio']) # check the length of the last audio row in the dataset, should be a lot less

307

In [19]:
# Convert to Hugging Face Dataset
data4 = Dataset.from_list(data3)

In [None]:
# re-check data format
# data4
# data4[0]
len(data4)

306

In [None]:
# save hugging face dataset back to disk
data4.save_to_disk("/Volumes/LaCie/l2-arctic-data/"+language)

Saving the dataset (2/2 shards): 100%|█| 307/307 [00:05<00:00, 53.49 examples/s]
