# Accent Classification Project: Data Prep Part 2
This file is 1 of 3 in an Accent Classification Project

**Purpose:**

The goal of this project is to create an accent classifier for people who learned English as a second language by fine-tuning a speech recognition model to classify accents from 24 people speaking English whose first language is Hindi, Korean, Arabic, Vietnamese, Spanish, or Mandarin.

**Data source**
https://psi.engr.tamu.edu/l2-arctic-corpus/

L2-Arctic dataset comes via email and includes approximately 24-30 hours of recordings where 24 speakers read passages in English. The first languages of the speakers are Arabic, Hindi, Korean, Mandarin, Spanish, and Vietnamese.  There's 2 women and 2 men in each language group.

The original dataset is around 8GB with contains 27,000 rows of data, each with an audio file of 3-4s with 48k Hz sampling rate.

**Summary of this file**
This file merges the reformatted L2-Arctic Hugging Face datasets from Data Prep Part 1 into 1 big dataset. It then updates the labels to numeric and handles padding/attention mask for distilHuBERT model using its AutoFeatureExtractor.

**Result**
The final dataset has 1737 rows, each with a ~30s audio file at 16,000 Hz and is ready for training distilHuBERT

**Environment**
It runs best on a mac CPU, which is faster than google colab's CPU or GPU.
Note: even when code is re-written to process files in bulk with GPU, a mac CPU is still surprisingly much faster. Splitting the dataset into smaller pieces then re-merging gets around the memory problems.

**Data source**

The [L2-Arctic](https://psi.engr.tamu.edu/l2-arctic-corpus/) data is ~8GB and comes via email. It includes approximately 24-30 hours of recordings where 24 speakers read passages in English. The first languages of the speakers are Arabic, Hindi, Korean, Mandarin, Spanish, and Vietnamese.  There's 2 women and 2 men in each language group.

**Foundation Model**

[DistilHuBERT](https://huggingface.co/ntu-spml/distilhubert) is a smaller version of HuBERT that was modified from BERT. BERT is a speech recognition model with encoder-only CTC architecture.  For this project, a classification layer was added. 

In [None]:
# Check to make sure Jupyter Notebook can access my external drive where the data is saved
!ls /Volumes

[34mData[m[m         [34mLaCie[m[m        [35mMacintosh HD[m[m


In [2]:
from datasets import Dataset
from datasets import load_from_disk

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# Define the parent directory
parent_dir = "/Volumes/LaCie/l2-arctic-data/"

In [None]:
# Load the individual Hugging Face datasets created in part 1 of data preparation
arabic_data = load_from_disk(parent_dir+"Arabic")
mandarin_data = load_from_disk(parent_dir+"Mandarin")
hindi_data = load_from_disk(parent_dir+"Hindi")
korean_data = load_from_disk(parent_dir+"Korean")
spanish_data = load_from_disk(parent_dir+"Spanish")
vietnamese_data = load_from_disk(parent_dir+"Vietnamese")

In [13]:
# check datasets
# arabic_data
# len(arabic_data[0]['audio'])
# mandarin_data
# hindi_data
# korean_data
# spanish_data
# vietnamese_data
# len(vietnamese_data[len(vietnamese_data)-1]['audio'])
# vietnamese_data[0]

In [14]:
# stack datasets
from datasets import concatenate_datasets
data = concatenate_datasets([arabic_data, mandarin_data, hindi_data, korean_data, spanish_data, vietnamese_data])
print(data)

Dataset({
    features: ['label', 'audio'],
    num_rows: 1737
})


## Update labels

In [None]:
# Change this column from string to numeric but with labels
data = data.class_encode_column('label')

In [None]:
# use method to map labels feature to human-readable names
id2label_fn = data.features["label"].int2str

In [None]:
# Check label now, should be numeric
data[0]["label"]

0

In [None]:
# check label on one of the rows
id2label_fn(data[0]["label"])

'Arabic'

In [None]:
# count the number of each label
from collections import Counter

In [None]:
def count_labels(dataset):
  label_counts = Counter(dataset["label"]) 
  for label, count in sorted(label_counts.items()):
    #print(f"Label {label}: {count} occurrences")
    print(f"Label {id2label_fn(label)}: {count} occurrences")
  print('length of dataset: ' + str(len(dataset)))
  print('number of labeled rows (should match length of dataset): ' + str(sum(label_counts.values())))

In [None]:
count_labels(data)

## Use AutoFeatureExtractor from model to prepare dataset with truncation/attention mask

In [None]:
# Instantiate the AutoFeatureExtractor for DistilHuBERT so we can format data in
# way that model expects
from transformers import AutoFeatureExtractor

In [None]:
# Choose pretrained model DistilHuBERT which is a smaller version of HuBERT
# Alternatively could try full HuBERT or Wav2Vec2 but these will take longer to train
# HuBERT and Wav2Vec2 models take in raw audio, not spectrograms
# https://huggingface.co/ntu-spml/distilhubert
model_id = "ntu-spml/distilhubert"

In [None]:
feature_extractor = AutoFeatureExtractor.from_pretrained(
    model_id, do_normalize=True, return_attention_mask=True
)

In [None]:
# distilHuBERT expects audio clips to be exactly 30 seconds
MAX_DURATION = 30.0

In [None]:
# define a function to apply the feature_extractor to all the data
def preprocess_function(examples):
    # This is getting all raw signals in an array. So for each audio in the array passed to the function,
    # take the audio column, then the array column, isolate those and put them in their own array
    audio_arrays = [x for x in examples["audio"]]
    # Now apply the feature_extractor to all the audio arrays, and tell it the SR matches what
    # it expects
    # max_length in samples
    # tell it to use truncation and return attention mask
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=int(feature_extractor.sampling_rate * MAX_DURATION),
        truncation=True,
        return_attention_mask=True,
    )
    return inputs

In [None]:
# apply the function to truncate/pad the audio to the dataset using map
data_encoded = data.map(
    preprocess_function, # pass the preprocess_function defined above
    batched=False,
    num_proc=1,
)
data_encoded
# - attention mask has a binary mask of 0/1 values that inducate where the audio input has been padded

## Save data to disk

In [None]:
# save hugging face dataset back to disk
data.save_to_disk("/Volumes/LaCie/l2-arctic-data/arctic_data_formatted")

Saving the dataset (7/7 shards): 100%|█| 1737/1737 [01:50<00:00, 15.75 examples/
