In many cases for speech classification, audio -> text -> text classification is a valid workflow, and avoids a lot of the awkwardness of working with audio. However, in some cases salient information is lost in the process of transcription. One class of these cases is when we want to investigate a certain manner of speaking.
In this project, I have trained a wav2vec2 model to classify Spanish speakers based on their accents by where they are from. The data we use contains 5 locales of Latin American Spanish, + Basque to further investigate how much easier we can expect language discrimination to be compared to dialect discrimination.
The high-quality audio clips of Spanish were introduced in a 2020 ACL Anthology paper available here.
For each language, the dataset contains a variety of speakers with different speaker profiles. This is favorable to some other datasets with 1 or very few speakers because our model is more likely to learn identifiers of dialects than individual voices / prosodies.
The data for Basque comes from a similar campaign described here, only with a focus on minority Western European languages rather than Latin American Spanish.
For this task, I used a wav2vec2 as a base before fine tuning. This model is introduced here.
A quick rundown of the model architecture (image from original paper).
The zoomed-out view of this model is we use a CNN on a normalized waveform to extract features. From the features we use a transformer network to learn a contextualized representation, but also use a discretized representation that helps our model identify distinct speech units. Both the discretized and contextualized representations are passed forward in the network.
The actual training process for this model was relatively light weight, only taking about half an hour (and starting to overfit within that time).
Accuracy = 0.980
Even splitting on speakers, our model achieves excellent accuracy on the testing set. This is interesting because it indicates that accent classification, at least at this granularity, is an easier task than voice identification, which could have just as easily met the training objective.
The confusion matrix shows that Basque is the most easily distinguished, which should be expecting as it is the only language that isn't Spanish. Puerto Rican was the hardest to identify in the testing set, but I think this is more having to do with PR having the least data moreso than something about the accent itself.
I think if this same size of dataset was used for this same experiment, but there were more speakers (and so not as much fitting on individual voices), we could expect near perfect accuracy.
Crowdsourcing Latin American Spanish for Low-Resource Text-to-Speech (Guevara-Rukoz et al., LREC 2020)
Open-Source High Quality Speech Datasets for Basque, Catalan and Galician (Kjartansson et al., SLTU 2020)
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (Baveski et al., Facebook AI 2020)
Huggingface Audio Classification Tutorial
Towards Data Science Blog about wav2vec2.0 This blog posts offers some easy reading for understanding the wav2vec2.0 model, as well as how it relates to and improves upon models before it.
Classifying Accents from Spectograms This blog post describes a computer-vision approach to classifying accents by using spectrograms rather than raw waveforms.
What's a Language Anyways? Outside the machine-learning realm, this The Atlantic article discusses the nuances of classifying languages, dialects, accents, etc. These nuances describe what's achievable with a dialect identifier like the one I trained. Sidenote: this author, John McWhorter, has a great podcast about linguistics!