# Extract data from a Common Voice dataset by accent 

The purpose of this notebook is to extract all the Australian-accented speech samples from Common Voice English so that we can use the data to fine-tune a Whisper model on Australian-accented speech. 

The extraction is done _not just_ on speech samples having the accent `Australian English` but _additional_ contributor-specified accent descriptors such as e.g. `Sydney` or `Australian Eastern Seaboard`. 



## Import libraries 

In [32]:
import pandas as pd 
import matplotlib as mpl 
import numpy as np
import tqdm
import re

## Import the Common Voice `.tsv` file into a `pandas` dataframe

Here, we extract data from the TSV file, and use `pandas` to perform manipulations on the dataset, such as removing rows that do not contain accent metadata.

In [21]:
# we only want to read from the validated data, not the invalidated
# change this to where you have downloaded the CV dataset
# the datasets can be downloaded from Mozilla Data Collective 
# https://datacollective.mozillafoundation.org/datasets?q=common+voice
filePath = '/media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/validated.tsv'

In [22]:
# put it into a DataFrame 
df = pd.read_csv(filePath, sep='\t', low_memory=False)

In [23]:
df.columns

Index(['client_id', 'path', 'sentence_id', 'sentence', 'sentence_domain',
       'up_votes', 'down_votes', 'age', 'gender', 'accents', 'variant',
       'locale', 'segment'],
      dtype='object')

In [24]:
# rows that have accent metadata 
len(df[df['accents'].notna()])

1026637

In [25]:
# what proportion of rows have accent metadata compared with those that don't 
print(len(df[df['accents'].notna()]) / (len(df)) * 100)

55.14276659970566


In [26]:
# remove all the rows where accents are not given (NaN)
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
# DataFrame.dropna(*, axis=0, how=_NoDefault.no_default, thresh=_NoDefault.no_default, subset=None, inplace=False)

df.dropna(axis='index', how='any', subset='accents', inplace=True)
len(df)

# this matches the above figure for rows that have accent metadata, so it's a good cross-check

1026637

In [27]:
# number of unique contributors to the dataset 
len(df['client_id'].unique())

18734

## Filter the dataset for accents which are just Australian or something to indicate Australian 

In [30]:
all_accents = df['accents']

In [34]:
# The original accent entries are given comma-delimited, e.g. `Non native speaker, German English`
# We want to turn this into a list of specific accents, the we can take only the unique values 

accents_list = []

for idx, accent_string in list(enumerate(all_accents)): 
    
    # this regex is from https://stackoverflow.com/questions/26633452/how-to-split-by-commas-that-are-not-within-parentheses
    accent_list = re.split(r',\s*(?![^()]*\))', accent_string)
    processed_accent_list = accent_list # we don't want to modify a list we're iterating over

    for idx_a, accent in list(enumerate(accent_list)): 
    
        # Trim any whitespace off the elements, because this makes matching on strings harder later on
        # Strings are immutable in Python, so we have to create another string
        processed_accent_list.remove(accent)
        stripped_accent = accent.strip() 
        processed_accent_list.append(stripped_accent)
        
        # Check for any empty strings and remove them - likely regex artefacts
        if not accent: 
            processed_accent_list.remove(accent)
            
        # Check for any non-Latin characters that we may want to investigate 
        # For example, if one of the accents is garbage or deliberate rubbish
        
        if not accent.isascii(): 
            print('flagging that accent: ', accent, ' is not ASCII encoded, may be in another language')
            
    accents_list.append(processed_accent_list)

flagging that accent:  Выраженный украинский акцент  is not ASCII encoded, may be in another language
flagging that accent:  Выраженный украинский акцент  is not ASCII encoded, may be in another language
flagging that accent:  Выраженный украинский акцент  is not ASCII encoded, may be in another language
flagging that accent:  Русский  is not ASCII encoded, may be in another language
flagging that accent:  Русский  is not ASCII encoded, may be in another language
flagging that accent:  Русский  is not ASCII encoded, may be in another language
flagging that accent:  Русский  is not ASCII encoded, may be in another language
flagging that accent:  Немного русский акцент произношения  is not ASCII encoded, may be in another language
flagging that accent:  съедание слов  is not ASCII encoded, may be in another language
flagging that accent:  Немного русский акцент произношения  is not ASCII encoded, may be in another language
flagging that accent:  съедание слов  is not ASCII encoded, may b

at this point, accents_list is a list of lists 

```
['Irish English']
['England English']
['Danish English', 'Transatlantic']
```
We want to turn it into a single list, then into a set - with only unique elements. 

In [41]:
flattened = [item for sublist in accents_list for item in sublist]

In [42]:
print(len(flattened))

1118507


In [43]:
# turn into a set 
unique_accents = list(set(flattened))

In [44]:
len(unique_accents)

831

In [45]:
print(unique_accents)

['France English', 'Выраженный украинский акцент', 'Hunglish', 'Shropshire', 'Latin American English', 'Kenyan', 'bulgarian english', '30 year old adult male', 'England West County', 'Latin English', 'Delaware Valley', 'Turkish English', 'German English', 'twinge of Yorkshire', 'just practicing', 'Spanish influenced', 'iowa', 'Whybare you asking', 'Deep South US', 'Brazilian', 'Kannadiga', 'european 2nd language english', 'WA', 'Colombia', 'country', 'mid-west with speach and mind problems from MS', 'Korean', 'london', 'Pittsburgh PA', 'French', 'but with better than typical diction.', 'Cool', 'Northumbrian British English', 'Problema alla corte vocali', 'try to maintain originality', 'Midwestern United States - Chicago', 'I am German and speak English as learned at school', 'Third Culture Kid / South American English', 'vietnamese accent', 'very subtle german/non native accent', 'Virginia', 'Mid Atlantic United States - Philadelphia Suburbs', 'Dutch English', 'Midwest US... With some 

## Figure out which of the accents above are Australian

This is a manual exercise for now .. 

* `Australian English`
* `General Australian`
* `lived with my grandfather for a time who has a mix between the australian and an english accent`
* `Hybrid Indian/Australian/American`
* `but spent a lot of time in australia and the US`
* `Hybrid United States and Australian`
* `South Australia`
* `Educated Australian Accent`
* `Sydney - middle eastern seaboard Australian`
* `Queenslandish`

For now we will only select rows with: 

* `Australian English`
* `General Australian`
* `South Australia`
* `Educated Australian Accent`
* `Sydney - middle eastern seaboard Australian`
* `Queenslandish`

to make it as "General Australian" as possible 

## Filter the dataframe by the Australian accents 