# Extract data from a Common Voice dataset by accent 

The purpose of this notebook is to extract all the Australian-accented speech samples from Common Voice English so that we can use the data to fine-tune a Whisper model on Australian-accented speech. 

The extraction is done _not just_ on speech samples having the accent `Australian English` but _additional_ contributor-specified accent descriptors such as e.g. `Sydney` or `Australian Eastern Seaboard`. 



## Import libraries 

In [24]:
import pandas as pd 
import matplotlib as mpl 
import numpy as np
from tqdm import tqdm
import re
import subprocess
import os
import shutil
import librosa

## Import the Common Voice `.tsv` file into a `pandas` dataframe

Here, we extract data from the TSV file, and use `pandas` to perform manipulations on the dataset, such as removing rows that do not contain accent metadata.

In [3]:
# we only want to read from the validated data, not the invalidated
# change this to where you have downloaded the CV dataset
# the datasets can be downloaded from Mozilla Data Collective 
# https://datacollective.mozillafoundation.org/datasets?q=common+voice
filePath = '/media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/validated.tsv'

In [4]:
# put it into a DataFrame 
df = pd.read_csv(filePath, sep='\t', low_memory=False)

In [5]:
df.columns

Index(['client_id', 'path', 'sentence_id', 'sentence', 'sentence_domain',
       'up_votes', 'down_votes', 'age', 'gender', 'accents', 'variant',
       'locale', 'segment'],
      dtype='object')

In [6]:
# rows that have accent metadata 
len(df[df['accents'].notna()])

1026637

In [7]:
# what proportion of rows have accent metadata compared with those that don't 
print(len(df[df['accents'].notna()]) / (len(df)) * 100)

55.14276659970566


In [8]:
# remove all the rows where accents are not given (NaN)
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
# DataFrame.dropna(*, axis=0, how=_NoDefault.no_default, thresh=_NoDefault.no_default, subset=None, inplace=False)

df.dropna(axis='index', how='any', subset='accents', inplace=True)
len(df)

# this matches the above figure for rows that have accent metadata, so it's a good cross-check

1026637

In [9]:
# number of unique contributors to the dataset 
len(df['client_id'].unique())

18734

## Filter the dataset for accents which are just Australian or something to indicate Australian 

In [10]:
all_accents = df['accents']

In [11]:
# The original accent entries are given comma-delimited, e.g. `Non native speaker, German English`
# We want to turn this into a list of specific accents, the we can take only the unique values 

accents_list = []

for idx, accent_string in list(enumerate(all_accents)): 
    
    # this regex is from https://stackoverflow.com/questions/26633452/how-to-split-by-commas-that-are-not-within-parentheses
    accent_list = re.split(r',\s*(?![^()]*\))', accent_string)
    processed_accent_list = accent_list # we don't want to modify a list we're iterating over

    for idx_a, accent in list(enumerate(accent_list)): 
    
        # Trim any whitespace off the elements, because this makes matching on strings harder later on
        # Strings are immutable in Python, so we have to create another string
        processed_accent_list.remove(accent)
        stripped_accent = accent.strip() 
        processed_accent_list.append(stripped_accent)
        
        # Check for any empty strings and remove them - likely regex artefacts
        if not accent: 
            processed_accent_list.remove(accent)
            
        # Check for any non-Latin characters that we may want to investigate 
        # For example, if one of the accents is garbage or deliberate rubbish
        
        if not accent.isascii(): 
            print('flagging that accent: ', accent, ' is not ASCII encoded, may be in another language')
            
    accents_list.append(processed_accent_list)

flagging that accent:  Выраженный украинский акцент  is not ASCII encoded, may be in another language
flagging that accent:  Выраженный украинский акцент  is not ASCII encoded, may be in another language
flagging that accent:  Выраженный украинский акцент  is not ASCII encoded, may be in another language
flagging that accent:  Русский  is not ASCII encoded, may be in another language
flagging that accent:  Русский  is not ASCII encoded, may be in another language
flagging that accent:  Русский  is not ASCII encoded, may be in another language
flagging that accent:  Русский  is not ASCII encoded, may be in another language
flagging that accent:  Немного русский акцент произношения  is not ASCII encoded, may be in another language
flagging that accent:  съедание слов  is not ASCII encoded, may be in another language
flagging that accent:  Немного русский акцент произношения  is not ASCII encoded, may be in another language
flagging that accent:  съедание слов  is not ASCII encoded, may b

at this point, `accents_list` is a list of lists:

```
['Irish English']
['England English']
['Danish English', 'Transatlantic']
```
We want to turn it into a single list, then into a set - with only unique elements - each of the accents. 

In [12]:
flattened = [item for sublist in accents_list for item in sublist]

In [13]:
print(len(flattened))

1118507


In [14]:
# turn into a set 
unique_accents = list(set(flattened))

In [15]:
len(unique_accents)

831

In [16]:
#print(unique_accents)

## Figure out which of the accents above are Australian

This is a manual exercise for now .. 

* `Australian English`
* `General Australian`
* `lived with my grandfather for a time who has a mix between the australian and an english accent`
* `Hybrid Indian/Australian/American`
* `but spent a lot of time in australia and the US`
* `Hybrid United States and Australian`
* `South Australia`
* `Educated Australian Accent`
* `Sydney - middle eastern seaboard Australian`
* `Queenslandish`

For now we will only select rows with: 

* `Australian English`
* `General Australian`
* `South Australia`
* `Educated Australian Accent`
* `Sydney - middle eastern seaboard Australian`
* `Queenslandish`

to make it as "General Australian" as possible 

## Filter the dataframe by the Australian accents 

In [17]:
accents_to_match = ['Australian English', 'General Australian', 'South Australia', 'Sydney', 'Queensland']

In [18]:
# Using regex with the '|' (OR) operator
filtered_df = df[df['accents'].str.contains('|'.join(accents_to_match), na=False)]

In [19]:
len(filtered_df)
# 55757 rows

55757

## Identify how many unique contributors there are 

We do this so for example, it's not one speaker who has contributed 55k samples

In [20]:
len(filtered_df['client_id'].unique())

834

In [21]:
#filtered_df['client_id'].value_counts()
# this shows that there are about ten key contributors who have thousands of samples
# but not one who has say 20k samples
# so this should be OK for fine-tuning. 

# This does have implications for how much data ends up into train and test 
#    as we don't want to train on the same speakers we're testing on
#    - this is called "data leakage" - or "learning to the test"

## Add a `duration_ms` column and calculate audio duration

In [30]:
filtered_df['duration_ms'] = None

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['duration_ms'] = None


In [31]:
filtered_df.columns

Index(['client_id', 'path', 'sentence_id', 'sentence', 'sentence_domain',
       'up_votes', 'down_votes', 'age', 'gender', 'accents', 'variant',
       'locale', 'segment', 'duration_ms'],
      dtype='object')

In [32]:
filtered_df_copy = filtered_df.copy(deep=True)

In [35]:
audio_path = '/media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/'

for index, row in tqdm(filtered_df_copy.iterrows(), total=len(filtered_df_copy), desc="Calculating audio duration..."):
    file_path = audio_path + row['path']
    duration = librosa.get_duration(path=file_path)
    # insert into dataframe 
    filtered_df.loc[index, 'duration_ms'] = duration

Calculating audio duration...: 100%|█████| 55757/55757 [02:05<00:00, 444.31it/s]


## Copy the matching audios across

I intend to use the [Mozilla.AI blueprint](https://mozilla-ai.github.io/speech-to-text-finetune/customization/) authored by [Kostis Saitas Zarkias](https://github.com/Kostis-S-Z). The format used in that blueprint is: 

```
datasets/
├── my_dataset/
│   ├── dataset.csv
│   └── audio_files/
│       ├── audio_1.wav
│       ├── audio_2.wav
│       ├── audio_3.wav
│       └── ...
```

and I want to match the format here. 

In [43]:
# define the file path to save the audio files
audio_fp_src = '/media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/'
audio_fp_destination = 'datasets/commonvoice-v23_en-AU/audio_files/'

In [44]:
# iterate over the dataframe and output the audio files to match the structure above.
# make sure the directory structure exists first  - easier to do this on the command line

for index, row in tqdm(filtered_df.iterrows(), total=len(filtered_df), desc="Copying audios across to new dataset"):

    source_file = audio_fp_src + row['path']
    destination_file = audio_fp_destination + row['path']

    try:
        shutil.copyfile(source_file, destination_file)
        # we don't want to setn too much info to screen
        if index%1000 == 0: 
            print(f"File copied from {source_file} to {destination_file}")
    except shutil.SameFileError:
        print("Source and destination represent the same file.")
    except IsADirectoryError:
        print("Destination is a directory. Use shutil.copy() instead if destination is a folder.")
    except FileNotFoundError:
        print("Source file not found.")

Copying audios across to new dataset:   1%| | 299/55757 [00:02<04:43, 195.56it/s

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_38162284.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_38162284.mp3


Copying audios across to new dataset:   1%| | 667/55757 [00:03<03:23, 270.97it/s

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_19667424.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_19667424.mp3


Copying audios across to new dataset:   3%| | 1720/55757 [00:06<02:09, 418.71it/

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_25136017.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_25136017.mp3


Copying audios across to new dataset:   6%| | 3309/55757 [00:10<01:30, 577.38it/

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_22105106.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_22105106.mp3


Copying audios across to new dataset:   7%| | 3960/55757 [00:11<01:15, 686.93it/

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_19736231.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_19736231.mp3
File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_20719051.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_20719051.mp3


Copying audios across to new dataset:   8%| | 4665/55757 [00:12<01:21, 625.42it/

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_17782523.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_17782523.mp3
File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_19502338.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_19502338.mp3


Copying audios across to new dataset:  11%| | 6139/55757 [00:14<01:16, 651.88it/

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_36696.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_36696.mp3


Copying audios across to new dataset:  15%|▏| 8235/55757 [00:18<01:26, 548.25it/

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_22103350.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_22103350.mp3


Copying audios across to new dataset:  17%|▏| 9400/55757 [00:20<00:48, 952.87it/

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_37161483.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_37161483.mp3


Copying audios across to new dataset:  18%|▏| 9863/55757 [00:20<00:57, 803.52it/

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_18986682.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_18986682.mp3


Copying audios across to new dataset:  23%|▏| 12724/55757 [00:25<00:54, 789.46it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_24983199.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_24983199.mp3


Copying audios across to new dataset:  24%|▏| 13267/55757 [00:25<00:48, 877.15it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_574819.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_574819.mp3


Copying audios across to new dataset:  25%|▏| 13683/55757 [00:26<00:55, 754.86it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_18365877.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_18365877.mp3


Copying audios across to new dataset:  25%|▎| 14179/55757 [00:26<00:38, 1084.94i

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_651359.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_651359.mp3


Copying audios across to new dataset:  26%|▎| 14515/55757 [00:26<00:30, 1355.33i

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_606106.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_606106.mp3


Copying audios across to new dataset:  26%|▎| 14764/55757 [00:27<01:02, 656.06it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_20305395.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_20305395.mp3


Copying audios across to new dataset:  29%|▎| 15911/55757 [00:28<00:32, 1223.05i

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_27775572.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_27775572.mp3


Copying audios across to new dataset:  30%|▎| 16560/55757 [00:29<00:32, 1198.56i

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_685062.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_685062.mp3


Copying audios across to new dataset:  32%|▎| 17879/55757 [00:30<00:37, 1009.77i

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_21229749.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_21229749.mp3


Copying audios across to new dataset:  33%|▎| 18197/55757 [00:30<00:40, 930.55it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_20052130.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_20052130.mp3


Copying audios across to new dataset:  34%|▎| 18766/55757 [00:31<00:45, 814.25it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_15904963.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_15904963.mp3


Copying audios across to new dataset:  35%|▎| 19343/55757 [00:32<00:49, 737.83it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_18996237.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_18996237.mp3


Copying audios across to new dataset:  36%|▎| 20190/55757 [00:33<00:44, 794.69it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_20348701.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_20348701.mp3


Copying audios across to new dataset:  38%|▍| 21370/55757 [00:34<00:43, 788.94it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_26977225.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_26977225.mp3


Copying audios across to new dataset:  39%|▍| 21886/55757 [00:36<01:08, 493.68it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_490882.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_490882.mp3


Copying audios across to new dataset:  43%|▍| 23856/55757 [00:40<00:25, 1265.69i

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_681253.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_681253.mp3


Copying audios across to new dataset:  44%|▍| 24641/55757 [00:41<00:22, 1398.02i

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_18488444.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_18488444.mp3
File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_44969.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_44969.mp3


Copying audios across to new dataset:  48%|▍| 26693/55757 [00:44<00:33, 875.37it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_19132878.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_19132878.mp3


Copying audios across to new dataset:  49%|▍| 27197/55757 [00:45<00:44, 647.44it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_19032707.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_19032707.mp3


Copying audios across to new dataset:  49%|▍| 27522/55757 [00:45<00:25, 1128.52i

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_111227.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_111227.mp3


Copying audios across to new dataset:  50%|▍| 27806/55757 [00:45<00:26, 1053.25i

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_24990376.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_24990376.mp3


Copying audios across to new dataset:  54%|▌| 30128/55757 [00:48<00:22, 1155.87i

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_531142.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_531142.mp3


Copying audios across to new dataset:  55%|▌| 30624/55757 [00:50<01:39, 252.65it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_31730091.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_31730091.mp3


Copying audios across to new dataset:  57%|▌| 31860/55757 [00:53<00:18, 1290.34i

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_24996146.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_24996146.mp3


Copying audios across to new dataset:  58%|▌| 32472/55757 [00:53<00:13, 1785.32i

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_594524.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_594524.mp3


Copying audios across to new dataset:  62%|▌| 34315/55757 [00:55<00:23, 916.45it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_18734691.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_18734691.mp3


Copying audios across to new dataset:  62%|▌| 34789/55757 [00:55<00:25, 813.56it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_17283247.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_17283247.mp3


Copying audios across to new dataset:  65%|▋| 36006/55757 [00:57<00:20, 942.43it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_20424443.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_20424443.mp3


Copying audios across to new dataset:  66%|▋| 36920/55757 [00:58<00:22, 854.45it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_18593932.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_18593932.mp3


Copying audios across to new dataset:  68%|▋| 37730/55757 [01:00<00:26, 682.67it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_22183105.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_22183105.mp3


Copying audios across to new dataset:  70%|▋| 38804/55757 [01:01<00:20, 808.77it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_21638117.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_21638117.mp3


Copying audios across to new dataset:  71%|▋| 39765/55757 [01:02<00:24, 654.91it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_24510166.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_24510166.mp3


Copying audios across to new dataset:  73%|▋| 40742/55757 [01:03<00:12, 1221.69i

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_16048231.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_16048231.mp3


Copying audios across to new dataset:  75%|▋| 41720/55757 [01:04<00:14, 949.30it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_18537105.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_18537105.mp3


Copying audios across to new dataset:  76%|▊| 42231/55757 [01:05<00:13, 985.44it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_22285585.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_22285585.mp3


Copying audios across to new dataset:  77%|▊| 43204/55757 [01:06<00:12, 1028.77i

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_22946308.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_22946308.mp3


Copying audios across to new dataset:  79%|▊| 44127/55757 [01:07<00:14, 798.85it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_27190118.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_27190118.mp3


Copying audios across to new dataset:  81%|▊| 45175/55757 [01:09<00:16, 624.57it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_27287872.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_27287872.mp3


Copying audios across to new dataset:  83%|▊| 46219/55757 [01:11<00:25, 374.85it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_18379425.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_18379425.mp3


Copying audios across to new dataset:  85%|▊| 47283/55757 [01:12<00:09, 939.26it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_18658061.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_18658061.mp3


Copying audios across to new dataset:  87%|▊| 48322/55757 [01:14<00:12, 579.33it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_18945726.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_18945726.mp3


Copying audios across to new dataset:  88%|▉| 49194/55757 [01:16<00:12, 511.78it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_19164780.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_19164780.mp3


Copying audios across to new dataset:  90%|▉| 50210/55757 [01:18<00:11, 466.61it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_19468088.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_19468088.mp3


Copying audios across to new dataset:  92%|▉| 51236/55757 [01:21<00:08, 523.03it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_19838377.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_19838377.mp3


Copying audios across to new dataset:  94%|▉| 52291/55757 [01:22<00:04, 781.05it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_20365742.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_20365742.mp3


Copying audios across to new dataset:  96%|▉| 53323/55757 [01:23<00:02, 1104.84i

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_20670909.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_20670909.mp3


Copying audios across to new dataset:  97%|▉| 54305/55757 [01:24<00:01, 848.25it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_21024847.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_21024847.mp3


Copying audios across to new dataset:  99%|▉| 55308/55757 [01:25<00:00, 1041.94i

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-23.0-2025-09-05/en/clips/common_voice_en_21595227.mp3 to datasets/commonvoice-v23_en-AU/audio_files/common_voice_en_21595227.mp3


Copying audios across to new dataset: 100%|█| 55757/55757 [01:26<00:00, 646.12it


In [46]:
# check how many files are in the destination dir - should match the # of rows in the dataframe 
count = 0; 
with os.scandir(audio_fp_destination) as entries:
    for entry in entries:
        if entry.is_file():
             count += 1

print(count)

55757


## Export the dataframe to dataset.csv

In [36]:
fp = 'datasets/commonvoice-v23_en-AU/commonvoice-v23_en-AU.csv'
filtered_df.to_csv(fp)