# Extract data from a Common Voice dataset by accent 

The purpose of this notebook is to extract all the Australian-accented speech samples from Common Voice English so that we can use the data to fine-tune a Whisper model on Australian-accented speech. 

The extraction is done _not just_ on speech samples having the accent `Australian English` but _additional_ contributor-specified accent descriptors such as e.g. `Sydney` or `Australian Eastern Seaboard`. 



## Import libraries 

In [10]:
import pandas as pd 
import matplotlib as mpl 
import numpy as np
from tqdm import tqdm
import re
import subprocess
import os
import shutil
import librosa

## Import the Common Voice `.tsv` file into a `pandas` dataframe

Here, we extract data from the TSV file, and use `pandas` to perform manipulations on the dataset, such as removing rows that do not contain accent metadata.

In [11]:
# we only want to read from the validated data, not the invalidated
# change this to where you have downloaded the CV dataset
# the datasets can be downloaded from Mozilla Data Collective 
# https://datacollective.mozillafoundation.org/datasets?q=common+voice
filePath = '/media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/validated.tsv'

In [12]:
# put it into a DataFrame 
df = pd.read_csv(filePath, sep='\t', low_memory=False)

In [13]:
df.columns

Index(['client_id', 'path', 'sentence_id', 'sentence', 'sentence_domain',
       'up_votes', 'down_votes', 'age', 'gender', 'accents', 'variant',
       'locale', 'segment'],
      dtype='object')

In [14]:
# rows that have accent metadata 
# 1026637 in v23 on CV English
# 1031295 in v24 on CV English
len(df[df['accents'].notna()])

1031295

In [15]:
# what proportion of rows have accent metadata compared with those that don't 
# 55.14276659970566 in v23 on CV English
# 55.16890711753501 in v24 on CV English

print(len(df[df['accents'].notna()]) / (len(df)) * 100)

55.16890711753501


In [16]:
# remove all the rows where accents are not given (NaN)
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
# DataFrame.dropna(*, axis=0, how=_NoDefault.no_default, thresh=_NoDefault.no_default, subset=None, inplace=False)

df.dropna(axis='index', how='any', subset='accents', inplace=True)
len(df)

# this matches the above figure for rows that have accent metadata, so it's a good cross-check

1031295

In [17]:
# number of unique contributors to the dataset
# 18734 in v23 on CV English
# 18953 in v24 on CV English
len(df['client_id'].unique())

18953

## Filter the dataset for accents which are just Australian or something to indicate Australian 

In [18]:
all_accents = df['accents']

In [19]:
# The original accent entries are given comma-delimited, e.g. `Non native speaker, German English`
# We want to turn this into a list of specific accents, the we can take only the unique values 

accents_list = []

for idx, accent_string in list(enumerate(all_accents)): 
    
    # this regex is from https://stackoverflow.com/questions/26633452/how-to-split-by-commas-that-are-not-within-parentheses
    accent_list = re.split(r',\s*(?![^()]*\))', accent_string)
    processed_accent_list = accent_list # we don't want to modify a list we're iterating over

    for idx_a, accent in list(enumerate(accent_list)): 
    
        # Trim any whitespace off the elements, because this makes matching on strings harder later on
        # Strings are immutable in Python, so we have to create another string
        processed_accent_list.remove(accent)
        stripped_accent = accent.strip() 
        processed_accent_list.append(stripped_accent)
        
        # Check for any empty strings and remove them - likely regex artefacts
        if not accent: 
            processed_accent_list.remove(accent)
            
        # Check for any non-Latin characters that we may want to investigate 
        # For example, if one of the accents is garbage or deliberate rubbish
        
        if not accent.isascii(): 
            print('flagging that accent: ', accent, ' is not ASCII encoded, may be in another language')
            
    accents_list.append(processed_accent_list)

flagging that accent:  Выраженный украинский акцент  is not ASCII encoded, may be in another language
flagging that accent:  Выраженный украинский акцент  is not ASCII encoded, may be in another language
flagging that accent:  Выраженный украинский акцент  is not ASCII encoded, may be in another language
flagging that accent:  Русский  is not ASCII encoded, may be in another language
flagging that accent:  Русский  is not ASCII encoded, may be in another language
flagging that accent:  Русский  is not ASCII encoded, may be in another language
flagging that accent:  Русский  is not ASCII encoded, may be in another language
flagging that accent:  Немного русский акцент произношения  is not ASCII encoded, may be in another language
flagging that accent:  съедание слов  is not ASCII encoded, may be in another language
flagging that accent:  Немного русский акцент произношения  is not ASCII encoded, may be in another language
flagging that accent:  съедание слов  is not ASCII encoded, may b

at this point, `accents_list` is a list of lists:

```
['Irish English']
['England English']
['Danish English', 'Transatlantic']
```
We want to turn it into a single list, then into a set - with only unique elements - each of the accents. 

In [20]:
flattened = [item for sublist in accents_list for item in sublist]

In [21]:
print(len(flattened))
# 1118507 in v23 English
# 1124409 in v24 English

1124409


In [22]:
# turn into a set 
unique_accents = list(set(flattened))

In [23]:
len(unique_accents)
# 831 in v23 English
# 859 in v24 English

859

In [24]:
# uncomment this line to examine the unique accent list
# print(unique_accents)

## Figure out which of the accents above are Australian

This is a manual exercise for now .. 

* `Australian English`
* `General Australian`
* `lived with my grandfather for a time who has a mix between the australian and an english accent`
* `Hybrid Indian/Australian/American`
* `but spent a lot of time in australia and the US`
* `Hybrid United States and Australian`
* `South Australia`
* `Educated Australian Accent`
* `Sydney - middle eastern seaboard Australian`
* `Queenslandish`

For now we will only select rows with: 

* `Australian English`
* `General Australian`
* `South Australia`
* `Educated Australian Accent`
* `Sydney - middle eastern seaboard Australian`
* `Queenslandish`

to make it as "General Australian" as possible 

## Filter the dataframe by the Australian accents 

In [25]:
accents_to_match = ['Australian English', 'General Australian', 'South Australia', 'Sydney', 'Queensland']

In [26]:
# Using regex with the '|' (OR) operator
# the str.contains will match 'Queensland' on 'Queenslandish' or 'spent time in Sydney' with 'Sydney'
filtered_df = df[df['accents'].str.contains('|'.join(accents_to_match), na=False)]

In [27]:
len(filtered_df)
# 55757 rows in v23 English 
# 55673 rows in v24 English -> not sure why the reduction, it could be that people have had their data deleted

55673

## Identify how many unique contributors there are 

We do this so for example, it's not one speaker who has contributed 55k samples

In [28]:
len(filtered_df['client_id'].unique())
# 834 in v23 English 
# 842 in v24 English

842

In [29]:
#filtered_df['client_id'].value_counts()
# this shows that there are about ten key contributors who have thousands of samples
# but not one who has say 20k samples
# so this should be OK for fine-tuning. 

# This does have implications for how much data ends up into train and test 
#    as we don't want to train on the same speakers we're testing on
#    - this is called "data leakage" - or "learning to the test"

## Add a `duration_ms` column and calculate audio duration

In [30]:
filtered_df['duration_ms'] = None

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['duration_ms'] = None


In [31]:
# check the extra column has been added
filtered_df.columns

Index(['client_id', 'path', 'sentence_id', 'sentence', 'sentence_domain',
       'up_votes', 'down_votes', 'age', 'gender', 'accents', 'variant',
       'locale', 'segment', 'duration_ms'],
      dtype='object')

In [32]:
filtered_df_copy = filtered_df.copy(deep=True)

In [33]:
audio_path = r"/media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/"

for index, row in tqdm(filtered_df_copy.iterrows(), total=len(filtered_df_copy), desc="Calculating audio duration..."):
    file_path = audio_path + row['path']
    duration = librosa.get_duration(path=file_path)
    # insert into dataframe 
    filtered_df.loc[index, 'duration_ms'] = duration

Calculating audio duration...: 100%|█████| 55673/55673 [08:20<00:00, 111.19it/s]


## Copy the matching audios across

I intend to use the [Mozilla.AI blueprint](https://mozilla-ai.github.io/speech-to-text-finetune/customization/) authored by [Kostis Saitas Zarkias](https://github.com/Kostis-S-Z). The format used in that blueprint is: 

```
datasets/
├── my_dataset/
│   ├── dataset.csv
│   └── audio_files/
│       ├── audio_1.wav
│       ├── audio_2.wav
│       ├── audio_3.wav
│       └── ...
```

and I want to match the format here. 

In [34]:
# define the file path to save the audio files
# use a raw string because I'm using an external drive
# make sure path has a trailing slash
audio_fp_src = r"/media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/"
audio_fp_destination = 'datasets/commonvoice-v24_en-AU/audio_files/'

In [35]:
# iterate over the dataframe and output the audio files to match the structure above.
# make sure the directory structure exists first  - easier to do this on the command line

for index, row in tqdm(filtered_df.iterrows(), total=len(filtered_df), desc="Copying audios across to new dataset"):

    source_file = audio_fp_src + row['path']    
    destination_file = audio_fp_destination + row['path']

    try:
        shutil.copyfile(source_file, destination_file)
        # we don't want to send too much info to screen
        if index%1000 == 0: 
            print(f"File copied from {source_file} to {destination_file}")
    except shutil.SameFileError:
        print("Source and destination represent the same file.")
    except IsADirectoryError:
        print("Destination is a directory. Use shutil.copy() instead if destination is a folder.")
    except FileNotFoundError:
        print("Source file not found.")

Copying audios across to new dataset:   0%|  | 16/55673 [00:00<54:31, 17.01it/s]

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_44507167.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_44507167.mp3


Copying audios across to new dataset:   2%| | 1263/55673 [00:27<11:48, 76.77it/s

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_17290173.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_17290173.mp3


Copying audios across to new dataset:  12%| | 6871/55673 [01:22<06:11, 131.51it/

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_19741293.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_19741293.mp3


Copying audios across to new dataset:  14%|▏| 8057/55673 [01:33<06:44, 117.78it/

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_18832228.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_18832228.mp3


Copying audios across to new dataset:  19%|▏| 10593/55673 [01:55<07:55, 94.89it/

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_39592429.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_39592429.mp3


Copying audios across to new dataset:  22%|▏| 12136/55673 [02:08<05:33, 130.57it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_23297241.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_23297241.mp3


Copying audios across to new dataset:  22%|▏| 12483/55673 [02:11<04:51, 148.05it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_627794.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_627794.mp3


Copying audios across to new dataset:  29%|▎| 16051/55673 [02:41<05:33, 118.85it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_20177153.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_20177153.mp3


Copying audios across to new dataset:  31%|▎| 17269/55673 [02:50<04:42, 135.85it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_131374.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_131374.mp3


Copying audios across to new dataset:  33%|▎| 18194/55673 [02:57<04:53, 127.61it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_20052114.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_20052114.mp3


Copying audios across to new dataset:  33%|▎| 18518/55673 [03:00<04:30, 137.15it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_28549898.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_28549898.mp3


Copying audios across to new dataset:  34%|▎| 18718/55673 [03:01<04:17, 143.59it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_15904937.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_15904937.mp3


Copying audios across to new dataset:  36%|▎| 20194/55673 [03:13<04:50, 121.93it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_18979044.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_18979044.mp3


Copying audios across to new dataset:  39%|▍| 21450/55673 [03:23<07:02, 81.09it/

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_201819.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_201819.mp3


Copying audios across to new dataset:  40%|▍| 21995/55673 [03:28<03:36, 155.50it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_32159111.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_32159111.mp3


Copying audios across to new dataset:  40%|▍| 22065/55673 [03:29<08:34, 65.35it/

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_172518.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_172518.mp3


Copying audios across to new dataset:  43%|▍| 24094/55673 [03:47<03:34, 147.55it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_18488029.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_18488029.mp3


Copying audios across to new dataset:  46%|▍| 25807/55673 [04:01<03:48, 130.62it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_22144814.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_22144814.mp3


Copying audios across to new dataset:  50%|▍| 27787/55673 [04:16<03:28, 134.04it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_25007128.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_25007128.mp3


Copying audios across to new dataset:  52%|▌| 28947/55673 [04:25<02:57, 150.34it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_19501598.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_19501598.mp3


Copying audios across to new dataset:  53%|▌| 29260/55673 [04:27<03:03, 143.76it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_317013.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_317013.mp3


Copying audios across to new dataset:  54%|▌| 30122/55673 [04:33<02:52, 147.98it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_531485.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_531485.mp3


Copying audios across to new dataset:  55%|▌| 30355/55673 [04:38<07:30, 56.17it/

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_25332657.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_25332657.mp3


Copying audios across to new dataset:  56%|▌| 31412/55673 [04:51<02:44, 147.26it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_24975517.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_24975517.mp3


Copying audios across to new dataset:  57%|▌| 31922/55673 [04:54<02:40, 147.99it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_594344.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_594344.mp3


Copying audios across to new dataset:  60%|▌| 33386/55673 [05:05<03:13, 115.34it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_27710987.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_27710987.mp3


Copying audios across to new dataset:  61%|▌| 33924/55673 [05:08<02:16, 159.25it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_18680946.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_18680946.mp3


Copying audios across to new dataset:  64%|▋| 35651/55673 [05:21<02:42, 123.38it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_20411082.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_20411082.mp3


Copying audios across to new dataset:  66%|▋| 36605/55673 [05:30<02:20, 135.52it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_18436967.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_18436967.mp3


Copying audios across to new dataset:  67%|▋| 37505/55673 [05:37<02:28, 121.98it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_22141587.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_22141587.mp3


Copying audios across to new dataset:  70%|▋| 39115/55673 [05:50<02:21, 116.89it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_21812329.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_21812329.mp3


Copying audios across to new dataset:  71%|▋| 39747/55673 [05:55<02:09, 123.19it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_15733318.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_15733318.mp3


Copying audios across to new dataset:  73%|▋| 40759/55673 [06:02<02:06, 117.84it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_18428874.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_18428874.mp3


Copying audios across to new dataset:  75%|▊| 41978/55673 [06:11<01:34, 145.33it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_22285558.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_22285558.mp3


Copying audios across to new dataset:  77%|▊| 42987/55673 [06:19<01:19, 159.24it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_22946277.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_22946277.mp3


Copying audios across to new dataset:  79%|▊| 43988/55673 [06:26<01:17, 150.54it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_27190079.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_27190079.mp3


Copying audios across to new dataset:  81%|▊| 44979/55673 [06:34<01:24, 127.05it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_27287839.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_27287839.mp3


Copying audios across to new dataset:  83%|▊| 46028/55673 [06:42<01:03, 150.73it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_15904133.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_15904133.mp3


Copying audios across to new dataset:  84%|▊| 47038/55673 [06:50<01:09, 124.58it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_18619434.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_18619434.mp3


Copying audios across to new dataset:  86%|▊| 48031/55673 [06:57<00:55, 136.85it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_18897927.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_18897927.mp3


Copying audios across to new dataset:  88%|▉| 49021/55673 [07:05<00:56, 117.40it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_19102648.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_19102648.mp3


Copying audios across to new dataset:  90%|▉| 50030/55673 [07:13<00:43, 128.64it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_19439601.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_19439601.mp3


Copying audios across to new dataset:  92%|▉| 51024/55673 [07:21<00:43, 105.81it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_19831626.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_19831626.mp3


Copying audios across to new dataset:  93%|▉| 52032/55673 [07:30<00:32, 113.44it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_20303611.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_20303611.mp3


Copying audios across to new dataset:  95%|▉| 53036/55673 [07:37<00:20, 127.08it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_20670775.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_20670775.mp3


Copying audios across to new dataset:  97%|▉| 54027/55673 [07:44<00:14, 110.89it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_20961031.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_20961031.mp3


Copying audios across to new dataset:  99%|▉| 55033/55673 [07:52<00:04, 132.57it

File copied from /media/kathyreid/Elements/cv-datasets/cv-corpus-24.0-2025-12-05/en/clips/common_voice_en_21541140.mp3 to datasets/commonvoice-v24_en-AU/audio_files/common_voice_en_21541140.mp3


Copying audios across to new dataset: 100%|█| 55673/55673 [07:57<00:00, 116.62it


In [36]:
# check how many files are in the destination dir - should match the # of rows in the dataframe 
count = 0; 
with os.scandir(audio_fp_destination) as entries:
    for entry in entries:
        if entry.is_file():
             count += 1

print(count)

55673


## Calculate how many hours of samples we have

In [37]:
filtered_df['duration_ms'].sum() / 1000 / 60 / 60 # convert from milliseconds to hours

0.07792748200597659

In [38]:
filtered_df['duration_ms'].sum() / 1000 / 60  # convert from milliseconds to minutes

4.675648920358595

## Export the dataframe to dataset.csv

In [28]:
fp = 'datasets/commonvoice-v24_en-AU/commonvoice-v24_en-AU.csv'
filtered_df.to_csv(fp)