# Working with accent data in the Mozilla Common Voice dataset 

The purpose of this Python Jupyter notebook is to provide some worked examples of how you might explore accent data in the Common Voice dataset. 

## Background information on demographic data in Common Voice 

Before you start working with accent data in Common Voice, there is background information you should know about the data structures in the Common Voice datasets, and how accents have been represented. 

### The ability to choose whether or not to specify demographic information 

Data contributors can contribute voice data to Common Voice with our without logging in to the platform. If a data contributor is not logged in, the utterances they record contain no demographic metadata information, such as the gender, age range or accent of the speaker. If the data contributor _does_ log in, then they can choose whether to specify demographic information in their profile. Part of the demographic information can include specifying which accent(s) they speak with. 


Since mid 2021, data contributors to the Common Voice dataset have been able to self-specify descriptors for their accents. 

The purpose of this script is to get demographic details from an MCV downloaded dataset. 
This informs decision making around, for example, how much of the data in a particular language, has demographic details, and if so, what they are. 

In [2]:
# imports go here 

# io 
import io

# pandas 
import pandas as pd


In [4]:
# specify the path to the TSV file - this should be `validated.tsv` from the MCV download 
#filePath = '/media/kathyreid/Seagate Backup Plus Drive/cv-datasets/ru/validated.tsv'
#filePath = '/media/kathyreid/Seagate Backup Plus Drive/cv-datasets/fr/validated.tsv'
#filePath = '/media/kathyreid/Elements/de/validated.tsv'
#filePath = '/media/kathyreid/Elements/es/validated.tsv'
#filePath = '/media/kathyreid/Elements/en/validated.tsv'
#filePath = '/media/kathyreid/Seagate Backup Plus Drive/cv-datasets/en-v9/validated.tsv'
filePath = '/media/kathyreid/Elements/cv-corpus-11.0-2022-09-21/en/validated.tsv'

# put it into a DataFrame 
df = pd.read_csv(filePath, sep='\t')

In [9]:
# summary data 
df.value_counts

<bound method DataFrame.value_counts of                                                  client_id  \
0        000abb3006b78ea4c1144e55d9d158f05a9db011016051...   
1        0013037a1d45cc33460806cc3f8ecee9d536c45639ba4c...   
2        0014c5a3e5715a54855257779b89c2bb498d470b225866...   
3        001509f4624a7dee75247f6a8b642c4a0d09f8be3eeea6...   
4        001519f234e04528a2b36158c205dbe61c8da45ab0242f...   
...                                                    ...   
1556249  372293e65cdab88771e028a4351651ab2eff64438ddafc...   
1556250  372293e65cdab88771e028a4351651ab2eff64438ddafc...   
1556251  372293e65cdab88771e028a4351651ab2eff64438ddafc...   
1556252  372293e65cdab88771e028a4351651ab2eff64438ddafc...   
1556253  372293e65cdab88771e028a4351651ab2eff64438ddafc...   

                                 path  \
0        common_voice_en_27710027.mp3   
1          common_voice_en_699711.mp3   
2        common_voice_en_21953345.mp3   
3        common_voice_en_18132047.mp3   
4        c

In [5]:
# unique contributors 
len(df['client_id'].unique())

72627

In [6]:
# rows that have metadata 
len(df[df['age'].notna()])


1040402

In [7]:
# get all the age ranges 
df['age'].unique()

array([nan, 'twenties', 'sixties', 'seventies', 'thirties', 'fourties',
       'teens', 'fifties', 'eighties', 'nineties'], dtype=object)

In [8]:
# age ranges

print('teens: ', len(df.loc[df['age'] == 'teens']))
print('twenties: ', len(df.loc[df['age'] == 'twenties']))
print('thirties: ', len(df.loc[df['age'] == 'thirties']))
print('fourties: ', len(df.loc[df['age'] == 'fourties']))
print('fifties: ', len(df.loc[df['age'] == 'fifties']))
print('sixties: ', len(df.loc[df['age'] == 'sixties']))
print('seventies: ', len(df.loc[df['age'] == 'seventies']))
print('eighties: ', len(df.loc[df['age'] == 'eighties']))
print('nineties: ', len(df.loc[df['age'] == 'nineties']))

print('NaN: ', len(df.loc[df['age'].isna()]))

teens:  94654
twenties:  393510
thirties:  221009
fourties:  156949
fifties:  83005
sixties:  76483
seventies:  12992
eighties:  1691
nineties:  109
NaN:  577475


In [9]:
# get the genders 

df['gender'].unique()

array([nan, 'male', 'female', 'other'], dtype=object)

In [10]:
# genders

print('female: ', len(df.loc[df['gender'] == 'female']))
print('male: ', len(df.loc[df['gender'] == 'male']))
print('other: ', len(df.loc[df['gender'] == 'other']))

print('NaN: ', len(df.loc[df['gender'].isna()]))

female:  257655
male:  754279
other:  30911
NaN:  575032


In [11]:
# get the accents 

# df['accent'].unique()
# in CV9, this key changed to `accents`
df['accents'].unique()


array([nan, 'England English,United States English', 'Hong Kong English',
       'England English', 'United States English',
       'United States English,wolof', 'Australian English',
       'Latin America,United States English',
       'Southern African (South Africa, Zimbabwe, Namibia)',
       'India and South Asia (India, Pakistan, Sri Lanka)',
       'United States English,Colombia', 'Canadian English',
       'Non native speaker from France', 'Scottish English',
       'New York City', 'Filipino', 'French', 'Argentinian English',
       "A'lo", 'Finnish',
       'United States English,India and South Asia (India, Pakistan, Sri Lanka)',
       'Singaporean English',
       'West Indies and Bermuda (Bahamas, Bermuda, Jamaica, Trinidad)',
       'Hispanic/Latino',
       'India and South Asia (India, Pakistan, Sri Lanka),England English',
       'Non-native', 'Georgian English', 'Non native',
       'United States English,Canadian English,Indo-Canadian English',
       'New Zealand

# TODO 

One of the things here that I need to do is get only the unique speaker ids for the accents - because you could have a speaker who has contributed a lot of _utterances_ and if we don't _only_ count _unique_ speakers, then this gives us a skewed understanding of the data. 

In [35]:
# this gives us the unique LISTS
# but because each Speaker can specify MANY accents
# this is not granular enough for our needs. 

# This listing is also not sorted, so it's difficult to see which lists of accents
# occur at higher volumes.

# You can see in the output of this cell that the multiple accents are separated by commas
english_accents = df['accents'].unique()

for accent in english_accents:
    accent_total = len(df.loc[df['accents'] == accent])
    print(accent, ': ', accent_total)

print('NaN: ', len(df.loc[df['accents'].isna()]))



nan :  0
England English,United States English :  86
Hong Kong English :  4318
England English :  134595
United States English :  389397
United States English,wolof :  1
Australian English :  51593
Latin America,United States English :  1
Southern African (South Africa, Zimbabwe, Namibia) :  8485
India and South Asia (India, Pakistan, Sri Lanka) :  101067
United States English,Colombia :  1
Canadian English :  61132
Non native speaker from France :  1
Scottish English :  15820
New York City :  1
Filipino :  5173
French :  16
Argentinian English :  1
A'lo :  1
Finnish :  1
United States English,India and South Asia (India, Pakistan, Sri Lanka) :  6
Singaporean English :  3402
West Indies and Bermuda (Bahamas, Bermuda, Jamaica, Trinidad) :  715
Hispanic/Latino :  1
India and South Asia (India, Pakistan, Sri Lanka),England English :  12
Non-native :  1
Georgian English :  1
Non native :  1
United States English,Canadian English,Indo-Canadian English :  1
New Zealand English :  12381
Malay

England English,New Zealand English,Welsh English,Australian English,United States English,Mixed-Accent English :  67
England English,Northern English :  69
United States English,CARIBBEAN AND BRITISH MIXED WITH SOME NEW YORK ACCENTS :  69
Spanish,Foreign,Non-native :  69
England English,India and South Asia (India, Pakistan, Sri Lanka),Northern English :  72
United States English,Pacific Northwest  :  73
Bangladeshi English :  74
German,south-west German,South German accent,Alemannic German Accent :  74
European,German,Foreign,Non-native :  92
United States English,Southern Californian :  105
East African Khoja :  108
Dutch :  108
England English,Canadian English :  132
United States English,Southwestern United States English :  135
New Zealand English,England English :  145
England English,South London :  148
Northumbrian British English :  163
United States English,Unite States Midwest :  180
England English,Northern England :  188
United States English,southern United States,New Or

In [65]:
# So what we want to do here is put them into another list
# So that we can then order that list 

english_accents = df['accents'].unique()

english_accents_dict = {} 
english_accents_list = []
english_accents_list_by_total =[]

for accent in english_accents:
    accent_total = len(df.loc[df['accents'] == accent])
    english_accents_dict[accent] = accent_total
    
# Dict structures in Python are not sortable
# So we're going to iterate so that we can have a sorted list 

for accent_total in english_accents_dict.items(): 
    english_accents_list.append(accent_total)

print ('\n')
print ('breakpoint 1')
print ('\n')

# print the list 
#print(english_accents_list)
    
# Now, we want to order the list
# But if we use sort(), it will sort by the first element, which is the accent itself
# But we want to sort on the number of occurrences of that accent 

for accents_list in english_accents_list: 
    #print(accents_list)
    reversed = accents_list[::-1] # they're tuples, reverse() won't work
    #print(reversed)
    english_accents_list_by_total.append(reversed)
    
english_accents_list_by_total.sort(reverse=True)
    
print ('\n')
print ('breakpoint 2')
print ('\n')

# print the size of the list     
print(len(english_accents_list_by_total))

# print the sorted list
print(english_accents_list_by_total)


    
    
    
# order the dict 
# english_accents_ordered.sorted()


print(english_accents_ordered)

print('NaN: ', len(df.loc[df['accents'].isna()]))




breakpoint 1




breakpoint 2


228
[(389397, 'United States English'), (134595, 'England English'), (101067, 'India and South Asia (India, Pakistan, Sri Lanka)'), (61132, 'Canadian English'), (51593, 'Australian English'), (42001, 'German English,Non native speaker'), (15820, 'Scottish English'), (12381, 'New Zealand English'), (9607, 'Irish English'), (8485, 'Southern African (South Africa, Zimbabwe, Namibia)'), (6897, 'Northern Irish'), (5173, 'Filipino'), (4318, 'Hong Kong English'), (3402, 'Singaporean English'), (2625, 'Liverpool English,Lancashire English,England English'), (2074, 'England English,New Zealand English'), (1810, 'Malaysian English'), (1759, 'Welsh English'), (715, 'West Indies and Bermuda (Bahamas, Bermuda, Jamaica, Trinidad)'), (391, 'southern United States,United States English'), (379, 'United States English,Midwestern,Low,Demure'), (276, 'United States English,Midwestern,Minnesotan'), (233, 'United States English,southern United States,New Orleans dialect'),

In [None]:
# if a data contributor has specified multiple accents, 
# they are given as items in a list 
# so we need to get all the items out to count their occurrences 
# we can also use this to count co-occurences of accents in a single speaker 
for accent_list in english_accent_lists:
    accent_total = len(df.loc[df['accents'] == accent])
    print(accent, ': ', accent_total)
    
    
    
    
    

In [15]:
# french accents 

french_accents = df['accent'].unique()
for accent in french_accents:
    accent_total = len(df.loc[df['accent'] == accent])
    print(accent, ': ', accent_total)

print('NaN: ', len(df.loc[df['accent'].isna()]))

nan :  0
canada :  8682
france :  347436
belgium :  9679
cote_d_ivoire :  104
senegal :  41
algeria :  354
burundi :  1
united_kingdom :  259
cameroon :  56
united_states :  751
reunion :  1053
germany :  274
romania :  121
morocco :  59
tunisia :  16
switzerland :  3989
martinique :  68
other :  43
guadeloupe :  133
new_caledonia :  144
benin :  996
congo_brazzaville :  5
monaco :  109
gabon :  5
luxembourg :  14
st_pierre_et_miquelon :  6
mayotte :  6
italy :  57
congo_kinshasa :  10
ireland :  14
haiti :  34
madagascar :  162
portugal :  17
netherlands :  81
french_guiana :  115
NaN:  167394


In [23]:
# german accents 

german_accents = df['accent'].unique()
for accent in german_accents:
    accent_total = len(df.loc[df['accent'] == accent])
    print(accent, ': ', accent_total)

print('NaN: ', len(df.loc[df['accent'].isna()]))

nan :  0
russia :  940
germany :  437712
france :  1405
switzerland :  8602
austria :  20659
bulgaria :  1
netherlands :  75
denmark :  1
poland :  103
turkey :  24
united_kingdom :  148
czechia :  37
united_states :  268
greece :  120
hungary :  151
other :  184
belgium :  9
slovakia :  62
lithuania :  5
luxembourg :  57
canada :  98
liechtenstein :  62
slovenia :  10
brazil :  12
italy :  978
finland :  31
NaN:  213040


In [30]:
# spanish accents 

spanish_accents = df['accent'].unique()
for accent in spanish_accents:
    accent_total = len(df.loc[df['accent'] == accent])
    print(accent, ': ', accent_total)

print('NaN: ', len(df.loc[df['accent'].isna()]))

mexicano :  16924
nan :  0
americacentral :  5532
andino :  13729
caribe :  8235
centrosurpeninsular :  8713
rioplatense :  12102
chileno :  5316
surpeninsular :  32622
nortepeninsular :  35345
canario :  952
filipinas :  342
NaN:  131198


In [39]:
# english accents 

english_accents = df['accent'].unique()
for accent in english_accents:
    accent_total = len(df.loc[df['accent'] == accent])
    print(accent, ': ', accent_total)

print('NaN: ', len(df.loc[df['accent'].isna()]))

nan :  0
hongkong :  2750
us :  351472
england :  118401
african :  8066
indian :  73030
other :  10505
australia :  46951
canada :  48453
scotland :  12676
philippines :  4158
singapore :  2967
bermuda :  643
newzealand :  11281
malaysia :  1685
ireland :  9233
wales :  1550
southatlandtic :  203
NaN:  721760
