# Working with accent data in the Mozilla Common Voice dataset 

The purpose of this Python Jupyter notebook is to provide some worked examples of how you might explore accent data in the Common Voice dataset. 



## Index of notebook contents 

To make this notebook easier to navigate, each section is indexed below. 

* [Background information on demographic data in Common Voice](#Background)
* [Preparation steps and importing modules](#PreparationSteps) - including the `requirements.txt` you should run if using this notebook. 
* [The Accent and AccentDescriptor classes we will use in the notebook](#Classes)
* [Preparing data from the Common Voice TSV file](#PreparingData)
* [Extracting accent information for data visualisation](#AccentExtraction)
* [Determine which accents are predetermined for selection in the Common Voice profile screen](#PreDetermined)
* [Add Descriptors to each Accent](#Descriptors)

---
<a id='Background'></a>
## Background information on demographic data in Common Voice 

Before you start working with accent data in Common Voice, there is background information you should know about the data structures in the Common Voice datasets, and how accents have been represented. 

### The ability to choose whether or not to specify demographic information 

Data contributors can contribute voice data to Common Voice with our without logging in to the platform. If a data contributor is not logged in, the utterances they record contain no demographic metadata information, such as the gender, age range or accent of the speaker. If the data contributor _does_ log in, then they can choose whether to specify demographic information in their profile. Part of the demographic information can include specifying which accent(s) they speak with. 


Since mid 2021, data contributors to the Common Voice dataset have been able to self-specify descriptors for their accents. 

The purpose of this script is to get demographic details from an MCV downloaded dataset. 
This informs decision making around, for example, how much of the data in a particular language, has demographic details, and if so, what they are. 

---
<a id='PreparationSteps'></a>

## Preparation steps and importing the modules we will use 

@TODO 

make a `requirements.txt` file to install all the dependencies. 

* pandas 


In [125]:
# imports go here 

# io 
import io

# pandas 
import pandas as pd

# regular expressions 
import re

# json 
import json

# pretty print 
import pprint
pp = pprint.PrettyPrinter(indent=4)

# reload = because I'm developing the CVaccents module as I go, I want to reload it each time so it doesn't cache
from importlib import reload


---
<a id='Classes'></a>
## Accent and AccentDescriptor classes used for manipulation

In [126]:
## Accent class and AccentDescriptor class 

# these are classes I defined for accent handling
import cvaccents as cva

# do an explicit reload as I'm still working on the classes 
reload(cva)

# prove that my DocStrings are useful
# they are good, so I am suppressing output while I work through the rest of the doc. 

#print('Module docstring is: \n', cva.__doc__)
#print('---')
#print('Accent docstring is: \n', cva.Accent.__doc__)
#print('---')
#print('AccentDescriptor docstring is: \n', cva.AccentDescriptor.__doc__)
#print('---')

<module 'cvaccents' from '/home/users/u6933485/cv-analysis-for-bias/cvaccents.py'>

---
<a id='PreparingData'></a>
## Preparing the data from the Common Voice dataset TSV file

Here, we extract data from the TSV file, and use `pandas` to perform manipulations on the dataset, such as removing rows that do not contain accent metadata. 

In [127]:
# specify the path to the TSV file - this should be `validated.tsv` from the MCV download 
#filePath = '/media/kathyreid/Seagate Backup Plus Drive/cv-datasets/ru/validated.tsv'
#filePath = '/media/kathyreid/Seagate Backup Plus Drive/cv-datasets/fr/validated.tsv'
#filePath = '/media/kathyreid/Elements/de/validated.tsv'
#filePath = '/media/kathyreid/Elements/es/validated.tsv'
#filePath = '/media/kathyreid/Elements/en/validated.tsv'
#filePath = '/media/kathyreid/Seagate Backup Plus Drive/cv-datasets/en-v9/validated.tsv'
filePath = '../cv-datasets/cv-corpus-11.0-2022-09-21/en/validated.tsv'

# put it into a DataFrame 
df = pd.read_csv(filePath, sep='\t')

In [128]:
df.columns

Index(['client_id', 'path', 'sentence', 'up_votes', 'down_votes', 'age',
       'gender', 'accents', 'locale', 'segment'],
      dtype='object')

In [129]:
# We don't want all the columns, as some of them are not useful for the accent analysis 
# Drop the columns we don't want 

df.drop(labels=['path', 'sentence', 'up_votes', 'down_votes', 'segment', 'locale'], axis='columns', inplace=True)
df.columns



Index(['client_id', 'age', 'gender', 'accents'], dtype='object')

In [130]:
len(df)

1617877

In [131]:
# rows that have accent metadata 
len(df[df['accents'].notna()])

861134

In [132]:
# remove all the rows where accents are not given (NaN)
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
# DataFrame.dropna(*, axis=0, how=_NoDefault.no_default, thresh=_NoDefault.no_default, subset=None, inplace=False)

df.dropna(axis='index', how='any', subset='accents', inplace=True)
len(df)

# this matches the above figure for rows that have accent metadata, so it's a good cross-check

861134

In [133]:
# number of unique contributors to the dataset 
len(df['client_id'].unique())

14822

In [148]:
# Now that the rows without an accent value have been removed, 
# we want to deduplicate the speaker_id values - because one speaker can speak many utterances
# and we only want to record one accent per speaker 
# and we should end up with the # of rows in the cell above 


# One of the reasons we try and reduce the size of the dataframe 
# first is because this operation is more efficient on a smaller dataframe 
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
# DataFrame.drop_duplicates(subset=None, *, keep='first', inplace=False, ignore_index=False)

df.drop_duplicates(subset='client_id', keep='first', inplace=True)
len(df)

# This length should match the length above 

14822

---
<a id='AccentExtraction'></a>
## Extracting the accent data for visualisation 

We have already: 

* Removed any rows where accent data was not available = `NaN`
* De-duplicated based on the `client_id`

So now, we need to extract all the self-styled accents for analysis. 

In [165]:
"""

The list english_accents_list is a list that contains the ORIGINAL accent entries for each speaker. 
In this list, the accents for each speaker are represented as a SINGLE STRING, NOT as a list of strings. 

So, we want to turn this into a LIST of LISTS OF STRINGS, to make it easier for doing data cleaning. 
Each individual string represents one accent descriptor given by a speaker, 
and the list which contains those strings is the grouping of accent descriptors for that speaker. 

We need to preserve the association between accent descriptors - co-references - for later data visualisation. 

"""

# They are already unique so we don't need the `.unique` method
english_accents = df['accents']

english_accents_list = [] 

for idx, accent_string in list(enumerate(english_accents)): 
    
    # this regex is from 
    # https://stackoverflow.com/questions/26633452/how-to-split-by-commas-that-are-not-within-parentheses
    accent_list=re.split(',\s*(?![^()]*\))', accent_string)
    
    
    for accent in accent_list: 
        
        print('idx is: ', idx, ' and accent_list is: ', accent_list)
        
        # Trim any whitespace off the elements, because this makes matching on strings harder later on 
        accent = accent.strip()
        
        # Check for any empty strings and remove them - likely regex artefacts
        if not accent: 
            accent_list.remove(accent)
        
        
    english_accents_list.append(accent_list)

idx is:  0  and accent_list is:  ['England English', 'United States English']
idx is:  0  and accent_list is:  ['England English', 'United States English']
idx is:  1  and accent_list is:  ['Hong Kong English']
idx is:  2  and accent_list is:  ['England English']
idx is:  3  and accent_list is:  ['United States English']
idx is:  4  and accent_list is:  ['United States English', 'wolof']
idx is:  4  and accent_list is:  ['United States English', 'wolof']
idx is:  5  and accent_list is:  ['England English']
idx is:  6  and accent_list is:  ['Australian English']
idx is:  7  and accent_list is:  ['United States English']
idx is:  8  and accent_list is:  ['Latin America', 'United States English']
idx is:  8  and accent_list is:  ['Latin America', 'United States English']
idx is:  9  and accent_list is:  ['United States English']
idx is:  10  and accent_list is:  ['United States English']
idx is:  11  and accent_list is:  ['Southern African (South Africa, Zimbabwe, Namibia)']
idx is:  12  

In [166]:
print(type(english_accents_list))
pp.pprint(english_accents_list)


<class 'list'>
[   ['England English', 'United States English'],
    ['Hong Kong English'],
    ['England English'],
    ['United States English'],
    ['United States English', 'wolof'],
    ['England English'],
    ['Australian English'],
    ['United States English'],
    ['Latin America', 'United States English'],
    ['United States English'],
    ['United States English'],
    ['Southern African (South Africa, Zimbabwe, Namibia)'],
    ['United States English'],
    ['United States English'],
    ['India and South Asia (India, Pakistan, Sri Lanka)'],
    ['United States English'],
    ['United States English'],
    ['England English'],
    ['India and South Asia (India, Pakistan, Sri Lanka)'],
    ['England English'],
    ['United States English'],
    ['Southern African (South Africa, Zimbabwe, Namibia)'],
    ['Australian English'],
    ['England English'],
    ['India and South Asia (India, Pakistan, Sri Lanka)'],
    ['United States English'],
    ['United States English', 'C

---
<a id='Descriptors'></a>
## Add descriptors to each accent

In this section, I apply a set of categories to the accent data. 

**I use a rule-based approach for reproduceability.** 
This could have been done in a spreadsheet, but I'm working in Python so I chose to do it that way. 
This also makes it easier for others applying this work to other languages or to other versions of the dataset. 



### Expand accents that have multiple descriptors in their .name element 

Here, we "break apart" accents that have multiple descriptors in the .name element into **multiple** accents. This is done _programmatically_ to aid in reproduceability. 

Some examples of this that I found during analysis were; 

* 8233 - 'south German / Swiss accent' - needs to be separated into 'South German' and 'Swiss accent'
* 6967 - 'slight Brooklyn Accent' - needs to be separated into "slight" and 'Brooklyn' - as a strength indicator 
* 10142 - 'minor French Accent' - needs to be separated into "minor" as an accent strength indicator, and 'French'
* 9902 - 'little Latino' - needs to be separated into "little" as an accent strength indicator, and "Latino"
* 3422 - 'i have some pronunciation issues because of oral surgery and a hidden southern accent' - needs to be separated into 'hidden southern accent' and 'changes due to Oral surgery'
* 9337 - 'heavy Cantonese' - needs to be separated into accent strength and country
* 1721 - 'United States English combined with European English' = needs to be separated into two accents
* 5365 - 'Sydney - middle eastern seaboard Australian' - should be separated into two accents
* 6615 - 'Spoke Chinese when little' - should be separated into Chinese, and speaking a different language as a child
* 748 - 'Spanish bilingual' - should be separated into 'Spanish' and 'bilingual ' as an Ln marker. 
* 6942 - 'South London and Essex' - should be separated into 'South London' and 'Essex'
* 12055 - 'Some time spent in Scotland' - should be separated into 'Scottish English' and into an Ln marker of "time spent in regional location"
* 773 - 'Silicon Valley Native' - should be separated into 'Silicon Valley' and "native" marker
* 3033 - "Porto des dels 3 anys aprenent anglès a l'escola i actualment m'estic preparant per a l'examen del B2." - this should be split into French, and academic register / L2 status
* 4322 - 'Polish. Have lived in nine states.' - should be separated into Polish and "lived in nine states" as a regional variance indicator or mixed accent indicator
* 5046 - 'Pittsburgh PA' - should be separated into Pittsburgh - city descriptor - and Pennsylvania as a regional descriptor. 
* 58 - 'Non native speaker from France' - should be separated into 'French' and 'Non native speaker'
* 1054 - 'Mild Northern England English' - should be separated into 'Northern England' and 'mild' for the strength of accent
* 11587 - 'Midwestern States (Michigan)' - should be separated into Midwestern and Michigan
* 1317 - 'Midwest USA speech blended with South Texas USA speech' - should be separated into Midwest and South Texas
* 1491 - 'Midwest US... With some Canadian slang. ' - should be separated into Midwest and Canadian Slang as register
* 'International Indian Accent' - should be separated into "Indian" and into "International"
* 2703 - 'Indian with a tinge of an RP accent' - should be separated into "Indian" and "Received Pronunciation" as well as an accent strength marker
* 1011 - 'I am German and speak English as learned at school' - should be separated into "German" and a register marker "as learned at school"
* 4839 - 'French mid level' - should be separated into 'French' and 'mid-level' as a fluency marker
* 1055 - 'England non-native' - should be separated into 'England' and 'Non-native' as a fluency marker
* 5305 - 'Educated Australian Accent' - should be separated into 'Australian' and 'Educated' as a register marker
* 12958 - 'CARIBBEAN AND BRITISH MIXED WITH SOME NEW YORK ACCENTS' - should be separated into three accents 
* 1577 - 'British English with a little bit of Russian' - should be separated into two accents. It's also an accent strength marker.
* 7271 - 'A variety of Texan English with some German influence that has undergone the cot-caught merger' - should be split into Texan, German influence, and a phonetic descriptor
* 6616 - '90% Pennsylvanian accent' - should be separated into Pennsylvanian and an accent strength marker
* 4319 - '4 years in Spain and Germany' - should be separated into French, German and a time or exposure marker
* 6617 - '10% Chinese accent' - should be separated into Chinese and an accent strength marker




### Modify the accent list to expand descriptors while preserving accent co-references

Some of the given accent descriptors contain multiple descriptors in one string. Here, I expand them while maintain co-references. 

For example: 

* `slight Brooklyn accent`  - contains both a City-based descriptor and an accent strength descriptor. 
* `United States English combined with European English` - contains both a national descriptor and a supranational descriptor. 



It's easier to do this before we create Accent and AccentDescriptor objects. 

In [169]:
# helper function for the below 

def update_list_coreference(list_to_be_updated, old_entry, new_entry):
    # the accent list is a list of lists so we need to iterate through each one to find the element to update. 
    
    
    for accent_list in list_to_be_updated:
        
        for accent in accent_list:    
            if accent == old_entry:

                accent_list.remove(old_entry)
                
                for entry in new_entry: # there may be more than one
                    accent_list.append(entry)
    
    return(list_to_be_updated)

In [176]:


#'south German / Swiss accent' - needs to be separated into 'South German' and 'Swiss'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'south German / Swiss accent', 
                                               ['South German', 'Swiss'])

# 'slight Brooklyn Accent' - needs to be separated into "slight" and 'Brooklyn' - as a strength indicator 
english_accents_list = update_list_coreference(english_accents_list, 
                                               'slight Brooklyn Accent', 
                                               ['Brooklyn Accent', 'slight'])

# 'minor French Accent' - needs to be separated into "minor" as an accent strength indicator, and 'French'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'minor French Accent', 
                                               ['French Accent', 'minor'])

# 'little Latino' - needs to be separated into "Latino" as an accent and 'little' as an accent strength indicator
english_accents_list = update_list_coreference(english_accents_list, 
                                               'little Latino', 
                                               ['Latino', 'little'])
                              
# 'heavy Cantonese' - needs to be separated into accent strength and country   
english_accents_list = update_list_coreference(english_accents_list, 
                                               'heavy Cantonese', 
                                               ['Cantonese', 'heavy'])

# 'United States English combined with European English' - needs to be separated into two accents
english_accents_list = update_list_coreference(english_accents_list, 
                                               'United States English combined with European English', 
                                               ['United States English', 'European English'])

# 'United States English combined with European English' - needs to be separated into two accents
english_accents_list = update_list_coreference(english_accents_list, 
                                               'United States English combined with European English', 
                                               ['United States English', 'European English'])
                                               
# 'Sydney - middle eastern seaboard Australian' - should be separated into two accent
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Sydney - middle eastern seaboard Australian', 
                                               ['Sydney', 'Middle eastern seaboard Australian'])
    
    
# 'Spoke Chinese when little' - should be separated into Chinese, and speaking a different language as a child
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Spoke Chinese when little', 
                                               ['Chinese', 'Spoke language when a child'])    
    
    
    
# 'Spanish bilingual' - should be separated into 'Spanish' and 'bilingual ' as an Ln marker. 
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Spanish bilingual', 
                                               ['Spanish', 'Bilingual']) 










pp.pprint(english_accents_list)

[   ['England English', 'United States English'],
    ['Hong Kong English'],
    ['England English'],
    ['United States English'],
    ['United States English', 'wolof'],
    ['England English'],
    ['Australian English'],
    ['United States English'],
    ['Latin America', 'United States English'],
    ['United States English'],
    ['United States English'],
    ['Southern African (South Africa, Zimbabwe, Namibia)'],
    ['United States English'],
    ['United States English'],
    ['India and South Asia (India, Pakistan, Sri Lanka)'],
    ['United States English'],
    ['United States English'],
    ['England English'],
    ['India and South Asia (India, Pakistan, Sri Lanka)'],
    ['England English'],
    ['United States English'],
    ['Southern African (South Africa, Zimbabwe, Namibia)'],
    ['Australian English'],
    ['England English'],
    ['India and South Asia (India, Pakistan, Sri Lanka)'],
    ['United States English'],
    ['United States English', 'Colombia'],
    

In [139]:
# So what we want to do here is put them into another list
# So that we can then order that list 

english_accents = df['accents'].unique()

english_accents_dict = {} 
english_accents_list = []
english_accents_list_by_total =[]

for accent in english_accents:
    accent_total = len(df.loc[df['accents'] == accent])
    english_accents_dict[accent] = accent_total
    
# Dict structures in Python are not sortable
# So we're going to iterate so that we can have a sorted list 

for accent_total in english_accents_dict.items(): 
    english_accents_list.append(accent_total)

#print ('\n')
#print ('breakpoint 1')
#print ('\n')

# print the list 
#print(english_accents_list)
    
# Now, we want to order the list
# But if we use sort(), it will sort by the first element, which is the accent itself
# But we want to sort on the number of occurrences of that accent 

for accents_list in english_accents_list: 
    #print(accents_list)
    reversed = accents_list[::-1] # they're tuples, reverse() won't work
    #print(reversed)
    english_accents_list_by_total.append(reversed)
    
english_accents_list_by_total.sort(reverse=True)
    
#print ('\n')
#print ('breakpoint 2')
#print ('\n')

# print the size of the list     
print('Size of the list is: ')
print(len(english_accents_list_by_total))
print('---')

# print the sorted list
print('Sorted list is: ')
pp.pprint(english_accents_list_by_total)





Size of the list is: 
227
---
Sorted list is: 
[   (7432, 'United States English'),
    (2281, 'England English'),
    (1990, 'India and South Asia (India, Pakistan, Sri Lanka)'),
    (888, 'Canadian English'),
    (655, 'Australian English'),
    (260, 'Southern African (South Africa, Zimbabwe, Namibia)'),
    (186, 'Irish English'),
    (166, 'Scottish English'),
    (157, 'New Zealand English'),
    (129, 'Hong Kong English'),
    (124, 'Filipino'),
    (95, 'Malaysian English'),
    (75, 'Singaporean English'),
    (65, 'Welsh English'),
    (48, 'West Indies and Bermuda (Bahamas, Bermuda, Jamaica, Trinidad)'),
    (13, 'United States English,England English'),
    (9, 'England English,United States English'),
    (6, 'German English'),
    (5, 'South Atlantic (Falkland Islands, Saint Helena)'),
    (4, 'Russian'),
    (4, 'French'),
    (   3,
        'United States English,India and South Asia (India, Pakistan, Sri '
        'Lanka)'),
    (3, 'Slavic'),
    (3, 'Northern Irish')

In [140]:
# build a dict of each unique accent using an Accent object for each object. 

ratio_display = 120 # to stop the browser crashing 

all_accents = {}
i = 0; 

all_accent_lists = df['accents']

for accent_list in all_accent_lists:
    #print (accent_list)
    
    # the accent_list is a string, not a list
    # and we can't just simply split it by a string 
    # because there are entries that may have a comma in the accent description 
    # such as 
    # India and South Asia (India, Pakistan, Sri Lanka)
    # so instead we use a regex that only splits on a comma ','
    # if the comma is not inside open brackets 
    
    # this regex is from 
    # https://stackoverflow.com/questions/26633452/how-to-split-by-commas-that-are-not-within-parentheses
    accent_list=re.split(',\s*(?![^()]*\))', accent_list)
    
    for accent in accent_list: 
        
        i +=1
        match = False 
        count = 0
        
        #if (i%ratio_display ==0): # only show the 100th 
            #print('')
            #print('---')
            #print('now processing: ', accent, ' - ', i)
            #print('---')
        
        # is this accent in our dict - if not, add it in 
        
        for item in all_accents.items() : # Each item should be an Accent object 
            
            # (self, id=0, name="Accent Name", count=0, locale=None, descriptors=None):
            #pp.pprint(item[1].__str__())
            
            #if (i%ratio_display ==0): # only show the 100th 
                #print('item is: ', item)
                #print(type(item))
                #print('now checking match for: item:', item[1], ' and accent: ', accent)
            
            if (item[1].name == accent) : # update the count
                
                #if (i%ratio_display ==0): # only show the 100th 
                    #print('---')
                    #print('match is True')
                    #print('---')
                    
                match = True 
                
                #if (i%ratio_display ==0): # only show the 100th 
                    #print('accent count was: ', item[1].count)
                
                # update the count of the accent 
                item[1].count+=1
                
                #if (i%ratio_display ==0): # only show the 100th 
                    #print('accent count is now: ', item[1].count)
                
                

        # this match loop has to be outside the for: loop above 
        # because if we add items to the dict inside the loop
        # then it will not run - because there are zero items in the dict to begin with 
        
        if (not match) :   
            
            # (self, id=0, name="Accent Name", count=0, locale=None, descriptors=None):
            all_accents[i] = cva.Accent(i, accent, 1, 'en', None, False) 
            
            if (i%ratio_display ==0): # only show the ratio'th element 
                print ('added ', all_accents[i], ' to dict')
                



added  id is 1920, name is southern United States, count is 1, locale is en, descriptors are None, predetermined is False.  to dict
added  id is 4320, name is speak some German, count is 1, locale is en, descriptors are None, predetermined is False.  to dict
added  id is 5760, name is Kiwi, count is 1, locale is en, descriptors are None, predetermined is False.  to dict


In [141]:
print(len(all_accents))

224


In [142]:

for item in all_accents.items(): 
    pp.pprint(item[1].__str__())



('id is 1, name is England English, count is 2342, locale is en, descriptors '
 'are None, predetermined is False.')
('id is 2, name is United States English, count is 7537, locale is en, '
 'descriptors are None, predetermined is False.')
('id is 3, name is Hong Kong English, count is 132, locale is en, descriptors '
 'are None, predetermined is False.')
('id is 7, name is wolof, count is 1, locale is en, descriptors are None, '
 'predetermined is False.')
('id is 9, name is Australian English, count is 665, locale is en, descriptors '
 'are None, predetermined is False.')
('id is 11, name is Latin America, count is 1, locale is en, descriptors are '
 'None, predetermined is False.')
('id is 15, name is Southern African (South Africa, Zimbabwe, Namibia), count '
 'is 260, locale is en, descriptors are None, predetermined is False.')
('id is 18, name is India and South Asia (India, Pakistan, Sri Lanka), count '
 'is 2009, locale is en, descriptors are None, predetermined is False.')
('

---
<a id='PreDetermined'></a>
## Label the accents that were pre-determined 

Since its inception, Mozilla Common Voice has enabled data contributors to enter demographic age such as age, gender and accent. These associations are not validated in any way, and we don't have any indicator of how accurate they are. Accent _used_ to be represented as an a priori drop-down list, which the contributor could select from. From Common Voice v10, the data contributor can **self-describe** their accent, however, the previous accent list is still presented (so may be more frequently chosen by the data contributor). We need to be able to distinguish these accents visually to help with the exploration. 

```
"splits": {
        "accent": {
          "": 0.51,
          "canada": 0.03,
          "england": 0.08,
          "us": 0.23,
          "indian": 0.07,
          "australia": 0.03,
          "malaysia": 0,
          "newzealand": 0.01,
          "african": 0.01,
          "ireland": 0.01,
          "philippines": 0,
          "singapore": 0,
          "scotland": 0.02,
          "hongkong": 0,
          "bermuda": 0,
          "southatlandtic": 0,
          "wales": 0,
          "other": 0.01
        },

```

The `cv-datasets` splits above have labels for the accents that don't actually match the accent name in the data. So we need to specify the accents that are pre-determined. This is how they appear to the data contributor filling out their profile at: [https://commonvoice.mozilla.org/en/profile/info](https://commonvoice.mozilla.org/en/profile/info)


![Accents as specified on Mozilla Common Voice profile](cv-profile-specify-accent.png)


In [143]:
# create a list of the pre-existing accents 
# this is how they are given in the dataset. 

predetermined_accents_list = ['United States English', 
                         'England English', 
                         'India and South Asia (India, Pakistan, Sri Lanka)', 
                         'Canadian English', 
                         'Australian English', 
                         'Southern African (South Africa, Zimbabwe, Namibia)', 
                         'Irish English', 
                         'Scottish English', 
                         'New Zealand English', 
                         'Hong Kong English', 
                         'Filipino', 
                         'Malaysian English', 
                         'Singaporean English', 
                         'Welsh English', 
                         'West Indies and Bermuda (Bahamas, Bermuda, Jamaica, Trinidad)', 
                         'South Atlantic (Falkland Islands, Saint Helena)']



  

In [144]:
# now, we iterate through both lists and if the ['accent'] in the Dict matches the accent in the predetermined list, 
# and add this as information in the dict that we can use for visualisation. 


for item in all_accents.items(): 
    #print('item is: ', item)
    key = item[0] 
    #print(key)
    
    predetermined_status = False
    
    for predetermined_accent in predetermined_accents_list: 
        
        #print('predetermined_accent is: ', predetermined_accent)
        #print('item 1 accent is: ', item[1]['accent'])
        
        if (predetermined_accent == item[1].name) :
            #print('MATCH!')
            predetermined_status = True
            
    all_accents[key].predetermined = predetermined_status
    #all_accents[key]['accent_predetermined'] = predetermined_status
    #print(all_accents[key]['accent_predetermined'])



    



In [145]:

for item in all_accents.items(): 
    pp.pprint(item[1].__str__())


('id is 1, name is England English, count is 2342, locale is en, descriptors '
 'are None, predetermined is True.')
('id is 2, name is United States English, count is 7537, locale is en, '
 'descriptors are None, predetermined is True.')
('id is 3, name is Hong Kong English, count is 132, locale is en, descriptors '
 'are None, predetermined is True.')
('id is 7, name is wolof, count is 1, locale is en, descriptors are None, '
 'predetermined is False.')
('id is 9, name is Australian English, count is 665, locale is en, descriptors '
 'are None, predetermined is True.')
('id is 11, name is Latin America, count is 1, locale is en, descriptors are '
 'None, predetermined is False.')
('id is 15, name is Southern African (South Africa, Zimbabwe, Namibia), count '
 'is 260, locale is en, descriptors are None, predetermined is True.')
('id is 18, name is India and South Asia (India, Pakistan, Sri Lanka), count '
 'is 2009, locale is en, descriptors are None, predetermined is True.')
('id is 

In [146]:
# sort the dict by count descending 
# I'm not using this actively but I want to know how to sort the Dict properly

all_accents_sorted_by_count = dict(sorted(all_accents.items(), key=lambda t:(t[1].count, t[1].name), reverse=True))


for item in all_accents.items(): 
    pp.pprint(item[1].__str__())


('id is 1, name is England English, count is 2342, locale is en, descriptors '
 'are None, predetermined is True.')
('id is 2, name is United States English, count is 7537, locale is en, '
 'descriptors are None, predetermined is True.')
('id is 3, name is Hong Kong English, count is 132, locale is en, descriptors '
 'are None, predetermined is True.')
('id is 7, name is wolof, count is 1, locale is en, descriptors are None, '
 'predetermined is False.')
('id is 9, name is Australian English, count is 665, locale is en, descriptors '
 'are None, predetermined is True.')
('id is 11, name is Latin America, count is 1, locale is en, descriptors are '
 'None, predetermined is False.')
('id is 15, name is Southern African (South Africa, Zimbabwe, Namibia), count '
 'is 260, locale is en, descriptors are None, predetermined is True.')
('id is 18, name is India and South Asia (India, Pakistan, Sri Lanka), count '
 'is 2009, locale is en, descriptors are None, predetermined is True.')
('id is 

In [147]:
"""
predetermined_accents_list = ['United States English', 
                         'England English', 
                         'India and South Asia (India, Pakistan, Sri Lanka)', 
                         'Canadian English', 
                         'Australian English', 
                         'Southern African (South Africa, Zimbabwe, Namibia)', 
                         'Irish English', 
                         'Scottish English', 
                         'New Zealand English', 
                         'Hong Kong English', 
                         'Filipino', 
                         'Malaysian English', 
                         'Singaporean English', 
                         'Welsh English', 
                         'West Indies and Bermuda (Bahamas, Bermuda, Jamaica, Trinidad)', 
                         'South Atlantic (Falkland Islands, Saint Helena)']
"""

"""
List of accents that should be merged into one canonical one for the purpose of analysis 

667, 2456, 1337, 4320 - German and German English and 'German Accent', speak some German

2985, 5349, 2191, 7603, 14262, 11587, 2205, 1317, 1491, 8043, 6472 - Midwestern United States variants 
1978 - and others - southern united states variants 

10548, 1133 - Dutch English and Dutch

7846, 5213, 479 -  polish and 'polish accent' and 'Polish English'

9878 - 'england' and 'England English'

4049, 12004 - 'eastern european English' and 'eastern Europe'

5737, 708 - 'swedish' and 'swedish english'

12363 and others - 'Philadelphia' and variants

14305, 1054, 4096 and others - 'Northern English'

15055, 188 and others 'non-native speaker'

8750, 700 and others - 'Nigerian English' 

11039, 12362 and others - 'Mid-atlantic'

2040 and others - 'London' and variants

879, 15028 - 'Liverpool' and variants 

5417, 11 and others - 'Latin American' and variants

15029, 5426 - 'Lancashire'

2828, 7744 - 'Kenyan'

There are at least two Japanese accents in the dataset. 

1099, 7644 - Israeli 

There are many Hispanic / Latino / Latin accent descriptors in the dataset. 

Several German and Swiss German Accents definitions - South German appears to be similar to Swiss, so I wouldn't separate them 

7774 - 'European English' and 'European'

Second language and 'ESL' should be combined into 'English as a Second Language' - it's a different type of 'non-native descriptor'

2455 and 31 - Colombian 

11459 and 276 - 'California' and 'Californian Accent'

13084, 3085, 3086 and others - variants of 'Bangladeshi English'

9948 -  '2nd Language ' - and others - variants of "second language" or "ESL"

"""



"""
Modifications that should be made 

Countries should be uppercased - Polish 
Capital case for accents - 'southern United States'

"""

"""
Notation logs 

* Northern Irish is categorised as a Country descriptor because Northern Ireland is a country in the United Kingdom* Serbian is categorised as a Country descriptor because Serbia is a country in its own right  it's no longer Yugoslavia
* Similarly, Scottish is categorised as a Country descriptor

Is it a country descriptor or is it a language descriptor?
Some countries have a language that tends to overlap its nation state borders - Germany, Austria, French, Polish
So it's both a country and a language description 
Slavic is a language group, as is Latino too. 

So that needs to come out in the analysis. 

Strength of accent appears to be another descriptor - so 'slight Brooklyn accent'

Is fluency the same as an Ln descriptor? For these purposes I think we can say so. 

Cantonese has been classified as sub-national - primarily because Hong Kong is a part of China, however there are many Cantonese diaspora in the world. 
Should it be classified as supra-national, as is Wolof? 
This might be a better classification, because of the many countries in which Cantonese is spoken (Singapore etc.
A different way to approach this might be to have a category which is "Language group descriptor"

'English County Durham' is distinct even when within Northumbrian English - so the two were not merged. 

AMBIGUOUS ACCENT DESCRIPTORS 

The accent descriptor 'West Indian' is ambiguous - does it refer to West Indies, or West India? I have categorised it as West India because there is an existing West Indies pre-determined category. 

The accent descriptor 'Low' is ambiguous - does it refer to low Midwestern or tonal low range? I have categorised it is Low Midwestern

The accent 'East Indian' is ambiguous - it co-occurred with Indian rather than say Dutch East Indies, so categorising it as subnational region.

The accent "A'lo" is ambiguous - based on its Spanish translation I am categorising it as a fluency marker. Elemento prefijal de origen griego que entra en la formación de nombres y adjetivos con el significado de ‘otro’, ‘diferente’.

'"Valley Girl" English' is ambiguous - is it a regional descriptor or is it a register? 
I have categorised it as a register because the features of the accent are also changes to words and pronunciation. 
It's not just Californian _accent_ - it's a register change, like, you know, awesome :D 

6615 - Spoke Chinese when little - is this another category for speaking a language in childhood and then having a different accent when you grow up? 
1237- Basic - this could be interpreted as a register or a fluency level. Choosing to categorise as a fluency level because it co-occurs with Indian, which culturally doesn't have a "Basic" register. 
https://en.wikipedia.org/wiki/Basic_(slang)


DISCARDED as regex artefacts 
12057: {'accent': 'I think', 'count': 1, 'accent_predetermined': False}

"""

"""
All accents coded. 

"""

accent_category_descriptors = [('Country descriptor'), 
                               ('Supranational region descriptor'),
                               ('Subnational region descriptor'),
                               ('City descriptor'),
                               ('Ln descriptor'),
                               ('Vocal quality descriptor'),
                               ('Accent strength descriptor'),
                               ('Rhoticity'),
                               ('Phoneme changes'),
                               ('Inflection changes'),
                               ('Register'), 
                               ('Accent effects due to physical changes')
                               ('Unknown or other'),
                              
                              
                              ]

accent_category_rules = [
    
    
    
                        ## COUNTRY DESCRIPTORS 
    
                         ('German English', 'Country descriptor' ),
                         ('German', 'Country descriptor' ),
                         ('German Accent', 'Country descriptor' ),
                         ('speak some German', 'Country descriptor' ),
                         ('I am German and speak English as learned at school', 'Country descriptor' ),
    
                        
                         
                         
                         
                         ('Russian', 'Country descriptor' ),
                            ('British English with a little bit of Russian', 'Country descriptor' ),
    
                         ('Ukrainian', 'Country descriptor' ),
                            ('Georgian English', 'Country descriptor' ),
                            ('Finnish', 'Country descriptor' ),
    
  
                         
    
                         ('Northern Irish', 'Country descriptor' ),
                         ('Scottish', 'Country descriptor' ),
                        ('England ', 'Country descriptor' ),
                            
    
   
    
    
                         ('polish', 'Country descriptor' ),
                         ('polish accent', 'Country descriptor' ),
                         ('Polish. Have lived in nine states.', 'Country descriptor' ),
                         ('Polish English', 'Country descriptor' ),
                            
    
    
                            ( 'Italian', 'Country descriptor' ),
    
    
                           
    
                         ('Slovak', 'Country descriptor' ),
                         
                         ('french accent', 'Country descriptor' ),
                         ('minor french accent', 'Country descriptor' ),
                         ('Non native speaker from France', 'Country descriptor' ),
                            
                         ('French mid level accent', 'Country descriptor' ),
    
                        
                         
                         
                         ('Spanish bilingual', 'Country descriptor' ),
                        ('4 years in Spain and Germany', 'Country descriptor' ),
                            
                
    
                         ('Greek', 'Country descriptor'), 
                         
                         
                         
                         ('Swedish accent', 'Country descriptor' ),
                         ('Swedish English', 'Country descriptor' ),
                         
                         
                         
                         
                         
                         ('england', 'Country descriptor' ),
                         
                         
                         ('Thai', 'Country descriptor' ),
                         ('Chinese', 'Country descriptor' ),
                         ('10% Chinese accent', 'Country descriptor' ),
                             
    
                         
    
                         ('Romanian', 'Country descriptor' ),
                         ('Spanish', 'Country descriptor' ),
                         ('Norwegian', 'Country descriptor' ),
                         
                         ('Japanese English', 'Country descriptor' ),
                         ('Japanese', 'Country descriptor' ),
                         
    
    
                         ('Spoke Chinese when little', 'Country descriptor' ),
                         
                         
                         ('Dutch English', 'Country descriptor' ),
                         ('Dutch', 'Country descriptor' ),
                         ('British', 'Country descriptor' ),
                         ('Austrian', 'Country descriptor' ),
                         ('Выраженный украинский акцент', 'Country descriptor' ),
                         ('serbian', 'Country descriptor' ),
    
    
                         ('nigeria english', 'Country descriptor' ),
                         ('Nigerian English', 'Country descriptor' ),
                         ('Nigerian ', 'Country descriptor' ),
                         
    
   
                         ('South African English', 'Country descriptor' ),
    
                         ('Kenyan English', 'Country descriptor' ),
                            ('Kenyan ', 'Country descriptor' ),
                        
                        
                        ("Israeli's accent ", 'Country descriptor' ),
                        ('Israeli accent', 'Country descriptor' ),
    
                        ('International Indian Accent', 'Country descriptor' ),
                        ('Indo-Canadian English', 'Country descriptor' ),
                            
    
                            
                        ('Indonesian English', 'Country descriptor' ),
                        
                            
    
                         
                         ('bangladesh', 'Country descriptor' ),
                            ('Bangladeshi English', 'Country descriptor' ),
                            ('Bangladeshi', 'Country descriptor' ),
                            ('Bangladesh English', 'Country descriptor' ),
                            
    
                        
    
                        ('Kiwi', 'Country descriptor' ),
                        
    ('Colombian Accent', 'Country descriptor' ),
    ('Colombia', 'Country descriptor' ),
    ('Argentinian English', 'Country descriptor' ),
    
                         
                         
                         
                         
                         ## SUBNATIONAL REGION DESCRIPTORS 
                         
                         ('Midwestern', 'Subnational region descriptor' ),
                         ('Midwestern United States English', 'Subnational region descriptor' ),
                         ('midwestern US', 'Subnational region descriptor' ),
                         ('midwest', 'Subnational region descriptor' ),
                         ('Upper Midwestern', 'Subnational region descriptor' ),
                         ('Unite States Midwest', 'Subnational region descriptor' ),
                         ('Minnesotan', 'Subnational region descriptor' ),
                            ('Midwestern US English (United States)', 'Subnational region descriptor' ),
                        ('Midwestern States (Michigan)', 'Subnational region descriptor' ),
                        ('Midwest United States', 'Subnational region descriptor' ),
                        ('Midwest USA speech blended with South Texas USA speech', 'Subnational region descriptor' ),
    
                         ('Mid-west United States English', 'Subnational region descriptor' ),
                        ('American Midwest', 'Subnational region descriptor' ),
    
    
    
                        ( 'Low', 'Subnational region descriptor', 'this co-occurred with mid-western, so I think it means Low MidWestern' ),
    
            
                        
                        
                        
                        
    
    
    
    
    
    
    
    
                         ('Southern Appalachian English', 'Subnational region descriptor' ),
                         ('Okie', 'Subnational region descriptor' ),
                         ('Northern', 'Subnational region descriptor', 'co-occurred with American English so grouped here')
    
    
    
                         
                                                 
                         
                         ('southern United States', 'Subnational region descriptor' ),
                         ('Southern United States English', 'Subnational region descriptor' ),
                         ('Southwestern United States English', 'Subnational region descriptor' ),
                         ('Southern Texas Accent', 'Subnational region descriptor' ),
                         ('South Texas', 'Subnational region descriptor' ),
                            ('A variety of Texan English with some German influence that has undergone the cot-caught merger', 'Subnational region descriptor' ),
                        
    
    
                         
                
                         ('slighty Southern affected by decades in the Midwest', 'Subnational region descriptor' ),
                         ('northern cali', 'Subnational region descriptor' ),
                         ('Southern Californian', 'Subnational region descriptor' ),
                         ('Silicon Valley Native', 'Subnational region descriptor' ),                         
                        ('Californian Accent', 'Subnational region descriptor' ),
                        ('California', 'Subnational region descriptor' ),
    
    
                         
                         ('new england/east coast', 'Subnational region descriptor' ),
                         ('Philadelphia Style United States English', 'Subnational region descriptor' ),
                         ('Philadelphia', 'Subnational region descriptor' ),
                         ('Pennsylvania', 'Subnational region descriptor' ),
                        ('90% Pennsylvanian accent', 'Subnational region descriptor' ),
                         
    
    
    
                            
    
                         ('United States English Pacific Northwest', 'Subnational region descriptor' ),
                         ('Pacific Northwest ', 'Subnational region descriptor' ),
    
    
                         ('Midatlantic', 'Subnational region descriptor' ),
                        ('Mid-atlantic', 'Subnational region descriptor' ),
                        ('Mid-Atlantic United States English', 'Subnational region descriptor' ),
                        
    
                            
    
    
                         
                         
                         
                         
                         
                         ('yorkshire', 'Subnational region descriptor' ),
                         ('Northern English', 'Subnational region descriptor' ),
                         ('Northern English', 'Subnational region descriptor' ),
                          ('Northern England', 'Subnational region descriptor' ),
                          ('Mild Northern England English', 'Subnational region descriptor' ),
                            ('English north of England ', 'Subnational region descriptor' ),
                            
                         
                         
                            
                         ('Northumbrian British English', 'Subnational region descriptor' ),
                            ('English County Durham', 'Subnational region descriptor' ),
    
    
    
                        ('Midlands English', 'Subnational region descriptor' ),
                        ('Lancashire English', 'Subnational region descriptor' ),
                        ('Lancashire', 'Subnational region descriptor' ),
                        
                        
                        
    
                            
                         ('sussex', 'Subnational region descriptor' ),
                         ('southern english', 'Subnational region descriptor' ),
                         ('Southern England', 'Subnational region descriptor' ),
                         ('South London and Essex', 'Subnational region descriptor' ),
    
    
                         
                         
                         ('southern', 'Subnational region descriptor' ),
                         ('south-west German', 'Subnational region descriptor' ),
                         ('south German / Swiss accent', 'Subnational region descriptor' ),
                         ('English with Swiss german accent', 'Subnational region descriptor' ),
                        
                         ('South German accent', 'Subnational region descriptor' ),
                         ('Alemannic German Accent', 'Subnational region descriptor' ),
                        
    
                         
                         ('With heavy Cantonese accent', 'Subnational region descriptor'), 
    
                         ('Javanese', 'Subnational region descriptor'), 
    
                            
                         
                         ('West Indian', 'Subnational region descriptor' ),
                         ('East Indian', 'Subnational region descriptor' ),
    
    
                         ('East Ukrainian', 'Subnational region descriptor' ),
                            
    
                        ('Catalan', 'Subnational region descriptor' ),
                        
                         
                         
                         ## CITY DESCRIPTORS 
                         
                        
                         
                         ('Sydney - middle eastern seaboard Australian', 'City descriptor'), 
    
    
                      
                          ('London English', 'City descriptor'), 
                           
                            ('london', 'City descriptor'), 
                        
    
    
                            ('Liverpudlian English', 'City descriptor'), 
                            ('Liverpool English', 'City descriptor'), 
    
    
                            
                         ('New York City', 'City descriptor')
                         ('New Orleans dialect', 'City descriptor')

                        
                        ('Chicago ', 'City descriptor')
                        
    
    
    
                        ## SUB-CITY DESCRIPTORS 
    
     ('slight Brooklyn Accent', 'Sub-city descriptor'), 
     ('East London ', 'Sub-city descriptor'), 
       ('South London', 'Sub-city descriptor'), 
                         
                         
                         ## SUPRANATIONAL DESCRIPTORS 
                                   
                         ('Slavic', 'Supranational region descriptor' ),
                         ('European', 'Supranational region descriptor' ),
                            ('European English', 'Supranational region descriptor' ),
    
    
                         ('Latino', 'Supranational region descriptor' ),
                         ('little latino', 'Supranational region descriptor' ),
                            ('Chicano English', 'Supranational region descriptor' ),
                         
    
    
                         ('Eastern European English', 'Supranational region descriptor' ),
                         ('eastern europe', 'Supranational region descriptor' ),
                        ('Eastern European', 'Supranational region descriptor' ),
                        ( 'East European', 'Supranational region descriptor' ),
                        
    
   
    
    
    
    
    
                        
                        
                         
                         
                         ('wolof', 'Supranational region descriptor' ),
                         ('West African ', 'Supranational region descriptor' ),
                         ('East African Khoja', 'Supranational region descriptor' ),
                            ('Afrikaans English', 'Supranational region descriptor' ),
                        
  
    
                         ('Hmong-American', 'Supranational region descriptor' ),
                         
    
                          ('Latin English', 'Supranational region descriptor' ),
                            ('Latin American accent', 'Supranational region descriptor' ),
                           ( 'Hispanic/Latino', 'Supranational region descriptor' )
    
                        
                         
                         
                         ## L1 OR L2 NATIVE / FLUENCY DESCRIPTORS 
                         
                         ('Non-native', 'Ln descriptor' ),
                         ('Non native speaker', 'Ln descriptor' ),
                         ('Non native', 'Ln descriptor' ),
    
    
                            ('England non-native', 'Ln descriptor' ),
                            
    
                         ('Foreign', 'Ln descriptor' ),
                         ('second language', 'Ln descriptor' ),
                         ( '2nd Language ', 'Ln descriptor' ),
                       
                         
    
                        ('ESL', 'Ln descriptor'),
                        
    
                        ( "A'lo", 'Ln descriptor'),
    
                       
    
                        ## FLUENCY DESCRIPTOR
    
    
    ('fluent', 'Ln descriptor'),
     
    ('Conversational', 'Ln descriptor'),
    
     ('Basic', 'Ln descriptor'),
    
    
    
                         
                         
                         ## VOCAL QUALITY DESCRIPTORS 
                         ('sultry', 'Vocal quality descriptor'),
                         ('Slightly effeminate', 'Vocal quality descriptor'),
                        ('Demure', 'Vocal quality descriptor'),
    
    
    
    
    
                            
                         ## SPECIFIC PHONETIC CHANGES 
                          
                         ("pronounced r's",'Rhoticity'),
                          
                         ('pin/pen merger', 'Phoneme changes'),
                         ('heavy consonants', 'Phoneme changes'),
                    
                         ('Slight lisp', 'Phoneme changes'),
    
                         
                          
                         ('mostly affecting inflection', 'Inflection changes'),
                         
                         ('i have some pronunciation issues because of oral surgery and a hidden southern accent', 'Accent effects due to physical changes'
                          
                          ## SPOKEN REGISTER
                          
                          ('formal', 'Register'),
                          ('academic', 'Register'),
                          ('Urban', 'Register'),
                          ('United States English. people say I sound like a surffer dude.', 'Register'),
                          ('Midwest US... With some Canadian slang.', 'Register'),
                          
                          ('Educated Australian Accent', 'Register'),
                          ('Educated', 'Register'),
                          ('Cool', 'Register'),
                          ('"Valley Girl" English', 'Register'),
                          
                          
                          
                            
                          
                         ## NOT SURE HOW TO GROUP THESE ONES YET 
                         
                         ('try to maintain originality' , 'Unknown or other' ),
                         ('non regional' , 'Unknown or other' ),
                         ('little bit classy little bit sassy and add some city.....thats me', 'Unknown or other'),
                         ('international', 'Unknown or other'),
                          
                          
                          ## MIX OF ACCENTS
                          
                        ('Mixed-Accent English', 'Unknown or other'),
                          ('Mix of voices', 'Unknown or other'),
                          
                          
                          ## ACCENT WHICH VARIES
                          
                         ('Variable', 'Unknown or other'),
                         ('Adjustable', 'Unknown or other'),
                          
                          
                          ('Transnational englishes blend', 'Unknown or other'),
                        ('International English', 'Unknown or other'),
                            ('CARIBBEAN AND BRITISH MIXED WITH SOME NEW YORK ACCENTS', 'Unknown or other'),
                          
                          
                          
                          
                          
                          ## DIALECT DESCRIPTOR
                          ('Patois', 'Unknown or other'),
                          
                          
                          
                          ## STRENGTH OF ACCENT 
                          ('Not bad', 'Unknown or other'),
                          
                         
                         
                         
                       


SyntaxError: invalid character in identifier (2393702440.py, line 400)

In [None]:
## Creating linkages between the individual accents and how they are represented in the data. 
## What I want to do here is create a data structure that has the ID of the accent and something to describe the edge: 
## 
## The data structure I think will work here is: 
## 
## { 99: (123, 456, 789)} 
## to represent each combination of accents 

## The data structures we are using are: 
## 
## all_accents_sorted_by_count - this is a Dict of all the *individual accents* with a count and predetermined status
## english_accents_list - this is a *string* of accents, comma separated, that requires
## 
## what we want to do is go through each list, 
## and find the ID number of the accent 
## from the Dict, 
## then build another data structure that represents the row, and the accents in it. 

## accent list is a String 

accent_nodes = {}
i = 0;

for accent_list in english_accents_list:
    
    # this regex is from 
    # https://stackoverflow.com/questions/26633452/how-to-split-by-commas-that-are-not-within-parentheses
    accent_list=re.split(',\s*(?![^()]*\))', accent_list[0])
    
    print('number of items in accent_list is: ', len(accent_list))
    
    
    # initialise the list first 
    accent_nodes[i] = []
    
    for accent_list_item in accent_list: 
        #print('now processing', accent_list_item)
        
        node_cnt = 0; 
        
        for accent in all_accents_sorted_by_count.items(): 
            
            if (i%ratio_display ==0): # only show the 100th 
                print('---')
                print ('now looking at row: ', accent_list, 'and accent list item: ', accent_list_item, ' and accent: ', accent)
        
            if (accent_list_item == accent[1]['accent']): ## match 
                
                if (i%ratio_display ==0): # only show the 100th 
                    print('---')
                    print ('match!')
                    
                print(accent[0])
                print('i is: ', i, ' and node_cnt is: ', node_cnt)
                
                accent_nodes[i].append(accent[0]) # we want the accent ID number
                node_cnt +=1
                
     
    i +=1 
                


                


In [None]:
print(accent_nodes)
print(accent_nodes[0])
print(len(accent_nodes[0]))

In [None]:
print(english_accents_list[27])
print(all_accents[2])
print(all_accents[37])
print(all_accents[200])

print(english_accents_list[156])
print(all_accents[133])
print(all_accents[257])


## this is a cross-check
## the other cross check is to make sure that the ID numbers in the lists above match with all_accents. 
## I've done this for a couple just to make sure, e.g.

## 27: [2, 37, 200]
## 156: [133, 257]



## Now that we have node data, we can start to build a picture of the nodes of accents - which accents are more commonly linked to each other? 

In [None]:
## what we want to do here is loop through the accent_nodes Dict 
## and create another Dict that we can use to create Links between the Nodes (which are accents)

accent_links = {} 
accent_link_id = 0

print(len(accent_nodes)) 

# we want to create a new structure, as we'll be popping elements off, and we want to leave accent_nodes untouched

for accent_list in accent_nodes.items(): 
    print('accent list is: ', accent_list)
    
    if len(accent_list[1]) > 1: 
        # we want to create links 
        
        while (len(accent_list[1]) > 1) : 
            print('\n')
            print('accent list stack size is greater than 1, now in while loop')
            
            popped_element = accent_list[1].pop(0) # remove the first element 
            print('popped element is :', popped_element)
            
            accent_links[accent_link_id] = {}
            
            # create links between the popped element and all the remaining elements in the list 
            for accent_id in accent_list[1]: 
                
                print('--- in accent list loop ---')
                print('accent_id is: ', accent_id)
                print('length of accent list is: ', len(accent_list[1]))
                
                accent_links[accent_link_id]['source'] = popped_element
                accent_links[accent_link_id]['target'] = accent_id
                accent_links[accent_link_id]['value'] = 1 # we can alter this later
            
                print(accent_links[accent_link_id])
                
        accent_link_id +=1
                
                
        print('end of while loop')
        print('\n')
       




In [None]:
print(accent_links)

In [None]:
# Output the dict to JSON 

with open("all_accents.json", "w") as outfile: 
    json.dump(all_accents_sorted_by_count, outfile)


## Define a

What I want to do here is define an Accent class which should make it easier to work with. 

In [None]:
accentClassDict = {} # Dict of Accent objects

dummyDescriptor1 = AccentDescriptor(999999, 'dummy1', 'definition1')
dummyDescriptor2 = AccentDescriptor(888888, 'dummy2', 'definition2')

pp.pprint(str(dummyDescriptor1))
pp.pprint(str(dummyDescriptor2))

for accent in all_accents.items(): 
    #pp.pprint(accent)
    accentClassDict[accent[0]] = Accent(accent[0], accent[1]['accent'], accent[1]['count'], 'en', (dummyDescriptor1, dummyDescriptor2), accent[1]['accent_predetermined'])
    

    
pp.pprint(accentClassDict[12958])
pp.pprint(str(accentClassDict[12958])) # reference the Dict by idx

for idx, item in accentClassDict.items(): 
    #print(idx, item)
    pp.pprint(str(item))