# Working with accent data in the Mozilla Common Voice dataset 

The purpose of this Python Jupyter notebook is to provide some worked examples of how you might explore accent data in the Common Voice dataset. 



## Index of notebook contents 

To make this notebook easier to navigate, each section is indexed below. 

* [Background information on demographic data in Common Voice](#Background)
* [Preparation steps and importing modules](#PreparationSteps) - including the `requirements.txt` you should run if using this notebook. 
* [The Accent and AccentDescriptor classes we will use in the notebook](#Classes)
* [Preparing data from the Common Voice TSV file](#PreparingData)
* [Extracting accent information for data visualisation](#AccentExtraction)
* [Determine which accents are predetermined for selection in the Common Voice profile screen](#PreDetermined)
* [Add Descriptors to each Accent](#Descriptors)

---
<a id='Background'></a>
## Background information on demographic data in Common Voice 

Before you start working with accent data in Common Voice, there is background information you should know about the data structures in the Common Voice datasets, and how accents have been represented. 

### The ability to choose whether or not to specify demographic information 

Data contributors can contribute voice data to Common Voice with our without logging in to the platform. If a data contributor is not logged in, the utterances they record contain no demographic metadata information, such as the gender, age range or accent of the speaker. If the data contributor _does_ log in, then they can choose whether to specify demographic information in their profile. Part of the demographic information can include specifying which accent(s) they speak with. 


Since mid 2021, data contributors to the Common Voice dataset have been able to self-specify descriptors for their accents. 

The purpose of this script is to get demographic details from an MCV downloaded dataset. 
This informs decision making around, for example, how much of the data in a particular language, has demographic details, and if so, what they are. 

---
<a id='PreparationSteps'></a>

## Preparation steps and importing the modules we will use 

@TODO 

make a `requirements.txt` file to install all the dependencies. 

* pandas 


In [87]:
# imports go here 

# io 
import io

# os for file handling 
import os 

# pandas 
import pandas as pd

# regular expressions 
import re

# json 
import json

# string handling for isascii
import string 

# pretty print 
import pprint
pp = pprint.PrettyPrinter(indent=4)

# reload = because I'm developing the CVaccents module as I go, I want to reload it each time so it doesn't cache
from importlib import reload


---
<a id='Classes'></a>
## Accent, AccentDescriptor and AccentCollection classes used for manipulation

In [88]:
## Accent class and AccentDescriptor class 

# these are classes I defined for accent handling
import cvaccents as cva

# do an explicit reload as I'm still working on the classes 
#reload(cva)

# prove that my DocStrings are useful
# they are good, so I am suppressing output while I work through the rest of the doc. 

#print('Module docstring is: \n', cva.__doc__)
#print('---')
#print('Accent docstring is: \n', cva.Accent.__doc__)
#print('---')
#print('AccentDescriptor docstring is: \n', cva.AccentDescriptor.__doc__)
#print('---')
#print('AccentCollection docstring is: \n', cva.AccentDescriptor.__doc__)
#print('---')

---
<a id='PreparingData'></a>
## Preparing the data from the Common Voice dataset TSV file

Here, we extract data from the TSV file, and use `pandas` to perform manipulations on the dataset, such as removing rows that do not contain accent metadata. 

In [89]:
# specify the path to the TSV file - this should be `validated.tsv` from the MCV download 
#filePath = '/media/kathyreid/Seagate Backup Plus Drive/cv-datasets/ru/validated.tsv'
#filePath = '/media/kathyreid/Seagate Backup Plus Drive/cv-datasets/fr/validated.tsv'
#filePath = '/media/kathyreid/Elements/de/validated.tsv'
#filePath = '/media/kathyreid/Elements/es/validated.tsv'
#filePath = '/media/kathyreid/Elements/en/validated.tsv'
#filePath = '/media/kathyreid/Seagate Backup Plus Drive/cv-datasets/en-v9/validated.tsv'
filePath = '../cv-datasets/cv-corpus-11.0-2022-09-21/en/validated.tsv'

# put it into a DataFrame 
df = pd.read_csv(filePath, sep='\t')

In [90]:
df.columns

Index(['client_id', 'path', 'sentence', 'up_votes', 'down_votes', 'age',
       'gender', 'accents', 'locale', 'segment'],
      dtype='object')

In [91]:
# We don't want all the columns, as some of them are not useful for the accent analysis 
# Drop the columns we don't want 

df.drop(labels=['path', 'sentence', 'up_votes', 'down_votes', 'segment', 'locale'], axis='columns', inplace=True)
df.columns



Index(['client_id', 'age', 'gender', 'accents'], dtype='object')

In [92]:
len(df)

1617877

In [93]:
# rows that have accent metadata 
len(df[df['accents'].notna()])

861134

In [94]:
# remove all the rows where accents are not given (NaN)
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
# DataFrame.dropna(*, axis=0, how=_NoDefault.no_default, thresh=_NoDefault.no_default, subset=None, inplace=False)

df.dropna(axis='index', how='any', subset='accents', inplace=True)
len(df)

# this matches the above figure for rows that have accent metadata, so it's a good cross-check

861134

In [95]:
# number of unique contributors to the dataset 
len(df['client_id'].unique())

14822

In [96]:
# Now that the rows without an accent value have been removed, 
# we want to deduplicate the speaker_id values - because one speaker can speak many utterances
# and we only want to record one accent per speaker 
# and we should end up with the # of rows in the cell above 


# One of the reasons we try and reduce the size of the dataframe 
# first is because this operation is more efficient on a smaller dataframe 
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
# DataFrame.drop_duplicates(subset=None, *, keep='first', inplace=False, ignore_index=False)

df.drop_duplicates(subset='client_id', keep='first', inplace=True)
len(df)

# This length should match the length above 

14822

---
<a id='AccentExtraction'></a>
## Extracting the accent data for visualisation 

We have already: 

* Removed any rows where accent data was not available = `NaN`
* De-duplicated based on the `client_id`

So now, we need to extract all the self-styled accents for analysis. 

In [97]:
# They are already unique so we don't need the `.unique` method
english_accents = df['accents']

In [98]:
"""

The list english_accents_list is a list that contains the ORIGINAL accent entries for each speaker. 
In this list, the accents for each speaker are represented as a SINGLE STRING, NOT as a list of strings. 

So, we want to turn this into a LIST of LISTS OF STRINGS, to make it easier for doing data cleaning. 
Each individual string represents one accent descriptor given by a speaker, 
and the list which contains those strings is the grouping of accent descriptors for that speaker. 

We need to preserve the association between accent descriptors - co-references - for later data visualisation. 

"""


english_accents_list = [] 

for idx, accent_string in list(enumerate(english_accents)): 
    
    # this regex is from 
    # https://stackoverflow.com/questions/26633452/how-to-split-by-commas-that-are-not-within-parentheses
    accent_list=re.split(',\s*(?![^()]*\))', accent_string)
    processed_accent_list = accent_list # we don't want to modify a list we're iterating over
    
    #print ('accent_string is: ', accent_string, ' and accent_list is: ', accent_list)
    for idx_a, accent in list(enumerate(accent_list)): 
        
        #print ('idx_a is: ', idx_a, ' and accent is: ', accent, ' and type of accent is: ', type(accent))
        # Trim any whitespace off the elements, because this makes matching on strings harder later on
        # Strings are immutable in Python, so we have to create another string
        processed_accent_list.remove(accent)
        stripped_accent = accent.strip() 
        processed_accent_list.append(stripped_accent)
        
        # Check for any empty strings and remove them - likely regex artefacts
        if not accent: 
            processed_accent_list.remove(accent)
            
        # Check for any non-Latin characters that we may want to investigate 
        # For example, if one of the accents is garbage or deliberate rubbish
        
        if not accent.isascii(): 
            print('flagging that accent: ', accent, ' is not ASCII encoded, may be in another language')
            
    english_accents_list.append(processed_accent_list)

flagging that accent:  Выраженный украинский акцент  is not ASCII encoded, may be in another language
flagging that accent:  Porto des dels 3 anys aprenent anglès a l'escola i actualment m'estic preparant per a l'examen del B2.  is not ASCII encoded, may be in another language


In [99]:
print(len(english_accents_list))

14822


In [100]:
for idx, accent_list in list(enumerate(english_accents_list)): 
    print(idx, accent_list)


0 ['England English', 'United States English']
1 ['Hong Kong English']
2 ['England English']
3 ['United States English']
4 ['United States English', 'wolof']
5 ['England English']
6 ['Australian English']
7 ['United States English']
8 ['Latin America', 'United States English']
9 ['United States English']
10 ['United States English']
11 ['Southern African (South Africa, Zimbabwe, Namibia)']
12 ['United States English']
13 ['United States English']
14 ['India and South Asia (India, Pakistan, Sri Lanka)']
15 ['United States English']
16 ['United States English']
17 ['England English']
18 ['India and South Asia (India, Pakistan, Sri Lanka)']
19 ['England English']
20 ['United States English']
21 ['Southern African (South Africa, Zimbabwe, Namibia)']
22 ['Australian English']
23 ['England English']
24 ['India and South Asia (India, Pakistan, Sri Lanka)']
25 ['United States English']
26 ['United States English', 'Colombia']
27 ['India and South Asia (India, Pakistan, Sri Lanka)']
28 ['India 

---
<a id='Descriptors'></a>
## Add descriptors to each accent

In this section, I apply a set of categories to the accent data. 

**I use a rule-based approach for reproduceability.** 
This could have been done in a spreadsheet, but I'm working in Python so I chose to do it that way. 
This also makes it easier for others applying this work to other languages or to other versions of the dataset. 



### Expand accents that have multiple descriptors in their .name element 

Here, we "break apart" accents that have multiple descriptors in the .name element into **multiple** accents. This is done _programmatically_ to aid in reproduceability. 

Some examples of this that I found during analysis were; 

* 8233 - 'south German / Swiss accent' - needs to be separated into 'South German' and 'Swiss accent'
* 6967 - 'slight Brooklyn Accent' - needs to be separated into "slight" and 'Brooklyn' - as a strength indicator 
* 10142 - 'minor French Accent' - needs to be separated into "minor" as an accent strength indicator, and 'French'
* 9902 - 'little Latino' - needs to be separated into "little" as an accent strength indicator, and "Latino"
* 3422 - 'i have some pronunciation issues because of oral surgery and a hidden southern accent' - needs to be separated into 'hidden southern accent' and 'changes due to Oral surgery'
* 9337 - 'heavy Cantonese' - needs to be separated into accent strength and country
* 1721 - 'United States English combined with European English' = needs to be separated into two accents
* 5365 - 'Sydney - middle eastern seaboard Australian' - should be separated into two accents
* 6615 - 'Spoke Chinese when little' - should be separated into Chinese, and speaking a different language as a child
* 748 - 'Spanish bilingual' - should be separated into 'Spanish' and 'bilingual ' as an Ln marker. 
* 6942 - 'South London and Essex' - should be separated into 'South London' and 'Essex'
* 12055 - 'Some time spent in Scotland' - should be separated into 'Scottish English' and into an Ln marker of "time spent in regional location"
* 773 - 'Silicon Valley Native' - should be separated into 'Silicon Valley' and "native" marker
* 3033 - "Porto des dels 3 anys aprenent anglès a l'escola i actualment m'estic preparant per a l'examen del B2." - this should be split into French, and academic register / L2 status
* 4322 - 'Polish. Have lived in nine states.' - should be separated into Polish and "lived in nine states" as a regional variance indicator or mixed accent indicator
* 5046 - 'Pittsburgh PA' - should be separated into Pittsburgh - city descriptor - and Pennsylvania as a regional descriptor. 
* 58 - 'Non native speaker from France' - should be separated into 'French' and 'Non native speaker'
* 1054 - 'Mild Northern England English' - should be separated into 'Northern England' and 'mild' for the strength of accent
* 11587 - 'Midwestern States (Michigan)' - should be separated into Midwestern and Michigan
* 1317 - 'Midwest USA speech blended with South Texas USA speech' - should be separated into Midwest and South Texas
* 1491 - 'Midwest US... With some Canadian slang. ' - should be separated into Midwest and Canadian Slang as register
* 'International Indian Accent' - should be separated into "Indian" and into "International"
* 2703 - 'Indian with a tinge of an RP accent' - should be separated into "Indian" and "Received Pronunciation" as well as an accent strength marker
* 1011 - 'I am German and speak English as learned at school' - should be separated into "German" and a register marker "as learned at school"
* 4839 - 'French mid level' - should be separated into 'French' and 'mid-level' as a fluency marker
* 1055 - 'England non-native' - should be separated into 'England' and 'Non-native' as a fluency marker
* 5305 - 'Educated Australian Accent' - should be separated into 'Australian' and 'Educated' as a register marker
* 12958 - 'CARIBBEAN AND BRITISH MIXED WITH SOME NEW YORK ACCENTS' - should be separated into three accents 
* 1577 - 'British English with a little bit of Russian' - should be separated into two accents. It's also an accent strength marker.
* 7271 - 'A variety of Texan English with some German influence that has undergone the cot-caught merger' - should be split into Texan, German influence, and a phonetic descriptor
* 6616 - '90% Pennsylvanian accent' - should be separated into Pennsylvanian and an accent strength marker
* 4319 - '4 years in Spain and Germany' - should be separated into French, German and a time or exposure marker
* 6617 - '10% Chinese accent' - should be separated into Chinese and an accent strength marker




### Modify the accent list to expand descriptors while preserving accent co-references

Some of the given accent descriptors contain multiple descriptors in one string. Here, I expand them while maintain co-references. 

For example: 

* `slight Brooklyn accent`  - contains both a City-based descriptor and an accent strength descriptor. 
* `United States English combined with European English` - contains both a national descriptor and a supranational descriptor. 



It's easier to do this before we create Accent and AccentDescriptor objects. 

In [101]:
# helper function for the below 

def update_list_coreference(list_to_be_updated, old_entry, new_entry):
    # the accent list is a list of lists so we need to iterate through each one to find the element to update. 
    
    for idx, accent_list in list(enumerate(list_to_be_updated)):
        
        #print(idx, '- old accent_list is: ', accent_list)
        new_accent_list = accent_list
        
        for accent in accent_list:  
            match = False
            #print ('accent is: ', accent, ' and old_entry is: ', old_entry)
            
            if accent == old_entry:
                match = True 
                #print('accent is: ', accent, ' and old_entry is: ', old_entry, 'and new_entry is: ', new_entry)
                #print('accent is: ', accent, ' which matches ', old_entry)
                new_accent_list.remove(old_entry)
                
                for entry in new_entry: # there may be more than one
                    new_accent_list.append(entry)
                    #print('appending new entry: ', entry)
                
            if match: 
                print ('processed ', old_entry, ' to be ', new_entry, ' and the old accent list is: ', accent_list, ' and the new accent list is: ', new_accent_list)

        # recreate the list from keys, this removes duplicates
        # for example, a duplicate may be created due to normalisation or merger of accents 
        new_accent_list = list(dict.fromkeys(new_accent_list))
        
        #print('new_accent_list is: ', new_accent_list)
        #print('---')
        
        #print('removing: ', accent_list)
        list_to_be_updated.remove(accent_list)
        #print('appending: ', newlist)
        list_to_be_updated.append(new_accent_list)
        
    
    return(list_to_be_updated)

In [102]:
#'south German / Swiss accent' - needs to be separated into 'South German' and 'Swiss'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'south German / Swiss accent', 
                                               ['South German', 'Swiss'])

#'English with Swiss german accent - needs to be separated into 'Swiss German'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'English with Swiss german accent', 
                                               ['Swiss German'])

# 'slight Brooklyn Accent' - needs to be separated into "slight" and 'Brooklyn' - as a strength indicator  
english_accents_list = update_list_coreference(english_accents_list, 
                                               'slight Brooklyn Accent', 
                                               ['Brooklyn Accent', 'slight'])

# 'minor French Accent' - needs to be separated into "minor" as an accent strength indicator, and 'French'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'minor French Accent', 
                                               ['French', 'minor'])

# 'minor French Accent' - needs to be separated into "minor" as an accent strength indicator, and 'French'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'minor french accent', 
                                               ['French', 'minor'])

# French mid level accent - needs to be separated into "mid level" as an accent strength indicator, and 'French'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'French mid level accent', 
                                               ['French', 'mid level'])

# 'little Latino' - needs to be separated into "Latino" as an accent and 'little' as an accent strength indicator
english_accents_list = update_list_coreference(english_accents_list, 
                                               'little Latino', 
                                               ['Latino', 'little'])

# 'little Latino' - needs to be separated into "Latino" as an accent and 'little' as an accent strength indicator
english_accents_list = update_list_coreference(english_accents_list, 
                                               'little latino', 
                                               ['Latino', 'little'])

# 'heavy Cantonese' - needs to be separated into accent strength and country   
english_accents_list = update_list_coreference(english_accents_list, 
                                               'heavy Cantonese', 
                                               ['Cantonese', 'heavy'])

# 'United States English combined with European English' - needs to be separated into two accents
english_accents_list = update_list_coreference(english_accents_list, 
                                               'United States English combined with European English', 
                                               ['United States English', 'European English'])

# 'United States English Pacific Northwest' - needs to be separated into two accents
english_accents_list = update_list_coreference(english_accents_list, 
                                               'United States English Pacific Northwest', 
                                               ['United States English', 'Pacific Northwest'])
                                               
# 'Sydney - middle eastern seaboard Australian' - should be separated into two accent
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Sydney - middle eastern seaboard Australian', 
                                               ['Sydney', 'Middle eastern seaboard Australian'])
        
# 'Spoke Chinese when little' - should be separated into Chinese, and speaking a different language as a child
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Spoke Chinese when little', 
                                               ['Chinese', 'Spoke language when a child'])        
    
# 'Spanish bilingual' - should be separated into 'Spanish' and 'bilingual ' as an Ln marker. 
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Spanish bilingual', 
                                               ['Spanish', 'Bilingual']) 

# 'South London and Essex' - should be separated into 'South London' and 'Essex'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'South London and Essex' , 
                                               ['South London', 'Essex']) 

# 'Some time spent in Scotland' - should be separated into 'Scottish English' and into an Ln marker of "time spent in regional location"
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Some time spent in Scotland' , 
                                               ['Scottish English', 'some time spent in location']) 

# 'Silicon Valley Native' - should be separated into 'Silicon Valley' and "native" marker
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Silicon Valley Native' , 
                                               ['Silicon Valley', 'native']) 

# "Porto des dels 3 anys aprenent anglès a l'escola i actualment m'estic preparant per a l'examen del B2." - this should be split into French, and academic register / L2 status
english_accents_list = update_list_coreference(english_accents_list, 
                                               "Porto des dels 3 anys aprenent anglès a l'escola i actualment m'estic preparant per a l'examen del B2." , 
                                               ['French', 'academic', 'second language']) 


# Выраженный украинский акцент - this should be separated into 'Ukrainian' and 'pronounced'
# it literally means 'pronounced Ukrainian accent'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Выраженный украинский акцент' , 
                                               ['Ukrainian', 'pronounced']) 


# 'Polish. Have lived in nine states.' - should be separated into Polish and "lived in nine states" as a regional variance indicator or mixed accent indicator
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Polish. Have lived in nine states.',
                                               ['Polish', 'mixed']) 

# 'Pittsburgh PA' - should be separated into Pittsburgh - city descriptor - and Pennsylvania as a regional descriptor.
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Pittsburgh PA',
                                               ['Pittsburgh', 'Pennsylvania']) 

# 'Non native speaker from France' - should be separated into 'French' and 'Non native speaker'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Non native speaker from France',
                                               ['French', 'Non native']) 

# 'Mild Northern England English' - should be separated into 'Northern England' and 'mild' for the strength of accent
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Mild Northern England English',
                                               ['Northern English', 'Mild']) 

# 'Midwestern States (Michigan)' - should be separated into Midwestern and Michigan
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Midwestern States (Michigan)',
                                               ['Midwestern United States', 'Michigan']) 

# Midwest US... With some Canadian slang - should separated into Midwestern United States, Canadian English and 'slang' as a register
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Midwest US... With some Canadian slang.',
                                               ['Midwestern United States', 'Canadian English', 'slang']) 

# 'Midwest USA speech blended with South Texas USA speech' - should be separated into Midwest and South Texas
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Midwest USA speech blended with South Texas USA speech',
                                               ['Midwestern United States', 'South Texas']) 

# 'slighty Southern affected by decades in the Midwest' - should be separated into Southern United States, Midwestern United States and time spent in location
english_accents_list = update_list_coreference(english_accents_list, 
                                               'slighty Southern affected by decades in the Midwest',
                                               ['Midwestern United States', 'Southern United States', 'time spent in location']) 


# 'United States English. people say I sound like a surffer dude.' - should be separated into 'United States English' and 'surfer' as a register
english_accents_list = update_list_coreference(english_accents_list, 
                                               'United States English. people say I sound like a surffer dude.',
                                               ['United States English', 'surfer']) 


# 'new england/east coast' - should be separated into New England and East coast 
english_accents_list = update_list_coreference(english_accents_list, 
                                               'new england/east coast',
                                               ['New England', 'East coast']) 

# 'little bit classy little bit sassy and add some city.....thats me' - should be separated into 'United States English' and 'surfer' as a register
english_accents_list = update_list_coreference(english_accents_list, 
                                               'little bit classy little bit sassy and add some city.....thats me',
                                               ['little bit', 'classy', 'sassy', 'city']) 

# 'International Indian Accent' - should be separated into "Indian" and into "International"
english_accents_list = update_list_coreference(english_accents_list, 
                                               'International Indian Accent',
                                               ['India and South Asia (India, Pakistan, Sri Lanka)', 'International']) 


# 'Indian with a tinge of an RP accent' - should be separated into "Indian" and "Received Pronunciation" as well as an accent strength marker
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Indian with a tinge of an RP accent',
                                               ['India and South Asia (India, Pakistan, Sri Lanka)', 'Received Pronunciation', 'tinge']) 

# 'I am German and speak English as learned at school' - should be separated into "German" and a register marker "as learned at school"
english_accents_list = update_list_coreference(english_accents_list, 
                                               'I am German and speak English as learned at school',
                                               ['German', 'academic', 'second language']) 

# 'speak some German' - should be separated into 'German' and an accent 
english_accents_list = update_list_coreference(english_accents_list, 
                                               'speak some German',
                                               ['German', 'some'])

# 'French mid level' - should be separated into 'French' and 'mid-level' as a fluency marker
english_accents_list = update_list_coreference(english_accents_list, 
                                               'French mid level',
                                               ['French', 'Mid-level']) 


# 'England non-native' - should be separated into 'England' and 'Non-native' as a fluency marker
english_accents_list = update_list_coreference(english_accents_list, 
                                               'England non-native',
                                               ['England English', 'Non native'])


# 'Educated Australian Accent' - should be separated into 'Australian' and 'Educated' as a register marker
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Educated Australian Accent',
                                               ['Australian English', 'Educated'])

# 'CARIBBEAN AND BRITISH MIXED WITH SOME NEW YORK ACCENTS' - should be separated into three accents 
english_accents_list = update_list_coreference(english_accents_list, 
                                               'CARIBBEAN AND BRITISH MIXED WITH SOME NEW YORK ACCENTS',
                                               ['West Indies and Bermuda (Bahamas, Bermuda, Jamaica, Trinidad)', 'England English', 'New York'])

# 'British English with a little bit of Russian' - should be separated into two accents. 
# It's also an accent strength marker.
english_accents_list = update_list_coreference(english_accents_list, 
                                               'British English with a little bit of Russian',
                                               ['Russian', 'England English', 'little bit'])

# 'A variety of Texan English with some German influence that has undergone the cot-caught merger' 
#- should be split into Texan, German influence, and a phonetic descriptor
english_accents_list = update_list_coreference(english_accents_list, 
        'A variety of Texan English with some German influence that has undergone the cot-caught merger',
        ['Texas', 'German', 'some', 'cot-caught merger'])

# '90% Pennsylvanian accent' - should be separated into Pennsylvanian and an accent strength marker
english_accents_list = update_list_coreference(english_accents_list, 
                                               '90% Pennsylvanian accent',
                                               ['Pennsylvania', '90%'])

# '4 years in Spain and Germany' - should be separated into French, German and a time or exposure marker
english_accents_list = update_list_coreference(english_accents_list, 
                                               '4 years in Spain and Germany',
                                               ['Spanish', 'German', 'some time spent in location'])

# '10% Chinese accent' - should be separated into Chinese and an accent strength marker
english_accents_list = update_list_coreference(english_accents_list, 
                                               '10% Chinese accent',
                                               ['Chinese', '10%'])

# 'With heavy Cantonese accent' - should be separated into Cantonese and 'heavy' as an accent strength marker
english_accents_list = update_list_coreference(english_accents_list, 
                                               'With heavy Cantonese accent',
                                               ['heavy', 'Cantonese'])

# 'i have some pronunciation issues because of oral surgery and a hidden southern accent' - needs to be separated into 'hidden southern accent' and 'changes due to Oral surgery'
english_accents_list = update_list_coreference(english_accents_list, 
                'i have some pronunciation issues because of oral surgery and a hidden southern accent',
                ['Southern United States', 'changes due to oral surgery'])


# 'Indo-Canadian English' - needs to be separated into the canonical forms for Indian and Canadian
english_accents_list = update_list_coreference(english_accents_list, 
                'Indo-Canadian English',
                ['Canadian English', 'India and South Asia (India, Pakistan, Sri Lanka)'])


# Hmong-American - needs to be separated into United States and Hmong 
english_accents_list = update_list_coreference(english_accents_list, 
                'Hmong-American',
                ['United States English', 'Hmong'])



processed  south German / Swiss accent  to be  ['South German', 'Swiss']  and the old accent list is:  ['England English', 'South German', 'Swiss']  and the new accent list is:  ['England English', 'South German', 'Swiss']
processed  English with Swiss german accent  to be  ['Swiss German']  and the old accent list is:  ['Swiss German']  and the new accent list is:  ['Swiss German']
processed  slight Brooklyn Accent  to be  ['Brooklyn Accent', 'slight']  and the old accent list is:  ['United States English', 'Canadian English', 'Brooklyn Accent', 'slight']  and the new accent list is:  ['United States English', 'Canadian English', 'Brooklyn Accent', 'slight']
processed  minor french accent  to be  ['French', 'minor']  and the old accent list is:  ['French', 'minor']  and the new accent list is:  ['French', 'minor']
processed  French mid level accent  to be  ['French', 'mid level']  and the old accent list is:  ['french accent', 'French', 'mid level']  and the new accent list is:  ['fre

processed  speak some German  to be  ['German', 'some']  and the old accent list is:  ['4 years in Spain and Germany', 'Spanish', 'Polish', 'mixed', 'Midwestern United States', 'Southern United States', 'time spent in location', 'German', 'some']  and the new accent list is:  ['4 years in Spain and Germany', 'Spanish', 'Polish', 'mixed', 'Midwestern United States', 'Southern United States', 'time spent in location', 'German', 'some']
processed  England non-native  to be  ['England English', 'Non native']  and the old accent list is:  ['England English', 'Non native']  and the new accent list is:  ['England English', 'Non native']
processed  Educated Australian Accent  to be  ['Australian English', 'Educated']  and the old accent list is:  ['Australian English', 'Australian English', 'Educated']  and the new accent list is:  ['Australian English', 'Australian English', 'Educated']
processed  CARIBBEAN AND BRITISH MIXED WITH SOME NEW YORK ACCENTS  to be  ['West Indies and Bermuda (Bahama

### Normalize closely related accent descriptors - merge them 

There are several closely related accent descriptors, and here I merge them. 

The principles I use are: 
    
* Accents are merged where there are spelling variations 
* Accents are merged where the accent has a region descriptor with our without 'accent' - such as "French" and "French accent"
* Where a country or language descriptor and demonym are closely equivalent - "Germany" and "German"

Accents are not merged where: 

* One accent descriptor is more granular than another - "London" and "South London" are not merged. 

In [103]:
# I can just use the same function

## There will be others as we put them into objects / classes


# Midwestern - canonical is 'Midwestern United States'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Midwestern',
                                               ['Midwestern United States'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Midwestern United States English',
                                               ['Midwestern United States'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'midwestern US',
                                               ['Midwestern United States'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'midwest',
                                               ['Midwestern United States'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Unite States Midwest',
                                               ['Midwestern United States'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Midwestern US English (United States)',
                                               ['Midwestern United States'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Midwestern US English (United States)',
                                               ['Midwest United States'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Mid-west United States English',
                                               ['Midwest United States'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'American Midwest',
                                               ['Midwest United States'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Midwestern United States',
                                               ['Midwest United States'])



# Southern United States - canonical is 'Southern United States'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'southern United States',
                                               ['Southern United States'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'southern',
                                               ['Southern United States'])


# Mid-atlantic - canonical is 'Mid-atlantic United States'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Mid-atlantic',
                                               ['Mid-atlantic United States'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Mid-Atlantic United States English',
                                               ['Mid-atlantic United States'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Midatlantic',
                                               ['Mid-atlantic United States'])


# Philadelphia - canonical is 'Philadelphia'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Philadelphia Style United States English',
                                               ['Philadelphia'])



# California - canonical is 'California'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Californian Accent',
                                               ['California'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Cali',
                                               ['California'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Californian',
                                               ['California'])

# Northern California - as distinct from California - canonical is 'Northern California'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'northern cali',
                                               ['Northern California'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Northern Californian',
                                               ['Northern California'])

# Southern California - as distinct from California - canonical is 'Southern California'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Southern Californian',
                                               ['Southern California'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Southern Cali',
                                               ['Southern California'])


# British - canonical is 'British'



# England - canonical is 'England English'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'england',
                                               ['England English'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'England',
                                               ['England English'])

# Scottish - canonical is 'Scottish English'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Scottish',
                                               ['Scottish English'])

# Northern England - canonical is 'Northern England'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Northern English',
                                               ['Northern England'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'English north of England',
                                               ['Northern England'])

# Durham - as distinct from Northern England - canonical is 'Durham'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'English County Durham',
                                               ['Durham'])

# Yorkshire - as distinct from Northern England - canonical is 'Yorkshire'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'yorkshire',
                                               ['Yorkshire'])

# Southern England - canonical is 'Southern England'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'southern english',
                                               ['Southern England'])


# Sussex - as distinct from Southern England - canonical is 'Sussex'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'sussex',
                                               ['Sussex'])

# London - canonical is 'London'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'London English',
                                               ['London'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'london',
                                               ['London'])


# Liverpool - canonical is 'Liverpool'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Liverpool English',
                                               ['Liverpool'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Liverpudlian English',
                                               ['Liverpool'])

# Lancashire - canonical is 'Lancashire'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Lancashire English',
                                               ['Lancashire'])

# Dutch - canonical is 'Dutch'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Dutch English',
                                               ['Dutch'])

# Swedish - canonical is 'Swedish'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'swedish',
                                               ['Swedish'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'swedish english',
                                               ['Swedish'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'swedish',
                                               ['Swedish'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Swedish English',
                                               ['Swedish'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Swedish accent',
                                               ['Swedish'])


# German - canonical is 'German'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'German Accent',
                                               ['German'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'German English',
                                               ['German'])

# South German - as distinct from German - canonical is 'South German'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'South German accent',
                                               ['South German'])

# South West German - as distinct from German 
english_accents_list = update_list_coreference(english_accents_list, 
                                               'south-west German',
                                               ['South West German'])

# French - canonical is 'French'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'french accent',
                                               ['French'])

# European - as distinct from Eastern European - canonical is 'European
english_accents_list = update_list_coreference(english_accents_list, 
                                               'European English',
                                               ['European'])

# Eastern Europe - canonical is 'Eastern European'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'eastern european English',
                                               ['Eastern European'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'eastern Europe',
                                               ['Eastern European'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'eastern europe',
                                               ['Eastern European'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'East European',
                                               ['Eastern European'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Eastern European English',
                                               ['Eastern European'])


# Polish - canonical is 'Polish'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Polish English',
                                               ['Polish'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'polish',
                                               ['Polish'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'polish accent',
                                               ['Polish'])


# 'serbian' - canonical is 'Serbian'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'serbian',
                                               ['Serbian'])

# Israeli - canonical is 'Israeli' 
english_accents_list = update_list_coreference(english_accents_list, 
                                               "Israeli's accent",
                                               ['Israeli'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Israeli accent',
                                               ['Israeli'])

# Nigerian - canonical is 'Nigerian'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Nigerian English',
                                               ['Nigerian'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'nigeria english',
                                               ['Nigerian'])

# Kenyan - canonical is 'Kenyan'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Kenyan English',
                                               ['Kenyan'])
# Wolof - canonical is 'Wolof'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'wolof',
                                               ['Wolof'])

# 'South African English' - canonical is 'Southern African (South Africa, Zimbabwe, Namibia)'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'South African English',
                                               ['Southern African (South Africa, Zimbabwe, Namibia)'])


# Latin American - canonical is 'Latino'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Latin American',
                                               ['Latino'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Latin American accent',
                                               ['Latino'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Latin America',
                                               ['Latino'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Hispanic/Latino',
                                               ['Latino'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Latin English',
                                               ['Latino'])

# Colombian - canonical is 'Colombian'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Colombian Accent',
                                               ['Colombian'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Colombia',
                                               ['Colombian'])



# Japanese - canonical is 'Japanese'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Japanese English',
                                               ['Japanese'])


#Bangladeshi - canonical is 'Bangladeshi'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Bangladesh',
                                               ['Bangladeshi'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'bangladesh',
                                               ['Bangladeshi'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Bangladesh English',
                                               ['Bangladeshi'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Bangladeshi English',
                                               ['Bangladeshi'])



# Non-native speaker descriptions - canonical is 'Non-native speaker'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Non-native',
                                               ['Non-native speaker'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Non native speaker',
                                               ['Non-native speaker'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Non native',
                                               ['Non-native speaker'])


# Second language descriptions - canonical is 'Second language'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'ESL',
                                               ['Second language'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'second language',
                                               ['Second language'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               '2nd Language',
                                               ['Second language'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'English as a Second Language',
                                               ['Second language'])

# International accent descriptions - canonical is 'International English'

english_accents_list = update_list_coreference(english_accents_list, 
                                               'International',
                                               ['International English'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'international',
                                               ['International English'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Transnational englishes blend',
                                               ['International English'])

# Mix of accents - canonical is 'Mix of accents'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Mix of voices',
                                               ['Mix of accents'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Mixed-Accent English',
                                               ['Mix of accents'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'mixed',
                                               ['Mix of accents'])


# Time spent in location - canonical is 'time spent in location'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'some time spent in location',
                                               ['time spent in location'])


processed  Midwestern  to be  ['Midwestern United States']  and the old accent list is:  ['United States English', 'Midwestern United States']  and the new accent list is:  ['United States English', 'Midwestern United States']
processed  Midwestern  to be  ['Midwestern United States']  and the old accent list is:  ['United States English', 'Midwestern United States']  and the new accent list is:  ['United States English', 'Midwestern United States']
processed  Midwestern  to be  ['Midwestern United States']  and the old accent list is:  ['United States English', 'Minnesotan', 'Midwestern United States']  and the new accent list is:  ['United States English', 'Minnesotan', 'Midwestern United States']
processed  Midwestern  to be  ['Midwestern United States']  and the old accent list is:  ['United States English', 'Low', 'Demure', 'Midwestern United States']  and the new accent list is:  ['United States English', 'Low', 'Demure', 'Midwestern United States']
processed  Midwestern United S

processed  Californian Accent  to be  ['California']  and the old accent list is:  ['United States English', 'California']  and the new accent list is:  ['United States English', 'California']
processed  northern cali  to be  ['Northern California']  and the old accent list is:  ['Northern California']  and the new accent list is:  ['Northern California']
processed  Southern Californian  to be  ['Southern California']  and the old accent list is:  ['United States English', 'Southern California']  and the new accent list is:  ['United States English', 'Southern California']
processed  england  to be  ['England English']  and the old accent list is:  ['london', 'academic', 'England English']  and the new accent list is:  ['london', 'academic', 'England English']
processed  England  to be  ['England English']  and the old accent list is:  ['England English', 'Lancashire', 'England English']  and the new accent list is:  ['England English', 'Lancashire', 'England English']
processed  Scott

processed  Israeli accent  to be  ['Israeli']  and the old accent list is:  ['United States English', 'Israeli']  and the new accent list is:  ['United States English', 'Israeli']
processed  Nigerian English  to be  ['Nigerian']  and the old accent list is:  ['Nigerian']  and the new accent list is:  ['Nigerian']
processed  nigeria english  to be  ['Nigerian']  and the old accent list is:  ['Nigerian']  and the new accent list is:  ['Nigerian']
processed  Kenyan English  to be  ['Kenyan']  and the old accent list is:  ['United States English', 'Kenyan']  and the new accent list is:  ['United States English', 'Kenyan']
processed  wolof  to be  ['Wolof']  and the old accent list is:  ['United States English', 'Wolof']  and the new accent list is:  ['United States English', 'Wolof']
processed  South African English  to be  ['Southern African (South Africa, Zimbabwe, Namibia)']  and the old accent list is:  ['Southern African (South Africa, Zimbabwe, Namibia)']  and the new accent list is:

## Extract unique accents from the list of normalised accents into a Dict of Accent objects for easier manipulation

In [104]:
# build a dict of each unique accent using an Accent object for each object. 

ratio_display = 120 # to stop the browser crashing 

AccentDict = {}
i = 0; 

# the english_accents_list is now normalised, merged etc so this is straightforward 
for accent_list in english_accents_list:
    for accent in accent_list: 
        
        i +=1
        match = False 
        count = 0
        
        #if (i%ratio_display ==0): # only show the 100th 
            #print('')
            #print('---')
            #print('now processing: ', accent, ' - ', i)
            #print('---')
        
        # is this accent in our dict - if not, add it in 
        
        for item in AccentDict.items() : # Each item should be an Accent object 
            
            # (self, id=0, name="Accent Name", count=0, locale=None, descriptors=None):
            #pp.pprint(item[1].__str__())
            
            #if (i%ratio_display ==0): # only show the 100th 
                #print('item is: ', item)
                #print(type(item))
                #print('now checking match for: item:', item[1], ' and accent: ', accent)
            
            if (item[1].name == accent) : # update the count
                
                #if (i%ratio_display ==0): # only show the 100th 
                    #print('---')
                    #print('match is True')
                    #print('---')
                    
                match = True 
                
                #if (i%ratio_display ==0): # only show the 100th 
                    #print('accent count was: ', item[1].count)
                
                # update the count of the accent 
                item[1].count+=1
                
                #if (i%ratio_display ==0): # only show the 100th 
                    #print('accent count is now: ', item[1].count)
                
                
        # this match loop has to be outside the for: loop above 
        # because if we add items to the dict inside the loop
        # then it will not run - because there are zero items in the dict to begin with 
        
        if (not match) :   
            
            # (self, id=0, name="Accent Name", count=0, locale=None, descriptors=None):
            AccentDict[i] = cva.Accent(i, accent, 1, 'en', None, False) 
                


In [105]:
# do an explicit reload as I'm still working on the classes 
#reload(cva)

all_accents = cva.AccentCollection(AccentDict)

In [106]:
print(all_accents.total())

166


In [107]:
all_accents.__str__()

'id is 1, name is England English, count is 2346, locale is en, descriptors are None, predetermined is False. id is 2, name is United States English, count is 7537, locale is en, descriptors are None, predetermined is False. id is 3, name is Hong Kong English, count is 132, locale is en, descriptors are None, predetermined is False. id is 7, name is Wolof, count is 1, locale is en, descriptors are None, predetermined is False. id is 9, name is Australian English, count is 665, locale is en, descriptors are None, predetermined is False. id is 12, name is Latino, count is 7, locale is en, descriptors are None, predetermined is False. id is 15, name is Southern African (South Africa, Zimbabwe, Namibia), count is 261, locale is en, descriptors are None, predetermined is False. id is 18, name is India and South Asia (India, Pakistan, Sri Lanka), count is 2010, locale is en, descriptors are None, predetermined is False. id is 31, name is Colombian, count is 2, locale is en, descriptors are N

In [108]:
# do an explicit reload as I'm still working on the classes 
#reload(cva)

all_accents_sortedByCount = all_accents.sortByCount(reverse=False)

for accent in all_accents_sortedByCount.items(): 
    print(accent[1].__str__())
    
# now I am cross-checking to see if there are any other duplicates or accents that should be merged


id is 7056, name is "Valley Girl" English, count is 1, locale is en, descriptors are None, predetermined is False.
id is 6641, name is 10%, count is 1, locale is en, descriptors are None, predetermined is False.
id is 6640, name is 90%, count is 1, locale is en, descriptors are None, predetermined is False.
id is 104, name is A'lo, count is 1, locale is en, descriptors are None, predetermined is False.
id is 1142, name is Adjustable, count is 1, locale is en, descriptors are None, predetermined is False.
id is 2846, name is Afrikaans English, count is 1, locale is en, descriptors are None, predetermined is False.
id is 13125, name is Alemannic German Accent, count is 1, locale is en, descriptors are None, predetermined is False.
id is 90, name is Argentinian English, count is 1, locale is en, descriptors are None, predetermined is False.
id is 1244, name is Basic, count is 1, locale is en, descriptors are None, predetermined is False.
id is 751, name is Bilingual, count is 1, locale is

---
<a id='PreDetermined'></a>
## Label the accents that were pre-determined 

Since its inception, Mozilla Common Voice has enabled data contributors to enter demographic age such as age, gender and accent. These associations are not validated in any way, and we don't have any indicator of how accurate they are. Accent _used_ to be represented as an a priori drop-down list, which the contributor could select from. From Common Voice v10, the data contributor can **self-describe** their accent, however, the previous accent list is still presented (so may be more frequently chosen by the data contributor). We need to be able to distinguish these accents visually to help with the exploration. 

```
"splits": {
        "accent": {
          "": 0.51,
          "canada": 0.03,
          "england": 0.08,
          "us": 0.23,
          "indian": 0.07,
          "australia": 0.03,
          "malaysia": 0,
          "newzealand": 0.01,
          "african": 0.01,
          "ireland": 0.01,
          "philippines": 0,
          "singapore": 0,
          "scotland": 0.02,
          "hongkong": 0,
          "bermuda": 0,
          "southatlandtic": 0,
          "wales": 0,
          "other": 0.01
        },

```

The `cv-datasets` splits above have labels for the accents that don't actually match the accent name in the data. So we need to specify the accents that are pre-determined. This is how they appear to the data contributor filling out their profile at: [https://commonvoice.mozilla.org/en/profile/info](https://commonvoice.mozilla.org/en/profile/info)


![Accents as specified on Mozilla Common Voice profile](cv-profile-specify-accent.png)


In [109]:
# create a list of the pre-existing accents 
# this is how they are given in the dataset. 

# TODO: for better maintainability, move this to a list of accents for each language, 
# that can be updated in a separate file, rather than specified here in an adhoc way. 

predetermined_accents_list = ['United States English', 
                         'England English', 
                         'India and South Asia (India, Pakistan, Sri Lanka)', 
                         'Canadian English', 
                         'Australian English', 
                         'Southern African (South Africa, Zimbabwe, Namibia)', 
                         'Irish English', 
                         'Scottish English', 
                         'New Zealand English', 
                         'Hong Kong English', 
                         'Filipino', 
                         'Malaysian English', 
                         'Singaporean English', 
                         'Welsh English', 
                         'West Indies and Bermuda (Bahamas, Bermuda, Jamaica, Trinidad)', 
                         'South Atlantic (Falkland Islands, Saint Helena)']



  

In [110]:
# use the predetermined_accents_list to populate the 'predetermined_status' attribute of each Accent object 
# to do this we use a method on the AccentCollection class

import cvaccents as cva
#reload(cva)

all_accents.updatePredeterminedStatus(predetermined_accents_list, True)

print(all_accents)






    



(1, <cvaccents.Accent object at 0x7fcc1d177220>)
1
changed  id is 1, name is England English, count is 2346, locale is en, descriptors are None, predetermined is True. status to  True
(2, <cvaccents.Accent object at 0x7fcc1d1775b0>)
2
changed  id is 2, name is United States English, count is 7537, locale is en, descriptors are None, predetermined is True. status to  True
(3, <cvaccents.Accent object at 0x7fcc1d177280>)
3
changed  id is 3, name is Hong Kong English, count is 132, locale is en, descriptors are None, predetermined is True. status to  True
(7, <cvaccents.Accent object at 0x7fcc1d177610>)
7
(9, <cvaccents.Accent object at 0x7fcc1d177370>)
9
changed  id is 9, name is Australian English, count is 665, locale is en, descriptors are None, predetermined is True. status to  True
(12, <cvaccents.Accent object at 0x7fcc1d1770d0>)
12
(15, <cvaccents.Accent object at 0x7fcc1d1774f0>)
15
changed  id is 15, name is Southern African (South Africa, Zimbabwe, Namibia), count is 261, local

---
<a id='AccentDescriptors'></a>

## Create Accent Descriptors and add them to each Accent 

Each Accent can have multiple Accent Descriptors. 

For example the accent `Pronounced German` contains both a _national regional descriptor_ and an _accent strength descriptor_. 

I have used the following principles for Accent Descriptors for English. This should be considered a CodeBook. 

### Geographic Regional Descriptors 

Regional descriptors are where the accent has been specified with reference to a geographic region. 

* `Geographic Region` 

Within this Category, there are several sub-categories: 

* `Country Descriptor` - where the descriptor is a country or a nation-state. 
* `Supranational region descriptor` - where the descriptor is a geographic region that crosses or overlaps multiple countries. An example would be `Slavic`, which refers to an [ethno-linguistic group](https://en.wikipedia.org/wiki/Slavic_languages) that covers several countries in Eastern Europe. 
* `Subnational region descriptor` - where the descriptor is a geographic region that refers to a region within a country's national boundary. An example would be `Midwestern United States`. 
* `City descriptor` - where the descriptor is a geographic region that refers to a city, town or municipality. An example would be `New York City` or `London`. 

One choice I have made here is not to represent areas _within_ cities using a separate Accent Descriptor. Examples here would be `Brooklyn` or `East London` - they have been classified as cities. This is because there are so few of them that it doesn't change the analysis significantly. 

### First or other language descriptor

This refers to Accent Descriptors where the data contributor refers to their accent using a descriptor such as `non-native` or `native speaker`. This is sometimes referred to as `first language (L1)` or `second language (L2)`. Although this _may_ be used to refer to the data contributor's _level of fluency_ in a language, I've chosen not to refer to this as a _level of fluency_ - because even though someone speaks a language as a second or other language, this _does not imply_ their level of fluency specifically. One could speak Mandarin as a second language, but be highly fluent. One could speak French as a second language and be less than proficient. 

* `First or other language` 

### Accent strength descriptor 

This refers to Accent Descriptors where the data contributor refers to their accent using a marker of the strength of the accent. Examples included `pronounced`, `90%` or `slight`. 

* `Accent strength descriptor` 

### Vocal quality descriptor 

This refers to Accent Descriptors where the data contributor refers to their accent using words to describe aspects of their voice that are subjective and qualitative - such as `sultry` or `sassy`. 

* `Vocal quality descriptor` 

TODO: is `quality` the correct word here? 

## Phonetic changes 

This category refers to Accent Descriptors which describe a particular phonetic change. This is used as a parent category to group these Accent Descriptors. 

* `Phonetic changes`

### Specific phonetic changes 

There are several phonetic changes that are linguistic markers for accent difference. 

* `Specified phonetic change` is applied when the Accent Descriptor itself specifies the type of phonetic change. `cot-caught merger` is an example. 

* `Rhoticity` is applied when the Accent Descriptor is describing how `/r/` and related phonemes are pronounced.

* `Inflection` is applied when the Accent Descriptor is describing an inflection change. 

## Register 

Although the Mozilla Common Voice data used _elicited speech_ - utterances spoken from given text prompts, people can speak in a range of _registers_. A register is generally the level of formality of speech - such as `formal`, or `educated` or `slang`. It may indicate socio-economic heritage of the speaker. This category captures Accent Descriptors that describe an accent in this way. 

* `Register` 

## Named accent 

Some accents, such as `Geordie` or `Scouse` have a related geographical region descriptor - North East England, and Liverpool respectively, but ones such as `Received Pronunciation` do not. This category allows for having a Named Accent descriptor where no related geographic region descriptor exists, as well as being able to capture specifically named accents.

* `Named Accent`

## Accent effects due to physical changes 

Accent changes may occur due to physical changes in the speaker's vocal tract - for instance through surgery or disease. This Accent Descriptor is used to capture descriptions such as these. 

* `Accent effects due to physical changes`

## Mixed or variable accent 

Where the data contributor specifies that their accent is a mixture or amalgamation of accents, but does not provide further information (for example so the Accent Descriptors can be separated or merged), this Accent Descriptor is used to capture this description. 

* `Mixed or variable accent`






In [111]:
import cvaccents as cva
#reload(cva)

# Using the Accent Descriptor class to create Accent Descriptor accents for the above 

descriptorGeoRegion = cva.AccentDescriptor(
    id = 100, 
    name='Geographic region', 
    definition = 'Indicates a geographic region used as a descriptor.', 
    parent = None, 
)
descriptorGeoCountry = cva.AccentDescriptor(
    id = 200, 
    name='Country', 
    definition = 'Indicates a geographic region of a country or nation-state.', 
    parent = 100, 
)
descriptorGeoSupra = cva.AccentDescriptor(
    id = 300, 
    name='Subnational region', 
    definition = 'Indicates a geographic region which crosses or overlaps multiple countries.', 
    parent = 100, 
)
descriptorGeoSub = cva.AccentDescriptor(
    id = 400, 
    name='Subnational region', 
    definition = 'Indicates a geographic region within a national boundary.', 
    parent = 100, 
)
descriptorGeoCity = cva.AccentDescriptor(
    id = 500, 
    name='Subnational region', 
    definition = 'Indicates a geographic region referring to a city, town or municipality.', 
    parent = 100, 
)


descriptorFOL = cva.AccentDescriptor(
    id = 600, 
    name='First or other language', 
    definition = 'Indicates a descriptor related to whether this is the speaker\s first or other language.', 
    parent = None, 
)

descriptorAccStr = cva.AccentDescriptor(
    id = 700, 
    name='Accent strength descriptor', 
    definition = 'Indicates a marker of accent strength.', 
    parent = None, 
)

descriptorVocQual = cva.AccentDescriptor(
    id = 800, 
    name='Vocal quality descriptor', 
    definition = 'Indicates a subjective vocal quality.', 
    parent = None, 
)


descriptorPhonChanges = cva.AccentDescriptor(
    id = 1000, 
    name='Phonetic Changes', 
    definition = 'Indicates a phonetic change.', 
    parent = None, 
)
descriptorPhonSpecific = cva.AccentDescriptor(
    id = 1100, 
    name='Specific phonetic changes', 
    definition = 'Indicates a specific phonetic change.', 
    parent = 1000, 
)
descriptorPhonRhoticity = cva.AccentDescriptor(
    id = 1200, 
    name='Rhoticity', 
    definition = 'Indicates rhoticity or its absence.', 
    parent = 1000, 
)
descriptorPhonInflection = cva.AccentDescriptor(
    id = 1200, 
    name='Inflection', 
    definition = 'Indicates an inflection change.', 
    parent = 1000, 
)

descriptorRegister = cva.AccentDescriptor(
    id = 1300, 
    name='Register', 
    definition = 'Indicates which register the data contributor speaks in.', 
    parent = None, 
)

descriptorNamedAcc = cva.AccentDescriptor(
    id = 1400, 
    name='Register', 
    definition = 'Indicates a specifically named accent.', 
    parent = None, 
)

descriptorPhysChange = cva.AccentDescriptor(
    id = 1500, 
    name='Accent effects due to physical changes', 
    definition = 'Indicates accent changes due to physical changes of the data contributor.', 
    parent = None, 
)

descriptorAccMixed = cva.AccentDescriptor(
    id = 1600, 
    name='Mixed or variable accent', 
    definition = 'Indicates mixture or amalgamation of accents.', 
    parent = None, 
)


descriptorAccUncertainty = cva.AccentDescriptor(
    id = 2000, 
    name='Uncertainty marker', 
    definition = 'Indicates uncertainty of descriptor.', 
    parent = None, 
)


print(descriptorGeoRegion.__str__())
print(descriptorGeoCountry.__str__())
print(descriptorGeoSupra.__str__())
print(descriptorGeoSub.__str__())
print(descriptorGeoCity.__str__())

print(descriptorFOL.__str__())

print(descriptorAccStr.__str__())

print(descriptorVocQual.__str__())

print(descriptorPhonChanges.__str__())
print(descriptorPhonSpecific.__str__())
print(descriptorPhonRhoticity.__str__())
print(descriptorPhonInflection.__str__())

print(descriptorRegister.__str__())

print(descriptorNamedAcc.__str__())

print(descriptorPhysChange.__str__())

print(descriptorAccMixed.__str__())

import cvaccents as cva
#reload(cva)

# Using the Accent Descriptor class to create Accent Descriptor accents for the above 

descriptorGeoRegion = cva.AccentDescriptor(
    id = 100, 
    name='Geographic region', 
    definition = 'Indicates a geographic region used as a descriptor.', 
    parent = None, 
)
descriptorGeoCountry = cva.AccentDescriptor(
    id = 200, 
    name='Country', 
    definition = 'Indicates a geographic region of a country or nation-state.', 
    parent = 100, 
)
descriptorGeoSupra = cva.AccentDescriptor(
    id = 300, 
    name='Subnational region', 
    definition = 'Indicates a geographic region which crosses or overlaps multiple countries.', 
    parent = 100, 
)
descriptorGeoSub = cva.AccentDescriptor(
    id = 400, 
    name='Subnational region', 
    definition = 'Indicates a geographic region within a national boundary.', 
    parent = 100, 
)
descriptorGeoCity = cva.AccentDescriptor(
    id = 500, 
    name='Subnational region', 
    definition = 'Indicates a geographic region referring to a city, town or municipality.', 
    parent = 100, 
)


descriptorFOL = cva.AccentDescriptor(
    id = 600, 
    name='First or other language', 
    definition = 'Indicates a descriptor related to whether this is the speaker\s first or other language.', 
    parent = None, 
)

descriptorAccStr = cva.AccentDescriptor(
    id = 700, 
    name='Accent strength descriptor', 
    definition = 'Indicates a marker of accent strength.', 
    parent = None, 
)

descriptorVocQual = cva.AccentDescriptor(
    id = 800, 
    name='Vocal quality descriptor', 
    definition = 'Indicates a subjective vocal quality.', 
    parent = None, 
)


descriptorPhonChanges = cva.AccentDescriptor(
    id = 1000, 
    name='Phonetic Changes', 
    definition = 'Indicates a phonetic change.', 
    parent = None, 
)
descriptorPhonSpecific = cva.AccentDescriptor(
    id = 1100, 
    name='Specific phonetic changes', 
    definition = 'Indicates a specific phonetic change.', 
    parent = 1000, 
)
descriptorPhonRhoticity = cva.AccentDescriptor(
    id = 1200, 
    name='Rhoticity', 
    definition = 'Indicates rhoticity or its absence.', 
    parent = 1000, 
)
descriptorPhonInflection = cva.AccentDescriptor(
    id = 1200, 
    name='Inflection', 
    definition = 'Indicates an inflection change.', 
    parent = 1000, 
)

descriptorRegister = cva.AccentDescriptor(
    id = 1300, 
    name='Register', 
    definition = 'Indicates which register the data contributor speaks in.', 
    parent = None, 
)

descriptorNamedAcc = cva.AccentDescriptor(
    id = 1400, 
    name='Register', 
    definition = 'Indicates a specifically named accent.', 
    parent = None, 
)

descriptorPhysChange = cva.AccentDescriptor(
    id = 1500, 
    name='Accent effects due to physical changes', 
    definition = 'Indicates accent changes due to physical changes of the data contributor.', 
    parent = None, 
)

descriptorAccMixed = cva.AccentDescriptor(
    id = 1600, 
    name='Mixed or variable accent', 
    definition = 'Indicates mixture or amalgamation of accents.', 
    parent = None, 
)


descriptorAccUncertainty = cva.AccentDescriptor(
    id = 2000, 
    name='Uncertainty marker', 
    definition = 'Indicates uncertainty of descriptor.', 
    parent = None, 
)


print(descriptorGeoRegion.__str__())
print(descriptorGeoCountry.__str__())
print(descriptorGeoSupra.__str__())
print(descriptorGeoSub.__str__())
print(descriptorGeoCity.__str__())

print(descriptorFOL.__str__())

print(descriptorAccStr.__str__())

print(descriptorVocQual.__str__())

print(descriptorPhonChanges.__str__())
print(descriptorPhonSpecific.__str__())
print(descriptorPhonRhoticity.__str__())
print(descriptorPhonInflection.__str__())

print(descriptorRegister.__str__())

print(descriptorNamedAcc.__str__())

print(descriptorPhysChange.__str__())

print(descriptorAccMixed.__str__())

print(descriptorAccUncertainty.__str__())









id is 100, name is Geographic region, definition is Indicates a geographic region used as a descriptor., parent is None
id is 200, name is Country, definition is Indicates a geographic region of a country or nation-state., parent is 100
id is 300, name is Subnational region, definition is Indicates a geographic region which crosses or overlaps multiple countries., parent is 100
id is 400, name is Subnational region, definition is Indicates a geographic region within a national boundary., parent is 100
id is 500, name is Subnational region, definition is Indicates a geographic region referring to a city, town or municipality., parent is 100
id is 600, name is First or other language, definition is Indicates a descriptor related to whether this is the speaker\s first or other language., parent is None
id is 700, name is Accent strength descriptor, definition is Indicates a marker of accent strength., parent is None
id is 800, name is Vocal quality descriptor, definition is Indicates a su

### Now we have the Accent Descriptors defined, we can associate Accent Descriptors with each Accent 

In [112]:
# I could put them all in one list, 
# but it's easier to debug this way

# Generic region descriptors that don't fit into any other category 
region_descriptors = [
    ('non regional', descriptorGeoRegion),
    ('International English', descriptorFOL)
]

# Country descriptors 
country_descriptors = [ 
    ('England English', descriptorGeoCountry),
    ('United States English', descriptorGeoCountry),
    ('Hong Kong English', descriptorGeoCountry),
    ('Australian English', descriptorGeoCountry),
    ('French', descriptorGeoCountry),
    ('Colombian', descriptorGeoCountry),
    ('Canadian English', descriptorGeoCountry),
    ('Scottish English', descriptorGeoCountry),
    ('Filipino', descriptorGeoCountry),
    ('Argentinian English', descriptorGeoCountry),
    ('Finnish', descriptorGeoCountry),
    ('Singaporean English', descriptorGeoCountry),
    ('Georgian English', descriptorGeoCountry),
    ('New Zealand English', descriptorGeoCountry),
    ('Malaysian English', descriptorGeoCountry),
    ('Irish English', descriptorGeoCountry),
    ('Chinese', descriptorGeoCountry),
    ('Nigerian', descriptorGeoCountry),
    ('Ukrainian', descriptorGeoCountry),
    ('Polish', descriptorGeoCountry),
    ('Romanian', descriptorGeoCountry),
    ('Welsh English', descriptorGeoCountry),
    ('British', descriptorGeoCountry),
    ('German', descriptorGeoCountry),
    ('Swedish', descriptorGeoCountry),
    ('Spanish', descriptorGeoCountry),
    ('Japanese', descriptorGeoCountry),
    ('Israeli', descriptorGeoCountry),
    ('Dutch', descriptorGeoCountry),
    ('Russian', descriptorGeoCountry),
    ('Northern Irish', descriptorGeoCountry),
    ('Greek', descriptorGeoCountry),
    ('Kenyan', descriptorGeoCountry),
    ('Bangladeshi', descriptorGeoCountry),
    ('Norwegian', descriptorGeoCountry),
    ('Kiwi', descriptorGeoCountry),
    ('Swiss', descriptorGeoCountry),
    ('Serbian', descriptorGeoCountry),
    ('Thai', descriptorGeoCountry),
    ('Italian', descriptorGeoCountry),
    ('Indonesian English', descriptorGeoCountry),
    ('Austrian', descriptorGeoCountry)
]

# Subnational descriptors 
subnational_descriptors = [
    ('California', descriptorGeoSub),
    ('Midlands English', descriptorGeoSub),
    ('Northern California', descriptorGeoSub),
    ('Durham', descriptorGeoSub),
    ('Catalan', descriptorGeoSub),
    ('Silicon Valley', descriptorGeoSub),
    ('Swiss German', descriptorGeoSub),
    ('Northern England', descriptorGeoSub),
    ('Southern United States', descriptorGeoSub),
    ('South Texas', descriptorGeoSub),
    ('West Indian', descriptorGeoSub),
    ('Northern', descriptorGeoSub),
    ('Southern Appalachian English', descriptorGeoSub),
    ('Southern United States English', descriptorGeoSub),
    ('Pacific Northwest', descriptorGeoSub),
    ('Southern Texas Accent', descriptorGeoSub),
    ('Midwest United States', descriptorGeoSub),
    ('Afrikaans English', descriptorGeoSub),
    ('East Ukrainian', descriptorGeoSub),
    ('Southern England', descriptorGeoSub),
    ('Yorkshire', descriptorGeoSub),
    ('East Indian', descriptorGeoSub),
    ('clickme', descriptorGeoSub),
    ('Pennsylvania', descriptorGeoSub),
    ('Middle eastern seaboard Australian', descriptorGeoSub),
    ('Lancashire', descriptorGeoSub),
    ('Essex', descriptorGeoSub),
    ('"Valley Girl" English', descriptorGeoSub),
    ('Texas', descriptorGeoSub),
    ('New England', descriptorGeoSub),
    ('East coast', descriptorGeoSub),
    ('Okie', descriptorGeoSub),
    ('South German', descriptorGeoSub),
    ('Upper Midwestern', descriptorGeoSub),
    ('Mid-atlantic United States', descriptorGeoSub),
    ('Michigan', descriptorGeoSub),
    ('Javanese', descriptorGeoSub),
    ('Philadelphia', descriptorGeoSub),
    ('Sussex', descriptorGeoSub),
    ('New York', descriptorGeoSub),
    ('Alemannic German Accent', descriptorGeoSub),
    ('South West German', descriptorGeoSub),
    ('Southern California', descriptorGeoSub),
    ('Southwestern United States English', descriptorGeoSub),
    ('Northumbrian British English', descriptorGeoSub),
    ('Minnesotan', descriptorGeoSub)
]

# Supranational descriptors 
supranational_descriptors = [
    ('Wolof', descriptorGeoSupra),
    ('Latino', descriptorGeoSupra),
    ('Southern African (South Africa, Zimbabwe, Namibia)', descriptorGeoSupra),
    ('India and South Asia (India, Pakistan, Sri Lanka)', descriptorGeoSupra),
    ("A'lo", descriptorGeoSupra),
    ('West Indies and Bermuda (Bahamas, Bermuda, Jamaica, Trinidad)', descriptorGeoSupra),
    ('Indo-Canadian English', descriptorGeoSupra),
    ('European', descriptorGeoSupra),
    ('South Atlantic (Falkland Islands, Saint Helena)', descriptorGeoSupra),
    ('Slavic', descriptorGeoSupra),
    ('Eastern European', descriptorGeoSupra),
    ('Slovak', descriptorGeoSupra),
    ('West African', descriptorGeoSupra),
    ('Hmong', descriptorGeoSupra),
    ('Cantonese', descriptorGeoSupra),
    ('East African Khoja', descriptorGeoSupra),
    ('clickme', descriptorGeoSupra),
    ('clickme', descriptorGeoSupra)
]

# City descriptors 
city_descriptors = [
    ('New York City', descriptorGeoCity),
    ('Liverpool', descriptorGeoCity),
    ('East London', descriptorGeoCity),
    ('London', descriptorGeoCity),
    ('Pittsburgh', descriptorGeoCity),
    ('Sydney', descriptorGeoCity),
    ('Chicago', descriptorGeoCity),
    ('South London', descriptorGeoCity),
    ('Brooklyn Accent', descriptorGeoCity),
    ('New Orleans dialect', descriptorGeoCity)
]

# First or other language descriptors 
FOL_descriptors = [
    ('Non-native speaker', descriptorFOL),
    ('Bilingual', descriptorFOL),
    ('native', descriptorFOL),
    ('Second language', descriptorFOL),
    ('Basic', descriptorFOL),
    ('time spent in location', descriptorFOL),
    ('some', descriptorFOL),
    ('mid level', descriptorFOL),
    ('Spoke language when a child', descriptorFOL),
    ('fluent', descriptorFOL),
    ('Conversational', descriptorFOL),
    ('Foreign', descriptorFOL)
]


# Accent Strength descriptors
AccStr_descriptors = [
    ('pronounced', descriptorAccStr),
    ('slight', descriptorAccStr),
    ('Mild', descriptorAccStr),
    ('Not bad', descriptorAccStr),
    ('little bit', descriptorAccStr),
    ('tinge', descriptorAccStr),
    ('90%', descriptorAccStr),
    ('10%', descriptorAccStr),
    ('heavy', descriptorAccStr),
    ('little', descriptorAccStr),
    ('minor', descriptorAccStr)
]


# Vocal quality descriptors
VocQual_descriptors = [
    ('sultry', descriptorVocQual),
    ('classy', descriptorVocQual),
    ('sassy', descriptorVocQual),
    ('Slight lisp', descriptorVocQual),
    ('Slightly effeminate', descriptorVocQual),
    ('Low', descriptorVocQual),
    ('Demure', descriptorVocQual)
]


# Phonetic descriptors 
PhonSpecific_descriptors = [
    ('pin/pen merger', descriptorPhonSpecific),
    ('heavy consonants', descriptorPhonSpecific),
    ('cot-caught merger', descriptorPhonSpecific)
]
PhonRhoticity_descriptors = [
    ("pronounced r's", descriptorPhonRhoticity)
]
PhonInflection_descriptors = [
    ('mostly affecting inflection', descriptorPhonInflection)
]

# Register descriptors
Register_descriptors = [
    ('surfer', descriptorRegister),
    ('academic', descriptorRegister),
    ('Educated', descriptorRegister),
    ('formal', descriptorRegister),
    ('slang', descriptorRegister),
    ('Urban', descriptorRegister),
    ('classy', descriptorRegister),
    ('sassy', descriptorRegister),
    ('city', descriptorRegister),
    ('Cool', descriptorRegister),
    ('Conversational', descriptorRegister)
]

# Named accent descriptors
NamedAcc_descriptors = [
    ('Patois', descriptorNamedAcc),
    ('Received Pronunciation', descriptorNamedAcc),
    ('Kiwi', descriptorNamedAcc),
    ('Chicano English', descriptorNamedAcc),
    ('"Valley Girl" English', descriptorNamedAcc),
    ('Kiwi', descriptorNamedAcc)
]
        
# Physical change descriptors
PhysChange_descriptors = [
    ('changes due to oral surgery', descriptorPhysChange)
]

# Mixed accent descriptors
AccMixed_descriptors = [
    ('pronounced', descriptorAccMixed),
    ('Variable', descriptorAccMixed),
    ('Adjustable', descriptorAccMixed),
    ('Mix of accents', descriptorAccMixed),
    ('try to maintain originality', descriptorAccMixed)
]
    
# Uncertainty marker 
AccUncertainty_descriptors = [
    ('I think', descriptorAccUncertainty)
]    


In [113]:
# create one list from the above lists 

accent_descriptor_list = [
    region_descriptors,
    country_descriptors,
    subnational_descriptors,
    supranational_descriptors,
    city_descriptors,
    FOL_descriptors,
    AccStr_descriptors,
    VocQual_descriptors,
    PhonSpecific_descriptors,
    PhonRhoticity_descriptors,
    PhonInflection_descriptors,
    Register_descriptors,
    NamedAcc_descriptors,
    PhysChange_descriptors,
    AccMixed_descriptors,
    AccUncertainty_descriptors,
]



In [114]:
# Now we loop through all the accents 
# And if the accent name matches one of the descriptors in accent_descriptor_list 
# We add the relevant Accent Descriptor to the Accent's object representation 

for accent_descriptor_category in accent_descriptor_list: 
    for accent_descriptor in accent_descriptor_category: 
        for accent in all_accents.items(): 
            
            #print ('accent is: ', accent[1], ' and accent_descriptor is: ', accent_descriptor)
            
            if accent[1]._name == accent_descriptor[0]: 
                #print ('MATCH!')
                if accent[1]._descriptors is None: 
                    accent[1]._descriptors = [] # initialise list if None
                accent[1]._descriptors.append(accent_descriptor[1]) # append because there can be multiple 
                

In [115]:
# the accents should now have descriptors 

import cvaccents as cva
#reload(cva)

for accent in all_accents.items(): 
    print(accent[1].__str__())

id is 1, name is England English, count is 2346, locale is en, descriptors are [<cvaccents.AccentDescriptor object at 0x7fcc1b910880>], predetermined is True.
id is 2, name is United States English, count is 7537, locale is en, descriptors are [<cvaccents.AccentDescriptor object at 0x7fcc1b910880>], predetermined is True.
id is 3, name is Hong Kong English, count is 132, locale is en, descriptors are [<cvaccents.AccentDescriptor object at 0x7fcc1b910880>], predetermined is True.
id is 7, name is Wolof, count is 1, locale is en, descriptors are [<cvaccents.AccentDescriptor object at 0x7fcc1b910160>], predetermined is False.
id is 9, name is Australian English, count is 665, locale is en, descriptors are [<cvaccents.AccentDescriptor object at 0x7fcc1b910880>], predetermined is True.
id is 12, name is Latino, count is 7, locale is en, descriptors are [<cvaccents.AccentDescriptor object at 0x7fcc1b910160>], predetermined is False.
id is 15, name is Southern African (South Africa, Zimbabwe,

In [116]:
# Now do a cross-check to see if there are any accents for which the ._descriptor is None 
# this flags if I've missed an accent somewhere 

import cvaccents as cva
#reload(cva)

missing_descriptors = all_accents.reportNoneAccentDescriptors() 

for accent in missing_descriptors: 
    print (accent[1].__str__())
    

## Create relationships between Accents suitable for data visualisation 

Now, we want to create relationships _between_ accents so that we can visualise accents as nodes and their relationships as edges. 

In [117]:
## Creating linkages between the individual accents and how they are represented in the data. 
## What I want to do here is create a data structure that has the ID of the accent 
## and something to describe the edge: 
## 
## The data structure I think will work here is: 
## 
## { 99: (123, 456, 789)} 
## to represent each combination of accents 

## The data structures we are using are: 
## 
## all_accents - Accent Collection object of all Accents, merged and normalised
## english_accents_list - this is a list of list of strings,
##                        where each list represents the Accents that are related
##   
## what we want to do is go through each list, 
## and find the ID number of the accent 
## from the Dict, 
## then build another data structure that represents the Accent's relation to other Accents

accent_nodes = {}
i = 0;

for accent_list in english_accents_list:

    #print(accent_list)
    
    # initialise the list first 
    accent_nodes[i] = []
    
    for accent_list_item in accent_list: 
        #print('now processing', accent_list_item)
        
        node_cnt = 0; 
        
        for accent in all_accents.items(): 
            
            #if (i%ratio_display ==0): # only show the 100th 
                #print('---')
                #print ('now looking at row: ', accent_list, 'and accent list item: ', accent_list_item, ' and accent: ', accent)
        
            if (accent_list_item == accent[1]._name): ## match 
                
                #if (i%ratio_display ==0): # only show the 100th 
                    #print('---')
                    #print ('match!')
                    
                #print(accent[0])
                #print('i is: ', i, ' and node_cnt is: ', node_cnt)
                
                accent_nodes[i].append(accent[0]) # we want the accent ID number
                node_cnt +=1
                
     
    i +=1 
                

In [118]:
pp.pprint(accent_nodes)

{   0: [1, 2],
    1: [3],
    2: [1],
    3: [2],
    4: [2, 7],
    5: [1],
    6: [9],
    7: [2],
    8: [2, 12],
    9: [2],
    10: [2],
    11: [15],
    12: [2],
    13: [2],
    14: [18],
    15: [2],
    16: [2],
    17: [1],
    18: [18],
    19: [1],
    20: [2],
    21: [15],
    22: [9],
    23: [1],
    24: [18],
    25: [2],
    26: [2, 31],
    27: [18],
    28: [18],
    29: [2],
    30: [2],
    31: [18],
    32: [37],
    33: [2],
    34: [18],
    35: [18],
    36: [2],
    37: [18],
    38: [2],
    39: [3],
    40: [2],
    41: [1],
    42: [1],
    43: [2],
    44: [1],
    45: [2],
    46: [1],
    47: [15],
    48: [2],
    49: [1],
    50: [2],
    51: [1],
    52: [2],
    53: [58, 59],
    54: [2],
    55: [18],
    56: [18],
    57: [2],
    58: [18],
    59: [9],
    60: [18],
    61: [18],
    62: [2],
    63: [1],
    64: [70],
    65: [2],
    66: [18],
    67: [2],
    68: [18],
    69: [2],
    70: [2],
    71: [2],
    72: [1],
    73: [79],
    74:

In [119]:
# Now what I need to do is create a JSON format suitable for using in 
# a Trellis diagram in Observable 
# e.g. https://observablehq.com/@jameslaneconkling/trellis

# The format of the JSON looks like: 
#   const nodes = [
#    {id: 'Myriel', group: 1},
#    {id: 'Napoleon', group: 1},

# const edges = [
#    {source: 'Napoleon', target: 'Myriel'},
#    {source: 'Mlle.Baptistine', target: 'Myriel'},

# the nodes should be as easy as JSON dumping the all_accents AccentCollection 
# possibly using a method 

# the edges will be a bit more complex 

### Nodes

In [120]:
#reload(cva)
all_accents.exportJSON('all_accents.json')

True

### Edges

Here, we use the `accent_nodes` list to create a dict of edges

In [121]:
## what we want to do here is loop through the accent_nodes Dict 
## and create another Dict that we can use to create Links between the Nodes (which are accents)

accent_edges = {} 
accent_edges_id = 0

print(len(accent_nodes)) 

# we want to create a new structure, as we'll be popping elements off, and we want to leave accent_nodes untouched

for accent_list in accent_nodes.items(): 
    print('accent list is: ', accent_list)
    
    if len(accent_list[1]) > 1: 
        # we want to create links 
        
        while (len(accent_list[1]) > 1) : 
            print('\n')
            print('accent list stack size is greater than 1, now in while loop')
            
            popped_element = accent_list[1].pop(0) # remove the first element 
            print('popped element is :', popped_element)
            
            accent_edges[accent_edges_id] = {}
            
            # create links between the popped element and all the remaining elements in the list 
            for accent_id in accent_list[1]: 
                
                print('--- in accent list loop ---')
                print('accent_id is: ', accent_id)
                print('length of accent list is: ', len(accent_list[1]))
                
                accent_edges[accent_edges_id]['source'] = popped_element
                accent_edges[accent_edges_id]['target'] = accent_id
            
                print(accent_edges[accent_edges_id])
                
        accent_edges_id +=1
                
                
        print('end of while loop')
        print('\n')

14822
accent list is:  (0, [1, 2])


accent list stack size is greater than 1, now in while loop
popped element is : 1
--- in accent list loop ---
accent_id is:  2
length of accent list is:  1
{'source': 1, 'target': 2}
end of while loop


accent list is:  (1, [3])
accent list is:  (2, [1])
accent list is:  (3, [2])
accent list is:  (4, [2, 7])


accent list stack size is greater than 1, now in while loop
popped element is : 2
--- in accent list loop ---
accent_id is:  7
length of accent list is:  1
{'source': 2, 'target': 7}
end of while loop


accent list is:  (5, [1])
accent list is:  (6, [9])
accent list is:  (7, [2])
accent list is:  (8, [2, 12])


accent list stack size is greater than 1, now in while loop
popped element is : 2
--- in accent list loop ---
accent_id is:  12
length of accent list is:  1
{'source': 2, 'target': 12}
end of while loop


accent list is:  (9, [2])
accent list is:  (10, [2])
accent list is:  (11, [15])
accent list is:  (12, [2])
accent list is:  (13, [2]

In [122]:
print(len(accent_edges))

177


In [123]:
pp.pprint(accent_edges)

{   0: {'source': 1, 'target': 2},
    1: {'source': 2, 'target': 7},
    2: {'source': 2, 'target': 12},
    3: {'source': 2, 'target': 31},
    4: {'source': 58, 'target': 59},
    5: {'source': 2, 'target': 18},
    6: {'source': 18, 'target': 1},
    7: {'source': 37, 'target': 18},
    8: {'source': 2, 'target': 277},
    9: {'source': 1, 'target': 335},
    10: {'source': 18, 'target': 2},
    11: {'source': 2, 'target': 1},
    12: {'source': 471, 'target': 472},
    13: {'source': 489, 'target': 490},
    14: {'source': 9, 'target': 2},
    15: {'source': 1, 'target': 3},
    16: {'source': 1, 'target': 2},
    17: {'source': 750, 'target': 751},
    18: {'source': 2, 'target': 763},
    19: {'source': 2, 'target': 1},
    20: {'source': 776, 'target': 777},
    21: {'source': 2, 'target': 782},
    22: {'source': 209, 'target': 2},
    23: {'source': 1, 'target': 2},
    24: {'source': 302, 'target': 1},
    25: {'source': 1, 'target': 3},
    26: {'source': 1, 'target': 2},
 

Now, we de-duplicate the listusing the in-built Python function [set](https://docs.python.org/3.8/library/stdtypes.html?highlight=set#set), which de-duplicates a list. 

TODO: In future iterations this list of edges could include a `value` values, so that if the edges were duplicated, then a `value` counter could be incremented when they were de-duplicated. 

In [124]:
accent_edges = [*set(accent_edges)]

In [125]:
# check to see how many duplicates were removed - doesn't look like there were any duplicates
# so I will skip calculating the `
print(len(accent_edges))

177
