# Working with accent data in the Mozilla Common Voice dataset 

The purpose of this Python Jupyter notebook is to provide some worked examples of how you might explore accent data in the Common Voice dataset. 



## Index of notebook contents 

To make this notebook easier to navigate, each section is indexed below. 

* [Background information on demographic data in Common Voice](#Background)
* [Preparation steps and importing modules](#PreparationSteps) - including the `requirements.txt` you should run if using this notebook. 
* [The Accent and AccentDescriptor classes we will use in the notebook](#Classes)
* [Preparing data from the Common Voice TSV file](#PreparingData)
* [Extracting accent information for data visualisation](#AccentExtraction)
* [Determine which accents are predetermined for selection in the Common Voice profile screen](#PreDetermined)
* [Add Descriptors to each Accent](#Descriptors)

---
<a id='Background'></a>
## Background information on demographic data in Common Voice 

Before you start working with accent data in Common Voice, there is background information you should know about the data structures in the Common Voice datasets, and how accents have been represented. 

### The ability to choose whether or not to specify demographic information 

Data contributors can contribute voice data to Common Voice with our without logging in to the platform. If a data contributor is not logged in, the utterances they record contain no demographic metadata information, such as the gender, age range or accent of the speaker. If the data contributor _does_ log in, then they can choose whether to specify demographic information in their profile. Part of the demographic information can include specifying which accent(s) they speak with. 


Since mid 2021, data contributors to the Common Voice dataset have been able to self-specify descriptors for their accents. 

The purpose of this script is to get demographic details from an MCV downloaded dataset. 
This informs decision making around, for example, how much of the data in a particular language, has demographic details, and if so, what they are. 

---
<a id='PreparationSteps'></a>

## Preparation steps and importing the modules we will use 

@TODO 

make a `requirements.txt` file to install all the dependencies. 

* pandas 


In [90]:
# imports go here 

# io 
import io

# os for file handling 
import os 

# pandas 
import pandas as pd

# regular expressions 
import re

# json 
import json

# string handling for isascii
import string 

# pretty print 
import pprint
pp = pprint.PrettyPrinter(indent=4)

# reload = because I'm developing the CVaccents module as I go, I want to reload it each time so it doesn't cache
from importlib import reload

# copy = for using deepcopy()
import copy

In [91]:
# set the version number so that we can differentiate files, such as JSON, that are produced. 

dataset_release_version = 13
JSON_data_dir = 'JSON-data-files'
language = 'sw'

# specify the filenames for the JSON output

accents_filename = JSON_data_dir + '/' + 'all_accents' + '_' + str(dataset_release_version) + '_' + language + '.json'
links_filename = JSON_data_dir + '/' + 'accent_edges' + '_' +  str(dataset_release_version) + '_' + language + '.json'

---
<a id='Classes'></a>
## Accent, AccentDescriptor and AccentCollection classes used for manipulation

In [92]:
## Accent class and AccentDescriptor class 

# these are classes I defined for accent handling
import cvaccents as cva

# do an explicit reload as I'm still working on the classes 
#reload(cva)

# prove that my DocStrings are useful
# they are good, so I am suppressing output while I work through the rest of the doc. 

#print('Module docstring is: \n', cva.__doc__)
#print('---')
#print('Accent docstring is: \n', cva.Accent.__doc__)
#print('---')
#print('AccentDescriptor docstring is: \n', cva.AccentDescriptor.__doc__)
#print('---')
#print('AccentCollection docstring is: \n', cva.AccentDescriptor.__doc__)
#print('---')

---
<a id='PreparingData'></a>
## Preparing the data from the Common Voice dataset TSV file

Here, we extract data from the TSV file, and use `pandas` to perform manipulations on the dataset, such as removing rows that do not contain accent metadata. 

In [93]:
# You will need to download the CV corpus somewhere, or at least have the validated.tsv file available. 

# I have found that the aria2c downloader works very well for large downloads. 
# https://aria2.github.io/

filePath = '../cv-datasets/cv-corpus-13.0-2023-03-09/sw/validated.tsv'

# put it into a DataFrame 
df = pd.read_csv(filePath, sep='\t')

  df = pd.read_csv(filePath, sep='\t')


In [94]:
df.columns

Index(['client_id', 'path', 'sentence', 'up_votes', 'down_votes', 'age',
       'gender', 'accents', 'variant', 'locale', 'segment'],
      dtype='object')

In [95]:
# We don't want all the columns, as some of them are not useful for the accent analysis 
# Drop the columns we don't want 

df.drop(labels=['path', 'sentence', 'up_votes', 'down_votes', 'segment', 'locale'], axis='columns', inplace=True)
df.columns



Index(['client_id', 'age', 'gender', 'accents', 'variant'], dtype='object')

In [96]:
len(df)

231142

In [97]:
# rows that have accent metadata 
len(df[df['accents'].notna()])

1325

In [98]:
display(df)

Unnamed: 0,client_id,age,gender,accents,variant
0,0133d8ddf5c1a3c678fde017e0b07d2835bfd707d5b3ec...,twenties,female,,
1,01c95772efd3fbe4a1122206c7474c77ed6591c8c9fb00...,,,,
2,023711185d4404ff398c2697f2e72868d1ecf69a92b581...,twenties,male,,
3,0244639ffd7ec755a01b21ea204735ca3c44443e9cf46c...,,,,
4,04e78dc3038488a080fe3c76c28602d0db9e4eec2efbf0...,,,,
...,...,...,...,...,...
231137,457b3a2570720101c75d297cde767487e8f0a1a7f714cb...,thirties,male,,
231138,457b3a2570720101c75d297cde767487e8f0a1a7f714cb...,thirties,male,,
231139,457b3a2570720101c75d297cde767487e8f0a1a7f714cb...,thirties,male,,
231140,457b3a2570720101c75d297cde767487e8f0a1a7f714cb...,thirties,male,,


In [99]:
# remove all the rows where accents are not given (NaN)
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
# DataFrame.dropna(*, axis=0, how=_NoDefault.no_default, thresh=_NoDefault.no_default, subset=None, inplace=False)

df.dropna(axis='index', how='any', subset='accents', inplace=True)
len(df)

# this matches the above figure for rows that have accent metadata, so it's a good cross-check

1325

In [100]:
# number of unique contributors to the dataset 
len(df['client_id'].unique())

26

In [101]:
# Now that the rows without an accent value have been removed, 
# we want to deduplicate the speaker_id values - because one speaker can speak many utterances
# and we only want to record one accent per speaker 
# and we should end up with the # of rows in the cell above 


# One of the reasons we try and reduce the size of the dataframe 
# first is because this operation is more efficient on a smaller dataframe 
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
# DataFrame.drop_duplicates(subset=None, *, keep='first', inplace=False, ignore_index=False)

df.drop_duplicates(subset='client_id', keep='first', inplace=True)
len(df)

# This length should match the length above 

26

---
<a id='AccentExtraction'></a>
## Extracting the accent data for visualisation 

We have already: 

* Removed any rows where accent data was not available = `NaN`
* De-duplicated based on the `client_id`

So now, we need to extract all the self-styled accents for analysis. 

In [102]:
# They are already unique so we don't need the `.unique` method
kiswahili_accents = df['accents']

print(len(kiswahili_accents))

26


In [103]:
"""

The list english_accents_list is a list that contains the ORIGINAL accent entries for each speaker. 
In this list, the accents for each speaker are represented as a SINGLE STRING, NOT as a list of strings. 

So, we want to turn this into a LIST of LISTS OF STRINGS, to make it easier for doing data cleaning. 
Each individual string represents one accent descriptor given by a speaker, 
and the list which contains those strings is the grouping of accent descriptors for that speaker. 

We need to preserve the association between accent descriptors - co-references - for later data visualisation. 

"""


kiswahili_accents_list = [] 

for idx, accent_string in list(enumerate(kiswahili_accents)): 
    
    # this regex is from 
    # https://stackoverflow.com/questions/26633452/how-to-split-by-commas-that-are-not-within-parentheses
    accent_list=re.split(',\s*(?![^()]*\))', accent_string)
    processed_accent_list = accent_list # we don't want to modify a list we're iterating over
    
    #print ('accent_string is: ', accent_string, ' and accent_list is: ', accent_list)
    for idx_a, accent in list(enumerate(accent_list)): 
        
        #print ('idx_a is: ', idx_a, ' and accent is: ', accent, ' and type of accent is: ', type(accent))
        # Trim any whitespace off the elements, because this makes matching on strings harder later on
        # Strings are immutable in Python, so we have to create another string
        processed_accent_list.remove(accent)
        stripped_accent = accent.strip() 
        processed_accent_list.append(stripped_accent)
        
        # Check for any empty strings and remove them - likely regex artefacts
        if not accent: 
            processed_accent_list.remove(accent)
            
        # Check for any non-Latin characters that we may want to investigate 
        # For example, if one of the accents is garbage or deliberate rubbish
        
        if not accent.isascii(): 
            print('flagging that accent: ', accent, ' is not ASCII encoded, may be in another language')
            
    kiswahili_accents_list.append(processed_accent_list)

In [104]:
print(len(kiswahili_accents_list))

26


In [105]:
for idx, accent_list in list(enumerate(kiswahili_accents_list)): 
    print(idx, accent_list)


0 ['Good']
1 ['swahili accent', 'coastal swahili accent']
2 ['Fluent', 'Mvita']
3 ['no sherapping']
4 ['Kiswahili']
5 ['Strong kiswahili accent']
6 ['Kimvita']
7 ['Typical Kenyan Accent']
8 ['Shaped by where i have lived']
9 ['native']
10 ['Lafudhi yangu ni kiswahili cha kawaida ambacho watanzania wengi wanakizungumza. Ni pamoja na kile ambacho kinafundishwa katika shule za msingi na sekondari ili kuleta maana katika mambo mbalimbali na namna ya kuzungumza kwa ujumla.']
11 ['Fluent Kiswahili']
12 ['Fluent in Swahili']
13 ['fluent in kiswahili']
14 ['Kenyan']
15 ['Eloquent and fluent.']
16 ['Fluent swahili']
17 ['kiswahili cha mkoa wa pwani']
18 ['Fluent Kiswahili']
19 ['Raisi wa Kenya alitoa hutoba yake juma tatu iliyopita']
20 ['Fluent swahili']
21 ['native']
22 ['Fluent swahili']
23 ['Fluent accent']
24 ['Fluent', 'Fluent swahili']
25 ['My accept is to recognize and record different sounds and voices', 'Especial in swahili', 'Normal Arusha accent']


---
<a id='Descriptors'></a>
## Add descriptors to each accent

In this section, I apply a set of categories to the accent data. 

**I use a rule-based approach for reproduceability.** 
This could have been done in a spreadsheet, but I'm working in Python so I chose to do it that way. 
This also makes it easier for others applying this work to other languages or to other versions of the dataset. 



### Expand accents that have multiple descriptors in their .name element 

Here, we "break apart" accents that have multiple descriptors in the .name element into **multiple** accents. This is done _programmatically_ to aid in reproduceability. 

Some examples of this that I found during analysis of Kiswahili were; 

* 'My accent is the common Swahili that many Tanzanians speak. It includes what is taught in primary and secondary schools to make sense in various aspects and how to talk in general.' - this contains a geographic descriptor, the 'common' descriptor and the 'taught in school' descriptor. 

* 



### Modify the accent list to expand descriptors while preserving accent co-references

Some of the given accent descriptors contain multiple descriptors in one string. Here, I expand them while maintain co-references. 

For example: 

* `slight Brooklyn accent`  - contains both a City-based descriptor and an accent strength descriptor. 
* `United States English combined with European English` - contains both a national descriptor and a supranational descriptor. 



It's easier to do this before we create Accent and AccentDescriptor objects. 

In [106]:
# helper function for the below 

def update_list_coreference(list_to_be_updated, old_entry, new_entry):
    # the accent list is a list of lists so we need to iterate through each one to find the element to update. 
    
    for idx, accent_list in list(enumerate(list_to_be_updated)):
        
        #print(idx, '- old accent_list is: ', accent_list)
        new_accent_list = accent_list
        
        for accent in accent_list:  
            match = False
            #print ('accent is: ', accent, ' and old_entry is: ', old_entry)
            
            if accent == old_entry:
                match = True 
                #print('accent is: ', accent, ' and old_entry is: ', old_entry, 'and new_entry is: ', new_entry)
                #print('accent is: ', accent, ' which matches ', old_entry)
                new_accent_list.remove(old_entry)
                
                for entry in new_entry: # there may be more than one
                    new_accent_list.append(entry)
                    #print('appending new entry: ', entry)
                
                
            if match: 
                print ('processed ', old_entry, ' to be ', new_entry, ' and the old accent list is: ', accent_list, ' and the new accent list is: ', new_accent_list)

        # recreate the list from keys, this removes duplicates
        # for example, a duplicate may be created due to normalisation or merger of accents 
        # run through a filter to remove empty list elements
        new_accent_list = list(dict.fromkeys(filter(None, new_accent_list)))
        
        #print('new_accent_list is: ', new_accent_list)
        #print('---')
        
        #print('removing: ', accent_list)
        list_to_be_updated.remove(accent_list)
        #print('appending: ', newlist)
        list_to_be_updated.append(new_accent_list)
        
    
    return(list_to_be_updated)

In [107]:
# NORMALISATION INTO MULTIPLE ACCENTS 

# 'Lafudhi yangu ni kiswahili cha kawaida ambacho watanzania wengi wanakizungumza. Ni pamoja na kile ambacho kinafundishwa katika shule za msingi na sekondari ili kuleta maana katika mambo mbalimbali na namna ya kuzungumza kwa ujumla.
# This translates to: 
# My accent is the common Swahili that many Tanzanians speak. It includes what is taught in primary and secondary schools to make sense in various aspects and how to talk in general. 
kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'Lafudhi yangu ni kiswahili cha kawaida ambacho watanzania wengi wanakizungumza. Ni pamoja na kile ambacho kinafundishwa katika shule za msingi na sekondari ili kuleta maana katika mambo mbalimbali na namna ya kuzungumza kwa ujumla.', 
                                               ['Kiswahili accent', 'Tanzania', 'academic'])

                                                
# 'Strong kiswahili accent' - separate into 'strong' and 'Kiswahili accent'
kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'Strong kiswahili accent', 
                                               ['Kiswahili accent', 'Strong'])

# Fluent Kiswahili - separate into 'Fluent' and 'Kiswahili accent'                                                
kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'Fluent Kiswahili', 
                                               ['Kiswahili accent', 'Fluent'])
                                                
# 'Fluent in Swahili' - separate into 'Fluent' and 'Kiswahili accent'                                                
kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'Fluent in Swahili', 
                                               ['Kiswahili accent', 'Fluent'])                                                

# 'Fluent in kiswahili' - separate into 'Fluent' and 'Kiswahili accent'                                                
kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'fluent in kiswahili', 
                                               ['Kiswahili accent', 'Fluent'])
                                                
# 'Fluent swahili' - separate into 'Fluent' and 'Kiswahili accent'                                                
kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'Fluent swahili', 
                                               ['Kiswahili accent', 'Fluent'])
                                                                                              

                                                
                                                
                                                
# 'Eloquent and fluent.' - separate into 'Eloquent' and 'Fluent'
kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'Eloquent and fluent.', 
                                               ['Eloquent', 'Fluent'])  
                                                
                                                
                                                
                                                
                                                

processed  Lafudhi yangu ni kiswahili cha kawaida ambacho watanzania wengi wanakizungumza. Ni pamoja na kile ambacho kinafundishwa katika shule za msingi na sekondari ili kuleta maana katika mambo mbalimbali na namna ya kuzungumza kwa ujumla.  to be  ['Kiswahili accent', 'Tanzania', 'academic']  and the old accent list is:  ['Kiswahili accent', 'Tanzania', 'academic']  and the new accent list is:  ['Kiswahili accent', 'Tanzania', 'academic']
processed  Strong kiswahili accent  to be  ['Kiswahili accent', 'Strong']  and the old accent list is:  ['Kiswahili accent', 'Strong']  and the new accent list is:  ['Kiswahili accent', 'Strong']
processed  Fluent Kiswahili  to be  ['Kiswahili accent', 'Fluent']  and the old accent list is:  ['Kiswahili accent', 'Fluent']  and the new accent list is:  ['Kiswahili accent', 'Fluent']
processed  Fluent Kiswahili  to be  ['Kiswahili accent', 'Fluent']  and the old accent list is:  ['Kiswahili accent', 'Fluent']  and the new accent list is:  ['Kiswahili

### Normalize closely related accent descriptors - merge them 

There are several closely related accent descriptors, and here I merge them. 

The principles I use are: 
    
* Accents are merged where there are spelling variations 
* Accents are merged where the accent has a region descriptor with our without 'accent' - such as "French" and "French accent"
* Where a country or language descriptor and demonym are closely equivalent - "Germany" and "German"

Accents are not merged where: 

* One accent descriptor is more granular than another - "London" and "South London" are not merged. 

In [108]:
## There will be others as we put them into objects / classes

# Kiswahili accent - canonical is 'Kiswahili accent'
kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'Kiswahili',
                                               ['Kiswahili accent'])

kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'kiswahili',
                                               ['Kiswahili accent'])

kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'Swahili',
                                               ['Kiswahili accent'])

kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'swahili',
                                               ['Kiswahili accent'])

kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'swahili accent',
                                               ['Kiswahili accent'])
kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'Especial in swahili',
                                               ['Kiswahili accent'])





# Coastal Swahili - canonical is 'Coastal Swahili'
kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'coastal swahili accent',
                                               ['Coastal Swahili'])
kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'kiswahili cha mkoa wa pwani',
                                               ['Coastal Swahili'])



# kiMvita accent - canonical is 'kiMvita'

kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'mvita',
                                               ['kiMvita'])
kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'Mvita',
                                               ['kiMvita'])
kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'kimvita',
                                               ['kiMvita'])
kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'Kimvita',
                                               ['kiMvita'])


# Arusha accent - canonical is 'Arusha'
kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'Normal Arusha accent',
                                               ['Arusha'])




# Kenyan accent - canonical is 'Kenyan'
kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'Typical Kenyan Accent',
                                               ['Kenyan'])


# Fluent - canonical is 'Fluent'
kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'Fluent accent',
                                               ['Fluent'])


# 'Shaped by where i have lived' - canonical is 'Lived in area'
kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'Shaped by where i have lived',
                                               ['Lived in area'])

############### accents to disregard


# just practicing - canonical is 'DISREGARD'

kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'no sherapping',
                                               '')
kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'Raisi wa Kenya alitoa hutoba yake juma tatu iliyopita',
                                               '')
kiswahili_accents_list = update_list_coreference(kiswahili_accents_list, 
                                               'My accept is to recognize and record different sounds and voices',
                                               '')




processed  Kiswahili  to be  ['Kiswahili accent']  and the old accent list is:  ['Kiswahili accent']  and the new accent list is:  ['Kiswahili accent']
processed  swahili accent  to be  ['Kiswahili accent']  and the old accent list is:  ['coastal swahili accent', 'Kiswahili accent']  and the new accent list is:  ['coastal swahili accent', 'Kiswahili accent']
processed  Especial in swahili  to be  ['Kiswahili accent']  and the old accent list is:  ['My accept is to recognize and record different sounds and voices', 'Normal Arusha accent', 'Kiswahili accent']  and the new accent list is:  ['My accept is to recognize and record different sounds and voices', 'Normal Arusha accent', 'Kiswahili accent']
processed  coastal swahili accent  to be  ['Coastal Swahili']  and the old accent list is:  ['Kiswahili accent', 'Coastal Swahili']  and the new accent list is:  ['Kiswahili accent', 'Coastal Swahili']
processed  kiswahili cha mkoa wa pwani  to be  ['Coastal Swahili']  and the old accent list

In [109]:
pp.pprint(kiswahili_accents_list)


[   ['Good'],
    ['Kiswahili accent', 'Coastal Swahili'],
    ['Fluent', 'kiMvita'],
    [],
    ['Kiswahili accent'],
    ['Kiswahili accent', 'Strong'],
    ['kiMvita'],
    ['Kenyan'],
    ['Lived in area'],
    ['native'],
    ['Kiswahili accent', 'Tanzania', 'academic'],
    ['Kiswahili accent', 'Fluent'],
    ['Kiswahili accent', 'Fluent'],
    ['Kiswahili accent', 'Fluent'],
    ['Kenyan'],
    ['Eloquent', 'Fluent'],
    ['Kiswahili accent', 'Fluent'],
    ['Coastal Swahili'],
    ['Kiswahili accent', 'Fluent'],
    [],
    ['Kiswahili accent', 'Fluent'],
    ['native'],
    ['Kiswahili accent', 'Fluent'],
    ['Fluent'],
    ['Fluent', 'Kiswahili accent'],
    ['Kiswahili accent', 'Arusha']]


## Extract unique accents from the list of normalised accents into a Dict of Accent objects for easier manipulation

In [110]:
# build a dict of each unique accent using an Accent object for each object. 

ratio_display = 120 # to stop the browser crashing 

AccentDict = {}
i = 0; 

# the english_accents_list is now normalised, merged etc so this is straightforward 
for accent_list in kiswahili_accents_list:
    for accent in accent_list: 
        
        i +=1
        match = False 
        count = 0
        
        #if (i%ratio_display ==0): # only show the 100th 
            #print('')
            #print('---')
            #print('now processing: ', accent, ' - ', i)
            #print('---')
        
        # is this accent in our dict - if not, add it in 
        
        for item in AccentDict.items() : # Each item should be an Accent object 
            
            # (self, id=0, name="Accent Name", count=0, locale=None, descriptors=None):
            #pp.pprint(item[1].__str__())
            
            #if (i%ratio_display ==0): # only show the 100th 
                #print('item is: ', item)
                #print(type(item))
                #print('now checking match for: item:', item[1], ' and accent: ', accent)
            
            if (item[1].name == accent) : # update the count
                
                #if (i%ratio_display ==0): # only show the 100th 
                    #print('---')
                    #print('match is True')
                    #print('---')
                    
                match = True 
                
                #if (i%ratio_display ==0): # only show the 100th 
                    #print('accent count was: ', item[1].count)
                
                # update the count of the accent 
                item[1].count+=1
                
                #if (i%ratio_display ==0): # only show the 100th 
                    #print('accent count is now: ', item[1].count)
                
                
        # this match loop has to be outside the for: loop above 
        # because if we add items to the dict inside the loop
        # then it will not run - because there are zero items in the dict to begin with 
        
        if (not match) :   
            
            # (self, id=0, name="Accent Name", count=0, locale=None, descriptors=None):
            AccentDict[i] = cva.Accent(i, accent, 1, 'en', None, False) 
                


In [111]:
# do an explicit reload as I'm still working on the classes 
#reload(cva)

all_accents = cva.AccentCollection(AccentDict)

In [112]:
print(all_accents.total())

13


In [113]:
all_accents.__str__()

'id is 1, name is Good, count is 1, locale is en, descriptors are None, predetermined is False. id is 2, name is Kiswahili accent, count is 13, locale is en, descriptors are None, predetermined is False. id is 3, name is Coastal Swahili, count is 2, locale is en, descriptors are None, predetermined is False. id is 4, name is Fluent, count is 11, locale is en, descriptors are None, predetermined is False. id is 5, name is kiMvita, count is 2, locale is en, descriptors are None, predetermined is False. id is 8, name is Strong, count is 1, locale is en, descriptors are None, predetermined is False. id is 10, name is Kenyan, count is 2, locale is en, descriptors are None, predetermined is False. id is 11, name is Lived in area, count is 1, locale is en, descriptors are None, predetermined is False. id is 12, name is native, count is 2, locale is en, descriptors are None, predetermined is False. id is 14, name is Tanzania, count is 1, locale is en, descriptors are None, predetermined is Fal

In [114]:
# do an explicit reload as I'm still working on the classes 
#reload(cva)

all_accents_sortedByCount = all_accents.sortByCount(reverse=True)

for accent in all_accents_sortedByCount.items(): 
    print(accent[1].__str__())
    
# now I am cross-checking to see if there are any other duplicates or accents that should be merged


id is 2, name is Kiswahili accent, count is 13, locale is en, descriptors are None, predetermined is False.
id is 4, name is Fluent, count is 11, locale is en, descriptors are None, predetermined is False.
id is 12, name is native, count is 2, locale is en, descriptors are None, predetermined is False.
id is 5, name is kiMvita, count is 2, locale is en, descriptors are None, predetermined is False.
id is 10, name is Kenyan, count is 2, locale is en, descriptors are None, predetermined is False.
id is 3, name is Coastal Swahili, count is 2, locale is en, descriptors are None, predetermined is False.
id is 15, name is academic, count is 1, locale is en, descriptors are None, predetermined is False.
id is 14, name is Tanzania, count is 1, locale is en, descriptors are None, predetermined is False.
id is 8, name is Strong, count is 1, locale is en, descriptors are None, predetermined is False.
id is 11, name is Lived in area, count is 1, locale is en, descriptors are None, predetermined is

---
<a id='PreDetermined'></a>
## Label the accents that were pre-determined 

Since its inception, Mozilla Common Voice has enabled data contributors to enter demographic age such as age, gender and accent. These associations are not validated in any way, and we don't have any indicator of how accurate they are. Accent _used_ to be represented as an a priori drop-down list, which the contributor could select from. From Common Voice v10, the data contributor can **self-describe** their accent, however, the previous accent list is still presented (so may be more frequently chosen by the data contributor). We need to be able to distinguish these accents visually to help with the exploration. 

```
"splits": {
        "accent": {
          "": 0.51,
          "canada": 0.03,
          "england": 0.08,
          "us": 0.23,
          "indian": 0.07,
          "australia": 0.03,
          "malaysia": 0,
          "newzealand": 0.01,
          "african": 0.01,
          "ireland": 0.01,
          "philippines": 0,
          "singapore": 0,
          "scotland": 0.02,
          "hongkong": 0,
          "bermuda": 0,
          "southatlandtic": 0,
          "wales": 0,
          "other": 0.01
        },

```

The `cv-datasets` splits above have labels for the accents that don't actually match the accent name in the data. So we need to specify the accents that are pre-determined. This is how they appear to the data contributor filling out their profile at: [https://commonvoice.mozilla.org/en/profile/info](https://commonvoice.mozilla.org/en/profile/info)


![Accents as specified on Mozilla Common Voice profile](cv-profile-specify-accent.png)


In [115]:
# create a list of the pre-existing accents 
# this is how they are given in the dataset. 

# TODO: for better maintainability, move this to a list of accents for each language, 
# that can be updated in a separate file, rather than specified here in an adhoc way. 

predetermined_accents_list = ['United States English', 
                         'England English', 
                         'India and South Asia (India, Pakistan, Sri Lanka)', 
                         'Canadian English', 
                         'Australian English', 
                         'Southern African (South Africa, Zimbabwe, Namibia)', 
                         'Irish English', 
                         'Scottish English', 
                         'New Zealand English', 
                         'Hong Kong English', 
                         'Filipino', 
                         'Malaysian English', 
                         'Singaporean English', 
                         'Welsh English', 
                         'West Indies and Bermuda (Bahamas, Bermuda, Jamaica, Trinidad)', 
                         'South Atlantic (Falkland Islands, Saint Helena)']



  

In [116]:
# use the predetermined_accents_list to populate the 'predetermined_status' attribute of each Accent object 
# to do this we use a method on the AccentCollection class

import cvaccents as cva
#reload(cva)

all_accents.updatePredeterminedStatus(predetermined_accents_list, True)

print(all_accents)


(1, <cvaccents.Accent object at 0x7fa5c85d0340>)
1
(2, <cvaccents.Accent object at 0x7fa5c85d01f0>)
2
(3, <cvaccents.Accent object at 0x7fa5c85d0dc0>)
3
(4, <cvaccents.Accent object at 0x7fa5c85d0430>)
4
(5, <cvaccents.Accent object at 0x7fa5c85d04c0>)
5
(8, <cvaccents.Accent object at 0x7fa5c85d07f0>)
8
(10, <cvaccents.Accent object at 0x7fa5c85d0bb0>)
10
(11, <cvaccents.Accent object at 0x7fa5c85d0c40>)
11
(12, <cvaccents.Accent object at 0x7fa5c85d0670>)
12
(14, <cvaccents.Accent object at 0x7fa5c85d0af0>)
14
(15, <cvaccents.Accent object at 0x7fa5c44848b0>)
15
(23, <cvaccents.Accent object at 0x7fa5c4484a90>)
23
(39, <cvaccents.Accent object at 0x7fa5c4484bb0>)
39
id is 1, name is Good, count is 1, locale is en, descriptors are None, predetermined is False. id is 2, name is Kiswahili accent, count is 13, locale is en, descriptors are None, predetermined is False. id is 3, name is Coastal Swahili, count is 2, locale is en, descriptors are None, predetermined is False. id is 4, name 

---
<a id='AccentDescriptors'></a>

## Create Accent Descriptors and add them to each Accent 

Each Accent can have multiple Accent Descriptors. 

For example the accent `Pronounced German` contains both a _national regional descriptor_ and an _accent strength descriptor_. 

I have used the following principles for Accent Descriptors for English. This should be considered a CodeBook. 

### Geographic Regional Descriptors 

Regional descriptors are where the accent has been specified with reference to a geographic region. 

* `Geographic Region` 

Within this Category, there are several sub-categories: 

* `Country Descriptor` - where the descriptor is a country or a nation-state. 
* `Supranational region descriptor` - where the descriptor is a geographic region that crosses or overlaps multiple countries. An example would be `Slavic`, which refers to an [ethno-linguistic group](https://en.wikipedia.org/wiki/Slavic_languages) that covers several countries in Eastern Europe. 
* `Subnational region descriptor` - where the descriptor is a geographic region that refers to a region within a country's national boundary. An example would be `Midwestern United States`. 
* `City descriptor` - where the descriptor is a geographic region that refers to a city, town or municipality. An example would be `New York City` or `London`. 

One choice I have made here is not to represent areas _within_ cities using a separate Accent Descriptor. Examples here would be `Brooklyn` or `East London` - they have been classified as cities. This is because there are so few of them that it doesn't change the analysis significantly. 

### First or other language descriptor

This refers to Accent Descriptors where the data contributor refers to their accent using a descriptor such as `non-native` or `native speaker`. This is sometimes referred to as `first language (L1)` or `second language (L2)`. Although this _may_ be used to refer to the data contributor's _level of fluency_ in a language, I've chosen not to refer to this as a _level of fluency_ - because even though someone speaks a language as a second or other language, this _does not imply_ their level of fluency specifically. One could speak Mandarin as a second language, but be highly fluent. One could speak French as a second language and be less than proficient. 

* `First or other language` 

### Accent strength descriptor 

This refers to Accent Descriptors where the data contributor refers to their accent using a marker of the strength of the accent. Examples included `pronounced`, `90%` or `slight`. 

* `Accent strength descriptor` 

### Vocal quality descriptor 

This refers to Accent Descriptors where the data contributor refers to their accent using words to describe aspects of their voice that are subjective and qualitative - such as `sultry` or `sassy`. 

* `Vocal quality descriptor` 

TODO: is `quality` the correct word here? 

## Phonetic changes 

This category refers to Accent Descriptors which describe a particular phonetic change. This is used as a parent category to group these Accent Descriptors. 

* `Phonetic changes`

### Specific phonetic changes 

There are several phonetic changes that are linguistic markers for accent difference. 

* `Specified phonetic change` is applied when the Accent Descriptor itself specifies the type of phonetic change. `cot-caught merger` is an example. 

* `Rhoticity` is applied when the Accent Descriptor is describing how `/r/` and related phonemes are pronounced.

* `Inflection` is applied when the Accent Descriptor is describing an inflection change. 

## Register 

Although the Mozilla Common Voice data used _elicited speech_ - utterances spoken from given text prompts, people can speak in a range of _registers_. A register is generally the level of formality of speech - such as `formal`, or `educated` or `slang`. It may indicate socio-economic heritage of the speaker. This category captures Accent Descriptors that describe an accent in this way. 

* `Register` 

## Named accent 

Some accents, such as `Geordie` or `Scouse` have a related geographical region descriptor - North East England, and Liverpool respectively, but ones such as `Received Pronunciation` do not. This category allows for having a Named Accent descriptor where no related geographic region descriptor exists, as well as being able to capture specifically named accents.

* `Named Accent`

## Accent effects due to physical changes 

Accent changes may occur due to physical changes in the speaker's vocal tract - for instance through surgery or disease. This Accent Descriptor is used to capture descriptions such as these. 

* `Accent effects due to physical changes`

## Mixed or variable accent 

Where the data contributor specifies that their accent is a mixture or amalgamation of accents, but does not provide further information (for example so the Accent Descriptors can be separated or merged), this Accent Descriptor is used to capture this description. 

* `Mixed or variable accent`






In [117]:
import cvaccents as cva
#reload(cva)

# Using the Accent Descriptor class to create Accent Descriptor accents for the above 

descriptorGeoRegion = cva.AccentDescriptor(
    id = 100, 
    name='Geographic region', 
    definition = 'Indicates a geographic region used as a descriptor.', 
    parent = None, 
)
descriptorGeoCountry = cva.AccentDescriptor(
    id = 200, 
    name='Country', 
    definition = 'Indicates a geographic region of a country or nation-state.', 
    parent = 100, 
)
descriptorGeoSupra = cva.AccentDescriptor(
    id = 300, 
    name='Supranational region', 
    definition = 'Indicates a geographic region which crosses or overlaps multiple countries.', 
    parent = 100, 
)
descriptorGeoSub = cva.AccentDescriptor(
    id = 400, 
    name='Subnational region', 
    definition = 'Indicates a geographic region within a national boundary.', 
    parent = 100, 
)
descriptorGeoCity = cva.AccentDescriptor(
    id = 500, 
    name='City', 
    definition = 'Indicates a geographic region referring to a city, town or municipality.', 
    parent = 100, 
)


descriptorFOL = cva.AccentDescriptor(
    id = 600, 
    name='First or other language', 
    definition = 'Indicates a descriptor related to whether this is the speaker\s first or other language.', 
    parent = None, 
)

descriptorAccStr = cva.AccentDescriptor(
    id = 700, 
    name='Accent strength descriptor', 
    definition = 'Indicates a marker of accent strength.', 
    parent = None, 
)

descriptorVocQual = cva.AccentDescriptor(
    id = 800, 
    name='Vocal quality descriptor', 
    definition = 'Indicates a subjective vocal quality.', 
    parent = None, 
)


descriptorPhonChanges = cva.AccentDescriptor(
    id = 1000, 
    name='Phonetic Changes', 
    definition = 'Indicates a phonetic change.', 
    parent = None, 
)
descriptorPhonSpecific = cva.AccentDescriptor(
    id = 1100, 
    name='Specific phonetic changes', 
    definition = 'Indicates a specific phonetic change.', 
    parent = 1000, 
)
descriptorPhonRhoticity = cva.AccentDescriptor(
    id = 1200, 
    name='Rhoticity', 
    definition = 'Indicates rhoticity or its absence.', 
    parent = 1000, 
)
descriptorPhonInflection = cva.AccentDescriptor(
    id = 1200, 
    name='Inflection', 
    definition = 'Indicates an inflection change.', 
    parent = 1000, 
)

descriptorRegister = cva.AccentDescriptor(
    id = 1300, 
    name='Register', 
    definition = 'Indicates which register the data contributor speaks in.', 
    parent = None, 
)

descriptorNamedAcc = cva.AccentDescriptor(
    id = 1400, 
    name='Specifically named accent', 
    definition = 'Indicates a specifically named accent.', 
    parent = None, 
)

descriptorPhysChange = cva.AccentDescriptor(
    id = 1500, 
    name='Accent effects due to physical changes', 
    definition = 'Indicates accent changes due to physical changes of the data contributor.', 
    parent = None, 
)

descriptorAccMixed = cva.AccentDescriptor(
    id = 1600, 
    name='Mixed or variable accent', 
    definition = 'Indicates mixture or amalgamation of accents.', 
    parent = None, 
)


descriptorAccUncertainty = cva.AccentDescriptor(
    id = 2000, 
    name='Uncertainty marker', 
    definition = 'Indicates uncertainty of descriptor.', 
    parent = None, 
)


print(descriptorGeoRegion.__str__())
print(descriptorGeoCountry.__str__())
print(descriptorGeoSupra.__str__())
print(descriptorGeoSub.__str__())
print(descriptorGeoCity.__str__())

print(descriptorFOL.__str__())

print(descriptorAccStr.__str__())

print(descriptorVocQual.__str__())

print(descriptorPhonChanges.__str__())
print(descriptorPhonSpecific.__str__())
print(descriptorPhonRhoticity.__str__())
print(descriptorPhonInflection.__str__())

print(descriptorRegister.__str__())

print(descriptorNamedAcc.__str__())

print(descriptorPhysChange.__str__())

print(descriptorAccMixed.__str__())

import cvaccents as cva
#reload(cva)

# Using the Accent Descriptor class to create Accent Descriptor accents for the above 

descriptorGeoRegion = cva.AccentDescriptor(
    id = 100, 
    name='Geographic region', 
    definition = 'Indicates a geographic region used as a descriptor.', 
    parent = None, 
)
descriptorGeoCountry = cva.AccentDescriptor(
    id = 200, 
    name='Country', 
    definition = 'Indicates a geographic region of a country or nation-state.', 
    parent = 100, 
)
descriptorGeoSupra = cva.AccentDescriptor(
    id = 300, 
    name='Supranational region', 
    definition = 'Indicates a geographic region which crosses or overlaps multiple countries.', 
    parent = 100, 
)
descriptorGeoSub = cva.AccentDescriptor(
    id = 400, 
    name='Subnational region', 
    definition = 'Indicates a geographic region within a national boundary.', 
    parent = 100, 
)
descriptorGeoCity = cva.AccentDescriptor(
    id = 500, 
    name='City', 
    definition = 'Indicates a geographic region referring to a city, town or municipality.', 
    parent = 100, 
)


descriptorFOL = cva.AccentDescriptor(
    id = 600, 
    name='First or other language', 
    definition = 'Indicates a descriptor related to whether this is the speaker\s first or other language.', 
    parent = None, 
)

descriptorAccStr = cva.AccentDescriptor(
    id = 700, 
    name='Accent strength descriptor', 
    definition = 'Indicates a marker of accent strength.', 
    parent = None, 
)

descriptorVocQual = cva.AccentDescriptor(
    id = 800, 
    name='Vocal quality descriptor', 
    definition = 'Indicates a subjective vocal quality.', 
    parent = None, 
)


descriptorPhonChanges = cva.AccentDescriptor(
    id = 1000, 
    name='Phonetic Changes', 
    definition = 'Indicates a phonetic change.', 
    parent = None, 
)
descriptorPhonSpecific = cva.AccentDescriptor(
    id = 1100, 
    name='Specific phonetic changes', 
    definition = 'Indicates a specific phonetic change.', 
    parent = 1000, 
)
descriptorPhonRhoticity = cva.AccentDescriptor(
    id = 1200, 
    name='Rhoticity', 
    definition = 'Indicates rhoticity or its absence.', 
    parent = 1000, 
)
descriptorPhonInflection = cva.AccentDescriptor(
    id = 1200, 
    name='Inflection', 
    definition = 'Indicates an inflection change.', 
    parent = 1000, 
)

descriptorRegister = cva.AccentDescriptor(
    id = 1300, 
    name='Register', 
    definition = 'Indicates which register the data contributor speaks in.', 
    parent = None, 
)

descriptorNamedAcc = cva.AccentDescriptor(
    id = 1400, 
    name='Named Accent', 
    definition = 'Indicates a specifically named accent.', 
    parent = None, 
)

descriptorPhysChange = cva.AccentDescriptor(
    id = 1500, 
    name='Accent effects due to physical changes', 
    definition = 'Indicates accent changes due to physical changes of the data contributor.', 
    parent = None, 
)

descriptorAccMixed = cva.AccentDescriptor(
    id = 1600, 
    name='Mixed or variable accent', 
    definition = 'Indicates mixture or amalgamation of accents.', 
    parent = None, 
)


descriptorAccUncertainty = cva.AccentDescriptor(
    id = 2000, 
    name='Uncertainty marker', 
    definition = 'Indicates uncertainty of descriptor.', 
    parent = None, 
)

descriptorGeneration = cva.AccentDescriptor(
    id = 2100, 
    name='Generational marker', 
    definition = 'Indicates generational association of speaker.', 
    parent = None, 
)

descriptorSocioeconomic = cva.AccentDescriptor(
    id = 2200, 
    name='Socio-economic marker', 
    definition = 'Indicates the socio-economic status of speaker.', 
    parent = None, 
)


descriptorHybrid = cva.AccentDescriptor(
    id = 2300, 
    name='Hybrid dialect', 
    definition = 'Indicates that the speaker has an accent of a hybrid dialect of the language.', 
    parent = None, 
)

descriptorHeritage = cva.AccentDescriptor(
    id = 2400, 
    name='Linguistic heritage of speaker', 
    definition = 'Indicates something about the language acquisition or language immersion of the speaker', 
    parent = None, 
)


#####


print(descriptorGeoRegion.__str__())
print(descriptorGeoCountry.__str__())
print(descriptorGeoSupra.__str__())
print(descriptorGeoSub.__str__())
print(descriptorGeoCity.__str__())

print(descriptorFOL.__str__())

print(descriptorAccStr.__str__())

print(descriptorVocQual.__str__())

print(descriptorPhonChanges.__str__())
print(descriptorPhonSpecific.__str__())
print(descriptorPhonRhoticity.__str__())
print(descriptorPhonInflection.__str__())

print(descriptorRegister.__str__())

print(descriptorNamedAcc.__str__())

print(descriptorPhysChange.__str__())

print(descriptorAccMixed.__str__())

print(descriptorAccUncertainty.__str__())

print(descriptorGeneration.__str__())

print(descriptorSocioeconomic.__str__())

print(descriptorHybrid.__str__())

print(descriptorHeritage.__str__())








id is 100, name is Geographic region, definition is Indicates a geographic region used as a descriptor., parent is None
id is 200, name is Country, definition is Indicates a geographic region of a country or nation-state., parent is 100
id is 300, name is Supranational region, definition is Indicates a geographic region which crosses or overlaps multiple countries., parent is 100
id is 400, name is Subnational region, definition is Indicates a geographic region within a national boundary., parent is 100
id is 500, name is City, definition is Indicates a geographic region referring to a city, town or municipality., parent is 100
id is 600, name is First or other language, definition is Indicates a descriptor related to whether this is the speaker\s first or other language., parent is None
id is 700, name is Accent strength descriptor, definition is Indicates a marker of accent strength., parent is None
id is 800, name is Vocal quality descriptor, definition is Indicates a subjective voc

### Now we have the Accent Descriptors defined, we can associate Accent Descriptors with each Accent 

In [118]:
# I could put them all in one list, 
# but it's easier to debug this way

# Generic region descriptors that don't fit into any other category 
region_descriptors = [
    #('non regional', descriptorGeoRegion),
    #('International English', descriptorFOL)
]

# Country descriptors 
country_descriptors = [ 
    ('Kenyan', descriptorGeoCountry),
    ('Tanzania', descriptorGeoCountry)
]

# Subnational descriptors 
subnational_descriptors = [
    ('kiMvita', descriptorGeoSub),
    ('Arusha', descriptorGeoSub),
    ('Coastal Swahili', descriptorGeoSub),
]

# Supranational descriptors 
supranational_descriptors = [
    ('Kiswahili accent', descriptorGeoSupra)
]

# City descriptors 
city_descriptors = [
    #('New York City', descriptorGeoCity),
]

# First or other language descriptors 
FOL_descriptors = [
    ('Non-native speaker', descriptorFOL),
    ('Bilingual', descriptorFOL),
    ('native', descriptorFOL),
    ('Second language', descriptorFOL),
    ('Basic', descriptorFOL),
    ('time spent in location', descriptorFOL),
    ('some', descriptorFOL),
    ('mid level', descriptorFOL),
    ('Spoke language when a child', descriptorFOL),
    ('fluent', descriptorFOL),
    ('Conversational', descriptorFOL),
    ('Foreign', descriptorFOL),
    ('Native speaker', descriptorFOL),
    ('Good', descriptorFOL),
    ('Fluent', descriptorFOL)
]


# Accent Strength descriptors
AccStr_descriptors = [
    ('pronounced', descriptorAccStr),
    ('slight', descriptorAccStr),
    ('Mild', descriptorAccStr),
    ('Not bad', descriptorAccStr),
    ('little bit', descriptorAccStr),
    ('tinge', descriptorAccStr),
    ('90%', descriptorAccStr),
    ('10%', descriptorAccStr),
    ('heavy', descriptorAccStr),
    ('little', descriptorAccStr),
    ('minor', descriptorAccStr),
    ('plain', descriptorAccStr), 
    ('Neutral', descriptorAccStr),
    ('touch', descriptorAccStr), 
    ('mostly', descriptorAccStr), 
    ('Strong', descriptorAccStr)
]


# Vocal quality descriptors
VocQual_descriptors = [
    ('sultry', descriptorVocQual),
    ('classy', descriptorVocQual),
    ('sassy', descriptorVocQual),
    ('Slight lisp', descriptorVocQual),
    ('Slightly effeminate', descriptorVocQual),
    ('Low', descriptorVocQual),
    ('Demure', descriptorVocQual),
    ('Gay', descriptorVocQual),
    ('slow', descriptorVocQual),
    ('slurred', descriptorVocQual)
]


# Phonetic descriptors 
PhonSpecific_descriptors = [
    ('pin/pen merger', descriptorPhonSpecific),
    ('heavy consonants', descriptorPhonSpecific),
    ('cot-caught merger', descriptorPhonSpecific)
]
PhonRhoticity_descriptors = [
    ("pronounced r's", descriptorPhonRhoticity)
]
PhonInflection_descriptors = [
    ('mostly affecting inflection', descriptorPhonInflection)
]

# Register descriptors
Register_descriptors = [
    ('surfer', descriptorRegister),
    ('academic', descriptorRegister),
    ('Educated', descriptorRegister),
    ('formal', descriptorRegister),
    ('slang', descriptorRegister),
    ('Urban', descriptorRegister),
    ('classy', descriptorRegister),
    ('sassy', descriptorRegister),
    ('city', descriptorRegister),
    ('Cool', descriptorRegister),
    ('Conversational', descriptorRegister),
    ('Received Pronunciation', descriptorRegister), 
    ('Eloquent', descriptorRegister)
]

# Named accent descriptors
NamedAcc_descriptors = [
    ('Patois', descriptorNamedAcc),
    ('Received Pronunciation', descriptorNamedAcc),
    ('Kiwi', descriptorNamedAcc),
    ('Chicano English', descriptorNamedAcc),
    ('"Valley Girl" English', descriptorNamedAcc),
    ('Okie', descriptorNamedAcc),
    
    ('Southern drawl', descriptorNamedAcc),
    ('Transatlantic English', descriptorNamedAcc),
    ('Culchie', descriptorNamedAcc),
    ('African American Vernacular', descriptorNamedAcc),
    ('Standard American English', descriptorNamedAcc)
   
]
        
# Physical change descriptors
PhysChange_descriptors = [
    ('changes due to oral surgery', descriptorPhysChange)
]

# Mixed accent descriptors
AccMixed_descriptors = [
    ('Variable', descriptorAccMixed),
    ('Adjustable', descriptorAccMixed),
    ('Mix of accents', descriptorAccMixed),
    ('try to maintain originality', descriptorAccMixed)
]
    
# Uncertainty marker 
AccUncertainty_descriptors = [
    ('I think', descriptorAccUncertainty)
]    

# Generational associations 
Generation_descriptors = [
    ('Gen Z', descriptorGeneration)
]  

# Socio-economic status descriptors 
Socioeconomic_descriptors = [
    ('Middle class', descriptorSocioeconomic)
]

# Hybrid descriptors 
Hybrid_descriptors = [
    ('Hunglish', descriptorHybrid),
    ('Denglish', descriptorHybrid)
]

# Heritage descriptors 
Heritage_descriptors = [
    ('Born in area', descriptorHeritage),
    ('Lived in area', descriptorHeritage),
]


In [119]:
# create one list from the above lists 

accent_descriptor_list = [
    region_descriptors,
    country_descriptors,
    subnational_descriptors,
    supranational_descriptors,
    city_descriptors,
    FOL_descriptors,
    AccStr_descriptors,
    VocQual_descriptors,
    PhonSpecific_descriptors,
    PhonRhoticity_descriptors,
    PhonInflection_descriptors,
    Register_descriptors,
    NamedAcc_descriptors,
    PhysChange_descriptors,
    AccMixed_descriptors,
    AccUncertainty_descriptors,
    Generation_descriptors,
    Socioeconomic_descriptors,
    Hybrid_descriptors,
    Heritage_descriptors
]



In [120]:
# Now we loop through all the accents 
# And if the accent name matches one of the descriptors in accent_descriptor_list 
# We add the relevant Accent Descriptor to the Accent's object representation 

for accent_descriptor_category in accent_descriptor_list: 
    for accent_descriptor in accent_descriptor_category: 
        for accent in all_accents.items(): 
            
            #print ('accent is: ', accent[1], ' and accent_descriptor is: ', accent_descriptor)
            
            if accent[1]._name == accent_descriptor[0]: 
                #print ('MATCH!')
                if accent[1]._descriptors is None: 
                    accent[1]._descriptors = [] # initialise list if None
                accent[1]._descriptors.append(accent_descriptor[1]) # append because there can be multiple 
                

In [121]:
# the accents should now have descriptors 

import cvaccents as cva
#reload(cva)

for accent in all_accents.items(): 
    print(accent[1].__str__())

id is 1, name is Good, count is 1, locale is en, descriptors are [<cvaccents.AccentDescriptor object at 0x7fa5c85df880>], predetermined is False.
id is 2, name is Kiswahili accent, count is 13, locale is en, descriptors are [<cvaccents.AccentDescriptor object at 0x7fa5c85df790>], predetermined is False.
id is 3, name is Coastal Swahili, count is 2, locale is en, descriptors are [<cvaccents.AccentDescriptor object at 0x7fa5c85df7f0>], predetermined is False.
id is 4, name is Fluent, count is 11, locale is en, descriptors are [<cvaccents.AccentDescriptor object at 0x7fa5c85df880>], predetermined is False.
id is 5, name is kiMvita, count is 2, locale is en, descriptors are [<cvaccents.AccentDescriptor object at 0x7fa5c85df7f0>], predetermined is False.
id is 8, name is Strong, count is 1, locale is en, descriptors are [<cvaccents.AccentDescriptor object at 0x7fa5c3c24eb0>], predetermined is False.
id is 10, name is Kenyan, count is 2, locale is en, descriptors are [<cvaccents.AccentDescri

In [122]:
# Now do a cross-check to see if there are any accents for which the ._descriptor is None 
# this flags if I've missed an accent somewhere 



missing_descriptors = all_accents.reportNoneAccentDescriptors() 

for accent in missing_descriptors: 
    print (accent[1].__str__())
    

## Create relationships between Accents suitable for data visualisation 

Now, we want to create relationships _between_ accents so that we can visualise accents as **nodes** and their relationships as **edges**. 

In [124]:
pp.pprint(kiswahili_accents_list)

[   ['Good'],
    ['Kiswahili accent', 'Coastal Swahili'],
    ['Fluent', 'kiMvita'],
    [],
    ['Kiswahili accent'],
    ['Kiswahili accent', 'Strong'],
    ['kiMvita'],
    ['Kenyan'],
    ['Lived in area'],
    ['native'],
    ['Kiswahili accent', 'Tanzania', 'academic'],
    ['Kiswahili accent', 'Fluent'],
    ['Kiswahili accent', 'Fluent'],
    ['Kiswahili accent', 'Fluent'],
    ['Kenyan'],
    ['Eloquent', 'Fluent'],
    ['Kiswahili accent', 'Fluent'],
    ['Coastal Swahili'],
    ['Kiswahili accent', 'Fluent'],
    [],
    ['Kiswahili accent', 'Fluent'],
    ['native'],
    ['Kiswahili accent', 'Fluent'],
    ['Fluent'],
    ['Fluent', 'Kiswahili accent'],
    ['Kiswahili accent', 'Arusha']]


In [125]:
for idx, value in all_accents.items(): 
    print(idx, value)

1 id is 1, name is Good, count is 1, locale is en, descriptors are [<cvaccents.AccentDescriptor object at 0x7fa5c85df880>], predetermined is False.
2 id is 2, name is Kiswahili accent, count is 13, locale is en, descriptors are [<cvaccents.AccentDescriptor object at 0x7fa5c85df790>], predetermined is False.
3 id is 3, name is Coastal Swahili, count is 2, locale is en, descriptors are [<cvaccents.AccentDescriptor object at 0x7fa5c85df7f0>], predetermined is False.
4 id is 4, name is Fluent, count is 11, locale is en, descriptors are [<cvaccents.AccentDescriptor object at 0x7fa5c85df880>], predetermined is False.
5 id is 5, name is kiMvita, count is 2, locale is en, descriptors are [<cvaccents.AccentDescriptor object at 0x7fa5c85df7f0>], predetermined is False.
8 id is 8, name is Strong, count is 1, locale is en, descriptors are [<cvaccents.AccentDescriptor object at 0x7fa5c3c24eb0>], predetermined is False.
10 id is 10, name is Kenyan, count is 2, locale is en, descriptors are [<cvaccen

In [126]:
## Creating linkages between the individual accents and how they are represented in the data. 
## What I want to do here is create a data structure that has the ID of the accent 
## and something to describe the edge: 
## 
## The data structure I think will work here is: 
## 
## { 99: (123, 456)} 
## to represent an edge between accent ID 123 and accent 456
## 
## One thing to be aware of here is that the edges are NON-DIRECTIONAL
## {99: (123, 456)}
## is equivalent to 
## {99: (123, 456)}
## so we need a way to remove duplicates 

## The data structures we are using are: 
## 
## all_accents - Accent Collection object of all Accents, merged and normalised
## english_accents_list - this is a list of list of strings,
##                        where each list represents the Accents that are related
##   
## what we want to do is go through each list, 
## and find the ID number of the accent 
## from the Dict, 
## then build a Dict that represents the Accent's relation to other Accents
## this is accent_nodes

accent_nodes = {}
i = 0;

for accent_list in kiswahili_accents_list:

    #print(accent_list)
    
    # initialise the list first 
    accent_nodes[i] = []
    
    for accent_list_item in accent_list: 
        #print('now processing', accent_list_item)
        
        for accent in all_accents.items(): 
            
            #if (i%ratio_display ==0): # only show the 100th 
                #print('---')
                #print ('now looking at row: ', accent_list, 'and accent list item: ', accent_list_item, ' and accent: ', accent)
        
            if (accent_list_item == accent[1]._name): ## match 
                
                #if (i%ratio_display ==0): # only show the 100th 
                    #print('---')
                    #print ('match!')
                    
                #print(accent[0])
                #print('i is: ', i)
                
                
                accent_nodes[i].append(accent[0]) # we want the accent ID number
                
                if (len(accent_nodes[i]) > 1) : 
                    # double check nodes that have more than 1 element 
                    print(accent[0])
                    print('i is: ', i)
                    print('length of accent_nodes[i] is: ', len(accent_nodes[i]))
                    pp.pprint(accent_nodes[i])
                

    
    
     
    i +=1                 

3
i is:  1
length of accent_nodes[i] is:  2
[2, 3]
5
i is:  2
length of accent_nodes[i] is:  2
[4, 5]
8
i is:  5
length of accent_nodes[i] is:  2
[2, 8]
14
i is:  10
length of accent_nodes[i] is:  2
[2, 14]
15
i is:  10
length of accent_nodes[i] is:  3
[2, 14, 15]
4
i is:  11
length of accent_nodes[i] is:  2
[2, 4]
4
i is:  12
length of accent_nodes[i] is:  2
[2, 4]
4
i is:  13
length of accent_nodes[i] is:  2
[2, 4]
4
i is:  15
length of accent_nodes[i] is:  2
[23, 4]
4
i is:  16
length of accent_nodes[i] is:  2
[2, 4]
4
i is:  18
length of accent_nodes[i] is:  2
[2, 4]
4
i is:  20
length of accent_nodes[i] is:  2
[2, 4]
4
i is:  22
length of accent_nodes[i] is:  2
[2, 4]
2
i is:  24
length of accent_nodes[i] is:  2
[4, 2]
39
i is:  25
length of accent_nodes[i] is:  2
[2, 39]


In [127]:
pp.pprint(len(accent_nodes))

26


In [128]:
pp.pprint(accent_nodes)

{   0: [1],
    1: [2, 3],
    2: [4, 5],
    3: [],
    4: [2],
    5: [2, 8],
    6: [5],
    7: [10],
    8: [11],
    9: [12],
    10: [2, 14, 15],
    11: [2, 4],
    12: [2, 4],
    13: [2, 4],
    14: [10],
    15: [23, 4],
    16: [2, 4],
    17: [3],
    18: [2, 4],
    19: [],
    20: [2, 4],
    21: [12],
    22: [2, 4],
    23: [4],
    24: [4, 2],
    25: [2, 39]}


In [129]:
# this figure is a cross-check - 
# it should equal the original length of accent_nodes minus the size of deletion_list
pp.pprint(len(accent_nodes))

26


In [130]:
pp.pprint(accent_nodes)

{   0: [1],
    1: [2, 3],
    2: [4, 5],
    3: [],
    4: [2],
    5: [2, 8],
    6: [5],
    7: [10],
    8: [11],
    9: [12],
    10: [2, 14, 15],
    11: [2, 4],
    12: [2, 4],
    13: [2, 4],
    14: [10],
    15: [23, 4],
    16: [2, 4],
    17: [3],
    18: [2, 4],
    19: [],
    20: [2, 4],
    21: [12],
    22: [2, 4],
    23: [4],
    24: [4, 2],
    25: [2, 39]}


In [131]:
# Now what I need to do is create a JSON format suitable for using in 
# a Trellis diagram in Observable 
# e.g. https://observablehq.com/@jameslaneconkling/trellis

# The format of the JSON looks like: 
#   const nodes = [
#    {id: 'Myriel', group: 1},
#    {id: 'Napoleon', group: 1},

# const edges = [
#    {source: 'Napoleon', target: 'Myriel'},
#    {source: 'Mlle.Baptistine', target: 'Myriel'},

# the nodes should be as easy as JSON dumping the all_accents AccentCollection 
# possibly using a method 

# the edges will be a bit more complex 

### Nodes

In [132]:
print(all_accents.total())

13


In [133]:
#reload(cva)

all_accents.exportJSON(accents_filename)

True

In [134]:
print(len(accent_nodes))

26


In [135]:
pp.pprint(accent_nodes)

{   0: [1],
    1: [2, 3],
    2: [4, 5],
    3: [],
    4: [2],
    5: [2, 8],
    6: [5],
    7: [10],
    8: [11],
    9: [12],
    10: [2, 14, 15],
    11: [2, 4],
    12: [2, 4],
    13: [2, 4],
    14: [10],
    15: [23, 4],
    16: [2, 4],
    17: [3],
    18: [2, 4],
    19: [],
    20: [2, 4],
    21: [12],
    22: [2, 4],
    23: [4],
    24: [4, 2],
    25: [2, 39]}


In [136]:
# let's do some sanity checking to make sure these are correct 

# 0: {'source': 1, 'target': 2, 'weight': 23}
# this is equivalent to the occurrence [1, 2] in accent_nodes 
# plus the occurrences of [2, 1] because we have removed bidirectional edges
# and represents a relationship between 
# 'England English' - accent id 1 - and 'United States English' - accent id 2

print('--- checking 1, 2 ---')

count = 0
for idx, node in accent_nodes.items(): 
    if (node == list([1, 2])) :
        count +=1
print(count)
        
print('--- checking 2, 1 ---')

count = 0
for idx, node in accent_nodes.items(): 
    if (node == list([2, 1])) :
        count +=1
print(count)
        
# these together should sum to 23 
# the first one is 9 which appears correct 
# but the second is 13, not 14 as expected 
# double check the values above. Why do we have an off by one error? 



print ('---')


# 139: {'source': 13016, 'target': 59, 'weight': 2}
# this is equivalent to the occurrence [13016, 59] in accent_nodes
# and represents a relationship between 
# 'Foreign' - accent id 13016 - and 'Non-native speaker'
# this didn't show originally so I dug into it. 
# The original accent listing is: 
# (12775, [750, 13016, 59])

for idx, node in accent_nodes.items(): 
    if node == list([13016, 59]) :
        print ('match')
        
# 

print ('---')

--- checking 1, 2 ---
0
--- checking 2, 1 ---
0
---
---


In [137]:
# So why am I getting an NODE count here of 13, when below I am getting an edge count of 14? 

### Edges

Here, we use the `accent_nodes` dict to create a dict of edges

In [138]:
print(len(accent_nodes))

26


In [140]:
## what we want to do here is loop through the accent_nodes Dict 
## and create another Dict that we can use to create Links between the Nodes (which are accents)

accent_edges = {} 
accent_edges_id = 0


# make a deep copy of accent_nodes as we will be pop()ing elements off lists inside the dict 
# and if we don't make a deep copy this will affect accent_nodes as well 
# because Python uses [pass by assignment](https://docs.python.org/3/faq/programming.html#how-do-i-write-a-function-with-output-parameters-call-by-reference)

accent_nodes_for_manipulation = copy.deepcopy(accent_nodes)

for accent_list in accent_nodes_for_manipulation.items(): 
    print('\n')
    print('--- BEGIN accent list ---')
    print('accent list is: ', accent_list)
    
    if len(accent_list[1]) > 1: 
        print ('--- node has more than one accent, processing ... ---')       

        # we want to create edges, so only care where the list has two or more elements
        while (len(accent_list[1]) > 1) : 
            
            print('size of accent_list[1] BEFORE popping is: ', (len(accent_list[1])))
            popped_element = accent_list[1].pop(0) # remove the first element 
            print('popped element is :', popped_element)
            print('size of accent_list[1] AFTER popping is: ', (len(accent_list[1])))
    
            # create links between the popped element and all the remaining elements in the list 
            for accent_id in accent_list[1]: 
                
                print('\n')
                print('--- in accent list loop ---')
                print('accent_id is: ', accent_id)
                print('length of accent list is: ', len(accent_list[1]))
                
                print('popped_element inside the for loop is: ', popped_element)
                print('accent_edges_id is: ', accent_edges_id)
                
                accent_edges[accent_edges_id] = {}
                accent_edges[accent_edges_id]['source'] = popped_element
                accent_edges[accent_edges_id]['target'] = accent_id
                accent_edges[accent_edges_id]['weight'] = 1
            
                print(accent_edges[accent_edges_id])
                
                accent_edges_id +=1
                print('checking that accent_edges_id has incremented: ', accent_edges_id)
                
            print('--- END for loop ---')
        print('--- END while loop ---')       
    print('--- END accent list ---')




--- BEGIN accent list ---
accent list is:  (0, [1])
--- END accent list ---


--- BEGIN accent list ---
accent list is:  (1, [2, 3])
--- node has more than one accent, processing ... ---
size of accent_list[1] BEFORE popping is:  2
popped element is : 2
size of accent_list[1] AFTER popping is:  1


--- in accent list loop ---
accent_id is:  3
length of accent list is:  1
popped_element inside the for loop is:  2
accent_edges_id is:  0
{'source': 2, 'target': 3, 'weight': 1}
checking that accent_edges_id has incremented:  1
--- END for loop ---
--- END while loop ---
--- END accent list ---


--- BEGIN accent list ---
accent list is:  (2, [4, 5])
--- node has more than one accent, processing ... ---
size of accent_list[1] BEFORE popping is:  2
popped element is : 4
size of accent_list[1] AFTER popping is:  1


--- in accent list loop ---
accent_id is:  5
length of accent list is:  1
popped_element inside the for loop is:  4
accent_edges_id is:  1
{'source': 4, 'target': 5, 'weight': 1

In [141]:
# accent_nodes and accent_nodes_for_manipulation should NOT be equivalent 
# because list items have been pop()'d off the latter

print(len(accent_nodes))
print(len(accent_nodes_for_manipulation))
print (accent_nodes == accent_nodes_for_manipulation)

26
26
False


In [142]:
print(len(accent_edges))

16


In [143]:
pp.pprint(accent_edges)

{   0: {'source': 2, 'target': 3, 'weight': 1},
    1: {'source': 4, 'target': 5, 'weight': 1},
    2: {'source': 2, 'target': 8, 'weight': 1},
    3: {'source': 2, 'target': 14, 'weight': 1},
    4: {'source': 2, 'target': 15, 'weight': 1},
    5: {'source': 14, 'target': 15, 'weight': 1},
    6: {'source': 2, 'target': 4, 'weight': 1},
    7: {'source': 2, 'target': 4, 'weight': 1},
    8: {'source': 2, 'target': 4, 'weight': 1},
    9: {'source': 23, 'target': 4, 'weight': 1},
    10: {'source': 2, 'target': 4, 'weight': 1},
    11: {'source': 2, 'target': 4, 'weight': 1},
    12: {'source': 2, 'target': 4, 'weight': 1},
    13: {'source': 2, 'target': 4, 'weight': 1},
    14: {'source': 4, 'target': 2, 'weight': 1},
    15: {'source': 2, 'target': 39, 'weight': 1}}


In [144]:
print (len(accent_nodes))

26


In [145]:
for idx, node in accent_nodes.items(): 
    # each node is a list 
    if len(node) > 1:
        if (1 in node) and (2 in node): 
            print(node)

In [146]:
for idx, node in accent_nodes.items(): 
    # each node is a list 
    if len(node) > 2:
        if (1 in node) and (2 in node): 
            print(node)

In [147]:
for idx, node in accent_nodes.items(): 
    # each node is a list 
    if len(node) == 2:
        if (1 in node) and (2 in node): 
            print(node)

This actually returns 30 lines - so we may not have as many _edges_ being generated as there should be. 

There are **11** cases in `accent_nodes` where there are more than two nodes in the list. 

There are **22** cases in `accent_nodes` where there are exactly two nodes in the list. 

_Working hypothesis for this bug:_ I think what's happening here is that the `accent_edges` for nodes with more than 2 nodes are not being generated correctly. 




Now, we de-duplicate the `accent_edges` dict


In [148]:

def deduplicateDict (accent_edges):
    
    temp_edges = []
    temp_dict = {}
    temp_weights = {}

    for key, val in accent_edges.items(): 
        #print (key)
        #print (val)
        
        if val not in temp_edges: 
            
            temp_edges.append(val)
            temp_dict[key] = val 
            temp_weights[key] = 1
            
        else: # increment the weight of the edge
            # find the key to increment based on the val
            print('breakpoint 1')
            print ('finding the key to update')
            pp.pprint(temp_dict)
            print('breakpoint 2')
            
            pp.pprint(temp_dict.keys())
            pp.pprint(temp_dict.values())
            print('breakpoint 3')
            
            pp.pprint(key)
            pp.pprint(val)
            print('breakpoint 4')
            
            update_key_position = list(temp_dict.values()).index(val) 
            #this is the *position* in the dict that should be updated 
            
            print('update_key_position is: ', update_key_position)
            
            update_key = list(temp_dict.keys())[update_key_position]

            print('update_key is: ', update_key)
            
            print(list(temp_dict.keys()))
            print('breakpoint 5')
            
            pp.pprint(temp_dict[update_key])
            temp_weights[update_key] +=1 
            print('breakpoint 6')
            
    # update the weights - we can't update in the for loop above
    # otherwise the `val` won't match, because the weight: element would be compared
    for key, val in temp_dict.items():
        val['weight'] = temp_weights[key]
        
    print(type(temp_dict))
    # sort the dict by source because it's easier to do error checking 
    temp_dict = dict(sorted(temp_dict.items(), key=lambda x: x[1]['source'], reverse=False))
    print(type(temp_dict))
    return temp_dict

In [149]:
accent_edges = deduplicateDict(accent_edges)

breakpoint 1
finding the key to update
{   0: {'source': 2, 'target': 3, 'weight': 1},
    1: {'source': 4, 'target': 5, 'weight': 1},
    2: {'source': 2, 'target': 8, 'weight': 1},
    3: {'source': 2, 'target': 14, 'weight': 1},
    4: {'source': 2, 'target': 15, 'weight': 1},
    5: {'source': 14, 'target': 15, 'weight': 1},
    6: {'source': 2, 'target': 4, 'weight': 1}}
breakpoint 2
dict_keys([0, 1, 2, 3, 4, 5, 6])
dict_values([{'source': 2, 'target': 3, 'weight': 1}, {'source': 4, 'target': 5, 'weight': 1}, {'source': 2, 'target': 8, 'weight': 1}, {'source': 2, 'target': 14, 'weight': 1}, {'source': 2, 'target': 15, 'weight': 1}, {'source': 14, 'target': 15, 'weight': 1}, {'source': 2, 'target': 4, 'weight': 1}])
breakpoint 3
7
{'source': 2, 'target': 4, 'weight': 1}
breakpoint 4
update_key_position is:  6
update_key is:  6
[0, 1, 2, 3, 4, 5, 6]
breakpoint 5
{'source': 2, 'target': 4, 'weight': 1}
breakpoint 6
breakpoint 1
finding the key to update
{   0: {'source': 2, 'target':

In [150]:
print(type(accent_edges))
pp.pprint(accent_edges)

<class 'dict'>
{   0: {'source': 2, 'target': 3, 'weight': 1},
    1: {'source': 4, 'target': 5, 'weight': 1},
    2: {'source': 2, 'target': 8, 'weight': 1},
    3: {'source': 2, 'target': 14, 'weight': 1},
    4: {'source': 2, 'target': 15, 'weight': 1},
    5: {'source': 14, 'target': 15, 'weight': 1},
    6: {'source': 2, 'target': 4, 'weight': 7},
    9: {'source': 23, 'target': 4, 'weight': 1},
    14: {'source': 4, 'target': 2, 'weight': 1},
    15: {'source': 2, 'target': 39, 'weight': 1}}


In [151]:
# check to see how many duplicates were removed - looks like about 50, or about a quarter
# so it's worth calculating a 'value' for each edge to signify its weight
print(len(accent_edges))

10


### Remove bidirectional edges

Now, we want to deduplicate the **edges** because graph we want to draw is not a _directed graph_. 

That is, the direction of links between nodes is not relevant for the analysis. 

To do this, we compare the `source` and the `target` of each of the edges, and if the `source` and `target` match the `target` and `source` of the edge being compared, we flag that edge for deletion. We then delete those edges flagged for deletion. 

In [152]:

deletion_list = [] # list to keep track of the dict keys that should be deleted 

for edge in accent_edges.items(): 
    #print (node) 
        
    for inner_edge in accent_edges.items(): 
        #pp.pprint(edge)
        #pp.pprint(inner_edge)
            
        # create values to compare on 
        edge_source_target = (edge[1]['source'], edge[1]['target'])
        edge_target_source = (edge[1]['target'], edge[1]['source'])
        inner_edge_target_source = (inner_edge[1]['target'], inner_edge[1]['source'])
        inner_edge_source_target = (inner_edge[1]['source'], inner_edge[1]['target'])
        
        

        if edge_source_target == inner_edge_target_source: # match, remove it 
                
            print ('match')
            print(edge_source_target)
            print(inner_edge_target_source)
                
            # we need to check that the outer edge is not already on the deletion_list 
            # otherwise we end up removing *all* the edges, not just the duplicates 
            
            # we need to also check that the transverse of the outer edge is not already on the deletion_list
            # otherwise we will end up deleting *both* of the edges
            # not just one of them 
            
            if (([inner_edge[0], edge[0]]) not in deletion_list) \
            and (([edge[0], inner_edge[0]]) not in deletion_list) : 
                deletion_list.append([inner_edge[0], edge[0]])
                print ('added ', ([inner_edge[0], edge[0]]), ' to deletion_list')
                

    

            
            

match
(2, 4)
(2, 4)
added  [14, 6]  to deletion_list
match
(4, 2)
(4, 2)


In [153]:
print (deletion_list)

[[14, 6]]


In [154]:
print (len(deletion_list))



1


In [155]:
# delete the edges in the deletion list, but transfer their weights to the edge that was de-duplicated

for edge_pair in deletion_list: 
    print(accent_edges[edge_pair[1]])
    print(accent_edges[edge_pair[0]])
    
    print('now deleting: ', edge_pair[0])
    
    print(accent_edges[edge_pair[1]]['weight'])
    print(accent_edges[edge_pair[0]]['weight'])
    
    
    print ('now adding weights to: ', edge_pair[1])
    accent_edges[edge_pair[1]]['weight'] += accent_edges[edge_pair[0]]['weight']
    print(accent_edges[edge_pair[1]]['weight'])
    del accent_edges[edge_pair[0]]
    

    

{'source': 2, 'target': 4, 'weight': 7}
{'source': 4, 'target': 2, 'weight': 1}
now deleting:  14
7
1
now adding weights to:  6
8


In [156]:

print(len(accent_edges)) # this should equal the count before the de-duplication

9


In [157]:
pp.pprint(accent_edges)

{   0: {'source': 2, 'target': 3, 'weight': 1},
    1: {'source': 4, 'target': 5, 'weight': 1},
    2: {'source': 2, 'target': 8, 'weight': 1},
    3: {'source': 2, 'target': 14, 'weight': 1},
    4: {'source': 2, 'target': 15, 'weight': 1},
    5: {'source': 14, 'target': 15, 'weight': 1},
    6: {'source': 2, 'target': 4, 'weight': 8},
    9: {'source': 23, 'target': 4, 'weight': 1},
    15: {'source': 2, 'target': 39, 'weight': 1}}


In [158]:
pp.pprint(accent_nodes)

{   0: [1],
    1: [2, 3],
    2: [4, 5],
    3: [],
    4: [2],
    5: [2, 8],
    6: [5],
    7: [10],
    8: [11],
    9: [12],
    10: [2, 14, 15],
    11: [2, 4],
    12: [2, 4],
    13: [2, 4],
    14: [10],
    15: [23, 4],
    16: [2, 4],
    17: [3],
    18: [2, 4],
    19: [],
    20: [2, 4],
    21: [12],
    22: [2, 4],
    23: [4],
    24: [4, 2],
    25: [2, 39]}


In [159]:
# let's do some sanity checking to make sure these are correct 

# 0: {'source': 1, 'target': 2, 'weight': 32}
# this is equivalent to the occurrence [1, 2] in accent_nodes 
# plus the occurrences of [2, 1] because we have removed bidirectional edges
# plus any occurrences where 1 or 2 occur in a list, such as [1, 5, 17, 2]
# and represents a relationship between 
# 'England English' - accent id 1 - and 'United States English' - accent id 2

for idx, node in accent_nodes.items(): 
    if (1 in node and 2 in node) :
        print ('match')

print ('---')
print ('there are 32 lines, excellent')
print ('---')


# 5: {'source': 2, 'target': 18, 'weight': 10},

for idx, node in accent_nodes.items(): 
    if (18 in node and 2 in node) :
        print ('match')

print ('---')
print ('there are 10 lines, excellent')
print ('---')

# 64: {'source': 2, 'target': 1325, 'weight': 16}

for idx, node in accent_nodes.items(): 
    if (2 in node and 1325 in node) :
        print ('match')

print ('---')
print ('there are 16 lines, excellent')
print ('---')

---
there are 32 lines, excellent
---
---
there are 10 lines, excellent
---
---
there are 16 lines, excellent
---


In [160]:
# I am now confident that the edges are being represented correctly

In [161]:
# export the edges to a file 

filePath = links_filename

with open(filePath, "w") as outfile:
                json.dump(accent_edges, outfile)

## Some miscellaneous reporting for the paper

I want to get counts by the category of the accent descriptors and the predetermined accents. 

In [162]:
reload(cva)

predetermined = all_accents.reportPredeterminedAccents()
pp.pprint(predetermined)

[]


In [163]:
reload (cva) 
accent_category_counts = all_accents.reportAccentDescriptorCategories()
pp.pprint(accent_category_counts)

[   ['First or other language', 3],
    ['Subnational region', 3],
    ['Country', 2],
    ['Register', 2],
    ['Supranational region', 1],
    ['Accent strength descriptor', 1],
    ['Linguistic heritage of speaker', 1]]


In [164]:
total = 0
for accent_category_count in accent_category_counts: 
    total+=accent_category_count[1]
    
print (total)

13


In [165]:
reload (cva) 
accent_multi_descriptor_counts = all_accents.reportMultipleAccentDescriptors()
print(accent_multi_descriptor_counts)

[]


In [166]:
for accent in accent_multi_descriptor_counts: 
    print('\naccent is:', accent[1]._name)
    for descriptor in accent[1]._descriptors: 
        print(descriptor._name)

In [167]:
print(all_accents)

id is 1, name is Good, count is 1, locale is en, descriptors are [<cvaccents.AccentDescriptor object at 0x7fa5c85df880>], predetermined is False. id is 2, name is Kiswahili accent, count is 13, locale is en, descriptors are [<cvaccents.AccentDescriptor object at 0x7fa5c85df790>], predetermined is False. id is 3, name is Coastal Swahili, count is 2, locale is en, descriptors are [<cvaccents.AccentDescriptor object at 0x7fa5c85df7f0>], predetermined is False. id is 4, name is Fluent, count is 11, locale is en, descriptors are [<cvaccents.AccentDescriptor object at 0x7fa5c85df880>], predetermined is False. id is 5, name is kiMvita, count is 2, locale is en, descriptors are [<cvaccents.AccentDescriptor object at 0x7fa5c85df7f0>], predetermined is False. id is 8, name is Strong, count is 1, locale is en, descriptors are [<cvaccents.AccentDescriptor object at 0x7fa5c3c24eb0>], predetermined is False. id is 10, name is Kenyan, count is 2, locale is en, descriptors are [<cvaccents.AccentDescri