# Working with accent data in the Mozilla Common Voice dataset 

The purpose of this Python Jupyter notebook is to provide some worked examples of how you might explore accent data in the Common Voice dataset. 



## Index of notebook contents 

To make this notebook easier to navigate, each section is indexed below. 

* [Background information on demographic data in Common Voice](#Background)
* [Preparation steps and importing modules](#PreparationSteps) - including the `requirements.txt` you should run if using this notebook. 
* [The Accent and AccentDescriptor classes we will use in the notebook](#Classes)
* [Preparing data from the Common Voice TSV file](#PreparingData)
* [Extracting accent information for data visualisation](#AccentExtraction)
* [Determine which accents are predetermined for selection in the Common Voice profile screen](#PreDetermined)
* [Add Descriptors to each Accent](#Descriptors)

---
<a id='Background'></a>
## Background information on demographic data in Common Voice 

Before you start working with accent data in Common Voice, there is background information you should know about the data structures in the Common Voice datasets, and how accents have been represented. 

### The ability to choose whether or not to specify demographic information 

Data contributors can contribute voice data to Common Voice with our without logging in to the platform. If a data contributor is not logged in, the utterances they record contain no demographic metadata information, such as the gender, age range or accent of the speaker. If the data contributor _does_ log in, then they can choose whether to specify demographic information in their profile. Part of the demographic information can include specifying which accent(s) they speak with. 


Since mid 2021, data contributors to the Common Voice dataset have been able to self-specify descriptors for their accents. 

The purpose of this script is to get demographic details from an MCV downloaded dataset. 
This informs decision making around, for example, how much of the data in a particular language, has demographic details, and if so, what they are. 

---
<a id='PreparationSteps'></a>

## Preparation steps and importing the modules we will use 

@TODO 

make a `requirements.txt` file to install all the dependencies. 

* pandas 


In [358]:
# imports go here 

# io 
import io

# pandas 
import pandas as pd

# regular expressions 
import re

# json 
import json

# string handling for isascii
import string 

# pretty print 
import pprint
pp = pprint.PrettyPrinter(indent=4)

# reload = because I'm developing the CVaccents module as I go, I want to reload it each time so it doesn't cache
from importlib import reload


---
<a id='Classes'></a>
## Accent, AccentDescriptor and AccentCollection classes used for manipulation

In [359]:
## Accent class and AccentDescriptor class 

# these are classes I defined for accent handling
import cvaccents as cva

# do an explicit reload as I'm still working on the classes 
reload(cva)

# prove that my DocStrings are useful
# they are good, so I am suppressing output while I work through the rest of the doc. 

#print('Module docstring is: \n', cva.__doc__)
#print('---')
#print('Accent docstring is: \n', cva.Accent.__doc__)
#print('---')
#print('AccentDescriptor docstring is: \n', cva.AccentDescriptor.__doc__)
#print('---')
#print('AccentCollection docstring is: \n', cva.AccentDescriptor.__doc__)
#print('---')

<module 'cvaccents' from '/home/users/u6933485/cv-analysis-for-bias/cvaccents.py'>

---
<a id='PreparingData'></a>
## Preparing the data from the Common Voice dataset TSV file

Here, we extract data from the TSV file, and use `pandas` to perform manipulations on the dataset, such as removing rows that do not contain accent metadata. 

In [360]:
# specify the path to the TSV file - this should be `validated.tsv` from the MCV download 
#filePath = '/media/kathyreid/Seagate Backup Plus Drive/cv-datasets/ru/validated.tsv'
#filePath = '/media/kathyreid/Seagate Backup Plus Drive/cv-datasets/fr/validated.tsv'
#filePath = '/media/kathyreid/Elements/de/validated.tsv'
#filePath = '/media/kathyreid/Elements/es/validated.tsv'
#filePath = '/media/kathyreid/Elements/en/validated.tsv'
#filePath = '/media/kathyreid/Seagate Backup Plus Drive/cv-datasets/en-v9/validated.tsv'
filePath = '../cv-datasets/cv-corpus-11.0-2022-09-21/en/validated.tsv'

# put it into a DataFrame 
df = pd.read_csv(filePath, sep='\t')

In [361]:
df.columns

Index(['client_id', 'path', 'sentence', 'up_votes', 'down_votes', 'age',
       'gender', 'accents', 'locale', 'segment'],
      dtype='object')

In [362]:
# We don't want all the columns, as some of them are not useful for the accent analysis 
# Drop the columns we don't want 

df.drop(labels=['path', 'sentence', 'up_votes', 'down_votes', 'segment', 'locale'], axis='columns', inplace=True)
df.columns



Index(['client_id', 'age', 'gender', 'accents'], dtype='object')

In [363]:
len(df)

1617877

In [364]:
# rows that have accent metadata 
len(df[df['accents'].notna()])

861134

In [365]:
# remove all the rows where accents are not given (NaN)
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
# DataFrame.dropna(*, axis=0, how=_NoDefault.no_default, thresh=_NoDefault.no_default, subset=None, inplace=False)

df.dropna(axis='index', how='any', subset='accents', inplace=True)
len(df)

# this matches the above figure for rows that have accent metadata, so it's a good cross-check

861134

In [366]:
# number of unique contributors to the dataset 
len(df['client_id'].unique())

14822

In [367]:
# Now that the rows without an accent value have been removed, 
# we want to deduplicate the speaker_id values - because one speaker can speak many utterances
# and we only want to record one accent per speaker 
# and we should end up with the # of rows in the cell above 


# One of the reasons we try and reduce the size of the dataframe 
# first is because this operation is more efficient on a smaller dataframe 
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
# DataFrame.drop_duplicates(subset=None, *, keep='first', inplace=False, ignore_index=False)

df.drop_duplicates(subset='client_id', keep='first', inplace=True)
len(df)

# This length should match the length above 

14822

---
<a id='AccentExtraction'></a>
## Extracting the accent data for visualisation 

We have already: 

* Removed any rows where accent data was not available = `NaN`
* De-duplicated based on the `client_id`

So now, we need to extract all the self-styled accents for analysis. 

In [368]:
"""

The list english_accents_list is a list that contains the ORIGINAL accent entries for each speaker. 
In this list, the accents for each speaker are represented as a SINGLE STRING, NOT as a list of strings. 

So, we want to turn this into a LIST of LISTS OF STRINGS, to make it easier for doing data cleaning. 
Each individual string represents one accent descriptor given by a speaker, 
and the list which contains those strings is the grouping of accent descriptors for that speaker. 

We need to preserve the association between accent descriptors - co-references - for later data visualisation. 

"""

# They are already unique so we don't need the `.unique` method
english_accents = df['accents']

english_accents_list = [] 

for idx, accent_string in list(enumerate(english_accents)): 
    
    # this regex is from 
    # https://stackoverflow.com/questions/26633452/how-to-split-by-commas-that-are-not-within-parentheses
    accent_list=re.split(',\s*(?![^()]*\))', accent_string)
    processed_accent_list = accent_list # we don't want to modify a list we're iterating over
    
    #print ('accent_string is: ', accent_string, ' and accent_list is: ', accent_list)
    for idx_a, accent in list(enumerate(accent_list)): 
        
        #print ('idx_a is: ', idx_a, ' and accent is: ', accent, ' and type of accent is: ', type(accent))
        # Trim any whitespace off the elements, because this makes matching on strings harder later on
        # Strings are immutable in Python, so we have to create another string
        processed_accent_list.remove(accent)
        stripped_accent = accent.strip() 
        processed_accent_list.append(stripped_accent)
        
        # Check for any empty strings and remove them - likely regex artefacts
        if not accent: 
            processed_accent_list.remove(accent)
            
        # Check for any non-Latin characters that we may want to investigate 
        # For example, if one of the accents is garbage or deliberate rubbish
        
        if not accent.isascii(): 
            print('flagging that accent: ', accent, ' is not ASCII encoded, may be in another language')
            
    english_accents_list.append(processed_accent_list)

flagging that accent:  Выраженный украинский акцент  is not ASCII encoded, may be in another language
flagging that accent:  Porto des dels 3 anys aprenent anglès a l'escola i actualment m'estic preparant per a l'examen del B2.  is not ASCII encoded, may be in another language


In [369]:
print(len(english_accents_list))

14822


In [370]:
for idx, accent_list in list(enumerate(english_accents_list)): 
    print(idx, accent_list)


0 ['England English', 'United States English']
1 ['Hong Kong English']
2 ['England English']
3 ['United States English']
4 ['United States English', 'wolof']
5 ['England English']
6 ['Australian English']
7 ['United States English']
8 ['Latin America', 'United States English']
9 ['United States English']
10 ['United States English']
11 ['Southern African (South Africa, Zimbabwe, Namibia)']
12 ['United States English']
13 ['United States English']
14 ['India and South Asia (India, Pakistan, Sri Lanka)']
15 ['United States English']
16 ['United States English']
17 ['England English']
18 ['India and South Asia (India, Pakistan, Sri Lanka)']
19 ['England English']
20 ['United States English']
21 ['Southern African (South Africa, Zimbabwe, Namibia)']
22 ['Australian English']
23 ['England English']
24 ['India and South Asia (India, Pakistan, Sri Lanka)']
25 ['United States English']
26 ['United States English', 'Colombia']
27 ['India and South Asia (India, Pakistan, Sri Lanka)']
28 ['India 

---
<a id='Descriptors'></a>
## Add descriptors to each accent

In this section, I apply a set of categories to the accent data. 

**I use a rule-based approach for reproduceability.** 
This could have been done in a spreadsheet, but I'm working in Python so I chose to do it that way. 
This also makes it easier for others applying this work to other languages or to other versions of the dataset. 



### Expand accents that have multiple descriptors in their .name element 

Here, we "break apart" accents that have multiple descriptors in the .name element into **multiple** accents. This is done _programmatically_ to aid in reproduceability. 

Some examples of this that I found during analysis were; 

* 8233 - 'south German / Swiss accent' - needs to be separated into 'South German' and 'Swiss accent'
* 6967 - 'slight Brooklyn Accent' - needs to be separated into "slight" and 'Brooklyn' - as a strength indicator 
* 10142 - 'minor French Accent' - needs to be separated into "minor" as an accent strength indicator, and 'French'
* 9902 - 'little Latino' - needs to be separated into "little" as an accent strength indicator, and "Latino"
* 3422 - 'i have some pronunciation issues because of oral surgery and a hidden southern accent' - needs to be separated into 'hidden southern accent' and 'changes due to Oral surgery'
* 9337 - 'heavy Cantonese' - needs to be separated into accent strength and country
* 1721 - 'United States English combined with European English' = needs to be separated into two accents
* 5365 - 'Sydney - middle eastern seaboard Australian' - should be separated into two accents
* 6615 - 'Spoke Chinese when little' - should be separated into Chinese, and speaking a different language as a child
* 748 - 'Spanish bilingual' - should be separated into 'Spanish' and 'bilingual ' as an Ln marker. 
* 6942 - 'South London and Essex' - should be separated into 'South London' and 'Essex'
* 12055 - 'Some time spent in Scotland' - should be separated into 'Scottish English' and into an Ln marker of "time spent in regional location"
* 773 - 'Silicon Valley Native' - should be separated into 'Silicon Valley' and "native" marker
* 3033 - "Porto des dels 3 anys aprenent anglès a l'escola i actualment m'estic preparant per a l'examen del B2." - this should be split into French, and academic register / L2 status
* 4322 - 'Polish. Have lived in nine states.' - should be separated into Polish and "lived in nine states" as a regional variance indicator or mixed accent indicator
* 5046 - 'Pittsburgh PA' - should be separated into Pittsburgh - city descriptor - and Pennsylvania as a regional descriptor. 
* 58 - 'Non native speaker from France' - should be separated into 'French' and 'Non native speaker'
* 1054 - 'Mild Northern England English' - should be separated into 'Northern England' and 'mild' for the strength of accent
* 11587 - 'Midwestern States (Michigan)' - should be separated into Midwestern and Michigan
* 1317 - 'Midwest USA speech blended with South Texas USA speech' - should be separated into Midwest and South Texas
* 1491 - 'Midwest US... With some Canadian slang. ' - should be separated into Midwest and Canadian Slang as register
* 'International Indian Accent' - should be separated into "Indian" and into "International"
* 2703 - 'Indian with a tinge of an RP accent' - should be separated into "Indian" and "Received Pronunciation" as well as an accent strength marker
* 1011 - 'I am German and speak English as learned at school' - should be separated into "German" and a register marker "as learned at school"
* 4839 - 'French mid level' - should be separated into 'French' and 'mid-level' as a fluency marker
* 1055 - 'England non-native' - should be separated into 'England' and 'Non-native' as a fluency marker
* 5305 - 'Educated Australian Accent' - should be separated into 'Australian' and 'Educated' as a register marker
* 12958 - 'CARIBBEAN AND BRITISH MIXED WITH SOME NEW YORK ACCENTS' - should be separated into three accents 
* 1577 - 'British English with a little bit of Russian' - should be separated into two accents. It's also an accent strength marker.
* 7271 - 'A variety of Texan English with some German influence that has undergone the cot-caught merger' - should be split into Texan, German influence, and a phonetic descriptor
* 6616 - '90% Pennsylvanian accent' - should be separated into Pennsylvanian and an accent strength marker
* 4319 - '4 years in Spain and Germany' - should be separated into French, German and a time or exposure marker
* 6617 - '10% Chinese accent' - should be separated into Chinese and an accent strength marker




### Modify the accent list to expand descriptors while preserving accent co-references

Some of the given accent descriptors contain multiple descriptors in one string. Here, I expand them while maintain co-references. 

For example: 

* `slight Brooklyn accent`  - contains both a City-based descriptor and an accent strength descriptor. 
* `United States English combined with European English` - contains both a national descriptor and a supranational descriptor. 



It's easier to do this before we create Accent and AccentDescriptor objects. 

In [371]:
# helper function for the below 

def update_list_coreference(list_to_be_updated, old_entry, new_entry):
    # the accent list is a list of lists so we need to iterate through each one to find the element to update. 
    
    for idx, accent_list in list(enumerate(list_to_be_updated)):
        
        #print(idx, '- old accent_list is: ', accent_list)
        new_accent_list = accent_list
        
        for accent in accent_list:  
            match = False
            #print ('accent is: ', accent, ' and old_entry is: ', old_entry)
            
            if accent == old_entry:
                match = True 
                #print('accent is: ', accent, ' and old_entry is: ', old_entry, 'and new_entry is: ', new_entry)
                #print('accent is: ', accent, ' which matches ', old_entry)
                new_accent_list.remove(old_entry)
                
                for entry in new_entry: # there may be more than one
                    new_accent_list.append(entry)
                    #print('appending new entry: ', entry)
                
            if match: 
                print ('processed ', old_entry, ' to be ', new_entry, ' and the new accent list is: ', new_accent_list)

        # recreate the list from keys, this removes duplicates
        # for example, a duplicate may be created due to normalisation or merger of accents 
        new_keys_list = list(dict.fromkeys(new_accent_list))
        
        #print('new_accent_list is: ', new_accent_list)
        #print('---')
        
        #print('removing: ', accent_list)
        list_to_be_updated.remove(accent_list)
        #print('appending: ', newlist)
        list_to_be_updated.append(new_keys_list)
        
    
    return(list_to_be_updated)

In [372]:
#'south German / Swiss accent' - needs to be separated into 'South German' and 'Swiss'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'south German / Swiss accent', 
                                               ['South German', 'Swiss'])

# 'slight Brooklyn Accent' - needs to be separated into "slight" and 'Brooklyn' - as a strength indicator  
english_accents_list = update_list_coreference(english_accents_list, 
                                               'slight Brooklyn Accent', 
                                               ['Brooklyn Accent', 'slight'])

# 'minor French Accent' - needs to be separated into "minor" as an accent strength indicator, and 'French'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'minor French Accent', 
                                               ['French Accent', 'minor'])

# 'little Latino' - needs to be separated into "Latino" as an accent and 'little' as an accent strength indicator
english_accents_list = update_list_coreference(english_accents_list, 
                                               'little Latino', 
                                               ['Latino', 'little'])

# 'heavy Cantonese' - needs to be separated into accent strength and country   
english_accents_list = update_list_coreference(english_accents_list, 
                                               'heavy Cantonese', 
                                               ['Cantonese', 'heavy'])

# 'United States English combined with European English' - needs to be separated into two accents
english_accents_list = update_list_coreference(english_accents_list, 
                                               'United States English combined with European English', 
                                               ['United States English', 'European English'])

# 'United States English Pacific Northwest' - needs to be separated into two accents
english_accents_list = update_list_coreference(english_accents_list, 
                                               'United States English Pacific Northwest', 
                                               ['United States English', 'Pacific Northwest'])
                                               
# 'Sydney - middle eastern seaboard Australian' - should be separated into two accent
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Sydney - middle eastern seaboard Australian', 
                                               ['Sydney', 'Middle eastern seaboard Australian'])
        
# 'Spoke Chinese when little' - should be separated into Chinese, and speaking a different language as a child
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Spoke Chinese when little', 
                                               ['Chinese', 'Spoke language when a child'])        
    
# 'Spanish bilingual' - should be separated into 'Spanish' and 'bilingual ' as an Ln marker. 
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Spanish bilingual', 
                                               ['Spanish', 'Bilingual']) 

# 'South London and Essex' - should be separated into 'South London' and 'Essex'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'South London and Essex' , 
                                               ['South London', 'Essex']) 


# 'Some time spent in Scotland' - should be separated into 'Scottish English' and into an Ln marker of "time spent in regional location"
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Some time spent in Scotland' , 
                                               ['Scottish English', 'some time spent in location']) 

# 'Silicon Valley Native' - should be separated into 'Silicon Valley' and "native" marker
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Silicon Valley Native' , 
                                               ['Silicon Valley', 'native']) 

# "Porto des dels 3 anys aprenent anglès a l'escola i actualment m'estic preparant per a l'examen del B2." - this should be split into French, and academic register / L2 status
english_accents_list = update_list_coreference(english_accents_list, 
                                               "Porto des dels 3 anys aprenent anglès a l'escola i actualment m'estic preparant per a l'examen del B2." , 
                                               ['French', 'academic', 'second language']) 


# Выраженный украинский акцент - this should be separated into 'Ukrainian' and 'pronounced'
# it literally means 'pronounced Ukrainian accent'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Выраженный украинский акцент' , 
                                               ['Ukrainian', 'pronounced']) 


# 'Polish. Have lived in nine states.' - should be separated into Polish and "lived in nine states" as a regional variance indicator or mixed accent indicator
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Polish. Have lived in nine states.',
                                               ['Polish', 'mixed']) 

# 'Pittsburgh PA' - should be separated into Pittsburgh - city descriptor - and Pennsylvania as a regional descriptor.
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Pittsburgh PA',
                                               ['Pittsburgh', 'Pennsylvania']) 

# 'Non native speaker from France' - should be separated into 'French' and 'Non native speaker'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Non native speaker from France',
                                               ['French', 'Non native']) 

# 'Mild Northern England English' - should be separated into 'Northern England' and 'mild' for the strength of accent
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Mild Northern England English',
                                               ['Northern English', 'Mild']) 

# 'Midwestern States (Michigan)' - should be separated into Midwestern and Michigan
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Midwestern States (Michigan)',
                                               ['Midwestern United States', 'Michigan']) 

# 'Midwest USA speech blended with South Texas USA speech' - should be separated into Midwest and South Texas
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Midwest USA speech blended with South Texas USA speech',
                                               ['Midwestern United States', 'South Texas']) 

# 'Midwest US... With some Canadian slang. ' - should be separated into Midwest and Canadian Slang as register
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Midwest US... With some Canadian slang. ',
                                               ['Midwestern United States', 'Canadian English', 'Slang']) 

# 'International Indian Accent' - should be separated into "Indian" and into "International"
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Midwest US... With some Canadian slang. ',
                                               ['Midwestern United States', 'Canadian English', 'Slang']) 

# 'International Indian Accent' - should be separated into "Indian" and into "International"
english_accents_list = update_list_coreference(english_accents_list, 
                                               'International Indian Accent',
                                               ['India and South Asia (India, Pakistan, Sri Lanka)', 'International']) 


# 'Indian with a tinge of an RP accent' - should be separated into "Indian" and "Received Pronunciation" as well as an accent strength marker
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Indian with a tinge of an RP accent',
                                               ['India and South Asia (India, Pakistan, Sri Lanka)', 'Received Pronunciation', 'tinge']) 

# 'I am German and speak English as learned at school' - should be separated into "German" and a register marker "as learned at school"
english_accents_list = update_list_coreference(english_accents_list, 
                                               'I am German and speak English as learned at school',
                                               ['German', 'academic', 'second language']) 

# 'speak some German' - should be separated into 'German' and an accent 
english_accents_list = update_list_coreference(english_accents_list, 
                                               'speak some German',
                                               ['German', 'some'])

# 'French mid level' - should be separated into 'French' and 'mid-level' as a fluency marker
english_accents_list = update_list_coreference(english_accents_list, 
                                               'French mid level',
                                               ['French', 'Mid-level']) 


# 'England non-native' - should be separated into 'England' and 'Non-native' as a fluency marker
english_accents_list = update_list_coreference(english_accents_list, 
                                               'England non-native',
                                               ['England English', 'Non native'])


# 'Educated Australian Accent' - should be separated into 'Australian' and 'Educated' as a register marker
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Educated Australian Accent',
                                               ['Australian English', 'Educated'])

# 'CARIBBEAN AND BRITISH MIXED WITH SOME NEW YORK ACCENTS' - should be separated into three accents 
english_accents_list = update_list_coreference(english_accents_list, 
                                               'CARIBBEAN AND BRITISH MIXED WITH SOME NEW YORK ACCENTS',
                                               ['West Indies and Bermuda (Bahamas, Bermuda, Jamaica, Trinidad)', 'England English', 'New York'])


# 'British English with a little bit of Russian' - should be separated into two accents. It's also an accent strength marker.
english_accents_list = update_list_coreference(english_accents_list, 
                                               'British English with a little bit of Russian',
                                               ['Russian', 'England English', 'little bit'])

# 'A variety of Texan English with some German influence that has undergone the cot-caught merger' - should be split into Texan, German influence, and a phonetic descriptor
english_accents_list = update_list_coreference(english_accents_list, 
                                               'A variety of Texan English with some German influence that has undergone the cot-caught merger',
                                               ['Texas', 'German', 'some', 'cot-caught merger'])

# '90% Pennsylvanian accent' - should be separated into Pennsylvanian and an accent strength marker
english_accents_list = update_list_coreference(english_accents_list, 
                                               '90% Pennsylvanian accent',
                                               ['Pennsylvania', '90%'])

# '4 years in Spain and Germany' - should be separated into French, German and a time or exposure marker
english_accents_list = update_list_coreference(english_accents_list, 
                                               '4 years in Spain and Germany',
                                               ['Spanish', 'German', 'some time spent in location'])

# '10% Chinese accent' - should be separated into Chinese and an accent strength marker
english_accents_list = update_list_coreference(english_accents_list, 
                                               '10% Chinese accent',
                                               ['Chinese', '10%'])

# 'i have some pronunciation issues because of oral surgery and a hidden southern accent' - needs to be separated into 'hidden southern accent' and 'changes due to Oral surgery'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'i have some pronunciation issues because of oral surgery and a hidden southern accent',
                                               ['Southern United States', 'changes due to oral surgery'])



processed  south German / Swiss accent  to be  ['South German', 'Swiss']  and the new accent list is:  ['England English', 'South German', 'Swiss']
processed  slight Brooklyn Accent  to be  ['Brooklyn Accent', 'slight']  and the new accent list is:  ['United States English', 'Canadian English', 'Brooklyn Accent', 'slight']
processed  United States English combined with European English  to be  ['United States English', 'European English']  and the new accent list is:  ['United States English', 'European English']
processed  United States English Pacific Northwest  to be  ['United States English', 'Pacific Northwest']  and the new accent list is:  ['United States English', 'United States English', 'Pacific Northwest']
processed  Sydney - middle eastern seaboard Australian  to be  ['Sydney', 'Middle eastern seaboard Australian']  and the new accent list is:  ['Australian English', 'Sydney', 'Middle eastern seaboard Australian']
processed  Spoke Chinese when little  to be  ['Chinese', 'Sp

### Normalize closely related accent descriptors - merge them 

There are several closely related accent descriptors, and here I merge them. 

The principles I use are: 
    
* Accents are merged where there are spelling variations 
* Accents are merged where the accent has a region descriptor with our without 'accent' - such as "French" and "French accent"
* Where a country or language descriptor and demonym are closely equivalent - "Germany" and "German"

Accents are not merged where: 

* One accent descriptor is more granular than another - "London" and "South London" are not merged. 

In [373]:
# I can just use the same function

## There will be others as we put them into objects / classes


# Midwestern - canonical is 'Midwestern United States'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Midwestern',
                                               ['Midwestern United States'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Midwestern United States English',
                                               ['Midwestern United States'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'midwestern US',
                                               ['Midwestern United States'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'midwest',
                                               ['Midwestern United States'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Unite States Midwest',
                                               ['Midwestern United States'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Midwestern US English (United States)',
                                               ['Midwestern United States'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Midwestern US English (United States)',
                                               ['Midwest United States'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Mid-west United States English',
                                               ['Midwest United States'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'American Midwest',
                                               ['Midwest United States'])


# Southern United States - canonical is 'Southern United States'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'southern United States',
                                               ['Southern United States'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'southern',
                                               ['Southern United States'])


# Mid-atlantic - canonical is 'Mid-atlantic United States'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Mid-atlantic',
                                               ['Mid-atlantic United States'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Mid-Atlantic United States English',
                                               ['Mid-atlantic United States'])


# Philadelphia - canonical is 'Philadelphia'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Philadelphia Style United States English',
                                               ['Philadelphia'])



# California - canonical is 'California'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Californian Accent',
                                               ['California'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Cali',
                                               ['California'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Californian',
                                               ['California'])

# Northern California - as distinct from California - canonical is 'Northern California'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'northern cali',
                                               ['Northern California'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Northern Californian',
                                               ['Northern California'])

# Southern California - as distinct from California - canonical is 'Southern California'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Southern Californian',
                                               ['Southern California'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Southern Cali',
                                               ['Southern California'])


# British - canonical is 'British'



# England - canonical is 'England English'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'england',
                                               ['England English'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'England',
                                               ['England English'])

# Northern England - canonical is 'Northern England'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Northern English',
                                               ['Northern England'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'English north of England',
                                               ['Northern England'])

# Yorkshire - as distinct from Northern England - canonical is 'Yorkshire'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'yorkshire',
                                               ['Yorkshire'])

# Southern England - canonical is 'Southern England'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'southern english',
                                               ['Southern England'])


# Sussex - as distinct from Southern England - canonical is 'Sussex'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'sussex',
                                               ['Sussex'])

# London - canonical is 'London'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'London English',
                                               ['London'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'london',
                                               ['London'])


# Liverpool - canonical is 'Liverpool'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Liverpool English',
                                               ['Liverpool'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Liverpudlian English',
                                               ['Liverpool'])

# Lancashire - canonical is 'Lancashire'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Lancashire English',
                                               ['Lancashire'])

# Dutch - canonical is 'Dutch'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Dutch English',
                                               ['Dutch'])

# Swedish - canonical is 'Swedish'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'swedish',
                                               ['Swedish'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'swedish english',
                                               ['Swedish'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'swedish',
                                               ['Swedish'])

# German - canonical is 'German'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'German Accent',
                                               ['German'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'German English',
                                               ['German'])

# South German - as distinct from German - canonical is 'South German'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'South German accent',
                                               ['South German'])

# South West German - as distinct from German 
english_accents_list = update_list_coreference(english_accents_list, 
                                               'south-west German',
                                               ['South West German'])

# French - canonical is 'French'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'french accent',
                                               ['French'])


# European - as distinct from Eastern European - canonical is 'European
english_accents_list = update_list_coreference(english_accents_list, 
                                               'European English',
                                               ['European'])

# Eastern Europe - canonical is 'Eastern European'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'eastern european English',
                                               ['Eastern European'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'eastern Europe',
                                               ['Eastern European'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'East European',
                                               ['Eastern European'])



# Polish - canonical is 'Polish'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Polish English',
                                               ['Polish'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'polish',
                                               ['Polish'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'polish accent',
                                               ['Polish'])

# Israeli - canonical is 'Israeli' 
english_accents_list = update_list_coreference(english_accents_list, 
                                               "Israeli's accent",
                                               ['Israeli'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Israeli accent',
                                               ['Israeli'])

# Nigerian - canonical is 'Nigerian'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Nigerian English',
                                               ['Nigerian'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'nigeria english',
                                               ['Nigerian'])

# Kenyan - canonical is 'Kenyan'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Kenyan English',
                                               ['Kenyan'])
# Wolof - canonical is 'Wolof'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'wolof',
                                               ['Wolof'])


# Latin American - canonical is 'Latino'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Latin American',
                                               ['Latino'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Latin America',
                                               ['Latino'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Hispanic/Latino',
                                               ['Latino'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Latin English',
                                               ['Latino'])

# Colombian - canonical is 'Colombian'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Colombian Accent',
                                               ['Colombian'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Colombia',
                                               ['Colombian'])



# Japanese - canonical is 'Japanese'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Japanese English',
                                               ['Japanese'])


#Bangladeshi - canonical is 'Bangladeshi'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Bangladesh',
                                               ['Bangladeshi'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'bangladesh',
                                               ['Bangladeshi'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Bangladesh English',
                                               ['Bangladeshi'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Bangladeshi English',
                                               ['Bangladeshi'])



# Non-native speaker descriptions - canonical is 'Non-native speaker'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Non-native',
                                               ['Non-native speaker'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Non native speaker',
                                               ['Non-native speaker'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'Non native',
                                               ['Non-native speaker'])


# Second language descriptions - canonical is 'Second language'
english_accents_list = update_list_coreference(english_accents_list, 
                                               'ESL',
                                               ['Second language'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'second language',
                                               ['Second language'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               '2nd Language',
                                               ['Second language'])
english_accents_list = update_list_coreference(english_accents_list, 
                                               'English as a Second Language',
                                               ['Second language'])



processed  Midwestern  to be  ['Midwestern United States']  and the new accent list is:  ['United States English', 'Midwestern United States']
processed  Midwestern  to be  ['Midwestern United States']  and the new accent list is:  ['United States English', 'Midwestern United States']
processed  Midwestern  to be  ['Midwestern United States']  and the new accent list is:  ['United States English', 'Minnesotan', 'Midwestern United States']
processed  Midwestern  to be  ['Midwestern United States']  and the new accent list is:  ['United States English', 'Low', 'Demure', 'Midwestern United States']
processed  Midwestern United States English  to be  ['Midwestern United States']  and the new accent list is:  ['United States English', 'Midwestern United States']
processed  Midwestern United States English  to be  ['Midwestern United States']  and the new accent list is:  ['United States English', 'Midwestern United States']
processed  midwestern US  to be  ['Midwestern United States']  and 

processed  Latin America  to be  ['Latino']  and the new accent list is:  ['United States English', 'Latino']
processed  Hispanic/Latino  to be  ['Latino']  and the new accent list is:  ['Latino']
processed  Latin English  to be  ['Latino']  and the new accent list is:  ['Latino']
processed  Colombia  to be  ['Colombian']  and the new accent list is:  ['United States English', 'Colombian']
processed  Japanese English  to be  ['Japanese']  and the new accent list is:  ['Japanese']
processed  Japanese English  to be  ['Japanese']  and the new accent list is:  ['Japanese']
processed  bangladesh  to be  ['Bangladeshi']  and the new accent list is:  ['India and South Asia (India, Pakistan, Sri Lanka)', 'Bangladeshi']
processed  Bangladesh English  to be  ['Bangladeshi']  and the new accent list is:  ['Bangladeshi', 'Bangladeshi']
processed  Bangladeshi English  to be  ['Bangladeshi']  and the new accent list is:  ['Bangladeshi']
processed  Non-native  to be  ['Non-native speaker']  and the 

## Extract unique accents from the list of normalised accents into a Dict of Accent objects for easier manipulation

In [374]:
# build a dict of each unique accent using an Accent object for each object. 

ratio_display = 120 # to stop the browser crashing 

AccentDict = {}
i = 0; 

# the english_accents_list is now normalised, merged etc so this is straightforward 
for accent_list in english_accents_list:
    for accent in accent_list: 
        
        i +=1
        match = False 
        count = 0
        
        #if (i%ratio_display ==0): # only show the 100th 
            #print('')
            #print('---')
            #print('now processing: ', accent, ' - ', i)
            #print('---')
        
        # is this accent in our dict - if not, add it in 
        
        for item in AccentDict.items() : # Each item should be an Accent object 
            
            # (self, id=0, name="Accent Name", count=0, locale=None, descriptors=None):
            #pp.pprint(item[1].__str__())
            
            #if (i%ratio_display ==0): # only show the 100th 
                #print('item is: ', item)
                #print(type(item))
                #print('now checking match for: item:', item[1], ' and accent: ', accent)
            
            if (item[1].name == accent) : # update the count
                
                #if (i%ratio_display ==0): # only show the 100th 
                    #print('---')
                    #print('match is True')
                    #print('---')
                    
                match = True 
                
                #if (i%ratio_display ==0): # only show the 100th 
                    #print('accent count was: ', item[1].count)
                
                # update the count of the accent 
                item[1].count+=1
                
                #if (i%ratio_display ==0): # only show the 100th 
                    #print('accent count is now: ', item[1].count)
                
                
        # this match loop has to be outside the for: loop above 
        # because if we add items to the dict inside the loop
        # then it will not run - because there are zero items in the dict to begin with 
        
        if (not match) :   
            
            # (self, id=0, name="Accent Name", count=0, locale=None, descriptors=None):
            AccentDict[i] = cva.Accent(i, accent, 1, 'en', None, False) 
                


In [375]:
# do an explicit reload as I'm still working on the classes 
reload(cva)

all_accents = cva.AccentCollection(AccentDict)

In [376]:
print(all_accents.total())

179


In [377]:
all_accents.__str__()

id is 1, name is England English, count is 2346, locale is en, descriptors are None, predetermined is False.
id is 2, name is United States English, count is 7537, locale is en, descriptors are None, predetermined is False.
id is 3, name is Hong Kong English, count is 132, locale is en, descriptors are None, predetermined is False.
id is 7, name is Wolof, count is 1, locale is en, descriptors are None, predetermined is False.
id is 9, name is Australian English, count is 665, locale is en, descriptors are None, predetermined is False.
id is 12, name is Latino, count is 5, locale is en, descriptors are None, predetermined is False.
id is 15, name is Southern African (South Africa, Zimbabwe, Namibia), count is 260, locale is en, descriptors are None, predetermined is False.
id is 18, name is India and South Asia (India, Pakistan, Sri Lanka), count is 2009, locale is en, descriptors are None, predetermined is False.
id is 31, name is Colombian, count is 1, locale is en, descriptors are No

In [378]:
# do an explicit reload as I'm still working on the classes 
reload(cva)

all_accents_sortedByCount = all_accents.sortByCount(reverse=False)

for accent in all_accents_sortedByCount.items(): 
    print(accent[1].__str__())
    
# now I am cross-checking to see if there are any other duplicates or accents that should be merged

# 'southern'
# 'serbian'
# 'new england/east coast'

id is 7051, name is "Valley Girl" English, count is 1, locale is en, descriptors are None, predetermined is False.
id is 6636, name is 10%, count is 1, locale is en, descriptors are None, predetermined is False.
id is 6635, name is 90%, count is 1, locale is en, descriptors are None, predetermined is False.
id is 104, name is A'lo, count is 1, locale is en, descriptors are None, predetermined is False.
id is 1142, name is Adjustable, count is 1, locale is en, descriptors are None, predetermined is False.
id is 2845, name is Afrikaans English, count is 1, locale is en, descriptors are None, predetermined is False.
id is 13116, name is Alemannic German Accent, count is 1, locale is en, descriptors are None, predetermined is False.
id is 90, name is Argentinian English, count is 1, locale is en, descriptors are None, predetermined is False.
id is 1244, name is Basic, count is 1, locale is en, descriptors are None, predetermined is False.
id is 751, name is Bilingual, count is 1, locale is