# Working with accent data in the Mozilla Common Voice dataset 

The purpose of this Python Jupyter notebook is to provide some worked examples of how you might explore accent data in the Common Voice dataset. 

## Background information on demographic data in Common Voice 

Before you start working with accent data in Common Voice, there is background information you should know about the data structures in the Common Voice datasets, and how accents have been represented. 

### The ability to choose whether or not to specify demographic information 

Data contributors can contribute voice data to Common Voice with our without logging in to the platform. If a data contributor is not logged in, the utterances they record contain no demographic metadata information, such as the gender, age range or accent of the speaker. If the data contributor _does_ log in, then they can choose whether to specify demographic information in their profile. Part of the demographic information can include specifying which accent(s) they speak with. 


Since mid 2021, data contributors to the Common Voice dataset have been able to self-specify descriptors for their accents. 

The purpose of this script is to get demographic details from an MCV downloaded dataset. 
This informs decision making around, for example, how much of the data in a particular language, has demographic details, and if so, what they are. 

## Preparation steps 

@TODO 

make a `requirements.txt` file to install all the dependencies. 

* pandas 


In [26]:
# imports go here 

# io 
import io

# pandas 
import pandas as pd

# regular expressions 
import re

# json 
import json

# pretty print 
import pprint
pp = pprint.PrettyPrinter(indent=4)


In [2]:
# specify the path to the TSV file - this should be `validated.tsv` from the MCV download 
#filePath = '/media/kathyreid/Seagate Backup Plus Drive/cv-datasets/ru/validated.tsv'
#filePath = '/media/kathyreid/Seagate Backup Plus Drive/cv-datasets/fr/validated.tsv'
#filePath = '/media/kathyreid/Elements/de/validated.tsv'
#filePath = '/media/kathyreid/Elements/es/validated.tsv'
#filePath = '/media/kathyreid/Elements/en/validated.tsv'
#filePath = '/media/kathyreid/Seagate Backup Plus Drive/cv-datasets/en-v9/validated.tsv'
filePath = '../cv-datasets/cv-corpus-11.0-2022-09-21/en/validated.tsv'

# put it into a DataFrame 
df = pd.read_csv(filePath, sep='\t')

In [3]:
df.columns

Index(['client_id', 'path', 'sentence', 'up_votes', 'down_votes', 'age',
       'gender', 'accents', 'locale', 'segment'],
      dtype='object')

In [4]:
# We don't want all the columns, as some of them are not useful for the accent analysis 
# Drop the columns we don't want 

df.drop(labels=['path', 'sentence', 'up_votes', 'down_votes', 'segment', 'locale'], axis='columns', inplace=True)
df.columns



Index(['client_id', 'age', 'gender', 'accents'], dtype='object')

In [5]:
len(df)

1617877

In [6]:
# rows that have accent metadata 
len(df[df['accents'].notna()])

861134

In [7]:
# remove all the rows where accents are not given (NaN)
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
# DataFrame.dropna(*, axis=0, how=_NoDefault.no_default, thresh=_NoDefault.no_default, subset=None, inplace=False)

df.dropna(axis='index', how='any', subset='accents', inplace=True)
len(df)

# this matches the above figure for rows that have accent metadata, so it's a good cross-check

861134

In [8]:
# number of unique contributors to the dataset 
len(df['client_id'].unique())

14822

In [9]:
# Now that the rows without an accent value have been removed, 
# we want to deduplicate the speaker_id values - because one speaker can speak many utterances
# and we only want to record one accent per speaker 
# and we should end up with the # of rows in the cell above 


# One of the reasons we try and reduce the size of the dataframe 
# first is because this operation is more efficient on a smaller dataframe 
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
# DataFrame.drop_duplicates(subset=None, *, keep='first', inplace=False, ignore_index=False)

df.drop_duplicates(subset='client_id', keep='first', inplace=True)
len(df)

# This length should match the length above 




14822

## Extracting the accent data for visualisation 

We have already: 

* Removed any rows where accent data was not available = `NaN`
* De-duplicated based on the `client_id`

So now, we need to extract all the self-styled accents for analysis. 

In [10]:
# this gives us the unique LISTS
# but because each Speaker can specify MANY accents
# this is not granular enough for our needs. 

# This listing is also not sorted, so it's difficult to see which lists of accents
# occur at higher volumes.

# You can see in the output of this cell that the multiple accents are separated by commas
# They are already unique so we don't need the `.unique` method
english_accents = df['accents']

for accent in english_accents:
    print(type(accent))
    accent_total = len(df.loc[df['accents'] == accent])
    print(accent, ': ', accent_total)


    # they are strings here too, not lists 


<class 'str'>
England English,United States English :  9
<class 'str'>
Hong Kong English :  129
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English,wolof :  1
<class 'str'>
England English :  2281
<class 'str'>
Australian English :  655
<class 'str'>
United States English :  7432
<class 'str'>
Latin America,United States English :  1
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Southern African (South Africa, Zimbabwe, Namibia) :  260
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
England English :  2281
<class 'str'>
United States English :  743

United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English,Canadian English,Indo-Canadian English :  1
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
Australian English :  655
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
New Zealand English :  157
<class 'str'>
England English :  2281
<class 'str'>
French :  4
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
India and South Asia (I

United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
Singaporean English :  75
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Scottish English :  166
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States Engl

United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Australian English :  655
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Canadian English :  888
<class 'str'>
Irish English :  186
<class 'str'>
United States English :  7432
<class 'str'>
Singaporean English :  75
<class 'str'>
Southern African (South Africa, Zimbabwe, Namibia) :  260
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  19

India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English,Silicon Valley Native :  1
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English,United States English. people say I sound like a surffer dude. :  1
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990

India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Southern African (South Africa, Zimbabwe, Namibia) :  260
<class 'str'>
United States English :  7432
<class 'str'>
United States English,India and South Asia (India, Pakistan, Sri Lanka) :  3
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>

India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Irish English :  186
<class 'str'>
Canadian English :  888
<class 'str'>
United States English,southern, formal, sultry :  1
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
S

United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Irish English :  186
<class 'str'>
Southern African (South Africa, Zimbabwe, Namibia) :  260
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432

United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Australian English :  655
<class 'str'>
England English :  2281
<class 'str'>
West Indies and Bermuda (Bahamas, Bermuda, Jamaica, Trinidad) :  48
<class 'str'>
Filipino :  124
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
En

India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
Filipino :  124
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Australian English :  655
<class 'str'>
New Zealand English :  157
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :

United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Southern African (South Africa, Zimbabwe, Namibia) :  260
<class 'str'>
Canadian English :  888
<class 'str'>
Australian English :  655
<class 'str'>
Southern African (South Africa, Zimbabwe, Namibia) :  260
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English,pin/pen merger :  1
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States En

United States English :  7432
<class 'str'>
Australian English :  655
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Australian English :  655
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'st

United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Australian English :  655
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Russian :  4
<class 'str'>
Australian English :  655
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Irish English :  186
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
Hong Kong English :  129
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Southern African (South Africa, Zimba

India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
Hong Kong English :  129
<class 'str'>
Scottish English :  166
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
New Zealand English :  157
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
Scottish English :  166
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
India and South 

United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
West Indies and Bermuda (Bahamas, Bermuda, Jamaica, Trinidad) :  48
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
England English :  2281
<class 'str'>
West Indies and Bermuda (Bahamas, Bermuda, Jamaica, Trinidad),Patois :  1
<class 'str'>
United States English :  7432
<class 'str'>
Welsh English :  65
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
England English :  2281
<class 'str'>
Canadian English :  888
<class 'str'>
England En

India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Transnational englishes blend :  1
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lank

England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Welsh English :  65
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English,Porto des dels 3 anys aprenent anglès a l'escola i actualment m'estic preparant per a l'examen del B2. :  1
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
New Zealand English :  157
<class 'str'>
Australian English :  655
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 

United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Australian English :  655
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Malaysian English :  95
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan,

United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Hong Kong En

England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Welsh English :  65
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
Southern African (South Africa, Zimbabwe, Namibia) :  260
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Malaysian English :  95
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Malaysian English :  95
<class 'str'>
United States English : 

United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
England English :  2281
<class 'str'>
Malaysian English :  95
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
Australian English :  655
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432


United States English :  7432
<class 'str'>
Singaporean English :  75
<class 'str'>
Filipino :  124
<class 'str'>
West Indies and Bermuda (Bahamas, Bermuda, Jamaica, Trinidad) :  48
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Southern African (South Africa, Zimbabwe, Namibia) :  260
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
Australian English :  655
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
England En

England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
New Zealand English :  157
<class 'str'>
Australian English :  655
<class 'str'>
Canadian English :  888
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
Hong Kong English :  129
<class 'str'>
United States English :  7432
<class 'str'>
Malaysian English :  95
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'

United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
New Zealand English :  157
<class 'str'>
Australian English :  655
<class 'str'>
United States English :  7432
<class 'str'>
Malaysian English :  95
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
Welsh English :  65
<class 'str'>
Australian English :  655
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English

United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
Canadian English :  888
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
Australian English :  655
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Filipino :  124
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Southern African (South Africa, Zimbabwe, Namibia) :  260
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Southern African (South Africa, Zimbabwe, Namibia) :  260
<class 'str'>
United States English :  7432
<class 'st

Hong Kong English :  129
<class 'str'>
England English :  2281
<class 'str'>
Australian English :  655
<class 'str'>
United States English :  7432
<class 'str'>
Australian English :  655
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
England English :  2281
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
Australian English,Educated Australian Accent :  1
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Southern African (South Africa, Zimbabwe, Namibia) :  260
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
England 

United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Australian English :  655
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
Australian English :  655
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<cla

United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Malaysian English :  95
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Irish English :  186
<class 'str'>
United States English :  7432
<class 'str'>
Australian English :  655
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Australian English :  655
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
Australian English :  655
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>

India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Australian English :  655
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
Southern African (South Africa, Zimbabwe, Namibia) :  260
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
Australian English :  655
<class 'str'>
United States English :  7432
<class 'str'>
New Zeal

England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
New Zealand English :  157
<class 'str'>
United States English :  7432
<class 'str'>
Australian English :  655
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432

United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
Southern African (South Africa, Zimbabwe, Namibia) :  260
<class 'str'>
Irish English :  186
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Singaporean English :  75
<class 'str'>
United States E

Southern African (South Africa, Zimbabwe, Namibia) :  260
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
Southern African (South Africa, Zimbabwe, Namibia) :  260
<class 'str'>
United Sta

United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Australian English :  655
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United Sta

United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Canadian English :  888
<class 'str'>
Canadian English :  888
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
England English :  2281
<class 'str'>
Irish English :  186
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Irish English :  186
<class 'str'>
Canadian English :  888
<class 'str'>
England English :  2281
<class 'str'>
Singaporean English :  75
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
Australian English :  655
<class 'str'>
India and South Asia (Indi

United States English :  7432
<class 'str'>
Southern African (South Africa, Zimbabwe, Namibia) :  260
<class 'str'>
United States English,Mid-west United States English :  1
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Irish English :  186
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Uni

Canadian English :  888
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
Irish English :  186
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
New Zealand English :  157
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English : 

England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
Australian English :  655
<class 'str'>
Scottish English :  166
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'st

England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
Filipino :  124
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Malaysian English :  95
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
Southern African (South Africa, Zimbabwe, Namibia) :  260
<class 'str'>
United Sta

Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
Malaysian English :  95
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Hong Kong English :  129
<class 'str'>
United States English :  7432
<class 'str'>
Australian English :  655
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Malaysian English :

India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
Australian English :  655
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
New Zealand English :  157
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Malaysian English :  95
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
Australian English :  655
<class 

Welsh English :  65
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
Australian English :  655
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States En

New Zealand English :  157
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
New Zealand English :  157
<class 'str'>
England English :  2281
<class 'str'>
Malaysian English :  95
<class 'str'>
United States English :  7432
<class 'str'>
Australian English :  655
<class 'str'>
United States English :  7432
<class 'str'>
Australian English :  655
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
Hong Kong English :  129
<class 'str'>
England English :  2281
<class 'str'>
Canadian English :  888
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Australian English :  655
<class 'str'>
United St

United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Australian English :  655
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
New Zealand English :  157
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Paki

India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
Australian English :  655
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
New Zealand English :  157
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Southern 

India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
Hong Kong English :  129
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
Hong Kong English :  129
<class 'str'>
United States English,Californian Accent :  1
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<cla

England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
Filipino :  124
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Australian English :  655
<class 'str'>
Unit

United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Australian English :  655
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Scottish English :  166
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
Australian English :  655
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'

United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
Australian English :  655
<class 'str'>
Canadian English :  888
<class 'str'>
England English :  2281
<class 'str'>
Scottish English :  166
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Southern African (South Africa, Zimbabwe, Namibia) :  260
<class 'str'>
England English :  2281
<class 'str'>
Canadian English :  888
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Australian English :  655
<c

India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Irish English :  186
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Irish English :  186
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
New Zealand English :  157
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Australian English :  655
<class 'str'>
England Engli

England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Eastern European English :  2
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Hong Kong English :  129
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
U

United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
Canadian English :  888
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
United States Englis

United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Southern African (South Africa, Zimbabwe, Namibia) :  260
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
New Zealand English :  157
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  

Australian English :  655
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
Hong Kong English :  129
<class 'str'>
Welsh English :  65
<

England English :  2281
<class 'str'>
Hong Kong English :  129
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
England English :  2281
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<clas

England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
England English :  2281
<class 'str'>
Scottish English :  166
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
United States English :  7432
<class 'str'>
southern United States,United States English :  1
<class 'str'>
United States English :  7432
<class 'str'>
Australian English :  655
<class 'str'>
United States English :  7432
<class 'str'>
Scottish English :  166
<class 'str'>
United States English :  7432
<class 'str'>
Canadian English :  888
<class 'str'>
England English :  2281
<class 'str'>
United States English :  7432
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
India and South Asia (India, Pakistan, Sri Lanka) :  1990
<class 'str'>
Scottish English :  166
<class 'str'>
United States English :  7432


In [27]:
# So what we want to do here is put them into another list
# So that we can then order that list 

english_accents = df['accents'].unique()

english_accents_dict = {} 
english_accents_list = []
english_accents_list_by_total =[]

for accent in english_accents:
    accent_total = len(df.loc[df['accents'] == accent])
    english_accents_dict[accent] = accent_total
    
# Dict structures in Python are not sortable
# So we're going to iterate so that we can have a sorted list 

for accent_total in english_accents_dict.items(): 
    english_accents_list.append(accent_total)

#print ('\n')
#print ('breakpoint 1')
#print ('\n')

# print the list 
#print(english_accents_list)
    
# Now, we want to order the list
# But if we use sort(), it will sort by the first element, which is the accent itself
# But we want to sort on the number of occurrences of that accent 

for accents_list in english_accents_list: 
    #print(accents_list)
    reversed = accents_list[::-1] # they're tuples, reverse() won't work
    #print(reversed)
    english_accents_list_by_total.append(reversed)
    
english_accents_list_by_total.sort(reverse=True)
    
#print ('\n')
#print ('breakpoint 2')
#print ('\n')

# print the size of the list     
print('Size of the list is: ')
print(len(english_accents_list_by_total))
print('---')

# print the sorted list
print('Sorted list is: ')
pp.pprint(english_accents_list_by_total)





Size of the list is: 
227
---
Sorted list is: 
[(7432, 'United States English'),
 (2281, 'England English'),
 (1990, 'India and South Asia (India, Pakistan, Sri Lanka)'),
 (888, 'Canadian English'),
 (655, 'Australian English'),
 (260, 'Southern African (South Africa, Zimbabwe, Namibia)'),
 (186, 'Irish English'),
 (166, 'Scottish English'),
 (157, 'New Zealand English'),
 (129, 'Hong Kong English'),
 (124, 'Filipino'),
 (95, 'Malaysian English'),
 (75, 'Singaporean English'),
 (65, 'Welsh English'),
 (48, 'West Indies and Bermuda (Bahamas, Bermuda, Jamaica, Trinidad)'),
 (13, 'United States English,England English'),
 (9, 'England English,United States English'),
 (6, 'German English'),
 (5, 'South Atlantic (Falkland Islands, Saint Helena)'),
 (4, 'Russian'),
 (4, 'French'),
 (3, 'United States English,India and South Asia (India, Pakistan, Sri Lanka)'),
 (3, 'Slavic'),
 (3, 'Northern Irish'),
 (3, 'India and South Asia (India, Pakistan, Sri Lanka),United States English'),
 (3, 'India

In [12]:
# build a dict of each unique accent 
# and assign it an identifer 
# so that it is easier to visualise linkages between accents 
# and to assign characteristics to 

ratio_display = 120 # to stop the browser crashing 

all_accents = {}
i = 0; 

all_accent_lists = df['accents']

for accent_list in all_accent_lists:
    #print (accent_list)
    
    # the accent_list is a string, not a list
    # and we can't just simply split it by a string 
    # because there are entries that may have a comma in the accent description 
    # such as 
    # India and South Asia (India, Pakistan, Sri Lanka)
    # so instead we use a regex that only splits on a comma ','
    # if the comma is not inside open brackets 
    
    # this regex is from 
    # https://stackoverflow.com/questions/26633452/how-to-split-by-commas-that-are-not-within-parentheses
    accent_list=re.split(',\s*(?![^()]*\))', accent_list)
    
    for accent in accent_list: 
        
        i +=1
        match = False 
        count = 0
        
        if (i%ratio_display ==0): # only show the 100th 
            print('')
            print('---')
            print('now processing: ', accent, ' - ', i)
            print('---')
        
        # is this accent in our dict - if not, add it in 
        
        for item in all_accents.items() :
            
            if (i%ratio_display ==0): # only show the 100th 
                print('item is: ', item)
                print(type(item))
                print('now checking match for: item:', item[1], ' and accent: ', accent)
            
            if (item[1]['accent'] == accent) : # update the count
                
                if (i%ratio_display ==0): # only show the 100th 
                    print('---')
                    print('match is True')
                    print('---')
                    
                match = True 
                
                if (i%ratio_display ==0): # only show the 100th 
                    print('accent count was: ', item[1]['count'])
                
                # update the count of the accent 
                item[1]['count']+=1
                
                if (i%ratio_display ==0): # only show the 100th 
                    print('accent count was: ', item[1]['count'])
                
                

        # this match loop has to be outside the for: loop above 
        # because if we add items to the dict inside the loop
        # then it will not run - because there are zero items in the dict to begin with 
        
        if (not match) :   
            
            all_accents[i] = dict((("accent", accent),("count", 1)))
            
            if (i%ratio_display ==0): # only show the 100th 
                print ('added ', accent, ' to dict')
                




---
now processing:  United States English  -  120
---
item is:  (1, {'accent': 'England English', 'count': 17})
<class 'tuple'>
now checking match for: item: {'accent': 'England English', 'count': 17}  and accent:  United States English
item is:  (2, {'accent': 'United States English', 'count': 54})
<class 'tuple'>
now checking match for: item: {'accent': 'United States English', 'count': 54}  and accent:  United States English
---
match is True
---
accent count was:  54
accent count was:  55
item is:  (3, {'accent': 'Hong Kong English', 'count': 3})
<class 'tuple'>
now checking match for: item: {'accent': 'Hong Kong English', 'count': 3}  and accent:  United States English
item is:  (7, {'accent': 'wolof', 'count': 1})
<class 'tuple'>
now checking match for: item: {'accent': 'wolof', 'count': 1}  and accent:  United States English
item is:  (9, {'accent': 'Australian English', 'count': 4})
<class 'tuple'>
now checking match for: item: {'accent': 'Australian English', 'count': 4}  an

<class 'tuple'>
now checking match for: item: {'accent': "A'lo", 'count': 1}  and accent:  United States English
item is:  (129, {'accent': 'Finnish', 'count': 1})
<class 'tuple'>
now checking match for: item: {'accent': 'Finnish', 'count': 1}  and accent:  United States English
item is:  (133, {'accent': 'Singaporean English', 'count': 42})
<class 'tuple'>
now checking match for: item: {'accent': 'Singaporean English', 'count': 42}  and accent:  United States English
item is:  (139, {'accent': 'West Indies and Bermuda (Bahamas, Bermuda, Jamaica, Trinidad)', 'count': 32})
<class 'tuple'>
now checking match for: item: {'accent': 'West Indies and Bermuda (Bahamas, Bermuda, Jamaica, Trinidad)', 'count': 32}  and accent:  United States English
item is:  (142, {'accent': 'Hispanic/Latino', 'count': 1})
<class 'tuple'>
now checking match for: item: {'accent': 'Hispanic/Latino', 'count': 1}  and accent:  United States English
item is:  (174, {'accent': 'Non-native', 'count': 2})
<class 'tuple


---
now processing:  England English  -  9240
---
item is:  (1, {'accent': 'England English', 'count': 1368})
<class 'tuple'>
now checking match for: item: {'accent': 'England English', 'count': 1368}  and accent:  England English
---
match is True
---
accent count was:  1368
accent count was:  1369
item is:  (2, {'accent': 'United States English', 'count': 4570})
<class 'tuple'>
now checking match for: item: {'accent': 'United States English', 'count': 4570}  and accent:  England English
item is:  (3, {'accent': 'Hong Kong English', 'count': 99})
<class 'tuple'>
now checking match for: item: {'accent': 'Hong Kong English', 'count': 99}  and accent:  England English
item is:  (7, {'accent': 'wolof', 'count': 1})
<class 'tuple'>
now checking match for: item: {'accent': 'wolof', 'count': 1}  and accent:  England English
item is:  (9, {'accent': 'Australian English', 'count': 344})
<class 'tuple'>
now checking match for: item: {'accent': 'Australian English', 'count': 344}  and accent:  


---
now processing:  India and South Asia (India, Pakistan, Sri Lanka)  -  11760
---
item is:  (1, {'accent': 'England English', 'count': 1790})
<class 'tuple'>
now checking match for: item: {'accent': 'England English', 'count': 1790}  and accent:  India and South Asia (India, Pakistan, Sri Lanka)
item is:  (2, {'accent': 'United States English', 'count': 5858})
<class 'tuple'>
now checking match for: item: {'accent': 'United States English', 'count': 5858}  and accent:  India and South Asia (India, Pakistan, Sri Lanka)
item is:  (3, {'accent': 'Hong Kong English', 'count': 113})
<class 'tuple'>
now checking match for: item: {'accent': 'Hong Kong English', 'count': 113}  and accent:  India and South Asia (India, Pakistan, Sri Lanka)
item is:  (7, {'accent': 'wolof', 'count': 1})
<class 'tuple'>
now checking match for: item: {'accent': 'wolof', 'count': 1}  and accent:  India and South Asia (India, Pakistan, Sri Lanka)
item is:  (9, {'accent': 'Australian English', 'count': 484})
<cla


---
now processing:  United States English  -  13320
---
item is:  (1, {'accent': 'England English', 'count': 2044})
<class 'tuple'>
now checking match for: item: {'accent': 'England English', 'count': 2044}  and accent:  United States English
item is:  (2, {'accent': 'United States English', 'count': 6650})
<class 'tuple'>
now checking match for: item: {'accent': 'United States English', 'count': 6650}  and accent:  United States English
---
match is True
---
accent count was:  6650
accent count was:  6651
item is:  (3, {'accent': 'Hong Kong English', 'count': 122})
<class 'tuple'>
now checking match for: item: {'accent': 'Hong Kong English', 'count': 122}  and accent:  United States English
item is:  (7, {'accent': 'wolof', 'count': 1})
<class 'tuple'>
now checking match for: item: {'accent': 'wolof', 'count': 1}  and accent:  United States English
item is:  (9, {'accent': 'Australian English', 'count': 574})
<class 'tuple'>
now checking match for: item: {'accent': 'Australian Engli

In [13]:
print(len(all_accents))

224


In [28]:
pp.pprint(all_accents)

{1: {'accent': 'England English', 'accent_predetermined': True, 'count': 2342},
 2: {'accent': 'United States English',
     'accent_predetermined': True,
     'count': 7537},
 3: {'accent': 'Hong Kong English', 'accent_predetermined': True, 'count': 132},
 7: {'accent': 'wolof', 'accent_predetermined': False, 'count': 1},
 9: {'accent': 'Australian English',
     'accent_predetermined': True,
     'count': 665},
 11: {'accent': 'Latin America', 'accent_predetermined': False, 'count': 1},
 15: {'accent': 'Southern African (South Africa, Zimbabwe, Namibia)',
      'accent_predetermined': True,
      'count': 260},
 18: {'accent': 'India and South Asia (India, Pakistan, Sri Lanka)',
      'accent_predetermined': True,
      'count': 2009},
 31: {'accent': 'Colombia', 'accent_predetermined': False, 'count': 1},
 37: {'accent': 'Canadian English', 'accent_predetermined': True, 'count': 896},
 58: {'accent': 'Non native speaker from France',
      'accent_predetermined': False,
      'count

## Label the accents that were pre-determined 

Since its inception, Mozilla Common Voice has enabled data contributors to enter demographic age such as age, gender and accent. These associations are not validated in any way, and we don't have any indicator of how accurate they are. Accent _used_ to be represented as an a priori drop-down list, which the contributor could select from. From Common Voice v10, the data contributor can **self-describe** their accent, however, the previous accent list is still presented (so may be more frequently chosen by the data contributor). We need to be able to distinguish these accents visually to help with the exploration. 

```
"splits": {
        "accent": {
          "": 0.51,
          "canada": 0.03,
          "england": 0.08,
          "us": 0.23,
          "indian": 0.07,
          "australia": 0.03,
          "malaysia": 0,
          "newzealand": 0.01,
          "african": 0.01,
          "ireland": 0.01,
          "philippines": 0,
          "singapore": 0,
          "scotland": 0.02,
          "hongkong": 0,
          "bermuda": 0,
          "southatlandtic": 0,
          "wales": 0,
          "other": 0.01
        },

```

The `cv-datasets` splits above have labels for the accents that don't actually match the accent name in the data. So we need to specify the accents that are pre-determined. This is how they appear to the data contributor filling out their profile at: [https://commonvoice.mozilla.org/en/profile/info](https://commonvoice.mozilla.org/en/profile/info)


![Accents as specified on Mozilla Common Voice profile](cv-profile-specify-accent.png)


In [15]:
# create a list of the pre-existing accents 
# this is how they are given in the dataset. 

predetermined_accents_list = ['United States English', 
                         'England English', 
                         'India and South Asia (India, Pakistan, Sri Lanka)', 
                         'Canadian English', 
                         'Australian English', 
                         'Southern African (South Africa, Zimbabwe, Namibia)', 
                         'Irish English', 
                         'Scottish English', 
                         'New Zealand English', 
                         'Hong Kong English', 
                         'Filipino', 
                         'Malaysian English', 
                         'Singaporean English', 
                         'Welsh English', 
                         'West Indies and Bermuda (Bahamas, Bermuda, Jamaica, Trinidad)', 
                         'South Atlantic (Falkland Islands, Saint Helena)']



  

In [29]:
# now, we iterate through both lists and if the ['accent'] in the Dict matches the accent in the predetermined list, 
# and add this as information in the dict that we can use for visualisation. 


for item in all_accents.items(): 
    #print('item is: ', item)
    key = item[0] 
    #print(key)
    
    predetermined_status = False
    
    for predetermined_accent in predetermined_accents_list: 
        
        #print('predetermined_accent is: ', predetermined_accent)
        #print('item 1 accent is: ', item[1]['accent'])
        
        if (predetermined_accent == item[1]['accent']) :
            #print('MATCH!')
            predetermined_status = True
              
    all_accents[key]['accent_predetermined'] = predetermined_status
    #print(all_accents[key]['accent_predetermined'])

    
pp.pprint(all_accents)

    



{1: {'accent': 'England English', 'accent_predetermined': True, 'count': 2342},
 2: {'accent': 'United States English',
     'accent_predetermined': True,
     'count': 7537},
 3: {'accent': 'Hong Kong English', 'accent_predetermined': True, 'count': 132},
 7: {'accent': 'wolof', 'accent_predetermined': False, 'count': 1},
 9: {'accent': 'Australian English',
     'accent_predetermined': True,
     'count': 665},
 11: {'accent': 'Latin America', 'accent_predetermined': False, 'count': 1},
 15: {'accent': 'Southern African (South Africa, Zimbabwe, Namibia)',
      'accent_predetermined': True,
      'count': 260},
 18: {'accent': 'India and South Asia (India, Pakistan, Sri Lanka)',
      'accent_predetermined': True,
      'count': 2009},
 31: {'accent': 'Colombia', 'accent_predetermined': False, 'count': 1},
 37: {'accent': 'Canadian English', 'accent_predetermined': True, 'count': 896},
 58: {'accent': 'Non native speaker from France',
      'accent_predetermined': False,
      'count

In [30]:
# sort the dict by count descending 

all_accents_sorted_by_count = dict(sorted(all_accents.items(), key=lambda t:(t[1]["count"], t[1]["accent"]), reverse=True))

pp.pprint(all_accents_sorted_by_count)


{1: {'accent': 'England English', 'accent_predetermined': True, 'count': 2342},
 2: {'accent': 'United States English',
     'accent_predetermined': True,
     'count': 7537},
 3: {'accent': 'Hong Kong English', 'accent_predetermined': True, 'count': 132},
 7: {'accent': 'wolof', 'accent_predetermined': False, 'count': 1},
 9: {'accent': 'Australian English',
     'accent_predetermined': True,
     'count': 665},
 11: {'accent': 'Latin America', 'accent_predetermined': False, 'count': 1},
 15: {'accent': 'Southern African (South Africa, Zimbabwe, Namibia)',
      'accent_predetermined': True,
      'count': 260},
 18: {'accent': 'India and South Asia (India, Pakistan, Sri Lanka)',
      'accent_predetermined': True,
      'count': 2009},
 31: {'accent': 'Colombia', 'accent_predetermined': False, 'count': 1},
 37: {'accent': 'Canadian English', 'accent_predetermined': True, 'count': 896},
 58: {'accent': 'Non native speaker from France',
      'accent_predetermined': False,
      'count

In [31]:
pp.pprint(english_accents_list)

[('England English,United States English', 9),
 ('Hong Kong English', 129),
 ('England English', 2281),
 ('United States English', 7432),
 ('United States English,wolof', 1),
 ('Australian English', 655),
 ('Latin America,United States English', 1),
 ('Southern African (South Africa, Zimbabwe, Namibia)', 260),
 ('India and South Asia (India, Pakistan, Sri Lanka)', 1990),
 ('United States English,Colombia', 1),
 ('Canadian English', 888),
 ('Non native speaker from France', 1),
 ('Scottish English', 166),
 ('New York City', 1),
 ('Filipino', 124),
 ('French', 4),
 ('Argentinian English', 1),
 ("A'lo", 1),
 ('Finnish', 1),
 ('United States English,India and South Asia (India, Pakistan, Sri Lanka)', 3),
 ('Singaporean English', 75),
 ('West Indies and Bermuda (Bahamas, Bermuda, Jamaica, Trinidad)', 48),
 ('Hispanic/Latino', 1),
 ('India and South Asia (India, Pakistan, Sri Lanka),England English', 3),
 ('Non-native', 1),
 ('Georgian English', 1),
 ('Non native', 1),
 ('United States Engli

In [19]:
print(len(english_accents_list))

## this should match the length of the accent_nodes dict below - it shows we have created this many co-references
## I'm not sure co-references is the correct terminology

227


## Add category descriptors to each accent

In this section, I apply a set of categories to the accent data. 

**I use a rule-based approach for reproduceability.** 
This could have been done in a spreadsheet, but I'm workin in Python so I chose to do it that way. 




In [32]:
"""
predetermined_accents_list = ['United States English', 
                         'England English', 
                         'India and South Asia (India, Pakistan, Sri Lanka)', 
                         'Canadian English', 
                         'Australian English', 
                         'Southern African (South Africa, Zimbabwe, Namibia)', 
                         'Irish English', 
                         'Scottish English', 
                         'New Zealand English', 
                         'Hong Kong English', 
                         'Filipino', 
                         'Malaysian English', 
                         'Singaporean English', 
                         'Welsh English', 
                         'West Indies and Bermuda (Bahamas, Bermuda, Jamaica, Trinidad)', 
                         'South Atlantic (Falkland Islands, Saint Helena)']
"""

"""
List of accents that should be merged into one canonical one for the purpose of analysis 

667, 2456, 1337, 4320 - German and German English and 'German Accent', speak some German

2985, 5349, 2191, 7603, 14262 - Midwestern United States variants 
1978 - and others - southern united states variants 

10548, 1133 - Dutch English and Dutch

7846, 5213, 479 -  polish and 'polish accent' and 'Polish English'

9878 - 'england' and 'England English'

4049 - 'eastern european English' and 'eastern Europe'

5737, 708 - 'swedish' and 'swedish english'

12363 and others - 'Philadelphia' and variants





"""

"""
Accent descriptors that should be separated into separate accents descriptors and joined as related nodes

8233 - 'south German / Swiss accent'
6967 - 'slight Brooklyn Accent' - needs to be separated into "slight" and 'Brooklyn' - as a strength indicator 
10142 - 'minor French Accent' - needs to be separated into "minor" and 'French'
9902 - 'little Latino' - needs to be separated into "little" and "Latino"
3422 - 'i have some pronunciation issues because of oral surgery and a hidden southern accent' - needs to be separated into 'hidden southern accent' and 'changes due to Oral surgery'
9337 - 'heavy Cantonese' - needs to be separated into accent strength and country
1721 - 'United States English combined with European English' = needs to be separated into two accents
5365 - 'Sydney - middle eastern seaboard Australian' - should be separated into two accents
6615 - 'Spoke Chinese when little' - should be separated into Chinese, and speaking a different language as a child
748 - 'Spanish bilingual' - should be separated into 'Spanish' and 'bilingual ' as an Ln marker. 
6942 - 'South London and Essex' - should be separated into 'South London' and 'Essex'
12055 - 'Some time spent in Scotland' - should be separated into 'Scottish English' and into an Ln marker of "time spent in regional location"
773 - 'Silicon Valley Native' - should be separated into 'Silicon Valley' and "native" marker
3033 - "Porto des dels 3 anys aprenent anglès a l'escola i actualment m'estic preparant per a l'examen del B2." - this should be split into French, and academic register / L2 status
4322 - 'Polish. Have lived in nine states.' - should be separated into Polish and "time spent in regional location"
5046 - 'Pittsburgh PA' - should be separated into Pittsburgh - city descriptor - and Pennsylvania as a regional descriptor. 

"""

"""
Modifications that should be made 

Countries should be uppercased - Polish 
Capital case for accents - 'southern United States'

"""

"""
Notation logs 

* Northern Irish is categorised as a Country descriptor because Northern Ireland is a country in the United Kingdom* Serbian is categorised as a Country descriptor because Serbia is a country in its own right  it's no longer Yugoslavia
* Similarly, Scottish is categorised as a Country descriptor

Is it a country descriptor or is it a language descriptor?
Some countries have a language that tends to overlap its nation state borders - Germany, Austria, French, Polish
So it's both a country and a language description 
Slavic is a language group, as is Latino too. 

So that needs to come out in the analysis. 

Strength of accent appears to be another descriptor - so 'slight Brooklyn accent'

Is fluency the same as an Ln descriptor? For these purposes I think we can say so. 

Cantonese has been classified as sub-national - primarily because Hong Kong is a part of China, however there are many Cantonese diaspora in the world. 
Should it be classified as supra-national, as is Wolof? 
This might be a better classification, because of the many countries in which Cantonese is spoken (Singapore etc.
A different way to approach this might be to have a category which is "Language group descriptor"

The accent descriptor 'West Indian' is ambiguous - does it refer to West Indies, or West India? I have categorised it as West India because there is an existing West Indies pre-determined category. 

6615 - Spoke Chinese when little - is this another category for speaking a language in childhood and then having a different accent when you grow up? 




"""

"""
  2685: {'accent': 'Patois', 'count': 1, 'accent_predetermined': False}, 13066: {'accent': 'Pacific Northwest ', 'count': 1, 'accent_predetermined': False}, 7741: {'accent': 'Okie', 'count': 1, 'accent_predetermined': False}, 1485: {'accent': 'Not bad', 'count': 1, 'accent_predetermined': False}, 14161: {'accent': 'Northumbrian British English', 'count': 1, 'accent_predetermined': False}, 14305: {'accent': 'Northern England', 'count': 1, 'accent_predetermined': False}, 1534: {'accent': 'Northern', 'count': 1, 'accent_predetermined': False}, 58: {'accent': 'Non native speaker from France', 'count': 1, 'accent_predetermined': False}, 15055: {'accent': 'Non native speaker', 'count': 1, 'accent_predetermined': False}, 188: {'accent': 'Non native', 'count': 1, 'accent_predetermined': False}, 8750: {'accent': 'Nigerian English', 'count': 1, 'accent_predetermined': False}, 700: {'accent': 'Nigerian ', 'count': 1, 'accent_predetermined': False}, 78: {'accent': 'New York City', 'count': 1, 'accent_predetermined': False}, 14479: {'accent': 'New Orleans dialect', 'count': 1, 'accent_predetermined': False}, 12912: {'accent': 'Mixed-Accent English', 'count': 1, 'accent_predetermined': False}, 2446: {'accent': 'Mix of voices', 'count': 1, 'accent_predetermined': False}, 14585: {'accent': 'Minnesotan', 'count': 1, 'accent_predetermined': False}, 1054: {'accent': 'Mild Northern England English', 'count': 1, 'accent_predetermined': False}, 12658: {'accent': 'Midwestern US English (United States)', 'count': 1, 'accent_predetermined': False}, 11587: {'accent': 'Midwestern States (Michigan)', 'count': 1, 'accent_predetermined': False}, 2205: {'accent': 'Midwest United States', 'count': 1, 'accent_predetermined': False}, 1317: {'accent': 'Midwest USA speech blended with South Texas USA speech', 'count': 1, 'accent_predetermined': False}, 1491: {'accent': 'Midwest US... With some Canadian slang. ', 'count': 1, 'accent_predetermined': False}, 334: {'accent': 'Midlands English', 'count': 1, 'accent_predetermined': False}, 11197: {'accent': 'Midatlantic', 'count': 1, 'accent_predetermined': False}, 8043: {'accent': 'Mid-west United States English', 'count': 1, 'accent_predetermined': False}, 11039: {'accent': 'Mid-atlantic', 'count': 1, 'accent_predetermined': False}, 12362: {'accent': 'Mid-Atlantic United States English', 'count': 1, 'accent_predetermined': False}, 14741: {'accent': 'Low', 'count': 1, 'accent_predetermined': False}, 2940: {'accent': 'London English', 'count': 1, 'accent_predetermined': False}, 879: {'accent': 'Liverpudlian English', 'count': 1, 'accent_predetermined': False}, 15028: {'accent': 'Liverpool English', 'count': 1, 'accent_predetermined': False}, 4112: {'accent': 'Latin English', 'count': 1, 'accent_predetermined': False}, 5417: {'accent': 'Latin American accent', 'count': 1, 'accent_predetermined': False}, 11: {'accent': 'Latin America', 'count': 1, 'accent_predetermined': False}, 15029: {'accent': 'Lancashire English', 'count': 1, 'accent_predetermined': False}, 5426: {'accent': 'Lancashire', 'count': 1, 'accent_predetermined': False}, 5760: {'accent': 'Kiwi', 'count': 1, 'accent_predetermined': False}, 2828: {'accent': 'Kenyan English', 'count': 1, 'accent_predetermined': False}, 7744: {'accent': 'Kenyan ', 'count': 1, 'accent_predetermined': False}, 11710: {'accent': 'Javanese', 'count': 1, 'accent_predetermined': False}, 4762: {'accent': 'Japanese', 'count': 1, 'accent_predetermined': False}, 11403: {'accent': 'Italian', 'count': 1, 'accent_predetermined': False}, 1099: {'accent': "Israeli's accent ", 'count': 1, 'accent_predetermined': False}, 7644: {'accent': 'Israeli accent', 'count': 1, 'accent_predetermined': False}, 10100: {'accent': 'International Indian Accent', 'count': 1, 'accent_predetermined': False}, 1723: {'accent': 'International English', 'count': 1, 'accent_predetermined': False}, 11711: {'accent': 'Indonesian English', 'count': 1, 'accent_predetermined': False}, 200: {'accent': 'Indo-Canadian English', 'count': 1, 'accent_predetermined': False}, 2703: {'accent': 'Indian with a tinge of an RP accent', 'count': 1, 'accent_predetermined': False}, 12057: {'accent': 'I think', 'count': 1, 'accent_predetermined': False}, 1011: {'accent': 'I am German and speak English as learned at school', 'count': 1, 'accent_predetermined': False}, 7714: {'accent': 'Hmong-American', 'count': 1, 'accent_predetermined': False}, 142: {'accent': 'Hispanic/Latino', 'count': 1, 'accent_predetermined': False}, 2773: {'accent': 'Greek', 'count': 1, 'accent_predetermined': False}, 177: {'accent': 'Georgian English', 'count': 1, 'accent_predetermined': False}, 4839: {'accent': 'French mid level accent', 'count': 1, 'accent_predetermined': False}, 129: {'accent': 'Finnish', 'count': 1, 'accent_predetermined': False}, 486: {'accent': 'European English', 'count': 1, 'accent_predetermined': False}, 1033: {'accent': 'English with Swiss german accent', 'count': 1, 'accent_predetermined': False}, 4096: {'accent': 'English north of England ', 'count': 1, 'accent_predetermined': False}, 445: {'accent': 'English County Durham', 'count': 1, 'accent_predetermined': False}, 1055: {'accent': 'England non-native', 'count': 1, 'accent_predetermined': False}, 5427: {'accent': 'England ', 'count': 1, 'accent_predetermined': False}, 5305: {'accent': 'Educated Australian Accent', 'count': 1, 'accent_predetermined': False}, 1076: {'accent': 'Educated', 'count': 1, 'accent_predetermined': False}, 7774: {'accent': 'Eastern European', 'count': 1, 'accent_predetermined': False}, 3389: {'accent': 'East Ukrainian ', 'count': 1, 'accent_predetermined': False}, 1252: {'accent': 'East London ', 'count': 1, 'accent_predetermined': False}, 4383: {'accent': 'East Indian', 'count': 1, 'accent_predetermined': False}, 12004: {'accent': 'East European', 'count': 1, 'accent_predetermined': False}, 13684: {'accent': 'East African Khoja', 'count': 1, 'accent_predetermined': False}, 7508: {'accent': 'ESL', 'count': 1, 'accent_predetermined': False}, 14742: {'accent': 'Demure', 'count': 1, 'accent_predetermined': False}, 7962: {'accent': 'Cool', 'count': 1, 'accent_predetermined': False}, 12555: {'accent': 'Conversational', 'count': 1, 'accent_predetermined': False}, 2455: {'accent': 'Colombian Accent', 'count': 1, 'accent_predetermined': False}, 31: {'accent': 'Colombia', 'count': 1, 'accent_predetermined': False}, 313: {'accent': 'Chinese', 'count': 1, 'accent_predetermined': False}, 7029: {'accent': 'Chicano English', 'count': 1, 'accent_predetermined': False}, 6398: {'accent': 'Chicago ', 'count': 1, 'accent_predetermined': False}, 474: {'accent': 'Catalan', 'count': 1, 'accent_predetermined': False}, 11459: {'accent': 'Californian Accent', 'count': 1, 'accent_predetermined': False}, 276: {'accent': 'California', 'count': 1, 'accent_predetermined': False}, 12958: {'accent': 'CARIBBEAN AND BRITISH MIXED WITH SOME NEW YORK ACCENTS', 'count': 1, 'accent_predetermined': False}, 1577: {'accent': 'British English with a little bit of Russian', 'count': 1, 'accent_predetermined': False}, 1237: {'accent': 'Basic', 'count': 1, 'accent_predetermined': False}, 13084: {'accent': 'Bangladeshi English', 'count': 1, 'accent_predetermined': False}, 3085: {'accent': 'Bangladeshi', 'count': 1, 'accent_predetermined': False}, 3086: {'accent': 'Bangladesh English', 'count': 1, 'accent_predetermined': False}, 89: {'accent': 'Argentinian English', 'count': 1, 'accent_predetermined': False}, 6472: {'accent': 'American Midwest', 'count': 1, 'accent_predetermined': False}, 13091: {'accent': 'Alemannic German Accent', 'count': 1, 'accent_predetermined': False}, 2833: {'accent': 'Afrikaans English', 'count': 1, 'accent_predetermined': False}, 1135: {'accent': 'Adjustable', 'count': 1, 'accent_predetermined': False}, 103: {'accent': "A'lo", 'count': 1, 'accent_predetermined': False}, 7271: {'accent': 'A variety of Texan English with some German influence that has undergone the cot-caught merger', 'count': 1, 'accent_predetermined': False}, 6616: {'accent': '90% Pennsylvanian accent', 'count': 1, 'accent_predetermined': False}, 4319: {'accent': '4 years in Spain and Germany', 'count': 1, 'accent_predetermined': False}, 9948: {'accent': '2nd Language ', 'count': 1, 'accent_predetermined': False}, 6617: {'accent': '10% Chinese accent', 'count': 1, 'accent_predetermined': False}, 7030: {'accent': '"Valley Girl" English', 'count': 1, 'accent_predetermined': False}}

"""

accent_category_descriptors = [('Country descriptor'), 
                               ('Supranational region descriptor'),
                               ('Subnational region descriptor'),
                               ('City descriptor'),
                               ('Ln descriptor'),
                               ('Vocal quality descriptor'),
                               ('Accent strength descriptor'),
                               ('Rhoticity'),
                               ('Phoneme changes'),
                               ('Inflection changes'),
                               ('Register'), 
                               ('Accent effects due to physical changes')
                               ('Unknown or other'),
                              
                              
                              ]

accent_category_rules = [
    
    
    
                        ## COUNTRY DESCRIPTORS 
    
                         ('German English', 'Country descriptor' ),
                         ('German', 'Country descriptor' ),
                         ('German Accent', 'Country descriptor' ),
                         ('speak some German', 'Country descriptor' ),
                         
                         
                         
                         ('Russian', 'Country descriptor' ),
                         ('Ukrainian', 'Country descriptor' ),
    
                         ('Northern Irish', 'Country descriptor' ),
                         ('Scottish', 'Country descriptor' ),
    
    
                         ('polish', 'Country descriptor' ),
                         ('polish accent', 'Country descriptor' ),
                         ('Polish. Have lived in nine states.', 'Country descriptor' ),
                         ('Polish English', 'Country descriptor' ),
                            
    
    
    
    
    
    
                         ('Slovak', 'Country descriptor' ),
                         
                         ('french accent', 'Country descriptor' ),
                         ('minor french accent', 'Country descriptor' ),
                         
                         ('Spanish bilingual', 'Country descriptor' ),
                         
                         
                         
                         ('Swedish accent', 'Country descriptor' ),
                         ('Swedish English', 'Country descriptor' ),
                         
                         
                         
                         
                         
                         ('england', 'Country descriptor' ),
                         
                         
                         ('Thai', 'Country descriptor' ),
                         ('Romanian', 'Country descriptor' ),
                         ('Spanish', 'Country descriptor' ),
                         ('Norwegian', 'Country descriptor' ),
                         
                         ('Japanese English', 'Country descriptor' ),
                         ('Spoke Chinese when little', 'Country descriptor' ),
                         
                         
                         ('Dutch English', 'Country descriptor' ),
                         ('Dutch', 'Country descriptor' ),
                         ('British', 'Country descriptor' ),
                         ('Austrian', 'Country descriptor' ),
                         ('Выраженный украинский акцент', 'Country descriptor' ),
                         ('serbian', 'Country descriptor' ),
    
    
                         ('nigeria english', 'Country descriptor' ),
                         ('South African English', 'Country descriptor' ),
    
                         
                         ('bangladesh', 'Country descriptor' ),
                         
                         
                         
                         
                         ## SUBNATIONAL REGION DESCRIPTORS 
                         
                         ('Midwestern', 'Subnational region descriptor' ),
                         ('Midwestern United States English', 'Subnational region descriptor' ),
                         ('midwestern US', 'Subnational region descriptor' ),
                         ('midwest', 'Subnational region descriptor' ),
                         ('Upper Midwestern', 'Subnational region descriptor' ),
                         ('Unite States Midwest', 'Subnational region descriptor' ),
                         ('Southern Appalachian English', 'Subnational region descriptor' ),
    
    
                         
                                                 
                         
                         ('southern United States', 'Subnational region descriptor' ),
                         ('Southern United States English', 'Subnational region descriptor' ),
                         ('Southwestern United States English', 'Subnational region descriptor' ),
                         ('Southern Texas Accent', 'Subnational region descriptor' ),
                         ('South Texas', 'Subnational region descriptor' ),
    
    
                         
                
                         ('slighty Southern affected by decades in the Midwest', 'Subnational region descriptor' ),
                         ('northern cali', 'Subnational region descriptor' ),
                         ('Southern Californian', 'Subnational region descriptor' ),
                         ('Silicon Valley Native', 'Subnational region descriptor' ),
    
                         
                         ('new england/east coast', 'Subnational region descriptor' ),
                         ('Philadelphia Style United States English', 'Subnational region descriptor' ),
                         ('Philadelphia', 'Subnational region descriptor' ),
                         ('Pennsylvania', 'Subnational region descriptor' ),
    
    
    
                            
    
                         ('United States English Pacific Northwest', 'Subnational region descriptor' ),
                         
                         
                         
                         
                         
                         ('yorkshire', 'Subnational region descriptor' ),
                         ('Northern English', 'Subnational region descriptor' ),
                         ('sussex', 'Subnational region descriptor' ),
                         ('southern english', 'Subnational region descriptor' ),
                         ('Southern England', 'Subnational region descriptor' ),
                         ('South London and Essex', 'Subnational region descriptor' ),
    
    
                         
                         
                         ('southern', 'Subnational region descriptor' ),
                         ('south-west German', 'Subnational region descriptor' ),
                         ('south German / Swiss accent', 'Subnational region descriptor' ),
                         ('South German accent', 'Subnational region descriptor' ),
    
                         
                         ('With heavy Cantonese accent', 'Subnational region descriptor'), 
                         
                         ('West Indian', 'Subnational region descriptor' ),
    
                         
                         
                         ## CITY DESCRIPTORS 
                         
                         ('slight Brooklyn Accent', 'City descriptor'), 
                         ('london', 'City descriptor'), 
                         ('Sydney - middle eastern seaboard Australian', 'City descriptor'), 
                         ('South London', 'City descriptor'), 
    
    
                         
                         
                         ## SUPRANATIONAL DESCRIPTORS 
                                   
                         ('Slavic', 'Supranational region descriptor' ),
                         ('European', 'Supranational region descriptor' ),
                         ('Latino', 'Supranational region descriptor' ),
                         ('little latino', 'Supranational region descriptor' ),
                         ('Eastern European English', 'Supranational region descriptor' ),
                         ('eastern europe', 'Supranational region descriptor' ),
                         
                         
                         ('wolof', 'Supranational region descriptor' ),
                         ('West African ', 'Supranational region descriptor' ),
                         
                         
                         ## L1 OR L2 NATIVE / FLUENCY DESCRIPTORS 
                         
                         ('Non-native', 'Ln descriptor' ),
                         ('Foreign', 'Ln descriptor' ),
                         ('second language', 'Ln descriptor' ),
                         ('fluent', 'Ln descriptor'),
                         
                         
                         ## VOCAL QUALITY DESCRIPTORS 
                         ('sultry', 'Vocal quality descriptor'),
                         ('Slightly effeminate', 'Vocal quality descriptor'),
    
    
    
                            
                         ## SPECIFIC PHONETIC CHANGES 
                          
                         ("pronounced r's",'Rhoticity'),
                          
                         ('pin/pen merger', 'Phoneme changes'),
                         ('heavy consonants', 'Phoneme changes'),
                    
                         ('Slight lisp', 'Phoneme changes'),
    
                         
                          
                         ('mostly affecting inflection', 'Inflection changes'),
                         
                         ('i have some pronunciation issues because of oral surgery and a hidden southern accent', 'Accent effects due to physical changes'
                          
                          ## SPOKEN REGISTER
                          
                          ('formal', 'Register'),
                          ('academic', 'Register'),
                          ('Urban', 'Register'),
                          ('United States English. people say I sound like a surffer dude.', 'Register'),
                          
                          
                          
                          
                         ## NOT SURE HOW TO GROUP THESE ONES YET 
                         
                         ('try to maintain originality' , 'Unknown or other' ),
                         ('non regional' , 'Unknown or other' ),
                         ('little bit classy little bit sassy and add some city.....thats me', 'Unknown or other'),
                         ('international', 'Unknown or other'),
                         ('Variable', 'Unknown or other'),
                          ('Transnational englishes blend', 'Unknown or other'),
                          ('Patois', 'Unknown or other'),
                          
                          
                          
                          
                          
                         
                         
                         
                       


SyntaxError: unexpected EOF while parsing (4241754722.py, line 327)

In [None]:
## Creating linkages between the individual accents and how they are represented in the data. 
## What I want to do here is create a data structure that has the ID of the accent and something to describe the edge: 
## 
## The data structure I think will work here is: 
## 
## { 99: (123, 456, 789)} 
## to represent each combination of accents 

## The data structures we are using are: 
## 
## all_accents_sorted_by_count - this is a Dict of all the *individual accents* with a count and predetermined status
## english_accents_list - this is a *string* of accents, comma separated, that requires
## 
## what we want to do is go through each list, 
## and find the ID number of the accent 
## from the Dict, 
## then build another data structure that represents the row, and the accents in it. 

## accent list is a String 

accent_nodes = {}
i = 0;

for accent_list in english_accents_list:
    
    # this regex is from 
    # https://stackoverflow.com/questions/26633452/how-to-split-by-commas-that-are-not-within-parentheses
    accent_list=re.split(',\s*(?![^()]*\))', accent_list[0])
    
    print('number of items in accent_list is: ', len(accent_list))
    
    
    # initialise the list first 
    accent_nodes[i] = []
    
    for accent_list_item in accent_list: 
        #print('now processing', accent_list_item)
        
        node_cnt = 0; 
        
        for accent in all_accents_sorted_by_count.items(): 
            
            if (i%ratio_display ==0): # only show the 100th 
                print('---')
                print ('now looking at row: ', accent_list, 'and accent list item: ', accent_list_item, ' and accent: ', accent)
        
            if (accent_list_item == accent[1]['accent']): ## match 
                
                if (i%ratio_display ==0): # only show the 100th 
                    print('---')
                    print ('match!')
                    
                print(accent[0])
                print('i is: ', i, ' and node_cnt is: ', node_cnt)
                
                accent_nodes[i].append(accent[0]) # we want the accent ID number
                node_cnt +=1
                
     
    i +=1 
                


                


In [None]:
print(accent_nodes)
print(accent_nodes[0])
print(len(accent_nodes[0]))

In [None]:
print(english_accents_list[27])
print(all_accents[2])
print(all_accents[37])
print(all_accents[200])

print(english_accents_list[156])
print(all_accents[133])
print(all_accents[257])


## this is a cross-check
## the other cross check is to make sure that the ID numbers in the lists above match with all_accents. 
## I've done this for a couple just to make sure, e.g.

## 27: [2, 37, 200]
## 156: [133, 257]



## Now that we have node data, we can start to build a picture of the nodes of accents - which accents are more commonly linked to each other? 

In [None]:
## what we want to do here is loop through the accent_nodes Dict 
## and create another Dict that we can use to create Links between the Nodes (which are accents)

accent_links = {} 
accent_link_id = 0

print(len(accent_nodes)) 

# we want to create a new structure, as we'll be popping elements off, and we want to leave accent_nodes untouched

for accent_list in accent_nodes.items(): 
    print('accent list is: ', accent_list)
    
    if len(accent_list[1]) > 1: 
        # we want to create links 
        
        while (len(accent_list[1]) > 1) : 
            print('\n')
            print('accent list stack size is greater than 1, now in while loop')
            
            popped_element = accent_list[1].pop(0) # remove the first element 
            print('popped element is :', popped_element)
            
            accent_links[accent_link_id] = {}
            
            # create links between the popped element and all the remaining elements in the list 
            for accent_id in accent_list[1]: 
                
                print('--- in accent list loop ---')
                print('accent_id is: ', accent_id)
                print('length of accent list is: ', len(accent_list[1]))
                
                accent_links[accent_link_id]['source'] = popped_element
                accent_links[accent_link_id]['target'] = accent_id
                accent_links[accent_link_id]['value'] = 1 # we can alter this later
            
                print(accent_links[accent_link_id])
                
        accent_link_id +=1
                
                
        print('end of while loop')
        print('\n')
       




In [None]:
print(accent_links)

In [None]:
# Output the dict to JSON 

with open("all_accents.json", "w") as outfile: 
    json.dump(all_accents_sorted_by_count, outfile)
