# Full Data Processing
Dastan Abdulla  
Ling 1340: Data Science for Linguists  
004/08/2024  

## About this Notebook: 
The goal of this notebook is to process the Speech Accent Archive dataset. By checking the integrity of the transcription files and the audio files, we create the custom data frame that will be used in the subsequent notebooks.

## Table of Contents
- [Status Update on the Dataset](#status-update-on-the-dataset)
- [About the Complete Speech Accent Archive](#about-the-complete-speech-accent-archive)
  - [Credit](#credit)
- [Initial Data Clean Up](#initial-data-clean-up)
- [Processing the Transcription files](#processing-the-transcription-files)
- [Processing the Audio Files](#processing-the-audio-files)
- [For Progress Report 3: Standardizing the Data Frame](#for-progress-report-3:-checking-for-errors-in-the-data-frame)
  - [Column: `native_language`](#column:-native_language)
  - [Column: `learning_style`](#column:-learning_style)
  - [Column: `countries`](#column:-countries)
- [Saving the Newly Processed Data Frame With Transcriptions](#saving-the-newly-processed-data-frame-with-transcriptions)


## Status Update on the Dataset
* Since the first progress report, I reached out to Dr. Weinberger requesting access to unicode encoded transcribed files. In response, Dr. Weinberger, thanks to his generosity, gave me access to a dropbox folder containing, not only the transcription and `.wav` files, but also an excel file of speakers that contains even more detailed geographic information on the speakers than the Kaggle Dataset.
* From this point onwards, I will the data Dr. Weinberger provided me and proceed with my data processing on the DropBox he provided me.

## About the Complete Speech Accent Archive
All the speakers in the repeated the following passage:  
```
Please call Stella.  Ask her to bring these things with her from the store:  Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.  We also need a small plastic snake and a big toy frog for the kids.  She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.
```
Here is the column information and comments for each in `speakers.xlsx`:
* `speakerid`: Unique speaker ID.
* `speaker`: The purpose of this column is unclear, and it seems irrelevant to this analysis.
* `native_language`: The speaker's native language, as also seen in the Kaggle dataset.
* `alternative_native_language`: Other names or labels for the speakers native language; this is not included in the Kaggle dataset.
* `city`: The city where the speaker is from, which is not included in the Kaggle dataset.
* `state_or_province`: The state or province of the speakers, applicable only to those in the U.S.
* `country`: The country the speaker is from, also mentioned in the Kaggle dataset.
* `age`: The age of the speakers, as included in the Kaggle dataset.
* `gender`: The speaker's biological sex, as included in the Kaggle dataset.
* `onset_age`: The age at which the speaker learned English, also mentioned in the Kaggle dataset.
* `english_residence`: The country(ies) where the speaker learned English.
* `length_of_residence`: The duration the speaker spent in the location(s) where they learned English; this is not included in the Kaggle dataset.
* `learning_style`: The method by which the speaker learned English.
* `speech_sample`: The `.mp3` file containing the speaker's sample.
* `phonetic_transcription`: The `.gif` file containing the speaker's sample transcription. There is also a parallel `.rtf` directory containing text-encoded transcriptions of the samples.
* `map`: This column seems to contain `.gif` filename entries showing geographic locations of the speakers.
* `ethnologue_language_code`: The three-letter code identifying the language of the speaker according to the ISO 639-3 standard.
* `notes`: Additional notes about the speaker or other relevant information.

### Credit

* Weinberger, Steven. (2015). Speech Accent Archive. George Mason University. Retrieved from http://accent.gmu.edu


In [97]:
# Imports
import pandas as pd
import numpy as np
# For files
import os
# For plotting
import seaborn as sns
import matplotlib.pyplot as plt
# For checking the language and country names against a standardized database
import pycountry

## Initial Data Clean Up

In [98]:
saa_df = pd.read_excel("../data/speakers.xlsx")
saa_df.sample(10)

Unnamed: 0,speakerid,speaker,native_language,alternative_native_language,city,state_or_province,country,age,gender,onset_age,english_residence,length_of_residence,learning_style,speech_sample,phonetic_transcription,map,ethnologue_language_code,notes
1349,1350,402,english,,canberra,,australia,26.0,female,0.0,australia,26.0,naturalistic,english402.mp3,notyet.gif,canberra.gif,eng,2010-09-13 00:00:00
1750,1750,39,russian,,brooklyn,new york,usa,25.0,female,5.0,usa,25.0,naturalistic,russian39.mp3,notyet.gif,brooklyn.gif,rus,2013-06-30 00:00:00
886,885,18,cantonese,chinese,hong kong,,china,19.0,male,3.0,,0.0,academic,cantonese18.mp3,cantonese18.gif,hongkong.gif,yue,2008-04-19 00:00:00
1958,1959,539,english,,orange county,california,usa,19.0,female,0.0,usa,19.0,naturalistic,english539.mp3,notyet.gif,orange_county.gif,eng,2014-04-29 00:00:00
1109,1109,26,portuguese,,campo grande,,brazil,25.0,female,10.0,usa,6.0,academic,portuguese26.mp3,portuguese26.gif,campogrande.gif,por,2009-03-07 00:00:00
1050,1050,7,hausa,,sokoto,,nigeria,52.0,male,7.0,"nigeria, usa",52.0,academic,hausa7.mp3,hausa7.gif,sokoto.gif,hau,"25 november 2008. residence: 20 years nigeria,..."
1426,1427,39,french,,pezenas,,france,28.0,male,13.0,,0.0,academic,french39.mp3,french39.gif,pezenas.gif,fra,2011-03-12 00:00:00
958,957,28,french,,port-au-prince,,haiti,35.0,female,11.0,usa,10.0,naturalistic,french28.mp3,french28.gif,portauprince.gif,fra,2008-08-28 00:00:00
230,231,2,khmer,,phnom penh,,cambodia,19.0,male,16.0,usa,4.0,academic,khmer2.mp3,khmer2.gif,phnompenh.gif,khm,
2131,2134,51,korean,,seoul,,south korea,25.0,male,16.0,usa,10.0,academic,korean51.mp3,notyet.gif,seoul.gif,kor,20 nov. 2015. LING306


In [99]:
# Lets drop unnecessary columns that we do not need 
saa_df = saa_df.drop(["speaker", "alternative_native_language","map", "notes"], axis=1)
saa_df.sample(10)

Unnamed: 0,speakerid,native_language,city,state_or_province,country,age,gender,onset_age,english_residence,length_of_residence,learning_style,speech_sample,phonetic_transcription,ethnologue_language_code
500,500,french,bordeaux,,france,31.0,male,10.0,ireland,1.0,academic,french10.mp3,french10.gif,fra
158,159,english,pensacola,florida,usa,55.0,female,0.0,usa,55.0,naturalistic,english94.mp3,english94.gif,eng
2785,2791,spanish,cochabamba,,bolivia,24.0,female,13.0,usa,10.0,academic,spanish215.mp3,notyet.gif,spa
308,309,russian,moscow,,russia,23.0,male,7.0,,0.0,academic,russian9.mp3,russian9.gif,rus
676,676,german,innsbruck,,austria,35.0,male,11.0,usa,8.0,academic,german15.mp3,german15.gif,deu
1388,1389,portuguese,sao paulo,,brazil,22.0,female,9.0,usa,0.3,academic,portuguese35.mp3,portuguese35.gif,por
2134,2137,russian,fergana,,uzbekistan,29.0,female,6.0,usa,21.0,academic,russian48.mp3,notyet.gif,rus
3008,3014,spanish,san juan,,puerto rico,55.0,male,5.0,usa,27.0,academic,spanish237.mp3,notyet.gif,spa
2540,2545,thai,mae suai,chiang rai,thailand,26.0,female,7.0,usa,0.5,academic,thai19.mp3,notyet.gif,tha
2908,2914,arabic,riyadh,,saudi arabia,29.0,male,18.0,usa,7.0,academic,arabic193.mp3,notyet.gif,ars


In [100]:
saa_df.describe()

Unnamed: 0,speakerid,age,onset_age,length_of_residence
count,3031.0,3031.0,3031.0,3031.0
mean,1518.65358,32.577862,9.054108,13.765856
std,876.68105,14.171028,8.005611,16.604457
min,1.0,0.0,0.0,0.0
25%,760.5,22.0,3.0,1.0
50%,1519.0,27.0,9.0,7.0
75%,2277.5,40.0,13.0,22.0
max,3036.0,97.0,86.0,93.0


In [101]:
saa_df[saa_df["age"] <= 5]

Unnamed: 0,speakerid,native_language,city,state_or_province,country,age,gender,onset_age,english_residence,length_of_residence,learning_style,speech_sample,phonetic_transcription,ethnologue_language_code
354,355,synthesized,,,,0.0,male,0.0,mac system 8.5,0.0,,synthesized1.mp3,notyet.gif,
355,356,synthesized,,,,0.0,female,0.0,mac system 8.5,0.0,,synthesized2.mp3,notyet.gif,
356,357,synthesized,,,,0.0,female,0.0,mac system 8.5,0.0,,synthesized3.mp3,notyet.gif,
357,358,synthesized,,,,0.0,male,0.0,mac system 8.5,0.0,,synthesized4.mp3,notyet.gif,


* Similar to the initial analysis, we still have those synthesizes samples in the data to provide a baseline.

* Since we know that the `.rst` files in the Dropbox directory provided are parallel to the `.gif`, We should be able to rename the phonetic transcription column so that it is referencing the `.rst` files in the transcriptions directory in data.

**CHANGED FOR PROGRESS REPORT 3**  
As instructed, I wrote a bash script to convert all the rtf files to text so that the files get parsed correctly. so instead of converting the `gif` to `rtf`, we will convert them to `txt`

In [102]:
saa_df['phonetic_transcription'] = saa_df['phonetic_transcription'].str.replace('.gif', '.txt')
saa_df.head()

Unnamed: 0,speakerid,native_language,city,state_or_province,country,age,gender,onset_age,english_residence,length_of_residence,learning_style,speech_sample,phonetic_transcription,ethnologue_language_code
0,1,afrikaans,virginia,,south africa,27.0,female,9.0,usa,0.5,academic,afrikaans1.mp3,afrikaans1.txt,afr
1,2,afrikaans,pretoria,,south africa,40.0,male,5.0,usa,10.0,academic,afrikaans2.mp3,afrikaans2.txt,afr
2,3,agni,diekabo,,ivory coast,25.0,male,15.0,usa,1.2,academic,agni1.mp3,agny1.txt,any
3,4,albanian,prishtina,,kosovo,19.0,male,6.0,usa,3.0,naturalistic,albanian1.mp3,albanian1.txt,als
4,5,albanian,tirana,,albania,33.0,male,15.0,usa,0.04,naturalistic,albanian2.mp3,albanian2.txt,aln


* The `phonetic_transcription` column has a default string value of `notyet.gif`, or `notyet.rtf` after the conversion, for files that have not yet been transcribed by the authors.
* We can use a default Null value for those instances in our data frame as it is best practice to do so.  
**CHANGED FOR PROGRESS REPORT 3**  
Again, just using `txt` instead of `rtf`.

In [103]:
saa_df['phonetic_transcription'] = saa_df['phonetic_transcription'].replace('notyet.txt', None)
saa_df.sample(10)

Unnamed: 0,speakerid,native_language,city,state_or_province,country,age,gender,onset_age,english_residence,length_of_residence,learning_style,speech_sample,phonetic_transcription,ethnologue_language_code
2554,2559,mandarin,xi'an,shanxi,china,23.0,female,4.0,australia,0.5,academic,mandarin109.mp3,,cmn
501,501,german,berlin,,germany,20.0,male,11.0,usa,1.0,academic,german7.mp3,german7.txt,deu
2345,2348,mandarin,zhengzhou,henan,china,29.0,male,12.0,usa,5.0,academic,mandarin76.mp3,,cmn
143,144,english,glasgow,scotland,uk,34.0,male,0.0,"uk, canada",34.0,naturalistic,english80.mp3,english80.txt,eng
2122,2124,english,painesville,ohio,usa,46.0,male,0.0,usa,46.0,naturalistic,english573.mp3,,eng
1705,1705,dutch,rotterdam,,netherlands,21.0,male,11.0,usa,0.5,academic,dutch43.mp3,,nld
3012,3018,italian,ravenna,,italy,19.0,female,3.0,"italy, usa",7.0,academic,italian40.mp3,,ita
1482,1483,polish,bialystok,,poland,18.0,male,5.0,,0.0,academic,polish18.mp3,polish18.txt,pol
174,175,farsi,tehran,,iran,29.0,female,11.0,usa,1.75,academic,farsi8.mp3,farsi8.txt,pes
2817,2823,farsi,tehran,,iran,53.0,female,43.0,usa,10.0,academic,farsi34.mp3,,pes


* Now let's check and see if transcriptions directory and the phonetic transcription column match up.  
**CHANGED FOR PROGRESS REPORT 3**
Using `transcriptions_text` instead of `transcriptions`

In [104]:
directory = "../data/transcriptions_text/"

for transcription in saa_df[saa_df['phonetic_transcription'].notna()]['phonetic_transcription']:
    file_path = os.path.join("../data/transcriptions_text/", transcription)
    if not os.path.exists(file_path):
        print(f"The file '{transcription}' does not exist.")

The file 'czech5.txt' does not exist.
The file 'not' does not exist.
The file 'mandarin42.txt' does not exist.
The file 'arabic195.txt' does not exist.
The file 'portuguese68.txt' does not exist.


* `polish4.rtf` and `albanian3.rtf` exist but they are text files, so we will handle them later when we read the files.
* The `not` is simply a typo after looking at the csv it was supposed to be `notyet.gif` so we can ignore that as well.
* As for the rest of them, those files are completely missing, the transcriptions could exist but they may not have been added yet to the current version of the dataset (it gets updated on a rolling basis).
* Note: The above cell was run around ~10 times to account for various discrepancies between the column filenames and the present files such as
    * Case sensitivity between the files in the directory and column files.
    * Typos in either the column files or the directory.  
* All of the typos and errors were manually addressed and corrected by me in the clean up process to the best of my ability.

In [105]:
# Changing the 'not' value
saa_df['phonetic_transcription'] = saa_df['phonetic_transcription'].replace('not', None)

# Changing the 'albanian3' and 'polish4'
saa_df['phonetic_transcription'] = saa_df['phonetic_transcription'].replace('polish4.rtf', 'polish4.txt')
saa_df['phonetic_transcription'] = saa_df['phonetic_transcription'].replace('albanian3.rtf', 'albanian3.txt')

## Processing the Transcription files

In [106]:
# Just a little poking around
with open("../data/transcriptions_text/afrikaans1.txt", encoding="utf8") as f:
    content = f.read()
    print(content)


\[pʰlis kɔl stɛːlʌ ɑsk˺ ɜ tə bɹɪ̃ŋ ðiz θɪ̃ŋz̥ wɪf hɜ fɹʌ̃ɱ ðə stɔɹ siks
spunz̥ əv̥ fɹɪʃ sn̥oʊ piːs faɪf θɪk slæb̥s əv blu ʧiːz ɛn măɪbi ɜ snæk˺ foɹ̥
hɜ bɹɑɾə̆ ʔə brʌðə bɑp wi ɔlˠsŏ nid ə smɔlˠ plæstɪk sneɪk ɛn ə bɪk tʊi
fɹɔɡ̥ fɛ̆ ðə kids̥ ʃi kɛ̆n skøp ðiz θɪ̃ŋs ɪntu fɹi ɹɛd bæɡz̥ ɛn wi wɪl ɡoʊ
miːd ɜ̆ wĕnz̥d̥eɪ ɛt d̪ə tɹeɪn steɪʃən\]



* It seems like python string encoding can handle the phonetic tokens, so we can proceed with parsing the rest of them. *fingers crossed*
**CHANGED FOR PROGRESS REPORT 3**  
Added the `chardet` and reading the first chunk of bytes from the file to detect the type of encoding because not all the files use the same encoding.

In [107]:
import chardet

dir_path = "../data/transcriptions_text/"

saa_df['transcription'] = None
for index, row in saa_df[saa_df['phonetic_transcription'].notna()].iterrows():
    file_path = row['phonetic_transcription']
    if os.path.exists(dir_path + file_path):
        encoding_detected = 'utf8'
        # Determine the encoding for the file
        with open(dir_path + file_path, 'rb') as file:
            raw_data = file.read(100000)  # Read a chunk of the file
            encoding_detected = chardet.detect(raw_data)['encoding']
        file_content = open(dir_path + file_path, encoding=encoding_detected).read()
        # The actual transcriptions are always wrapped around in brackets, more on this later
        start_index = file_content.find('\[')
        end_index = file_content.find('\]')
        transcription = file_content[start_index:end_index+1]
        saa_df.at[index, 'transcription'] = transcription
    else:
        print(f"Transcription file {file_path} was ignored.")

Transcription file czech5.txt was ignored.
Transcription file mandarin42.txt was ignored.
Transcription file arabic195.txt was ignored.
Transcription file portuguese68.txt was ignored.


* The ignored files do not exist in the directory, they probably just haven't been added yet. 
* The reason why it is necessary to create a substring of the transcribed content for the brackets is because some of the transcriptions have extra characters that cause issues for the rtf_to_text() function. It was throwing an error saying `‘utf-8’ codec can’t decode byte 0xca in position 0: invalid start byte`, so to get around that we simply don't consider anything before or after the brackets.

In [108]:
saa_df.transcription.isna().sum()

1762

* So when it's all said and done, we have (3031-1762) = 1269 transcribed samples to work with.

In [109]:
saa_df.sample(5)

Unnamed: 0,speakerid,native_language,city,state_or_province,country,age,gender,onset_age,english_residence,length_of_residence,learning_style,speech_sample,phonetic_transcription,ethnologue_language_code,transcription
622,621,greek,nicosia,,cyprus,51.0,male,25.0,usa,29.0,academic,greek5.mp3,greek5.txt,ell,\[pʰlis kɔl stelə æːsk əɹ tŭ bɹɪ̃ŋ d̪is θɪŋs ...
1537,1537,japanese,clark field,,philippines,25.0,female,3.0,usa,17.0,naturalistic,japanese16.mp3,,jpn,
2078,2079,spanish,la ceiba,,honduras,19.0,male,11.0,,0.0,academic,spanish148.mp3,,spa,
1940,1941,portuguese,salvador,,brazil,19.0,female,7.0,usa,12.5,academic,portuguese45.mp3,,por,
2698,2704,cantonese,fuzhou,fujian,china,26.0,female,7.0,usa,2.0,academic,cantonese32.mp3,,yue,


## Processing the Audio Files
The Dropbox folder contains directories for both `mp3` and `.wav` processed audio files. Since, the `.wav` are not compressed, it would will probably have the best quality and thus we will use them. But we need to go through the same spiel of converting `.mp3` column descriptors to `.wav`.

In [110]:
saa_df['speech_sample'] = saa_df['speech_sample'].str.replace('.mp3', '.wav')

In [111]:
saa_df.head()

Unnamed: 0,speakerid,native_language,city,state_or_province,country,age,gender,onset_age,english_residence,length_of_residence,learning_style,speech_sample,phonetic_transcription,ethnologue_language_code,transcription
0,1,afrikaans,virginia,,south africa,27.0,female,9.0,usa,0.5,academic,afrikaans1.wav,afrikaans1.txt,afr,\[pʰlis kɔl stɛːlʌ ɑsk˺ ɜ tə bɹɪ̃ŋ ðiz θɪ̃ŋz̥ ...
1,2,afrikaans,pretoria,,south africa,40.0,male,5.0,usa,10.0,academic,afrikaans2.wav,afrikaans2.txt,afr,\[pʰliːz̥ kʰɔl stɛ̆lʌ ɔsk hɜ tŭ bɹiŋ ðiz θiŋz...
2,3,agni,diekabo,,ivory coast,25.0,male,15.0,usa,1.2,academic,agni1.wav,agny1.txt,any,\[pliz kɑl stelə æs hɚ tu bɹɪ̃ŋ viz fɪŋ wɪf hɜ...
3,4,albanian,prishtina,,kosovo,19.0,male,6.0,usa,3.0,naturalistic,albanian1.wav,albanian1.txt,als,\[p̬liz kʰɔl stɛla æs xɜɹ tu bɹɪ̃ŋ ðɪs θɪ̃ŋks ...
4,5,albanian,tirana,,albania,33.0,male,15.0,usa,0.04,naturalistic,albanian2.wav,albanian2.txt,aln,\[pliz kɔl stɛlə æsk hɛɹ tu bɹɪ̃ŋ ðɪs θɪ̃ŋs wɪ...


Before we begin checking the files, we need to standardize the naming of the files, e.g remove all white spaces and make them all lower case.

In [112]:
directory = "../data/processed_wav_files/"

for filename in os.listdir(directory):
    old_path = os.path.join(directory, filename)
    
    # Convert the filename to lowercase and remove whitespace
    new_filename = filename.lower().replace(" ", "")
    new_path = os.path.join(directory, new_filename)
    
    # Rename the file to lowercase and without whitespace
    os.rename(old_path, new_path)

In [113]:
directory = "../data/processed_wav_files/"

for file in saa_df['speech_sample']:
    file_path = os.path.join(directory, file)
    if not os.path.exists(file_path):
        print(f"The file '{file}' does not exist.")

The file 'estonian15.wav' does not exist.
The file 'wu4.wav' does not exist.
The file 'arabic195.wav' does not exist.
The file 'french81.wav' does not exist.
The file 'catalan6.wav' does not exist.
The file 'thai22.wav' does not exist.
The file 'spanish230.wav' does not exist.
The file 'arabic196.wav' does not exist.
The file 'vietnamese36.wav' does not exist.
The file 'portuguese67.wav' does not exist.
The file 'japanese43.wav' does not exist.
The file 'portuguese68.wav' does not exist.
The file 'bengali21.wav' does not exist.
The file 'spanish231.wav' does not exist.
The file 'arabic197.wav' does not exist.
The file 'sundanese2.wav' does not exist.
The file 'swissgerman8.wav' does not exist.
The file 'english647.wav' does not exist.
The file 'indonesian14.wav' does not exist.
The file 'english648.wav' does not exist.
The file 'mandarin152.wav' does not exist.
The file 'bengali22.wav' does not exist.
The file 'spanish233.wav' does not exist.


After looking through each instance of the above list manually in the directory, fixing any typos or minor errors in pronunciation, I was able to dwindle it down to those missing files. It could be that they have not been processed yet as the the `.wav` files go through processing for the dataset. However, since they are not transcribed anyways, they are not of much value to us anyways and thus we can proceed. 

In [114]:
saa_df.sample(5)

Unnamed: 0,speakerid,native_language,city,state_or_province,country,age,gender,onset_age,english_residence,length_of_residence,learning_style,speech_sample,phonetic_transcription,ethnologue_language_code,transcription
870,869,english,kilkenny,,ireland,24.0,male,0.0,ireland,23.0,naturalistic,english258.wav,,eng,
2593,2598,arabic,riyadh,,saudi arabia,25.0,female,5.0,usa,3.0,academic,arabic151.wav,,ars,
1108,1108,english,oxford,,uk,37.0,female,0.0,"uk, usa",22.0,naturalistic,english310.wav,,eng,
1834,1834,spanish,ilbague,,colombia,57.0,female,14.0,usa,17.0,academic,spanish123.wav,,spa,
2869,2875,vietnamese,can tho,,vietnamese,43.0,female,18.0,usa,25.0,naturalistic,vietnamese35.wav,,vie,


## For Progress Report 3: Checking for Errors in the data frame

### Column: `native_language`

In [115]:
saa_df[~saa_df['ethnologue_language_code'].isna()]['ethnologue_language_code'].unique()

array(['afr', 'any', 'als', 'aln', 'amh', 'ars', 'arz', 'aeb', 'acm',
       'apc', 'abv', 'afb', 'ary', 'ayn', 'ajp', 'hye', 'azj', 'fmp',
       'bsp', 'bca', 'bam', 'bax', 'bqe', 'ben', 'bos', 'bul', 'yue',
       'cal', 'cat', 'cha', 'ces', 'dan', 'gbz', 'nld', 'igb', 'bin',
       'eng', 'ewe', 'fak', 'aka', 'pes', 'fin', 'fra', 'fri', 'kat',
       'deu', 'ell', 'guj', 'guz', 'heb', 'hin', 'hsn', 'hun', 'ibo',
       'ind', 'ita', 'jpn', 'kan', 'kaz', 'khm', 'kir', 'swh', 'kor',
       'kri', 'ckb', 'slp', 'lao', 'lav', 'lit', 'luo', 'mkd', 'zlm',
       'mal', 'cmn', 'emk', 'mar', 'mfe', 'khk', 'mos', 'mrl', 'npi',
       'nor', 'gaz', 'pon', 'pol', 'por', 'pnb', 'pan', 'quh', 'qvh',
       'ron', 'rus', 'sdn', 'swy', 'stw', 'srp', 'scn', 'sin', 'slk',
       'som', 'spa', 'swe', 'tgl', 'tlg', 'tam', 'tat', 'tel', 'tha',
       'bod', 'tir', 'tpi', 'tur', 'urd', 'uig', 'uzn', 'uzs', 'vie',
       'wof', 'ydd', 'zul', 'mlt', 'gcf', 'yor', 'dib', 'bel', 'isl',
       'snd', 'phr',

In [116]:
# Look at the native languages that do not have an ethnologue code assigned to them
saa_df[saa_df['ethnologue_language_code'].isna()]['native_language'].unique()

array(['synthesized', 'taiwanese', 'teochew', 'hainanese', 'arabic',
       'home sign', 'min nan'], dtype=object)

- This is probably due to how the file was parsed since the ethnologue code for the Min Nan dialects are `nan`, python assumed we meant `Not a Number`(aka NaN). Let's fix this.

In [117]:
# The ethnologue for arabic is 'ara'
saa_df.loc[
    (saa_df['ethnologue_language_code'].isna()) & 
    (saa_df['native_language'] == 'arabic'), 
    'ethnologue_language_code'
] = 'ara'

# 'taiwanese', 'teochew', 'hainanese', and 'min nan' 
saa_df.loc[
    (saa_df['ethnologue_language_code'].isna()) & 
    (saa_df['native_language'] != 'synthesized') & 
    (saa_df['native_language'] != 'home sign'), 
    'ethnologue_language_code'
] = 'nan'

saa_df[saa_df['ethnologue_language_code'].isna()]['native_language'].unique()

array(['synthesized', 'home sign'], dtype=object)

In [118]:
ethno_codes = list(saa_df[~saa_df['ethnologue_language_code'].isna()]['ethnologue_language_code'].unique())

In [119]:
# These are the countries ISO-639 lookup table
not_represented = []
for code in ethno_codes:
    try:
        pycountry.languages.lookup(code)
    except:
        not_represented.append(code)

not_represented

['bqe', 'fri', 'cit', 'cnm']

- Those codes exist for ISO-639 but they have not been added for pycountry, we can manually add them

In [120]:
import pycountry

def get_language_fullname(ethnologue_code):
    try:
        language = pycountry.languages.lookup(ethnologue_code)
        return language.name
    except:
        non_represented = {
            'bqe':'Beembe',
            'fri':'Western Frisian',
            'cit':'Chittagonian',
            'cnm':'Mandarin Chinese'
        }

        if ethnologue_code in non_represented:
            return non_represented[ethnologue_code]
        else:
            return np.nan

saa_df['language_name'] = saa_df['ethnologue_language_code'].apply(get_language_fullname)
saa_df['language_name'].sample(10)

1032    San Salvador Kongo
2915               English
1242               English
2935            Portuguese
2763               Spanish
2218               English
2525      Mandarin Chinese
2693      Mandarin Chinese
2276      Mandarin Chinese
2815               Amharic
Name: language_name, dtype: object

In [121]:
saa_df[saa_df['language_name'].isna()]['native_language'].unique()

array(['synthesized', 'home sign'], dtype=object)

### Column: `learning_style`

In [122]:
saa_df['learning_style'].unique()

array(['academic', 'naturalistic', nan, 'academic_x000D_naturalistic',
       'naturalisstic', 'naturalisic'], dtype=object)

In [123]:
# Let's fix those label typos in learning style
updated_learning_styles = {
        'academic_x000D_naturalistic': 'academic_naturalistic',
        'naturalisstic': 'naturalistic',
        'naturalisic': 'naturalistic'
    }
def corrected_learning_style(value):
    return updated_learning_styles.get(value, value)

saa_df['learning_style'] = saa_df['learning_style'].apply(corrected_learning_style)

### Column: `countries`

In [124]:
countries = list(saa_df[saa_df['country'].notna()]['country'].unique())
countries.sort()
print(countries)

['New York', 'afghanistan', 'albania', 'algeria', 'andorra', 'angola', 'antigua and barbuda', 'argentina', 'armenia', 'australia', 'austria', 'azerbaijan', 'bahrain', 'bangladesh', 'barbados', 'belarus', 'belgium', 'belize', 'benin', 'bolivia', 'bosnia', 'bosnia and herzegovina', 'botswana', 'brazil', 'bulgaria', 'burkina faso', 'burundi', 'cambodia', 'cameroon', 'canada', 'chad', 'chile', 'china', 'colombia', 'costa rica', 'croatia', 'cuba', 'curacao', 'cyprus', 'czech republic', 'democratic republic of congo', 'democratic republic of the congo', 'denmark', 'dominican republic', 'ecuador', 'egypt', 'el salvador', 'el salvadore', 'equatorial guinea', 'eritrea', 'estonia', 'ethiopia', 'faroe islands', 'federated states of micronesia', 'fiji', 'finland', 'france', 'gabon', 'germany', 'ghana', 'greece', 'guatemala', 'guinea', 'guyana', 'haiti', 'honduras', 'hungary', 'iceland', 'india', 'indonesia', 'indonesian', 'indonesisa', 'iran', 'iraq', 'ireland', 'isle of man', 'israel', 'israel (o

In [125]:
import pycountry

def is_valid_country(name):
    try:
        if pycountry.countries.lookup(name):
            return True
    except LookupError:
        return False

print("Non-valid countries:", [name for name in countries if not is_valid_country(name)])


Non-valid countries: ['New York', 'bosnia', 'curacao', 'democratic republic of congo', 'democratic republic of the congo', 'el salvadore', 'indonesian', 'indonesisa', 'israel (occupied territory)', 'ivory coast', 'kosovo', 'kyrgystan', 'macedonia', 'palestine', 'philippiness', 'republic of georgia', 'romanian', 'russia', 'sicily', 'the bahamas', 'tibet', 'trinidad', 'turkey', 'uk', 'us virgin islands', 'vietnamese', 'virginia']


- It looks like there are a couple of typos, instances where the city was used and not the country, and some outdated edge cases where countries got merged or were separated. Let's try to address those.

In [126]:
updated_countries = {
        'New York':'usa',
        'bosnia':'bosnia and herzegovina',
        'curacao':'curaçao',
        'democratic republic of congo':'democratic republic of the congo',
        'el salvadore':'el salvador',
        'indonesian':'indonesia',
        'indonesisa':'indonesia',
        'israel (occupied territory)':'israel',
        'ivory coast':'Côte d\'Ivoire',
        'kyrgystan':'kyrgyzstan',
        'macedonia':'north macedonia',
        'palestine':'palestine, state of',
        'philippiness':'Philippines',
        'republic of georgia':'georgia',
        'romanian':'romania',
        'russia':'russian federation',
        'sicily':'italy',
        'the bahamas':'bahamas',
        'tibet':'china',
        'trinidad':'trinidad and tobago',
        'turkey':'türkiye',
        'uk':'united kingdom of great britain and northern ireland',
        'us virgin islands':'virgin islands, u.s.',
        'vietnamese':'vietnam',
        'virginia':'usa',
        'kosovo':'serbia'
    }
# Let's fix those label typos in learning style
def corrected_countries(value):

    return updated_countries.get(value, value)

saa_df['country'] = saa_df['country'].apply(corrected_countries)

countries = list(saa_df[saa_df['country'].notna()]['country'].unique())
countries.sort()
print("Non-valid countries:", [name for name in countries if not is_valid_country(name)])


Non-valid countries: ['democratic republic of the congo']


In [127]:
saa_df.sample(5)

Unnamed: 0,speakerid,native_language,city,state_or_province,country,age,gender,onset_age,english_residence,length_of_residence,learning_style,speech_sample,phonetic_transcription,ethnologue_language_code,transcription,language_name
2566,2571,tamil,coimbatore,tamil nadu,india,28.0,female,3.0,usa,2.0,academic,tamil13.wav,,tam,,Tamil
2902,2908,arabic,riyadh,,saudi arabia,20.0,female,15.0,usa,6.0,academic,arabic187.wav,,ars,,Najdi Arabic
2508,2513,portuguese,sao paulo,,brazil,21.0,male,5.0,usa,3.0,academic,portuguese58.wav,,por,,Portuguese
3026,3032,turkish,spaichingen,,germany,26.0,female,10.0,,0.0,academic,turkish45.wav,,tur,,Turkish
2098,2100,spanish,barcelona,,spain,51.0,male,25.0,usa,26.0,naturalistic,spanish152.wav,,spa,,Spanish


- We also won't be looking at the city level of the data since we do not have enough samples in the data to be representative of the countries, so we can drop the `city` column.
- We can also drop the `state_or_province` column as well since we won't be looking at United States speaker's exclusively.

In [128]:
saa_df.drop(['city', 'state_or_province'], axis=1, inplace=True)
saa_df.sample(5)

Unnamed: 0,speakerid,native_language,country,age,gender,onset_age,english_residence,length_of_residence,learning_style,speech_sample,phonetic_transcription,ethnologue_language_code,transcription,language_name
173,174,farsi,iran,29.0,male,11.0,usa,2.0,academic,farsi7.wav,farsi7.txt,pes,\[pliz kɔl ə̆stɛlə æsk hʲɛɹ tu bɹɪ̃ŋ d̪iə̆s t̪...,Iranian Persian
485,522,english,usa,22.0,male,0.0,usa,22.0,naturalistic,english146.wav,english146.txt,eng,\[pʰliz̥ kʰɑlˠ stɪlə æsk hɚ ɾə bɹɪ̃ŋ n̪iz̥ θɪ̃...,English
2917,2923,english,usa,18.0,female,0.0,usa,18.0,naturalistic,english641.wav,,eng,,English
270,271,nepali,nepal,22.0,female,5.0,usa,4.0,academic,nepali1.wav,nepali1.txt,npi,\[plis kʰɑl s̪ɛ̞lʌ æks hɝ tu brĩŋ d̪iz θɪŋks ...,Nepali (individual language)
484,485,english,australia,28.0,male,0.0,australia,28.0,naturalistic,english125.wav,english125.txt,eng,\[pʰliz̥ kʰɔlˠ stɛla ask ɚ ɾə bɹɪ̃ŋ ðɪz θɪ̃ŋz ...,English


## Saving the Newly Processed Data Frame With Transcriptions

In [129]:
saa_df.to_pickle('../data/saa_df.pkl')