# Full Data Processing
Dastan Abdulla  
Ling 1340: Data Science for Linguists  
02/21/2024  

## Status Update on the Dataset
* Since the first progress report, I reached out to Dr. Weinberger requesting access to unicode encoded transcribed files. In response, Dr. Weinberger, thanks to his generosity, gave me access to a dropbox folder containing, not only the transcription and `.wav` files, but also an excel file of speakers that contains even more detailed geographic information on the speakers than the Kaggle Dataset.
* From this point onwards, I will the data Dr. Weinberger provided me and proceed with my data processing on the DropBox he provided me.

## About the Complete Speech Accent Archive
All the speakers in the repeated the following passage:  
```
Please call Stella.  Ask her to bring these things with her from the store:  Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.  We also need a small plastic snake and a big toy frog for the kids.  She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.
```
Here is the column information and comments for each in `speakers.xlsx`:
* `speakerid`: Unique speaker ID.
* `speaker`: The purpose of this column is unclear, and it seems irrelevant to this analysis.
* `native_language`: The speaker's native language, as also seen in the Kaggle dataset.
* `alternative_native_language`: Other names or labels for the speakers native language; this is not included in the Kaggle dataset.
* `city`: The city where the speaker is from, which is not included in the Kaggle dataset.
* `state_or_province`: The state or province of the speakers, applicable only to those in the U.S.
* `country`: The country the speaker is from, also mentioned in the Kaggle dataset.
* `age`: The age of the speakers, as included in the Kaggle dataset.
* `gender`: The speaker's biological sex, as included in the Kaggle dataset.
* `onset_age`: The age at which the speaker learned English, also mentioned in the Kaggle dataset.
* `english_residence`: The country(ies) where the speaker learned English.
* `length_of_residence`: The duration the speaker spent in the location(s) where they learned English; this is not included in the Kaggle dataset.
* `learning_style`: The method by which the speaker learned English.
* `speech_sample`: The `.mp3` file containing the speaker's sample.
* `phonetic_transcription`: The `.gif` file containing the speaker's sample transcription. There is also a parallel `.rtf` directory containing text-encoded transcriptions of the samples.
* `map`: This column seems to contain `.gif` filename entries showing geographic locations of the speakers.
* `ethnologue_language_code`: The three-letter code identifying the language of the speaker according to the ISO 639-3 standard.
* `notes`: Additional notes about the speaker or other relevant information.

### Credit

* Weinberger, Steven. (2015). Speech Accent Archive. George Mason University. Retrieved from http://accent.gmu.edu


In [168]:
# Imports
import pandas as pd
import numpy as np
# For files
import os
# For plotting
import seaborn as sns
import matplotlib.pyplot as plt
# For processing rtf files
from striprtf.striprtf import rtf_to_text

## Initial Clean up

In [169]:
saa_df = pd.read_excel("../data/speakers.xlsx")
saa_df.sample(10)

Unnamed: 0,speakerid,speaker,native_language,alternative_native_language,city,state_or_province,country,age,gender,onset_age,english_residence,length_of_residence,learning_style,speech_sample,phonetic_transcription,map,ethnologue_language_code,notes
1090,1090,1,gedeo,,dilla,,ethiopia,51.0,male,9.0,usa,7.0,academic,gedeo1.mp3,gedeo1.gif,dilla.gif,drs,2008-12-04 00:00:00
1099,1099,17,romanian,,chisinau,,moldova,55.0,female,48.0,usa,10.0,naturalistic,romanian17.mp3,romanian17.gif,chisinau.gif,ron,2009-02-26 00:00:00
279,280,4,polish,,krakow,,poland,47.0,female,15.0,usa,14.0,academic,polish4.mp3,polish4.gif,krakow.gif,pol,
980,979,1,akan,twi,accra,,ghana,41.0,male,5.0,"ghana, usa",41.0,academic,akan1.mp3,akan1.gif,accra.gif,aka,"25 september 2008. residence: 27 years ghana, ..."
3018,3024,241,spanish,,la union,,el salvador,40.0,female,15.0,usa,25.0,academic,spanish241.mp3,notyet.gif,la_union.gif,spa,9 april 2023. LING306
690,690,8,romanian,,bucharest,,romania,47.0,male,12.0,usa,6.0,academic,romanian8.mp3,romanian8.gif,bucharest.gif,ron,5 april 2007. spent 4 years in netherlands spe...
296,297,3,romanian,,birtem,,romania,50.0,male,32.0,usa,18.0,naturalistic,romanian3.mp3,romanian3.gif,birtem.gif,ron,
318,319,1,slovak,,kosice,,slovak republic,25.0,female,14.0,usa,0.3,academic,slovak1.mp3,slovak1.gif,kosice.gif,slk,
2178,2181,5,uzbek,,panjakent,,tajikistan,19.0,male,16.0,,0.0,academic,uzbek5.mp3,notyet.gif,panjakent.gif,uzn,"22 april, 2016"
637,636,192,english,,east lansing,michigan,usa,27.0,female,0.0,usa,27.0,naturalistic,english192.mp3,notyet.gif,eastlansing.gif,eng,"18 january, 2007"


In [170]:
# Lets drop unnecessary columns that we do not need 
saa_df = saa_df.drop(["speaker", "alternative_native_language","map", "notes"], axis=1)
saa_df.sample(10)

Unnamed: 0,speakerid,native_language,city,state_or_province,country,age,gender,onset_age,english_residence,length_of_residence,learning_style,speech_sample,phonetic_transcription,ethnologue_language_code
1724,1724,english,london,england,uk,20.0,male,0.0,uk,20.0,naturalistic,english496.mp3,notyet.gif,eng
1998,1999,ukrainian,odessa,,ukraine,23.0,female,9.0,usa,3.0,academic,ukrainian11.mp3,notyet.gif,ukr
299,300,russian,moscow,,russia,20.0,female,5.0,uk,0.5,academic,russian10.mp3,russian10.gif,rus
700,700,finnish,helsinki,,finland,30.0,female,9.0,"uk, usa",3.0,academic,finnish4.mp3,finnish4.gif,fin
2092,2093,miskito,bilwi,puerto cabezas,nicaragua,18.0,female,11.0,,0.0,academic,miskito11.mp3,notyet.gif,miq
914,913,dutch,essen,,belgium,39.0,male,12.0,,0.0,naturalistic,dutch10.mp3,dutch10.gif,nld
732,732,spanish,tandil,,argentina,31.0,male,6.0,,0.0,academic,spanish49.mp3,spanish49.gif,spa
1780,1780,malayalam,bangalore,,india,19.0,female,3.5,usa,15.0,naturalistic,malayalam3.mp3,notyet.gif,mal
1645,1645,mandarin,pingdingshan,henan,china,37.0,male,12.0,,0.0,academic,mandarin48.mp3,notyet.gif,cmn
2109,2111,french,ouagadougou,,burkina faso,37.0,male,11.0,usa,16.0,academic,french63.mp3,notyet.gif,fra


In [171]:
saa_df.describe()

Unnamed: 0,speakerid,age,onset_age,length_of_residence
count,3031.0,3031.0,3031.0,3031.0
mean,1518.65358,32.577862,9.054108,13.765856
std,876.68105,14.171028,8.005611,16.604457
min,1.0,0.0,0.0,0.0
25%,760.5,22.0,3.0,1.0
50%,1519.0,27.0,9.0,7.0
75%,2277.5,40.0,13.0,22.0
max,3036.0,97.0,86.0,93.0


In [172]:
saa_df[saa_df["age"] <= 5]

Unnamed: 0,speakerid,native_language,city,state_or_province,country,age,gender,onset_age,english_residence,length_of_residence,learning_style,speech_sample,phonetic_transcription,ethnologue_language_code
354,355,synthesized,,,,0.0,male,0.0,mac system 8.5,0.0,,synthesized1.mp3,notyet.gif,
355,356,synthesized,,,,0.0,female,0.0,mac system 8.5,0.0,,synthesized2.mp3,notyet.gif,
356,357,synthesized,,,,0.0,female,0.0,mac system 8.5,0.0,,synthesized3.mp3,notyet.gif,
357,358,synthesized,,,,0.0,male,0.0,mac system 8.5,0.0,,synthesized4.mp3,notyet.gif,


* Similar to the initial analysis, we still have those synthesizes samples in the data to provide a baseline.

* Since we know that the `.rst` files in the Dropbox directory provided are parallel to the `.gif`, We should be able to rename the phonetic transcription column so that it is referencing the `.rst` files in the transcriptions directory in data.

In [173]:
saa_df['phonetic_transcription'] = saa_df['phonetic_transcription'].str.replace('.gif', '.rtf')
saa_df.head()

Unnamed: 0,speakerid,native_language,city,state_or_province,country,age,gender,onset_age,english_residence,length_of_residence,learning_style,speech_sample,phonetic_transcription,ethnologue_language_code
0,1,afrikaans,virginia,,south africa,27.0,female,9.0,usa,0.5,academic,afrikaans1.mp3,afrikaans1.rtf,afr
1,2,afrikaans,pretoria,,south africa,40.0,male,5.0,usa,10.0,academic,afrikaans2.mp3,afrikaans2.rtf,afr
2,3,agni,diekabo,,ivory coast,25.0,male,15.0,usa,1.2,academic,agni1.mp3,agny1.rtf,any
3,4,albanian,prishtina,,kosovo,19.0,male,6.0,usa,3.0,naturalistic,albanian1.mp3,albanian1.rtf,als
4,5,albanian,tirana,,albania,33.0,male,15.0,usa,0.04,naturalistic,albanian2.mp3,albanian2.rtf,aln


* The `phonetic_transcription` column has a default string value of `notyet.gif`, or `notyet.rtf` after the conversion, for files that have not yet been transcribed by the authors.
* We can use a default Null value for those instances in our data frame as it is best practice to do so. 

In [174]:
saa_df['phonetic_transcription'] = saa_df['phonetic_transcription'].replace('notyet.rtf', None)
saa_df.sample(10)

Unnamed: 0,speakerid,native_language,city,state_or_province,country,age,gender,onset_age,english_residence,length_of_residence,learning_style,speech_sample,phonetic_transcription,ethnologue_language_code
1500,1500,filipino,pasay,,philippines,21.0,female,3.0,usa,0.5,naturalistic,filipino2.mp3,filipino2.rtf,fil
996,996,romanian,bucharest,,romania,25.0,female,9.0,usa,8.5,academic,romanian16.mp3,romanian16.rtf,ron
729,729,english,spartanburg,south carolina,usa,24.0,male,0.0,usa,24.0,naturalistic,english212.mp3,,eng
2248,2253,arabic,al ras,,saudi arabia,40.0,male,37.0,usa,3.0,academic,arabic117.mp3,,ars
410,411,hebrew,rishon,,israel,33.0,female,9.0,,0.0,academic,hebrew6.mp3,hebrew6.rtf,heb
2092,2093,miskito,bilwi,puerto cabezas,nicaragua,18.0,female,11.0,,0.0,academic,miskito11.mp3,,miq
538,536,english,dublin,,ireland,33.0,male,0.0,ireland,33.0,naturalistic,english156.mp3,english156.rtf,eng
1718,1718,english,miami,florida,usa,20.0,male,0.0,usa,13.5,naturalistic,english493.mp3,,eng
689,689,romanian,bucharest,,romania,35.0,male,12.0,usa,1.0,academic,romanian6.mp3,romanian6.rtf,ron
2171,2174,turkish,panjakent,,tajikistan,19.0,male,17.0,,0.0,academic,turkish37.mp3,,tur


* Now let's check and see if transcriptions directory and the phonetic transcription column match up.

In [175]:
directory = "../data/transcriptions/"

for transcription in saa_df[saa_df['phonetic_transcription'].notna()]['phonetic_transcription']:
    file_path = os.path.join("../data/transcriptions/", transcription)
    if not os.path.exists(file_path):
        print(f"The file '{transcription}' does not exist.")

The file 'polish4.rtf' does not exist.
The file 'albanian3.rtf' does not exist.
The file 'czech5.rtf' does not exist.
The file 'not' does not exist.
The file 'mandarin42.rtf' does not exist.
The file 'arabic195.rtf' does not exist.
The file 'portuguese68.rtf' does not exist.


* `polish4.rtf` and `albanian3.rtf` exist but they are text files, so we will handle them later when we read the files.
* The `not` is simply a typo after looking at the csv it was supposed to be `notyet.gif` so we can ignore that as well.
* As for the rest of them, those files are completely missing, the transcriptions could exist but they may not have been added yet to the current version of the dataset (it gets updated on a rolling basis).
* Note: The above cell was run around ~10 times to account for various discrepancies between the column filenames and the present files such as
    * Case sensitivity between the files in the directory and column files.
    * Typos in either the column files or the directory.  
* All of the typos and errors were manually addressed and corrected by me in the clean up process to the best of my ability.

In [176]:
# Changing the 'not' value
saa_df['phonetic_transcription'] = saa_df['phonetic_transcription'].replace('not', None)

# Changing the 'albanian3' and 'polish4'
saa_df['phonetic_transcription'] = saa_df['phonetic_transcription'].replace('polish4.rtf', 'polish4.txt')
saa_df['phonetic_transcription'] = saa_df['phonetic_transcription'].replace('albanian3.rtf', 'albanian3.txt')

## Processing the Transcription files

In [177]:
# Just a little poking around
with open("../data/transcriptions/afrikaans1.rtf", encoding="utf8") as f:
    content = f.read()
    start_index = content.find('[')
    end_index = content.find(']')
    text = rtf_to_text(content[start_index:end_index+1])
print(text)


[pʰlis kɔl stɛːlʌ ɑsk˺ ɜ tə bɹɪ̃ŋ ðiz θɪ̃ŋz̥ wɪf hɜ fɹʌ̃ɱ ðə stɔɹ siks spunz̥ əv̥ fɹɪʃ sn̥oʊ piːs faɪf θɪk slæb̥s əv blu ʧiːz ɛn măɪbi ɜ snæk˺ foɹ̥ hɜ bɹɑɾə̆ ʔə brʌðə bɑp wi ɔlˠsŏ nid ə smɔlˠ plæstɪk sneɪk ɛn ə bɪk tʊi fɹɔɡ̥ fɛ̆ ðə kids̥ ʃi kɛ̆n skøp ðiz θɪ̃ŋs ɪntu fɹi ɹɛd bæɡz̥ ɛn wi wɪl ɡoʊ miːd ɜ̆ wĕnz̥d̥eɪ ɛt d̪ə tɹeɪn steɪʃən]


* It seems like python string encoding can handle the phonetic tokens, so we can proceed with parsing the rest of them. *fingers crossed*

In [178]:
dir_path = "../data/transcriptions/"

saa_df['transcription'] = None
for index, row in saa_df[saa_df['phonetic_transcription'].notna()].iterrows():
    file_path = row['phonetic_transcription']
    if os.path.exists(dir_path + file_path):
        file_content = open(dir_path + file_path).read()
        # The actual transcriptions are always wrapped around in brackets, more on this later
        start_index = file_content.find('[')
        end_index = file_content.find(']')
        transcription = rtf_to_text(file_content[start_index:end_index+1]) 
        saa_df.at[index, 'transcription'] = transcription
    else:
        print(f"Transcription file {file_path} was ignored.")

Transcription file czech5.rtf was ignored.
Transcription file mandarin42.rtf was ignored.
Transcription file arabic195.rtf was ignored.
Transcription file portuguese68.rtf was ignored.


* The ignored files do not exist in the directory, they probably just haven't been added yet. 
* The reason why it is necessary to create a substring of the transcribed content for the brackets is because some of the transcriptions have extra characters that cause issues for the rtf_to_text() function. It was throwing an error saying `‘utf-8’ codec can’t decode byte 0xca in position 0: invalid start byte`, so to get around that we simply don't consider anything before or after the brackets.

In [179]:
saa_df.transcription.isna().sum()

1762

* So when it's all said and done, we have (3031-1762) = 1269 transcribed samples to work with.

In [180]:
saa_df.sample(5)

Unnamed: 0,speakerid,native_language,city,state_or_province,country,age,gender,onset_age,english_residence,length_of_residence,learning_style,speech_sample,phonetic_transcription,ethnologue_language_code,transcription
1494,1494,german,bielefeld,,germany,23.0,male,10.0,ireland,0.6,academic,german27.mp3,german27.rtf,deu,[pliəz̥ kɑl stɛlɚ ask həɹ tŭ bɹɪ̃ŋ ðis θɪ̃ŋks...
32,33,basque,guernica,,spain,50.0,male,19.0,usa,30.0,naturalistic,basque2.mp3,basque2.rtf,bqe,[pliz̥ kɔl ɛst̪ɛla ask xɜr t̪ʊ brɪŋ d̪iz̥ θ̬ẽ...
2903,2909,arabic,riyadh,,saudi arabia,20.0,male,6.0,usa,5.0,academic,arabic188.mp3,,ars,
867,866,hindi,new delhi,,india,27.0,female,4.0,usa,1.0,academic,hindi6.mp3,hindi6.rtf,hin,[pliːz̥ kɔːl stɛla ask hə ʈu bɹɪ̃ŋ ðiz θɪ̃ːŋz ...
1583,1583,mandarin,tie ling,liaoning,china,24.0,female,12.0,,0.0,academic,mandarin39.mp3,,cmn,


## Processing the Audio Files
The Dropbox folder contains directories for both `mp3` and `.wav` processed audio files. Since, the `.wav` are not compressed, it would will probably have the best quality and thus we will use them. But we need to go through the same spiel of converting `.mp3` column descriptors to `.wav`.

In [181]:
saa_df['speech_sample'] = saa_df['speech_sample'].str.replace('.mp3', '.wav')

In [182]:
saa_df.head()

Unnamed: 0,speakerid,native_language,city,state_or_province,country,age,gender,onset_age,english_residence,length_of_residence,learning_style,speech_sample,phonetic_transcription,ethnologue_language_code,transcription
0,1,afrikaans,virginia,,south africa,27.0,female,9.0,usa,0.5,academic,afrikaans1.wav,afrikaans1.rtf,afr,[pʰlis kɔl stɛːlʌ ɑsk˺ ɜ tə bɹɪ̃ŋ ðiz θɪ̃ŋz̥ w...
1,2,afrikaans,pretoria,,south africa,40.0,male,5.0,usa,10.0,academic,afrikaans2.wav,afrikaans2.rtf,afr,[pʰliːz̥ kʰɔl stɛ̆lʌ ɔsk hɜ tŭ bɹiŋ ðiz θiŋz̥...
2,3,agni,diekabo,,ivory coast,25.0,male,15.0,usa,1.2,academic,agni1.wav,agny1.rtf,any,[pliz kɑl stelə æs hɚ tu bɹɪ̃ŋ viz fɪŋ wɪf hɜɹ...
3,4,albanian,prishtina,,kosovo,19.0,male,6.0,usa,3.0,naturalistic,albanian1.wav,albanian1.rtf,als,[p̬liz kʰɔl stɛla æs xɜɹ tu bɹɪ̃ŋ ðɪs θɪ̃ŋks w...
4,5,albanian,tirana,,albania,33.0,male,15.0,usa,0.04,naturalistic,albanian2.wav,albanian2.rtf,aln,[pliz kɔl stɛlə æsk hɛɹ tu bɹɪ̃ŋ ðɪs θɪ̃ŋs wɪð...


Before we begin checking the files, we need to standardize the naming of the files, e.g remove all white spaces and make them all lower case.

In [183]:
directory = "../data/processed_wav_files/"

for filename in os.listdir(directory):
    old_path = os.path.join(directory, filename)
    
    # Convert the filename to lowercase and remove whitespace
    new_filename = filename.lower().replace(" ", "")
    new_path = os.path.join(directory, new_filename)
    
    # Rename the file to lowercase and without whitespace
    os.rename(old_path, new_path)

In [184]:
directory = "../data/processed_wav_files/"

for file in saa_df['speech_sample']:
    file_path = os.path.join(directory, file)
    if not os.path.exists(file_path):
        print(f"The file '{file}' does not exist.")

The file 'estonian15.wav' does not exist.
The file 'wu4.wav' does not exist.
The file 'arabic195.wav' does not exist.
The file 'french81.wav' does not exist.
The file 'catalan6.wav' does not exist.
The file 'thai22.wav' does not exist.
The file 'spanish230.wav' does not exist.
The file 'arabic196.wav' does not exist.
The file 'vietnamese36.wav' does not exist.
The file 'portuguese67.wav' does not exist.
The file 'japanese43.wav' does not exist.
The file 'portuguese68.wav' does not exist.
The file 'bengali21.wav' does not exist.
The file 'spanish231.wav' does not exist.
The file 'arabic197.wav' does not exist.
The file 'sundanese2.wav' does not exist.
The file 'swissgerman8.wav' does not exist.
The file 'english647.wav' does not exist.
The file 'indonesian14.wav' does not exist.
The file 'english648.wav' does not exist.
The file 'mandarin152.wav' does not exist.
The file 'bengali22.wav' does not exist.
The file 'spanish233.wav' does not exist.


After looking through each instance of the above list manually in the directory, fixing any typos or minor errors in pronunciation, I was able to dwindle it down to those missing files. It could be that they have not been processed yet as the the `.wav` files go through processing for the dataset. However, since they are not transcribed anyways, they are not of much value to us anyways and thus we can proceed. 

In [185]:
saa_df

Unnamed: 0,speakerid,native_language,city,state_or_province,country,age,gender,onset_age,english_residence,length_of_residence,learning_style,speech_sample,phonetic_transcription,ethnologue_language_code,transcription
0,1,afrikaans,virginia,,south africa,27.0,female,9.0,usa,0.50,academic,afrikaans1.wav,afrikaans1.rtf,afr,[pʰlis kɔl stɛːlʌ ɑsk˺ ɜ tə bɹɪ̃ŋ ðiz θɪ̃ŋz̥ w...
1,2,afrikaans,pretoria,,south africa,40.0,male,5.0,usa,10.00,academic,afrikaans2.wav,afrikaans2.rtf,afr,[pʰliːz̥ kʰɔl stɛ̆lʌ ɔsk hɜ tŭ bɹiŋ ðiz θiŋz̥...
2,3,agni,diekabo,,ivory coast,25.0,male,15.0,usa,1.20,academic,agni1.wav,agny1.rtf,any,[pliz kɑl stelə æs hɚ tu bɹɪ̃ŋ viz fɪŋ wɪf hɜɹ...
3,4,albanian,prishtina,,kosovo,19.0,male,6.0,usa,3.00,naturalistic,albanian1.wav,albanian1.rtf,als,[p̬liz kʰɔl stɛla æs xɜɹ tu bɹɪ̃ŋ ðɪs θɪ̃ŋks w...
4,5,albanian,tirana,,albania,33.0,male,15.0,usa,0.04,naturalistic,albanian2.wav,albanian2.rtf,aln,[pliz kɔl stɛlə æsk hɛɹ tu bɹɪ̃ŋ ðɪs θɪ̃ŋs wɪð...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3026,3032,turkish,spaichingen,,germany,26.0,female,10.0,,0.00,academic,turkish45.wav,,tur,
3027,3033,english,singapore,,singapore,30.0,male,1.0,singapore,30.00,naturalistic,english655.wav,,eng,
3028,3034,english,dallas,texas,usa,40.0,male,0.0,usa,40.00,naturalistic,english656.wav,,eng,
3029,3035,english,alexandria,virginia,usa,55.0,male,0.0,usa,47.00,naturalistic,english657.wav,,eng,


## Saving the Newly Processed Data Frame With Transcriptions

In [186]:
saa_df.to_pickle('../data/df_report2.pkl')