# Clean Raw Elan Files

This notebook finds all non-printable characters in ELAN files and allows the user to replace them. 

For example, curly quotes can be replaces with normal quotes: 

# ’ becomes '

To filters out any non standard character from the annotations the notebook creates and provides a dictionary that can be edited manually to change the characters

1. Import raw files: reads all characters from file
2. Find all unusual characters: shows all non-printable characters
3. MANUAL STEP: edit the replacement dictionary
4. Clean all files: Apply the new dictionary to the files

<div class="alert alert-block alert-warning">
<b>User guide to using a Jupyter Notebook</b> 

If you are new to Jupyter Notebook, feel free to take a quick look at [this user guide](https://github.com/Australian-Text-Analytics-Platform/semantic-tagger/blob/main/documents/jupyter-notebook-guide.pdf) for basic information on how to use a notebook.
</div>

## 1. Import raw files
The first step is to load all files which should be cleaned

In [21]:
import os
import re
import glob
import json
import string
import traitlets
import xml.etree.ElementTree as ET

from ipywidgets import widgets
from IPython.display import display
from tkinter import Tk, filedialog


class SelectFilesButton(widgets.Button):
    """A file widget that leverages tkinter.filedialog."""

    def __init__(self):
        super(SelectFilesButton, self).__init__()
        # Add the selected_files trait
        self.add_traits(files=traitlets.traitlets.List())
        # Create the button.
        self.description = "Select Files"
        self.icon = "square-o"
        self.style.button_color = "orange"
        # Set on click behavior.
        self.on_click(self.select_files)

    @staticmethod
    def select_files(b):
        """Generate instance of tkinter.filedialog.

        Parameters
        ----------
        b : obj:
            An instance of ipywidgets.widgets.Button 
        """
        # Create Tk root
        root = Tk()
        # Hide the main window
        root.withdraw()
        # Raise the root to the top of all windows.
        root.call('wm', 'attributes', '.', '-topmost', True)
        # List of selected fileswill be set to b.value
        b.files = filedialog.askopenfilename(multiple=True)

        b.description = "Files Selected"
        b.icon = "check-square-o"
        b.style.button_color = "lightgreen"
        # In a different cell of the same Jupyter Notebook You can access the file list by using the following:
        files = my_button.files
        files = [ fi for fi in files if fi.endswith(".eaf") ]
        print (files)



print("+++ Ready +++")

my_button = SelectFilesButton()
my_button # This will display the button in the context of the Notebook

+++ Ready +++


SelectFilesButton(description='Select Files', icon='square-o', style=ButtonStyle(button_color='orange'))

[]
['C:/Users/barth/Documents/LDACA/AusESL/raw_elan/AAE_MF_002_Stella_anon.eaf', 'C:/Users/barth/Documents/LDACA/AusESL/raw_elan/AAE_MF_006_Tanya_anon.eaf', 'C:/Users/barth/Documents/LDACA/AusESL/raw_elan/AAE_MF_008_Kelsey_anon.eaf', 'C:/Users/barth/Documents/LDACA/AusESL/raw_elan/AAE_MF_009_Joyce_anon.eaf', 'C:/Users/barth/Documents/LDACA/AusESL/raw_elan/AAE_MF_011_Rosa_anon.eaf', 'C:/Users/barth/Documents/LDACA/AusESL/raw_elan/AAE_MF_012_Ruby_anon.eaf', 'C:/Users/barth/Documents/LDACA/AusESL/raw_elan/AAE_MF_022_Mei_anon.eaf', 'C:/Users/barth/Documents/LDACA/AusESL/raw_elan/AAE_MF_024_Liu_anon.eaf', 'C:/Users/barth/Documents/LDACA/AusESL/raw_elan/AAE_MF_029_Ning_anon.eaf', 'C:/Users/barth/Documents/LDACA/AusESL/raw_elan/AAE_MF_036_Liling_anon.eaf', 'C:/Users/barth/Documents/LDACA/AusESL/raw_elan/AAE_MF_037_Mingzhu_anon.eaf', 'C:/Users/barth/Documents/LDACA/AusESL/raw_elan/AAE_MF_038_Xiulan_anon.eaf', 'C:/Users/barth/Documents/LDACA/AusESL/raw_elan/AAE_MF_043_Lanfen_anon.eaf', 'C:/User

## 2. Find all unusual characters
The following code-block finds all non standard characters in the ELAN annotations

In [22]:
# finds all non printable characters in a dictionary

# In a different cell of the same Jupyter Notebook You can access the file list by using the following:
files = my_button.files
files = [ fi for fi in files if fi.endswith(".eaf") ]
#print (files)

# out_path is where the anonymised files and the replacment dictionary will be stored
out_path = "/".join(files[0].split("/")[:-1]) + "/"

def annotation_finder(file):
    tree = ET.parse(file)
    root = tree.getroot()
    return root.iter('ANNOTATION_VALUE')


printable = set(string.printable)
"""{'C', '.', 'U', 'w', '^', '9', 'j', '3', '$', '0', 'D', '\n', '+', '~', 'e', '|', 'c', 'P', '7',
    'O', 'Y', 'r', 'm', 'M', '-', '5', '=', 'y', 'q', 'o', 'I', ':', 'a', '"', '{', '8', 'J', 'K', 
    '}', 'V', 'N', 'G', 'W', '*', 'l', 'z', 'Z', 'i', '\x0b', '(', '6', 'X', 't', '&', '<', 'R', 'p',
    '#', 'F', 'b', 'L', 'B', 'S', '@', '_', 'n', '[', "'", '\r', 'f', 'h', 'u', '!', '>', 'A', 'v', 
    '?', ' ', ',', 'k', 'd', 'Q', '\x0c', '\t', '/', '2', '1', 'x', '\\', 'g', '%', ')', 'T', 's', 
    '4', 'E', 'H', '`', ']', ';'}"""

found = {}
for file in files:
    #print ("\n--------------------------------------------------------------")
    #print (file)
    for each in annotation_finder(file):
        if each.text:
            for letter in each.text:
                
                if letter not in printable:
                    
                    #print (letter, " -- ", each.text)
                    if letter not in found:
                        found[letter] = letter
                        
print ("\n\nnon-standard characters found:")
print (found)



non-standard characters found:
{'’': '’', '…': '…', '‘': '‘', 'ǎ': 'ǎ', 'è': 'è', 'ì': 'ì', 'é': 'é', '–': '–', 'á': 'á', 'ó': 'ó', 'µ': 'µ', '\xa0': '\xa0', 'ü': 'ü', '₽': '₽'}


## 3. Manual Step: edit the replacement dictionary

copy the dictionary { . . . } from above into the following textbox in line 2 after <span style="background-color: #FFFF00">replacements =</span>  

The dictionary always has: key : value    for each of the found characters. 
right now the key and value are the same.
Change the value to what you want the key to be replaced with.

e.g.:

 
```{'…': '...'} ```

will raplace: I see … 

to:           I see . . . 

## 4. Clean all files
run the following code box to clean all input files

In [23]:
# add the replacment dict and make the edits you need. 
#replacements = {'’': "'", '…': '...'}
replacements = {'’': "'", '…': '', '‘': "'", 'ǎ': 'a', 'è': 'e', 'ì': 'i',
                'é': 'e', '–': '-', 'á': 'a', 'ó': 'o', 'µ': '', '\xa0': " ", '–': '-', 
                'ü': 'u', '₽': 'p'} #sydney speaks '\xa0'

folder_path = 'C:/Users/barth/Documents/LDACA/AusESL/edited_elan/'
for file in files:
    outfile = folder_path + file.split('/')[-1]
    
    with open(file, "r", encoding="utf-8") as inf:
        
        tree = ET.parse(file)
        root = tree.getroot()
        
        for each in root.iter('ANNOTATION_VALUE'):

            if each.text:
                for replacee, replacement in replacements.items():
                    if replacee in each.text:
                        old = str(each.text)
                        each.text  = each.text.replace(replacee, replacement)
                        print(old, each.text)

        tree.write(outfile, encoding='utf-8', xml_declaration=True)
    print (file)
print ("+++ DONE +++\n\n Clean files:")


I grew up in China, southern part of China, small town called .  Yeah, it’s one part of Guangdong province. I grew up in China, southern part of China, small town called .  Yeah, it's one part of Guangdong province.
Yeah, Tianzhou is the dialect of our hometown.  It’s not actually, like, Mandarin or Cantonese.  Yeah. Yeah, Tianzhou is the dialect of our hometown.  It's not actually, like, Mandarin or Cantonese.  Yeah.
Yeah, because normally people live there, their relationship quite closely to each other.  You can say it’s quite a small town and you can know nearly everyone in the small town and people are very friendly with each other.  Sometimes if some issue, like for example, personal relationship issue happened, people in the town who got the, traditional or local status, actually, they can get more effective to solve that problem.  It’s not the government problem.  Yeah, because normally people live there, their relationship quite closely to each other.  You can say it's quite a

We do speak Mandarin at home but sometimes - yeah sometimes like some words we have to use English. Sometimes we couldn’t think about what is the Chinese. Yeah yeah. We do speak Mandarin at home but sometimes - yeah sometimes like some words we have to use English. Sometimes we couldn't think about what is the Chinese. Yeah yeah.
We do speak Mandarin at home but sometimes - yeah sometimes like some words we have to use English. Sometimes we couldn’t think about what is the Chinese. Yeah yeah. We do speak Mandarin at home but sometimes - yeah sometimes like some words we have to use English. Sometimes we couldn't think about what is the Chinese. Yeah yeah.
We're from - we are from the same province but different cities. I had never heard of his city before I met him. It’s where the where the is. We're from - we are from the same province but different cities. I had never heard of his city before I met him. It's where the where the is.
We're from - we are from the same province but diffe

C:/Users/barth/Documents/LDACA/AusESL/raw_elan/AAE_MF_011_Rosa_anon.eaf
Very intense. I mean I guess when I was a kid I didn’t - I went to tutoring when I was in year four... Very intense. I mean I guess when I was a kid I didn't - I went to tutoring when I was in year four...
Very intense. I mean I guess when I was a kid I didn’t - I went to tutoring when I was in year four... Very intense. I mean I guess when I was a kid I didn't - I went to tutoring when I was in year four...
I feel like that’s the that's pretty much that bring me up from that time which make me just oh realise okay the world is like that and you know. I feel like that's the that's pretty much that bring me up from that time which make me just oh realise okay the world is like that and you know.
C:/Users/barth/Documents/LDACA/AusESL/raw_elan/AAE_MF_012_Ruby_anon.eaf
That’s good. That's good.
Yeah, yeah. I think… Yeah, yeah. I think
That’s good That's good
Yeah yeah I think… Yeah yeah I think
That's right. That’s rig

C:/Users/barth/Documents/LDACA/AusESL/raw_elan/AAE_MF_029_Ning_anon.eaf
Interesting.  And you said that - so, when you were working there, was it an English-speaking environment or were the people… Interesting.  And you said that - so, when you were working there, was it an English-speaking environment or were the people
No, I… No, I
I have not… I have not
So you - so you did that?  You went through… So you - so you did that?  You went through
Five hundred - and so did you find - did you feel like you were learning - because obviously by the time you came to Australia you could already speak English and you already knew - you weren’t like an absolutely beginner. Five hundred - and so did you find - did you feel like you were learning - because obviously by the time you came to Australia you could already speak English and you already knew - you weren't like an absolutely beginner.
Five hundred - and so did you find - did you feel like you were learning - because obviously by the time y

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



C:/Users/barth/Documents/LDACA/AusESL/raw_elan/AAE_RF_017_Alenka_anon.eaf
How old were you when you came to… How old were you when you came to
Where did you go to school and what was your schooling like? Is it different to how it is in Australia? Is it primary school and then high school? How does it work in… Where did you go to school and what was your schooling like? Is it different to how it is in Australia? Is it primary school and then high school? How does it work in
Where did you go to school and what was your schooling like? Is it different to how it is in Australia? Is it primary school and then high school? How does it work in… Where did you go to school and what was your schooling like? Is it different to how it is in Australia? Is it primary school and then high school? How does it work in
Yeah? So you settled in. How did you settle in? How did you … Yeah? So you settled in. How did you settle in? How did you 
Okay that must have been interesting. How did it work who - did 

So you mean there were eight classes that you can… So you mean there were eight classes that you can
Mm-hm.  But mainly … Great.  Tell me how you were learning English language. Mm-hm.  But mainly  Great.  Tell me how you were learning English language.
How old were you when you started to… How old were you when you started to
Do you mean you were learning language particularly… Do you mean you were learning language particularly
language or different …? language or different ?
So you speak Russian with your… So you speak Russian with your
On TV or… On TV or
Thank you.  What about connection with your home country?  Do you have it still?  Or… Thank you.  What about connection with your home country?  Do you have it still?  Or
We… We
 yeah of course there are … But to … by this project.  yeah of course there are  But to  by this project.
Does it - does it give it to you anything in particular this diversity?  What do you feel about it like personally?   in relation to your kids … Does i

C:/Users/barth/Documents/LDACA/AusESL/raw_elan/AAE_RF_025_Mikhaila_anon.eaf
Did you learn through university or … Did you learn through university or 
I find it's one of those things you need to go to the country and just speak with people because … I find it's one of those things you need to go to the country and just speak with people because 
Was it an easy transition for you or was it… Was it an easy transition for you or was it
Okay, and so did you learn Russian and Ukrainian together or… Okay, and so did you learn Russian and Ukrainian together or
So, do they - how do you - is it the classes are in Ukrainian and then you learn Russian, or… So, do they - how do you - is it the classes are in Ukrainian and then you learn Russian, or
Just depends on who's… Just depends on who's
Okay, and… Okay, and
Okay, and you've still got family in… Okay, and you've still got family in
Oh, okay. Does your daughter like going? What's it like when you go back to… Oh, okay. Does your daughter like g

C:/Users/barth/Documents/LDACA/AusESL/raw_elan/AAE_RF_030_Ivana_anon.eaf
I went to university and I didn't have a choice sort of thing because somehow my parents decided that out of three kids I'm the one who must go to university and I simply didn’t know about any other opportunities.  So I went and sort of on luck choose something that I was thinking is mine had an exam and I passed. I went to university and I didn't have a choice sort of thing because somehow my parents decided that out of three kids I'm the one who must go to university and I simply didn't know about any other opportunities.  So I went and sort of on luck choose something that I was thinking is mine had an exam and I passed.
I went to university and I didn't have a choice sort of thing because somehow my parents decided that out of three kids I'm the one who must go to university and I simply didn’t know about any other opportunities.  So I went and sort of on luck choose something that I was thinking is mine had a

C:/Users/barth/Documents/LDACA/AusESL/raw_elan/AAE_RF_032_Ingrid_anon.eaf
…get into the university they separated it four years bachelor and then two years of master. get into the university they separated it four years bachelor and then two years of master.
That's interesting I was working - I was working in a café as a waitress yeah my boss she like - her husband works there and when they were looking for like people to take this position she just like proposed like would you like to try you know to apply for this job if you want to. I was like yes. That's interesting I was working - I was working in a cafe as a waitress yeah my boss she like - her husband works there and when they were looking for like people to take this position she just like proposed like would you like to try you know to apply for this job if you want to. I was like yes.
That's interesting I was working - I was working in a café as a waitress yeah my boss she like - her husband works there and when they were loo

Okay, so they could do their own… Okay, so they could do their own
Use that musical … Use that musical 
You just think it's gonna be… You just think it's gonna be
Okay, you notice the people are a bit… Okay, you notice the people are a bit
So, you guys settled in very easily, is that …? So, you guys settled in very easily, is that ?
Yeah, and then sort of settled in and…. Yeah, and then sort of settled in and.
Yeah, it can be a bit hard, even just finding that job and … Yeah, it can be a bit hard, even just finding that job and 
Probably like five… Probably like five
Well, if you still remember it, then… Well, if you still remember it, then
Yeah, that's fair enough. That language barrier would have just made it so… Yeah, that's fair enough. That language barrier would have just made it so
Yeah, so they have to be… Yeah, so they have to be
When the money - when that was happening, the price changes and all of that… When the money - when that was happening, the price changes and all of t

I watched some interesting YouTube videos on like you know that people in Australia – Australian girl I think it was. She - she was trying to impersonate different accents and she done really well with some of those. She actually mastered Russian accent very well. I watched some interesting YouTube videos on like you know that people in Australia - Australian girl I think it was. She - she was trying to impersonate different accents and she done really well with some of those. She actually mastered Russian accent very well.
I watched some interesting YouTube videos on like you know that people in Australia – Australian girl I think it was. She - she was trying to impersonate different accents and she done really well with some of those. She actually mastered Russian accent very well. I watched some interesting YouTube videos on like you know that people in Australia - Australian girl I think it was. She - she was trying to impersonate different accents and she done really well with som

C:/Users/barth/Documents/LDACA/AusESL/raw_elan/AAE_RM_052_Stas_anon.eaf
So were you on your own or… So were you on your own or
So do you feel like Australia is kind of ideal country and everything is… So do you feel like Australia is kind of ideal country and everything is
That’s . That's .
Okay. So met him here… Okay. So met him here
…here? here?
So Hindi… So Hindi
Is he… Is he
So you haven't - not - haven't mentioned… So you haven't - not - haven't mentioned
…read these words. read these words.
So do you feel like Australia is kind of ideal country and everything is… So do you feel like Australia is kind of ideal country and everything is
…racial point of view? Something will discriminate. I've been discriminated by young girls. I don't know why. Maybe because I'm twice their age you know what I mean? But what I'm trying to say… racial point of view? Something will discriminate. I've been discriminated by young girls. I don't know why. Maybe because I'm twice their age you know what 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

