# TODO: ELAN to CSV


## 1. clean the elan file content

## 2. save a CSV version of the ELAN file, format:

Transcript | IU_number | start_time | end_time | speaker | IU	
AAE_MF_002_Stella | 1 | 02:04.1 | 02:09.1 | INT: Sylvie | So do you do you did you went to the school chiao zhou like all the way 	

This notebook finds all non-printable characters in ELAN files and allows the user to replace them. 

For example, curly quotes can be replaces with normal quotes: 

# ’ becomes '

To filters out any non standard character from the annotations the notebook creates and provides a dictionary that can be edited manually to change the characters

1. Import raw files: reads all characters from file
2. Find all unusual characters: shows all non-printable characters
3. MANUAL STEP: edit the replacement dictionary
4. Clean all files: Apply the new dictionary to the files

<div class="alert alert-block alert-warning">
<b>User guide to using a Jupyter Notebook</b> 

If you are new to Jupyter Notebook, feel free to take a quick look at [this user guide](https://github.com/Australian-Text-Analytics-Platform/semantic-tagger/blob/main/documents/jupyter-notebook-guide.pdf) for basic information on how to use a notebook.
</div>

## 1. Import raw files
The first step is to load all files which should be cleaned

In [5]:
import os
import re
import glob
import json
import string
import traitlets
import xml.etree.ElementTree as ET

from ipywidgets import widgets
from IPython.display import display
from tkinter import Tk, filedialog


class SelectFilesButton(widgets.Button):
    """A file widget that leverages tkinter.filedialog."""

    def __init__(self):
        super(SelectFilesButton, self).__init__()
        # Add the selected_files trait
        self.add_traits(files=traitlets.traitlets.List())
        # Create the button.
        self.description = "Select Files"
        self.icon = "square-o"
        self.style.button_color = "orange"
        # Set on click behavior.
        self.on_click(self.select_files)

    @staticmethod
    def select_files(b):
        print ("tree")
        """Generate instance of tkinter.filedialog.

        Parameters
        ----------
        b : obj:
            An instance of ipywidgets.widgets.Button 
        """
        # Create Tk root
        root = Tk()
        # Hide the main window
        root.withdraw()
        # Raise the root to the top of all windows.
        root.call('wm', 'attributes', '.', '-topmost', True)
        # List of selected fileswill be set to b.value
        b.files = filedialog.askopenfilename(multiple=True)

        b.description = "Files Selected"
        b.icon = "check-square-o"
        b.style.button_color = "lightgreen"
        # In a different cell of the same Jupyter Notebook You can access the file list by using the following:
        files = my_button.files
        files = [ fi for fi in files if fi.endswith(".eaf") ]
        print (files)



print("+++ Ready +++")

my_button = SelectFilesButton()
my_button # This will display the button in the context of the Notebook

+++ Ready +++


SelectFilesButton(description='Select Files', icon='square-o', style=ButtonStyle(button_color='orange'))

tree
[]


## 2. Find all unusual characters
The following code-block finds all non standard characters in the ELAN annotations

In [3]:
# finds all non printable characters in a dictionary

# In a different cell of the same Jupyter Notebook You can access the file list by using the following:
files = my_button.files
files = [ fi for fi in files if fi.endswith(".eaf") ]
#print (files)

# out_path is where the anonymised files and the replacment dictionary will be stored
#out_path = "/".join(files[0].split("/")[:-1]) + "/"

def annotation_finder(file):
    tree = ET.parse(file)
    root = tree.getroot()
    return root.iter('ANNOTATION_VALUE')


printable = set(string.printable)
"""{'C', '.', 'U', 'w', '^', '9', 'j', '3', '$', '0', 'D', '\n', '+', '~', 'e', '|', 'c', 'P', '7',
    'O', 'Y', 'r', 'm', 'M', '-', '5', '=', 'y', 'q', 'o', 'I', ':', 'a', '"', '{', '8', 'J', 'K', 
    '}', 'V', 'N', 'G', 'W', '*', 'l', 'z', 'Z', 'i', '\x0b', '(', '6', 'X', 't', '&', '<', 'R', 'p',
    '#', 'F', 'b', 'L', 'B', 'S', '@', '_', 'n', '[', "'", '\r', 'f', 'h', 'u', '!', '>', 'A', 'v', 
    '?', ' ', ',', 'k', 'd', 'Q', '\x0c', '\t', '/', '2', '1', 'x', '\\', 'g', '%', ')', 'T', 's', 
    '4', 'E', 'H', '`', ']', ';'}"""

found = {}
for file in files:
    print ("\n--------------------------------------------------------------")
    print (file)
    for each in annotation_finder(file):
        if each.text:
            for letter in each.text:
                
                if letter not in printable:
                    
                    print (letter, " -- ", each.text)
                    if letter not in found:
                        found[letter] = letter
                        
print ("\n\nnon-standard characters found:")
print (found)



non-standard characters found:
{}


## 3. Manual Step: edit the replacement dictionary

copy the dictionary { . . . } from above into the following textbox in line 2 after <span style="background-color: #FFFF00">replacements =</span>  

The dictionary always has: key : value    for each of the found characters. 
right now the key and value are the same.
Change the value to what you want the key to be replaced with.

e.g.:

 
```{'…': '...'} ```

will raplace: I see … 

to:           I see . . . 

## 4. Clean all files
run the following code box to clean all input files

In [9]:
# add the replacment dict and make the edits you need. 
#replacements = {'’': "'", '…': '...'}
replacements = {'’': "'", '…': '...'} #sydney speaks

folder_path = "C:\\Users\\barth\\Documents\\LDACA\\jupyter_notebooks\\ELAN_to_CSV\\output\\"
for file in files:
    outfile = folder_path + file.split('/')[-1]
    
    with open(file, "r", encoding="utf-8") as inf:
        
        tree = ET.parse(file)
        root = tree.getroot()
        
        for each in root.iter('ANNOTATION_VALUE'):

            if each.text:
                for replacee, replacement in replacements.items():
                    each.text.replace(replacee, replacement)

        tree.write(outfile)

print ("+++ DONE +++\n\n Clean files:")
print (file)

+++ DONE +++

 Clean files:
C:/Users/barth/Documents/LDACA/jupyter_notebooks/Clean_Raw_Data/AAE_MF_002_Stella.eaf
