<p style='text-align:center; font-size: 30px;'>Clean Raw Elan Files</p>
<br>
This notebook finds all non-printable characters in ELAN files and allows the user to replace them. 
<br>
For example, curly quotes can be replaces with normal quotes: 

# ’ becomes '


    
To filters out any non standard character from the annotations the notebook creates and provides a dictionary that can be edited manually to change the characters

+ Load all files from local folder into the binder using the upload button on the left side
<br>
+ run the code boxes
<br>
1. reads all characters from file
2. Find all unusual characters: shows all non-printable characters
3. MANUAL STEP: edit the replacement dictionary
4. Clean all files: Apply the new dictionary to the files

<div class="alert alert-block alert-warning">
<b>User guide to using a Jupyter Notebook</b> 

If you are new to Jupyter Notebook, feel free to take a quick look at [this user guide](https://github.com/Australian-Text-Analytics-Platform/semantic-tagger/blob/main/documents/jupyter-notebook-guide.pdf) for basic information on how to use a notebook.
</div>

<br>
<br>
<div class="warning" style='padding:0.1em; background-color: #FDAE44; color:#51247a; border-style: solid; border-color: #CC5500 '>
<span>
<p style='margin-top:1em; text-align:center'>
<b>Never use this script on your main files. always use it on a copy of your files! </b> 
<br>
</p>
<p style='margin-left:1em;'></p></span>
</div>

## 1. Import raw files
The first step is to load all files which should be cleaned


In [None]:
import os

# List all files in the current directory that end with .eaf
eaf_files = [file for file in os.listdir() if file.endswith('.eaf')]

# Print the list of .eaf files
print(eaf_files)

## 2. Find all unusual characters
The following code-block finds all non standard characters in the ELAN annotations

In [None]:
import os
import re
import glob
import json
import string
import traitlets
import xml.etree.ElementTree as ET



# finds all non printable characters in a dictionary
files = eaf_files

# out_path is where the anonymised files and the replacment dictionary will be stored
out_path = "/".join(files[0].split("/")[:-1]) + "/"

def annotation_finder(file):
    tree = ET.parse(file)
    root = tree.getroot()
    return root.iter('ANNOTATION_VALUE')


printable = set(string.printable)
"""{'C', '.', 'U', 'w', '^', '9', 'j', '3', '$', '0', 'D', '\n', '+', '~', 'e', '|', 'c', 'P', '7',
    'O', 'Y', 'r', 'm', 'M', '-', '5', '=', 'y', 'q', 'o', 'I', ':', 'a', '"', '{', '8', 'J', 'K', 
    '}', 'V', 'N', 'G', 'W', '*', 'l', 'z', 'Z', 'i', '\x0b', '(', '6', 'X', 't', '&', '<', 'R', 'p',
    '#', 'F', 'b', 'L', 'B', 'S', '@', '_', 'n', '[', "'", '\r', 'f', 'h', 'u', '!', '>', 'A', 'v', 
    '?', ' ', ',', 'k', 'd', 'Q', '\x0c', '\t', '/', '2', '1', 'x', '\\', 'g', '%', ')', 'T', 's', 
    '4', 'E', 'H', '`', ']', ';'}"""

found = {}
for file in files:
    #print ("\n--------------------------------------------------------------")
    #print (file)
    for each in annotation_finder(file):
        if each.text:
            for letter in each.text:
                
                if letter not in printable:
                    
                    #print (letter, " -- ", each.text)
                    if letter not in found:
                        found[letter] = letter
                        
print ("\n\nnon-standard characters found:")
print (found)

## 3. Manual Step: edit the replacement dictionary

copy the dictionary { . . . } from above into the following textbox in line 2 after <span style="background-color: #FFFF00">replacements =</span>  

The dictionary always has: key : value    for each of the found characters. 
right now the key and value are the same.
Change the value to what you want the key to be replaced with.

e.g.:

 
```{'…': '...'} ```

will raplace: I see … 

to:           I see . . . 

## 4. Clean all files
run the following code box to clean all input files

In [None]:
# add the replacment dict and make the edits you need. 
#replacements = {'’': "'", '…': '...'}
replacements = {'’': "'", '…': '', '‘': "'", 'ǎ': 'a', 'è': 'e', 'ì': 'i',
                'é': 'e', '–': '-', 'á': 'a', 'ó': 'o', 'µ': '', '\xa0': " ", '–': '-', 
                'ü': 'u', '₽': 'p'} #sydney speaks '\xa0'

#folder_path = 'C:/Users/barth/Documents/LDACA/AusESL/edited_elan/'
folder_path = "./"


for file in files:
    outfile = folder_path + file.split('/')[-1]
    
    with open(file, "r", encoding="utf-8") as inf:
        
        tree = ET.parse(file)
        root = tree.getroot()
        
        for each in root.iter('ANNOTATION_VALUE'):

            if each.text:
                for replacee, replacement in replacements.items():
                    if replacee in each.text:
                        old = str(each.text)
                        each.text  = each.text.replace(replacee, replacement)
                        print(old, each.text)

        tree.write(outfile, encoding='utf-8', xml_declaration=True)
    print (file)
print ("+++ DONE +++\n\n")
