# Export all tokens for initial lemmatisation

* __Note__: This script is similar to [this one](.//disambiguate-lemmatization-in-corrected-file/REV_2_script_export_csv_tokens.ipynb), which is rather meant to extract all `<w>` information from a lemmatised file for revision or assessment.
* __Note__: Due to the fact that on GitHub the `@` sign is used to tag users, it is replaced by `att.` in XPath expressions.
* __Note__: The output CSV files will be encoded in UTF-8, but Analog needs UTF-16 encoding.

This function takes a valid TEI-XML file as input, with all `//w[att.n]`. It will return a CSV table with only the text and particular number of each `<w>` element, for lemmatisation.
For instance, if the original tokens are as follows:
```xml
<w n="1">La</w>
<w n="2">Coutume</w>
<w n="3">de</w>
<w n="4">Normandie</w>
```
...then the resulting table will be this:

| ID | TOKEN |
|------------|----------------|
| 1 | La |
| 2 | Coutume |
| 3 | de |
| 4 | Normandie |

This script deals with elements within each `<w>`, for instance `<choice>` or `<lb>` elements. It exports the full modernised text, with abbreviations resolved.

### FUNCTION: get the tokens and make the CSV output file

In [1]:
def export_tokens_to_csv(chemin_entree, chemin_sortie):
    
    """
    Function taking the text of all elements from the
    input TEI-XML file, along with the value of their
    @n attributes, to export both into a CSV file with one
    column for each information and one line per token.
    
    :param chemin_entree: The local path to the TEI-XML file
        whose <w> elements need to be lemmatised.
    :param chemin_sortie: The local path for the output file.
    """

    import xml.etree.ElementTree as ET
    import csv
    
    # The column headers for the CSV-to-be.
    colonnes = ['ID', 'TOKEN']
    
    # Declare the TEI namespace, without a prefix since it is the only one.
    ET.register_namespace('', "http://tei-c.org/ns/1.0")
    
    # Import and parse the input XML file.
    tree = ET.parse(chemin_entree)
    root = tree.getroot()
    
    # Open an output CSV file in "writing" mode and write column headers.
    with open(chemin_sortie, 'w') as csv_file:
        csv_contenu = csv.DictWriter(csv_file, fieldnames = colonnes, delimiter=";")
        csv_contenu.writeheader()
        
        # Loop on all <w> elements in reading order.
        for word in root.findall('.//{http://tei-c.org/ns/1.0}w'):
        
            # Get the @n attribute into var numero
            # and create an empty var to compose the text.
            numero = str(word.get('n'))
            texte = ""
        
            # If the current <w> element has no child at all,
            # then the text can be taken directly.
            if word.find('.') == None :
                texte = str(word.text)
            
            # Otherwise the text needs to be compiled.
            else:
            
                # Add all text before the first child, if any.
                if word.text:
                    texte += str(word.text)
                
                # Loop on all children.
                for item in word:
                
                    # If the child is <height> or <supplied>, add its text directly.
                    if item.tag == '{http://tei-c.org/ns/1.0}height' or item.tag == '{http://tei-c.org/ns/1.0}supplied':
                        texte += str(item.text)
                        # If there is text between this child and the next,
                        # add it also.
                        if item.tail:
                            texte += str(item.tail)
                            
                    # If the child is a <lb> (line beginning), just check
                    # for text between it and the next child.
                    elif item.tag == '{http://tei-c.org/ns/1.0}lb':
                        if item.tail:
                            texte += str(item.tail)
                        
                    # If the child is a <choice> element, take its second child
                    # (<reg>, <expan> or <cor>) and check for text after <choice>.
                    elif item.tag == '{http://tei-c.org/ns/1.0}choice':
                        texte += str(item[1].text)
                        if item.tail:
                            texte += str(item.tail)
                    
                    # <c> = character (marks initials).
                    elif item.tag == '{http://tei-c.org/ns/1.0}c':
                        texte += item.text
                        if item.tail:
                            texte += str(item.tail)
                            
                    # <hi> = highlight (italics, color, etc)
                    elif item.tag == '{http://tei-c.org/ns/1.0}hi':
                        texte += item.text
                        if item.tail:
                            texte += item.tail
                    
                    # <add> = the scriptor added text in the interline.
                    elif item.tag == '{http://tei-c.org/ns/1.0}add':
                        # All tests need to be redone within the <add> element.
                        if item.find('.') == None :
                            texte = str(item.text)
                            
                        else:
                            
                            if item.text:
                                texte += str(item.text)
                            
                            for subitem in item:
                                if subitem.tag == '{http://tei-c.org/ns/1.0}lb':
                                    if subitem.tail:
                                        texte += str(subitem.tail)
                                elif subitem.tag == '{http://tei-c.org/ns/1.0}choice':
                                    texte += str(subitem[1].text)
                                    if subitem.tail:
                                        texte += str(subitem.tail)
        
    
            # Then write values into their appropriate column.
            csv_contenu.writerow(
                {
                    "ID" : numero,
                    "TOKEN" : str(texte),
                }
            )

### Executing the script

In [5]:
# To execute the function on a given file,
# write the path to said file as the first argument,
# and a new one for the output CSV.

export_tokens_to_csv('/local/path/to/input-file.xml',
                 '/local/path/to/output-file.csv')