# Numbering tokens for an initial lemmatisation

* __Note__: This script is similar to [this one](../disambiguate-lemmatization-in-corrected-file/REV_1_script_id_tokens.ipynb), which is rather meant to redo the numbering of tokens after the lemmas/pos tagging was revised directly within the XML file.
* __Note__: due to the fact that on GitHub the `@` sign is used to tag users, it is replaced by `att.` in XPath expressions.

This script takes a valid TEI-XML file, tokenised with `//tei:w[not(att.n)]` elements. Its only function will give a unique number to each `<w>` element, within an `att.n` attribute, in the reading order.

### FUNCTION: give each `<w>` element a number

In [1]:
def id_tokens_in_tei(chemin_entree, chemin_sortie):
    
    """
    This function takes a valid TEI-XML file as input.
    It targets all <w> elements and gives them a unique
    @n attribute, numbered from 1. The result is
    a valid TEI-XML file.
    
    :param chemin_entree: The local path to the tokenized
        TEI-XML file whose <w> elements need to be numbered.
    :param chemin_sortie: The local path for the output file.
    
    """

    import xml.etree.ElementTree as ET
    
    # Create a counter.
    counter = 1
    
    # Declare the TEI namespace, without a prefix since it is the only one.
    ET.register_namespace('', "http://tei-c.org/ns/1.0")
    
    # Import and parse the input XML file.
    tree = ET.parse(chemin_entree)
    root = tree.getroot()

    # Loop on <w> elements in reading order.
    for word in root.findall('.//{http://www.tei-c.org/ns/1.0}w'):
        
        # Make an @n attribute with the current state of the counter as value.
        word.set('n', str(counter))
        # Add 1 to the counter for the next <w> element.
        counter += 1

    # Write the output file at the path specified as second argument.
    tree.write(chemin_sortie, xml_declaration=True, encoding="unicode")

### Defining input and output files to execute the function

In [2]:
# To execute the function, replace current paths with your own..

id_tokens_in_tei(
    '/local/path/to/input-file.xml',
    '/local/path/to/output-file.xml'
    )