# Adding semi-automatic @xml:id to structured TEI-XML files

Script constructing identifiers for TEI-XML divs, according to the ConDÉ project schema, all values separated by `-`, all body div numbers formatted with three digits and all front|back div numbers formatted with two digits.
    
For `//tei:text/tei:front` and `//tei:text/tei:back` divs, the construction is as follows:
* source id,
* type of edition (base/txm/simplified),
* current version (alpha/beta)
* frontMatter or backMatter
* number of current front/div or back/div
* subtype of current front/div if any,
* number of current front/div/div if subject div is inside a div itself (max 2 levels of div in front and back).

Example: `basnage-base-beta-frontMatter-01-titlePage`.

For `//tei:text/tei:body` divs, construction is as follows:
* source id,
* type of edition (base/txm/simplified),
* current version (alpha/beta)
* current part div number,
* current chapter div number (if subject div is a chapter or section)
* current section div number (if subject div is a section).

Example: `basnage-base-beta-002-005-036`.

All body divs need to be typed (`part`/`chapter`/`section`) for this script to function.

### Imports and declarations

In [None]:
import xml.etree.ElementTree as ET

ET.register_namespace("", "http://www.tei-c.org/ns/1.0")
ET.register_namespace('xml','http://www.w3.org/XML/1998/namespace')

### FUNCTION: contains all the actual code of this file

In [1]:
def add_ids(xml_in, xml_out, fileID):
    
    """
    Function taking one TEI-XML file and adding @xml:id to <tei:div> elements.
    
    
    :param xml_in: The local path to the TEI-XML file needing <tei:div> identification, as a string.
    :param xml_out: The local path to the new TEI-XML file with identified <tei:div>, as a string.
    :param fileID: A string to prefix all identifiers, at best with only small letters inside.
                    Meant to use the corpus identifier of the current source.
    
    """

    # Starting a new file, we make counters for each type of div.
    section_counter = 0
    chapter_counter = 0
    part_counter = 0
    front_counter = 0
    back_counter = 0
    
    
    # Open and parse current TEI-XML file.
    tree = ET.parse(xml_in)
    root = tree.getroot()
    
    # Add the specified source ID to the <tei:text> element.
    textElement = root.find('.//{http://www.tei-c.org/ns/1.0}text')
    textElement.set('{http://www.w3.org/XML/1998/namespace}id', fileID)
    
    
    
    # START WITH THE FRONT MATTER DIVs.
    
    for item in root.findall(".//{http://www.tei-c.org/ns/1.0}front/*"):
    #for item in root.findall(".//front/*"):
        
        # Found one more div: add 1 to counter.
        front_counter += 1
        
        # If current element is a <tei:titlePage> element,
        # this will be included in its identifier.
        if item.tag == "{http://www.tei-c.org/ns/1.0}titlePage":
        #if item.tag == "titlePage":
            
            frontID = fileID + "-frontMatter-" + str("{:02}".format(front_counter)) + "-titlepage"
            item.set("{http://www.w3.org/XML/1998/namespace}id", frontID)
        
        # If current element is not a <tei:titlePage> element,
        # and it has an @type, this will be included in its identifier.
        elif item.get("type"):            
            
            frontID = fileID + "-frontMatter-" + str("{:02}".format(front_counter)) + "-" + item.get("type")
            item.set("{http://www.w3.org/XML/1998/namespace}id", frontID)
        
        # If current element is not a <tei:titlePage> element,
        # and it has no @type, its identifier will only have its number.
        else:
            
            frontID = fileID + "-frontMatter-" + str("{:02}".format(front_counter))
            item.set("{http://www.w3.org/XML/1998/namespace}id", frontID)
    
    
    
    # THEN DO THE BACK MATTER DIVs.
    
    for item in root.findall(".//{http://www.tei-c.org/ns/1.0}back/*"):
    #for item in root.findall(".//back/*"):
    
        # Found one more div: add 1 to counter.
        back_counter += 1
        
        # If current element has an @type, this will be included in its identifier.
        if item.get("subtype"):            
            
            backID = fileID + "-backMatter-" + str("{:02}".format(back_counter)) + "-" + item.get("subtype")
            item.set("{http://www.w3.org/XML/1998/namespace}id", backID)
        
        # Otherwise, we will use its number only.
        else:
            
            backID = fileID + "-backMatter-" + str("{:02}".format(back_counter))
            item.set("{http://www.w3.org/XML/1998/namespace}id", backID)
    
    
    
    # FINALLY, DO THE MAIN CONTENT DIVs.
    
    #for part in root.findall(".//{http://www.tei-c.org/ns/1.0}div[@type='part']"):
    for part in root.findall(".//{http://www.tei-c.org/ns/1.0}body/{http://www.tei-c.org/ns/1.0}div[@type='part']"):
        
        # When entering a new part div, start chapter numbers anew, add 1
        # to part counter and make the div id.
        chapter_counter = 0
        part_counter += 1
        part_identifier = fileID + "-" + str("{:03}".format(part_counter))
        
        # If there is no @xml:id, make it. Otherwise, replace it.
        if part.get("{http://www.w3.org/XML/1998/namespace}id"):
            del part.attrib['{http://www.w3.org/XML/1998/namespace}id']
        part.set("{http://www.w3.org/XML/1998/namespace}id", part_identifier)
        
        #for chapter in part.findall(".//{http://www.tei-c.org/ns/1.0}div[@type='chapter']"):
        for chapter in part.findall(".//{http://www.tei-c.org/ns/1.0}div[@type='chapter']"):
            
            # When entering a new chapter div, start section numbers anew, add 1
            # to chapter counter and make the div id.
            section_counter = 0
            chapter_counter += 1
            chapter_identifier = fileID + "-" + str("{:03}".format(part_counter)) + "-" + str("{:03}".format(chapter_counter))
            
            # If there is no @xml:id, make it. Otherwise, replace it.
            if chapter.get("{http://www.w3.org/XML/1998/namespace}id"):
                del chapter.attrib["{http://www.w3.org/XML/1998/namespace}id"]
            chapter.set("{http://www.w3.org/XML/1998/namespace}id", chapter_identifier)
        
            #for section in chapter.findall(".//{http://www.tei-c.org/ns/1.0}div[@type='section']"):
            for section in chapter.findall(".//{http://www.tei-c.org/ns/1.0}div[@type='section']"):
                
                section_counter += 1
                section_identifier = fileID + "-" + str("{:03}".format(part_counter)) + "-" + str("{:03}".format(chapter_counter)) + "-" + str("{:03}".format(section_counter))
                
                if section.get("{http://www.w3.org/XML/1998/namespace}id"):
                    del section.attrib["{http://www.w3.org/XML/1998/namespace}id"]
                section.set("{http://www.w3.org/XML/1998/namespace}id", section_identifier)
    
    
    
    # Write final TEI-XML file into specified output path.
    tree.write(xml_out, encoding="unicode")

### Apply the function to desired files.

Here is where you put Python to work and change settings with the function parameters:
* a string containing the path of the input file,
* a string containing the path for an output file,
* a prefix for all `<tei:div>` identifiers in the file.

In [7]:
add_ids(
    "/home/erminea/Documents/CONDE/nov-21_renum/terrien_base.xml",
    "/home/erminea/Documents/CONDE/nov-21_divID/terrien_base.xml",
    "terrien-base-beta"
)