# Transform a TEI-XML export from Transkribus into TEI-XML according to the ConDÉ project specifications.

This script may be optimised or partly rewritten. If I rewrite it, this version will be kept as documentation for the building of the ConDÉ corpus.

You are free to use this script for your own corpus. Know however that you may need to adapt your data, or adapt the script to your data.

Transkribus format observations:
* Informations are split into graphical information (description of text zones on the original image by giving coordinates for the pixels which form the corners of the zone) and textual information, and both are linked through identifiers. This may be shortly summarised thus:
```xml
<TEI>
    <teiHeader/>
    <facsimile>
        <zone id="paragid">
            <zone id="lineid"/>
        </zone>
    </facsimile>
    <text>
        <p facs="#paragid">
            <l facs="#lineid">text</l>
        </p>
    </text>
</TEI>
```
* Anything you type within Transkribus will be noted in the graphic description part instead (`/TEI/facsimile/zone`). Yet you are going to more-or-less copy graphical information and transform the full structure of the text: it is therefore more convenient to base oneself on the text and test the types of the zones through the identifiers.
* Transkribus takes all text information at the same level: zones and lines. This script aims to structure the data better.

### Imports and reclarations

In [None]:
import xml.etree.ElementTree as ET
import re

# This is the TEI namespace declaration.
# Since it is the main namespace, no prefix is added.
ET.register_namespace("","http://www.tei-c.org/ns/1.0")
ET.register_namespace("xml","http://www.w3.org/XML/1998/namespace")

### FUNCTION: convert XML data to ConDÉ semantic TEI

Where the actual script takes place.

ConDÉ format specifics (note: attributes will be noted as att. in the XPath expressions, since the *at* sign is used for user identification on GitHub):
* Three levels of text division: `/TEI/text/div[att.type="part"]/div[att.type="chapter"]/div[att.type="section"]`
* The initial Transkribus types (in `//zone/att.subtype`) have been manually changed to resemble the semantic typing I had in mind, through automatic search and replace in a text editor (I used Oxygen). If you need to adapt this script to your own texts, please change zone types where needed (e.g. in `if para.get('subtype') == "titre":...`.
* If your text may be divided into front, body and back matters, the corresponding TEI elements `/TEI/text/(front|body|back)` must be added manually before application of current script around the identified paragraphs.
* Elements `//l[att.subtype="sectionhead"]` did not get this subtype attribute from Transkribus, but from a regular expression search and replace: by identifying the specifics of the section titles, I was able to get most of them by a regular expression. This meant a search and replace had to be made for each sample in the corpus.

In [5]:
def to_tei(xml_entree, chemin_sortie):
    
    """
    Function reading a TEI-XML file written by Transkribus prepared with zone typing,
    and returning a more semantically encoded TEI-XML file.
    
    :param xml_entree: As a string, the local path to the Transkribus TEI-XML output to read.
    :param chemin_sortie: As a string, the local path to write the resulting XML file to.
    
    """
    
    # Declare a dict. for identifiers.
    transcriptions = {}
    
    # The lines of these types are not actually part of the text
    # and must be kept apart: they are about the page/codex format
    # and how to find a reference (p. nb., etc).
    types_a_enlever = ['signature', 'page', 'marginalia', 'header']
    
    # Parse the original XML file.
    entree_tree = ET.parse(xml_entree)
    entree_root = entree_tree.getroot()
    
    # HERE WE START WRITING THE OUTPUT XML FILE.
    
    # Make the root.
    sortie_root = ET.Element("{http://www.tei-c.org/ns/1.0}TEI")
    
    """# Get the teiHeader from a separate file.
    # Otherwise, make it directly yourself before the <text> element.
    
    fichier_teiHeader = ET.parse('data/teiHeader.xml')
    teiHeader = fichier_teiHeader.getroot()
    sortie_root.append(teiHeader)"""
    
    # Make a dictionary with identifiers from all //l/@facs.
    for ligne in entree_root.findall('.//{http://www.tei-c.org/ns/1.0}l'):
        cle = ligne.get('facs')
        transcriptions[cle] = str(ligne.text)
    
    # TRANSFER AND MODIFY THE ORIGINAL IMAGE INFORMATION AND DIVISION.
    
    # Make facsimile elements, one by one.
    # For all zones with types corresponding to "form work" (cf earlier:
    # "not actually part of the text and must be kept apart"),
    # get the corresponding text and put it directly in the zone element
    # as //zone/fw.
    for facsimile in entree_root.findall('{http://www.tei-c.org/ns/1.0}facsimile'):
        
        # We look for any <zone> element that has "Line" for type.
        for zone in facsimile.findall('{http://www.tei-c.org/ns/1.0}surface/{http://www.tei-c.org/ns/1.0}zone[@rendition="TextRegion"]'):
            if zone.findall('{http://www.tei-c.org/ns/1.0}zone[@rendition="Line"]'):
                
                for ligne in zone.findall('{http://www.tei-c.org/ns/1.0}zone[@rendition="Line"]'):
                
                    if zone.get('subtype') == "page":
                        a_trouver = "#" + str(ligne.get('{http://www.w3.org/XML/1998/namespace}id'))
                        fw = ET.Element('{http://www.tei-c.org/ns/1.0}fw', subtype='pageNum', place='top')
                        fw.text = transcriptions[a_trouver]
                        ligne.append(fw)
                    
                    elif zone.get('subtype') == "header":
                        a_trouver = "#" + str(ligne.get('{http://www.w3.org/XML/1998/namespace}id'))
                        fw = ET.Element('{http://www.tei-c.org/ns/1.0}fw', subtype='head', place='top-centre')
                        fw.text = transcriptions[a_trouver]
                        ligne.append(fw)
                    
                    elif zone.get('subtype') == "marginalia":
                        a_trouver = "#" + str(ligne.get('{http://www.w3.org/XML/1998/namespace}id'))
                        fw = ET.Element('{http://www.tei-c.org/ns/1.0}fw', subtype='marginalia', place='side')
                        fw.text = transcriptions[a_trouver]
                        ligne.append(fw)
                    
            
                    elif zone.get('subtype') == "signature":
                        a_trouver = "#" + str(ligne.get('{http://www.w3.org/XML/1998/namespace}id'))
                        fw = ET.Element('fw', subtype='sig', place='bot-right')
                        fw.text = transcriptions[a_trouver]
                        ligne.append(fw)
                        
        
        # Once the facsimile element has been treated completely,
        # add the result to the root element.
        sortie_root.append(facsimile)
    
    
    # NOW PREPARE THE TEXT ITSELF.
    
    # Make elements to contain the text itself.
    text = ET.Element('{http://www.tei-c.org/ns/1.0}text')
    front = ET.Element('{http://www.tei-c.org/ns/1.0}front')
    body = ET.Element('{http://www.tei-c.org/ns/1.0}body')
    back = ET.Element('{http://www.tei-c.org/ns/1.0}back')
    
    # But front and body into vars.
    ancien_front_repris = entree_root.find('{http://www.tei-c.org/ns/1.0}text/{http://www.tei-c.org/ns/1.0}front')
    ancien_body_repris = entree_root.find('{http://www.tei-c.org/ns/1.0}text/{http://www.tei-c.org/ns/1.0}body')
    
    # Prepare empty <div> elements for the middle and inner level.
    chapitre = ET.Element('{http://www.tei-c.org/ns/1.0}div', subtype='chapitre')
    section = ET.Element('{http://www.tei-c.org/ns/1.0}div', subtype='section')
    
    # Prepare front matter.
    for para in ancien_front_repris.findall('.//{http://www.tei-c.org/ns/1.0}p'):
        
        if para.get('subtype') == "titre":
            
            if chapitre.findall('.//{http://www.tei-c.org/ns/1.0}p'):
                front.append(chapitre)
                del chapitre
                chapitre = ET.Element('{http://www.tei-c.org/ns/1.0}div', subtype='chapitre')

            head = ET.Element('{http://www.tei-c.org/ns/1.0}head')
            for ligne in para.findall('.//{http://www.tei-c.org/ns/1.0}l'):
                head.append(ligne)
            chapitre.append(head)
            
        elif para.get('subtype') == "coutume" or para.get('subtype') == "note_interne":
            paragraphe = ET.Element('{http://www.tei-c.org/ns/1.0}p')
            paragraphe.set('{http://www.tei-c.org/ns/1.0}subtype', para.get('subtype'))
            for ligne in para.findall('.//l'):
                paragraphe.append(ligne)
            chapitre.append(paragraphe)
            
        ancien_front_repris.remove(para)
    
    # Once all paragraphs in the <front> element were added,
    # the last chapter still needs to be added to the <text> element.
    front.append(chapitre)
    del chapitre
    chapitre = ET.Element('{http://www.tei-c.org/ns/1.0}div', subtype='chapitre')
    
    
    for parag in ancien_body_repris.findall('.//{http://www.tei-c.org/ns/1.0}p'):
        
        # If current paragraph is a chapter title, it means closing both
        # current section and current chapter, adding both to the text element,
        # then opening a new section and a new chapter (the latter will get
        # the current title paragraph as its title).
        if parag.get('subtype') == "titre":
            
            if section.findall('.//{http://www.tei-c.org/ns/1.0}p'):
                chapitre.append(section)
                del section
                section = ET.Element('{http://www.tei-c.org/ns/1.0}div', subtype='section')
                
            if chapitre.findall('.//{http://www.tei-c.org/ns/1.0}p'):
                body.append(chapitre)
                del chapitre
                chapitre = ET.Element('{http://www.tei-c.org/ns/1.0}div', subtype='chapitre')
            
            head = ET.Element('{http://www.tei-c.org/ns/1.0}head')
            for ligne in parag.findall('.//{http://www.tei-c.org/ns/1.0}l'):
                head.append(ligne)
            chapitre.append(head)
            
            
        # If current paragraph contains a section title line, close the current section,
        # add it to the current chapter, and start a new one, with the section title line
        # for a title.
        elif parag.findall('.//{http://www.tei-c.org/ns/1.0}l[@subtype="sectionhead"]'):
            if section.findall('.//{http://www.tei-c.org/ns/1.0}p'):
                chapitre.append(section)
                del section
                section = ET.Element('{http://www.tei-c.org/ns/1.0}div', subtype='section')
            
            # Make and add section title.
            facs = parag.find('.//{http://www.tei-c.org/ns/1.0}l[@subtype="sectionhead"]').get('facs')
            head = ET.Element('{http://www.tei-c.org/ns/1.0}head')
            head.set('facs', facs)
            head.text = parag.find('.//{http://www.tei-c.org/ns/1.0}l[@subtype="sectionhead"]').text
            section.append(head)
            
            # The rest of the paragraph goes into a quote element.
            quote = ET.Element('{http://www.tei-c.org/ns/1.0}quote', subtype='coutume')
            paragraphe = ET.SubElement(quote, '{http://www.tei-c.org/ns/1.0}p')
            for ligne in parag.findall('.//{http://www.tei-c.org/ns/1.0}l'):
                paragraphe.append(ligne)
            section.append(quote)
        
        # Otherwise, the current paragraph is either main text or a note.
        else:
            # If it has a "coutume" type, it will be a quote element.
            if parag.get('subtype') == "coutume":
                quote = ET.Element('{http://www.tei-c.org/ns/1.0}quote', subtype='coutume')
                paragraphe = ET.SubElement(quote, '{http://www.tei-c.org/ns/1.0}p')
                for ligne in parag.findall('.//{http://www.tei-c.org/ns/1.0}l'):
                    paragraphe.append(ligne)
                section.append(quote)
           
            # If it has a "note_interne" type, it will be main text.
            elif parag.get('subtype') == "note_interne":
                paragraphe = ET.Element('{http://www.tei-c.org/ns/1.0}p', subtype='note_interne')
                for ligne in parag.findall('.//{http://www.tei-c.org/ns/1.0}l'):
                    paragraphe.append(ligne)
                section.append(paragraphe)
            
            # If it has a "note_de_note" type, it will be main text.
            elif parag.get('subtype') == "note_de_note":
                paragraphe = ET.Element('{http://www.tei-c.org/ns/1.0}p', subtype='note_de_note')
                for ligne in parag.findall('.//l'):
                    paragraphe.append(ligne)
                section.append(paragraphe)
            
            # If it has a "footnote-continued" type, it will be main text.
            elif parag.get('subtype') == "footnote-continued":
                paragraphe = ET.Element('{http://www.tei-c.org/ns/1.0}p', subtype='note_de_note_continuee')
                for ligne in parag.findall('.//l'):
                    paragraphe.append(ligne)
                section.append(paragraphe)
            
            # If there is no subtype, it will be a paragraph.
            elif parag.get('subtype') == None:
                paragraphe = ET.Element('{http://www.tei-c.org/ns/1.0}p')
                for ligne in parag.findall('.//{http://www.tei-c.org/ns/1.0}l'):
                    paragraphe.append(ligne)
                back.append(paragraphe)
            
        ancien_body_repris.remove(parag)
    
    # After all paragraphs have been treated, add the last section
    # and chapter to the body element.
    chapitre.append(section)
    body.append(chapitre)
    
    # Add all main elements to the text element, then add it to root.
    text.append(front)
    text.append(body)
    text.append(back)
    
    sortie_root.append(text)
    
    # WRITE THE OUTPUT INTO A FILE.
    
    a_ecrire = ET.tostring(sortie_root, encoding="unicode", method="xml")
    
    ecriture = open(chemin_sortie, "w")
    ecriture.write(a_ecrire)

### Where to call the function

To use previous function, you may change these parameters:
* the local path to the TEI-XML file you wish to transform (.xml),
* the local path for the output file (.xml).

In [9]:
to_tei('/local/path/to/original_transkribus_tei_export.xml',
      '/local/path/to/more_semantic_tei_output.xml')

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

