# Production de tableaux pour correction des textes tokenisés sur Excel/Calc

Ce script permet de transformer des textes tokénisés en format XML-TEI avec des `//w[@lemma and @pos and @n]`.
Il nécessite des XML encodés selon le schéma ConDÉ, en version `base`.
Il comporte deux fonctions :
* get_w_txt(), qui construit les chaînes de caractères de chaque token,
* export_tokens_to_csv(), qui parse le XML, extrait les informations et ensuite les écrit dans un CSV.
Les dernières cellules donnent des emplacements pour lancer export_tokens_to_csv().

-----------------

# Production of tables to correct tokenized texts on Excel/Calc

This script transforms XML-TEI texts tokenized as `//w[@lemma and @pos and @n]`.
It needs XMLs encoded according to the ConDÉ `base` version schema.
It is written in two functions:
* get_w_txt(), which constructs strings out of each token,
* export_tokens_to_csv(), which parses the XML, extracts information and then writes it into a CSV file.
The last cells provide a place to launch export_tokens_to_csv().

#### Import libraries and declare TEI namespace

In [4]:
import xml.etree.ElementTree as ET
import csv
import time

# Declaring base namespace (TEI),
# without a prefix (since it's the base namespace).
ET.register_namespace('', "http://www.tei-c.org/ns/1.0")

#### Extracting word strings from tokens

In [2]:
def get_w_txt(word):
    
    """
    Function constructing two strings from one word token (tei:w),
    first string being the 'diplomatic' version of the word,
    second string being the modernized/corrected version.
    If the word does not change, both strings will be identical.
    Function is based on the fact that each token has at least one child:
    it loops on all children of the word token.
    
    Function returns texte (list). texte[1] will always be the modernized/
    constructed version.
    If both constructed strings are identical, texte[0] will be an empty
    string. Otherwise it will contain the 'diplomatic' version.
    
    :param word: A parsed XML object corresponding to the following
    path: //tei:w.
    """
    
    # The future 'diplomatic' version.
    amorig = []
    # The future 'modernized' version.
    expmod = []
    
    # Storing TEI elements which may be treated the same.
    checklist = [
        '{http://www.tei-c.org/ns/1.0}height',
        '{http://www.tei-c.org/ns/1.0}supplied',
        '{http://www.tei-c.org/ns/1.0}c',
        '{http://www.tei-c.org/ns/1.0}hi'
    ]
    
    # If there is text before first child, add it to both versions.
    if word.text:
        
        amorig.append(word.text)
        expmod.append(word.text)
    
    # Looping on children of word token. According to the nature of
    # each child, treatment will be different.
    for item in word:
        
        # For all tei:height, tei:supplied, tei:c, tei:hi,
        # there is no modernization : add the text inside and
        # right after closing tag to both versions.
        if item.tag in checklist:
            amorig.append(item.text)
            expmod.append(item.text)
            if item.tail:
                amorig.append(item.tail)
                expmod.append(item.tail)
        
        # If child is a line beginning, there is no text inside:
        # if there is text directly afterwards, add it to both versions.
        elif item.tag == '{http://www.tei-c.org/ns/1.0}lb':
            if item.tail:
                amorig.append(item.tail)
                expmod.append(item.tail)
        
        # If child is a tei:choice element, then text will be different for
        # both versions. We parse its own children. Contents of tei:am and
        # tei:orig will go to 'diplomatic' version, while contents of
        # tei:expan and tei:reg will go to 'modernized' version. If there is
        # text directly after tei:choice element, add it to both.
        elif item.tag == '{http://www.tei-c.org/ns/1.0}choice':
            
            for subitem in item.findall('./*'):
                if subitem.tag == '{http://www.tei-c.org/ns/1.0}am' or subitem.tag == '{http://www.tei-c.org/ns/1.0}orig':
                    amorig.append(subitem.text)
                            
                elif subitem.tag == '{http://www.tei-c.org/ns/1.0}expan' or subitem.tag == '{http://www.tei-c.org/ns/1.0}reg':
                    try:
                        expmod.append(subitem.text)
                    except:
                        expmod.append("[X]")
                
            if item.tail:
                amorig.append(item.tail)
                expmod.append(item.tail)
                
        
        # If child is tei:add, then any of the previous children may be inside.
        # So we do the same there.
        elif item.tag == '{http://www.tei-c.org/ns/1.0}add':
            for subitem in item:
    
                if subitem.tag in checklist:
                    amorig.append(subitem.text)
                    expmod.append(subitem.text)
                    if item.tail:
                        amorig.append(subitem.tail)
                        expmod.append(subitem.tail)

                elif subitem.tag == '{http://www.tei-c.org/ns/1.0}lb':
                    if subitem.tail:
                        amorig.append(subitem.tail)
                        expmod.append(subitem.tail)

                elif subitem.tag == '{http://www.tei-c.org/ns/1.0}choice':
            
                    for subsub in subitem.findall('./*'):
                        if subsub.tag == '{http://www.tei-c.org/ns/1.0}am' or subsub.tag == '{http://www.tei-c.org/ns/1.0}orig':
                            amorig.append(subsub.text)
                            
                        elif subsub.tag == '{http://www.tei-c.org/ns/1.0}expan' or subsub.tag == '{http://www.tei-c.org/ns/1.0}reg':
                            expmod.append(subsub.text)

                            if subitem.tail:
                                amorig.append(subitem.tail)
                                expmod.append(subitem.tail)
    
    # Construction of return list. If 'diplomatic' and 'modernized' versions
    # are different, then make a list with previous and latter. Otherwise,
    # replace previous with empty string.
    if ''.join(amorig) != ''.join(expmod):
        texte = [''.join(amorig), ''.join(expmod)]
    else:
        texte = ['', ''.join(expmod)]
    
    return texte

#### Main function

In [10]:
def export_tokens_to_csv(chemin_entree, chemin_sortie):
    
    """
    Function parsing a tokenized TEI-XML file and returning a CSV
    table allowing manual correction outside of the XML file.
    Function uses the above get_w_txt() function to extract text
    from tokens containing children.
    
    :param chemin_entree: The local path to the TEI-XML file
        one needs to convert to a table.
    :param chemin_sortie: The local path to where one wants the
        resulting CSV table written.
    """
    
    # Counter to number table lines independantly from token numbers. 
    l_count = 0
    
    # List used to store together //w/@n and //lb/@facs in document order,
    # so as to render the order of elements in the table.
    lbinitlist = []
    
    # Dictionnary used to store final information. Keys are w numbers //w/@n,
    # contents are all informations needed to make CSV file, including
    # //following::sibling:*[1][self::lb] or //child::lb when such is the case.
    winitdict = {}
    
    # Columns of final CSV.
    colonnes = [
        'n° de ligne',
        'nature',
        'n°/id',
        'forme "diplo"',
        'forme modernisée',
        'forme corrigée',
        'lemme',
        'pos',
        'à scinder',
        'à fusionner avec w n°',
        'à corriger',
        'corrigé'
    ]
    
    # Open and parse TEI XML.
    with open(chemin_entree) as infile:
    
        tree = ET.parse(infile)
        root = tree.getroot()
        
        # Loop on all elements containing //w children.
        # Add @n to lbinitlist if children are //w,
        # add @facs it children are //lb.
        # Order of elements in documents is preserved here.
        current = time.time_ns()
        for parent in root.findall('.//*[{http://www.tei-c.org/ns/1.0}w]'):
            for child in parent.findall('./*'):
                if child.tag == '{http://www.tei-c.org/ns/1.0}w':
                    lbinitlist.append(child.get('n'))
                elif child.tag == '{http://www.tei-c.org/ns/1.0}lb':
                    lbinitlist.append(child.get('facs'))
        print(f"Liste des LB: {(time.time_ns() - current)/1000000000} s.")
        current = time.time_ns()
        # Loop on //w elements. Register @n, @lemma and @pos and start
        # the dictionary entry in winitdict.
        for word in root.findall('.//{http://www.tei-c.org/ns/1.0}w'):
            texte = ''
            numero = str(word.get('n'))
            lemmes = str(word.get('lemma'))
            pos = str(word.get('pos'))
            
            winitdict[numero] = {'lemma':lemmes, 'pos':pos}
            
            # Then check if word has an //lb child (which would
            # not appear in lbinitlist) and, if so, add it to the dictionary
            # entry.
            if word.find('{http://www.tei-c.org/ns/1.0}lb'):
                winitdict[numero]['lb'] = word.find('{http://www.tei-c.org/ns/1.0}lb').get('facs')
            
            # Otherwise, check for the @n in lbinitlist, see if it is
            # followed by a //lb/@facs. If so, add it to the dictionary
            # entry. Otherwise, write 'None'.
            else:
                wIndex = lbinitlist.index(numero)
                try:
                    nextIndex = lbinitlist[wIndex+1]
                except:
                    nextIndex = ''
                if "_"in nextIndex:
                    winitdict[numero]['lb'] = nextIndex
                else:
                    winitdict[numero]['lb'] = 'None'
            
            # If word token has no children, we can get the
            # text directly. Otherwise, invoque get_w_txt() function
            # to compose it.
            if word.find('./*') == None :
                winitdict[numero]['original'] = ''
                winitdict[numero]['modernisé'] = word.text

            else:
                winitdict[numero]['original'] = get_w_txt(word)[0]
                winitdict[numero]['modernisé'] = get_w_txt(word)[1]

        print(f"Dico des W: {(time.time_ns() - current)/1000000000} s.")
    current = time.time_ns()
    # Open a CSV file in the output path, parse and write column headers.
    with open(chemin_sortie, 'w') as csv_file:
        csv_contenu = csv.DictWriter(csv_file, fieldnames = colonnes)
        csv_contenu.writeheader()

        # Then loop on word tokens in winitdict.
        for word in winitdict.keys():
            
            dicolocal = winitdict[word]

            l_count += 1
            
            # Check if token is ambiguous for sure and make the corresponding
            # variable.
            if '|' in dicolocal['pos'] or dicolocal['pos'] == 'Inconnu':
                a_corriger = 'X'
            else:
                a_corriger = ''

            
            # If current token dictionary contains a //lb/@facs, then
            # we add two lines: one for the line beginning, the other
            # for the actual word token. In-between, we add 1 to the
            # general counter.
            # Otherwise, we only add one line to the CSV table.
            if dicolocal['lb'] != 'None':
                csv_contenu.writerow({
                    'n° de ligne' : str(l_count),
                    'nature' : 'saut de ligne',
                    'n°/id' : dicolocal['lb'],
                    'forme "diplo"' :'',
                    'forme modernisée':'',
                    'forme corrigée':'',
                    'lemme' :'',
                    'pos' :'',
                    'à scinder' :'',
                    'à fusionner avec w n°' :'',
                    'à corriger' :'',
                    'corrigé' :''
                })

                l_count += 1

                csv_contenu.writerow({
                    'n° de ligne' : str(l_count),
                    'nature' : 'w',
                    'n°/id' : word,
                    'forme "diplo"' : dicolocal['original'],
                    'forme modernisée' : dicolocal['modernisé'],
                    'forme corrigée':'',
                    'lemme' : dicolocal['lemma'],
                    'pos' : dicolocal['pos'],
                    'à scinder' :'',
                    'à fusionner avec w n°' :'',
                    'à corriger' : a_corriger,
                    'corrigé' :''
                })

            else:

                csv_contenu.writerow({
                    'n° de ligne' : str(l_count),
                    'nature' : 'w',
                    'n°/id' : word,
                    'forme "diplo"' : dicolocal['original'],
                    'forme modernisée' : dicolocal['modernisé'],
                    'forme corrigée':'',
                    'lemme' : dicolocal['lemma'],
                    'pos' : dicolocal['pos'],
                    'à scinder' :'',
                    'à fusionner avec w n°' :'',
                    'à corriger' : a_corriger,
                    'corrigé' :''
                })
    print(f"Écriture du CSV: {(time.time_ns() - current)/1000000000} s.")

#### Where to launch the script for the whole collection

In [None]:
# If one wishes to treat several XML documents together,
# and they are named on the same model,
# and they are in the same directory,
# one can add each specific part of the filename
# into the following list.
temoins = ['bookname1', 'bookname2']

# Then one may add the actual paths and suffixes to this loop.
for temoin in temoins:
    export_tokens_to_csv('/path/to/folder/' + temoin + 'suffixe.xml',
                 temoin + '_tableau_pour_corrections.csv')
    
    # This allows the user to check which files are done yet.
    print(temoin + " : terminé")

basnage : terminé


#### Where to launch the script for one witness

```
export_tokens_to_csv(
    '/path/to/original.xml',
    '/path/to/table.csv'
)```

In [11]:
export_tokens_to_csv(
    '../../../editions/base-version/gc_base.xml',
    'base-essai.csv'
)

Liste des LB: 0.0221404
Dico des W: 29.227740066 ns
Écriture du CSV: 0.175281253


### À Faire

* Faire le dictionnaire CSV pour les lignes directement et ensuite l'écrire.
* Faire la vérification dans le dictionnaire des LB pendant la boucle.

###### Références

* https://twitter.com/MorganePica/status/1470339783729819656
* http://python-simple.com/python-pandas/creation-dataframes.php
* https://www.geeksforgeeks.org/python-ways-to-add-row-columns-in-numpy-array/
* https://lxml.de/index.html

###### Remarques de TC :
> Trois quatre choses:
> * (1) LXML est beaucoup BEAUCOUP plus rapide que l'implémentation basique.
> * (2) Si tes données passent bien dans un "tableur" et que ce qui prend du temps, c'est aussi le retraitement des données -> import pandas
> * (3) Au bout d'un moment, t'intéresser à multithreading si ça suffit pas.
> * (4) T'assurer que tu as bien optimisé tes boucles, que tu "recompiles" pas des choses dans tes boucles, etc. afin de gagner en vitesse. Potentiellement, chercher où sa ralentit