## Set-up

Download all data from https://figshare.com/articles/The_Histoire_ancienne_jusqu_C_sar_A_Digital_Edition_The_Values_of_French_/11907081

There should be in the same folder as this file:

- `2_TVOF_lemmatised_contexts_1.0.xml`
- `3_TVOF_Fr20125_tokenised_1.0.xml`
- `4_TVOF_Royal20D1_tokenised_1.0xml`

In [46]:
import lxml.etree as ET
import tqdm
from collections import defaultdict

with open("2_TVOF_lemmatised_contexts_1.0.xml") as f:
    KWIC = ET.parse(f)
    

## Quick explanations

1. I tried first to go full xpath (`//*[@xml:id='#location']//w[@n='#n']`) but this was very slow (~ 10h).
2. `lxml.etree` provides a parseId that returns a tuple (ElementTree, IDDict) where the IDDict returns a nice dictionary containing all `@xml:id`s.
3. The best way from there was to check for all w inside. But div could have both seg, and head. Note sure I did not collide with anything else (small checks are in place)

In [65]:
# In the original file, xml:id="edfr20125_genNotes" is duplicated. The second was updated as "edfr20125_genNotes2"
Global_Dict = defaultdict(dict)
with open("3_TVOF_Fr20125_tokenised_1.0.xml") as f:
    FR20125, FR20125_Dict = ET.parseid(f)
    for location_id in tqdm.tqdm(FR20125_Dict):
        location = FR20125_Dict[location_id]
        if location.tag == "{http://www.tei-c.org/ns/1.0}div" and location.attrib["type"] != "note":
            for w in location.xpath(
                "./tei:head//tei:w[@n]",
                namespaces={"tei": "http://www.tei-c.org/ns/1.0"}
            ):
                if w.attrib["n"] in Global_Dict[location_id]:
                    print(f"{w.attrib['n']} already in {location_id}")
                Global_Dict[location_id][w.attrib["n"]] = w
        elif location.tag == "{http://www.tei-c.org/ns/1.0}div" and location.attrib["type"] == "note":
            continue # Ignore note
        else:
            for w in location.xpath(
                ".//tei:w[@n]",
                namespaces={"tei": "http://www.tei-c.org/ns/1.0"}
            ):
                if w.attrib["n"] in Global_Dict[location_id]:
                    print(f"{w.attrib['n']} already in {location_id}")
                Global_Dict[location_id][w.attrib["n"]] = w
    

100%|██████████| 13756/13756 [00:00<00:00, 14725.14it/s]


In [53]:
with open("4_TVOF_Royal20D1_tokenised_1.0.xml") as f:
    edRoyal, edRoyal_Dict = ET.parseid(f)
    for location_id in tqdm.tqdm(edRoyal_Dict):
        location = edRoyal_Dict[location_id]
        if location.tag == "{http://www.tei-c.org/ns/1.0}div" and location.attrib["type"] != "note":
            for w in location.xpath(
                ".//tei:head//tei:w[@n]",
                namespaces={"tei": "http://www.tei-c.org/ns/1.0"}
            ):
                Global_Dict[location_id][w.attrib["n"]] = w
        else:
            for w in location.xpath(
                ".//tei:w[@n]",
                namespaces={"tei": "http://www.tei-c.org/ns/1.0"}
            ):
                Global_Dict[location_id][w.attrib["n"]] = w

100%|██████████| 15613/15613 [00:00<00:00, 24248.21it/s]


In [None]:
Global_Dict  = edRoyal_BetterDict + FR20125_BetterDict

In [55]:
xpath = len(KWIC.xpath("//item"))

for data in tqdm.tqdm(KWIC.xpath("//item")):
    location = data.attrib["location"]
    n = data.attrib["n"]
    
    try:
        w = Global_Dict[location][n]
        w.attrib["lemma"] = data.attrib.get("lemma", "")
        w.attrib["pos"] = data.attrib.get("lemmaPos", data.attrib.get("pos", ""))
        w.attrib["ana"] = data.attrib.get("sp", "")
    except KeyError:
        print(f"[FR] ERROR ON {title} : {ET.tostring(data, encoding=str)}")

 68%|██████▊   | 249664/364893 [00:00<00:00, 276054.61it/s]

[FR] ERROR ON FR20125 : <item type="seg_item" location="edfr20125_00651_08" n="96" preceding="pluisors et meismement ausi des autors en lor" following=". Tant nori Faustus et sa feme" lemma="livre2" lemmaPos="s.m." sp="">
      <string>livres</string><punctuation type="end">.</punctuation>
    </item>

    


100%|██████████| 364893/364893 [00:01<00:00, 271386.06it/s]


In [56]:
with open("3_TVOF_Fr20125_tokenised_1.0_collated.xml", "w") as f:
    f.write(ET.tostring(FR20125, encoding=str))
with open("4_TVOF_Royal20D1_tokenised_1.0_collated.xml", "w") as f:
    f.write(ET.tostring(edRoyal, encoding=str))