# Transformation of cei.xml into tabular data (monasterium-files)

- edit directoryPath (and maybe custom output file names) and before running!

## .xml-file(s) on images [stemming from Georg's xquery] -> .csv ✔

In [None]:
from pathlib import PurePath
from pathlib import Path
from lxml import etree #lxml since xml.etree.ElementTree does not have full xpath support (no getparent() after using find/findall())
import pandas as pd

In [None]:
namespaces = {'atom': 'http://www.w3.org/2005/Atom', 'cei': 'http://www.monasterium.net/NS/cei'}
directoryPath = '\\\?/'+'C://Users/atzenhof/playground/GitHub/didip/data/images_xml' # path escape due to long path
fileExtension = ('*.xml')

In [None]:
atomIDs = []
image_links = []

for file in Path(directoryPath).rglob(fileExtension):
    tree = etree.parse(str(file)) # requres conversion to str since lxml does not vibe with windowspath
    root = tree.getroot()
    for img in root.findall('.//img', namespaces):
        atomID = img.getparent().attrib['id']
        atomIDs.append(atomID)
        image_link = img.attrib['src']
        image_links.append(image_link)

In [None]:
img_list = list(zip(atomIDs, image_links))
df = pd.DataFrame(img_list).rename(columns={0: 'atomID', 1: 'url'})
df

In [None]:
pathname = PurePath(directoryPath).name
df.to_csv(f'../data/output/{pathname}.csv', index=False)

## 👑 tags from cei.xml charters -> .csv ✔
### to do/to improve
- re-check encoding, eventually add postcorrection
- consider new lines
- check need for re-explode in pandas series
- maybe when using xpath make them more concrete (with subdirectories) (fix redundancies in iteration; use iterator and maybe memory clearing); this will speed up the code
- discuss distinction between pTenor and Tenor (so either there are multiple ptenors to be concatenated or there is one single tenor); has repercussions on code (e.g., `<sup>` can be valuable for linguists, but probably not historians, however they should probably be accessible in the end, as for every tag, and in both directions, i.e. reading and writing)
- maybe normalize (with regex) years based on multiple date tags given in the data dump
- write dynamic code that distinguishes between tags and attributes (using dynamic list creation, making it easier to choose elements that are desired; see https://stackoverflow.com/questions/23999801/creating-multiple-lists)
- create somewhat normalized and performant mapping between element names in cei and python (e.g. tenor_content for cei:tenor/ptenor etc.); alternative: same names
- add transformation scenarios using https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.melt.html for more holistic parsing (less post-correction)

### relevant data for CV ✔

In [59]:
from pathlib import PurePath
from pathlib import Path
from lxml import etree
import pandas as pd

In [60]:
namespaces = {'atom': 'http://www.w3.org/2005/Atom', 'cei': 'http://www.monasterium.net/NS/cei'}
directoryPath = '\\\?/'+'C://Users/atzenhof/playground/GitHub/didip/data/db_subset_for_test/transcriptions-ref'# escape needed for subdirectory paths longer than system allows for
#directoryPath = '\\\?/'+'C://Users/atzenhof/playground/GitHub/didip/data/db/mom-data/metadata.charter.public'# escape needed for subdirectory paths longer than system allows forf
fileExtension = ('*.cei.xml')

In [61]:
atom_id = []
cei_graphic_ATTRIBUTE_url = []

In [63]:
# might take over 5 min for whole metadata.charter.public directory
for file in Path(directoryPath).rglob(fileExtension):
    tree = etree.parse(str(file))
    atom_id.append(tree.xpath("//atom:id/text()", namespaces = namespaces))
    cei_graphic_ATTRIBUTE_url.append(tree.xpath(".//cei:graphic/@url", namespaces = namespaces))

In [64]:
charters = list(zip(atom_id, cei_graphic_ATTRIBUTE_url))
charter_image_list = pd.DataFrame(charters).rename(columns={0: "atom_id", 1: "cei_graphic_ATTRIBUTE_url"})

In [65]:
charter_image_list

Unnamed: 0,atom_id,cei_graphic_ATTRIBUTE_url
0,"[tag:www.monasterium.net,2011:/charter/CSGIX/1...",[K.._MOM-Bilddateien._~StiASG3jpgweb._~StiASG_...
1,"[tag:www.monasterium.net,2011:/charter/AFM/1.1.1]",[]
2,"[tag:www.monasterium.net,2011:/charter/AFM/1.1...",[]
3,"[tag:www.monasterium.net,2011:/charter/AFM/1.1...",[]
4,"[tag:www.monasterium.net,2011:/charter/AFM/1.1...",[]
...,...,...
47091,"[tag:www.monasterium.net,2011:/charter/OOEUB/1...",[]
47092,"[tag:www.monasterium.net,2011:/charter/OOEUB/1...",[]
47093,"[tag:www.monasterium.net,2011:/charter/OOEUB/1...",[]
47094,"[tag:www.monasterium.net,2011:/charter/OOEUB/1...",[]


In [66]:
charter_image_list_exploded = charter_image_list.explode("atom_id").explode("cei_graphic_ATTRIBUTE_url")

In [67]:
charter_image_list_exploded
# this either needs to be in an output that is more relational or the https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.melt.html is used to merge again (if it even works)

Unnamed: 0,atom_id,cei_graphic_ATTRIBUTE_url
0,"tag:www.monasterium.net,2011:/charter/CSGIX/13...",K.._MOM-Bilddateien._~StiASG3jpgweb._~StiASG_1...
0,"tag:www.monasterium.net,2011:/charter/CSGIX/13...",K.._MOM-Bilddateien._~StiASG3jpgweb._~StiASG_1...
1,"tag:www.monasterium.net,2011:/charter/AFM/1.1.1",
2,"tag:www.monasterium.net,2011:/charter/AFM/1.1.10",
3,"tag:www.monasterium.net,2011:/charter/AFM/1.1.100",
...,...,...
47092,"tag:www.monasterium.net,2011:/charter/OOEUB/13...",
47093,"tag:www.monasterium.net,2011:/charter/OOEUB/13...",
47094,"tag:www.monasterium.net,2011:/charter/OOEUB/1400",
47095,"tag:www.monasterium.net,2011:/charter/CSGIX/13...",K.._MOM-Bilddateien._~StiASG3jpgweb._~StiASG_1...


In [68]:
charter_image_list_exploded.reset_index(drop=True, inplace=True)

In [69]:
pathname = PurePath(directoryPath).name
charter_image_list.to_csv(f'../data/output/charter_image_list_{pathname}.csv', index=False)
charter_image_list_exploded.to_csv(f'../data/output/charter_image_list_exploded_{pathname}.csv', index=False)

### relevant data for NLP ✔ [results in 450mb + .csv]

In [48]:
from pathlib import PurePath
from pathlib import Path
from lxml import etree
import pandas as pd

In [49]:
namespaces = {'atom': 'http://www.w3.org/2005/Atom', 'cei': 'http://www.monasterium.net/NS/cei'}
directoryPath = '\\\?/'+'C://Users/atzenhof/playground/GitHub/didip/data/db_subset_for_test/transcriptions-ref'# escape needed for subdirectory paths longer than system allows for
#directoryPath = '\\\?/'+'C://Users/atzenhof/playground/GitHub/didip/data/db_subset_for_test/transcriptions-ref/test'# escape needed for subdirectory paths longer than system allows for
#directoryPath = '\\\?/'+'C://Users/atzenhof/playground/GitHub/didip/data/db/mom-data/metadata.charter.public'# escape needed for subdirectory paths longer than system allows for
fileExtension = ('*.cei.xml')

In [50]:
atom_id = []
cei_placeName = []
cei_lang_MOM = []
cei_tenor = []
cei_pTenor = []
cei_date = []
cei_date_ATTRIBUTE_value = []
cei_dateRange = []
cei_dateRange_ATTRIBUTE_from = []
cei_dateRange_ATTRIBUTE_to = []

In [51]:
# whole metadata.charter.public directory takes 12min + on Florian's Dell laptop for this query
for file in Path(directoryPath).rglob(fileExtension):
    tree = etree.parse(str(file))
    atom_id.append(tree.xpath("//atom:id/text()", namespaces = namespaces))
    cei_placeName.append(tree.xpath(".//cei:placeName/text()", namespaces = namespaces))
    cei_lang_MOM.append(tree.xpath(".//cei:lang_MOM/text()", namespaces = namespaces))
    cei_tenor.append(tree.xpath("//cei:tenor/text()", namespaces = namespaces))
    cei_pTenor.append(tree.xpath("//cei:pTenor/text()", namespaces = namespaces))
    cei_date.append(tree.xpath("//cei:date/text()", namespaces = namespaces))
    cei_date_ATTRIBUTE_value.append(tree.xpath("//cei:date/@value", namespaces = namespaces))
    cei_dateRange.append(tree.xpath("//cei:dateRange/text()", namespaces = namespaces))
    cei_dateRange_ATTRIBUTE_from.append(tree.xpath("//cei:dateRange/@from", namespaces = namespaces))
    cei_dateRange_ATTRIBUTE_to.append(tree.xpath("//cei:dateRange/@to", namespaces = namespaces))

In [52]:
charter_contents = list(zip(atom_id, cei_placeName, cei_lang_MOM, cei_tenor, cei_pTenor, cei_date, cei_date_ATTRIBUTE_value, cei_dateRange, cei_dateRange_ATTRIBUTE_from, cei_dateRange_ATTRIBUTE_to))
charter_contents = pd.DataFrame(charter_contents).rename(columns={0: "atom_id", 1: "cei_placeName", 2: "cei_lang_MOM", 3: "cei_tenor", 4:"cei_pTenor", 5: "cei_date", 6: "cei_date_ATTRIBUTE_value", 7: "cei_dateRange", 8: "cei_dateRange_ATTRIBUTE_from", 9: "cei_dateRange_ATTRIBUTE_to"})

In [53]:
charter_contents

Unnamed: 0,atom_id,cei_placeName,cei_lang_MOM,cei_tenor,cei_pTenor,cei_date,cei_date_ATTRIBUTE_value,cei_dateRange,cei_dateRange_ATTRIBUTE_from,cei_dateRange_ATTRIBUTE_to
0,"[tag:www.monasterium.net,2011:/charter/CSGIX/1...",[Goldenberg],[Deutsch],"[Allen, den, die disen brief an sehent oder h...",[],[],[],[30. Januar 1380],[13800130],[13800130]
1,"[tag:www.monasterium.net,2011:/charter/AFM/1.1.1]",[],[],[],[Anno domini millesimo trecentesimo nonagesimo...,[6. Mai 1399],[13990506],[],[],[]
2,"[tag:www.monasterium.net,2011:/charter/AFM/1.1...",[],[],[],[Placuit doctoribus nostris omnibus et singuli...,[15. April 1404],[14040415],[],[],[]
3,"[tag:www.monasterium.net,2011:/charter/AFM/1.1...",[],[],[],"[In vigilia Corporis Christi hora 6, post cen...",[6. Juni 1414],[14140606],[],[],[]
4,"[tag:www.monasterium.net,2011:/charter/AFM/1.1...",[],[],[],[In die autem Corporis Christi convenerunt sim...,[7. Juni 1414],[14140607],[],[],[]
...,...,...,...,...,...,...,...,...,...,...
23543,"[tag:www.monasterium.net,2011:/charter/OOEUB/1...",[],[Deutsch],[],[],[],[],[6. Oktober 1399],[13991006],[13991006]
23544,"[tag:www.monasterium.net,2011:/charter/OOEUB/1...",[],[Deutsch],[Wir Wilhalm von Gotes gnaden herczog ze Öster...,[],[],[],[10. Oktober 1399],[13991010],[13991010]
23545,"[tag:www.monasterium.net,2011:/charter/OOEUB/1...",[],[Deutsch],[Ich Hainreich der Schuster mein hausfraw vnd ...,[],[],[],[30. Oktober 1399],[13991030],[13991030]
23546,"[tag:www.monasterium.net,2011:/charter/OOEUB/1...",[],[Deutsch],[Vermerckt dj gerechtigkait zwayer pfanhausste...,[],[],[],[1400 - 1500],[14000101],[15001231]


#### use `.explode` as desired to somewhat clean the data out of the lists; use at your own risk 😤

In [54]:
charter_contents_exploded = charter_contents.explode("atom_id").explode("cei_placeName").explode("cei_lang_MOM").explode("cei_date").explode("cei_date_ATTRIBUTE_value").explode("cei_dateRange").explode("cei_dateRange_ATTRIBUTE_from").explode("cei_dateRange_ATTRIBUTE_to")

In [55]:
charter_contents_exploded

Unnamed: 0,atom_id,cei_placeName,cei_lang_MOM,cei_tenor,cei_pTenor,cei_date,cei_date_ATTRIBUTE_value,cei_dateRange,cei_dateRange_ATTRIBUTE_from,cei_dateRange_ATTRIBUTE_to
0,"tag:www.monasterium.net,2011:/charter/CSGIX/13...",Goldenberg,Deutsch,"[Allen, den, die disen brief an sehent oder h...",[],,,30. Januar 1380,13800130,13800130
1,"tag:www.monasterium.net,2011:/charter/AFM/1.1.1",,,[],[Anno domini millesimo trecentesimo nonagesimo...,6. Mai 1399,13990506,,,
2,"tag:www.monasterium.net,2011:/charter/AFM/1.1.10",,,[],[Placuit doctoribus nostris omnibus et singuli...,15. April 1404,14040415,,,
3,"tag:www.monasterium.net,2011:/charter/AFM/1.1.100",,,[],"[In vigilia Corporis Christi hora 6, post cen...",6. Juni 1414,14140606,,,
4,"tag:www.monasterium.net,2011:/charter/AFM/1.1.101",,,[],[In die autem Corporis Christi convenerunt sim...,7. Juni 1414,14140607,,,
...,...,...,...,...,...,...,...,...,...,...
23543,"tag:www.monasterium.net,2011:/charter/OOEUB/13...",,Deutsch,[],[],,,6. Oktober 1399,13991006,13991006
23544,"tag:www.monasterium.net,2011:/charter/OOEUB/13...",,Deutsch,[Wir Wilhalm von Gotes gnaden herczog ze Öster...,[],,,10. Oktober 1399,13991010,13991010
23545,"tag:www.monasterium.net,2011:/charter/OOEUB/13...",,Deutsch,[Ich Hainreich der Schuster mein hausfraw vnd ...,[],,,30. Oktober 1399,13991030,13991030
23546,"tag:www.monasterium.net,2011:/charter/OOEUB/1400",,Deutsch,[Vermerckt dj gerechtigkait zwayer pfanhausste...,[],,,1400 - 1500,14000101,15001231


In [56]:
charter_contents_exploded.reset_index(drop=True, inplace=True)
charter_contents_exploded

Unnamed: 0,atom_id,cei_placeName,cei_lang_MOM,cei_tenor,cei_pTenor,cei_date,cei_date_ATTRIBUTE_value,cei_dateRange,cei_dateRange_ATTRIBUTE_from,cei_dateRange_ATTRIBUTE_to
0,"tag:www.monasterium.net,2011:/charter/CSGIX/13...",Goldenberg,Deutsch,"[Allen, den, die disen brief an sehent oder h...",[],,,30. Januar 1380,13800130,13800130
1,"tag:www.monasterium.net,2011:/charter/AFM/1.1.1",,,[],[Anno domini millesimo trecentesimo nonagesimo...,6. Mai 1399,13990506,,,
2,"tag:www.monasterium.net,2011:/charter/AFM/1.1.10",,,[],[Placuit doctoribus nostris omnibus et singuli...,15. April 1404,14040415,,,
3,"tag:www.monasterium.net,2011:/charter/AFM/1.1.100",,,[],"[In vigilia Corporis Christi hora 6, post cen...",6. Juni 1414,14140606,,,
4,"tag:www.monasterium.net,2011:/charter/AFM/1.1.101",,,[],[In die autem Corporis Christi convenerunt sim...,7. Juni 1414,14140607,,,
...,...,...,...,...,...,...,...,...,...,...
23549,"tag:www.monasterium.net,2011:/charter/OOEUB/13...",,Deutsch,[],[],,,6. Oktober 1399,13991006,13991006
23550,"tag:www.monasterium.net,2011:/charter/OOEUB/13...",,Deutsch,[Wir Wilhalm von Gotes gnaden herczog ze Öster...,[],,,10. Oktober 1399,13991010,13991010
23551,"tag:www.monasterium.net,2011:/charter/OOEUB/13...",,Deutsch,[Ich Hainreich der Schuster mein hausfraw vnd ...,[],,,30. Oktober 1399,13991030,13991030
23552,"tag:www.monasterium.net,2011:/charter/OOEUB/1400",,Deutsch,[Vermerckt dj gerechtigkait zwayer pfanhausste...,[],,,1400 - 1500,14000101,15001231


In [None]:
#sorting, filtering, limiting etc
#df.loc[df['columnname'] == 'something'
#df.groupby('columnname')['countingtag'].count()
#df.sort_values(ascending=False)

In [58]:
pathname = PurePath(directoryPath).name
charter_contents.to_csv(f'../data/output/charter_contents_full_{pathname}.csv', index=False)
charter_contents_exploded.to_csv(f'../data/output/charter_contents_full_exploded_{pathname}.csv', index=False)