In [35]:
import pandas as pd
import numpy as np
import csv

# Notebook 3 - Connecting building complexes and monasteries

This notebook implements the third step of the Klosterdatenbank-to-FactGrid-Workflow, which is to connect the newly created items for Building Complexes with the newly created items for religious communities.

As mentioned in Notebook 1, our data model distinguishes between the religious communities and the building complexes in which they lived and worked. In this step, both concepts are connected with each other. This is done in two directions: there is a connection from the building to the religious community ("Users" - (P1095)[https://database.factgrid.de/wiki/Property:P1095]) and a connection from the religious community to the building ("Real Estate" - (P208)[https://database.factgrid.de/wiki/Property:P208]). In the database table `gs_monastery_location`, this connection is specified in more detail. There are indications on when the building was used by the religious community. These information are represented in FactGrid with corresponding qualifiers for the relationship between the religious community and the building complex (see below).

In this notebook, the existing monasteries and building complexes in FactGrid are queried first. Subsequently, the corresponding Q-numbers are assigned to the entities from the table `gs_monastery_location`. For the specification of the qualifiers, the natural language statements from the fields `begin_note` and `end_note` are mapped to FactGrid properties using a rule-based approach. Finally, yet another import file in V1 syntax is generated.

## Preparation

In the first step, all required dataframes are loaded. In addition, two new dataframes are created by querying existing monasteries and building complexes in FactGrid. 

In [36]:
from helper_functions import query_factgrid

# Save all required DataFrames in a Dictionary
dataframes = {}
dataframes["gs_monastery_location"] = pd.read_excel("data/exports_monasteryDB/gs_monastery_location.xlsx")
dataframes["building_complexes_in_factgrid"] = query_factgrid("building_complexes")
dataframes["monasteries_in_factgrid"] = query_factgrid("monasteries")

# Process query results: Split result strings and convert to Integer, if necessary
dataframes["building_complexes_in_factgrid"]["item"] = dataframes["building_complexes_in_factgrid"]["item"].str.split("/").str[-1]
dataframes["building_complexes_in_factgrid"]["GSVocabTerm"] = dataframes["building_complexes_in_factgrid"]["GSVocabTerm"].str.split("Location").str[-1].astype(int)
dataframes["monasteries_in_factgrid"]["item"] = dataframes["monasteries_in_factgrid"]["item"].str.split("/").str[-1]
dataframes["monasteries_in_factgrid"]["KlosterdatenbankID"] = dataframes["monasteries_in_factgrid"]["KlosterdatenbankID"].astype(int)

In [37]:
# Merge tables together
merge = pd.merge(dataframes["gs_monastery_location"], dataframes["monasteries_in_factgrid"], how="left", left_on="gsn_id", right_on="KlosterdatenbankID").rename(columns={"item":"monastery_factgrid_id"})
merge = pd.merge(merge, dataframes["building_complexes_in_factgrid"], how="left", left_on="id_monastery_location", right_on="GSVocabTerm").rename(columns={"item":"building_factgrid_id"}).dropna(subset=["monastery_factgrid_id","building_factgrid_id"])
merge.drop(columns=["KlosterdatenbankID", "GSVocabTerm"], inplace=True)
merge

Unnamed: 0.1,Unnamed: 0,id_monastery_location,place_id,relocated,gsn_id,location_begin_tpq,location_begin_taq,location_begin_note,location_end_tpq,location_end_taq,location_end_note,comment,longitude,latitude,location_name,main_location,diocese_id,monastery_factgrid_id,building_factgrid_id
0,1220,13732,46483081,False,8502,1436,,vor 1436,1574,,,,3.945278,51.681667,,,0.0,Q1752913,Q1752895
1,1785,8632,10682,False,60372,1644,,,1802,,,,11.262,48.5632,,,,Q1752914,Q1752896
2,3365,8408,18263,False,3790,1100,1140.0,zwischen 1100 und 1140,1553,,letzte Erwähnung 1553,,11.427972,51.295222,,,,Q1752916,Q1752897
3,3440,1324,13832,False,2055,1219,,,1555,,,es handelt sich um das verlegte Kloster Sonnen...,11.685714,53.863182,Neukloster (Kussin),,,Q469450,Q1752898
4,3441,1325,41955,False,2055,1205,1210.0,um 1210 (?),1219,,,Kloster Sonnenkamp: https://de.wikipedia.org/w...,,,Parchow,,,Q469450,Q1752899
5,3511,8299,10757,False,60271,762,782.0,vor 783,1803,,,"ab 877 Benediktiner, besteht noch heute, die e...",10.2317,49.8053,Schwarzach am Main,,,Q1752923,Q1752900
6,3525,94,5035,True,93,1005,1022.0,ca. 1010/1022,1803,,,"Anfangs vermutlich Kanoniker, aber noch im 11....",9.943611,52.152778,,,,Q1752920,Q1752901
7,4064,13869,46483177,False,8609,1406,,,1572,1580.0,1572/1580,,5.636605,53.053089,,,0.0,Q1752918,Q1752902
8,4519,6137,17237,False,3768,1256,1260.0,kurz nach 1256,1525,1550.0,zwischen 1525 und 1550,vorher offenbar Augustinerchorfrauen in Hettst...,11.529778,51.666767,,,,Q1752919,Q1752903
9,4559,873,12105,False,814,1220,,,1649,,,,9.228792,52.916129,,,,Q1752917,Q1752904


## Time Qualifiers

In the monastery database, there are six fields in the table `gs_monastery_location` that define the time period during which a religious community lived and worked at a specific location. The start and end dates can be defined by a Terminus Post Quem (`location_begin_tpq`/`location_end_tpq`) and a Terminus Ante Quem (`location_begin_taq`, `location_end_taq`). Not both fields need to be filled in every time. Additionally, it is possible to add a natural language supplement to the date information using the fields `begin_note` and `end_note`.

In practice, there is a degree of uncertainty for many of the date entries in the monastery database. There are various causes and manifestations of uncertainty or vagueness. For example, in the case of the Cistercian monastery of Mariawald in Heimbach (GSN [50153](https://klosterdatenbank.adw-goe.de/gsn/50153)), two different years are listed as potential end dates for the localization of the monastery: the dissolution of the site occurred either in 1795 or 1802. In the database, the Terminus Post Quem of the end date was set to 1795 and the Terminus Ante Quem to 1802. However, this does not exactly reflect the situation, so the note field was supplemented with the entry "1795/1802". In this case, there is uncertainty due to different statements in the literature. A case of vagueness is, for example, the Benedictine abbey of Saint-Ghislain in Belgium (GSN [11681](https://klosterdatenbank.adw-goe.de/gsn/11681)). Here, it is only known that the localization in Saint-Ghislain began in the middle of the 7th century. In the relational database, this vague designation was quantified by approximating the possible time span: the Terminus Post Quem is set to 634 and the Terminus Ante Quem to 666, assuming that this time span approximately describes the "middle of the 7th century". Additionally, this is noted again in the note field.

In FactGrid, these date entries can be modeled differently. There are properties for start and end time points, as well as properties that directly correspond to the respective fields in the database: Begin date (terminus ante quem) and Begin date (terminus post quem), respectively, as well as for the end date. Additionally, data can be expressed on different levels of precision. For the case above, the value for the date could be set to `+634-01-01T00:00:00Z/7/J`, which is displayed in the interface as "7th century". With properties for further refinement of the precision of the date entries, it can also be expressed, for example, that it is an approximate date entry (Item [Q10](https://database.factgrid.de/wiki/Item:Q10) - "circa").

To reflect the information present in the monastery database in the best possible way, each date specification in the table `gs_monastery_location` is treated in differnt steps: Firstly, if there is no note on it, the Terminus Post Quem ist considered to be the most correct date, hence the field `[begin/end]_date_tpq` is mapped to either (P49)[https://database.factgrid.de/wiki/Property:P49] (Begin date) or (P50)[https://database.factgrid.de/wiki/Property:P50] (End date). For the cases in which there is a note on the date, it is generally assumed that this note reflects the dating circumstances better than the field for terminus post and ante quem. Therefore the note is parsed using a range of regular expressions. The details on this can be found in the `helper_function.py` file, where the method is written and documented. During the process, the precision of any date is set zu "century" if indicated within the date note. Lastly, all remaining fields are filled with the value from the respective Terminus Post Quem field. These are the cases, in which the parsing of the note did not return any sufficient results. 

In general, it has to be acknowledged that the dates that are provided within the monastery database dataset are historical dates that naturally come with a degree of vagueness. The quantification of these dates, or in other words, fitting the dates into an existing datamodel, adds another layer of potential uncertainty that has to be taken into consideration when working with the dataset.

To begin with, the qualifiers are processed and added to the table that links monasteries to building complexes.

In [38]:
from helper_functions import process_date_parsing_results, parse_date, DateType

# Create a new dataframe for import and fill it with monastery-building complex pairs
monastery_to_building = pd.DataFrame()
monastery_to_building["qid"] = merge["monastery_factgrid_id"]
monastery_to_building["P208"] = merge["building_factgrid_id"]

# Getting relevant columns from the dataframe above
monastery_to_building["location_begin_tpq"] = merge["location_begin_tpq"]
monastery_to_building["location_end_tpq"] = merge["location_end_tpq"]
monastery_to_building["location_begin_note"] = merge["location_begin_note"]
monastery_to_building["location_end_note"] = merge["location_end_note"]

# Parse date notes
monastery_to_building['begin_date_parse_result'] = merge["location_begin_note"].apply(lambda x: parse_date(str(x), DateType.BEGIN_DATE))
monastery_to_building['end_date_parse_result'] = merge["location_end_note"].apply(lambda x: parse_date(str(x), DateType.END_DATE))

# Add date notes as Qualifiers 787 and 788
monastery_to_building['qal787'] = merge["location_begin_note"].apply(lambda x: f"\"{x}\"" if not pd.isna(x) else np.nan)
monastery_to_building['qal788'] = merge["location_end_note"].apply(lambda x: x if x != "heute" else np.nan).apply(lambda x: f"\"{x}\"" if not pd.isna(x) else np.nan)

# Add more Qualifiers as indicated by date parsing results
process_date_parsing_results(monastery_to_building, "location")

# Cleanup & Add Source Statements
monastery_to_building = monastery_to_building.drop(columns=["location_begin_tpq", "location_end_tpq", "begin_date_parse_result", "end_date_parse_result"])
monastery_to_building["S471"] = merge["gsn_id"].apply(lambda x: f"\"{x}\"" if not pd.isna(x) else np.nan)
monastery_to_building.drop(columns={"location_begin_note", "location_end_note"}, inplace=True)
monastery_to_building

Unnamed: 0,qid,P208,qal787,qal788,qal1124,qal50,qal49,qal785,qal1126,S471
0,Q1752913,Q1752895,"""vor 1436""",,+1436-00-00T00:00:00Z/9/J,+1574-01-01T00:00:00Z/9/J,,,,"""8502"""
1,Q1752914,Q1752896,,,,+1802-01-01T00:00:00Z/9,+1644-01-01T00:00:00Z/9,,,"""60372"""
2,Q1752916,Q1752897,"""zwischen 1100 und 1140""","""letzte Erwähnung 1553""",,+1553-01-01T00:00:00Z/9/J,+1100-01-01T00:00:00Z/9/J,,,"""3790"""
3,Q469450,Q1752898,,,,+1555-01-01T00:00:00Z/9/J,+1219-01-01T00:00:00Z/9/J,,,"""2055"""
4,Q469450,Q1752899,"""um 1210 (?)""",,,+1219-01-01T00:00:00Z/9/J,+1210-00-00T00:00:00Z/9/J,Q10,,"""2055"""
5,Q1752923,Q1752900,"""vor 783""",,+0783-00-00T00:00:00Z/9/J,+1803-01-01T00:00:00Z/9,,,,"""60271"""
6,Q1752920,Q1752901,"""ca. 1010/1022""",,,+1803-01-01T00:00:00Z/9,+1010-00-00T00:00:00Z/9/J,Q10,,"""93"""
7,Q1752918,Q1752902,,"""1572/1580""",,+1572-00-00T00:00:00Z/9/J,+1406-01-01T00:00:00Z/9/J,,,"""8609"""
8,Q1752919,Q1752903,"""kurz nach 1256""","""zwischen 1525 und 1550""",,+1525-01-01T00:00:00Z/9/J,,Q266009,+1256-00-00T00:00:00Z/9/J,"""3768"""
9,Q1752917,Q1752904,,,,+1649-01-01T00:00:00Z/9,+1220-01-01T00:00:00Z/9/J,,,"""814"""


The cell below saves the result to the supported export formats

In [39]:
from helper_functions import df_to_qs_v1
monastery_to_building.to_excel("data/results/monastery_building_connection/monastery_to_building.xlsx", index=False)
monastery_to_building.to_csv("data/results/monastery_building_connection/monastery_to_building.csv", index=False, doublequote=False, quoting=csv.QUOTE_NONE, escapechar="§")
with open("data/results/monastery_building_connection/monastery_to_building.tsv", "w") as file:
    file.write(df_to_qs_v1(monastery_to_building))
monastery_to_building

Unnamed: 0,qid,P208,qal787,qal788,qal1124,qal50,qal49,qal785,qal1126,S471
0,Q1752913,Q1752895,"""vor 1436""",,+1436-00-00T00:00:00Z/9/J,+1574-01-01T00:00:00Z/9/J,,,,"""8502"""
1,Q1752914,Q1752896,,,,+1802-01-01T00:00:00Z/9,+1644-01-01T00:00:00Z/9,,,"""60372"""
2,Q1752916,Q1752897,"""zwischen 1100 und 1140""","""letzte Erwähnung 1553""",,+1553-01-01T00:00:00Z/9/J,+1100-01-01T00:00:00Z/9/J,,,"""3790"""
3,Q469450,Q1752898,,,,+1555-01-01T00:00:00Z/9/J,+1219-01-01T00:00:00Z/9/J,,,"""2055"""
4,Q469450,Q1752899,"""um 1210 (?)""",,,+1219-01-01T00:00:00Z/9/J,+1210-00-00T00:00:00Z/9/J,Q10,,"""2055"""
5,Q1752923,Q1752900,"""vor 783""",,+0783-00-00T00:00:00Z/9/J,+1803-01-01T00:00:00Z/9,,,,"""60271"""
6,Q1752920,Q1752901,"""ca. 1010/1022""",,,+1803-01-01T00:00:00Z/9,+1010-00-00T00:00:00Z/9/J,Q10,,"""93"""
7,Q1752918,Q1752902,,"""1572/1580""",,+1572-00-00T00:00:00Z/9/J,+1406-01-01T00:00:00Z/9/J,,,"""8609"""
8,Q1752919,Q1752903,"""kurz nach 1256""","""zwischen 1525 und 1550""",,+1525-01-01T00:00:00Z/9/J,,Q266009,+1256-00-00T00:00:00Z/9/J,"""3768"""
9,Q1752917,Q1752904,,,,+1649-01-01T00:00:00Z/9,+1220-01-01T00:00:00Z/9/J,,,"""814"""


For the other direction, from building to monastery, we can simply swap the columns qid and P208 and change the Property Name to P1095

In [40]:
p1095 = monastery_to_building["qid"]
building_to_monastery = monastery_to_building.copy()
building_to_monastery["qid"] = monastery_to_building["P208"]
building_to_monastery["P208"] = p1095
building_to_monastery.rename(columns={"P208":"P1095"}, inplace=True)
building_to_monastery

Unnamed: 0,qid,P1095,qal787,qal788,qal1124,qal50,qal49,qal785,qal1126,S471
0,Q1752895,Q1752913,"""vor 1436""",,+1436-00-00T00:00:00Z/9/J,+1574-01-01T00:00:00Z/9/J,,,,"""8502"""
1,Q1752896,Q1752914,,,,+1802-01-01T00:00:00Z/9,+1644-01-01T00:00:00Z/9,,,"""60372"""
2,Q1752897,Q1752916,"""zwischen 1100 und 1140""","""letzte Erwähnung 1553""",,+1553-01-01T00:00:00Z/9/J,+1100-01-01T00:00:00Z/9/J,,,"""3790"""
3,Q1752898,Q469450,,,,+1555-01-01T00:00:00Z/9/J,+1219-01-01T00:00:00Z/9/J,,,"""2055"""
4,Q1752899,Q469450,"""um 1210 (?)""",,,+1219-01-01T00:00:00Z/9/J,+1210-00-00T00:00:00Z/9/J,Q10,,"""2055"""
5,Q1752900,Q1752923,"""vor 783""",,+0783-00-00T00:00:00Z/9/J,+1803-01-01T00:00:00Z/9,,,,"""60271"""
6,Q1752901,Q1752920,"""ca. 1010/1022""",,,+1803-01-01T00:00:00Z/9,+1010-00-00T00:00:00Z/9/J,Q10,,"""93"""
7,Q1752902,Q1752918,,"""1572/1580""",,+1572-00-00T00:00:00Z/9/J,+1406-01-01T00:00:00Z/9/J,,,"""8609"""
8,Q1752903,Q1752919,"""kurz nach 1256""","""zwischen 1525 und 1550""",,+1525-01-01T00:00:00Z/9/J,,Q266009,+1256-00-00T00:00:00Z/9/J,"""3768"""
9,Q1752904,Q1752917,,,,+1649-01-01T00:00:00Z/9,+1220-01-01T00:00:00Z/9/J,,,"""814"""


Finally, save the file in the available export formats

In [41]:
building_to_monastery.to_excel("data/results/monastery_building_connection/building_to_monastery.xlsx", index=False)
building_to_monastery.to_csv("data/results/monastery_building_connection/building_to_monastery.csv", index=False, doublequote=False, quoting=csv.QUOTE_NONE, escapechar="§")
with open("data/results/monastery_building_connection/building_to_monastery.tsv", "w") as file:
    file.write(df_to_qs_v1(building_to_monastery))
building_to_monastery

Unnamed: 0,qid,P1095,qal787,qal788,qal1124,qal50,qal49,qal785,qal1126,S471
0,Q1752895,Q1752913,"""vor 1436""",,+1436-00-00T00:00:00Z/9/J,+1574-01-01T00:00:00Z/9/J,,,,"""8502"""
1,Q1752896,Q1752914,,,,+1802-01-01T00:00:00Z/9,+1644-01-01T00:00:00Z/9,,,"""60372"""
2,Q1752897,Q1752916,"""zwischen 1100 und 1140""","""letzte Erwähnung 1553""",,+1553-01-01T00:00:00Z/9/J,+1100-01-01T00:00:00Z/9/J,,,"""3790"""
3,Q1752898,Q469450,,,,+1555-01-01T00:00:00Z/9/J,+1219-01-01T00:00:00Z/9/J,,,"""2055"""
4,Q1752899,Q469450,"""um 1210 (?)""",,,+1219-01-01T00:00:00Z/9/J,+1210-00-00T00:00:00Z/9/J,Q10,,"""2055"""
5,Q1752900,Q1752923,"""vor 783""",,+0783-00-00T00:00:00Z/9/J,+1803-01-01T00:00:00Z/9,,,,"""60271"""
6,Q1752901,Q1752920,"""ca. 1010/1022""",,,+1803-01-01T00:00:00Z/9,+1010-00-00T00:00:00Z/9/J,Q10,,"""93"""
7,Q1752902,Q1752918,,"""1572/1580""",,+1572-00-00T00:00:00Z/9/J,+1406-01-01T00:00:00Z/9/J,,,"""8609"""
8,Q1752903,Q1752919,"""kurz nach 1256""","""zwischen 1525 und 1550""",,+1525-01-01T00:00:00Z/9/J,,Q266009,+1256-00-00T00:00:00Z/9/J,"""3768"""
9,Q1752904,Q1752917,,,,+1649-01-01T00:00:00Z/9,+1220-01-01T00:00:00Z/9/J,,,"""814"""
