# Database Dutch Reformed Clergy (DRC) 1555-1816

The Database Dutch Reformed Clergy (DRC) 1555-1816 (stored as Repertoriummetoudepersoonsnummers1.docx) is provided by prof. dr. Fred van Lieburg of which an earlier version is published under [van Lieburg, F. A. (1997). Profeten en hun vaderland. De geografische herkomst van de gereformeerde predikanten in Nederlamd van 1572 tot 1816. [PhD-Thesis - Research and graduation internal, Vrije Universiteit Amsterdam]. Boekencentrum.](https://hdl.handle.net/1871.1/e1bfb2c9-8d30-42b4-8edf-83b20bd6c5a7) . This dataset contains biographical information and career path information of Dutch ministers that started after 1555 until the starting data 1816. This means that it does contain careers that continue after 1816, but no individuals that started after 1816. 

The dataset contains 12558 individuals which are systematically registered in a text file. A sample of the text is provided below.

> Aalst; Wilhelmus Gedoopt Biggekerke 5 jan. 1664; pred. Aardenburg 22 mei 1695, overl. 19 dec. 1700.<4>
>
> Aalst, van; Cornelius Geb. Castricum ca. 1686; ambassadepred. in Parijs maart tot dec. 1715; pred. Kalslagen ber. 21 febr. 1717, emer. 1751; overl. Amsterdam 27 aug. 1756.<2>
>
> Aalst, van; Gerardus Geb. xxx sept. 1678; pred. Vuren en Dalem 10 aug. 1704, Sommelsdijk 13 juni 1706, West Zaandam 4 aug. 1715, emer. 1755; overl. 29 juni 1759.<3>
>

Having this dataset in this form does not allow to categorize persons based on the place that they were born or years that they were active in a certain church. In its current form the dataset thus does not allow to be analysed systematically. Valueable historical insight remain hidden. To open this dataset for systematic analyses, the first steps is to parse the dataset into a relation database. By doing so a series of basic and more advanced analysis methods will become present. Transforming the data in a Relational Database will allow it to be queried systematically. Furthermore, it allows for more complex analyses to see which individuals lived near to each other and eventually allowing it to be linked with other datasets such as book title datasets. 

The steps in this notebook provide the process on how the text file has been converted into a series of csv files which can be imported into a Relational Database. Since the dataset, at the end of the processing, still contained significant errors, the whole dataset did also underwent a manual curation round. The notebook presents that tailor made processing steps to parse the data. The overall principle for the steps that were taken is that the information of an individual is stored on a single line and that all characteristics are parsed into separate columns based on specific strings that seperate distinguishable items. Although the pipeline did structure most the data succesfully, still a manual curation was required to fully check the data. 

Note, that since it was a one time processing pipeline the notebook has not been optimized, yet it does contain extensive explanatory texts on the processing steps. 

### Step 0 import libraries and configure settings
The required python libraries for the notebook are imported. In case you are new to python and installing libraries have a look [here](../4_Dissemination/install_packages.md) . 


In [None]:
import docx2txt
import os
import re
import csv
import pandas as pd

In [None]:
# Set variables for the project (i.e. the input location of the file to be processed and the output location) )

folderlink = '..//data//'
input_folder = 'input//'
input_file = os.path.join(folderlink+input_folder, 'Repertoriummetoudepersoonsnummers1.docx')
folder_output = 'output//'
output_txt = folderlink+folder_output+'output.txt'
output_csv = folderlink+folder_output+'output_file.csv'

In [None]:
# Panda settings for showing data (this is foremost done to more easily explore the data while processing it)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

### Step 1 Convert .docx file and parse individuals to seperate rows

In [None]:
# Converting the docx to a text file and remove all unecessary rows.

# Use docx2txt library to extract text from .docx file
text = docx2txt.process(input_file)

# Remove excessive whitespaces
data = ' '.join(text.split())

While processing the data some clear errors occured resulting which were updated as follows:

1.  FROM: "N.N. "de oude vicarius">pred. Lichtenvoorde 1602 tot 1615.<20871>"
    TO: "N.N. "de oude vicarius" pred. Lichtenvoorde 1602 tot 1615.<20871>" 
    REASON: since it contained the > character which is used to define ID fields. 
 
2.  FROM: "Bosch; Cornelius Geb. Utrecht 1634; pred. Renswoude 16 dec. 1656, Maasland 15 april 1663, Brielle 30 jan. 1667, Alkmaar 1667, 's Gravenhage 5 juli 1676, emer. 1713,;overl. 28 maart 1715.<1185>"
    TO:Bosch; Cornelius
Geb. Utrecht 1634; pred. Renswoude 16 dec. 1656, Maasland 15 april 1663, Brielle 30 jan. 1667, Alkmaar 1667, 's Gravenhage 5 juli 1676, emer. 1713, overl. 28 maart 1715.<1185>
    REASON: Since it splitted the string based on ",;overl"

3.  FROM: Leeuwen, van, Cornelis [z.v. Cornelis]
Geb. Hazerswoude 1611; pred. Boskoop en Middelburg 1637, overl. 1681.<5778>
    TO: Leeuwen, van; Cornelis [z.v. Cornelis]
Geb. Hazerswoude 1611; pred. Boskoop en Middelburg 1637, overl. 1681.<5778>
    REASON: there was no ; between the name and surname (including infix)

4.  FROM: Peenen, van, Marcus
Gedoopt Leiden 31 aug. 1642; pred. Koudekerk aan den Rijn 2 sept. 1668, Leiden 1680, begraven 1 febr. 1696.<7388>
    TO: Peenen, van; Marcus
Gedoopt Leiden 31 aug. 1642; pred. Koudekerk aan den Rijn 2 sept. 1668, Leiden 1680, begraven 1 febr. 1696.<7388>
    REASON: there was no ; between the name and surname (including infix)

5.  FROM: Knuyt, de (Kuntius), Elias
Geb. Middelburg yyy; pred. Oude Niedorp en Veenhuizen (NH) mei 1628, Sint Annaland 1630, Westkapelle 7 maart 1641, overl. --of vertrokken?-- 1642.<5381>
    TO: Knuyt, de (Kuntius); Elias
Geb. Middelburg yyy; pred. Oude Niedorp en Veenhuizen (NH) mei 1628, Sint Annaland 1630, Westkapelle 7 maart 1641, overl. --of vertrokken?-- 1642.<5381>
    REASON: there was no ; between the name and surname (including infix)

6.  FROM: Leonardis, de, Paulus
Geb. Keulen yyy; pred. Bacharach (Pfalz) 16.., Kampen 1620, overl. 1649.<5836>
    TO: Leonardis, de; Paulus
Geb. Keulen yyy; pred. Bacharach (Pfalz) 16.., Kampen 1620, overl. 1649.<5836>
    REASON: there was no ; between the name and surname (including infix)

7.  FROM: Tamerus; Henricus
Geb. xxx ca. 1540; voorheen Lutheraan;--onwettig pred. Heteren en Randwijk ca. 1600-1602; pred. Eethen en Meeuwen 1606, Doeveren, Gansoyen en Genderen 1610, afgezet als remonstrant 1619; overl. Heusden.<21206>
    TO: Tamerus; Henricus
Geb. xxx ca. 1540; voorheen Lutheraan; onwettig pred. Heteren en Randwijk ca. 1600-1602; pred. Eethen en Meeuwen 1606, Doeveren, Gansoyen en Genderen 1610, afgezet als remonstrant 1619; overl. Heusden.<21206>
    REASON: -- conflicted created a line break. 
    
8. FROM Aitton; Rijk Otto [z.v. Hendrik Arnold]
Geb. Zwolle 27 maart 1790; pred. Zuilichem + Nieuwaal 3 maart 1811, --legerpred. 1815, garnizoenspred. Oostende (Vlaanderen)-- Aalten 11 mei 1817, Hooge Zwaluwe 9 april 1826, Monster 4 mei 1828, Zevenbergen 5 april 1840, emer. 1855; overl. 4 aug. 1863.<130>
    TO: Aitton; Rijk Otto [z.v. Hendrik Arnold]
Geb. Zwolle 27 maart 1790; pred. Zuilichem + Nieuwaal 3 maart 1811, legerpred. 1815, garnizoenspred. Oostende (Vlaanderen), pred. Aalten 11 mei 1817, Hooge Zwaluwe 9 april 1826, Monster 4 mei 1828, Zevenbergen 5 april 1840, emer. 1855; overl. 4 aug. 1863.<130>
    REASON: second pred. was missing.




In [None]:
data = data.replace('(Vlaanderen)§§', '(Vlaanderen)§§, pred.').replace('overl. 28 maart 1715.<1185>', ' overl. 28 maart 1715.<1185>').replace('>pred.', ' pred.').replace('Leeuwen, van, Cornelis [z.v. Cornelis]','Leeuwen, van; Cornelis [z.v. Cornelis]',).replace('Peenen, van, Marcus','Peenen, van; Marcus').replace('Knuyt, de (Kuntius), Elias','Knuyt, de (Kuntius); Elias').replace('Leonardis, de, Paulus','Leonardis, de; Paulus').replace('Lutheraan;§§onwettig','Lutheraan; onwettig')
data = data.replace('§', '-')

To get alle the information of the file into single rows per individual the first step was to remove all the enters in the file so that all the information of an individual is stored in a single row. Since all the individuals have a unique ID structured as <x> where x is the id. The next step thus was to add an enter after every ID creating a file that has information of every individual in a single row.

> Aalst, van; Cornelius Geb. Castricum ca. 1686; ambassadepred. in Parijs maart tot dec. 1715; pred. Kalslagen ber. 21 febr. 1717, emer. 1751; overl. Amsterdam 27 aug. 1756.<2>
> Aalst, van; Gerardus Geb. xxx sept. 1678; pred. Vuren en Dalem 10 aug. 1704, Sommelsdijk 13 juni 1706, West Zaandam 4 aug. 1715, emer. 1755; overl. 29 juni 1759.<3>

To isolate the IDs into a column once read as a .csv file a semicolon is added in front of the < and after the > sign. We decided to call the ids drc_id.

In [None]:

# Replace semicolons with newlines and add semicolons around < and > since these identify the IDs
data = data.replace(';', ';\n').replace(';\n ', '; ').replace('>', '>;\n ').replace('<', ';<')
lines = data.split('\n')

lines = [line for line in lines if not line.startswith('; ;') and not line.startswith('; ')]
data = '\n'.join(lines)
lines = data.strip().split('\n')
data = '\n'.join([line.lstrip() for line in lines])


### Step 2.
In the original dataset the various characteristics of an individual are distinguished using s semicolon. However, this is not done in a systematic way (e.g Geb. and  emer. Are not separated with a semicolon. Therefore, a search on the various **distinguishable key strings** is performed and a semicolon is added. Key strings that we searched for are:

``` "Geb.","pred.","overl.","Gedoopt","legerpred.","pastoor","garnizoenspred.","emer.","begraven","conrector","rector","monnik","schoolmeester","hoogleraar","chirurgijn","praeceptor","ziekentrooster","vlootpred.","legerpred.","ambassadepred." ```

and many more...

By added a “; ” in front of these key strings the various will be handled as separate columns when imported as .csv file.

> Aalst, van; Cornelius ; Geb. Castricum ca. 1686; ; ambassadepred. in Parijs maart tot dec. 1715; ; pred. Kalslagen ber. 21 febr. 1717, ; emer. 1751; ; overl. Amsterdam 27 aug. 1756.;<2>;
> Aalst, van; Gerardus ; Geb. xxx sept. 1678; ; pred. Vuren en Dalem 10 aug. 1704, Sommelsdijk 13 juni 1706, West Zaandam 4 aug. 1715, ; emer. 1755; ; overl. 29 juni 1759.;<3>;

Individuals could have been minister in multiple places and also have "gaps" in their minister carreer. For example:

> Haack; Petrus Geb. Brielle okt. 1747; pred. Noordgouwe 26 nov. 1769, Zwartewaal 20 nov. 1774, Sommelsdijk 6 juli 1777, Breda 23 juni 1782, Amsterdam 25 nov. 1789, politiek afgezet 1796; vertrokken naar Hamburg, hersteld: pred. Amsterdam 1804, overl. 27 juli 1824.<3782>

The individual in this example was a minister in Noorgouwe from 1769 followed by positions in Zwartewaal from 1774, Sommelsdijk from 1777, Breda from 1782 and Amsterdam from 1789. In 1796 he was fired after which he went to Hamburg. In 1804 he was reinstalled in Amsterdam until he passed away in 1824. As can be noticed in this exmple, the various roles as minister do not all start with pred. . Subsequent positions are only seperated with a comma. Then, if someone was reinstalled as a minister this information starts with pred. again. To lateron merge all the locations of where someone was minister into one field zx and the position pred. is used in the row is added to the front of the " pred." string. The reason to add zx is since this was unique in the dataset. 

By adding a count the values of these fields can lateron be merged.

In [None]:
def replace_pred_count(string):
    count = 0
    result = ""
    position = string.find(" pred.")

    while position != -1:
        result += string[:position] + "zx"+ str(count) + " pred."
        string = string[position + len(" pred."):]
        count += 1
        position = string.find(" pred.")

    result += string

    return result

In [None]:
lines = data.split('\n')

with open(output_txt, "w", encoding='utf-8') as file:
    for line in lines:
        result_string = replace_pred_count(line)
        file.write(result_string + "\n")

Here the various identified **distinguishable key strings** are listed. This covered most of the seperate entities, yet after the manual curation more occurred (sometimes only once or twice).

In [None]:
columns = ("Geb.",
 "zx0 pred.",
 "zx1 pred.",
 "zx2 pred.",
 "zx3 pred.",
 "zx4 pred.",
 "overl.",
 "Gedoopt",
 "legerpred.",
 "pastoor",
 "garnizoenspred.",
 "emer.",
 "begraven",
 "conrector",
 " rector",
 "monnik",
 "schoolmeester",
 "hoogleraar",
 "chirurgijn",
 "praeceptor",
 "ziekentrooster",
 "vlootpred.",
 "legerpred.",
 "ambassadepred."
)


In [None]:
for column in columns:
    with open(output_txt, 'r', encoding='utf-8') as f:
        lines = f.readlines()
    with open(output_txt,'w', encoding='utf-8') as f:
        for line in lines:
            if "; "+column in line:
                f.write(line)
            elif column in line:
                line = line.replace(column, ";"+column)
                f.write(line)
            else:
                f.write(line)

The surnames and names of the individuals in the dataset are always stored in the first and second column of the dataset. Therefore the first two columns are given have been named accordingly. 

In [None]:
# Define the headers for the output file
headers = ['surname_temp', 'first_name', 'Field1', 'Field2', 'Field3', 'Field4', 'Field5', 'Field6', 'Field7', 'Field8', 'Field9', 'Field10', 'Field11','Field12','Field13','Field14','Field15','Field16','Field17','Field18']

with open(output_txt, 'r', encoding='utf-8') as infile, open(output_csv, 'w', newline='', encoding='utf-8' ) as outfile:
    reader = csv.reader(infile, delimiter=';')
    writer = csv.writer(outfile, delimiter=';')

    # Write the headers to the output file
    writer.writerow(headers)

    # Loop through each row in the input file and write it to the output file with 12 fields
    for row in reader:
        # Create a new row with 12 fields by extending the current row with empty values
        new_row = row + [''] * (12 - len(row))
        writer.writerow(new_row)

In [None]:
with open(output_csv, 'r') as file:
    content = file.read()

modified_content = content.replace("§ ", "-")

with open(output_csv, 'w') as file:
    file.write(modified_content)

In [None]:
df = pd.read_csv(output_csv, sep=';', encoding='utf-8')

In [None]:
#In the file all IDs, which are called "drc_id" are stored between < and > therefore:

for column in df.columns:
# Check if any value in the column contains '<'
    if df[column].astype(str).str.contains('<').any():
# Copy the values containing '<' to a column
        df.loc[df[column].astype(str).str.contains('<'), 'drc_id'] = df[column]


In [None]:
df['drc_id'] = df['drc_id'].str.replace('>', '').str.replace('<','')

In [None]:
for column in columns:
    df[column] = df.apply(lambda row: row[row.astype(str).str.contains(column)].iloc[0] if any(row.astype(str).str.contains(column)) else '', axis=1)

To ensure that the orginal input remains accessible for the user aftewards, which appears to be essential for the data curation process, the original input is reconstructed by the following. 

In [None]:
df['original_input'] = df['surname_temp'].fillna('') + df['first_name'].fillna('') + df['Field1'].fillna('') + df['Field2'].fillna('')+ df['Field3'].fillna('')+ df['Field4'].fillna('')+ df['Field5'].fillna('')+ df['Field6'].fillna('')+ df['Field7'].fillna('')+ df['Field8'].fillna('')+ df['Field9'].fillna('')+ df['Field10'].fillna('')+ df['Field11'].fillna('')+ df['Field12'].fillna('')+ df['Field13'].fillna('')+ df['Field14'].fillna('')+ df['Field15'].fillna('')+ df['Field16'].fillna('')+ df['Field17'].fillna('')+ df['Field18'].fillna('')

In [None]:
df['original_input'] = df['original_input'].str.replace('zx0 pred. ', ' pred. ').str.replace('zx1 pred.',' pred. ').str.replace('zx2 pred.',' pred. ').str.replace('zx3 pred.',' pred. ').str.replace('zx4 pred.',' pred. ')

### Step 3.
When importing this dataset, it will create a lot of empty cells and obviously does not structure the data according to the distinguishable key string. Therefor the next step is to create columns based on the key strings and add information from cells that contain the key string into that column. To improve the readability of the information in the new columns the key strings are also removed.

In the example information about Cornelius´s death will be stored into column **overl.** and will initially contain the value *“ overl. 29 juni 1759.”*, however once the distinguishable string is removed from the cell it will contain as value *“29 juni 1759.”*.

An important issue here is that in some cases the distinguishable key string is used multiple times for an individual. For ministers this issue has been solved by counting the number of time the string “ pred.” is in a line and add a number to the position. These are later on integrated into one cell called minister.

In [None]:
for column in columns:
    df[column] = df[column].str.replace(column, '')

### Step 4. 
The fields **first_name** and **surname**  also contain alternative surnames and information about family relations.

In the original text files all information about alternative surnames is provided between ( ) and about family relationships between [ ]. As a next step we thus have cutted information between ( ) in the field **surname_name** into a new field called **name_info_family** and from **surname** information between [ ] into a new field called **alternative_name**. Once the additional information is moved from **first_name** and **surname** columns, the infixes of the various individuals can be isolated by searching for a comma in the column **surname**.


Since information about the family, like the son of, is always put [ ] as part of the first name this information is isolated into **name_info_family**. 

e.g. 

> Abbinck; Lambertus Hermanus [broer van Tieleman] Geb. Zutphen 4 juli 1771; pred. Bahr, Lathum en Giesbeek 19 okt. 1794, Groenlo 13 april 1806, overl. 20 nov. 1838.<26>
> Hartman; Rudolph [z.v. Constantinus] Geb. Enkhuizen 2 aug. 1668; vlootpred. 1689; pred. Steenbergen (NBr) + Kruisland 3 sept. 1690, overl. 26 juli 1700.<4022>



In [None]:

df['name_info_family'] =df['first_name'] .str.extract(r'\[(.*?)\]')


Altenative surnames are always put between ( ) and are isolated into a field alternative_name.

e.g. 
> Haitsema; Messias (Mesche) Loban Gedoopt Winschoten 19 febr. 1673; pred. Weener (Oost-Friesland) 1695, Winschoten 17 april 1698, overl. 30 juli 1698.<3857>
> Gruterus (de Gruyter); Samuel Simonsz. Geb. Leiden yyy; pred. Delfshaven 1605, overl. 1634.<3750>

In [None]:
df['alternative_name'] =df['surname_temp'] .str.extract(r'\((.*?)\)')


Remove all the alternative surnames and information about the family out of the field. 

In [None]:
df['surname_temp']= df['surname_temp'].str.replace(r'\(.*\)', '', regex=True)
df['first_name']= df['first_name'].str.replace(r'\[.*\]', '', regex=True)

In [None]:
df[['surname', 'infix']] = df['surname_temp'].str.split(',', expand=True)

In [None]:
df['first_letter'] = df['first_name'].astype(str).apply(lambda x: x[1])

In [None]:
df = df.drop(['surname_temp'], axis=1)


### Step 5.
In order to make the various field into understandable entities, the names of the columns have been translated to English.

In [None]:
columns_rename = {
    'Geb.': 'birth',
    'overl.': 'death',
    'Gedoopt':'baptized',
    'legerpred.':'legerpredikant',
    'pastoor':'pastoor',
    'garnizoenspred.':'garnizoenspredikant',
    "emer.":'emeritus_status',
    "begraven":'burried',
    "conrector":'conrector',
    " rector":'rector',
    "monnik":'monnik',
    "schoolmeester":'schoolmeester',
    "hoogleraar":'hoogleraar',
    "chirurgijn":'chirurgijn',
    "praeceptor":'praeceptor',
    "ziekentrooster":'ziekentrooster',
    "vlootpred.":'vlootpredikant',
    "legerpred.":'legerpredikant',
    "ambassadepred.":'ambassadepredikant'}

In [None]:
# Rename the columns
df = df.rename(columns=columns_rename)
new_columns = list(columns_rename.values())


Now that all the information from the fields have been parsed to the right column, we can drop the unassigned fields.

In [None]:
array_drop = [i for i in range(1, 19)]
for dropid in array_drop:
    column_dropid = 'Field'+str(dropid)
    df = df.drop(column_dropid, axis=1)

### Step 6. 
Here all the fields that contain information about an individual's role as minister are merged. This field will produce a string with all the locations and years someone was minister in sesequent order.

For example:

>Haack; Petrus Geb. Brielle okt. 1747; pred. Noordgouwe 26 nov. 1769, Zwartewaal 20 nov. 1774, Sommelsdijk 6 juli 1777, Breda 23 juni 1782, Amsterdam 25 nov. 1789, politiek afgezet 1796; vertrokken naar Hamburg, hersteld: pred. Amsterdam 1804, overl. 27 juli 1824.<3782>

Became 

| drc_id  | ... | ... | zx0 pred. | zx1 pred. | zx2 pred. | zx3 pred. | zx4 pred. | ... | etc. |
|---|---|---|---|---|---|---|---|---|---|
| 2 |	 |		| Noordgouwe 26 nov. 1769, Zwartewaal 20 nov. 1774, Sommelsdijk 6 juli 1777, Breda 23 juni 1782, Amsterdam 25 nov. 1789	| Amsterdam 1804	|  	| | | | |

In [None]:
df['minister'] = df['zx0 pred.']+ ','+df['zx1 pred.']+ ','+df['zx2 pred.']+ ','+df['zx3 pred.']+ ','+df['zx4 pred.']

In [None]:
df = df.drop(['zx0 pred.','zx1 pred.','zx2 pred.','zx3 pred.','zx4 pred.'], axis=1)

And is now converted into

| drc_id  | ... | ... | minister | ... | etc. |
|---|---|---|---|---|---|
| 2 |	 |		| Noordgouwe 26 nov. 1769, Zwartewaal 20 nov. 1774, Sommelsdijk 6 juli 1777, Breda 23 juni 1782, Amsterdam 25 nov. 1789, Amsterdam 1804	|  	| |

Where the commas sperate the various locations and starting moment where someone was minister.


In [None]:
def extract_year(text):
    match = re.search(r'\d{4}', text)
    if match:
        return match.group(0)
    else:
        return None

In [None]:
function_year = [word for word in new_columns if word != 'minister']

In [None]:
for year in function_year:
    fld_year = year +'_year'
    df[fld_year] = df[year].apply(lambda x: extract_year(x))

In [None]:
for year_accu in function_year:
    accu_fld_year = year_accu +'_year_accuracy'
    df[accu_fld_year] = ''
    # Iterate over each row in the DataFrame
    for index, row in df.iterrows():
        # Check if the string contains "ca." (case-insensitive)
        if 'ca.' in row[year_accu].lower():
            # If found, set the value of the "accuracy" column to "circa"
            df.at[index, accu_fld_year] = 'circa'



To keep the information about specific dates the information isolated from the fields are added a a remark field. This allows to lateron write queries that allow to extract specific information about dates. 

In [None]:
for remark in function_year:
    remarks_field = 'remarks_'+remark
    df[remarks_field] = df[remark]

In [None]:
months =(" januari ",
 " februari ",
 " maart ",
 " april ",
 " mei ",
 " juni ",
 " juli ",
 " augustus ",
 " september ",
 " oktober ",
 " november ",
 " december ",
 "jan. ",
 "feb. ",
 "mrt. ",
 "apr. ",
 "jun. ",
 "jul. ",
 "aug. ",
 "sept. ",
 "sep. ",
 "okt. ",
 "nov. ",
 "dec. ",
 "yyy",
 "xxx",
 "ca.",
 "febr."
)


In [None]:
for column_strip in function_year:
    for month in months:
        df[column_strip] = df[column_strip].str.replace(month, '')

In [None]:
for column_strip in function_year:
    df[column_strip] = df[column_strip].apply(lambda x: re.sub(r'[\d\.]', '', x))

In [None]:
# In order to link the DRC with DM a join field is created based on the surname, firstname and the infix.

df['join_name'] = df['surname']+df['first_name']+df['infix'].fillna('')
df['join_name'] = df['join_name'].str.replace("  "," ")

### Step 8. 

Since an individual might have had multiple positions as minister, the relation between an individual and a role is considered one-to-many. The process above added all the information of a minister career into one field, where every new location and year is distinguished by a , . The next step has therefore been to isolate the information from the column “minister” into a new table where every role has a separate row contain the unique ID and the information about the role. 

In [None]:
exclude_elements = ['birth', 'death', 'baptized', 'burried']

roles = [item for item in function_year if item not in exclude_elements]

In [None]:
child_role_dfs = []

for role in roles:
    accu_year = role +'_year_accuracy'
    year = role + '_year'
    role_remarks = 'remarks_'+role
    role_df = role
    role_df = df[['drc_id',role,year,accu_year,role_remarks]]
    columns_to_check = [role, year, accu_year, role_remarks]
    role_df = role_df[role_df[columns_to_check].notna().all(axis=1)]
    new_column_names = {role:'role_place', accu_year : 'role_start_year_accuracy', year : 'role_start_year', role_remarks : 'role_remarks'}
    role_df.rename(columns=new_column_names, inplace=True)
    role_df['role_type'] = role
    child_role_dfs.append(role_df)

child_role = pd.concat(child_role_dfs, ignore_index=True)



In [None]:
column_list = df.columns.tolist()

In [None]:
df['place_birth'] = df['birth']
df['place_death'] = df['death']
df['place_baptized'] = df['baptized']
df['place_burried'] = df['burried']

The information about the parent is parsed to a seperate table. 

In [None]:
drc_parent = df[['drc_id','first_name', 'infix', 'surname', 'first_letter', 'place_birth', 'place_death', 'place_baptized', 'place_burried', 'name_info_family',  'birth_year', 'death_year', 'baptized_year', 'burried_year', 'birth_year_accuracy', 'death_year_accuracy', 'baptized_year_accuracy', 'burried_year_accuracy']]
drc_parent.to_csv(folderlink+folder_output+'01_bio_drc.csv', sep=';', encoding='utf-8', index=False)

Alternative names are stored as a seperate table, since in theory individuals have multiple alternative names. Yet since non have been found we decided to leave this seperation for the manual curation phase.  

In [None]:
drc_child_alt_name = df[['drc_id','alternative_name']]
drc_child_alt_name.to_csv(folderlink+folder_output+'11_alt_name_drc.csv', sep=';', encoding='utf-8', index=False)

### Step 9. 
Now we create the child relation for roles. For this we first seperated ministers after which the other roles were added.

In [None]:
subset_pred = df[['drc_id', 'minister','first_letter','join_name']]

Here the information about the role minister, that is created above in step 6 is split to different rows.

from: 

| clerus_id  | ... | ... | minister | ... | etc. |
|---|---|---|---|---|---|
| 2 |	 |		| Noordgouwe 26 nov. 1769, Zwartewaal 20 nov. 1774, Sommelsdijk 6 juli 1777, Breda 23 juni 1782, Amsterdam 25 nov. 1789, Amsterdam 1804	|  	| |


to:

| clerus_id  | minister | ... | etc. |
|---|---|---|---|
| 2 | Noordgouwe 26 nov. 1769|  	| |
| 2 | Zwartewaal 20 nov. 1774|  	| |
| 2 | Sommelsdijk 6 juli 1777|  	| |
| 2 | Breda 23 juni 1782, Amsterdam 25 nov. 1789, Amsterdam 1804	|  	| |
| 2 | Amsterdam 25 nov. 1789|  	| |
| 2 | Amsterdam 1804	|  	| |




In [None]:
df_expanded = subset_pred.assign(minister=subset_pred['minister'].str.split(','))

# Explode the 'pred.' column to create separate rows for each item
df_expanded = df_expanded.explode('minister')

In [None]:
df_filtered = df_expanded[['drc_id','first_letter', 'minister','join_name']]

It produced some empty rows that need to be removed.

In [None]:
childs = df_filtered[(df_filtered["minister"] !=" ") & (df_filtered["minister"] !="")]

In [None]:
childs.head()

The procedure is repeated isolate the year into a different field, add circa and keep the original input as remarks.

In [None]:
childs['minister_year'] = childs['minister'].apply(lambda x: extract_year(x))

In [None]:
childs['role_start_year_accuracy'] = ''
childs.loc[childs['minister'].str.contains('ca\.', case=False), 'role_start_year_accuracy'] = 'circa'

In [None]:
childs['role_remarks'] = childs['minister']

In [None]:
for month in months:
        childs['minister'] = childs['minister'].str.replace(month, '')

childs['minister'] = childs['minister'].apply(lambda x: re.sub(r'[\d\.]', '', x))

A new field is created with the role type. For this "predikant" is filled in. 

In [None]:
childs['role_type'] = "predikant"

In [None]:
new_column_names_min = {'minister':'role_place', 'minister_year' : 'role_start_year'}
childs.rename(columns=new_column_names_min, inplace=True)

In [None]:
role_minister = childs[['drc_id', 'role_type','role_place','role_start_year','role_start_year_accuracy','role_remarks']]

### Step 10. 
Next the other roles are integrated into every row and added in a child_roles.csv output file.

In [None]:
child_roles = pd.concat([child_role, role_minister], ignore_index=True)

In [None]:
child_roles.to_csv(folderlink+folder_output+'12_roles_drc.csv', sep=';', encoding='utf-8', index=False)

In [None]:
child_roles.head()

In [None]:
os.remove(output_txt)
os.remove(output_csv)

# Final remarks.

As said, this workflow provided the steps that were taken to process the DRC file form a word document to a series of csv which can be integrated into a Relational Database Management System. The dataprocessing converted 7111 out of the 12579 individuals successfully. For 5468 individuals modiffications were made based on the data curation. The notebook above thus helped in structuring the dataset and isolate various fields. Yet, since there were many exceptions a manual curation was essential. 

