# Dutch Ministers (DM) 1572-2004

The dataset **Dutch Ministers (DM)** is provided by the brothers van Lieburg and contains all the minister positions Dutch Reformed ministers had between 1572 and 2004. The dataset contains one row for ever carreer step of a minister and is said to **contain more up to date information** compared to [DRC](./1_1_DRC_1555-1816.ipynb) With every carreer step on a new row, this dataset contains multiple rows for every individual. For instance, Isaäc Abbema in the example below had two posts; one from 1618 to 1635 in Berkenwoude and from 1635 to 1637 in Gouda (figure 1).

![Figure 2 Dutch Ministers 1572-2004](../images/figure2.png)

*Figure 1 - Sample of Dutch Ministers 1572-2004 dataset* 

Contrary to [DRC](./1_1_DRC_1555-1816.ipynb), this dataset contains data about ministers that started their careers after 1816. What makes this dataset difficult to  work with is that individuals cannot be easily distinguished, since no unique ID is provide. Especially since over time people had the same names, individuals are not easily distinguishable. Out of the 53646 records this dataset contains, 25082 times exactly the same name is used. However, when only looking at the records that had the same name, unfeasible career paths occured. For instance the name "J. de Jong" would have had 30 positions over an unfeasible long period of time. Looking closely at “J. de Jong” it appears that this name represents multiple individuals (which is not a surprose in the Netherlands).

![Figure 3](../images/figure3.png)

*Figure 2 - Number of time a name appeared in DM* 


To integrate this dataset into the CLERUS dataset a pipeline to extract individuals out of this dataset is developed and presented in this notebook. The main idea for isolating individuals from DM is that carreer paths can be extracted by linking the rows based on the combination of the name, place where they were minister and the year that they started. This is possible since every row contains a value for the years they started and the year they left. In addition, the data contains the placename where they were minister and from where they came. Below in table 1 and example of the dataset is provided.


|...|gemeente (community/ parish) |predikant (name of minster)| ... | jaar intrede (start year) |... | jaar vertrek (end year)| ... | ... |  
|---|---|---|---|---|---|---|---|---|
|...|Hedikhuizen|Rosiere (Rosarius); Josephus van de |...| 1611 | ... | 1617 | ...| ...|
|...|Woerden|Rosiere (Rosarius); Josephus van de | ... | 1617 | ... | 1619| ...| ...|
|...|Haarlem|Rosiere (Rosarius); Josephus van de | ... | 1619 | ... | 1649| ...| ...|


Table 1 shows the records related to one carreer. This minister, i.e. Josephus van de Rosiere (Rosarius), started his carreer as minister in 1611 in Hedikhuizen after which he moved to Woerden in 1617 where he got a position until 1619. In 1619 he moved to Haarlem where he stayed until 1649 when he retired or past away. To link the various records with each other a combination between name the start year and name and end year needs to be created.



|...|gemeente (community/ parish) |predikant (name of minster)| ... | jaar intrede (start year) |... | jaar vertrek (end year)| ... | ... |  start_name_year | end_name_year |
|---|---|---|---|---|---|---|---|---|
|...|Hedikhuizen|Rosiere (Rosarius); Josephus van de |...| 1611 | ... | 1617 | ...| ...|
|...|Woerden|Rosiere (Rosarius); Josephus van de | ... | 1617 | ... | 1619| ...| ...|
|...|Haarlem|Rosiere (Rosarius); Josephus van de | ... | 1619 | ... | 1649| ...| ...|







### Manual cleaning

Before performing this processing step, the DM was cleaned. A thorough analysis scan of the dataset revealed a series of errors listed below. 
-	Information is stored in wrong column. 
-	Spaces in front of name (make it difficult to link)
-	; between name and surname is lacking, making it at a later stage difficult to split these
-	Many individuals have only one value in the field predikant, making it difficult to link these thus it is difficult to distinguish surname or name 

A round of corrections has been executed and produced an updated list. Furthermore, it contains 131 records that still needs to be checked. This however does not mean that the rest of the file does not contain any errors. This data cleaning only looked at the following issues:
-	whether “jaar intrede” has a numeric value
-	“predikant” does not start with a number
-	how many semicolons there are in field “predikant” (and if not 1 put in the list to check)
-	whether “predikant” starts with a space

In [1]:
# import the required libraries
import os
import re
import csv
import pandas as pd
import numpy as np
import networkx as nx

In [2]:
# Set variables for the project (i.e. the input location of the file to be processed and the output location) )

folderlink = '..//data//'
input_folder = 'input//'
input_pred = folderlink+input_folder+"DM_predikanten.csv"

In [3]:
# Panda settings for showing data
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

In [4]:
jaar_vertrek_type = {'jaar vertrek': pd.Int64Dtype(), 'ind_id': pd.Int64Dtype(),'dag intrede': pd.Int64Dtype(), 'dag vertrek': pd.Int64Dtype() }
df = pd.read_csv(input_pred, sep=',', dtype= jaar_vertrek_type, encoding='utf-8')

In [5]:
# The dataset is split in two. Since the dataset is composed of rows that have information about where someone went to ('vertrek naar of vanwege') and where he or she came from ('Herkomst'), which is foremost for ministers from after 1816, and rows that have only a year when they move position a cut was made based on the field Herkomst having values NaN.
# The former is named dm_part2, the latter dm_part1.

dm_part1 = df[df['Herkomst'].isna()]

In [6]:
dm_part1['predikant'] = df['predikant'].str.replace(' ', '')
dm_part1['pred_year_start'] = dm_part1['predikant'].astype(str)+'_'+dm_part1['jaar intrede'].astype(str)
dm_part1['pred_year_end'] = dm_part1['predikant'].astype(str)+'_'+dm_part1['jaar vertrek'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dm_part1['predikant'] = df['predikant'].str.replace(' ', '')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dm_part1['pred_year_start'] = dm_part1['predikant'].astype(str)+'_'+dm_part1['jaar intrede'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dm_part1['pred_year_end'] = dm_part1['p

In [7]:
tdf = dm_part1
variable_end = 'pred_year_end'
variable_start = 'pred_year_start'
variable_name = 'predikant'


In [8]:
def create_node_paths_dataframe(dataframe):
    # Generate the directed graph G
    G = nx.DiGraph()
    for _, row in dataframe.iterrows():
        G.add_edge(row[variable_start], row[variable_end], name=row[variable_name])

    # Find connected components and assign path IDs
    paths = list(nx.connected_components(G.to_undirected()))
    path_ids = list(range(1, len(paths) + 1))
    node_path_pairs = []
    for path_id, path_nodes in zip(path_ids, paths):
        node_path_pairs.extend([(node, path_id) for node in path_nodes])

    # Create a DataFrame from the node-path pairs
    node_paths = pd.DataFrame(node_path_pairs, columns=[variable_start, 'individual'])

    # Join the original DataFrame with the node-paths DataFrame
    joined = pd.merge(dataframe, node_paths, left_on=variable_start, right_on=variable_start, how='left')

    return joined

In [9]:
result_dm_part1 = create_node_paths_dataframe(tdf)

In [10]:
individuals_p1 = result_dm_part1[['pid',variable_name,'individual']].copy()

In [11]:
dm_part1_max = individuals_p1['individual'].max()


In [12]:
dm_part2 = df[df['Herkomst'].notna()]

In [13]:
dm_part2['gemeente'] = dm_part2['gemeente'].str.replace(' ', '')
dm_part2['vertrek naar of vanwege'] = dm_part2['vertrek naar of vanwege'].str.replace(' ', '')

dm_part2['gemeente'] = dm_part2['gemeente'].str[:6]
dm_part2['vertrek naar of vanwege'] = dm_part2['vertrek naar of vanwege'].str[:6]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dm_part2['gemeente'] = dm_part2['gemeente'].str.replace(' ', '')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dm_part2['vertrek naar of vanwege'] = dm_part2['vertrek naar of vanwege'].str.replace(' ', '')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dm_part2['gemeente'] = dm_part2['gemeente'].s

In [14]:
dm_part2['predikant'] = dm_part2['predikant'].str.replace(' ', '')
dm_part2['pred_start'] = dm_part2['predikant'].astype(str)+'_'+dm_part2['gemeente'].astype(str)+'_'+dm_part2['jaar intrede'].astype(str)
dm_part2['pred_end'] = dm_part2['predikant'].astype(str)+'_'+dm_part2['vertrek naar of vanwege'].astype(str)+'_'+dm_part2['jaar vertrek'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dm_part2['predikant'] = dm_part2['predikant'].str.replace(' ', '')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dm_part2['pred_start'] = dm_part2['predikant'].astype(str)+'_'+dm_part2['gemeente'].astype(str)+'_'+dm_part2['jaar intrede'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d

In [15]:
dm_part2 = dm_part2.sort_values(by='pred_start')

In [16]:
tdf = dm_part2
variable_end = 'pred_end'
variable_start = 'pred_start'
variable_name = 'predikant'


In [17]:
result_dm_part2 = create_node_paths_dataframe(tdf)

In [18]:
individuals_p2 = result_dm_part2[['pid',variable_name,'individual']].copy()

In [19]:
individuals_p2['individual']= individuals_p2['individual'] + dm_part1_max

In [20]:
individuals = pd.concat([individuals_p1, individuals_p2], ignore_index=True)

In [21]:
unique_individuals = individuals.drop_duplicates(subset=['individual'])

In [22]:
unique_individuals.head(50)

Unnamed: 0,pid,predikant,individual
0,35685,Aalburg;Johannesvan,1
1,22489,Aalst;Corneliusvan,2
2,46953,Aalst;Gerardusvan,3
5,741,Aalst;Wilhelmus,4
6,8276,Aalstius;Henricus,5
7,32102,Aalstius;Johannes,6
11,21435,Aalstius;Johannes,7
14,5357,Aalstius;Leonardus,8
16,4078,Aalstius;Petrus,9
19,3584,Aalstius;Wilhelmus,10


In [23]:
unique_individuals.describe()

Unnamed: 0,pid,individual
count,31656.0,31656.0
mean,24293.77941,15828.5
std,15503.700562,9138.444397
min,1.0,1.0
25%,10421.75,7914.75
50%,23030.5,15828.5
75%,37313.25,23742.25
max,53646.0,31656.0


In [26]:
individuals.head(50)

Unnamed: 0,pid,predikant,individual
0,35685,Aalburg;Johannesvan,1
1,22489,Aalst;Corneliusvan,2
2,46953,Aalst;Gerardusvan,3
3,41854,Aalst;Gerardusvan,3
4,48641,Aalst;Gerardusvan,3
5,741,Aalst;Wilhelmus,4
6,8276,Aalstius;Henricus,5
7,32102,Aalstius;Johannes,6
8,6933,Aalstius;Johannes,6
9,32398,Aalstius;Johannes,6
