# Integrate Dutch Ministers (DM) in CLERUS

The process of integrating the DRC dataset into CLERUS has been extensively described in [1_1_DRC_1555-1816.ipynb](1_1_DRC_1555-1816.ipynb). In this notebook it has also been described that for the final integration of DRC into CLERUS a curation steps was made. In [1_2_DM_1572-2004.ipynb](1_2_DM_1572-2004.ipynb) the process is presented how to extract individuals from the DM dataset can be extracted. In this notebook these individuals are connected with the individuals from the curated DRC dataset. 

For this, the strategy is to extend the DM dataset with the clerus_id for those fields that are present in the drc dataset. By doing this in this way, the DM dataset stays as much intact as possible. Once the DM is extended with the clerus_id for records that match, the DM dataset will be parsed to clerus.


In [18]:
# import the required libraries
import pandas as pd

In [2]:
# Panda settings for showing data
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

In [3]:
# Set variables for the project (i.e. the input location of the file to be processed and the output location) )
folderlink = '..//data//'
input_folder = 'input//'
output_folder = 'output//'


In [19]:
# Set links to the files
d1_csv = folderlink+input_folder+"01_clerus_bio_curated.csv" 
d11_csv = folderlink+input_folder+"12_clerus_role_curated.csv"
d22_csv = folderlink+output_folder+"12_roles_dm.csv"

In [20]:
# Import the csvs as dataframes
d1 = pd.read_csv(d1_csv, sep=';', encoding='utf-8')

d11_col_data_type = {'clerus_id': pd.Int64Dtype(), 'role_start_year': pd.Int64Dtype(), 'role_end_year': pd.Int64Dtype()}
d11 = pd.read_csv(d11_csv, sep=';', dtype=d11_col_data_type, encoding='utf-8')

d22_data_types = {'jaar vertrek': pd.Int64Dtype(), 'ind_id': pd.Int64Dtype(),'dag intrede': pd.Int64Dtype(), 'dag vertrek': pd.Int64Dtype() }
d22 = pd.read_csv(d22_csv, dtype=d22_data_types)


In [98]:
# from the 01_clerus_bio table we only require the name of the individual to be linked with DM 
d1_name = d1[['clerus_id', 'first_name', 'infix', 'surname']]
d1_d11_merge = pd.merge(d1_name, d11, on='clerus_id', how='inner')

In [104]:
# filter out all role_type that are not minister (predikant)
d1_d11_merge['role_type'] = d1_d11_merge['role_type'].fillna('')
d1_d11 = d1_d11_merge[d1_d11_merge['role_type'].str.contains('pred', case=False)]


In [105]:
# The first stategy to link DM with DRC is to link them based on the names. In DM the name is stored in one field (i.e. predikant). Since that field uses multiple methods to distinguish the first name, inifix and surname (e.g. . and ; etc.) we decided to map DRC to DM and remove all spaces and seperating characters to create a field that can be linked.

d1_d11['surname'] = d1_d11['surname'].fillna('')
d1_d11['first_name'] = d1_d11['first_name'].fillna('')
d1_d11['infix'] = d1_d11['infix'].fillna('')
d1_d11['join_name'] = d1_d11['surname']+d1_d11['first_name']+d1_d11['infix']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d1_d11['surname'] = d1_d11['surname'].fillna('')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d1_d11['first_name'] = d1_d11['first_name'].fillna('')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d1_d11['infix'] = d1_d11['infix'].fillna('')
A value is trying to be set on a copy of a slice from a 

In [106]:
# Here we clean the names of the ministers from DM from spaces and seperating characters
d22['d22join_name'] = d22['predikant'].str.replace(";","").str.replace(" ","").str.replace("  ","").str.replace(".","")
# We select the first string in the placename
d22['d22gemeente'] = d22['gemeente'].str.split().str[0]
# Add the start year
d22['d22jaar intrede'] = d22['jaar intrede'].astype(str).str.replace(";","").str.replace("","").str.replace("  ","")

In [107]:
d1_d11.describe()

Unnamed: 0,clerus_id,role_place_id,role_classis_code,role_classis,role_parish,role_province,role_region,role_start_year,role_start_date_exact,role_end_year,role_end_date_exact,role_residence_place,role_residence_place_id,d11_count
count,27070.0,0.0,0.0,0.0,0.0,0.0,0.0,26698.0,0.0,2340.0,0.0,0.0,0.0,27070.0
mean,7348.41389,,,,,,,1701.256611,,1614.584188,,,,3.267196
std,5697.450009,,,,,,,72.774512,,293.357603,,,,1.812118
min,1.0,,,,,,,1545.0,,0.0,,,,1.0
25%,3191.0,,,,,,,1639.0,,1601.0,,,,2.0
50%,6362.0,,,,,,,1700.0,,1629.0,,,,3.0
75%,9498.0,,,,,,,1768.0,,1720.25,,,,4.0
max,30025.0,,,,,,,1856.0,,1865.0,,,,15.0


In [108]:
# Here we clean the joined names of the ministers from DRC from spaces and seperating characters
d1_d11['d11join_name'] = d1_d11['join_name'].str.replace(";","").str.replace(" ","").str.replace("  ","")

# We select the first word form the placename, since sometimes alternative names are included between brackets
d1_d11['d11_role_place']= d1_d11['role_place'].str.split().str[0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d1_d11['d11join_name'] = d1_d11['join_name'].str.replace(";","").str.replace(" ","").str.replace("  ","")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d1_d11['d11_role_place']= d1_d11['role_place'].str.split().str[0]


In [109]:
# We create a joining field, which is a combination between the name, placename and start year
d1_d11['d11j_place_year']=d1_d11['d11join_name'].astype(str)+"_"+d1_d11['d11_role_place'].astype(str)+"_"+d1_d11['role_start_year'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d1_d11['d11j_place_year']=d1_d11['d11join_name'].astype(str)+"_"+d1_d11['d11_role_place'].astype(str)+"_"+d1_d11['role_start_year'].astype(str)


In [110]:
# For dm a join field is created in a similar way after which the two are joined
d22['d22j_place_year']=d22['d22join_name'].astype(str)+"_"+d22['d22gemeente'].astype(str)+"_"+d22['d22jaar intrede'].astype(str)
d11_d22 = pd.merge(d1_d11, d22, left_on='d11j_place_year', right_on='d22j_place_year', how='inner')

In [111]:
# For dm a join field is created in a similar way after which the two are joined
id_counts_d11 = d1_d11['clerus_id'].value_counts()

In [112]:
# Since an individual form drc could have more minister roles in dm we are going to check for that. In d1_d11 the number of times an individual was minister is counted.
d1_d11['d11_count'] = d1_d11['clerus_id'].map(id_counts_d11)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d1_d11['d11_count'] = d1_d11['clerus_id'].map(id_counts_d11)


In [113]:
# Next, a string is created combined with the clerus id. For DM we will do the same lateron
d1_d11['d11_count_unique'] = d1_d11['clerus_id'].astype(str)+"__"+d1_d11['d11_count'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d1_d11['d11_count_unique'] = d1_d11['clerus_id'].astype(str)+"__"+d1_d11['d11_count'].astype(str)


In [114]:
# The number of times the clerus_id is present in the joined tables 
id_counts_d22_d11 = d11_d22['clerus_id'].value_counts()
d11_d22['d11_d22_count'] = d11_d22['clerus_id'].map(id_counts_d22_d11)

In [115]:
d11_d22['d11_d22_count_unique']= d11_d22['clerus_id'].astype(str)+"__"+d11_d22['d11_d22_count'].astype(str)

In [116]:
# join the two do see which ones are the same
countd11_countd11_d22 = pd.merge(d1_d11, d11_d22, left_on='d11_count_unique', right_on='d11_d22_count_unique', how='inner')

In [118]:
output_table = countd11_countd11_d22[['clerus_id_x','d11_count_unique','d11_d22_count_unique']].copy()

In [120]:
# remove duplicates
output_without_duplicates = output_table.drop_duplicates()

In [121]:
output_without_duplicates.describe()

Unnamed: 0,clerus_id_x
count,7798.0
mean,6605.39228
std,4941.647551
min,1.0
25%,2957.25
50%,5926.5
75%,9035.75
max,30020.0


In [122]:
output_without_duplicates.to_csv(folderlink+output_folder+'ouput_nodup_first_word.csv', sep=';', encoding='utf-8', index=False)