# Integrate Dutch Ministers (DM) in CLERUS

The process of integrating the DRC dataset into CLERUS has been extensively described in [1_1_DRC_1555-1816.ipynb](1_1_DRC_1555-1816.ipynb). In [1_2_DM_1572-2004.ipynb](1_2_DM_1572-2004.ipynb) the process is presented how to extract individuals from the DM dataset can be extracted. In this notebook these individuals are connected with the individuals from the (manually) curated DRC dataset. 

For this, the strategy is to extend the DM dataset with the clerus_id for those fields that are present in the drc dataset. By doing this in this way, the DM dataset stays as much intact as possible. Once the DM is extended with the clerus_id for records that match, the DM dataset will be parsed to clerus.


In [78]:
# import the required libraries
import pandas as pd

In [79]:
# Panda settings for showing data
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

In [80]:
# Set variables for the project (i.e. the input location of the file to be processed and the output location) )
folderlink = '..//data//'
input_folder = 'input//'
output_folder = 'output//'


In [81]:
# Set links to the files
d1_csv = folderlink+input_folder+"01_clerus_bio_curated.csv" 
d11_csv = folderlink+input_folder+"12_clerus_role_curated.csv"
d22_csv = folderlink+output_folder+"12_roles_dm.csv"

In [82]:
# Import the csvs as dataframes
d1 = pd.read_csv(d1_csv, sep=';', encoding='utf-8')

d11_col_data_type = {'clerus_id': pd.Int64Dtype(), 'role_start_year': pd.Int64Dtype(), 'role_end_year': pd.Int64Dtype()}
d11 = pd.read_csv(d11_csv, sep=';', dtype=d11_col_data_type, encoding='utf-8')

d22_data_types = {'jaar intrede': pd.Int64Dtype(), 'jaar vertrek': pd.Int64Dtype(), 'ind_id': pd.Int64Dtype(),'dag intrede': pd.Int64Dtype(), 'dag vertrek': pd.Int64Dtype() }
d22 = pd.read_csv(d22_csv, dtype=d22_data_types)


In [83]:
# from the 01_clerus_bio table we only require the name of the individual to be linked with DM 
d1_name = d1[['clerus_id', 'first_name', 'infix', 'surname']]
d1_d11_merge = pd.merge(d1_name, d11, on='clerus_id', how='inner')

In [84]:
# filter out all role_type that are not minister (predikant)
d1_d11_merge['role_type'] = d1_d11_merge['role_type'].fillna('')
d1_d11 = d1_d11_merge[d1_d11_merge['role_type'].str.contains('pred', case=False)]

In [85]:
# it appears that in some cases the start of a role in on of the two datasets is one year later. Therefore, a link is also establised with "role_start_year+1" and lateron "d22jaar intrede+1"
d1_d11['role_start_year+1'] = d1_d11['role_start_year']+1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d1_d11['role_start_year+1'] = d1_d11['role_start_year']+1


In [86]:
# The first stategy to link DM with DRC is to link them based on the names. In DM the name is stored in one field (i.e. predikant). Since that field uses multiple methods to distinguish the first name, inifix and surname (e.g. . and ; etc.) we decided to map DRC to DM and remove all spaces and seperating characters to create a field that can be linked.

d1_d11['surname'] = d1_d11['surname'].fillna('')
d1_d11['first_name'] = d1_d11['first_name'].fillna('')
d1_d11['infix'] = d1_d11['infix'].fillna('')
d1_d11['join_name'] = d1_d11['surname']+d1_d11['first_name']+d1_d11['infix']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d1_d11['surname'] = d1_d11['surname'].fillna('')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d1_d11['first_name'] = d1_d11['first_name'].fillna('')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d1_d11['infix'] = d1_d11['infix'].fillna('')
A value is trying to be set on a copy of a slice from a 

In [87]:
# Here we clean the names of the ministers from DM from spaces and seperating characters
d22['d22join_name'] = d22['predikant'].str.replace(";","").str.replace(" ","").str.replace("  ","").str.replace(".","")
# We select the first string in the placename
d22['d22gemeente'] = d22['gemeente'].str.split().str[0]
# Add the start year
d22['d22jaar intrede'] = d22['jaar intrede'].astype(str).str.replace(";","").str.replace("","").str.replace("  ","")

In [88]:
# Here we clean the joined names of the ministers from DRC from spaces and seperating characters
d1_d11['d11join_name'] = d1_d11['join_name'].str.replace(";","").str.replace(" ","").str.replace("  ","")

# We select the first word form the placename, since sometimes alternative names are included between brackets
d1_d11['d11_role_place']= d1_d11['role_place'].str.split().str[0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d1_d11['d11join_name'] = d1_d11['join_name'].str.replace(";","").str.replace(" ","").str.replace("  ","")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d1_d11['d11_role_place']= d1_d11['role_place'].str.split().str[0]


In [89]:
# We create a joining field, which is a combination between the name, placename and start year
d1_d11['d11j_place_year']=d1_d11['d11join_name'].astype(str)+"_"+d1_d11['d11_role_place'].astype(str)+"_"+d1_d11['role_start_year'].astype(str)
d1_d11['d11j_place_year+1']=d1_d11['d11join_name'].astype(str)+"_"+d1_d11['d11_role_place'].astype(str)+"_"+d1_d11['role_start_year+1'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d1_d11['d11j_place_year']=d1_d11['d11join_name'].astype(str)+"_"+d1_d11['d11_role_place'].astype(str)+"_"+d1_d11['role_start_year'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d1_d11['d11j_place_year+1']=d1_d11['d11join_name'].astype(str)+"_"+d1_d11['d11_role_place'].astype(str)+"_"+d1_d11['role_start_year+1'].astype(str)


In [90]:
d22['d22jaar intrede+1'] = d22['jaar intrede']+1

In [91]:
# For dm a join field is created in a similar way after which the two are joined
d22['d22j_place_year']=d22['d22join_name'].astype(str)+"_"+d22['d22gemeente'].astype(str)+"_"+d22['d22jaar intrede'].astype(str)
d22['d22j_place_year+1']=d22['d22join_name'].astype(str)+"_"+d22['d22gemeente'].astype(str)+"_"+d22['d22jaar intrede+1'].astype(str)
d11_d22 = pd.merge(d1_d11, d22, left_on='d11j_place_year', right_on='d22j_place_year', how='inner')

In [92]:
#to account for 1 plus year in one or the other dataset two joins are made
d11p1_d22 = pd.merge(d1_d11, d22, left_on='d11j_place_year+1', right_on='d22j_place_year', how='inner')
d11_d22p1 = pd.merge(d1_d11, d22, left_on='d11j_place_year', right_on='d22j_place_year+1', how='inner')

In [93]:
#these additional links are appended to d11_d22
d11_d22 = pd.concat([d11_d22, d11p1_d22, d11_d22p1], ignore_index=True)

In [94]:
# For dm a join field is created in a similar way after which the two are joined
id_counts_d11 = d1_d11['clerus_id'].value_counts()

In [95]:
# Since an individual form drc could have more minister roles in dm we are going to check for that. In d1_d11 the number of times an individual was minister is counted.
d1_d11['d11_count'] = d1_d11['clerus_id'].map(id_counts_d11)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d1_d11['d11_count'] = d1_d11['clerus_id'].map(id_counts_d11)


In [96]:
# Next, a string is created combined with the clerus id. For DM we will do the same lateron
d1_d11['d11_count_unique'] = d1_d11['clerus_id'].astype(str)+"__"+d1_d11['d11_count'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d1_d11['d11_count_unique'] = d1_d11['clerus_id'].astype(str)+"__"+d1_d11['d11_count'].astype(str)


In [97]:
# The number of times the clerus_id is present in the joined tables 
id_counts_d22_d11 = d11_d22['clerus_id'].value_counts()
d11_d22['d11_d22_count'] = d11_d22['clerus_id'].map(id_counts_d22_d11)

In [98]:
# We combine the clerus id with the numner of times a role is present per individual
d11_d22['d11_d22_count_unique']= d11_d22['clerus_id'].astype(str)+"__"+d11_d22['d11_d22_count'].astype(str)

In [99]:
# Now we join the two. This allows us to see which individuals in both datasets have exactly the same time the same role. These individuals thus completely overlap. 
countd11_countd11_d22 = pd.merge(d1_d11, d11_d22, left_on='d11_count_unique', right_on='d11_d22_count_unique', how='inner')

In [100]:
# getting rid of all the unnecesary fields 
output_table = countd11_countd11_d22[['clerus_id_x','d11_count_unique','d11_d22_count_unique']].copy()

In [101]:
# remove duplicates from the table (since every match produces an additional row)
output_without_duplicates = output_table.drop_duplicates()

In [102]:
# Join the tables to allow to to get the clerus_ids for rows that in d22 (DM) that are corresponding to the inforation in d1_d11 (DRC)
d22_in_d11 = pd.merge(output_without_duplicates, d11_d22, left_on='clerus_id_x', right_on='clerus_id', how='inner')

In [103]:
# create df with clerus id that fully match in DM and DRC (thus not need to be checked) 
d22_in_d11_pid_clerus_id = d22_in_d11[['clerus_id','pid']] 

In [104]:
# export to csv in order to integrate in the database
d22_in_d11_pid_clerus_id.to_csv(folderlink+output_folder+'dm_in_drc_same_frequency.csv', sep='$', encoding='utf-8', index=False)

In [105]:
# create a list with all row that match with drc 
d11_d22_clerus_pid_all = d11_d22[['clerus_id','pid']] 

In [106]:
# create a list with all row that match with drc and left join it with the table where all matching records are stored, so this information can be used as input for the datacuration.
d22_not_d11 = pd.merge(d11_d22_clerus_pid_all, d22_in_d11_pid_clerus_id, left_on='pid', right_on='pid', how='left')


In [107]:
d22_not_d11.head()

Unnamed: 0,clerus_id_x,pid,clerus_id_y
0,20041,11440,
1,20041,31861,
2,20042,14485,20042.0
3,20043,16388,
4,20043,36153,


In [108]:
# create a dataset with clerus id-s (thus being in DRC) and pid that are not matching with the number of time a role occures in DM and DRC. 
d22_not_same_frequency = d22_not_d11[d22_not_d11['clerus_id_y'].isna()]

In [109]:
d22_not_same_frequency_drop = d22_not_same_frequency[['clerus_id_x','pid']] 
d22_not_same_frequency_drop = d22_not_same_frequency_drop.rename(columns={"clerus_id_x": "clerus_id"})

In [110]:
# Export the file in order for it to be imported in the database
d22_not_same_frequency_drop.to_csv(folderlink+output_folder+'dm_in_drc_not_same_frequency.csv', sep='$', encoding='utf-8', index=False)

In [111]:
d22_all_d11_match = pd.merge(d22, d11_d22_clerus_pid_all, left_on='pid', right_on='pid', how='left')



In [112]:
d22_all_d11_match['clerus_id'] = d22_all_d11_match['clerus_id'].astype('Int64')

In [113]:
d22_columns_to_drop = ['d22join_name', 'd22gemeente', 'd22jaar intrede', 'd22jaar intrede+1', 'd22j_place_year', 'd22j_place_year+1']

d22_all_d11_match= d22_all_d11_match.drop(columns=d22_columns_to_drop) ##	d22gemeente	d22jaar intrede	d22jaar intrede+1	d22j_place_year	d22j_place_year+1

In [114]:
# Export the file in order for it to be imported in the database
d22_all_d11_match.to_csv(folderlink+output_folder+'dm_all_drc_match.csv', sep='$', encoding='utf-8', index=False)