# Boekzaallijst

The **Boekzaallijst** is created from multiple archival sources that registered the individuals that got the "proponenten" status from 1736 to 1816. Being a "proponent" meant that someone passed their proponent exam, which was a prerequisite to become a minister. It does however not mean that someone actually became a minister. Yet, without the "proponent" status it was not allowed to act as minister. 

The boekzaallijst is a structured dataset with rows for every individual. It contains the first name, surname, the place (classis region) where the exam was registered and the place of the first parish where one acted as minister. In addition is sometimes contains additional information about the carreer.

The boekzaallijst contains individuals that are also present in DRC. However, it also contains individuals that have not become minister in the Netherlands, thus that are not part of DRC, since that list exclusively contains individuals that have become minister. 

Linking the **Boekzaallijst** with [DRC](1_1_DRC_1555-1816.ipynb) allows to see what share of individuals have become a minister in the Netherlands and also which individuals have acted as minister in the for example the Dutch colonies. With a variety in the spelling of names and different data fields, this is not as staightforward as it seems. The **Boekzaallijst** is less rich compared to the DRC, which for instance also contains information about the date or place of birth or death etc., allowing to distinguish individuals more easily. 

This notebook is created to aid in the process of linking DRC with the **Boekzaallijst** based on a series of rules and by applying [Levenshtein](https://pypi.org/project/python-Levenshtein/). As for the input DRC dataset a fully curated DRC database stored as MS access .accdb is used. 

In [5]:
# import the required libraries
import os
import re
import csv
import pandas as pd
import numpy as np
import pyodbc
import Levenshtein as lev


In [6]:
# To link the DRC with the boekzaallijst we decided to follow 6 strategies. 

# Strategies to link DRC and BZ are: 
# 1. the first letter of the name, the full surname and the year of the first time someone acted as minister.
# 2. the first letter of the name, the full surname and the year of the first time someone acted as minister -1, since there can be a delay in the boekzaallijst registration.
# 3. The first 3 of the surname and the year of the first time someone acted as minister.   
# 4. The first 3 of the surname and the year of the first time someone acted as minister -1 (see 2).
# 5. With the strings created in strategy 1 apply a top 3 matching based on Levenshtein distances

# Before we start we load the "boekzaalijst" data from a csv file.

In [7]:
# Set variables for the project (i.e. the input location of the file to be processed and the output location) )

folderlink = '..//data//'
input_folder = 'input//'
input_file = os.path.join(folderlink+input_folder, 'boekzaallijst_27072023.csv')
folder_output = 'output//'
output_csv = folderlink+folder_output+'clerus_boekzaal.csv'
drc_database = 'DRC_05102023_merged.accdb'


# Panda settings for showing data
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

### STEP 1 Add join fields in Boekzaallijst 

First we need to add fields in the Boekzaallijst sheet with which the DRC can be joined based on the strategies formulated above. 

In [8]:
# Load the boekzaallijst dataset from a csv
years_to_integers = {'Jaar (Begin) Rol': pd.Int64Dtype(),'Jaar Beroepen': pd.Int64Dtype(), }
boekzaallijst = pd.read_csv(input_file, sep=';', dtype=years_to_integers, encoding='utf-8')

In [9]:
# To create new fields containing the 'First_Letter' the following function is used
def get_first_letter(row, name_column, initial_column):
    name_letter = row[name_column][0] if pd.notnull(row[name_column]) else None
    initial_letter = row[initial_column][0] if pd.notnull(row[initial_column]) else None
    return name_letter or initial_letter

In [10]:
# Create the new field containing the 'First_Letter'
boekzaallijst['first_letter'] = boekzaallijst.apply(lambda row: get_first_letter(row, 'Voornaam_BZ', 'Voorletter_BZ'), axis=1)

In [11]:
fil_boekzaallijst = boekzaallijst.dropna(subset=['Jaar Beroepen'])

In [12]:
# Create the link to formulate the connection using strategy 1
fil_boekzaallijst['strat1_boekzaallink'] = fil_boekzaallijst['first_letter'].astype(str) + '_' + fil_boekzaallijst['Achternaam_BZ'].astype(str) + '_' + fil_boekzaallijst['Jaar Beroepen'].astype(str).str.replace(' ', '')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  fil_boekzaallijst['strat1_boekzaallink'] = fil_boekzaallijst['first_letter'].astype(str) + '_' + fil_boekzaallijst['Achternaam_BZ'].astype(str) + '_' + fil_boekzaallijst['Jaar Beroepen'].astype(str).str.replace(' ', '')


In [13]:
# Create the link to formulate the connection using strategy 2
def lower_one_to_integer(num):
    return num - 1

In [14]:
fil_boekzaallijst['year_min1'] = fil_boekzaallijst['Jaar Beroepen'].apply(lower_one_to_integer)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  fil_boekzaallijst['year_min1'] = fil_boekzaallijst['Jaar Beroepen'].apply(lower_one_to_integer)


In [15]:
fil_boekzaallijst['strat2_boekzaallink'] = fil_boekzaallijst['first_letter'].astype(str) + '_' + fil_boekzaallijst['Achternaam_BZ'].astype(str) + '_' + fil_boekzaallijst['year_min1'].astype(str).str.replace(' ', '')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  fil_boekzaallijst['strat2_boekzaallink'] = fil_boekzaallijst['first_letter'].astype(str) + '_' + fil_boekzaallijst['Achternaam_BZ'].astype(str) + '_' + fil_boekzaallijst['year_min1'].astype(str).str.replace(' ', '')


In [16]:
# Strategy 3 and 4 
fil_boekzaallijst['strat3_boekzaallink'] =  fil_boekzaallijst['Achternaam_BZ'].str[:3]+ '_' + fil_boekzaallijst['Jaar Beroepen'].astype(str).str.replace(' ', '')
fil_boekzaallijst['strat4_boekzaallink'] =  fil_boekzaallijst['Achternaam_BZ'].str[:3]+ '_' + fil_boekzaallijst['year_min1'].astype(str).str.replace(' ', '')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  fil_boekzaallijst['strat3_boekzaallink'] =  fil_boekzaallijst['Achternaam_BZ'].str[:3]+ '_' + fil_boekzaallijst['Jaar Beroepen'].astype(str).str.replace(' ', '')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  fil_boekzaallijst['strat4_boekzaallink'] =  fil_boekzaallijst['Achternaam_BZ'].str[:3]+ '_' + fil_boekzaallijst['year_min1'].astype(str).str.replace(' ', '')


### STEP 2 Get DRC data     

Next we need to get the data from the DRC database to link it with the **boekzaallijst**.

In [17]:
conn_str = (
    r'DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};'
    r'DBQ='+folderlink+input_folder+drc_database+';'
)

In [18]:
# Establish the connection
conn = pyodbc.connect(conn_str)

# Read the table into a pandas DataFrame
# Replace 'your_table_name' with the name of the table you want to read.
drc_bio = pd.read_sql('SELECT * FROM 01_clerus_bio', conn)
drc_role = pd.read_sql('SELECT * FROM 12_clerus_role', conn)

# Close the connection
conn.close()

  drc_bio = pd.read_sql('SELECT * FROM 01_DRC_BIO', conn)
  drc_role = pd.read_sql('SELECT * FROM 12_DRC_roles', conn)


In [19]:
def double_to_integer(dataframe, field):
    dataframe[field] = dataframe[field].astype('Int64')  
 

In [20]:
double_to_integer(drc_role, 'role_start_year')
double_to_integer(drc_role, 'role_end_year')
double_to_integer(drc_bio, 'birth_year')
double_to_integer(drc_bio, 'death_year')
double_to_integer(drc_bio, 'baptism_year')
double_to_integer(drc_bio, 'burried_year')

In [21]:
drc_joined = pd.merge(drc_bio, drc_role, left_on='clerus_id', right_on='clerus_id', how = 'right')


In [22]:
drc_subset = drc_joined[drc_joined['role_type'] == 'predikant']

In [23]:
drc_subset.dropna(subset=['role_start_year'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  drc_subset.dropna(subset=['role_start_year'], inplace=True)


In [24]:
first_minister_subset = drc_subset.loc[drc_subset.groupby('clerus_id')['role_start_year'].idxmin()]

In [25]:
# Creating the linking field for strategy 1 and strategy 2

first_minister_subset['strat12_drc_link'] = first_minister_subset['first_letter'].astype(str) + '_' +first_minister_subset['surname'].astype(str) + '_' + first_minister_subset['role_start_year'].astype(str).str.replace(' ', '')

In [26]:
# Creating the linking field for strategy 3 and strategy 4
first_minister_subset['strat34_drc_link'] = first_minister_subset['surname'].str[:3] + '_' + first_minister_subset['role_start_year'].astype(str).str.replace(' ', '')


In [27]:
light_drc = first_minister_subset[['clerus_id','original_input','role_place','strat34_drc_link','strat12_drc_link']]


In [28]:
light_bz = fil_boekzaallijst[['Nr_BZ','strat1_boekzaallink','strat2_boekzaallink','strat3_boekzaallink','strat4_boekzaallink']]

In [29]:
# In order to distinguish the possible links, all the results of the strategies are accompanied with an id of the strategy applied. 

strategy1 = pd.merge(light_bz, light_drc, left_on='strat1_boekzaallink', right_on='strat12_drc_link', how='inner')
strategy1['strategy'] = 1
strategy2 = pd.merge(light_bz, light_drc, left_on='strat2_boekzaallink', right_on='strat12_drc_link', how='inner')
strategy2['strategy'] = 2
strategy3 = pd.merge(light_bz, light_drc, left_on='strat3_boekzaallink', right_on='strat34_drc_link', how='inner')
strategy3['strategy'] = 3
strategy4 = pd.merge(light_bz, light_drc, left_on='strat4_boekzaallink', right_on='strat34_drc_link', how='inner')
strategy4['strategy'] = 4

In [30]:
appended_strategies = pd.concat([strategy1, strategy2, strategy3, strategy4], ignore_index=True)


In [31]:
# Apply Levenshtein distance 
# Compute a cross join
light_bz['key'] = 1
light_drc['key'] = 1
cross_df = light_bz.merge(light_drc, on='key').drop('key', axis=1)
string1 = 'strat1_boekzaallink'
string2 = 'strat12_drc_link'
cross_df['lev_ratio'] = cross_df.apply(lambda row: lev.ratio(row[string1], row[string2]), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  light_bz['key'] = 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  light_drc['key'] = 1


In [32]:
cross_df = cross_df.sort_values(by='lev_ratio', ascending=False)

In [33]:
# Set the top 3 of matches. By changing the variable n it can be set to a higher score in the levensthein ratio. 
def top_n_matches(group, n=3):
    return group.nlargest(n, 'lev_ratio')

In [34]:
strategy5 = cross_df.groupby('Nr_BZ').apply(top_n_matches).reset_index(drop=True)

In [35]:
# Here, the strategy 5 id is assed as well 
strategy5['strategy'] = 5

In [36]:
appended_strategies_lev = pd.concat([appended_strategies, strategy5], ignore_index=True)


In [37]:
# The output of the linkage is stored in a csv which can be joined Nr_BZ .
appended_strategies_lev.to_csv(folderlink+folder_output+'possible_links_drc_bz_strat1-5.csv', sep=';', encoding='utf-8', index=False)