# Connecting dataset from Biography portal of the Netherlands CLERUS

## Description of Dutch Biography Portal data

In order to connect inidividuals from the [Dutch Biography Portal](http://www.biografischportaal.nl/en/) we have been provided with an excel sheet from the datacurator of this portal. The dataset we have been provided with contains the following fields.

|Fieldname | Description|
|----|---|
|Badge| internal badge id|
|Bioport_id| unique id in bioportal to link it to other datasources |
|Person_id|unique id of individual|
|prepositie| preposition, an official title like duke or dr.|
|voornaam| first name|
|intrapositie| infix |
|pnv_infixTitle| infix title |
|geslachtsnaam| surname |
|postpositie| postposition |
|person_sex| gender |
|VIAF_id_1| Virtual International Authority File Id to link with other datasources |
|VIAF_id_2| Virtual International Authority File Id to link with other datasources when a second is known |
|Wikidata_id| Id to wikidata |
|event_birth_when| Birth date |
|event_birth_text| additional information about the birth, mostly date of baptism | 
|event_birth_place| place of birth or baptism |
|event_death_when| date of death |
|event_death_text| additional information about the death, e.g. date of funeral |
|event_death_place| place of death or burial | 
|category-1| category for which the inidivual was known for |
|category-2| second category for which the inidivual was known for |
|category-3| additional category for which the inidivual was known for |
|category-4| additional category for which the inidivual was known for |
|religion| information about the relgion of an individual |

*Table 1 - fields Dutch Biography portal*

To link individuals from CLERUS dataset with data from the Biography Portal (BP) a string is created combining the first letter of the name, the infix, the surname and the year of birth in both datasets. 

From CLERUS this string match is created through the table **01_clerus_bio** for the fields **first_name**, **infix**, **surname** , **birth_year**. From BP these are derived from the first letter of **voornaam**, **intrapositie**, **geslachtsnaam**, **event_birth_when**. For **event_birth_when** the first 4 digital number has been isolated assuming that is the year of birth.

For field where these corresponded we have created a list containing the Bioport_id and clerus_id. This outcome of this script is meant to enrich BP and CLERUS. In total this resulted in 1199 matches between the biography portal and the CLERUS.


In [1]:
# import the required libraries
import os
import re
import csv
import pandas as pd
import numpy as np
import pyodbc
import shutil


In [2]:
# Set variables for the the datasets (i.e. the input location of the file to be processed and the output location) )

folderlink = '..//data//'
folder_input = 'input//bioportal//'
input_file_bio = os.path.join(folderlink+folder_input, '2019_12_10_BioPort_BPR_MASTERFILE.csv')
folder_output = 'output//'

In [3]:
def export_access_to_dataframes(database_path):
    # Connection string for Access database
    conn_str = (
        r'DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};'
        r'DBQ=' + database_path + ';'
    )

    # Establish a connection to the Access database
    conn = pyodbc.connect(conn_str)
    cursor = conn.cursor()

    # Get a list of all tables in the database
    tables = [row.table_name for row in cursor.tables(tableType='TABLE')]

    # Loop through the tables and load each into a DataFrame
    for table in tables:
        query = f'SELECT * FROM [{table}]'
        df = pd.read_sql(query, conn)
        globals()[f'tbl_{table}'] = df  # Create a global variable with the table name

    # Close the connection
    conn.close()

In [4]:
# Link with CLERUS database (which is the result of the data Harmonization steps)
clerus_database_path = 'E:\\digidure\\CLERUS_v3_06082024.accdb'
export_access_to_dataframes(clerus_database_path)

  df = pd.read_sql(query, conn)
  df = pd.read_sql(query, conn)
  df = pd.read_sql(query, conn)
  df = pd.read_sql(query, conn)
  df = pd.read_sql(query, conn)
  df = pd.read_sql(query, conn)
  df = pd.read_sql(query, conn)
  df = pd.read_sql(query, conn)


In [5]:
# Panda settings for showing data
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

In [7]:
# Load the biography portal datadump, parse the various field and create a string to match CLERUS with
bio = pd.read_csv(input_file_bio, sep=';', encoding='utf-8')
bio['birth_year'] = bio['event_birth_when'].str.extract(r'(\d{4})')
bio['first_letter'] = bio['voornaam'].astype(str).apply(lambda x: x[0])
bio['bio_name_surname_year'] = (bio['first_letter'].astype(str) + '_' + np.where(bio['intrapositie'].isna(), '', bio['intrapositie'].astype(str)) + '_' + bio['geslachtsnaam'].astype(str) + '_' + np.where(bio['birth_year'].isna(), '', bio['birth_year'].astype(str)))
bio = bio[~bio['birth_year'].isna()]

  bio = pd.read_csv(input_file_bio, sep=';', encoding='utf-8')


In [21]:
# Access Clerus Bio dataframe and create a string to matcht the biography portal data with
tbl_01_clerus_bio['first_letter'] = tbl_01_clerus_bio['first_name'].astype(str).apply(lambda x: x[0] if len(x) > 1 else None)
tbl_01_clerus_bio = tbl_01_clerus_bio[~tbl_01_clerus_bio['birth_year'].isna()]
tbl_01_clerus_bio['birth_year'] = tbl_01_clerus_bio['birth_year'].astype(str)
tbl_01_clerus_bio['birth_year']= tbl_01_clerus_bio['birth_year'].str.extract(r'(\d{4})')
tbl_01_clerus_bio['clerus_name_surname_year'] = (tbl_01_clerus_bio['first_letter'].astype(str) + '_' + np.where(tbl_01_clerus_bio['infix'].isna(), '', tbl_01_clerus_bio['infix'].astype(str)) + '_' + tbl_01_clerus_bio['surname'].astype(str) + '_' + tbl_01_clerus_bio['birth_year'].astype(str))


In [23]:
# Link Clerus and BP
clerus_bio = pd.merge(tbl_01_clerus_bio, bio, left_on='clerus_name_surname_year', right_on='bio_name_surname_year', how='inner')

In [24]:
# export the file as csv
clerus_bio.to_csv(folderlink+folder_output+'clerus_bp.csv', sep=';', encoding='utf-8', index=False)

In [25]:
# create a subset with only the id fields (ideally these would be added to both datasets)
clerus_bio_id_links = clerus_bio[['clerus_id','Person_id']]

In [27]:
clerus_bio_id_links.describe() # 1221 links with biography portal

Unnamed: 0,clerus_id
count,1221.0
mean,6353.720721
std,4876.574665
min,6.0
25%,2962.0
50%,5484.0
75%,8681.0
max,30014.0


In [26]:
# export the file as csv
clerus_bio_id_links.to_csv(folderlink+folder_output+'clerus_bp_id_links.csv', sep=';', encoding='utf-8', index=False)