# Location matching against Wikidata

Utrecht University, Applied Data Science, Master thesis

By Sander Engelberts (1422138)

May-June 2022

Wikidata contains linked open data for many different type of entities, among which places. This is used to retrieve the Wikidata URI, GeoNames URI, centroid coordinates, standard and alternative names, country name and URI, and administrative region name and URI corresponding to place names recorded in genealogical documents. 

In [1]:
# Install package to easily extract information from Wikidata
# ! pip install wptools

In [2]:
# Load required packages
import os # for paths to files in operating system
import pandas as pd # for dataframes and operations on it
import wptools # for querying Wikidata
import numpy as np # for mathematical operations
from tqdm import tqdm # for displaying progress of operations
import re # for Regular expressions
tqdm.pandas() # display progress of pandas operations such as apply (use progress_apply instead)

## Define functions for querying Wikidata

In [3]:
def remove_parentheses(string):
    """
    Clean string from text between parentheses
    
    This function is a duplicate from Data_exploration.ipybn
    
    Parameters
    ----------
    string : string
        Text string that potentially contains additional text within parentheses
        
    Returns
    -------
    string : string
        Cleaned text without parentheses_text information
    parentheses_text : string
        Text that was written within parentheses in the input string
    """
    # Find the parts of the string that are within parentheses (including the brackets)
    # Note that this does not work for nested parentheses, but these are not expected for 
    # this data
    parentheses_text = re.findall('\(.*?\)', string)

    # Remove the potential texts within parentheses from the string
    for text in parentheses_text:
        string = string.replace(text, "")
        
    # Remove the parentheses from the texts that are between these
    parentheses_text = [text.replace('(', "").replace(')', "") for text in parentheses_text]

    # If no texts within parentheses is found, then record a missing value for parentheses_text.
    # If one is found then its string directly, and else join the multiple strings
    # that were within multiple parentheses
    if len(parentheses_text) == 0:
        parentheses_text = np.nan
    elif len(parentheses_text) == 1:
        parentheses_text = parentheses_text[0]
    else:
        parentheses_text = " ".join(parentheses_text)
        
    return string, parentheses_text

In [32]:
def query_wikidata(place_name, province_name):
    '''
    Query the Wikidata API using the wptools package for a place-province name 
    combination and return extracted place information
    
    Parameters
    ----------
    place_name : str
        String representation of place name (likely in Dutch), but without need to be
        standardised using lowercasing and replacing special characters by their URL
        encodings as that gets done internally
    province_name : str
        String representation of province name (likely only Dutch ones), but without need to be
        standardised using lowercasing and replacing special characters by their URL
        encodings as that gets done internally. This aids in disambiguation cases where
        multiple places exist with the same name in different provinces
        
    Returns
    -------
    results : dictionary
        Dictionary containing the query results of Wikidata place entities
    '''
    # Assemble the place query, which is either the place name or the place name 
    # with province indicator
    # If province name exists for place name, then add it between parentheses behind the
    # place name as often done (adding country indicator does not retrieve results). 
    # This aids in disambiguation cases where multiple places exist with the same name 
    # in different provinces
    if province_name is not np.nan:
        place_query = place_name + ' (' + province_name + ')'
    else:
        place_query = place_name
    
    # Initialize page that gets queried using the Wikidata API with the given place name 
    # Note: the names in the dataset are mostly in Dutch as it are Dutch documents, so 
    # here also gets specifically searched for the Dutch Wikidata page
    # Silent=True suppresses intermediate output prints/echoes of Wikidata results
    page = wptools.page(place_query, lang='nl', silent=True) 
    
    # Specify the elements I want to retrieve from Wikidata for this place:
    # country (P17): country this place is in     
    # GeoNames ID (P1566): GeoNames ID that uniquely refers to a unique entity in their database
    # official name (P1448): name of city, likely the same as label used for the Wikidata entity
    # located in the administrative territorial entity (P131): for retrieving province     
    # coordinate location (P625): (centroid) coordinates of place
    page.wanted_labels(['P17', 'P1566', 'P1448', 'P131', 'P625'])
    
    # Boolean that states if page could be queried or not
    extract_query = True
    try:
        # Query the Wikidata page to retrieve the specified information in json format
        # Note this does not work with special symbols, like 's-Gravenhage, which is likely 
        # because it requests the url encoding %27s-Gravenhage instead of with apostrophe
        page.get_wikidata() 
    except:
        # If the query can't find an entity, then it raises a LookupError 
        # This can be the case due to spelling errors, the entity not being recorded yet,
        # name variants that are not recorded, etc.
        # Thus directly return missing values as result
        standard_name = np.nan
        alternative_names = np.nan
        wikidata_uri = np.nan
        geonames_uri = np.nan
        longitude = np.nan
        latitude = np.nan
        country = np.nan
        country_URI = np.nan
        province = np.nan
        province_URI = np.nan
        n_results = 0
        
        # Additionally, set boolean to false such that no element values are tried to be found
        extract_query = False

    # If a Wikidata entity was found, extract its information
    # Note that it doesn't add a key-value pair for elements that were not found
    if extract_query:
        # Check if just one result was found or multiple. Multiple is indicated by 
        # a 'Wikimedia-doorverwijspagina', which refers to all possible results
        try:
            if (page.data['what'] == 'Wikimedia-doorverwijspagina') or (
                page.data['description'] == 'Wikimedia-doorverwijspagina'):
                # Retrieve more information from the disambiguation page
                page.get()

                # Get number of results that may all be the result of this place query
                n_results = page.data['disambiguation']

                # Fill in all other elements with missing values as ambiguous which 
                # is the correct one
                standard_name = np.nan
                alternative_names = np.nan
                wikidata_uri = np.nan
                geonames_uri = np.nan
                longitude = np.nan
                latitude = np.nan
                country = np.nan
                country_URI = np.nan
                province = np.nan
                province_URI = np.nan
        except:
            # Set n_results to a high number to know there are multiple results.
            # It is unknown how many results exactly, because these entities don't exist in 
            # the disambiguation page 
            n_results = 9999

            # Fill in all other elements with missing values as ambiguous which 
            # is the correct one
            standard_name = np.nan
            alternative_names = np.nan
            wikidata_uri = np.nan
            geonames_uri = np.nan
            longitude = np.nan
            latitude = np.nan
            country = np.nan
            country_URI = np.nan
            province = np.nan
            province_URI = np.nan
        
        else:
            # Exactly one result found
            n_results = 1
            
            # Retrieve standardised place name to use during entity resolution 
            try:
                # Try to get official place name as standard name
                standard_name = page.data['wikidata'].get('officiële naam (P1448)')
            except:
                standard_name = np.nan

            # Retrieve label of place used by Wikidata as standardised name if
            # no official place name was found
            if standard_name is np.nan:
                try:
                    standard_name = page.data['label']
                except:
                    standard_name = np.nan

            # Retrieve alternative names 
            try:
                alternative_names = page.data['aliases']
                # Join the alternative names from a list into a single comma separated string
                alternative_names = ", ".join(alternative_names) 
            except:
                alternative_names = np.nan

            # Retrieve WikiData base referring to place entity
            # URI then becomes: http://www.wikidata.org/entity/wikibase
            # (HTTP URL has wiki instead of entity in its link)
            try:
                wikidata_uri = 'http://www.wikidata.org/entity/' + str(page.data['wikibase'])
            except:
                wikidata_uri = np.nan

            # Retrieve GeoNames ID referring to place entity (like ErfGeo Proxy returns)
            # and make it into a respective GeoNames URI link with:
            # http://sws.geonames.org/GeoNamesID
            # (HTTP URL has no 'sws.' in its link)
            try:
                geonames_uri = 'http://sws.geonames.org/' + str(
                    page.data['wikidata'].get('GeoNames-identificatiecode (P1566)'))
            except:
                geonames_uri = np.nan

            # Retrieve coordinates of place 
            try:
                longitude = page.data['wikidata'].get('geografische locatie (P625)').get('longitude')
            except:
                longitude = np.nan

            try:
                latitude = page.data['wikidata'].get('geografische locatie (P625)').get('latitude')
            except:
                latitude = np.nan

            # Retrieve country name and its Wikidata URI
            try:
                # Get country name, which likely has its entity number behind parentheses
                # Note that 'land' (country) sometimes returns multiple results in a list,
                # e.g. a reference to the Netherlands and Kingdom the Netherlands
                country_with_URI = page.data['wikidata'].get('land (P17)')

                # Split the country name from its URI base (if none is found between
                # parentheses after the word, then URI_base=np.nan)
                # Note that if multiple results were returned in a list, then here
                # the country will get a missing value via the except clause (function
                # underneath expects a string and not a list), as it is ambiguous which 
                # is the correct one
                country, country_URI_base = remove_parentheses(string=country_with_URI)

                # If URI base was found, then create an URI link out of it
                if country_URI_base is not np.nan:
                    country_URI = 'http://www.wikidata.org/entity/' + str(country_URI_base)
                else:
                    country_URI = np.nan
                # Remove additional spaces at the beginning or end of the country name string
                country = country.strip()
            except:
                country = np.nan
                country_URI = np.nan

            # Retrieve province name and its Wikidata URI
            try:
                # Get province name, which likely has its entity number behind parentheses
                # Note that 'bestuurlijke eenheid' (governmental organisation) does not
                # directly mean province, but can also be municipality or region, so multiple
                # results may be returned in a list
                province_with_URI = page.data['wikidata'].get(
                    'gelegen in bestuurlijke eenheid (P131)')

                # Split the province name from its URI base (if none is found between
                # parentheses after the word, then URI_base=np.nan)
                # Note that if multiple results were returned in a list, then here
                # the province will get a missing value via the except clause (function
                # underneath expects a string and not a list), as it is ambiguous which is 
                # the correct one
                province, province_URI_base = remove_parentheses(string=province_with_URI)

                # If URI base was found, then create an URI link out of it
                if province_URI_base is not np.nan:
                    province_URI = 'http://www.wikidata.org/entity/' + str(province_URI_base)
                else:
                    province_URI = np.nan
                # Remove additional spaces at the beginning or end of the country name string
                province = province.strip()
            except:
                province = np.nan
                province_URI = np.nan
        
    # Save retrieved place features within a dictionary and return this
    return {"place_name":place_name, "province_name":province_name, 
            "standard_name":standard_name, "wikidata_uri":wikidata_uri,
            "geonames_uri":geonames_uri, "longitude":longitude, "latitude":latitude, 
            "alternative_names":alternative_names, "country":country, 
            "country_wikidata_uri":country_URI, "province":province, 
            "province_wikidata_uri":province_URI, "n_results":n_results}

In [5]:
def query_locations(places_lst, provinces_lst, csv_path, N=100):
    """
    Query Wikidata for all unique place-province name combinations 
    from specified list and save its results
    
    This function is similar to the one with the same name in 
    Location_matching_with_ErfGeo_Proxy.ipybn, but with a different query function call,
    as well as a different result dictionary
    
    Parameters
    ----------
    places_lst : list of strings
        List with place names that should be queried
    provinces_lst : list of strings
        List with recorded province names that correspond to the places in places_lst
    csv_path : str
        Path to .csv file where to write outputs in
    N : int (default 100)
        Integer stating how often intermediate results need to be written to .csv file.
        This is respective to the number of places and not to the number of queries
        required for that place to try to retrieve a corresponding entity
    """
    # Query all places from specified list and store the information in a list, 
    # and save intermediately to .csv file 
    query_results = []
    for i, (place_name, province_name) in tqdm(enumerate(zip(places_lst, provinces_lst)), 
                                             total=len(places_lst)):
        # Query place name if it is not a missing value 
        # (not the case when only unique places are queried)
        if place_name is not np.nan:
            # Try to retrieve entity belonging to place-province name combination
            query_result = query_wikidata(place_name=place_name, province_name=province_name)
            
            # If no results were found above, then if province name was added to query,
            # try to query again with just the place name
            if province_name is not np.nan and query_result.get("n_results") == 0:
                query_result = query_wikidata(place_name=place_name, province_name=np.nan)
            
            query_results.append(query_result)
        else:
            # If the place name is a missing value, just record missing values as result
            query_results.append({"place_name":np.nan, "province_name":np.nan, 
                                  "standard_name":np.nan, 
                                  "wikidata_uri":np.nan, "geonames_uri":np.nan, 
                                  "longitude":np.nan, "latitude":np.nan, 
                                  "alternative_names":np.nan, "country":np.nan, 
                                  "country_wikidata_uri":np.nan, "province":np.nan, 
                                  "province_wikidata_uri":np.nan})

        # Check if results should be (intermediately) saved
        if i % N == 0:
            # Intermediately save queried results
            write_chunk(i=i, query_results=query_results, csv_path=csv_path)
            query_results = []
        elif i == len(places_lst) - 1:
            # Write last results
            write_chunk(i=i, query_results=query_results, csv_path=csv_path)

In [31]:
# TODO: Function copied from Location_matching_with_ErfGeo_Proxy.ipybn at 27-05-2022 16:56
def write_chunk(i, query_results, csv_path):
    '''
    Intermediately write query results to .csv file
    
    Function is duplicate from Location_matching_with_ErfGeo_Proxy.ipybn
    
    i : int
        The ith place that was queried, required to check if
        a header should still be written (only the first iteration)
    query_results : list with dictionaries
        List with dictionaries containing the results
    csv_path : str
        Path to .csv file where to write outputs in
    '''
    # Only write header in the first iteration
    if i == 0:
        header = True 
    else:
        header = False
        
    # Create Pandas Dataframe from list with dictionaries    
    chunk = pd.DataFrame(query_results)    
    
    # Save chunk to .csv file (mode 'a' is appending)
    chunk.to_csv(csv_path, header=header, mode='a', sep=',', index=False) 

## Prepare list of unique place-province names from data

In [7]:
# Path to folder where I stored the data, adjust to own storage location
data_path = "E:\CBG" 

# Path to cleaned person cards data 
person_card_path = os.path.join(data_path, "clean_persoonskaarten.csv")

# Path to cleaned passport requests data
passport_path = os.path.join(data_path, "clean_Indische_paspoorten.csv")

# Path to store location URI results
csv_path = os.path.join(data_path, "wikidata_locations.csv") 

In [8]:
# Load cleaned person cards data into a dataframe
df_person_card = pd.read_csv(person_card_path, sep=",", header=0, index_col=None)

In [9]:
# Load cleaned passport requests data into a dataframe
df_passport = pd.read_csv(passport_path, sep=",", header=0, index_col=None)

  interactivity=interactivity, compiler=compiler, result=result)


In [10]:
# Determine unique place names in person cards data (not taking into account province names yet)
places_person_card = list(df_person_card.Geboorteplaats.unique())

In [11]:
print("There are", len(places_person_card), "unique place names in person card data")

There are  17475  unique place names in person card data


In [12]:
# Determine unique place names in passport data (not taking into account province names yet)
places_passport = list(df_passport.p1_gebplaats.unique())

In [13]:
print("There are", len(places_passport), "unique place names in passport request data")

There are  18889  unique place names in passport request data


In [14]:
# Make a set together with the places from person cards and passport requests to not need to 
# query places with the same name twice (not taking into account province names yet)
places_combined = list(set(places_person_card + places_passport))

In [15]:
print("There are", len(places_combined), 
      "unique place names in combined person card and passport request data")

There are  33304  unique place names in combined person card and passport request data


In [16]:
# For duplicate place names it is more interesting to add also the province name
# as then sometimes it is possible to find the place URI non-ambiguisly.
# So, retrieve the unique combinations of birth place and its province name 
# for personal record cards
grouped_person_card = df_person_card[['Geboorteplaats', 'Geboorteprovincie']].drop_duplicates()

In [17]:
# Inspect how this dataframe looks like: notice that also missing province values are
# included here (as wanted)
grouped_person_card.head()

Unnamed: 0,Geboorteplaats,Geboorteprovincie
0,Dordrecht,
1,Wolphaartsdijk,Zeeland
2,Lochem,Gelderland
3,'s-Gravenhage,
4,Egmond aan Zee,Noord-Holland


In [18]:
print("There are", len(grouped_person_card), 
    "unique place name - place province combinations in person card data")
# Notice there are indeed more place-province combinations than unique place names,
# so some places did only sometimes have a province indicated or there were places
# in different provinces with the same name

There are  19495  unique place name - place province combinations in person card data


In [19]:
# Retrieve the unique combinations of birth place and its province name for passport requests
grouped_passport = df_passport[['p1_gebplaats', 'p1_gebprovincie']].drop_duplicates()

In [20]:
print("There are", len(grouped_passport), 
    "unique place name - place province combinations in passport request data")

There are  19335  unique place name - place province combinations in passport request data


In [21]:
# Rename columns of passport requests dataframe because then merging is easier 
# (otherwise both column name variants will be maintained)
grouped_passport.rename(columns={'p1_gebplaats':'Geboorteplaats', 
                                 'p1_gebprovincie':'Geboorteprovincie'},
                       inplace=True)

In [22]:
# Inspect how this dataframe looks like: notice that also missing province values are
# included here (as wanted)
grouped_passport.head()

Unnamed: 0,Geboorteplaats,Geboorteprovincie
0,Sliedrecht,
1,Rotterdam,
3,Surakarta,
4,Waeweran,
5,Gile-Rijen,


In [23]:
# Merge the unique place-province combinations from personal record cards and passport requests
# Here make sure that duplicates are not added and if a combination is only part of 
# one of the dataframes, then these must always be added (i.e. outer join which
# uses the union of keys from both dataframes instead of e.g. the intersection or only from
# left or right dataframe)
place_province_combined = pd.merge(left=grouped_person_card, right=grouped_passport, 
                                   on=['Geboorteplaats', 'Geboorteprovincie'],
                                   how='outer')

In [24]:
# Inspect resulting dataframe with all unique place-province combinations
place_province_combined.head()

Unnamed: 0,Geboorteplaats,Geboorteprovincie
0,Dordrecht,
1,Wolphaartsdijk,Zeeland
2,Lochem,Gelderland
3,'s-Gravenhage,
4,Egmond aan Zee,Noord-Holland


In [25]:
print("There are", len(place_province_combined), 
    "unique place name - place province combinations in combined " + 
    "person card and passport request data")

There are  35634  unique place name - place province combinations in combined person card and passport request data


## Query Wikidata

In [33]:
# Query unique places in personal record cards and passport requests data
# Note that the output progress bar shows less places than said to query above, which is the
# case due to an intermediate restart and continuation
query_locations(places_lst=place_province_combined.Geboorteplaats, 
                provinces_lst=place_province_combined.Geboorteprovincie, 
                csv_path=csv_path, N=100)

  6%|████▋                                                                      | 1380/22133 [12:35<3:30:28,  1.64it/s]Note: Wikidata item Q2050473 missing 'instance of' (P31)
 13%|█████████▌                                                                 | 2806/22133 [25:00<1:59:37,  2.69it/s]Note: Wikidata item Q2417776 missing 'instance of' (P31)
 13%|█████████▉                                                                 | 2917/22133 [26:00<2:40:52,  1.99it/s]Note: Wikidata item Q44539 missing 'instance of' (P31)
 22%|████████████████▋                                                          | 4935/22133 [42:37<1:00:50,  4.71it/s]Note: Wikidata item Q2383211 missing 'instance of' (P31)
 25%|███████████████████                                                        | 5624/22133 [47:59<2:06:27,  2.18it/s]Note: Wikidata item Q9788 missing 'instance of' (P31)
 31%|██████████████████████▉                                                    | 6769/22133 [54:40<1:10:14,  3.65it/s]Note: 

## Inspect query results

In [34]:
# Load all saved query results into a pandas dataframe
df_query_results = pd.read_csv(csv_path, sep=",", header=0, index_col=None)

# Inspect dataframe
df_query_results.head(10)

Unnamed: 0,place_name,province_name,standard_name,wikidata_uri,geonames_uri,longitude,latitude,alternative_names,country,country_wikidata_uri,province,province_wikidata_uri,n_results
0,Dordrecht,,Dordrecht,http://www.wikidata.org/entity/Q26421,http://sws.geonames.org/2756668,4.678333,51.795833,"Dordt, gemeente Dordrecht, Dordrecht (gemeente...",Nederland,http://www.wikidata.org/entity/Q55,Zuid-Holland,http://www.wikidata.org/entity/Q694,1.0
1,Wolphaartsdijk,,,http://www.wikidata.org/entity/Q1025042,http://sws.geonames.org/2744199,3.8197,51.5297,,Nederland,http://www.wikidata.org/entity/Q55,,,1.0
2,Lochem,,,http://www.wikidata.org/entity/Q15878783,http://sws.geonames.org/None,,,,,,,,1.0
3,'s-Gravenhage,,,,,,,,,,,,0.0
4,Egmond aan Zee,,,http://www.wikidata.org/entity/Q1616324,http://sws.geonames.org/2756301,,,,Nederland,http://www.wikidata.org/entity/Q55,,,1.0
5,Winschoten,,,http://www.wikidata.org/entity/Q73817,http://sws.geonames.org/2744344,7.033333,53.15,"Winschoot, Mokum van het Noorden",Nederland,http://www.wikidata.org/entity/Q55,,,1.0
6,Klundert,,,http://www.wikidata.org/entity/Q76996,http://sws.geonames.org/2752600,4.5289,51.6628,"Niervaart, Loerendonck",Nederland,http://www.wikidata.org/entity/Q55,,,1.0
7,Deurne en Liessel,,,http://www.wikidata.org/entity/Q2774034,http://sws.geonames.org/None,5.8,51.45,,Nederland,http://www.wikidata.org/entity/Q55,Noord-Brabant,http://www.wikidata.org/entity/Q1101,1.0
8,Kerkrade,,,http://www.wikidata.org/entity/Q28912270,http://sws.geonames.org/None,,,,,,,,1.0
9,Amsterdam,,Amsterdam,http://www.wikidata.org/entity/Q727,http://sws.geonames.org/2759794,4.9,52.383333,"Mokum, A'dam, 020, Asd., Damsko",,,,,1.0


In [35]:
# Check occurance of number of results (i.e. how many results found per unique location name)
# Notice that hardly ever multiple conflicting results were found, rather there were 0
df_query_results.n_results.value_counts(normalize=False)

0.0       26117
1.0        9481
9999.0       29
Name: n_results, dtype: int64

In [37]:
# Check frequency of number of results (i.e. how many results found per unique location name)
df_query_results.n_results.value_counts(normalize=True)

0.0       0.733068
1.0       0.266118
9999.0    0.000814
Name: n_results, dtype: float64

In [57]:
# Check results with n_results=9999 (i.e. more than 1 result but unknown how many exactly)
df_query_results[df_query_results.n_results == 9999]
# These mostly don't contain province name and often also could refer to an entity that is
# not a place name (e.g. person) or outside of the Netherlands

Unnamed: 0,Geboorteplaats,Geboorteprovincie,standard_name,wikidata_uri,geonames_uri,longitude,latitude,alternative_names,country,country_wikidata_uri,province,province_wikidata_uri,n_results
956,Kalk,,,,,,,,,,,,9999.0
1140,Waarde,Zeeland,,,,,,,,,,,9999.0
2096,Neus,,,,,,,,,,,,9999.0
2284,Beek,Noord-Brabant,,,,,,,,,,,9999.0
3290,Haard,,,,,,,,,,,,9999.0
5795,Nadorst,,,,,,,,,,,,9999.0
9350,Lag,,,,,,,,,,,,9999.0
9415,Yerköy,,,,,,,,,,,,9999.0
10056,Rating,,,,,,,,,,,,9999.0
11764,Filippine,Zeeland,,,,,,,,,,,9999.0


In [38]:
# Check number of missing values per attribute (which couldn't get retrieved with Wikidata)
# Interestingly one of the unique places is the NaN value
df_query_results.isna().sum()

place_name                   7
province_name            35504
standard_name            34258
wikidata_uri             26153
geonames_uri             26153
longitude                30011
latitude                 30011
alternative_names        33917
country                  29913
country_wikidata_uri     29914
province                 32277
province_wikidata_uri    32278
n_results                    7
dtype: int64

In [39]:
# Note that here now sometimes no province name is saved anymore while it was there 
# previously as unique combination, but not saved anymore due to Wikidata not having found
# a place using it and then trying if it can find one without it (which may have also failed)
# Thus, fill in province again with original data
df_query_results['province_name'] = place_province_combined['Geboorteprovincie']

In [40]:
# Check again if this indeed reduces the number of missing province values
df_query_results.isna().sum()

place_name                   7
province_name            31652
standard_name            34258
wikidata_uri             26153
geonames_uri             26153
longitude                30011
latitude                 30011
alternative_names        33917
country                  29913
country_wikidata_uri     29914
province                 32277
province_wikidata_uri    32278
n_results                    7
dtype: int64

In [44]:
# Check number of non-missing values
# Notice that for many records not all information could be retrieved, but only a part of it
len(df_query_results) - df_query_results.isna().sum()

place_name               35627
province_name             3982
standard_name             1376
wikidata_uri              9481
geonames_uri              9481
longitude                 5623
latitude                  5623
alternative_names         1717
country                   5721
country_wikidata_uri      5720
province                  3357
province_wikidata_uri     3356
n_results                35627
dtype: int64

In [41]:
# Save with original provinces again
df_query_results.to_csv(csv_path, header=True, sep=',', index=False)

In [42]:
# Check number of unique values
# Notice that quite some provinces are found, which may be because of provinces in other 
# countries added now as well, or because another administrative region is mentioned sometimes
df_query_results.nunique()

place_name               33303
province_name               11
standard_name             1171
wikidata_uri              7850
geonames_uri              3830
longitude                 4112
latitude                  4004
alternative_names         1231
country                    102
country_wikidata_uri       102
province                  1629
province_wikidata_uri     1636
n_results                    3
dtype: int64

In [49]:
# Check separately for personal record cards and passport request places how many got resolved
# with Wikidata URI
# For this first rename columns in results to be equal to other dataframes
df_query_results.rename(columns={'place_name':'Geboorteplaats', 
                                 'province_name':'Geboorteprovincie'},
                       inplace=True)

In [55]:
# Subset query results for personal record cards (only keep results records with keys
# in left dataframe)
results_person_cards = pd.merge(left=grouped_person_card, right=df_query_results, 
                                   on=['Geboorteplaats', 'Geboorteprovincie'],
                                   how='left')

In [54]:
# Subset query results for passport requests (only keep results records with keys
# in left dataframe)
results_passport = pd.merge(left=grouped_passport, right=df_query_results, 
                                   on=['Geboorteplaats', 'Geboorteprovincie'],
                                   how='left')

In [58]:
# Check occurance of number of results (i.e. how many results found per unique location name)
# Do this for personal record cards results
results_person_cards.n_results.value_counts(normalize=False)

0.0       11284
1.0        8189
9999.0       16
Name: n_results, dtype: int64

In [59]:
# Check frequency of number of results (i.e. how many results found per unique location name)
# Do this for personal record cards results
results_person_cards.n_results.value_counts(normalize=True)

0.0       0.578993
1.0       0.420186
9999.0    0.000821
Name: n_results, dtype: float64

In [60]:
# Check number of unique values
# Do this for personal record cards results
results_person_cards.nunique()

Geboorteplaats           17474
Geboorteprovincie           11
standard_name             1041
wikidata_uri              6690
geonames_uri              3438
longitude                 3693
latitude                  3589
alternative_names         1130
country                     89
country_wikidata_uri        89
province                  1371
province_wikidata_uri     1379
n_results                    3
dtype: int64

In [61]:
# Check occurance of number of results (i.e. how many results found per unique location name)
# Do this for passport request results
results_passport.n_results.value_counts(normalize=False)

0.0       15914
1.0        3401
9999.0       16
Name: n_results, dtype: int64

In [62]:
# Check frequency of number of results (i.e. how many results found per unique location name)
# Do this for passport request results
# Notice that much less queries had a (unique) result than with personal record cards places
results_passport.n_results.value_counts(normalize=True)

0.0       0.823237
1.0       0.175935
9999.0    0.000828
Name: n_results, dtype: float64

In [63]:
# Check number of unique values
# Do this for passport request results
results_passport.nunique()

Geboorteplaats           18888
Geboorteprovincie           11
standard_name              423
wikidata_uri              3083
geonames_uri              1407
longitude                 1441
latitude                  1408
alternative_names          547
country                     61
country_wikidata_uri        62
province                   552
province_wikidata_uri      554
n_results                    3
dtype: int64

In [76]:
# Check frequency of standard names to see if these make sense to replace the original names with
# The slightly more frequent ones make sense (more frequent likely due to being resolved with 
# multiple name variants or once with province indicator and once without), but here at the 
# tail one can see that often the native names are used and even their symbols 
# (the former happens sometimes already, but always with Latin alphabet), or sometimes
# a list with multiple options is given. Standard name seems to often also allign with the place 
# name that was used for querying. A standard name was only found for 1376 unique place-province 
# combinations, but for some that didn't equal the originally recorded name, it may be beneficial
# to replace it. However, this should only be done for Latin characters as others are often 
# not used within Dutch documents. A simple check is string.isascii(), but this boolean also 
# returns False for accented characters like é or ü (could be alleviated by first unidecoding 
# the string but for these few cases it is fine to just keep the original name)
df_query_results.standard_name.value_counts(normalize=True).tail(20)

München                                          0.000727
Druogno                                          0.000727
Strijensas                                       0.000727
Bensheim                                         0.000727
Easterlittens                                    0.000727
Schirmeck                                        0.000727
Wohlen (AG)                                      0.000727
Kecamatan Barabai                                0.000727
Ahrensburg                                       0.000727
Sains-du-Nord                                    0.000727
['Мінск', 'Минск', 'Менск', 'Менск', 'Мінск']    0.000727
Sinaia                                           0.000727
Kolbermoor                                       0.000727
Kapfenberg                                       0.000727
Kecamatan Rangkasbitung                          0.000727
Douai                                            0.000727
Raguhn                                           0.000727
La Réole      

In [95]:
# Check how many of the retrieved standard names correspond exactly to the original 
# queried place names (of 1376 found standard names)
sum([1 for name, standard_name in zip(df_query_results.Geboorteplaats, 
                                      df_query_results.standard_name) if name==standard_name])

1007

In [87]:
# Check frequency of found province names to see if these make sense to enrich the provinces with
# In the administrative regions there was often a (non-)Dutch province mentioned, 
# but also very often a municipality or other area. A manual check of this should thus 
# be needed to be able to verify if indeed a province was retrieved
# or something else before being able to enrich the data with this column
df_query_results.province.value_counts(normalize=True).head(20)

Zuid-Holland                        0.024724
Noord-Holland                       0.021448
Friesland                           0.019363
Utrecht                             0.014298
Noord-Brabant                       0.013405
Overijssel                          0.011022
Gelderland                          0.008639
Súdwest-Fryslân                     0.008043
Limburg                             0.005362
Arrondissement Turnhout             0.004766
Kreis Steinfurt                     0.004468
Weststellingwerf                    0.004170
Arrondissement Antwerpen            0.004170
Arrondissement Leuven               0.003575
Oss                                 0.003575
Arrondissement Verviers             0.003277
Arrondissement Brussel-Hoofdstad    0.003277
Zeeland                             0.002979
Dantumadeel                         0.002681
Rendsburg-Eckernförde               0.002681
Name: province, dtype: float64

In [92]:
# Check frequency of found country names to see if these make sense to enrich the original 
# country column with. These countries indeed make sense and are also Dutch name variants 
# as requested and used within the documents. Only sometimes these are not one standardised 
# variant where e.g. the country and the Kingdom of the country exist in this list 
# (which are two different entities indeed but will with string matching have some 
# dissimilarities later). This data can thus be rather safely used to enrich the country 
# information (if none was recorded already for that record)
df_query_results.country.value_counts(normalize=True).head(20)

Nederland                       0.451844
Duitsland                       0.219892
België                          0.080580
Indonesië                       0.046146
Frankrijk                       0.030414
Verenigd Koninkrijk             0.025170
Italië                          0.024821
Spanje                          0.013984
Oostenrijk                      0.013459
Verenigde Staten van Amerika    0.010313
Zwitserland                     0.009264
Hongarije                       0.007167
Zuid-Afrika                     0.004370
Australië                       0.004195
Polen                           0.003671
Tsjechië                        0.003496
Marokko                         0.003146
Slovenië                        0.002797
Noorwegen                       0.002622
Canada                          0.002622
Name: country, dtype: float64