## Matching historic place names to GeoNames entities

Utrecht University, Applied Data Science, Master thesis

By Sander Engelberts (1422138)

May-June 2022

The ErfGeo Proxy can be found at http://www.hicsuntleones.nl/erfgeoproxy/ and is used to retrieve the GeoNames URI, centroid coordinates and alternative names corresponding to Dutch historic place names. The ErfGeo website itself can be found at https://erfgeo.nl/, which is created by among which the Cultural Heritage Agency of the Netherlands as can be read at https://www.cultureelerfgoed.nl/onderwerpen/bronnen-en-kaarten/overzicht/erfgeo. With the ErfGeo Proxy can't be further specified if the information of a specific historical version of a place needs to be returned, e.g. because the geometry may have been different in the past, or about which of the places that share the same name this query is. Some manual postprocessing may thus need to be done if no results or more than one are found. For determining if two places equal across genealogical documents, it is however acceptable that not the details of how the place looked like in the past but how it currently looks like are returned.

In [20]:
# Import required packages
import os # for paths to files in operating system
import pandas as pd # for dataframes and operations on it
import requests # Used for retrieving web information with query
from requests.adapters import HTTPAdapter, Retry # Used for requests to be able to handle 
# connection errors when too often queries are made
import urllib.parse # Used for parsing string to its corresponding url encoding
import numpy as np # for mathematical operations
from tqdm import tqdm # for displaying progress of operations
tqdm.pandas() # display progress of pandas operations such as apply (use progress_apply instead)

## Query ErfGeo Proxy

In [2]:
# Url to search for historic places using the ErfGeo Proxy
ErfGeo_url = "http://www.hicsuntleones.nl/erfgeoproxy/search/?"

In [21]:
def query_place(standardised_place, ErfGeo_url="http://www.hicsuntleones.nl/erfgeoproxy/search/?",
                search_param="q", query_suffix="&dataset=geonames&type=hg:Place&geometry=true"):
    """
    Query the ErfGeo Proxy for a Dutch historic name that is given in a standardised format
    and return all found results
    
    Parameters
    ----------
    standardised_place : str
        String representation of a Dutch historic place name (with additional province name): 
        lowercased, and special characters replaced by their URL encodings
    ErfGeo_url : str (default "http://www.hicsuntleones.nl/erfgeoproxy/search/?")
        String representing the base of the query to ErfGeo Proxy
    search_param : str (default "q")
        String representing the type of query to perform for the standardised_place:
        "q" searches for the given name, uri, or BAG id
        "contains" searches for a broader entity that contains standardised_place,
            e.g. a municipality that the standardised_place fused into
    query_suffix : str (default "&dataset=geonames&type=hg:Place&geometry=true")
        String representing additional parameters that limit the search results:
        "dataset" searches in all datasets for the standard_place, but only returns
            the results of the requested one(s): tgn and/or geonames (default)
        "type" searches only for entities of the requested type: hg:Place (default) or 
            hg:Municipality
        "geometry" returns the geometry corresponding to standardised_place or not (faster):
            true (default) or false
        
    Returns
    -------
    results : list
        List containing (a) dictionary(/ies) with the result(s) of the query
        [] if no results could be found
    """
    # Assemble query from multiple parts
    query = ErfGeo_url + search_param + "=" + standardised_place + query_suffix
    
    # Make a requests retry strategy such that common connection errors (from status_forcelist)
    # are catched, a time-out is implemented in between (backoff_factor, which exponentially
    # increases the time between retries) and a retry is done maximum 3 times (total)
    retry_strategy = Retry(
        total=3,
        status_forcelist=[429, 500, 502, 503, 504],
        backoff_factor=5
    )
    adapter = HTTPAdapter(max_retries=retry_strategy)
    http = requests.Session()
    http.mount("https://", adapter)
    http.mount("http://", adapter)
    
    # Use the ErfGeo Proxy to get the requested information about the historic place
    # This will return a json structure, also when no match is found
    place_json = http.get(query)
    
    # try-except because e.g. 'Alphen a/d Rijn' just returned <Response [200]> at a query 
    # instead of also a json structure that didn't contain results (i.e. results = []),
    # so that one of course also can't parse json if there is a none object
    try:
        # Retrieve information from the returned json structure
        # Parse json representation into dictionary 
        place_dict = place_json.json()
        
        # Return results of query ([] if none found)
        results = place_dict.get('results')
    except:
        results = []
        
    return results

In [4]:
def retrieve_place_info(place_name, province_name,
                        ErfGeo_url="http://www.hicsuntleones.nl/erfgeoproxy/search/?"):
    """
    Query the ErfGeo Proxy for a Dutch historic name and return the following found attributes 
    of the first result (if any are found): GeoNames URI, coordinates, and alternative names
    
    Parameters
    ----------
    place_name : str
        String representation of a (Dutch) historic place name 
    province_name : str
        String representation of (Dutch) province name, just used for storing the
        results and not for querying ErfGeo Proxy 
        (to have results in the same format as Wikidata queries)
    ErfGeo_url : str (default "http://www.hicsuntleones.nl/erfgeoproxy/search/?")
        String representing the base of the query to ErfGeo Proxy

    Returns
    -------
    result : dict
        Dictionary with the attributes:
            place_name: original place_name string given in parameters
            geonames_uri: string with URI to GeoNames place entity (unless other dataset is
                requested in query)
            longitude: float representing the longitude coordinate of the place (centroid) 
                in WGS84
            latitude: float representing the latitude coordinate of the place (centroid) in WGS84
            alternative_names: string with alternative names of the place_name which are 
                connected to the correct entity by ErfGeo. This are toponyms that can represent 
                historic names of places (e.g. Traiecto for Maastricht), carnaval names, 
                and dialects
            n_results: integer representing the number of results found by the query in 
                ErfGeo Proxy. If 0 then no results were found, and if more than 1 then only 
                the first result is represented in the previous attributes so manual 
                inspection needs to discover if this is the correct entity that is 
                referenced or if another one of the results represents the place_name better. 
                This may be the case due to places having the same name
    """
    # Standardise place name by lowercasing, and URL encode special characters by e.g. replacing 
    # spaces by their url encoding %20 and apostrophes by their url encoding %27
    standardised_place = place_name.lower()
    standardised_place = urllib.parse.quote(standardised_place)
    
    # Create query to retrieve the GeoNames entity of the place that is the given place 
    # (e.g. when this place fused with another place then returns the new municipality)
    query_suffix = "&dataset=geonames&type=hg:Place&geometry=true"
    results = query_place(standardised_place=standardised_place, ErfGeo_url=ErfGeo_url,
                         search_param="q", query_suffix=query_suffix)
    
    # If no results are found, then try again with alternative queries
    if not results:
        # Retry with different query that searches for a broader entity that contains 
        # the requested place name if no results are found
        results = query_place(standardised_place=standardised_place, ErfGeo_url=ErfGeo_url,
                         search_param="contains", query_suffix=query_suffix)
    if not results:
        # Retry with different query that searches for a municipality that is
        # the requested place name if no results are found (e.g. when this place
        # fused with another one)
        query_suffix = "&dataset=geonames&type=hg:Municipality&geometry=true"
        results = query_place(standardised_place=standardised_place, ErfGeo_url=ErfGeo_url,
                         search_param="q", query_suffix=query_suffix)
    if not results:
        # Retry with different query that searches for a municipality that contains
        # the requested place name if no results are found (e.g. when this place
        # fused with anothr one) (same query_suffix as previous block is used)
        results = query_place(standardised_place=standardised_place, ErfGeo_url=ErfGeo_url,
                         search_param="contains", query_suffix=query_suffix)

    # Note down if multiple results were found, because then manual inspection is
    # required to specify the correct result using additional information 
    # This is for example the case when multiple places have the same place name
    n_results = len(results)
    
    # Retrieve place entity features of results
    # No results when ErfGeo Proxy couldn't find any match for the given place name
    if n_results > 0:
        # Get the first result (also if multiple are returned)
        place_dict = results[0]
        
        # Retrieve GeoNames URI
        try:
            geonames_uri = place_dict.get('geonames').get('uri')
        except:
            geonames_uri = np.nan
        
        # Retrieve centroid coordinates stored at GeoNames
        # Instead, the information from www.gemeentegeschiedenis.nl could be 
        # used to determine the centroid of the place geometry in the specific 
        # year the document refers to. However, this would require more computations
        # that are not required to relatively accurately match places based on coordinates
        try:
            longitude = place_dict.get('geonames').get('geometry').get('coordinates')[0]
        except:
            longitude = np.nan
        try:
            latitude = place_dict.get('geonames').get('geometry').get('coordinates')[1]
        except:
            latitude = np.nan
            
        # Retrieve alternative names stored in ErfGeo Proxy
        # This is given for the returned entity this place name is part of, which can
        # be the broader municipality instead of the alternative names of this place itself
        try:
            alternative_names = place_dict.get('known-names')
        except:
            alternative_names = np.nan
    else:
        geonames_uri = np.nan
        longitude = np.nan
        latitude = np.nan
        alternative_names = np.nan
    
    return {"place_name":place_name, "province_name":province_name,
            "geonames_uri":geonames_uri, "longitude":longitude, 
            "latitude":latitude, "alternative_names":alternative_names, "n_results":n_results}

In [27]:
def write_chunk(i, query_results, csv_path):
    '''
    Intermediately write query results to .csv file
    
    i : int
        The ith place that was queried, required to check if
        a header should still be written (only the first iteration)
    query_results : list with dictionaries
        List with dictionaries containing the results
    csv_path : str
        Path to .csv file where to write outputs in
    '''
    # Only write header in the first iteration
    if i == 0:
        header = True
    else:
        header = False
        
    # Create Pandas Dataframe from list with dictionaries    
    chunk = pd.DataFrame(query_results)    
    
    # Save chunk to .csv file (mode 'a' is appending)
    chunk.to_csv(csv_path, header=header, mode='a', sep=',', index=False) 

In [6]:
def query_locations(places_lst, provinces_lst, csv_path, N=100):
    """
    Query ErfGeo Proxy for all unique place-province name combinations 
    from specified list and save its results
    
    Parameters
    ----------
    places_lst : list of strings
        List with place names that should be queried
    provinces_lst : list of strings
        List with recorded province names that correspond to the places in places_lst
    csv_path : str
        Path to .csv file where to write outputs in
    N : int (default 100)
        Integer stating how often intermediate results need to be written to .csv file.
        This is respective to the number of places and not to the number of queries
        required for that place to try to retrieve a corresponding entity
    """
    # Query all places from specified list and store the information in a list, 
    # and save intermediately to .csv file 
    query_results = []
    for i, (place_name, province_name) in tqdm(enumerate(zip(places_lst, provinces_lst)), 
                                               total=len(places_lst)):
        # Query place name if it is not a missing value 
        # (not the case when only unique places are queried)
        if place_name is not np.nan:
            # Try to retrieve entity belonging to place-province name combination
            query_result = retrieve_place_info(ErfGeo_url=ErfGeo_url, place_name=place_name,
                                              province_name=province_name)
            query_results.append(query_result)
        else:
            # If the place name is a missing value, just record missing values as result
            query_results.append({"place_name":place_name, "province_name":province_name, 
                                  "geonames_uri":np.nan, "longitude":np.nan, 
            "latitude":np.nan, "alternative_names":np.nan, "n_results":0}) 

        # Check if results should be (intermediately) saved
        if i % N == 0:
            # Intermediately save queried results
            write_chunk(i=i, query_results=query_results, csv_path=csv_path)
            query_results = []
        elif i == len(places_lst) - 1:
            # Write last results
            write_chunk(i=i, query_results=query_results, csv_path=csv_path)

In [7]:
# Path to folder where I stored the data, adapt to own storage location
data_path = "E:\CBG" 

# Path to unique place-province combinations data, which were already queried at Wikidata
location_path = os.path.join(data_path, "wikidata_locations.csv")

# Path to store location URI results
csv_path = os.path.join(data_path, "erfgeo_locations.csv")

In [8]:
# Load location data that was already queried at Wikidata
df_location = pd.read_csv(location_path, sep=",", header=0, index_col=None)

In [9]:
# Inspect dataframe that is already partially filled in by Wikidata queries
df_location.head()

Unnamed: 0,place_name,province_name,standard_name,wikidata_uri,geonames_uri,longitude,latitude,alternative_names,country,country_wikidata_uri,province,province_wikidata_uri,n_results
0,Dordrecht,,Dordrecht,http://www.wikidata.org/entity/Q26421,http://sws.geonames.org/2756668,4.678333,51.795833,"Dordt, gemeente Dordrecht, Dordrecht (gemeente...",Nederland,http://www.wikidata.org/entity/Q55,Zuid-Holland,http://www.wikidata.org/entity/Q694,1.0
1,Wolphaartsdijk,Zeeland,,http://www.wikidata.org/entity/Q1025042,http://sws.geonames.org/2744199,3.8197,51.5297,,Nederland,http://www.wikidata.org/entity/Q55,,,1.0
2,Lochem,Gelderland,,http://www.wikidata.org/entity/Q15878783,http://sws.geonames.org/None,,,,,,,,1.0
3,'s-Gravenhage,,,,,,,,,,,,0.0
4,Egmond aan Zee,Noord-Holland,,http://www.wikidata.org/entity/Q1616324,http://sws.geonames.org/2756301,,,,Nederland,http://www.wikidata.org/entity/Q55,,,1.0


In [10]:
# Subset the data to only query locations that don't have Wikidata information stored yet
# because no (or too many) results were found
df_location_subset = df_location[df_location.n_results != 1]

In [11]:
# Inspect how many places have to get queried at ErfGeo Proxy
print("There are", len(df_location_subset), "unique place-province combinations that"+
     "didn't get a result using Wikidata so will be queried using ErfGeo Proxy")

There are  26153  unique place-province combinations that didn't get a result using Wikidata so will be queried using ErfGeo Proxy


In [28]:
# Query ErfGeo Proxy on places that were not yet giving unique results with Wikidata queries
# Note that the output progress bar shows less places than said to query above, which is the
# case due to an intermediate restart and continuation
query_locations(places_lst=df_location_subset.place_name, 
                provinces_lst=df_location_subset.province_name, csv_path=csv_path, N=100)

100%|█████████████████████████████████████████████████████████████████████████| 23352/23352 [24:52:00<00:00,  3.83s/it]


## Inspect query results

In [29]:
# Load all saved query results into a Pandas dataframe
df_query_results = pd.read_csv(csv_path, sep=",", header=0, index_col=None)

# Inspect dataframe
df_query_results.head()

Unnamed: 0,place_name,province_name,geonames_uri,longitude,latitude,alternative_names,n_results
0,'s-Gravenhage,,http://sws.geonames.org/2747373,4.29861,52.07667,"The Hague, 's-Gravenhage, Haag, 's Gravenhage,...",1
1,Surabaja,,,,,,0
2,Ambt-Doetinchem,Gelderland,http://sws.geonames.org/2756766,6.27955,51.96216,"Doetinchem, Gemeente Doetinchem, Ambt-Doetiche...",1
3,Huien,Noord-Holland,,,,,0
4,Enkhuien,Noord-Holland,,,,,0


In [30]:
# Check occurance of number of results (i.e. how many results found per unique location name)
df_query_results.n_results.value_counts(normalize=False)

0    25438
1      708
2        5
5        1
3        1
Name: n_results, dtype: int64

In [31]:
# Check frequency of number of results (i.e. how many results found per unique location name)
# Only a small 3% of additional places could be resolved (2.7% is of the places that couldn't
# get resolved using Wikidata, not of the total number of unique place-province combinations)
df_query_results.n_results.value_counts(normalize=True)

0    0.972661
1    0.027071
2    0.000191
5    0.000038
3    0.000038
Name: n_results, dtype: float64

In [32]:
# Some places that were not found are (Dutch name variants of) foreign places, 
# but also still some Dutch ones. This can be due to spelling errors, non-recorded name variants,
# an abbreviation that was not written out fully (because of it not being a standard one),
# or addition of extra information in the place name that could not get cleaned automatically
df_query_results[df_query_results.n_results == 0].head(20)

Unnamed: 0,place_name,province_name,geonames_uri,longitude,latitude,alternative_names,n_results
1,Surabaja,,,,,,0
3,Huien,Noord-Holland,,,,,0
4,Enkhuien,Noord-Holland,,,,,0
5,Haerswoude,Zuid-Holland,,,,,0
8,Ambt-Hardenberg,Overijssel,,,,,0
10,'s-Gravenande,,,,,,0
12,Zierikee,Zeeland,,,,,0
14,Hoogeand,Groningen,,,,,0
15,Blokijl,Overijssel,,,,,0
18,Wijmbritseradeel,Friesland,,,,,0


In [34]:
# Check some places with one result
df_query_results[df_query_results.n_results == 1].head(10)

Unnamed: 0,place_name,province_name,geonames_uri,longitude,latitude,alternative_names,n_results
0,'s-Gravenhage,,http://sws.geonames.org/2747373,4.29861,52.07667,"The Hague, 's-Gravenhage, Haag, 's Gravenhage,...",1
2,Ambt-Doetinchem,Gelderland,http://sws.geonames.org/2756766,6.27955,51.96216,"Doetinchem, Gemeente Doetinchem, Ambt-Doetiche...",1
6,het Bildt,Friesland,http://sws.geonames.org/2754301,5.6487,53.28397,"Het Bildt, Gemeente Het Bildt, 't Bildt Vrouwe...",1
7,Münster,,http://sws.geonames.org/2867543,7.62571,51.96236,Münster,1
9,Zegwaart,Zuid-Holland,http://sws.geonames.org/2743986,4.50769,52.06693,"Zegwaart, Segwaart, Segwaerden, Seggewaart, Ze...",1
11,Ouder Amstel,Noord-Holland,http://sws.geonames.org/2749027,4.91432,52.30659,"Ouder-Amstel, Gemeente Ouder-Amstel, O Amstel ...",1
13,Egmond Binnen,Noord-Holland,http://sws.geonames.org/2756299,4.65556,52.59583,"Egmond-Binnen, Egmont-Binnen, Egmond de Hoeff,...",1
16,Nieuwe Tonge,Zuid-Holland,http://sws.geonames.org/2750233,4.16528,51.715,"Nieuwe-Tonge, Nieuwe tonge, Nieuwetongh, Nieuw...",1
17,Sint Michiels Gestel,Noord-Brabant,http://sws.geonames.org/2747233,5.37851,51.66076,"Sint-Michielsgestel, Gemeente Sint-Michielsges...",1
39,Heinkensand,Zeeland,http://sws.geonames.org/2754503,3.81111,51.4725,"Heinkenszand, Heinkensand, Heijnjesont, Heijnk...",1


In [33]:
# Check some places with multiple results
# These would need to be manually validated which is the correct entity
df_query_results[df_query_results.n_results > 1].head()

Unnamed: 0,place_name,province_name,geonames_uri,longitude,latitude,alternative_names,n_results
3657,WInsum,Groningen,http://sws.geonames.org/2744338,5.63359,53.1519,"Winsum, Winssum in Vriesland, wintzum, winsum,...",2
9122,Krimpen aan de lek,Zuid-Holland,http://sws.geonames.org/2752267,4.62917,51.895,"Krimpen aan de Lek, Krimpen op de Lek, Crimpen...",2
13170,Hooren,,http://sws.geonames.org/2753638,5.05972,52.6425,"Hoorn, Hooren, Horen, Horen in West Vrieslant,...",2
14872,Horen,,http://sws.geonames.org/2753638,5.05972,52.6425,"Hoorn, Hooren, Horen, Horen in West Vrieslant,...",2
23972,De Krim,,http://sws.geonames.org/2757396,6.61806,52.64917,"De Krim, Krim, De Kruin",2


In [35]:
# Check number of missing values per attribute (which couldn't get retrieved with ErfGeo)
df_query_results.isna().sum()

place_name               7
province_name        24058
geonames_uri         25441
longitude            25441
latitude             25441
alternative_names    25438
n_results                0
dtype: int64

In [36]:
# Check number of unique results
df_query_results.nunique()

place_name           25585
province_name           11
geonames_uri           389
longitude              377
latitude               381
alternative_names      422
n_results                5
dtype: int64

In [37]:
# Merge the found results from ErfGeo with the already existing ones of Wikidata 
# Note that some columns only existed within Wikidata so got missing values for ErfGeo Proxy 
# results
# Note: only ErfGeo Proxy run when elements were missing values (or n_results != 1) so after 
# can just overwrite columns with ErfGeo Proxy results
merged_results = pd.merge(left=df_location, right=df_query_results,
                         on=['place_name', 'province_name'],
                         how='left', suffixes=["_L", "_R"])

In [38]:
# Inspect merged dataframe
merged_results.head()

Unnamed: 0,place_name,province_name,standard_name,wikidata_uri,geonames_uri_L,longitude_L,latitude_L,alternative_names_L,country,country_wikidata_uri,province,province_wikidata_uri,n_results_L,geonames_uri_R,longitude_R,latitude_R,alternative_names_R,n_results_R
0,Dordrecht,,Dordrecht,http://www.wikidata.org/entity/Q26421,http://sws.geonames.org/2756668,4.678333,51.795833,"Dordt, gemeente Dordrecht, Dordrecht (gemeente...",Nederland,http://www.wikidata.org/entity/Q55,Zuid-Holland,http://www.wikidata.org/entity/Q694,1.0,,,,,
1,Wolphaartsdijk,Zeeland,,http://www.wikidata.org/entity/Q1025042,http://sws.geonames.org/2744199,3.8197,51.5297,,Nederland,http://www.wikidata.org/entity/Q55,,,1.0,,,,,
2,Lochem,Gelderland,,http://www.wikidata.org/entity/Q15878783,http://sws.geonames.org/None,,,,,,,,1.0,,,,,
3,'s-Gravenhage,,,,,,,,,,,,0.0,http://sws.geonames.org/2747373,4.29861,52.07667,"The Hague, 's-Gravenhage, Haag, 's Gravenhage,...",1.0
4,Egmond aan Zee,Noord-Holland,,http://www.wikidata.org/entity/Q1616324,http://sws.geonames.org/2756301,,,,Nederland,http://www.wikidata.org/entity/Q55,,,1.0,,,,,


In [39]:
# For the columns that have the same name in Wikidata and ErfGeo Proxy results,
# make sure to merge these into one column where the results from ErfGeo Proxy are
# used in the case WikiData didn't get a unique result (may still be a missing value of course)

# When Wikidata found results, use these for the new columns, else the results from ErfGeo Proxy
merged_results['geonames_uri'] = np.where(merged_results['n_results_L'] == 1, 
                                          merged_results['geonames_uri_L'], 
                                          merged_results['geonames_uri_R'])
merged_results['geonames_uri'] = np.where(merged_results['n_results_L'] == 1, 
                                          merged_results['geonames_uri_L'], 
                                          merged_results['geonames_uri_R'])
merged_results['longitude'] = np.where(merged_results['n_results_L'] == 1, 
                                          merged_results['longitude_L'], 
                                          merged_results['longitude_R'])
merged_results['latitude'] = np.where(merged_results['n_results_L'] == 1, 
                                          merged_results['latitude_L'], 
                                          merged_results['latitude_R'])
merged_results['alternative_names'] = np.where(merged_results['n_results_L'] == 1, 
                                          merged_results['alternative_names_L'], 
                                          merged_results['alternative_names_R'])
merged_results['n_results'] = np.where(merged_results['n_results_L'] == 1, 
                                          merged_results['n_results_L'], 
                                          merged_results['n_results_R'])

In [40]:
# Drop columns that were specifically from left and right dataframe by only keeping the merged
# columns and the ones that were in only one of the two dataframes
merged_results = merged_results.drop(merged_results.columns.difference(list(df_location.columns)),
                                     axis=1)

In [41]:
# Inspect the final merged query results
# 's-Gravenhage is an example of a query that did give results here with ErfGeo Proxy but
# not with Wikidata
merged_results.head()

Unnamed: 0,place_name,province_name,standard_name,wikidata_uri,country,country_wikidata_uri,province,province_wikidata_uri,geonames_uri,longitude,latitude,alternative_names,n_results
0,Dordrecht,,Dordrecht,http://www.wikidata.org/entity/Q26421,Nederland,http://www.wikidata.org/entity/Q55,Zuid-Holland,http://www.wikidata.org/entity/Q694,http://sws.geonames.org/2756668,4.678333,51.795833,"Dordt, gemeente Dordrecht, Dordrecht (gemeente...",1.0
1,Wolphaartsdijk,Zeeland,,http://www.wikidata.org/entity/Q1025042,Nederland,http://www.wikidata.org/entity/Q55,,,http://sws.geonames.org/2744199,3.8197,51.5297,,1.0
2,Lochem,Gelderland,,http://www.wikidata.org/entity/Q15878783,,,,,http://sws.geonames.org/None,,,,1.0
3,'s-Gravenhage,,,,,,,,http://sws.geonames.org/2747373,4.29861,52.07667,"The Hague, 's-Gravenhage, Haag, 's Gravenhage,...",1.0
4,Egmond aan Zee,Noord-Holland,,http://www.wikidata.org/entity/Q1616324,Nederland,http://www.wikidata.org/entity/Q55,,,http://sws.geonames.org/2756301,,,,1.0


In [42]:
# Save merged results
# Path to store merged location URI results
merged_path = os.path.join(data_path, "merged_locations.csv")

# Save merged results
merged_results.to_csv(merged_path, header=True, sep=',', index=False)

In [43]:
# Inspect full set of results
# Check occurance of number of results (i.e. how many results found per unique location name)
merged_results.n_results.value_counts(normalize=False)

0.0    25437
1.0    10189
2.0        5
3.0        1
5.0        1
Name: n_results, dtype: int64

In [44]:
# Check frequency of number of results (i.e. how many results found per unique location name)
merged_results.n_results.value_counts(normalize=True)

0.0    0.713861
1.0    0.285943
2.0    0.000140
3.0    0.000028
5.0    0.000028
Name: n_results, dtype: float64

In [45]:
# Check number of missing values per attribute (which couldn't get retrieved with ErfGeo)
merged_results.isna().sum()

place_name                   7
province_name            31652
standard_name            34258
wikidata_uri             26153
country                  29913
country_wikidata_uri     29914
province                 32277
province_wikidata_uri    32278
geonames_uri             25441
longitude                29299
latitude                 29299
alternative_names        33202
n_results                    1
dtype: int64

In [46]:
# Check number of unique values
merged_results.nunique()

place_name               33303
province_name               11
standard_name             1171
wikidata_uri              7850
country                    102
country_wikidata_uri       102
province                  1629
province_wikidata_uri     1636
geonames_uri              4033
longitude                 4468
latitude                  4357
alternative_names         1651
n_results                    5
dtype: int64