This Python code implements and evaluates the proposed LinkGeoML models for Toponym classification-based interlinking.
In this setting, we consider the names of the toponyms as the only source of information that can be used to decide whether two toponyms refer to the same real-world entity. Specifically, we build a meta-similarity function, called LGM-Sim, that takes into account and incorporates within its processing steps the specificities of toponym names. Consequently, we derive training features from LGM-Sim that are used in various classification models. The proposed method and its derived features are robust enough to handle variations in the distribution of toponyms and demonstrate a significant increase in interlinking accuracy compared to baseline models widely used in the literature (see References). Indicatively, we succeed a 85.6% accuracy with the Gradient Boosting Trees classifier compared to the best baseline model that achieves accuracy of 78.6% with Random Forests.
The data folder contains the train datasets, which are used to build the classifiers, and files containing frequent terms, extracted from train datasets. For evaluation, we used the dataset from the Toponym-Matching work (see Setup procedure).
The source code was tested using Python 2.7, 3.5 and 3.6 and Scikit-Learn 0.20.3 on a Linux server.
Download the latest version from the GitHub repository, change to the main directory and run:
pip install -r pip_requirements.txt
It should install all the required libraries automatically (scikit-learn, numpy, pandas etc.).
Change to the data folder, download the test dataset and unzip it:
wget https://github.com/ruipds/Toponym-Matching/raw/master/dataset/dataset.zip wget https://github.com/ruipds/Toponym-Matching/raw/master/dataset/dataset.z01 zip -FF dataset.zip --out dataset.zip.fixed unzip dataset.zip.fixed
Source code documentation is available from linkgeoml.github.io.
The sim_measures.py file, which is used to generate the train/test datasets and to compute the string similarity measures, is a slightly modified version of the datasetcreator.py file used in Toponym-Matching work, which is under the MIT license.
- Santos, R., Murrieta-Flores, P. and Martins, B., 2018. Learning to combine multiple string similarity metrics for effective toponym matching. International journal of digital earth, 11(9), pp.913-938.
LGM-Interlinking is available under the MIT License.