Skip to content
No description, website, or topics provided.
Python Other
  1. Python 98.9%
  2. Other 1.1%
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data
docs
sphinx_docs
src
.gitignore
LICENSE
Pipfile
Pipfile.lock
README.rst
pip_requirements.txt
run.py

README.rst

MIT


LGM-Interlinking

This Python code implements and evaluates the proposed LinkGeoML models for Toponym classification-based interlinking.

In this setting, we consider the names of the toponyms as the only source of information that can be used to decide whether two toponyms refer to the same real-world entity. Specifically, we build a meta-similarity function, called LGM-Sim, that takes into account and incorporates within its processing steps the specificities of toponym names. Consequently, we derive training features from LGM-Sim that are used in various classification models. The proposed method and its derived features are robust enough to handle variations in the distribution of toponyms and demonstrate a significant increase in interlinking accuracy compared to baseline models widely used in the literature (see References). Indicatively, we succeed a 85.6% accuracy with the Gradient Boosting Trees classifier compared to the best baseline model that achieves accuracy of 78.6% with Random Forests.

The data folder contains the train datasets, which are used to build the classifiers, and files containing frequent terms, extracted from train datasets. For evaluation, we used the dataset from the Toponym-Matching work (see Setup procedure).

The source code was tested using Python 2.7, 3.5 and 3.6 and Scikit-Learn 0.20.3 on a Linux server.

Setup procedure

Download the latest version from the GitHub repository, change to the main directory and run:

pip install -r pip_requirements.txt

It should install all the required libraries automatically (scikit-learn, numpy, pandas etc.).

Change to the data folder, download the test dataset and unzip it:

wget https://github.com/ruipds/Toponym-Matching/raw/master/dataset/dataset.zip
wget https://github.com/ruipds/Toponym-Matching/raw/master/dataset/dataset.z01

zip -FF dataset.zip  --out dataset.zip.fixed
unzip dataset.zip.fixed

Documentation

Source code documentation is available from linkgeoml.github.io.

Acknowledgements

The sim_measures.py file, which is used to generate the train/test datasets and to compute the string similarity measures, is a slightly modified version of the datasetcreator.py file used in Toponym-Matching work, which is under the MIT license.

References

  • Santos, R., Murrieta-Flores, P. and Martins, B., 2018. Learning to combine multiple string similarity metrics for effective toponym matching. International journal of digital earth, 11(9), pp.913-938.

License

LGM-Interlinking is available under the MIT License.

You can’t perform that action at this time.