Skip to content

SLIPO-EU/TripleGeo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Welcome to TripleGeo: An open-source tool for transforming geospatial features into RDF triples

TripleGeo is a utility developed by the Information Management Systems Institute at Athena Research Center under the EU/FP7 project GeoKnow: Making the Web an Exploratory for Geospatial Knowledge and the EU/H2020 Innovation Action SLIPO: Scalable Linking and Integration of big POI data. This generic purpose, open-source tool can be used for extracting features from geospatial files and databases and transforming them into RDF triples.

Initial releases of TripleGeo were based on open-source utility geometry2rdf. Starting from version 1.2, the source code has been completely re-engineered, rewritten, and further enhanced towards scalable performance against big data volumes, as well as advanced support for more input formats and attribute schemata. TripleGeo is written in Java and is still under development; more enhancements will be included in future releases. However, all supported functionality has been tested and works smoothly in both MS Windows and Linux platforms.

Quick start

Installation

  • TripleGeo is a command-line utility and has several dependencies on open-source and third-party, freely redistributable libraries. The pom.xml file contains the project's configuration in Maven.
  • Special note on JDBC drivers for database connections: In case you wish to extract data from a geospatially-enabled DBMS (e.g., PostGIS), either you have to include the respective .jar (e.g., postgresql-9.4-1206-jdbc4.jar) in the classpath at runtime or to specify the respective dependency in the .pom and then rebuild the application.
  • Special note on manual installation of a JDBC driver for Oracle DBMS: Due to Oracle license restrictions, there are no public repositories that provide ojdbc7.jar (or any other Oracle JDBC driver) for enabling JDBC connections to an Oracle database. You need to download it and install in your local repository. Get this jar from Oracle and install it in your local maven repository using:
    mvn install:install-file -Dfile=/<*YOUR_LOCAL_DIR*>/ojdbc7.jar -DgroupId=com.oracle -DartifactId=ojdbc7 -Dversion=12.1.0.1 -Dpackaging=jar
  • Starting from version 1.3, TripleGeo includes support for custom transformation of thematic attributes according to RDF Mapping language (RML). In order to enable RML conversion mode, you need to install RML-Mapper.jar specially prepared for TripleGeo execution in your local maven repository using:
    mvn install:install-file -Dfile=/<*YOUR_LOCAL_DIR*>/RML-Mapper.jar -DgroupId=be.ugent.mmlab.rml -DartifactId=rml-mapper -Dversion=0.3 -Dpackaging=jar
  • Building the application with maven:
    mvn clean package
    results into a triplegeo-2.0-SNAPSHOT.jar under directory target according to what has been specified in the pom.xml file.

Execution

TripleGeo supports two-way transformation of geospatial features:

  • Transformation of geospatial datasets from various conventional formats into RDF data. TripleGeo supports mappings from the attribute schema of input dataset into an ontology for RDF features that guides the transformation (i.e., creating RDF properties, constructing URIs, defining links between entities, etc.). Optionally, classification of input features into categories can be also performed, provided that the user specifies a (possibly hierarchical, multi-tier) classification scheme (e.g., possible amenities for Points of Interest, a list of road types for a Road Network).
  • Reverse Transformation of RDF data into de facto geospatial formats (currently, CSV and ESRI shapefiles). TripleGeo retrieves data from a graph constructed on-the-fly from the RDF data and creates records with a geometry attribute and thematic attributes reflecting the underlying ontology of the input RDF data.

Since ver. 1.2, TripleGeo supports parallel transformation of multiple datasets having identical schema and the same configuration settings. This is performed by isolated transformation tasks, each running over a separate Java thread, but it requires a distinct file stored on disk for each dataset. Since ver. 1.7, TripleGeo supports on-the-fly partitioning of a single data file (in CSV or ESRI shapefile format) into a number of user-specified partitions and their subsequent transformation to RDF.

Starting from ver. 1.7, TripleGeo also enables distributed transformation of geographical files (currently, CSV, GeoJSON, and ESRI shapefiles) into RDF on top of Apache Spark and its geospatial extension GeoSpark. Configuration settings for such transformations are exactly as in the case of standalone execution over JVM, with extra specifications for the number of worker nodes (i.e., data partitions).

Explanation and usage tips for both transformation modules are given next. The current distribution (ver. 2.0) comes with dummy configuration templates file_options.conf for geographical files (ESRI shapefiles, CSV, GPX, KML, etc.) and dbms_options.conf for database contents (from PostGIS, Oracle Spatial, etc.). These files contain indicative values for the most important properties when accessing data from geographical files or a spatial DBMS. This release also includes a template reverse_options.conf for reconverting RDF data back into geospatial file formats. Self-contained brief instructions can guide you into the extraction and reverse transformation processes.

Indicative configuration files and mappings for several cases are available here in order to assist you when preparing your own.

In addition, custom classification schemes for OpenStreetMap data are available here and can be readily used with the provided mappings against the sample datasets.

NOTE: All execution commands and configurations refer to the current version (TripleGeo ver. 2.0).

A. Transformation of geospatial datasets to RDF

How to use TripleGeo in order to transform geospatial data into RDF triples:

  • In case that triples will be extracted from a geographical file (e.g., ESRI shapefiles) as specified in the user-defined configuration file in ./test/conf/shp_options.conf, and assuming that binaries are bundled together in /target/triplegeo-2.0-SNAPSHOT.jar, give a command like this:
    java -cp ./target/triplegeo-2.0-SNAPSHOT.jar eu.slipo.athenarc.triplegeo.Extractor ./test/conf/shp_options.conf
  • If triples will be extracted from a geospatially-enabled DBMS (e.g., PostGIS), the command is essentially the same, but it specifies a suitable configuration file ./test/conf/PostGIS_options.conf with all information required to connect and extract data from the DBMS, as well as runtime linking to the JDBC driver for enabling connections to PostgreSQL (assuming that this JDBC driver is located at ./lib/postgresql-9.4-1206-jdbc4.jar):
    java -cp ./lib/postgresql-9.4-1206-jdbc4.jar;./target/triplegeo-2.0-SNAPSHOT.jar eu.slipo.athenarc.triplegeo.Extractor ./test/conf/PostGIS_options.conf
  • TripleGeo supports data in GML (Geography Markup Language) and KML (Keyhole Markup Language). It can also handle INSPIRE-aligned GML data for seven Data Themes (Annex I), as well as INSPIRE-aligned geospatial metadata. Any such transformation is performed via XSLT, as specified in the respective configuration settings (e.g., ./test/conf/KML_options.conf) as follows:
    java -cp ./target/triplegeo-2.0-SNAPSHOT.jar eu.slipo.athenarc.triplegeo.Extractor ./test/conf/KML_options.conf
  • TripleGeo can also run on top of Apache Spark/GeoSpark for selected geographical file formats (currently, CSV, GeoJSON, and ESRI shapefiles). Assuming a user-defined configuration file in ./test/conf/shp_spark_options.conf that also specifies the number of partitions over the input data, transformation can be executed by sumbitting a Spark job like this:
    spark-submit --class eu.slipo.athenarc.triplegeo.Extractor --master local[*] target/triplegeo-2.0-SNAPSHOT.jar ./test/conf/shp_spark_options.conf

Wait until the process gets finished, and verify that the resulting output files are according to your specifications.

B. Reverse Transformation from RDF to geospatial datasets

How to use TripleGeo in order to transform RDF triples into a geospatial data file:

  • In the configuration file, specify one or multiple files that contain the RDF triples that will be given as input to the reverse transformation process.
  • You must specify a valid SPARQL SELECT query that will be applied against the RDF graph and will fetch the resulting records. The path to the file containing this SPARQL command must be specified in the configuration. It is assumed that the user is aware of the underlying ontology of the RDF graph. If the SPARQL query is not valid, then no or partial results may be retrieved. By default, the names of the variables in the SELECT clause will be used as attribute names in the output file.
  • The current release of TripleGeo (ver. 2.0) supports .CSV delimited files, GeoJSON files, and ESRI shapefiles as output formats for reverse transformation.
  • In case of ESRI shapefile as output format, make sure that all input RDF geometries are of the same type (i.e., either points or lines or polygons), because shapefiles can only support a single geometry type in a given file.
  • Once parameters have been specified in a suitable configuration file (e.g., like ./test/conf/shp_reverse.conf), execute the following command to launch the reverse transformation process:
    java -cp ./target/triplegeo-2.0-SNAPSHOT.jar eu.slipo.athenarc.triplegeo.ReverseExtractor ./test/conf/shp_reverse.conf

Supported Geospatial Formats

The current version of TripleGeo utility can access geometries from:

  • ESRI shapefiles, a widely used file-based format for storing geospatial features.
  • Other widely used geographical file formats, including: GPX (GPS Exchange Format), GeoJSON, as well as OpenStreetMap (OSM) XML and PBF files.
  • De facto data interchange formats with geometries specified as coordinate pairs: CSV (comma separated values), JSON.
  • Geographical data stored in GML (Geography Markup Language) and KML (Keyhole Markup Language).
  • INSPIRE-aligned datasets for seven Data Themes (Annex I) in GML format: Addresses, Administrative Units, Cadastral Parcels, GeographicalNames, Hydrography, Protected Sites, and Transport Networks (Roads).
  • Several geospatially-enabled DBMSs, including: Oracle Spatial and Graph, PostGIS extension for PostgreSQL, MySQL, Microsoft SQL Server, IBM DB2 with Spatial Extender, SpatiaLite, and ESRI Personal Geodatabases in Microsoft Access format.

Sample geographic datasets for testing are available in various file formats.

Supported RDF Serializations and Spatial Ontologies

In terms of RDF serializations, triples can be obtained in one of the following formats: RDF/XML (default), RDF/XML-ABBREV, N-TRIPLES, N3, TURTLE (TTL).

Concerning geospatial representations, RDF triples can be exported according to these ontologies:

Resulting triples are written into local files, so that they can be readily imported into a triple store that supports the respective ontology.

Extra utilities

TripleGeo also offers the following three extra utilities:

  • Classification Scheme Validator can be used to verify the consistence and suitability of a classification hierarchy where the spatial entities refer to. TripleGeo supports multi-tier classification hierarchies (e.g., POI categories, subcategories, etc.) specified in YML or CSV files (take a look here for example classifications). This auxiliary utlity can be invoked as follows:
    java -cp target/triplegeo-2.0-SNAPSHOT.jar eu.slipo.athenarc.triplegeo.extra.ClassificationSchemeValidator path-to-CSV-or-YML-classification-file boolean-flag output-CSV-or-YML-format
    where:
    • path-to-CSV-or-YML-classification-file specifies the file containing the classification hiererchy (in CSV or YML format);
    • the boolean-flag specifies whether each category is referenced by its identifier in the classification scheme (false) or by the actual name of the category (true);
    • and output-CSV-or-YML-format indicates the format (CSV or YML) that will be used for printing out the reconstructed classification after its validation.
  • RDF Graph Sanity Tester is an auxiliary utility that can be used to verify whether the transformed triples are valid and queryable. First, it loads triples (in any typical serialization) from data file(s) into a disk-based RDF graph and then runs a simple sanity test with a user-specified SELECT query in SPARQL. If successful, it reports the number of triples stored in the graph. This utility can be executed as follows:
    java -cp target/triplegeo-2.0-SNAPSHOT.jar eu.slipo.athenarc.triplegeo.extra.RDFGraphSanityTester path-to-triples-file(s) triple-serialization-format path-to-temporary-dir path-to-SPARQL-query-file
    where:
    • path-to-triples-file(s) is the path to the RDF file(s) that constitute the graph;
    • triple-serialization-format is the serialization of the RDF files (e.g., N-TRIPLES, TTL);
    • path-to-temporary-dir is the path to an existing directory on disk where the RDF graph model will be temporarily created;
    • and path-to-SPARQL-query-file is the path to the file with the SPARQL SELECT command that will be used to query the RDF graph and extract results.
  • Synthetic Data Generator can be used to create a synthetic dataset based on a given CSV dataset by translating geometries, modifying names, and randomly erasing attribute values. It can be used to create synthetic spatial data, by inflating and modifying a dataset given as seed. NOTE: This utility does not apply over RDF data; it handles CSV files only that may be accepted by TripleGeo for transformation to RDF. This utility can be executed as follows:
    java -cp target/triplegeo-2.0-SNAPSHOT.jar eu.slipo.athenarc.triplegeo.extra.SyntheticDataGenerator path-to-input-CSV-file dX dY suffix
    where:
    • path-to-input-CSV-file is the path to input CSV file containing the spatial entities and their thematic attribute values;
    • dX is the max displacement on the x-axis (longitude) to be applied on each geometry;
    • dY is the max displacement on the y-axis (latitude) to be applied on each geometry;
    • and suffix is a user-defined string that will suffix each generated identifier in the output.

Use Cases

TripleGeo has been used to transform a large variety of geospatial datasets into RDF. Amongst them:

  • Exposing INSPIRE-alinged geospatial data and metadata for Greece as Linked Data through a SPARQL endpoint. This has been the first attempt to build an abstraction layer on top of the INSPIRE infrastructure based on GeoSPARQL concepts, thus making INSPIRE contents accessible and discoverable as linked data.
  • Exposing Points of Interest (POI) as Linked Geospatial Data through this SPARQL endpoint. In this case, POI data extracted from OpenStreetMap across Europe has been transformed into RDF according a comprehensive and vendor-agnostic OWL ontology for POI data, which enables modeling and representation of multifaceted and enriched POI profiles.

Documentation

All Java classes and data structures developed for TripleGeo are fully documented in this Javadoc.

License

The contents of this project are licensed under the GPL v3 License.