Skip to content
The RMLStreamer executes RML rules to generate high quality Linked Data from multiple originally (semi-)structured data sources in a streaming way.
Scala JavaScript Shell Dockerfile
Branch: master
Clone or download
Latest commit 3a456ae Jan 21, 2020
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
docker
images
scripts synchronize the script Sep 25, 2019
src
.gitignore add logging properties file for test debugging Jul 2, 2019
.gitlab-ci.yml
Dockerfile
Dockerfile.kafkaproducer
Dockerfile.stream
LICENSE
README.md Updated documentation; added Docker how-to. Jan 21, 2020
TEST_README.md Luckily, updating to Flink 1.9 didn't require code rewrites ;) Aug 29, 2019
changes.md
configuration_example.properties Updated README a bit. Mar 28, 2019
pom.xml prepare for 1.2.1 release Jan 21, 2020
run.sh
uml.md Rename every 'tipleMap' to 'triplesMap'. As they say in Antwerpish: "… Oct 24, 2019

README.md

RMLStreamer

The RMLStreamer generates RDF from files or data streams using RML. The difference with other RML implementations is that it can handle big input files and continuous data streams, like sensor data.

Quick start

If you want to get the RMLStreamer up and running within 5 minutes using Docker, check out docker/README.md

If you want to deploy it yourself, read on.

Installing Flink

RMLStreamer runs its jobs on Flink clusters. More information on how to install Flink and getting started can be found here. At least a local cluster must be running in order to start executing RML Mappings with RMLStreamer. Please note that this version works with Flink 1.9.1 with Scala 2.11 support, which can be downloaded here. Note that the latest release version might require another version of Flink, check the README for that version.

Building RMLStreamer

In order to build a jar file that can be deployed on a Flink cluster, you need:

  • a Java JDK 8 or higher
  • Apache Maven 3 or higher

Clone or download and then build the code in this repository:

$ git clone https://github.com/RMLio/RMLStreamer.git 
$ cd RMLStreamer
$ mvn -DskipTests clean package

The resulting RMLStreamer-<version>.jar can be deployed on a Flink cluster.

Executing RML Mappings

The script run.sh helps running RMLStreamer on a given Flink cluster.

Usage:
run.sh -p RML MAPPING PATH -f FLINK PATH -o FILE OUTPUT PATH [-a PARALLELISM]
run.sh -p RML MAPPING PATH -f FLINK PATH -s SOCKET [-a PARALLELISM] 
run.sh -p RML MAPPING PATH -f FLINK PATH -b KAFKA BROKERS -t KAFKA TOPIC
run.sh -c CONFIG FILE

Every option can be defined in its long form in the CONFIG FILE.
E.g. flinkBin=/opt/flink-1.8.0/flink

Options:
-n --job-name                      The name of the Flink job
-p --path RML MAPPING PATH         The path to an RML mapping file.
-o --outputPath FILE OUTPUT PATH   The path to an output file.
-f --flinkBin FLINK PATH           The path to the Flink binary.
-s --socket                        The port number of the socket.
-b --kafkaBrokerList KAFKA BROKERS The (list of) hosts where Kafka runs on
-t --kafkaTopic                    The kafka topic to which the output will be streamed to. 
--pp --post-process                 The name of the post processing that will be done on generated triples 
                                   Default is: None
                                   Currently supports:  "bulk", "json-ld"

-a --parallelism                   The parallelism to assign to the job. The default is 1.
-t --kafkaTopic                    The kafka topic to which the output will be streamed to. 
--pp --post-process                The name of the post processing that will be done on generated triples 
                                   Default is: None
                                   Currently supports:  "bulk", "json-ld"
--pi --partition-id                The partition id of kafka topic to which the output will be written to. 
                                   Required for "--partition-type fix"
--pt --partition-type              The type of the partitioner which will be used to partition the output
                                   Default is: flink's default partitioner
                                   Currently supports: "fixed", "kafka", "default"  
-c --config CONFIG FILE	           The path to a configuration file. Every parameter can be put in its long form in the 
                                   configuration file. e.g:
                                    flinkBin=/opt/flink-1.8.0/bin/flink
                                    path=/home/rml/mapping.rml.ttl
                                   Commandline parameters override properties.

TODO: documentation below needs updates.

Examples

Processing a stream

An example of how to define the generation of an RDF stream from a stream in an RML Mapping via TCP.

 <#TripleMap>

    a rr:TriplesMap;
    rml:logicalSource [
        rml:source [
            rdf:type rmls:TCPSocketStream ;
            rmls:hostName "localhost";
            rmls:type "PULL" ;
            rmls:port "5005"
        ];
        rml:referenceFormulation ql:JSONPath;
    ];

    rr:subjectMap [
        rml:reference "$.id";
        rr:termType rr:IRI;
        rr:class skos:Concept
    ];

    rr:predicateObjectMap [
            rr:predicateMap [
                rr:constant dcterms:title;
                rr:termType rr:IRI
            ];
            rr:objectMap [
                rml:reference "$.id";
                rr:termType rr:Literal
            ]
        ].

The RML Mapping above can be executed as follows:

The input and output in the RML Framework are both TCP clients when streaming. Before running stream mappings the input and output ports must be listened to by an application. For testing purposes the following commands can be used:

$ nc -lk 5005 # This will start listening for input connections at port 5005
$ nc -lk 9000 # This will start listening for output connections at port 9000
# This is for testing purposes, your own application needs to start listening to the input and output ports. 

Once the input and output ports are listened to by applications or by the above commands, the RML Mapping can be executed. The RML Framework will open the input and output sockets so it can act upon data that will be written to the input socket.

bash run.sh -p /home/wmaroy/framework/src/main/resources/json_stream_data_mapping.ttl -s 9000
# The -p paramater sets the mapping file location
# The -s parameter sets the output socket port number
# The -o parameter sets the output path if the output needs to be written to a file instead of a stream.

Whenever data is written (every data object needs to end with \r\n) to the socket, this data will be processed by the RML Framework.

The repository contains node.js scripts for setting up stream input and output. The readme can be found in the scripts folder.

Generating a stream from a Kafka Source

An example of how to define the generation of an RDF stream from a stream in an RML Mapping via Kafka.

 <#TripleMap>

    a rr:TriplesMap;
    rml:logicalSource [
        rml:source [
            rdf:type rmls:KafkaStream ;
            rmls:broker "broker" ;
            rmls:groupId "groupId";
            rmls:topic "topic";
        ];
        rml:referenceFormulation ql:JSONPath;
    ];

Note on using Kafka with Flink: As a consumer, the Flink Kafka client never subscribes to a topic, but it is assigned to a topic/partition (even if you declare it to be in a consumer group with the rmls:groupId predicate). This means that it doesn't do anything with the concept "consumer group", except for committing offsets. This means that load is not spread across RMLStreamer jobs running in the same consumer group. Instead, each RMLStreamer job is assigned a partition. This has some consequences:

  • When you add multiple RMLStreamer jobs in a consumer group, and the topic it listens to has one partition, only one instance will get the input.
  • If there are multiple partitions in the topic and multiple RMLStreamer jobs, it could be that two (or more) jobs are assigned a certain partition, resulting in duplicate output.

The only option for spreading load is to use multiple topics, and assign one RMLStreamer job to one topic.

Generating a stream from a file
<#TripleMap>

    a rr:TriplesMap;
    rml:logicalSource [
        rml:source [
            rdf:type rmls:FileStream;
            rmls:path "/home/wmaroy/github/rml-framework/akka-pipeline/src/main/resources/io/rml/framework/data/books.json"
        ];
        rml:referenceFormulation ql:JSONPath;
        rml:iterator "$.store.books[*]"
    ];

    rr:subjectMap [
        rr:template "{$.id}" ;
        rr:termType rr:IRI;
        rr:class skos:Concept
    ];

    rr:predicateObjectMap [
            rr:predicateMap [
                rr:constant dcterms:title;
                rr:termType rr:IRI
            ];
            rr:objectMap [
                rml:reference "$.id";
                rr:termType rr:Literal
            ]
        ].
Generating a stream from a dataset
 <#TripleMap>

    a rr:TriplesMap;
    rml:logicalSource [
        rml:source "/home/wmaroy/github/rml-framework/akka-pipeline/src/main/resources/io/rml/framework/data/books_small.json";
        rml:referenceFormulation ql:JSONPath;
        rml:iterator "$.store.books"
    ];

    rr:subjectMap [
        rml:reference "id";
        rr:termType rr:IRI;
        rr:class skos:Concept
    ];

    rr:predicateObjectMap [
            rr:predicateMap [
                rr:constant dcterms:title;
                rr:termType rr:IRI
            ];
            rr:objectMap [
                rml:reference "id";
                rr:termType rr:Literal
            ]
        ] .
        

RML Stream Vocabulary (non-normative)

Namespace: http://semweb.mmlab.be/ns/rmls#

The RML vocabulary have been extended with rmls to support streaming logical sources. The following are the classes/terms currently used:

  • rmls:[stream type]

    • rmls:TCPSocketStream specifies that the logical source will be a tcp socket stream.
    • rmls:FileStream specifies that the logical source will be a file stream.
    • rmls:KafkaStream specifies that the logical source will be a kafka stream.
  • rmls:hostName specifies the desired host name of the server, from where data will be streamed from.

  • rmls:port specifies a port number for the stream mapper to connect to.

  • rmls:type specifies how a streamer will act:

    • "PULL":
      The stream mapper will act as a client.
      It will create a socket and connect to the specified port at the given host name.
      rmls:port and rmls:hostName needs to be specified.
    • "PUSH":
      The stream mapper will act as a server and will start listening at the given port.
      If the given port is taken, the mapper will keep opening subsequent ports until a free port is found.
      Only rmls:port needs to be specified here.

Example of a valid json logical source map using all possible terms:


rml:logicalSource [
        rml:source [
            rdf:type rmls:TCPSocketStream ;
            rmls:hostName "localhost";
            rmls:type "PULL" ;
            rmls:port "5005"
        ];
        rml:referenceFormulation ql:JSONPath;
    ];
You can’t perform that action at this time.