SANSA RDF

Description

SANSA RDF is a library to read RDF files into Spark or Flink. It allows files to reside in HDFS as well as in a local file system and distributes them across Spark RDDs/Datasets or Flink DataSets.

SANSA uses the RDF data model for representing graphs consisting of triples with subject, predicate and object. RDF datasets may contains multiple RDF graphs and record information about each graph, allowing any of the upper layers of sansa (Querying and ML) to make queries that involve information from more than one graph. Instead of directly dealing with RDF datasets, the target RDF datasets need to be converted into an RDD/DataSets of triples. We name such an RDD/DataSets a main dataset. The main dataset is based on an RDD/DataSets data structure, which is a basic building block of the Spark/Flink framework. RDDs/DataSets are in-memory collections of records that can be operated on in parallel on large clusters.

Usage

The following Scala code shows how to read an RDF file in N-Triples syntax (be it a local file or a file residing in HDFS) into a Spark RDD:

import net.sansa_stack.rdf.spark.io.rdf._
import org.apache.jena.net.sansa_stack.rdf.spark.riot.Lang

val spark: SparkSession = ...

val lang = Lang.NTRIPLES
val triples = spark.rdf(lang)(path)

triples.take(5).foreach(println(_))

N-Triples loading options

NTripleReader.load(...)

Loads N-Triples data from a file or directory into an RDD. The path can also contain multiple paths and even wildcards, e.g. "/my/dir1,/my/paths/part-00[0-5],/another/dir,/a/specific/file"

Handling of errors

By default, it stops once a parse error occurs, i.e. a org.apache.jena.net.sansa_stack.rdf.spark.riot.RiotException will be thrown generated by the underlying parser.

The following options exist:

STOP the whole data loading process will be stopped and a org.apache.jena.net.sansa_stack.rdf.spark.riot.RiotException will be thrown
SKIP the line will be skipped but the data loading process will continue, an error message will be logged

Handling of warnings

If the additional checking of RDF terms is enabled, warnings during parsing can occur. For example, a wrong lexical form of a literal w.r.t. to its datatype will lead to a warning.

The following can be done with those warnings:

IGNORE the warning will just be logged to the configured logger
STOP similar to the error handling mode, the whole data loading process will be stopped and a org.apache.jena.net.sansa_stack.rdf.spark.riot.RiotException will be thrown
SKIP similar to the error handling mode, the line will be skipped but the data loading process will continue. In additon, an error message will be logged

Checking of RDF terms

Set whether to perform checking of NTriples - defaults to no checking.

Checking adds warnings over and above basic syntax errors. This can also be used to turn warnings into exceptions if the option stopOnWarnings is set to STOP or SKIP.

IRIs - whether IRIs confirm to all the rules of the IRI scheme
Literals: whether the lexical form conforms to the rules for the datatype.
Triples: check slots have a valid kind of RDF term (parsers usually make this a syntax error anyway).

See also the optional errorLog argument to control the output. The default is to log.

An overview is given in the FAQ section of the SANSA project page. Further documentation about the builder objects can also be found on the ScalaDoc page.

How to Contribute

We always welcome new contributors to the project! Please see our contribution guide for more details on how to get started contributing to SANSA.

Name		Name	Last commit message	Last commit date
Latest commit History 589 Commits
sansa-rdf-common		sansa-rdf-common
sansa-rdf-flink		sansa-rdf-flink
sansa-rdf-spark		sansa-rdf-spark
.coveralls.yml		.coveralls.yml
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
Makefile		Makefile
Partitioning.md		Partitioning.md
README.md		README.md
pom.xml		pom.xml
scalastyle-config.xml		scalastyle-config.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sansa-rdf-common

sansa-rdf-common

sansa-rdf-flink

sansa-rdf-flink

sansa-rdf-spark

sansa-rdf-spark

.coveralls.yml

.coveralls.yml

.gitignore

.gitignore

.travis.yml

.travis.yml

LICENSE

LICENSE

Makefile

Makefile

Partitioning.md

Partitioning.md

README.md

README.md

pom.xml

pom.xml

scalastyle-config.xml

scalastyle-config.xml

Repository files navigation

SANSA RDF

Description

Usage

N-Triples loading options

Handling of errors

Handling of warnings

Checking of RDF terms

How to Contribute

About

Releases

Packages

Languages

License

abakarboubaa/SANSA-RDF

Folders and files

Latest commit

History

Repository files navigation

SANSA RDF

Description

Usage

N-Triples loading options

Handling of errors

Handling of warnings

Checking of RDF terms

How to Contribute

About

Resources

License

Stars

Watchers

Forks

Languages