Skip to content

Commit

Permalink
misc
Browse files Browse the repository at this point in the history
  • Loading branch information
LorenzBuehmann committed Nov 30, 2020
1 parent 2dd38fd commit 530cf3b
Show file tree
Hide file tree
Showing 4 changed files with 28 additions and 4 deletions.
Expand Up @@ -41,6 +41,7 @@ object OntopBasedSPARQLEngine {
", ",
"net.sansa_stack.rdf.spark.io.JenaKryoRegistrator",
"net.sansa_stack.query.spark.sparqlify.KryoRegistratorSparqlify"))
.config("spark.sql.crossJoin.enabled", true)
.getOrCreate()

// load the data into an RDD
Expand Down
13 changes: 13 additions & 0 deletions sansa-rdf/README.md
Expand Up @@ -13,6 +13,7 @@ SANSA uses the RDF data model for representing graphs consisting of triples with

## Usage

### Load as RDD
We suggest to import the `net.sansa_stack.rdf.spark.io` package which adds the function `rdf()` to a Spark session. You can either explicitely specify the type of RDF serialization or let the API guess the format based on the file extension.

For example, the following Scala code shows how to read an RDF file in N-Triples syntax (be it a local file or a file residing in HDFS) into a Spark RDD:
Expand All @@ -28,6 +29,18 @@ val triples = spark.rdf(lang)(path)
triples.take(5).foreach(println(_))
```

### Load as DataFrame
import net.sansa_stack.rdf.spark.io._
import org.apache.jena.riot.Lang

val spark: SparkSession = ...

val lang = Lang.NTRIPLES
val triples = spark.read.rdf(lang)(path)

triples.take(5).foreach(println(_))
```
## Input
We basically support reading most (if not all) of the common RDF formats due to the Apache Jena being our core parser backend. Note, some of the formats can be easily read from distributed data, i.e. multiple file splits can be processed in parallel which ideally results in a much higher loading performance. This holds especially for line based formats like N-Triples and N-Quads, but we also do provide an (experimental) Trig parser which works on file splits distributed among the cluster nodes.
Expand Down
Expand Up @@ -6,6 +6,7 @@ import java.nio.file.{Files, Path}
import java.util.zip.ZipInputStream

import scala.collection.JavaConverters._

import com.holdenkarau.spark.testing.DataFrameSuiteBase
import org.apache.jena.graph.GraphUtil
import org.apache.jena.rdf.model.{ModelFactory, ResourceFactory}
Expand All @@ -16,10 +17,10 @@ import org.apache.jena.sparql.serializer.SerializationContext
import org.apache.jena.sparql.util.FmtUtils
import org.apache.jena.vocabulary.RDF
import org.scalatest.FunSuite
import org.scalatest.tags._


import net.sansa_stack.rdf.spark.io._
import net.sansa_stack.rdf.spark.model._
import net.sansa_stack.rdf.spark.utils.tags.ConformanceTestSuite

/**
* Tests for loading triples from either N-Triples are Turtle files into a DataFrame.
Expand Down Expand Up @@ -70,6 +71,7 @@ class RDFLoadingTests
graph1.find().asScala.foreach(println)

val triplesDF = spark.read.rdf(lang)(path)
triplesDF.show(30, false)
val triplesDS = triplesDF.toDS()
triplesDS.show()
val triples = triplesDS.collect()
Expand Down Expand Up @@ -124,8 +126,8 @@ class RDFLoadingTests
}
}
}

test("RDF 1.1 Turtle test suites must be parsed correctly") {
import org.scalatest.tagobjects.Slow
test("RDF 1.1 Turtle test suites must be parsed correctly", ConformanceTestSuite, Slow) {

// load test suite from URL
val url = new URL("https://www.w3.org/2013/TurtleTests/TESTS.zip")
Expand Down
@@ -0,0 +1,8 @@
package net.sansa_stack.rdf.spark.utils.tags

import org.scalatest.Tag

/**
* @author Lorenz Buehmann
*/
object ConformanceTestSuite extends Tag("net.sansa_stack.tags.ConformanceTestSuite")

0 comments on commit 530cf3b

Please sign in to comment.