New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kryo exceptions #33

Closed
antonkulaga opened this Issue Feb 4, 2018 · 14 comments

Comments

Projects
None yet
4 participants
@antonkulaga
Copy link

antonkulaga commented Feb 4, 2018

I constantly get out of bounds exceptions from kryo, despite using it as a default serizalizer:
Here is what I got from opening Gene Ontology file (I had to convert it to N3 before parsing):

org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index: 102, Size: 31
Serialization trace:
fTargetNamespace (org.apache.xerces.impl.dv.xs.XSSimpleTypeDecl)
fBase (org.apache.xerces.impl.dv.xs.XSSimpleTypeDecl)
typeDeclaration (org.apache.jena.datatypes.xsd.impl.XSDBaseStringType)
dtype (org.apache.jena.graph.impl.LiteralLabelImpl)
label (org.apache.jena.graph.Node_Literal)
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
  at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1354)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.RDD.take(RDD.scala:1327)
  ... 81 elided
@GezimSejdiu

This comment has been minimized.

Copy link
Member

GezimSejdiu commented Feb 9, 2018

Hi @antonkulaga ,
many thanks for posting the issue. In order to reproduce the issue, could you please send us the code snippets you are using when it throws this exception.
Like when loading the data:

val triples = spark.rdf(Lang.NTRIPLES)(path)

or somewhere else? and if possible could you send a sample dataset which raises this issue.

Best regards,

@earthquakesan

This comment has been minimized.

Copy link
Member

earthquakesan commented Apr 6, 2018

@GezimSejdiu

This comment has been minimized.

Copy link
Member

GezimSejdiu commented Apr 10, 2018

Thanks, @earthquakesan for sharing the ontology/dataset where I could try it out.

Hi @antonkulaga , in order to reproduce the issue I downloaded Go Ontology (go.owl) and run these code snippets :

import net.sansa_stack.rdf.spark.io._
import org.apache.jena.riot.Lang

val input = "hdfs://namenode:8020/data/go.owl"

val lang = Lang.RDFXML
val triples = spark.rdf(lang)(input)
triples.take(5).foreach(println(_))

and I get this as a result :
image
As we can see from the figure, SANSA is able to read RDF/XML format and gives us 1581306 triples.

Please, could you do the same and let us know if the error persist.

Best regards,

@GezimSejdiu

This comment has been minimized.

Copy link
Member

GezimSejdiu commented Apr 11, 2018

Hi there,
I'm going to close this issue for now since I was able to read the go.owl file even without the need to transform to NTRIPLES/N3 serialization.

In case there is still room for discussion related to this issue, please, feel free to comment and we can open the issue and investigate more.

Best regards,

@antonkulaga

This comment has been minimized.

Copy link
Author

antonkulaga commented May 4, 2018

With current rdf/xml code in the latest published vesin (0.3.0) I constantly get

import net.sansa_stack.rdf.spark.io._
import net.sansa_stack.rdf.spark.io.rdf._
import org.apache.jena.riot.Lang

val input = az("/go/go.owl")

val lang = Lang.RDFXML
val triples = spark.read.rdf(lang)(input)
triples.take(5).foreach(println(_))

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 10.0.0.3, executor 0): java.lang.NullPointerException
at org.apache.hadoop.conf.Configuration.(Configuration.java:703)
at org.apache.hadoop.mapred.JobConf.(JobConf.java:442)
at org.apache.hadoop.mapreduce.task.JobContextImpl.(JobContextImpl.java:67)
at org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl.(TaskAttemptContextImpl.java:49)
at org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl.(TaskAttemptContextImpl.java:44)
at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.(HadoopFileLinesReader.scala:44)
at net.sansa_stack.rdf.spark.io.rdfxml.TextInputRdfXmlDataSource$.readFile(RdfXmlDataSource.scala:110)
at net.sansa_stack.rdf.spark.io.rdfxml.RdfXmlFileFormat$$anonfun$buildReader$1.apply(RdfXmlFileFormat.scala:87)
at net.sansa_stack.rdf.spark.io.rdfxml.RdfXmlFileFormat$$anonfun$buildReader$1.apply(RdfXmlFileFormat.scala:85)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:136)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:120)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:124)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:174)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

@GezimSejdiu

This comment has been minimized.

Copy link
Member

GezimSejdiu commented May 4, 2018

Hi @antonkulaga ,
unfortunately, this is not released yet. It is only implemented on the 0.3.1-SNAPSHOT version of SANSA. Soon we will have our next release and it will go with 0.4.0 version.

Could you please, try with 0.3.1-SNAPSHOT version of SANSA and let us know if the issue still persist. If so we can re-open this issue again.

Many thanks for your feedback.

@antonkulaga

This comment has been minimized.

Copy link
Author

antonkulaga commented May 4, 2018

I do not see 0.3.1-SNAPSHOT at Sonatype and I also get

Found 149 errors
Found 53 warnings

When I try to:

mvn clean install

On current github version.

@GezimSejdiu

This comment has been minimized.

Copy link
Member

GezimSejdiu commented May 4, 2018

Hi @antonkulaga ,

yes, it is not on the Sonatype, there will be pushed only the stable releases:).

But you can get it via our maven central repository, but for that, you have to include it as a repository.

Reg. the build of the project, that is strange. The jenkins says that the build is passing, and also it pass on my computer as well :

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] SANSA Stack - RDF Layer - Parent ................... SUCCESS [  2.114 s]
[INFO] sansa-rdf-common_2.11 .............................. SUCCESS [ 16.794 s]
[INFO] SANSA RDF API - Flink .............................. SUCCESS [ 29.097 s]
[INFO] sansa-rdf-spark_2.11 ............................... SUCCESS [ 45.580 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:33 min
[INFO] Finished at: 2018-05-04T14:47:50+02:00
[INFO] Final Memory: 104M/1641M
[INFO] ------------------------------------------------------------------------

So maybe you have a locally cached dependencies and for that you may need to do mvn -U install first.

Let me know if still you are having issues with the current develop branch version of SANSA-RDF.

Best regards,

@antonkulaga

This comment has been minimized.

Copy link
Author

antonkulaga commented May 5, 2018

@GezimSejdiu you messed something up when publishing SNAPSHOT.
Coursier cannot resolve it. As all modern build systems (Mill, sbt, cbt, etc) are using Coursier it means that resolution will fail anywhere (probably, with the exceptions of projects that are using ancient build systems like Maven).
Here is a simple Ammonite script (with 2.11 version of Ammonite)

import  coursier.MavenRepository 
interp.repositories() ++= Seq(MavenRepository("http://maven.aksw.org/repository/snapshots")) 
import $ivy.`net.sansa-stack::sansa-rdf-spark-core:0.3.1-SNAPSHOT`

Here are the errors that I get:

Failed to resolve ivy dependencies:
  net.sansa-stack:sansa-rdf-partition-core_2.11: 
    not found: /home/antonkulaga/.ivy2/local/net.sansa-stack/sansa-rdf-partition-core_2.11/ivys/ivy.xml
    not found: https://repo1.maven.org/maven2/net/sansa-stack/sansa-rdf-partition-core_2.11//sansa-rdf-partition-core_2.11-.pom
    not found: http://maven.aksw.org/repository/snapshots/net/sansa-stack/sansa-rdf-partition-core_2.11//sansa-rdf-partition-core_2.11-.pom
  net.sansa-stack:sansa-rdf-spark-utils_2.11: 
    not found: /home/antonkulaga/.ivy2/local/net.sansa-stack/sansa-rdf-spark-utils_2.11/ivys/ivy.xml
    not found: https://repo1.maven.org/maven2/net/sansa-stack/sansa-rdf-spark-utils_2.11//sansa-rdf-spark-utils_2.11-.pom
    not found: http://maven.aksw.org/repository/snapshots/net/sansa-stack/sansa-rdf-spark-utils_2.11//sansa-rdf-spark-utils_2.11-.pom
  net.sansa-stack:sansa-rdf-kryo-jena_2.11: 
    not found: /home/antonkulaga/.ivy2/local/net.sansa-stack/sansa-rdf-kryo-jena_2.11/ivys/ivy.xml
    not found: https://repo1.maven.org/maven2/net/sansa-stack/sansa-rdf-kryo-jena_2.11//sansa-rdf-kryo-jena_2.11-.pom
    not found: http://maven.aksw.org/repository/snapshots/net/sansa-stack/sansa-rdf-kryo-jena_2.11//sansa-rdf-kryo-jena_2.11-.pom
  org.apache.commons:commons-compress: 
    not found: /home/antonkulaga/.ivy2/local/org.apache.commons/commons-compress/ivys/ivy.xml
    not found: https://repo1.maven.org/maven2/org/apache/commons/commons-compress//commons-compress-.pom
    not found: http://maven.aksw.org/repository/snapshots/org/apache/commons/commons-compress//commons-compress-.pom

I get pretty much the same errors with Mill and SBT, that means that the bug is not Ammonite specific.

@GezimSejdiu

This comment has been minimized.

Copy link
Member

GezimSejdiu commented May 5, 2018

Hi @antonkulaga ,
it could be because you are using import $ivy.net.sansa-stack::sansa-rdf-spark-core:0.3.1-SNAPSHOTinstead ofimport $ivy.net.sansa-stack::sansa-rdf-spark:0.3.1-SNAPSHOT .
Check it out, on the new version we did a bit of refactoring by simplifying/unifying all SANSA layers to (common, spark, and flink), so you have to remove -core from the dependency and include scala_prefix. See this as an example how to use it.

@antonkulaga

This comment has been minimized.

Copy link
Author

antonkulaga commented May 6, 2018

@GezimSejdiu yes, after changing the artifacts name everything was resolved.
However, I still have Null pointers exceptions from:

import net.sansa_stack.rdf.spark.io._
import org.apache.jena.riot.Lang

val input = az("/go/go.owl")

val lang = Lang.RDFXML
val triples = spark.read.rdf(lang)(input)
triples.take(5).foreach(println(_))

Could you tell me what did you do to make it work? Exceptions that I get now are:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 15, 10.0.0.3, executor 0): java.lang.NullPointerException
	at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:703)
	at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:442)
	at org.apache.hadoop.mapreduce.task.JobContextImpl.<init>(JobContextImpl.java:67)
	at org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl.<init>(TaskAttemptContextImpl.java:49)
	at org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl.<init>(TaskAttemptContextImpl.java:44)
	at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.<init>(HadoopFileLinesReader.scala:44)
	at net.sansa_stack.rdf.spark.io.rdfxml.TextInputRdfXmlDataSource$.readFile(RdfXmlDataSource.scala:110)
	at net.sansa_stack.rdf.spark.io.rdfxml.RdfXmlFileFormat$$anonfun$buildReader$1.apply(RdfXmlFileFormat.scala:87)
	at net.sansa_stack.rdf.spark.io.rdfxml.RdfXmlFileFormat$$anonfun$buildReader$1.apply(RdfXmlFileFormat.scala:85)
	at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:136)
	at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:120)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:124)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:174)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336)
  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2861)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)
  at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2150)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2363)
  ... 77 elided
Caused by: java.lang.NullPointerException
  at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:703)
  at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:442)
  at org.apache.hadoop.mapreduce.task.JobContextImpl.<init>(JobContextImpl.java:67)
  at org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl.<init>(TaskAttemptContextImpl.java:49)
  at org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl.<init>(TaskAttemptContextImpl.java:44)
  at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.<init>(HadoopFileLinesReader.scala:44)
  at net.sansa_stack.rdf.spark.io.rdfxml.TextInputRdfXmlDataSource$.readFile(RdfXmlDataSource.scala:110)
  at net.sansa_stack.rdf.spark.io.rdfxml.RdfXmlFileFormat$$anonfun$buildReader$1.apply(RdfXmlFileFormat.scala:87)
  at net.sansa_stack.rdf.spark.io.rdfxml.RdfXmlFileFormat$$anonfun$buildReader$1.apply(RdfXmlFileFormat.scala:85)
  at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:136)
  at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:120)
  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:124)
  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:174)
  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
  at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:108)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)

@GezimSejdiu GezimSejdiu reopened this May 6, 2018

@GezimSejdiu

This comment has been minimized.

Copy link
Member

GezimSejdiu commented May 6, 2018

You are right, I haven't tested this dataset on DataFrame. I did it only on RDD (see my comment above).
Not sure if you are using the same dataset Go Ontology (go.owl), because I get a different error, which is more related to bad characters on the file

Caused by: org.apache.jena.riot.RiotException: Premature end of file.

One workaround could be that you read it via RDD and then convert it on DataFrame by using our wrappers:

import net.sansa_stack.rdf.spark.io._
import org.apache.jena.riot.Lang
import net.sansa_stack.rdf.spark.model._

val input = "src/test/resoures/go.owl"
val lang = Lang.RDFXML
val triples = spark.rdf(lang)(input).toDF()
triples.take(5).foreach(println(_))

I just opened this issue again, and will update when we update the DataFrameReader (but first will have a look why I'm getting this error).

@antonkulaga

This comment has been minimized.

Copy link
Author

antonkulaga commented May 8, 2018

@GezimSejdiu we use the same dataset. I just use:
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.kryo.registrator": "org.bdgenomics.adam.serialization.ADAMKryoRegistrator",
"spark.kryo.referenceTracking": "true"
in my SparkNotebook because I do bioinformatics and need faster serialization. By now I had not problems with serializations of other libs with such settings

@GezimSejdiu

This comment has been minimized.

Copy link
Member

GezimSejdiu commented Jun 22, 2018

Hi @antonkulaga ,
apologize for the late reply, we were busy with the release process and couldn't continue the discussion here.
Just by the experience, maybe the issue is coming because of your kryoregistrator is not included on the Spark configuration on the Notebook.
See this as an example of how we did it for the SANSA-OWL registrator.

In case you did it, please could you let me know that still, this persists that we have to investigate more.

Best regards,

@GezimSejdiu GezimSejdiu added this to the 0.5 milestone Jun 25, 2018

@LorenzBuehmann LorenzBuehmann added the bug label Jul 18, 2018

GezimSejdiu added a commit that referenced this issue Jul 18, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment