adding exception handlers #290

ScaddingJ · 2022-03-09T11:01:01Z

Had a use case that required skipping bad messages instead of just failing fast. This adds exception handling functionality to ABRiS so that it can fail fast or fail and return an empty row exactly as described in issue #183.

However, there is an issue with the EmptyExceptionHandler's log output occurring many times instead of just once per exception. Currently the reason for this is not known and was wondering if there would be a known solution?

ScaddingJ · 2022-03-09T11:01:22Z

quoting from previous pull request:

@cerveada > Could it be that is used to stop executing because of the exception, and now it continues and prints the exceptions for each row?

No I don't think so as it has been tested with sending a few messages/rows and one exception but it prints approximately 40 times

ScaddingJ · 2022-03-09T14:28:35Z

@cerveada I see you ran the workflows for my pr earlier. I accidentally closed the pr so I'm sorry about that. Would you mind running them again?

yigalk89 · 2022-03-10T16:39:23Z

This is cool.!
Glad to see this is in the process of making.
We are also facing an issue with error handling, so glad you came up with a solution.
Let me know if there is anything we can do to help.

pom.xml

kevinwallimann

I tried out the EmptyExceptionHandler by producing this message

{"value0":"value","value1":[1,2,3]}

to a topic and setting different schemas for that topic, and then consume from that topic. Concretely, I used https://github.com/AbsaOSS/ABRiS/blob/v6.2.0/src/main/scala/za/co/absa/abris/examples/ConfluentKafkaAvroReader.scala and added
.withExceptionHandler(new EmptyExceptionHandler) on line 60.

The expected result would have been an empty record, however for certain schemas I got NullPointerExceptions.

For the following schema

  "fields": [
    {
      "name": "bytes",
      "type": "bytes"
    }
  ],
  "name": "somename",
  "namespace": "somenamespace",
  "type": "record"
}

I got this NullPointerException

Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: Writing job aborted
=== Streaming Query ===
Identifier: [id = 93d9f41b-62c9-4048-9a65-face0f93328e, runId = 93fd62f0-c2ea-4030-900b-7f0c010dc5eb]
Current Committed Offsets: {}
Current Available Offsets: {KafkaV2[Subscribe[test-topic]]: {"test-topic":{"0":1}}}

Current State: ACTIVE
Thread State: RUNNABLE

Logical Plan:
WriteToMicroBatchDataSource ConsoleWriter[numRows=20, truncate=false]
+- Project [from_avro(value#8, (readerSchema,{"type":"record","name":"somename","namespace":"somenamespace","fields":[{"name":"bytes","type":"bytes"}]}), (exceptionHandler,za.co.absa.abris.avro.errors.EmptyExceptionHandler@71db45f7)) AS data#21]
   +- StreamingDataSourceV2Relation [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13], org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaScan@236b1a51, KafkaV2[Subscribe[test-topic]]

	at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:325)
	at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:209)
Caused by: org.apache.spark.SparkException: Writing job aborted
	at org.apache.spark.sql.errors.QueryExecutionErrors$.writingJobAbortedError(QueryExecutionErrors.scala:613)
	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:386)
	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2$(WriteToDataSourceV2Exec.scala:330)
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.writeWithV2(WriteToDataSourceV2Exec.scala:279)
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.run(WriteToDataSourceV2Exec.scala:290)
	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
	at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2971)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
	at org.apache.spark.sql.Dataset.collect(Dataset.scala:2971)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$17(MicroBatchExecution.scala:603)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$16(MicroBatchExecution.scala:598)
	at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:375)
	at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:373)
	at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:69)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:598)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:228)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:375)
	at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:373)
	at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:69)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:193)
	at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:57)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:187)
	at org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:303)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:286)
	... 1 more
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (c02x937gjg5j executor driver): java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:114)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:412)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1496)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:457)
	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:358)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2642)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214)
	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:354)
	... 40 more
Caused by: java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:114)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:412)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1496)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:457)
	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:358)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)

Also for this schema, I got the same exception

{
  "fields": [
    {
      "name": "fixed",
      "type": {
        "name": "fixed",
        "size": 40,
        "type": "fixed"
      }
    }
  ],
  "name": "somename",
  "namespace": "somenamespace",
  "type": "record"
}

For this schema

{
  "fields": [
    {
      "name": "array",
      "type": {
        "items": "string",
        "type": "array"
      }
    }
  ],
  "name": "somename",
  "namespace": "somenamespace",
  "type": "record"
}

I'm getting this exception

Logical Plan:
WriteToMicroBatchDataSource ConsoleWriter[numRows=20, truncate=false]
+- Project [from_avro(value#8, (readerSchema,{"type":"record","name":"somename","namespace":"somenamespace","fields":[{"name":"array","type":{"type":"array","items":"string"}}]}), (exceptionHandler,za.co.absa.abris.avro.errors.EmptyExceptionHandler@df1f6ad)) AS data#21]
   +- StreamingDataSourceV2Relation [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13], org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaScan@199afe5, KafkaV2[Subscribe[test-topic]]

	at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:325)
	at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:209)
Caused by: org.apache.spark.SparkException: Writing job aborted
	at org.apache.spark.sql.errors.QueryExecutionErrors$.writingJobAbortedError(QueryExecutionErrors.scala:613)
	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:386)
	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2$(WriteToDataSourceV2Exec.scala:330)
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.writeWithV2(WriteToDataSourceV2Exec.scala:279)
	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.run(WriteToDataSourceV2Exec.scala:290)
	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
	at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2971)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
	at org.apache.spark.sql.Dataset.collect(Dataset.scala:2971)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$17(MicroBatchExecution.scala:603)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$16(MicroBatchExecution.scala:598)
	at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:375)
	at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:373)
	at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:69)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:598)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:228)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:375)
	at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:373)
	at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:69)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:193)
	at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:57)
	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:187)
	at org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:303)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:286)
	... 1 more
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (c02x937gjg5j executor driver): java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:412)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1496)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:457)
	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:358)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2642)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214)
	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:354)
	... 40 more
Caused by: java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:412)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1496)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:457)
	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:358)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

The same exception also for

{
  "fields": [
    {
      "name": "map",
      "type": {
        "type": "map",
        "values": {
          "items": "long",
          "type": "array"
        }
      }
    }
  ],
  "name": "somename",
  "namespace": "somenamespace",
  "type": "record"
}

ScaddingJ · 2022-04-08T14:06:27Z

This is cool.! Glad to see this is in the process of making. We are also facing an issue with error handling, so glad you came up with a solution. Let me know if there is anything we can do to help.

@yigalk89 Hi would you know more about spark stream reading as we currently are using batch reading and do not know that well what is going on yet?

yigalk89 · 2022-04-20T15:29:09Z

@ScaddingJ
What do you mean by "what is going on yet"?
Are you asking about Spark Implementation? Or about a piece of advice regarding writing a spark streaming code snippet?

gintautassulskus-elsevier · 2022-04-22T13:09:00Z

Providing context regarding the latest commit.

Replaced EmptyDeserializationHandler with SpecificRecordDeserializationHandler. The latter replaces undeserializable records with a preconfigured default SpecificRecord. This provides a decent alternative to the default behaviour to fail fast until we figure out whether replacing an undeserializable record with an empty one is a valid approach.

A few thoughts.

Schema contract

Avro Schema provides us with a strong assumption about the data shape we are working with. For example, we know which field is optional and mandatory and, if provided, what default value to use. Ideally, it should be possible to make the same assumptions, given that the output Spark DataFrame inherits Avro Schema via Abris.

We designed the EmptyDeserializationHandler to replace undeserializable records with empty rows. If the schema contains mandatory fields, the empty row will lead to a breach of the contract. Usually, an intentional discrepancy between the schema and the data is a bad practice.

NullPointerException caused by empty records

I've isolated the case of an empty row causing a NullPointerException in Spark, but I do not know the root cause.

Option 1 - the empty row works fine with all-optional-field Avro Schema: https://github.com/ScaddingJ/ABRiS/blob/6bcb903c31429ac4d19b94da8fe1e93d2c3fb98c/src/test/scala/za/co/absa/abris/avro/sql/AvroDataToCatalystSpec.scala#L169 Perhaps it has to do with the DataFrame schema not matching the values, e.g. field is set to nullable=false, but the value is null.

Interestingly, Spark's behaviour differs depending on the approach one uses to read the values.

https://github.com/ScaddingJ/ABRiS/blob/6bcb903c31429ac4d19b94da8fe1e93d2c3fb98c/src/test/scala/za/co/absa/abris/avro/sql/AvroDataToCatalystSpec.scala#L177

Attribute serialisation

For the ExceptionHandler to work with Spark, values must be serializable. Only types extending SpecificRecordBase are serializable. GenericRecords are not. That said, the SpecificRecordDeserializationHandler is limited to SpecificRecords as a replacement for undeserializable records.

The most straightforward approach to creating a SpecificRecord is via generated POJOs. Hence the introduction of avro-maven-plugin to generate POJOs from the avsc files.

docker-compose.yaml

pom.xml

src/main/scala/za/co/absa/abris/avro/errors/FailFastExceptionHandler.scala

src/main/scala/za/co/absa/abris/avro/errors/SpecificRecordExceptionHandler.scala

src/main/scala/za/co/absa/abris/avro/sql/AvroDataToCatalyst.scala

kevinwallimann · 2022-04-25T16:04:43Z

Thanks @gintautassulskus-elsevier for your commits. I like where this PR is going. Using a SpecificRecordBase as the default record seems to be the right decision. @ScaddingJ Would this solve your use case as well?

ScaddingJ · 2022-04-27T10:41:13Z

Thanks @gintautassulskus-elsevier for your commits. I like where this PR is going. Using a SpecificRecordBase as the default record seems to be the right decision. @ScaddingJ Would this solve your use case as well?

Hi @kevinwallimann, yes it does. @gintautassulskus-elsevier and I are collaborating on the same use case so our commits will build on each other.

Author: Jeremy Scadding <scaddingj@science.regn.net> Date: Tue Mar 08 15:49:04 2022 +0100

Rename DefaultExceptionHandler to FailExceptionHandler

kevinwallimann · 2022-04-29T13:51:10Z

Thank you @ScaddingJ and @gintautassulskus-elsevier for your contribution! 🚀 🚀

yigalk89 · 2022-05-19T06:45:50Z

Hi @kevinwallimann
Is there a plan for when there would be a release that contains this PR?

kevinwallimann · 2022-05-20T07:08:59Z

Hi @yigalk89 Version 6.3.0 of ABRiS has been released.

cerveada reviewed Mar 11, 2022

View reviewed changes

pom.xml Outdated Show resolved Hide resolved

kevinwallimann reviewed Mar 14, 2022

View reviewed changes

kevinwallimann reviewed Apr 25, 2022

View reviewed changes

gintautassulskus-elsevier force-pushed the Implement-error-handling-mods branch from 42c6c26 to dd1d939 Compare April 26, 2022 07:21

Jeremy Scadding and others added 9 commits April 27, 2022 13:41

adding exception handlers

1e027e1

Author: Jeremy Scadding <scaddingj@science.regn.net> Date: Tue Mar 08 15:49:04 2022 +0100

Change EmptyRecordHandler to SpecificRecordExceptionHandler

9e1b325

Rename DefaultExceptionHandler to FailExceptionHandler

Fail -> FailFast

5811d61

removed duplicate dependency

47370ff

Adding licensing to several added files

99cfb22

removed not needed interpolation

1843b72

updated exception handlers' error messages

e08663f

fixed identation to adhere to Scala stylistic guidelines

ce2f43a

fix test assertion

33425bd

gintautassulskus-elsevier force-pushed the Implement-error-handling-mods branch from 05755a7 to 33425bd Compare April 27, 2022 12:49

gintautassulskus-elsevier added 3 commits April 27, 2022 13:50

removed docker-compose

7f21621

removed unused Logging

9946ee4

Updated licensing

1c101bb

gintautassulskus-elsevier force-pushed the Implement-error-handling-mods branch from 5a3544d to 1c101bb Compare April 27, 2022 13:06

kevinwallimann approved these changes Apr 29, 2022

View reviewed changes

cerveada approved these changes Apr 29, 2022

View reviewed changes

kevinwallimann merged commit 9538506 into AbsaOSS:master Apr 29, 2022

This was referenced Oct 31, 2022

Implement error handling mods #183

Closed

How to capture bad records while using from_avro in abris #312

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding exception handlers #290

adding exception handlers #290

ScaddingJ commented Mar 9, 2022

ScaddingJ commented Mar 9, 2022

ScaddingJ commented Mar 9, 2022

yigalk89 commented Mar 10, 2022

kevinwallimann left a comment

ScaddingJ commented Apr 8, 2022 •

edited

Loading

yigalk89 commented Apr 20, 2022

gintautassulskus-elsevier commented Apr 22, 2022

kevinwallimann commented Apr 25, 2022

ScaddingJ commented Apr 27, 2022

kevinwallimann commented Apr 29, 2022

yigalk89 commented May 19, 2022

kevinwallimann commented May 20, 2022

adding exception handlers #290

adding exception handlers #290

Conversation

ScaddingJ commented Mar 9, 2022

ScaddingJ commented Mar 9, 2022

ScaddingJ commented Mar 9, 2022

yigalk89 commented Mar 10, 2022

kevinwallimann left a comment

Choose a reason for hiding this comment

ScaddingJ commented Apr 8, 2022 • edited Loading

yigalk89 commented Apr 20, 2022

gintautassulskus-elsevier commented Apr 22, 2022

Schema contract

NullPointerException caused by empty records

Attribute serialisation

kevinwallimann commented Apr 25, 2022

ScaddingJ commented Apr 27, 2022

kevinwallimann commented Apr 29, 2022

yigalk89 commented May 19, 2022

kevinwallimann commented May 20, 2022

ScaddingJ commented Apr 8, 2022 •

edited

Loading