Wrong seqNo is set when reading from evenehubs #462

yuecong · 2019-09-19T19:49:52Z

Bug Report:

Actual behavior
I am using azure-event-hubs-spark connector to read data from eventhubs and write to one elasticsearch cluster and I got the following errors. After get this error, the streaming job can not resume as it always fali with this error message and I have to change the checkpoint folder to mitigate the issue, but this is causing data loss as the offset is reset to head.

got the following errors

ERROR: Query termination received for [id=c599fdad-a1ad-4d21-8177-753b4018425f, runId=59eac3c3-9cfa-4f2e-bd7c-59d2a5142e2c], with exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 0.0 failed 4 times, most recent failure: Lost task 12.3 in stage 0.0 (TID 34, 10.139.64.15, executor 0): java.lang.IllegalStateException: In partition 12 of service-log, with consumer group $Default, request seqNo 52608085 is less than the received seqNo 52608090. The earliest seqNo is 0 and the last seqNo is 54517107

Expected behavior
offset is correctly setup
Spark version
2.4.3
spark-eventhubs artifactId and version
com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.13

The text was updated successfully, but these errors were encountered:

agilityhawk · 2019-11-15T12:23:21Z

Any update on this? I have also been seeing a similar error, even after ensuring that a dedicated consumer group is created specifically for the readers.

The eventhub in question has 2 partitions and there are 2 worker nodes (databricks) part of the same consumer group trying to read from the event-hub.

Any help would be much appreciated.
Let me know if additional information is required

satyady · 2020-02-06T23:38:59Z

Facing the same issue, any update on this issue. As mentioned by others I do have dedicated consumer group and the notebook still throws this exception.

k4jiang · 2020-03-06T22:54:49Z

I am facing the same issue. This is a pretty critical bug that makes it near impossible to reliably productionize any pipeline that uses this connector. Please prioritize a fix or workaround to this.

nyaghma · 2020-03-09T15:56:38Z

Could you please tell us the version of the connector that you are using? Are you using the latest version (2.3.14.1) and still seeing this issue?

k4jiang · 2020-03-09T22:40:48Z

I was currently using 2.3.13. I will update to 2.3.14.1 and see if we face the issue.

nyaghma · 2020-03-09T22:45:31Z

Please let us know if you face the error using 2.3.14.1. In that case, please let me know about your set up and send me the logs.

agilityhawk · 2020-03-10T07:12:45Z

When using this setup with Databricks, I noticed that this usually happened when there was a scale up or scale down on auto-scaling worker nodes. And also during development, where we were sending data to eventhub sporadically.

In proper staging/prod environment, I decided to have a fixed set of workers - 1 per consumer group/partition combo and things seem to be stable for now.

For development situations, setting the eventhub enqueued time to now() seems to be a stable option.

Also, in case of databricks especially, the checkout gets corrupted eventually. And needs to be cleared or the job needs to be restarted for it to recover.

Hope this helps someone.

nyaghma · 2020-03-10T15:11:43Z

Thanks for sharing the info, @agilityhawk.

k4jiang · 2020-03-10T17:20:44Z

@agilityhawk Did you experience this behavior even in the newest version (2.3.14.1)?

agilityhawk · 2020-03-13T13:11:17Z

@k4jiang - We're still using com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.13.
Haven't migrated to the next version yet in our cycle.
Curious to know if just upgrading to latest connector works out for you.

k4jiang · 2020-03-13T16:43:36Z

After upgrading to 2.3.14.1 on 2020-03-10 until now (2020-03-13), we have not encountered the bug. The bug happened maybe once/week so let me sit on it for another week before jumping to conclusions.

nyaghma · 2020-03-18T15:45:40Z

I'm closing the issue. If you see the error again using 2.3.14.1 please reopen.

k4jiang · 2020-04-02T21:11:22Z

@nyaghma We are seeing this issue even on 2.3.14.1.
Here is the stacktrace of the error:

Job aborted due to stage failure: Task 30 in stage 2348.0 failed 4 times, most recent failure: Lost task 30.3 in stage 2348.0 (TID 4058, 10.139.64.6, executor 0): java.lang.IllegalStateException: In partition 30 of http-access-log, with consumer group $Default, request seqNo 19609525 is less than the received seqNo 19684911. The earliest seqNo is 19684804 and the last seqNo is 20231767
	at org.apache.spark.eventhubs.client.CachedEventHubsReceiver.checkCursor(CachedEventHubsReceiver.scala:189)
	at org.apache.spark.eventhubs.client.CachedEventHubsReceiver.org$apache$spark$eventhubs$client$CachedEventHubsReceiver$$receive(CachedEventHubsReceiver.scala:213)
	at org.apache.spark.eventhubs.client.CachedEventHubsReceiver$.receive(CachedEventHubsReceiver.scala:288)
	at org.apache.spark.eventhubs.rdd.EventHubsRDD.compute(EventHubsRDD.scala:120)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:353)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:317)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:353)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:317)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:353)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:317)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:353)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:317)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:353)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:317)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:353)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:317)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:140)
	at org.apache.spark.scheduler.Task.run(Task.scala:113)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:537)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:543)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

sjkwak assigned nyaghma Mar 6, 2020

nyaghma closed this as completed Mar 18, 2020

k4jiang mentioned this issue Apr 2, 2020

Wrong seqNo is set when reading from eventhubs #488

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong seqNo is set when reading from evenehubs #462

Wrong seqNo is set when reading from evenehubs #462

yuecong commented Sep 19, 2019

agilityhawk commented Nov 15, 2019

satyady commented Feb 6, 2020 •

edited

k4jiang commented Mar 6, 2020

nyaghma commented Mar 9, 2020

k4jiang commented Mar 9, 2020

nyaghma commented Mar 9, 2020

agilityhawk commented Mar 10, 2020

nyaghma commented Mar 10, 2020

k4jiang commented Mar 10, 2020

agilityhawk commented Mar 13, 2020

k4jiang commented Mar 13, 2020 •

edited

nyaghma commented Mar 18, 2020

k4jiang commented Apr 2, 2020

Wrong seqNo is set when reading from evenehubs #462

Wrong seqNo is set when reading from evenehubs #462

Comments

yuecong commented Sep 19, 2019

agilityhawk commented Nov 15, 2019

satyady commented Feb 6, 2020 • edited

k4jiang commented Mar 6, 2020

nyaghma commented Mar 9, 2020

k4jiang commented Mar 9, 2020

nyaghma commented Mar 9, 2020

agilityhawk commented Mar 10, 2020

nyaghma commented Mar 10, 2020

k4jiang commented Mar 10, 2020

agilityhawk commented Mar 13, 2020

k4jiang commented Mar 13, 2020 • edited

nyaghma commented Mar 18, 2020

k4jiang commented Apr 2, 2020

satyady commented Feb 6, 2020 •

edited

k4jiang commented Mar 13, 2020 •

edited