Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong seqNo is set when reading from evenehubs #462

Closed
yuecong opened this issue Sep 19, 2019 · 13 comments
Closed

Wrong seqNo is set when reading from evenehubs #462

yuecong opened this issue Sep 19, 2019 · 13 comments
Assignees

Comments

@yuecong
Copy link

yuecong commented Sep 19, 2019

Bug Report:

  • Actual behavior
    I am using azure-event-hubs-spark connector to read data from eventhubs and write to one elasticsearch cluster and I got the following errors. After get this error, the streaming job can not resume as it always fali with this error message and I have to change the checkpoint folder to mitigate the issue, but this is causing data loss as the offset is reset to head.

got the following errors

ERROR: Query termination received for [id=c599fdad-a1ad-4d21-8177-753b4018425f, runId=59eac3c3-9cfa-4f2e-bd7c-59d2a5142e2c], with exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 0.0 failed 4 times, most recent failure: Lost task 12.3 in stage 0.0 (TID 34, 10.139.64.15, executor 0): java.lang.IllegalStateException: In partition 12 of service-log, with consumer group $Default, request seqNo 52608085 is less than the received seqNo 52608090. The earliest seqNo is 0 and the last seqNo is 54517107
  • Expected behavior
    offset is correctly setup

  • Spark version
    2.4.3

  • spark-eventhubs artifactId and version
    com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.13

@agilityhawk
Copy link

Any update on this? I have also been seeing a similar error, even after ensuring that a dedicated consumer group is created specifically for the readers.

The eventhub in question has 2 partitions and there are 2 worker nodes (databricks) part of the same consumer group trying to read from the event-hub.

Any help would be much appreciated.
Let me know if additional information is required

@satyady
Copy link

satyady commented Feb 6, 2020

Facing the same issue, any update on this issue. As mentioned by others I do have dedicated consumer group and the notebook still throws this exception.

@k4jiang
Copy link

k4jiang commented Mar 6, 2020

I am facing the same issue. This is a pretty critical bug that makes it near impossible to reliably productionize any pipeline that uses this connector. Please prioritize a fix or workaround to this.

@nyaghma
Copy link
Contributor

nyaghma commented Mar 9, 2020

Could you please tell us the version of the connector that you are using? Are you using the latest version (2.3.14.1) and still seeing this issue?

@k4jiang
Copy link

k4jiang commented Mar 9, 2020

I was currently using 2.3.13. I will update to 2.3.14.1 and see if we face the issue.

@nyaghma
Copy link
Contributor

nyaghma commented Mar 9, 2020

Please let us know if you face the error using 2.3.14.1. In that case, please let me know about your set up and send me the logs.

@agilityhawk
Copy link

When using this setup with Databricks, I noticed that this usually happened when there was a scale up or scale down on auto-scaling worker nodes. And also during development, where we were sending data to eventhub sporadically.

In proper staging/prod environment, I decided to have a fixed set of workers - 1 per consumer group/partition combo and things seem to be stable for now.

For development situations, setting the eventhub enqueued time to now() seems to be a stable option.

Also, in case of databricks especially, the checkout gets corrupted eventually. And needs to be cleared or the job needs to be restarted for it to recover.

Hope this helps someone.

@nyaghma
Copy link
Contributor

nyaghma commented Mar 10, 2020

Thanks for sharing the info, @agilityhawk.

@k4jiang
Copy link

k4jiang commented Mar 10, 2020

@agilityhawk Did you experience this behavior even in the newest version (2.3.14.1)?

@agilityhawk
Copy link

@k4jiang - We're still using com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.13.
Haven't migrated to the next version yet in our cycle.
Curious to know if just upgrading to latest connector works out for you.

@k4jiang
Copy link

k4jiang commented Mar 13, 2020

After upgrading to 2.3.14.1 on 2020-03-10 until now (2020-03-13), we have not encountered the bug. The bug happened maybe once/week so let me sit on it for another week before jumping to conclusions.

@nyaghma
Copy link
Contributor

nyaghma commented Mar 18, 2020

I'm closing the issue. If you see the error again using 2.3.14.1 please reopen.

@nyaghma nyaghma closed this as completed Mar 18, 2020
@k4jiang
Copy link

k4jiang commented Apr 2, 2020

@nyaghma We are seeing this issue even on 2.3.14.1.
Here is the stacktrace of the error:

Job aborted due to stage failure: Task 30 in stage 2348.0 failed 4 times, most recent failure: Lost task 30.3 in stage 2348.0 (TID 4058, 10.139.64.6, executor 0): java.lang.IllegalStateException: In partition 30 of http-access-log, with consumer group $Default, request seqNo 19609525 is less than the received seqNo 19684911. The earliest seqNo is 19684804 and the last seqNo is 20231767
	at org.apache.spark.eventhubs.client.CachedEventHubsReceiver.checkCursor(CachedEventHubsReceiver.scala:189)
	at org.apache.spark.eventhubs.client.CachedEventHubsReceiver.org$apache$spark$eventhubs$client$CachedEventHubsReceiver$$receive(CachedEventHubsReceiver.scala:213)
	at org.apache.spark.eventhubs.client.CachedEventHubsReceiver$.receive(CachedEventHubsReceiver.scala:288)
	at org.apache.spark.eventhubs.rdd.EventHubsRDD.compute(EventHubsRDD.scala:120)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:353)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:317)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:353)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:317)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:353)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:317)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:353)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:317)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:353)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:317)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:353)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:317)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:140)
	at org.apache.spark.scheduler.Task.run(Task.scala:113)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:537)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:543)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants