Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark streaming schema evolution + ABRiS #176

Closed
grantatspothero opened this issue Dec 21, 2020 · 3 comments
Closed

Spark streaming schema evolution + ABRiS #176

grantatspothero opened this issue Dec 21, 2020 · 3 comments

Comments

@grantatspothero
Copy link

This thread discusses a change to ABRiS to avoid hitting the schema registry repeatedly during spark execution by only fetching the latest schema once on the driver:
#105 (comment)

However, doesn't this mean if a new version of an event is registered to the schema registry after the spark streaming job is kicked off, then the spark streaming job will never pick up the new version of the schema until it is restarted?

The above thread discussed potential workarounds, one of them being a parameter to tune how often the schema registry should be polled for the latest schema version: #105 (comment)

Is this something reasonable to ask? It is nice since you could have longrunning spark streaming jobs that will dynamically ingest new versions of events as they are produced.

@kevinwallimann
Copy link
Collaborator

Hi @grantatspothero
Thanks for your question. Spark Structured Streaming does not allow for schema changes during query execution, at least not out of the box. That’s why ABRiS doesn’t support this. Previously, ABRiS picked up the latest Avro schema for every record, but the Spark schema would not have changed during the query anyway. Thus, getting the latest version from schema registry for every row was never really a feature, but rather a performance bug.

@grantatspothero
Copy link
Author

@kevinwallimann Makes sense. There appears to be a way to invalidate a structured streaming plan so you could potentially detect schema changes and rebuild the plan, but it is complicated.

You can close this thanks for the help!

@richiesgr
Copy link

richiesgr commented Mar 21, 2023

Hi @kevinwallimann
I used abris to stream avro to delta table. As mentioned in this thread the schema evolution is not working neither using streaming of foreachbatch.

However using this code I'm able to update the schema for each message at least in the source the problem is that is never reflected on the sink

You said spark can't handle this out of the box I confirm but do you've any idea how to implement it ?
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants