Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

passthroughFields causes error when writing parquet with new field #250

Open
alexkuang opened this issue Nov 29, 2016 · 5 comments
Open

Comments

@alexkuang
Copy link

Using scrooge-core 4.7.0 and parquet-scrooge 1.8.1, I ran into org.apache.parquet.io.ParquetEncodingException: field 4 was not found in [...] when updating a thrift schema with a new optional field. Simplified example case:

  • I have 2 systems, Publisher and Logger. Publisher publishes records to a stream using a thrift schema T, and Logger reads the objects from the stream and writes them to storage in the parquet format.
  • I update T's schema to include a new optional field.
  • I update Publisher to the new version of the schema, and it starts publishing with the new schema.
  • Logger fails with above ParquetEncodingException.

I believe the cause is that the scrooge-generated scala in Logger is calling write on the passthrough fields (https://github.com/twitter/scrooge/blob/develop/scrooge-generator/src/main/resources/scalagen/struct.scala#L597), which delegates to the protocol. parquet-scrooge's protocol populates its schema using the info exposed in the scrooge-generated fieldInfos, which does not include the passthrough fields. This in turn causes parquet-scrooge to fail as it tries to find the passthrough field in its schema while servicing the write call.

One potential workaround is to call withoutPassthroughFields on individual objects before passing them into parquet (or to have parquet-scrooge do it). Is this the recommended solution? It seems like it might be nicer to have a more universal way of excluding passthroughFields--I did find a comment in the generated code saying "If the field is unknown and passthrough fields are enabled [...]" but didn't find any obvious references as to how to disable them. Wanted to make sure I'm not missing anything.

@kevinoliver
Copy link
Contributor

Can you use Struct.withoutPassthroughFields(original) to create a copy of the struct without the pass throughs?

For example: https://github.com/twitter/scrooge/blob/develop/scrooge-generator-tests/src/test/resources/gold_file_output_scala/com/twitter/scrooge/test/gold/thriftscala/Request.scala#L116

@alexkuang
Copy link
Author

Yes that is the solution I am using right now, but I was hoping there would be a nicer way to handle it. withoutPassthroughFields is defined separately for each generated Struct, so for generic serde code I currently need to bolt it on via an additional implicit for every datatype I handle.

This behavior also seems like a gotcha to me, have there been similar interop issues with other formats? For my own education, what was the original use case for passthroughFields?

@kevinoliver
Copy link
Contributor

Perhaps we could add it as a method on each generated struct so that its generic? I'm open to ideas there.

As for the motivations — Its very powerful to preserve fields that a service doesn't know about. For example, it allows for schema evolution without having "middle-boxes" have to know about the IDL changes.

@alexkuang
Copy link
Author

alexkuang commented Dec 5, 2016

Perhaps we could add it as a method on each generated struct so that its generic? I'm open to ideas there.

I agree that would be super useful. It would let other libraries like parquet-scrooge work with that method to eliminate this gotcha on a higher level.

Its very powerful to preserve fields that a service doesn't know about.

As far as I can tell, the passthrough field is basically an Array[Byte]. For the consumer of a DTO with a new unknown field, I can think of directly logging / writing the blob of Bytes, but are there any more sophisticated use cases for the data?

@mosesn
Copy link
Contributor

mosesn commented May 9, 2017

@alexkuang sorry for the slow response. I think adding a method to each generated struct makes sense. Would you be interested in tackling this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants