Add support for custom conversion #27

ittayd · 2014-06-12T13:06:53Z

I have a scheme that has many fields as avro "bytes". These are converted to BytesWritable (BTW, why not Byte[] ?). But I want to use them as String. So I need an extra step to convert. And the BytesWritable object is created unnecessarily. Would be nice to be able to specify custom conversions for fields to avoid the generic ones.

ldcasillas-progreso · 2015-03-01T22:11:05Z

I think this and issue #26 can be solved together using Cascading 2.x's typed Fields and coercions. The idea would be this (for source taps):

At construction time, the AvroScheme would be supplied with both an Avro Schema and a Cascading Fields object.
The Schema would be used to read the Avro records.
The Fields object would be used to determine which of the Avro fields are actually projected into the TupleEntry.
The AvroScheme would use Cascading 2.x's type coercion facilities to he Avro records' fields' types into the ones specified in the user-supplied Fields object's types.
Users would be able to customize the coercions by creating their own instances of cascading.tuple.type.CoercibleType and putting them into the Fields object they supply to the AvroScheme.

For sink taps, the Cascading Fields object may also contains only a subset of the Avro Schema's fields; this only works when the Schema has defaults for all of the fields missing from the Cascading side. This ties in with Avro schema evolution—it allows for an older Cascading job to continue working with a newer version of an Avro schema, as long as the schema provides defaults for the fields that the Cascading job is unaware of.

ldcasillas-progreso · 2015-03-01T22:26:28Z

I should note that I've started doing some modifications to the project that are related to this, but still fall short from what's desired. My forked branch is here:

https://github.com/ldcasillas-progreso/cascading.avro/tree/field_types

kkrugler · 2015-03-10T15:15:25Z

Hi Luis - I agree this would be useful, especially on the source side...I just ran into the issue of only needing a few fields from an Avro file, but having to read everything in. Though I then started looking at using Parquet (with Avro records) as an even more efficient approach to this issue - have you ever tried that?

ldcasillas-progreso · 2015-03-10T18:45:15Z

I gave parquet-cascading a shot and didn't get the impression that it was mature enough just yet. In any case, I haven't really had the need so far to take narrow projections from wide tuples—my main issues so far have been around the type coercions.

kkrugler · 2015-03-10T18:47:27Z

OK, just curious about parquet-cascading.

I'm hoping to roll out a 2.6 release that has Silas's fairly significant changes. These are in the version-2.6 branch; no idea if this would change your approach to the above, but I think it would be worth a look.

ldcasillas-progreso mentioned this issue Mar 16, 2015

avro-scheme generates untyped Cascading Fields objects #41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for custom conversion #27

Add support for custom conversion #27

ittayd commented Jun 12, 2014

ldcasillas-progreso commented Mar 1, 2015

ldcasillas-progreso commented Mar 1, 2015

kkrugler commented Mar 10, 2015

ldcasillas-progreso commented Mar 10, 2015

kkrugler commented Mar 10, 2015

Add support for custom conversion #27

Add support for custom conversion #27

Comments

ittayd commented Jun 12, 2014

ldcasillas-progreso commented Mar 1, 2015

ldcasillas-progreso commented Mar 1, 2015

kkrugler commented Mar 10, 2015

ldcasillas-progreso commented Mar 10, 2015

kkrugler commented Mar 10, 2015