Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for custom conversion #27

Open
ittayd opened this issue Jun 12, 2014 · 5 comments
Open

Add support for custom conversion #27

ittayd opened this issue Jun 12, 2014 · 5 comments

Comments

@ittayd
Copy link

ittayd commented Jun 12, 2014

I have a scheme that has many fields as avro "bytes". These are converted to BytesWritable (BTW, why not Byte[] ?). But I want to use them as String. So I need an extra step to convert. And the BytesWritable object is created unnecessarily. Would be nice to be able to specify custom conversions for fields to avoid the generic ones.

@ldcasillas-progreso
Copy link

I think this and issue #26 can be solved together using Cascading 2.x's typed Fields and coercions. The idea would be this (for source taps):

  1. At construction time, the AvroScheme would be supplied with both an Avro Schema and a Cascading Fields object.
  2. The Schema would be used to read the Avro records.
  3. The Fields object would be used to determine which of the Avro fields are actually projected into the TupleEntry.
  4. The AvroScheme would use Cascading 2.x's type coercion facilities to he Avro records' fields' types into the ones specified in the user-supplied Fields object's types.
  5. Users would be able to customize the coercions by creating their own instances of cascading.tuple.type.CoercibleType and putting them into the Fields object they supply to the AvroScheme.

For sink taps, the Cascading Fields object may also contains only a subset of the Avro Schema's fields; this only works when the Schema has defaults for all of the fields missing from the Cascading side. This ties in with Avro schema evolution—it allows for an older Cascading job to continue working with a newer version of an Avro schema, as long as the schema provides defaults for the fields that the Cascading job is unaware of.

@ldcasillas-progreso
Copy link

I should note that I've started doing some modifications to the project that are related to this, but still fall short from what's desired. My forked branch is here:

https://github.com/ldcasillas-progreso/cascading.avro/tree/field_types

@kkrugler
Copy link
Member

Hi Luis - I agree this would be useful, especially on the source side...I just ran into the issue of only needing a few fields from an Avro file, but having to read everything in. Though I then started looking at using Parquet (with Avro records) as an even more efficient approach to this issue - have you ever tried that?

@ldcasillas-progreso
Copy link

I gave parquet-cascading a shot and didn't get the impression that it was mature enough just yet. In any case, I haven't really had the need so far to take narrow projections from wide tuples—my main issues so far have been around the type coercions.

@kkrugler
Copy link
Member

OK, just curious about parquet-cascading.

I'm hoping to roll out a 2.6 release that has Silas's fairly significant changes. These are in the version-2.6 branch; no idea if this would change your approach to the above, but I think it would be worth a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants