Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(WIP) Implement external mapping of IDs #80

Closed
wants to merge 3 commits into from

Conversation

YuvalItzchakov
Copy link

@YuvalItzchakov YuvalItzchakov commented Oct 24, 2018

Initial fix for #71

This implementation introduces AvroNamedSchemaVisitor<T> and the AvroToIcebergSchemaVisitor<iceberg.Schema> types in order to overcome the fact that AvroSchemaVisitor only passes the schema at the field level, neglecting to pass in the field names which are needed generate the schema.

There are a few "workarounds" in the implementation of the record method which I don't really like, still need to iterate on that.

There are still open issues which need to be resolved:

  1. Add Parquet visitor with the same logic of sequential IDs
  2. The field ids mappings start with 0, and they also map structs in a table with ids (is this desired?)
  3. Field id mappings assign ids to the top level fields first, and then iterate complex type
  4. There is still a gap regarding the table properties update, which seems essential in the open issue, but I still don't have the full mental picture to implement it
  5. More tests need to be created for all types of Avro schemas (currently only Record is tested, need to add tests for Map, Union, Array, etc..)
  6. Need to make sure the logic for assigning ids holds for all these types

@YuvalItzchakov YuvalItzchakov changed the title Implement external mapping of IDs (WIP) Implement external mapping of IDs Oct 24, 2018
@@ -26,6 +26,7 @@
import org.apache.avro.LogicalType;
import org.apache.avro.LogicalTypes;
import org.apache.avro.Schema;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: non-functional change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -87,7 +98,7 @@ public static Schema buildAvroProjection(Schema schema, com.netflix.iceberg.Sche

public static boolean isTimestamptz(Schema schema) {
LogicalType logicalType = schema.getLogicalType();
if (logicalType != null && logicalType instanceof LogicalTypes.TimestampMicros) {
if (logicalType instanceof LogicalTypes.TimestampMicros) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all for cleaning up minor issues like this, but I would prefer to keep these in separate commits so that the project is easier to maintain. Could you move this sort of improvement into a separate PR that we can get in separately (and probably more quickly)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will create a separate PR

import java.util.HashMap;
import java.util.List;

public class AvroToIcebergSchemaVisitor extends AvroNamedSchemaVisitor<com.netflix.iceberg.Schema> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a SchemaToType implementation that does most of the work done here. I think it would be better to extend that class with logic for assigning IDs.

I don't think it would be difficult to extend that to detect when an ID is missing and make a function call to assign it. That way, ID assignment is external to the conversion logic. That would make it possible to plug in something that assigns IDs like this, or something that maps IDs using a configured field mapping.

@rdblue
Copy link
Contributor

rdblue commented Oct 25, 2018

@YuvalItzchakov, I think the main goal of this PR is to take a mapping and an Avro schema without IDs and produce something that can be passed into buildAvroProjection. That method takes a file schema, an Iceberg read schema, and a set of renames.

The work here produces an Iceberg schema with IDs from an Avro schema, but that's not exactly what is needed to be passed into buildAvroProjection. So I think this should actually produce an Avro schema with the right IDs from a field name to ID mapping.

@YuvalItzchakov
Copy link
Author

@rdblue Thank you for taking the time to review everything! Frankly, I was not sure this code was in the right direction at all, because as you've mentioned, SchemaToType does exactly the work which is done here (I learned that after I wrote the code).

I'll start working on the changes.

@rdblue
Copy link
Contributor

rdblue commented Dec 7, 2018

@YuvalItzchakov, if you want to continue working on this, please re-open it in the apache/incubator-iceberg repository. That's the project's new home. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants