(WIP) Implement external mapping of IDs #80

YuvalItzchakov · 2018-10-24T16:46:41Z

Initial fix for #71

This implementation introduces AvroNamedSchemaVisitor<T> and the AvroToIcebergSchemaVisitor<iceberg.Schema> types in order to overcome the fact that AvroSchemaVisitor only passes the schema at the field level, neglecting to pass in the field names which are needed generate the schema.

There are a few "workarounds" in the implementation of the record method which I don't really like, still need to iterate on that.

There are still open issues which need to be resolved:

Add Parquet visitor with the same logic of sequential IDs
The field ids mappings start with 0, and they also map structs in a table with ids (is this desired?)
Field id mappings assign ids to the top level fields first, and then iterate complex type
There is still a gap regarding the table properties update, which seems essential in the open issue, but I still don't have the full mental picture to implement it
More tests need to be created for all types of Avro schemas (currently only Record is tested, need to add tests for Map, Union, Array, etc..)
Need to make sure the logic for assigning ids holds for all these types

rdblue · 2018-10-25T22:16:36Z

core/src/main/java/com/netflix/iceberg/avro/AvroSchemaUtil.java

@@ -26,6 +26,7 @@
 import org.apache.avro.LogicalType;
 import org.apache.avro.LogicalTypes;
 import org.apache.avro.Schema;
+


Nit: non-functional change.

rdblue · 2018-10-25T22:17:42Z

core/src/main/java/com/netflix/iceberg/avro/AvroSchemaUtil.java

@@ -87,7 +98,7 @@ public static Schema buildAvroProjection(Schema schema, com.netflix.iceberg.Sche

  public static boolean isTimestamptz(Schema schema) {
    LogicalType logicalType = schema.getLogicalType();
-    if (logicalType != null && logicalType instanceof LogicalTypes.TimestampMicros) {
+    if (logicalType instanceof LogicalTypes.TimestampMicros) {


I'm all for cleaning up minor issues like this, but I would prefer to keep these in separate commits so that the project is easier to maintain. Could you move this sort of improvement into a separate PR that we can get in separately (and probably more quickly)?

Sure, will create a separate PR

rdblue · 2018-10-25T22:20:45Z

core/src/main/java/com/netflix/iceberg/avro/AvroToIcebergSchemaVisitor.java

+import java.util.HashMap;
+import java.util.List;
+
+public class AvroToIcebergSchemaVisitor extends AvroNamedSchemaVisitor<com.netflix.iceberg.Schema> {


There is already a SchemaToType implementation that does most of the work done here. I think it would be better to extend that class with logic for assigning IDs.

I don't think it would be difficult to extend that to detect when an ID is missing and make a function call to assign it. That way, ID assignment is external to the conversion logic. That would make it possible to plug in something that assigns IDs like this, or something that maps IDs using a configured field mapping.

rdblue · 2018-10-25T22:23:48Z

@YuvalItzchakov, I think the main goal of this PR is to take a mapping and an Avro schema without IDs and produce something that can be passed into buildAvroProjection. That method takes a file schema, an Iceberg read schema, and a set of renames.

The work here produces an Iceberg schema with IDs from an Avro schema, but that's not exactly what is needed to be passed into buildAvroProjection. So I think this should actually produce an Avro schema with the right IDs from a field name to ID mapping.

YuvalItzchakov · 2018-10-26T05:34:20Z

@rdblue Thank you for taking the time to review everything! Frankly, I was not sure this code was in the right direction at all, because as you've mentioned, SchemaToType does exactly the work which is done here (I learned that after I wrote the code).

I'll start working on the changes.

rdblue · 2018-12-07T17:57:18Z

@YuvalItzchakov, if you want to continue working on this, please re-open it in the apache/incubator-iceberg repository. That's the project's new home. Thanks!

YuvalItzchakov added 3 commits October 24, 2018 19:35

Initial implementation for avro to iceberg schema conversion

e10bc7e

Added license on iceberg visitor file

97f16bc

Removed test

72e973e

YuvalItzchakov changed the title ~~Implement external mapping of IDs~~ (WIP) Implement external mapping of IDs Oct 24, 2018

rdblue reviewed Oct 25, 2018

View reviewed changes

rdblue mentioned this pull request Dec 8, 2018

Add external schema mappings for files written with name-based schemas apache/iceberg#40

Closed

YuvalItzchakov closed this Aug 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(WIP) Implement external mapping of IDs #80

(WIP) Implement external mapping of IDs #80

YuvalItzchakov commented Oct 24, 2018 •

edited

rdblue Oct 25, 2018

YuvalItzchakov Oct 26, 2018

rdblue Oct 25, 2018

YuvalItzchakov Oct 26, 2018

rdblue Oct 25, 2018

rdblue commented Oct 25, 2018

YuvalItzchakov commented Oct 26, 2018

rdblue commented Dec 7, 2018

(WIP) Implement external mapping of IDs #80

(WIP) Implement external mapping of IDs #80

Conversation

YuvalItzchakov commented Oct 24, 2018 • edited

rdblue Oct 25, 2018

Choose a reason for hiding this comment

YuvalItzchakov Oct 26, 2018

Choose a reason for hiding this comment

rdblue Oct 25, 2018

Choose a reason for hiding this comment

YuvalItzchakov Oct 26, 2018

Choose a reason for hiding this comment

rdblue Oct 25, 2018

Choose a reason for hiding this comment

rdblue commented Oct 25, 2018

YuvalItzchakov commented Oct 26, 2018

rdblue commented Dec 7, 2018

YuvalItzchakov commented Oct 24, 2018 •

edited