Initial pass at adding ORC to Iceberg. #12

omalley · 2018-02-23T23:28:37Z

Known problems:

Doesn't do schema evolution.
Doesn't include column size metrics.
Doesn't properly handle timestamp with timezone.
Doesn't do the schema mangling for partitions.

omalley · 2018-02-26T17:38:29Z

Other comments:

ORC reads/writes to/from VectorizedColumnBatches in groups of 1024 rows. I made the Spark interfaces implement an iterator to an UnsafeRow. I believe I have it so that there won't be any allocations in the inner loop (except for Decimals).
I extended the RandomData class to also generate InternalRows, which is a parent interface for UnsafeRow.
In RandomData I limited the year range for timestamps to 1970+-50years. I got tired of debugging with dates 20,000 years in the future.
I added a comparison between InternalRow and Row in TestHelpers to deal with the data generator creating InternalRow instances.
I pulled out TestParquetWrite.Record to a top level class so that I could reuse it in TestOrcWrite.
Added a method on UpdateProperties to set the default file format.

rdblue · 2018-02-28T02:40:48Z

Thanks! I'll have a look shortly.

rdblue · 2018-03-02T20:45:37Z

spark/src/test/java/com/netflix/iceberg/spark/data/TestHelpers.java

+            break;
+          }
+          case LIST:
+            compareLists(prefix + "." + fieldName, childType.asListType(),


I think I prefer the assert naming convention to compare because it is clear that the result when something is different is an AssertionError. Same with compareMaps.

rdblue · 2018-03-02T20:47:14Z

spark/src/main/java/com/netflix/iceberg/spark/source/Writer.java

+              @SuppressWarnings("unchecked")
+              SparkOrcWriter writer = new SparkOrcWriter(ORC.write(file)
+                  .schema(schema)
+                  .partitionSpec(spec)


I think spec can be removed because it isn't relevant at a the level of a single file. It also isn't used in the OrcFileAppender.

rdblue · 2018-03-02T20:47:50Z

orc/src/main/java/com/netflix/iceberg/orc/TypeConversion.java

+        List<String> fieldNames = schema.getFieldNames();
+        List<TypeDescription> fieldTypes = schema.getChildren();
+        List<Types.NestedField> fields = new ArrayList<>(fieldNames.size());
+        for(int c=0; c < fieldNames.size(); ++c) {


Nit: no space between for and (

rdblue · 2018-03-02T20:50:19Z

orc/src/main/java/com/netflix/iceberg/orc/TypeConversion.java

+        for(int c=0; c < fieldNames.size(); ++c) {
+          String name = fieldNames.get(c);
+          TypeDescription type = fieldTypes.get(c);
+          fields.add(Types.NestedField.optional(columnIds[type.getId()], name,


Does ORC not support a distinction between required and optional fields?

Because Iceberg assumes that the files in a given table are managed by Iceberg, I think we can probably work around this by adding non-null checks to the write path and assuming non-null on the read path. But I would still prefer to have a guarantee that the files won't contain null values.

No, ORC doesn't support required fields. The history is that Hive doesn't support required fields, so it wasn't that important. Once schema evolution was added, it really didn't make any sense to add required fields.

That said, the column statistics track whether there are any nulls in each column. So given a file footer you can tell whether there are any nulls in that column or not. You're right that it would be easy to check for null values on the Iceberg write path. On the read path, you always need to check for null because the column may not be present in the file.

Schema evolution doesn't allow you to add a required column, so we shouldn't need to worry about missing required columns. I like that we can use the stats to make sure the file doesn't contain any nulls. Let's plan on doing that for the read path and throwing an exception.

Is it possible to add the metadata to ORC? We have a fairly compelling use case for it: when there are null values in a foreign key column, a SQL inner join will ignore the rows with null because joins use null-safe equality (null == null returns null). We want to ensure we don't accidentally lose rows this way because it is a subtle correctness issue and we can't expect users to know they should do an outer join everywhere. Instead, we want to ensure that the foreign key is never null.

rdblue · 2018-03-02T20:51:12Z

orc/src/main/java/com/netflix/iceberg/orc/TypeConversion.java

+        return Types.BinaryType.get();
+      case DATE:
+        return Types.DateType.get();
+      case TIMESTAMP:


Does ORC differentiate between with zone and without zone? I think we need support for both. We can work around it by keeping Iceberg metadata, but I'd rather have everything represented correctly in data files.

Not yet. I have jira open to add it.

Great, what's the JIRA?

https://issues.apache.org/jira/browse/ORC-189

rdblue · 2018-03-02T20:52:31Z

orc/src/main/java/com/netflix/iceberg/orc/TypeConversion.java

+          case STRING:
+          case CHAR:
+          case VARCHAR:
+            return Types.MapType.ofOptional(columnIds[key.getId()],


Are keys required in ORC, or can they be null as well?

Keys can be null.

rdblue · 2018-03-02T20:55:06Z

orc/src/main/java/com/netflix/iceberg/orc/TypeConversion.java

+    return toOrc(schema.asStruct(), columnIds);
+  }
+
+  static TypeDescription toOrc(Type type, List<Integer> columnIds) {


You might consider using a SchemaVisitor for conversion to separate the conversion logic from type traversal. Here's an example for Avro: https://github.com/Netflix/iceberg/blob/master/avro/src/main/java/com/netflix/iceberg/avro/TypeToSchema.java

I looked at the SchemaVisitor. I found it far more difficult to read than having a single function that did the recursion.

Sounds reasonable. I think it's important for more complicated transformations, but this does look straight-forward.

rdblue · 2018-03-02T20:58:17Z

orc/src/main/java/com/netflix/iceberg/orc/TypeConversion.java

+        }
+      }
+      default:
+        // We don't have an answer for union types -> blech.


I'm interested to hear how ORC handles unions when only some of the columns are projected. This issue is why we never quite standardized unions in Parquet and why I'm reluctant to add them. (That, and I don't see a very distinct use case for them.)

I'm not against it in principle, but I'm skeptical and want to keep everything as small as possible until we're sure they are really necessary. Plus, many engines have no support for unions (e.g., Spark) and might not intend to support them.

In practice, I haven't seen that case come up.

If you dropped some of the union children, those values would become null.

I agree that returning null is the reasonable option, but doesn't that defeat the purpose of the guarantee that one branch of the union is non-null?

Here's the discussion about a UNION type in Parquet, for reference: apache/parquet-format#44

I think Alex makes some good points, like that projection should not affect the result, only the efficiency of the query.

In ORC it is a bit easier. As I wrote elsewhere, the user can't have required fields, so if the user projected a single one of the union children, they would get nulls in the values where the value was one of the other children. This is the same as what would happen if the file had an extra child in the union. Those values would become null.

One additional piece of the ORC format is that we don't push the metadata down into the leaf types. So the data for the union is a selector that says which child is selected for that value. Then that child's value is used. So if child 3 is picked, but child 3 was not in the projection, it would just have a null value. But the reader could tell that it was a child 3 value that was null.

https://orc.apache.org/api/hive-storage-api/org/apache/hadoop/hive/ql/exec/vector/UnionColumnVector.html

rdblue · 2018-03-02T21:00:08Z

orc/src/main/java/com/netflix/iceberg/orc/ORC.java

+      return this;
+    }
+
+    public WriteBuilder tableProperties(Properties properties) {


Is there a reason to use Properties to pass options? I'd normally use a config(String, String) method instead.

Ok, what I was trying to do was:

Not create a new Configuration. They are expensive from both time and memory.

Not change the Configuration that was passed in, since I don't own it.

That said, I'm not against changing it so that I clone the passed in Configuration. I've had bad experiences making changes to a configuration that was passed in, so I'd rather avoid that.

I'd prefer to hide the Properties so that the caller only has to work with the builder. Properties can be used to accumulate the configuration and passed in. If this is a good way to avoid needing to change the Configuration, then I'm all for it.

rdblue · 2018-03-02T21:02:07Z

core/src/main/java/com/netflix/iceberg/PropertiesUpdate.java

@@ -68,6 +68,12 @@ public UpdateProperties remove(String key) {
    return this;
  }

+  @Override
+  public UpdateProperties format(FileFormat format) {


I like adding the method, but it should be clear that this is the default file format for the table. Tables, by design, can contain multiple file formats so you can change from one format to another and so you can write from a streaming system to Avro and then compact to a long-term storage format later.

Maybe this should be preferredFormat or defaultFormat?

Let's go with defaultFormat.

rdblue · 2018-03-02T21:07:43Z

orc/src/main/java/com/netflix/iceberg/orc/ORC.java

+      return this;
+    }
+
+    public WriteBuilder conf(Configuration conf) {


I'm trying to avoid Hadoop classes in APIs like this (even if they aren't in the iceberg-api module). In Parquet, we added config that adds properties to the internal Configuration. That configuration is passed in as part of a HadoopOutputFile instance. If HadoopOutputFile isn't used, then the writer should fall back to new Configuration() if it is required.

Creating new Configuration object is very expensive. You really want to avoid doing at all costs. And it limits what the configuration to the default. ORC's API needs a Hadoop Configuration. So if I don't pass one down, I'll need to create it and deprive the user of the ability to pass down a context specific Configuration.

I think the right way to pass that configuration is through a HadoopOutputFile. If there is a Configuration to pass, then the file passed in will be a Hadoop one and you can use it. If it isn't, then you'd have to create a configuration anyway in order to use ORC because it requires one. I guess a better way to describe it is not avoiding Hadoop classes, but keeping them in Hadoop-specific areas (not in API) so we can possibly remove the classes later.

For this, I think just moving to a config(String, String) method is a good idea, and using those configs to update a Configuration from the HadoopOutputFile instance.

I can do that, but I don't see why making a strong dependence on HadoopOutputFile and HadoopInputFile is ok, but Configuration is bad. In either case, you aren't going to be able to compile or run the Iceberg code without hadoop-common on the class path.

It just partitions Hadoop classes in their own package so that we don't leak them in the main parts of the API. The eventual goal is to support tables without pulling in Hadoop.

rdblue · 2018-03-02T21:08:07Z

orc/src/main/java/com/netflix/iceberg/orc/ORC.java

+      return this;
+    }
+
+    public ReadBuilder conf(Configuration conf) {


Same thing here. I'd like to avoid passing Configuration where possible.

rdblue · 2018-03-02T21:12:38Z

orc/src/main/java/com/netflix/iceberg/orc/TypeConversion.java

+   * @return the ORC schema
+   */
+  public static TypeDescription toOrc(Schema schema,
+                                      List<Integer> columnIds) {


Is there a better way to pass column IDs back? Isn't this dependent on the order you use to traverse the Schema? Why not use a BiMap between column ID and full column names?

In other places, we use a BiMap. One key part of that is that you only ever convert in one direction: from components like ["a", "b"] to a full name, "a.b". That way, you never have to parse the names and avoid the problem of fields named "a.b".

Ugh. I avoid Guava at all costs. They are really bad at breaking API compatibility.

I haven't done the schema evolution stuff yet, so maybe. I mostly need these so that I can stringify it to store it into the file.

All ORC TypeDescriptions have automatically assigned ids that run 0 (for the root) to N-1 for the right most leaf. So mapping from the ORC to Iceberg id is easy given the list.

My main concern is that this depends on the order of traversal, which is easy to accidentally break. Since this is a critical piece of information, I'd like to see a different solution to track IDs. It doesn't have to be guava (though we do use guava elsewhere because it provides so much value).

The relevant constraint is that the columnIds.get(orcSchema.getId()) gives you Iceberg id.

This code does depend on the fact that ORC's ids are assigned sequentially given a prefix ordering. On the other hand, changing that would be a serious change to ORC's API.

But fine, I'll change this to a Map.

rdblue · 2018-03-02T21:13:34Z

orc/src/main/java/com/netflix/iceberg/orc/TypeConversion.java

+      }
+      case LIST: {
+        TypeDescription child = schema.getChildren().get(0);
+        return Types.ListType.ofOptional(columnIds[child.getId()],


ORC's column ID is the ordinal?

orcType.getId() gives you the automatically assigned id number above.

rdblue · 2018-03-02T21:14:08Z

spark/src/test/java/com/netflix/iceberg/spark/data/RandomData.java

@@ -103,7 +113,7 @@ public Object map(Types.MapType map, Supplier<Object> valueResult) {

      Map<String, Object> result = Maps.newLinkedHashMap();
      for (int i = 0; i < numEntries; i += 1) {
-        String key = randomString(random) + i; // add i to ensure no collisions
+        String key = randomString(random).toString() + i; // add i to ensure no collisions


Why would randomString return something that isn't a String?

randomString returns a UTF8String.

Ah, makes sense. Thanks!

rdblue · 2018-03-02T21:21:55Z

This looks promising. I think we should be able to get everything working.

I added ORC to TestDataFrameWrites and tried it out but hit a couple of test failures with it. You might want to test that next.

Would you like to open PRs for some of the independent changes in here? Moving SimpleRecord out and the changes to RandomData are good candidates to get some of the changes in right away.

Known problems: * Doesn't do schema evolution. * Doesn't include column size metrics. * Doesn't properly handle timestamp with timezone. * Doesn't do the schema mangling for partitions.

omalley · 2018-03-06T16:50:48Z

Ok, I've updated the branch with the changes based on the comments.

omalley · 2018-03-06T16:53:23Z

Snyk isn't letting me see what the problem is. Should we add a travis-ci build?

rdblue · 2018-03-06T17:57:11Z

I'm all for travis CI, but we need a Parquet release first. I'm working on that next.

omalley · 2018-03-06T21:05:59Z

Ok, I think I have all of the feedback resolved.

omalley · 2018-03-06T22:38:33Z

I assume you'll squash the commit before committing. I've just left them as separate commits to make review easier.

omalley · 2018-03-07T01:14:50Z

With the last commit the TestDataFramesWrites works with ORC also.

rdblue · 2018-03-08T18:09:17Z

I'm on leave today and tomorrow, I'll review the changes on Monday. Thanks!

rdblue · 2018-03-13T00:36:17Z

Looks good to me. I'll do a little more digging into the random data tomorrow to make sure it covers everything, but otherwise I think this is ready to commit.

rdblue · 2018-03-13T16:29:15Z

@omalley, I have a few minor updates. Do you want me to include them when I squash, or do you want me to open a PR for your branch?

omalley · 2018-03-13T16:39:17Z

You can go ahead and include them in the squash. What were they?

rdblue · 2018-03-13T19:07:39Z

Fix spelling (primative -> primitive), pass floats through for validation, and use strict equality when checking floats and doubles because for a storage system, the bits really should be identical. I'm also adding generic type args for maps and lists in some places.

omalley · 2018-03-13T19:28:48Z

I've been burned so badly by using equality for floats & doubles over the years that I always avoid it. Even for a storage system, cases like non-normalized or various NaN will cause problems. In this case, because we are generating the data, it is no doubt fine. I still wouldn't recommend it.

rdblue · 2018-03-13T19:57:37Z

I think the good outweighs the bad here. A storage system should guarantee that the bits you pass in are the bits you get out, and we should verify that's the case. For values like NaN where it could be reasonable to normalize, we should explicitly test for it and not rely on undefined behavior.

rdblue · 2018-03-13T22:46:23Z

Merged as c59138e. Thanks for contributing, @omalley!

Would you mind opening a few issues to cover the remaining features that we need for ORC before it is ready for a release?

omalley mentioned this pull request Feb 26, 2018

Add initial support for ORC #13

Closed

rdblue reviewed Mar 2, 2018

View reviewed changes

omalley added 2 commits March 6, 2018 08:40

Initial pass at adding ORC to Iceberg.

ee6c292

Known problems: * Doesn't do schema evolution. * Doesn't include column size metrics. * Doesn't properly handle timestamp with timezone. * Doesn't do the schema mangling for partitions.

fixing review comments

ff9a52e

omalley force-pushed the initial-orc branch from a470a34 to ff9a52e Compare March 6, 2018 16:50

Removing Hadoop's Configuration from the ORC API.

ca71554

Replacing the List<Integer> with a ColumnIdMap.

a732ecf

omalley force-pushed the initial-orc branch from e896e7b to a732ecf Compare March 6, 2018 23:37

Fix race issue on reading small decimals.

f710794

rdblue closed this Mar 13, 2018

Initial pass at adding ORC to Iceberg. #12

Initial pass at adding ORC to Iceberg. #12

Conversation

omalley commented Feb 23, 2018

omalley commented Feb 26, 2018

rdblue commented Feb 28, 2018

rdblue Mar 2, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omalley Mar 2, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Mar 2, 2018

omalley commented Mar 6, 2018

omalley commented Mar 6, 2018

rdblue commented Mar 6, 2018

omalley commented Mar 6, 2018

omalley commented Mar 6, 2018

omalley commented Mar 7, 2018

rdblue commented Mar 8, 2018

rdblue commented Mar 13, 2018

rdblue commented Mar 13, 2018

omalley commented Mar 13, 2018

rdblue commented Mar 13, 2018

omalley commented Mar 13, 2018

rdblue commented Mar 13, 2018

rdblue commented Mar 13, 2018

rdblue Mar 2, 2018 •

edited

omalley Mar 2, 2018 •

edited