O~~~ O~~ O~ O~~~ O~~~
O~~ O~~ O~~ O~~ O~~
O~~ O~~ O~~ O~~ O~~
O~~ O~~ O~~ O~~ O~~
O~~~ O~~ O~~~ O~~~
Use corc to read and write data in the Optimized Row Columnar (ORC) file format in your Cascading applications. The reading of ACID datasets is also supported.
This project is no longer in active development.
You can obtain corc from Maven Central :
Corc has been built and tested against Cascading 3.3.0.
Corc is built with Hive 2.3.4. Several dependencies will need to be included when using Corc:
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>2.3.4</version>
<classifier>core</classifier>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-serde</artifactId>
<version>2.3.4</version>
</dependency>
<dependency>
<groupId>com.esotericsoftware.kryo</groupId>
<artifactId>kryo</artifactId>
<version>2.22</version>
</dependency>
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
<version>2.5.0</version>
</dependency>
Hive | Cascading/Java |
---|---|
STRING | String |
BOOLEAN | Boolean |
TINYINT | Byte |
SMALLINT | Short |
INT | Integer |
BIGINT | Long |
FLOAT | Float |
DOUBLE | Double |
TIMESTAMP | java.sql.Timestamp |
DATE | java.sql.Date |
BINARY | byte[] |
CHAR | String (HiveChar) |
VARCHAR | String (HiveVarchar) |
DECIMAL | BigDecimal (HiveDecimal) |
ARRAY | List<Object> |
MAP | Map<Object, Object> |
STRUCT | List<Object> |
UNIONTYPE | Sub-type |
OrcFile
provides two public constructors; one for sourcing and one for sinking. However, these are provided to be more flexible for others who may wish to extend the class. It is advised to construct an instance via the SourceBuilder
and SinkBuilder
classes.
Create a builder:
SourceBuilder builder = OrcFile.source();
Specify the fields that should be read. If the declared schema is a subset of the complete schema, then column projection will occur:
builder.declaredFields(fields);
// or
builder.columns(structTypeInfo);
// or
builder.columns(structTypeInfoString);
Specify the complete schema of the underlying ORC Files. This is only required for reading ORC Files that back a transactional Hive table. The default behaviour should be to obtain the schema from the ORC Files being read:
builder.schemaFromFile();
// or
builder.schema(fields);
// or
builder.schema(structTypeInfo);
// or
builder.schema(structTypeInfoString);
ORC Files support predicate pushdown. This allows whole row groups to be skipped if they do not contain any rows that match the given SearchArgument
:
Fields message = new Fields("message", String.class);
SearchArgument searchArgument = SearchArgumentFactory.newBuilder()
.startAnd()
.equals(message, "hello")
.end()
.build();
builder.searchArgument(searchArgument);
When passing objects to the SearchArgument.Builder
, care should be taken to choose the correct type:
Hive | Java |
---|---|
STRING | String |
BOOLEAN | Boolean |
TINYINT | Byte |
SMALLINT | Short |
INT | Integer |
BIGINT | Long |
FLOAT | Float |
DOUBLE | Double |
TIMESTAMP | java.sql.Timestamp |
DATE | org.apache.hadoop.hive.serde2.io.DateWritable |
CHAR | String (HiveChar) |
VARCHAR | String (HiveVarchar) |
DECIMAL | BigDecimal |
When reading ORC Files that back a transactional Hive table, include the VirtualColumn#ROWID
("ROW__ID") virtual column. The column will be prepended to the record's Fields
:
builder.prependRowId();
Finally, build the OrcFile
:
OrcFile orcFile = builder.build();
OrcFile orcFile = OrcFile.sink()
.schema(schema)
.build();
The schema
parameter can be one of Fields
, StructTypeInfo
or the String
representation of the StructTypeInfo
. When providing a Fields
instance, care must be taken when deciding how best to specify the types as there is no one-to-one bidirectional mapping between Cascading types and Hive types. The TypeInfo
is able to represent richer, more complex types. Consider your ORC File schema and the mappings to Fields
types carefully.
List<String> names = new ArrayList<>();
names.add("col0");
names.add("col1");
List<TypeInfo> typeInfos = new ArrayList<>();
typeInfos.add(TypeInfoFactory.stringTypeInfo);
typeInfos.add(TypeInfoFactory.longTypeInfo);
StructTypeInfo structTypeInfo = (StructTypeInfo) TypeInfoFactory.getStructTypeInfo(names, typeInfos);
or...
String typeString = "struct<col0:string,col1:bigint>";
StructTypeInfo structTypeInfo = (StructTypeInfo) TypeInfoUtils.getTypeInfoFromTypeString(typeString);
or, via the convenience builder...
StructTypeInfo structTypeInfo = new StructTypeInfoBuilder()
.add("col0", TypeInfoFactory.stringTypeInfo)
.add("col1", TypeInfoFactory.longTypeInfo)
.build();
Corc also supports the reading of ACID datasets that underpin transactional Hive tables. However, for this to work effectively with an active Hive table you must provide your own lock management. We intend to make this functionality available in the cascading-hive project. When reading the data you may optionally include the virtual RecordIdentifer
column, also known as the ROW__ID
column, with one of the following approaches:
- Add a field named '
ROW__ID
' to yourFields
definition. This must be of typeorg.apache.hadoop.hive.ql.io.RecordIdentifier
. For convenience you can use the constantOrcFile#ROW__ID
with some fields arithmetic:Fields myFields = Fields.join(OrcFile.ROW__ID, myFields);
. - Use the
OrcFile.source().prependRowId()
option. Be sure to exclude theRecordIdentifer
column from yourtypeInfo
instance. TheROW__ID
field will be added to your tuple stream automatically.
OrcFile
can be used with Hfs
, just like TextDelimited
.
OrcFile orcFile = ...
String path = ...
Hfs hfs = new Hfs(orcFile, path);
Created by Dave Maughan & Elliot West, with thanks to: Patrick Duin, James Grant & Adrian Woodhead.
This project is available under the Apache 2.0 License.
Copyright 2015-2020 Expedia, Inc.