Skip to content
Browse files

Add bool description and miscellaneous cleanup.

  • Loading branch information...
1 parent 684acaa commit e1438f134bb6435ad188ca6735930b61eb76c530 @julienledem julienledem committed with nongli
Showing with 113 additions and 80 deletions.
  1. +57 −0 Encodings.md
  2. +0 −43 Encodings.txt
  3. +2 −2 Makefile
  4. +15 −3 README.md
  5. +8 −8 pom.xml
  6. +31 −24 src/thrift/{redfile.thrift → parquet.thrift}
View
57 Encodings.md
@@ -0,0 +1,57 @@
+
+Parquet encoding definitions
+====
+
+This file contains the specification of all supported encodings.
+
+### Plain:
+
+Supported Types: all
+
+This is the plain encoding that must be supported for types. It is
+intended to be the simplest encoding. Values are encoded back to back.
+
+For native types, this outputs the data as little endian. Floating
+ point types are encoded in IEEE.
+
+For the byte array type, it encodes the length as a 4 byte little
+endian, followed by the bytes.
+
+### GroupVarInt:
+
+Supported Types: INT32, INT64
+
+
+32-bit ints are encoded in groups of 4 with 1 leading bytes to encode the
+byte length of the following 4 ints.
+
+64-bit are encoded in groups of 5,
+with 2 leading bytes to encode the byte length of the 5 ints.
+
+For 32-bit ints, the leading byte contains 2 bits per int. Each length
+encoding specifies the number of bytes minus 1 for that int. For example
+a byte value of 0b00101101, indicates that:
+
+ * the first int has 1 byte (0b00 + 1),
+ * the second int has 3 bytes (0b10 + 1),
+ * the third int has 4 bytes (0b11 + 1)
+ * the 4th int has 2 bytes (0b01 + 1)
+
+In this case, the entire row group would be: 1 + (1 + 3 + 4 + 2) = 11 bytes.
+The bytes that follow the leading byte is just the int data encoded in little
+endian. With this example:
+
+ * the first int starts at byte offset 1 with a max value of 0xFF,
+ * the second int starts at byte offset 2 with a max value of 0xFFFFFF,
+ * the third int starts at byte offset 5 with a max value of 0xFFFFFFFF, and
+ * the 4th int starts at byte offset 9 with a max value of 0xFFFF.
+
+For 64-bit ints, the lengths of the 5 ints is encoded as 3 bits. Combined,
+this uses 15 bits and fits in 2 bytes. The msb of the two bytes is unused.
+Like the 32-bit case, after the length bytes, the data bytes follow.
+
+In the case where the data does not make a complete group, (e.g. 3 32-bit ints),
+a complete group should still be output with 0's filling in for the remainder.
+For example, if the input was (1,2,3,4,5): the resulting encoding should
+behave as if the input was (1,2,3,4,5,0,0,0) and the two groups should be
+encoded back to back.
View
43 Encodings.txt
@@ -1,43 +0,0 @@
-This file contains the specification of all supported encodings.
-
-Plain:
- - Supported Types: all
- This is the plain encoding that must be supported for types. It is
- intended to be the simplest encoding. Values are encoded back to back.
- - For native types, this outputs the data as little endian. Floating
- point types are encoded in IEEE.
- - For the byte array type, it encodes the length as a 4 byte little
- endian, followed by the bytes.
-
-GroupVarInt:
- - Supported Types: INT32, INT64
- 32-bit ints are encoded in groups of 4 with 1 leading bytes to encode the
- byte length of the following 4 ints. 64-bit are encoded in groups of 5,
- with 2 leading bytes to encode the byte length of the 5 ints.
-
- For 32-bit ints, the leading byte contains 2 bits per int. Each length
- encoding specifies the number of bytes minus 1 for that int. For example
- a byte value of 0b00101101, indicates that:
- the first int has 1 byte (0b00 + 1),
- the second int has 3 bytes (0b10 + 1),
- the third int has 4 bytes (0b11 + 1), and
- the 4th int has 2 bytes (0b01 + 1)
-
- In this case, the entire row group would be: 1 + (1 + 3 + 4 + 2) = 11 bytes.
- The bytes that follow the leading byte is just the int data encoded in little
- endian. With this example:
- the first int starts at byte offset 1 with a max value of 0xFF,
- the second int starts at byte offset 2 with a max value of 0xFFFFFF,
- the third int starts at byte offset 5 with a max value of 0xFFFFFFFF, and
- the 4th int starts at byte ofset 9 with a max value of 0xFFFF.
-
- For 64-bit ints, the lengths of the 5 ints is encoded as 3 bits. Combined,
- this uses 15 bits and fits in 2 bytes. The msb of the two bytes is unused.
- Like the 32-bit case, after the length bytes, the data bytes follow.
-
- In the case where the data does not make a complete group, (e.g. 3 32-bit ints),
- a complete group should still be output with 0's filling in for the remainder.
- For example, if the input was (1,2,3,4,5): the resulting encoding should
- behave as if the input was (1,2,3,4,5,0,0,0) and the two groups should be
- encoded back to back.
-
View
4 Makefile
@@ -1,4 +1,4 @@
thrift:
mkdir -p generated
- thrift --gen cpp -o generated src/thrift/redfile.thrift
- thrift --gen java -o generated src/thrift/redfile.thrift
+ thrift --gen cpp -o generated src/thrift/parquet.thrift
+ thrift --gen java -o generated src/thrift/parquet.thrift
View
18 README.md
@@ -1,6 +1,11 @@
-redfile [![Build Status](https://travis-ci.org/twitter/redfile.png?branch=master)](redfile)
+Parquet [![Build Status](https://travis-ci.org/twitter/parquet-format.png?branch=master)](parquet)
======
-Redfile is a columnar storage format that supports nested data.
+Parquet is a columnar storage format that supports nested data.
+
+Parquet metadata is encoded using Apache Thrift.
+
+The Parquet-format project contains all Thrift definitions that are necessary to create readers
+and writers for Parquet files.
## Glossary
- Block (hdfs block): This means a block in hdfs and the meaning is
@@ -80,7 +85,7 @@ readers and writers for the format. The types are:
- BYTE_ARRAY: arbitrarily long byte arrays.
## Nested Encoding
-To encode nested columns, redfile uses the dremel encoding with definition and
+To encode nested columns, Parquet uses the dremel encoding with definition and
repetition levels. Definition levels specify how many optional fields in the
path for the column are defined. Repetition levels specify at what repeated field
in the path has the value repeated. The max definition and repetition levels can
@@ -100,6 +105,13 @@ The run length encoding is serialized as follows:
- 0 as one bit
- the value encoded in w bits
+To sum up:
+ - the first bit is 1 if we're storing a repeated value [1][value][count]
+ - it is 0 if we're storing the value without repetition count [0][value]
+ - 0 or 1 is stored as 1 bit
+ - value is stored as w bits
+ - count is stored as var int
+
For repetition levels, the levels are bit packed as tightly as possible,
rounding up to the nearest byte. For example, if the max repetition level was 3
(2 bits) and the max definition level as 3 (2 bits), to encode 30 values, we would
View
16 pom.xml
@@ -2,19 +2,19 @@
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
- <groupId>redfile</groupId>
- <artifactId>redfile-metadata</artifactId>
+ <groupId>parquet</groupId>
+ <artifactId>parquet-format</artifactId>
<version>1.0.0-SNAPSHOT</version>
<packaging>jar</packaging>
- <name>redfile metadata</name>
- <url>http://github.com/redfile</url>
- <description>Redfile is a columnar storage format that supports nested data</description>
+ <name>parquet format metadata</name>
+ <url>http://github.com/Parquet/parquet-format</url>
+ <description>Parquet is a columnar storage format that supports nested data. This provides all generated metadata code.</description>
<scm>
- <connection>scm:git:git@github.com:redfile/redfile.git</connection>
- <url>scm:git:git@github.com:redfile/redfile.git</url>
- <developerConnection>scm:git:git@github.com:redfile/redfile.git</developerConnection>
+ <connection>scm:git:git@github.com:Parquet/parquet-format.git</connection>
+ <url>scm:git:git@github.com:Parquet/parquet-format.git</url>
+ <developerConnection>scm:git:git@github.com:Parquet/parquet-format.git</developerConnection>
</scm>
<licenses>
View
55 src/thrift/redfile.thrift → src/thrift/parquet.thrift
@@ -17,14 +17,14 @@
*/
/**
- * File format description for the redfile file format
+ * File format description for the parquet file format
*/
-namespace cpp redfile
-namespace java redfile
+namespace cpp parquet
+namespace java parquet.format
/**
- * Types supported by redfile. These types are intended to be for the storage
- * format, and in particular how they interact with different encodings.
+ * Types supported by Parquet. These types are intended to be used in combination
+ * with the encodings to control the on disk storage format.
* For example INT16 is not included as a type since a good encoding of INT32
* would handle this.
*/
@@ -45,11 +45,15 @@ enum Type {
enum ConvertedType {
/** a BYTE_ARRAY actually contains UTF8 encoded chars */
UTF8 = 0;
+
/** a map is converted as an optional field containing a repeated key/value pair */
MAP = 1;
+
/** a key/value pair is converted into a group of two fields */
MAP_KEY_VALUE = 2;
- /** a list is converted into an optional field containing a repeated field for its values */
+
+ /** a list is converted into an optional field containing a repeated field for its
+ * values */
LIST = 3;
}
@@ -73,39 +77,41 @@ enum FieldRepetitionType {
/**
* Represents a element inside a schema definition.
- * if it is a group (inner node) then type is undefined and children_indices is defined
- * if it is a primitive type (leaf) then type is defined and children_indices is undefined
+ * if it is a group (inner node) then type is undefined and num_children is defined
+ * if it is a primitive type (leaf) then type is defined and num_children is undefined
+ * the nodes are listed in depth first traversal order.
*/
struct SchemaElement {
- /** Data type for this field. e.g. int32 **/
+ /** Data type for this field. e.g. int32
+ * not set if the current element is a group node **/
1: optional Type type;
- /** repetition of the field. The root of the schema does not have a field_type.
+ /** repetition of the field. The root of the schema does not have a repetition_type.
* All other nodes must have one **/
- 2: optional FieldRepetitionType field_type;
+ 2: optional FieldRepetitionType repetition_type;
/** Name of the field in the schema **/
3: required string name;
/** Nested fields. Since thrift does not support nested fields,
- * the nesting is flattened to a single list. These indices
- * are used to construct the nested relationship.
- * each index refers to its position in the list.
+ * the nesting is flattened to a single list by a depth frist traversal.
+ * The children count is used to construct the nested relationship.
+ * This field is not set when the element is a primitive type
**/
- 4: optional list<i32> children_indices;
+ 4: optional i32 num_children;
/** When the schema is the result of a conversion from another model
- * Used to record the original type to help with cross conversion
+ * Used to record the original type to help with cross conversion.
**/
5: optional ConvertedType converted_type;
}
/**
- * Encodings supported by redfile. Not all encodings are valid for all types.
+ * Encodings supported by Parquet. Not all encodings are valid for all types.
*/
enum Encoding {
/** Default encoding.
- * BOOLEAN - 1 bit per value.
+ * BOOLEAN - 1 bit per value. 0 is false; 1 is true.
* INT32 - 4 bytes per value. Stored as little-endian.
* INT64 - 8 bytes per value. Stored as little-endian.
* FLOAT - 4 bytes per value. IEEE. Stored as little-endian.
@@ -139,9 +145,6 @@ struct DataPageHeader {
/** Encoding used for this data page **/
2: required Encoding encoding
-
- /** TODO: should this contain min/max for this page? It could also be stored in an index page **/
-
}
struct IndexPageHeader {
@@ -194,10 +197,10 @@ struct ColumnMetaData {
/** Number of values in this column **/
5: required i64 num_values
- /** total of uncompressed pages size in bytes **/
+ /** total byte size of all uncompressed pages in this column chunk **/
6: required i64 total_uncompressed_size
- /** total of compressed pages size in bytes **/
+ /** total byte size of all compressed pages in this column chunk **/
7: required i64 total_compressed_size
/** Byte offset from beginning of file to first data page **/
@@ -228,8 +231,10 @@ struct ColumnChunk {
struct RowGroup {
1: required list<ColumnChunk> columns
+
/** Total byte size of all the uncompressed column data in this row group **/
2: required i64 total_byte_size
+
/** Number of rows in this row group **/
3: required i64 num_rows
}
@@ -241,7 +246,9 @@ struct FileMetaData {
/** Version of this file **/
1: required i32 version
- /** Schema for this file. **/
+ /** Schema for this file.
+ * Nodes are listed in depth-first traversal order.
+ * The first element is the root **/
2: required list<SchemaElement> schema;
/** Number of rows in this file **/

0 comments on commit e1438f1

Please sign in to comment.
Something went wrong with that request. Please try again.