Browse files

Update readme

add repo for maven-thrift-plugin
Add 'sonatype-oss-release' maven profile
Change groupId to com.twitter
  • Loading branch information...
1 parent b74e715 commit 9810de6701cef3fafa1cfd4ff56db836bfc551de @nongli nongli committed Mar 4, 2013
Showing with 103 additions and 8 deletions.
  1. +30 −7 README.md
  2. +73 −1 pom.xml
View
37 README.md
@@ -93,7 +93,16 @@ be computed from the schema (i.e. how much nesting is there). This defines the
maximum number of bits required to store the levels (levels are defined for all
values in the column).
-For the definition levels, the values are encoded using run length encoding.
+Two encodings for the levels are supported in the initial version.
+
+### Bit-packed
+The first is a bit-packed encoding. Each level is encoding in the minimum number of bits and simply
+encoded back to back. This is no padding between values (except the last byte).
+For example, if the max repetition level was 3 (2 bits) and the max definition level as 3
+(2 bits), to encode 30 values, we would have 30 * 2 = 60 bits = 8 bytes.
+
+### RLE
+The second encoding is bit-packed run-length-encoding.
The run length encoding is serialized as follows:
- let max be the maximum definition level (determined by the schema)
- let w be the width in bits required to encode a definition level value. w = ceil(log2(max + 1))
@@ -112,10 +121,7 @@ To sum up:
- value is stored as w bits
- count is stored as var int
-For repetition levels, the levels are bit packed as tightly as possible,
-rounding up to the nearest byte. For example, if the max repetition level was 3
-(2 bits) and the max definition level as 3 (2 bits), to encode 30 values, we would
-have 30 * 2 = 60 bits = 8 bytes.
+The size of all the RLE data comes before the encoded data. The length is encoded in little endian.
## Nulls
Nullity is encoded in the definition levels (which is run-length encoded). NULL values
@@ -125,8 +131,20 @@ nothing else.
## Data Pages
For data pages, the 3 pieces of information are encoded back to back, after the page
-header. We'll have the definition levels, followed by repetition levels, followed
-by the encoded values. The size of specified in the header is for all 3 pieces combined.
+header. We have the
+ - definition levels data,
+ - repetition levels data,
+ - encoded values.
+The size of specified in the header is for all 3 pieces combined.
+
+The data for the data page is always required. The definition and reptition levels
+are optional, based on the schema definition. If the column is not nested (i.e.
+the path to the column has length 1), we do not encode the reptition levels (it would
+always have the value 1). For data that is required, the definition levels are
+skipped (if encoded, it will always have the value of the max definition level).
+
+For example, in the case where the column is non-nested and required, the data in the
+page is only the encoded values.
## Column chunks
Column chunks are composed of pages written back to back. The pages share a common
@@ -152,6 +170,11 @@ Each file metadata would be cumulative and include all the row groups written so
far. Combining this with the strategy used for rc or avro files using sync markers,
a reader could recovery partially written files.
+## Separating metadata and column data.
+The format is explicitly designed to separate the metadata from the data. This
+allows splitting columns into multiple files as well as having a single metadata
+file reference multiple parquet files.
+
## Configurations
- Row group size: Larger row groups allow for larger column chunks which makes it
possible to do larger sequential IO. Larger groups also require more buffering in
View
74 pom.xml
@@ -2,7 +2,7 @@
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
- <groupId>parquet</groupId>
+ <groupId>com.twitter</groupId>
<artifactId>parquet-format</artifactId>
<version>1.0.0-SNAPSHOT</version>
<packaging>jar</packaging>
@@ -61,13 +61,36 @@
</repository>
</repositories>
+ <!-- this is needed for maven-thrift-plugin, would like to remove this.
+ see: https://issues.apache.org/jira/browse/THRIFT-1536 -->
+ <pluginRepositories>
+ <pluginRepository>
+ <id>Twitter public Maven repo</id>
+ <url>http://maven.twttr.com</url>
+ </pluginRepository>
+ </pluginRepositories>
+
<properties>
<maven.compiler.source>1.6</maven.compiler.source>
<maven.compiler.target>1.6</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<build>
+ <pluginManagement>
+ <plugins>
+ <plugin>
+ <groupId>org.apache.maven.plugins</groupId>
+ <artifactId>maven-release-plugin</artifactId>
+ <version>2.1</version>
+ <configuration>
+ <mavenExecutorId>forked-path</mavenExecutorId>
+ <useReleaseProfile>false</useReleaseProfile>
+ <arguments>-Psonatype-oss-release</arguments>
+ </configuration>
+ </plugin>
+ </plugins>
+ </pluginManagement>
<plugins>
<!-- thrift -->
<plugin>
@@ -108,4 +131,53 @@
<scope>test</scope>
</dependency>
</dependencies>
+ <profiles>
+ <profile>
+ <id>sonatype-oss-release</id>
+ <build>
+ <plugins>
+ <plugin>
+ <groupId>org.apache.maven.plugins</groupId>
+ <artifactId>maven-source-plugin</artifactId>
+ <version>2.1.2</version>
+ <executions>
+ <execution>
+ <id>attach-sources</id>
+ <goals>
+ <goal>jar-no-fork</goal>
+ </goals>
+ </execution>
+ </executions>
+ </plugin>
+ <plugin>
+ <groupId>org.apache.maven.plugins</groupId>
+ <artifactId>maven-javadoc-plugin</artifactId>
+ <version>2.7</version>
+ <executions>
+ <execution>
+ <id>attach-javadocs</id>
+ <goals>
+ <goal>jar</goal>
+ </goals>
+ </execution>
+ </executions>
+ </plugin>
+ <plugin>
+ <groupId>org.apache.maven.plugins</groupId>
+ <artifactId>maven-gpg-plugin</artifactId>
+ <version>1.1</version>
+ <executions>
+ <execution>
+ <id>sign-artifacts</id>
+ <phase>verify</phase>
+ <goals>
+ <goal>sign</goal>
+ </goals>
+ </execution>
+ </executions>
+ </plugin>
+ </plugins>
+ </build>
+ </profile>
+ </profiles>
</project>

0 comments on commit 9810de6

Please sign in to comment.