Parquet input plugin for Embulk
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
config/checkstyle
embulk-input-parquet_hadoop
gradle/wrapper
parquet-msgpack
.gitignore
LICENSE
NOTICE
README.md
build.gradle
gradlew
gradlew.bat
settings.gradle

README.md

Parquet Hadoop input plugin for Embulk

  • Read Parquet files via Hadoop FileSystem.
  • Outputs a single record named record (type is json).

Overview

  • Plugin type: input
  • Resume supported: no
  • Cleanup supported: no
  • Guess supported: no

Configuration

  • config_files: list of paths to Hadoop's configuration files (array of strings, default: [])
  • config: overwrites configuration parameters (hash, default: {})
  • path: file path on Hdfs. you can use glob pattern (string, required).
  • parquet_log_level: set log level of parquet reader module (string, default: "INFO")
    • value is one of java.util.logging.Level (ALL, SEVERE, WARNING, INFO, CONFIG, FINE, FINER, FINEST, OFF)

Hadoop Configuration

Describe parquet reader specific configuration.

  • parquet.read.bad.record.threshold: Tolerated percent bad records per file (float, 0.0 to 1.0, default: 0)

Example

in:
  type: parquet_hadoop
  config_files:
    - /etc/hadoop/conf/core-site.xml
    - /etc/hadoop/conf/hdfs-site.xml
  config:
    parquet.read.bad.record.threshold: 0.01
  path: /user/hadoop/example/data/*.parquet
  parquet_log_level: WARNING

Build

$ ./gradlew gem  # -t to watch change of files and rebuild continuously

Notes

Why implement this as input plugin rather than parser plugin ?

Because to parsing Parquet format needs seekable file stream but parser plugin has only sequential read.

How map Parquet schema to json ?

TBD