target-hdfs is a Singer target for hdfs.
Build with the Meltano Target SDK.
Install from PyPi:
pipx install target-hdfsInstall from GitHub:
pipx install git+https://github.com/Automattic/target-hdfs.git@mainA full list of supported settings and capabilities for this target is available by running:
target-hdfs --about| Setting | Required | Default | Description |
|---|---|---|---|
| hdfs_destination_path | True | None | HDFS Destination Path |
| hdfs_block_size_limit | False | 85% of HDFS block site | HDFS Block Size Limit (e.g. 200M) (default: it will use 85% of the current block size). If the size is lower than this limit, the data will be appended to the existing file |
| skip_existing_files | False | False | If set to true, the data will not be appended to the existing file |
| compression_method | False | gzip | (Default - gzip) Compression methods have to be supported by Pyarrow, and currently the compression modes available are - snappy, zstd, brotli and gzip. |
| max_pyarrow_table_size | False | 800 | Max size of pyarrow table in MB (before writing to parquet file). It can control the memory usage of the target. |
| max_batch_size | False | 10000 | Max records to write in one batch. It can control the memory usage of the target. |
| extra_fields | False | None | Extra fields to add to the flattened record. (e.g. extra_col1=value1,extra_col2=value2) |
| extra_fields_types | False | None | Extra fields types. (e.g. extra_col1=string,extra_col2=integer) |
| partition_cols | False | None | Extra fields to add to the flattened record. (e.g. extra_col1,extra_col2) |
This Singer target will automatically import any environment variables within the working directory's
.env if the --config=ENV is provided, such that config values will be considered if a matching
environment variable is set either in the terminal context or in the .env file.
The target-hdfs will use the configuration set in core-sites.xml and hdfs-site.xml files to authenticate and authorize the connection to HDFS.
You can easily run target-hdfs by itself or in a pipeline using Meltano.
target-hdfs --version
target-hdfs --help
# Test using the "Carbon Intensity" sample:
tap-carbon-intensity | target-hdfs --config /path/to/target-hdfs-config.jsonFollow these instructions to contribute to this project.
pipx install poetry
poetry installCreate tests within the tests subfolder and
then run:
poetry run pytestYou can also test the target-hdfs CLI interface directly using poetry run:
poetry run target-hdfs --helpTesting with Meltano
Note: This target will work in any Singer environment and does not require Meltano. Examples here are for convenience and to streamline end-to-end orchestration scenarios.
Next, install Meltano (if you haven't already) and any needed plugins:
# Install meltano
pipx install meltano
# Initialize meltano within this directory
cd target-hdfs
meltano installNow you can test and orchestrate using Meltano:
# Test invocation:
meltano invoke target-hdfs --version
# OR run a test `elt` pipeline with the Carbon Intensity sample tap:
meltano run tap-carbon-intensity target-hdfsSee the dev guide for more instructions on how to use the Meltano Singer SDK to develop your own Singer taps and targets.