Skip to content

Refactor ParquetPartitioningStreamWriter to transformer #118

@kevinwallimann

Description

@kevinwallimann

ParquetPartitioningStreamWriter does two things: It adds two columns (i.e. transformation) and writes the dataframe partitioned (special write). With #116 the two responsibilities can be separated: ParquetStreamWriter is enhanced to write partitioned. Thus, only the transformation is left for ParquetPartitioningStreamWriter.

Tasks

  • Refactor ParquetPartitioningStreamWriter to a transformer and rename
  • Merge AbstractParquetStreamWriter with ParquetStreamWriter

How to migrate Hyperdrive-Trigger

  1. Replace
"component.writer=za.co.absa.hyperdrive.ingestor.implementation.writer.parquet.ParquetPartitioningStreamWriter"

with

"component.transformer.id.2=add.date.version", "component.transformer.class.add.date.version=za.co.absa.hyperdrive.ingestor.implementation.transformer.add.dateversion.AddDateVersionTransformer",
"component.writer=za.co.absa.hyperdrive.ingestor.implementation.writer.parquet.ParquetStreamWriter"
  1. Replace
writer.parquet.partitioning.report.date

with

transformer.add.date.version.report.date
  1. Replace
"writer.parquet.destination

with

"transformer.add.date.version.destination=${writer.parquet.destination}", "writer.parquet.partition.columns=hyperdrive_date, hyperdrive_version", "writer.parquet.destination

Make sure there is no workflow using ParquetPartitioning and partition columns at the same time

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions