Currently, all jobs are ingested with Trigger.Once, i.e. all data is ingested into one parquet file (per kafka partition). Certain jobs may produce very large output files, leading to out of memory errors.
To prevent this, Trigger.ProcessingTime should be used.
New configuration property: writer.parquet.trigger
The expected value is the number of milliseconds.
If the value is not a number or the property is not present, all data should be ingested at once, as it is the case now.
The change should be available for both ParquetStreamWriter and ParquetPartitioningStreamWriter
Currently, all jobs are ingested with
Trigger.Once, i.e. all data is ingested into one parquet file (per kafka partition). Certain jobs may produce very large output files, leading to out of memory errors.To prevent this,
Trigger.ProcessingTimeshould be used.New configuration property:
writer.parquet.triggerThe expected value is the number of milliseconds.
If the value is not a number or the property is not present, all data should be ingested at once, as it is the case now.
The change should be available for both
ParquetStreamWriterandParquetPartitioningStreamWriter