Skip to content

Implementation of partitioning stream writer#52

Merged
kevinwallimann merged 3 commits intodevelopfrom
feature/partitioned-writer
Nov 5, 2019
Merged

Implementation of partitioning stream writer#52
kevinwallimann merged 3 commits intodevelopfrom
feature/partitioned-writer

Conversation

@kevinwallimann
Copy link
Copy Markdown
Collaborator

@kevinwallimann kevinwallimann commented Oct 22, 2019

Added new component:
component.writer=za.co.absa.hyperdrive.ingestor.implementation.writer.parquet.ParquetPartitioningStreamWriter

  • Allows multiple ingestions per day
  • Data is written to the destination, partitioned according to processing date ("hyperdrive_date") and version ("hyperdrive_version"). The version is incremented for every execution. The initial version is 1.

Required configuration
writer.parquet.destination.directory (same like for ParquetStreamWriter)
Optional configuration
writer.parquet.partitioning.report.date=yyyy-MM-dd For debugging or failure recovery purposes, this parameter determines "hyperdrive_date", i.e. into which partition the data is written. Default: Current date

Copy link
Copy Markdown
Collaborator

@Zejnilovic Zejnilovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few questions, I know it was not requested but I like to butt in.

}

def write(dataFrame: DataFrame, offsetManager: OffsetManager): StreamingQuery = {
if (dataFrame == null) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I have seen a lot of usage of null. How about we isolate this check if x:Any == null somewhere else, so the "not nice" null usage in scala is only there?

Copy link
Copy Markdown
Collaborator Author

@kevinwallimann kevinwallimann Oct 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. We could also use Predef.require for these null checks because most of the null checks are actually preconditions for the function.
I have created #53


import scala.util.{Failure, Success, Try}

private[writer] abstract class AbstractParquetStreamWriter(destination: String, val extraConfOptions: Option[Map[String, String]]) extends StreamWriter(destination) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't an Empty Map be easier to handle?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be. The "option" here is more of a semantic markup to indicate that this options are not mandatory.


private def parseConf(option: String): (String, String) = {
val keyValue = option.split("=")
if (keyValue.length == 2) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think general java conf files are allowed to have more than one = but only first acts as the delimiter.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We provide the following config property: writer.parquet.extra.conf.1=key1=value1
That's why we check for the second =
We could change it to writer.parquet.extra.conf.key1=value1

However I would not change it in this PR, because AbstractParquetStreamWriter was just copied from ParquetStreamWriter

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created #54

private val COL_VERSION = "hyperdrive_version"

if (StringUtils.isBlank(destination)) {
throw new IllegalArgumentException(s"Invalid PARQUET destination: '$destination'")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AbstractParquetStreamWriter has a same check. Is it not enough?


behavior of "ParquetStreamWriter"

it should "write partitioned by date and version=1 where destination is empty" in {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean destination dir is empty?


import scala.util.{Failure, Success, Try}

private[writer] abstract class AbstractParquetStreamWriter(destination: String, val extraConfOptions: Option[Map[String, String]]) extends StreamWriter(destination) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be. The "option" here is more of a semantic markup to indicate that this options are not mandatory.

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession

trait SparkTestBase {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spot on

*/

package za.co.absa.hyperdrive.ingestor.implementation.writer.parquet

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused imports:
import org.apache.logging.log4j.LogManager
StreamWriterFactory

import java.time.format.DateTimeFormatter

import org.apache.commons.configuration2.Configuration
import org.apache.commons.lang3.StringUtils
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used

@kevinwallimann kevinwallimann merged commit 71a94f7 into develop Nov 5, 2019
@kevinwallimann kevinwallimann deleted the feature/partitioned-writer branch November 25, 2019 07:50
@kevinwallimann kevinwallimann modified the milestones: v1.1.0, v2.0.0 Jan 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants