Implementation of partitioning stream writer by kevinwallimann · Pull Request #52 · AbsaOSS/hyperdrive

kevinwallimann · 2019-10-22T14:46:32Z

Added new component:
component.writer=za.co.absa.hyperdrive.ingestor.implementation.writer.parquet.ParquetPartitioningStreamWriter

Allows multiple ingestions per day
Data is written to the destination, partitioned according to processing date ("hyperdrive_date") and version ("hyperdrive_version"). The version is incremented for every execution. The initial version is 1.

Required configuration
writer.parquet.destination.directory (same like for ParquetStreamWriter)
Optional configuration
writer.parquet.partitioning.report.date=yyyy-MM-dd For debugging or failure recovery purposes, this parameter determines "hyperdrive_date", i.e. into which partition the data is written. Default: Current date

Zejnilovic

Just a few questions, I know it was not requested but I like to butt in.

Zejnilovic · 2019-10-22T14:51:57Z

+  }
+
+  def write(dataFrame: DataFrame, offsetManager: OffsetManager): StreamingQuery = {
+    if (dataFrame == null) {


In general, I have seen a lot of usage of null. How about we isolate this check if x:Any == null somewhere else, so the "not nice" null usage in scala is only there?

I agree. We could also use Predef.require for these null checks because most of the null checks are actually preconditions for the function.
I have created #53

Zejnilovic · 2019-10-22T14:53:50Z

+
+import scala.util.{Failure, Success, Try}
+
+private[writer] abstract class AbstractParquetStreamWriter(destination: String, val extraConfOptions: Option[Map[String, String]]) extends StreamWriter(destination) {


Wouldn't an Empty Map be easier to handle?

Could be. The "option" here is more of a semantic markup to indicate that this options are not mandatory.

Zejnilovic · 2019-10-22T15:00:03Z

+
+  private def parseConf(option: String): (String, String) = {
+    val keyValue = option.split("=")
+    if (keyValue.length == 2) {


I think general java conf files are allowed to have more than one = but only first acts as the delimiter.

We provide the following config property: writer.parquet.extra.conf.1=key1=value1
That's why we check for the second =
We could change it to writer.parquet.extra.conf.key1=value1

However I would not change it in this PR, because AbstractParquetStreamWriter was just copied from ParquetStreamWriter

I have created #54

Zejnilovic · 2019-10-22T15:01:10Z

+  private val COL_VERSION = "hyperdrive_version"
+
+  if (StringUtils.isBlank(destination)) {
+    throw new IllegalArgumentException(s"Invalid PARQUET destination: '$destination'")


AbstractParquetStreamWriter has a same check. Is it not enough?

Zejnilovic · 2019-10-22T15:06:23Z

+
+  behavior of "ParquetStreamWriter"
+
+  it should "write partitioned by date and version=1 where destination is empty" in {


You mean destination dir is empty?

felipemmelo · 2019-10-22T15:32:37Z

+
+import scala.util.{Failure, Success, Try}
+
+private[writer] abstract class AbstractParquetStreamWriter(destination: String, val extraConfOptions: Option[Map[String, String]]) extends StreamWriter(destination) {


Could be. The "option" here is more of a semantic markup to indicate that this options are not mandatory.

felipemmelo · 2019-10-22T15:37:48Z

+import org.apache.log4j.{Level, Logger}
+import org.apache.spark.sql.SparkSession
+
+trait SparkTestBase {


jozefbakus · 2019-11-05T09:46:29Z

+ */
+
+package za.co.absa.hyperdrive.ingestor.implementation.writer.parquet
+


Unused imports:
import org.apache.logging.log4j.LogManager
StreamWriterFactory

jozefbakus · 2019-11-05T09:47:15Z

+import java.time.format.DateTimeFormatter
+
+import org.apache.commons.configuration2.Configuration
+import org.apache.commons.lang3.StringUtils


Implementation of partitioning stream writer

d760799

kevinwallimann requested a review from jozefbakus October 22, 2019 14:46

Zejnilovic reviewed Oct 22, 2019

View reviewed changes

felipemmelo approved these changes Oct 22, 2019

View reviewed changes

Address comments in pull request

f5552fd

jozefbakus reviewed Nov 5, 2019

View reviewed changes

Remove unused imports

058ac97

jozefbakus approved these changes Nov 5, 2019

View reviewed changes

kevinwallimann merged commit 71a94f7 into develop Nov 5, 2019

kevinwallimann deleted the feature/partitioned-writer branch November 25, 2019 07:50

kevinwallimann modified the milestones: v1.1.0, v2.0.0 Jan 22, 2020

kevinwallimann added the enhancement label Jan 22, 2020


		import scala.util.{Failure, Success, Try}

		private[writer] abstract class AbstractParquetStreamWriter(destination: String, val extraConfOptions: Option[Map[String, String]]) extends StreamWriter(destination) {


		behavior of "ParquetStreamWriter"

		it should "write partitioned by date and version=1 where destination is empty" in {

		*/

		package za.co.absa.hyperdrive.ingestor.implementation.writer.parquet

Conversation

kevinwallimann commented Oct 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zejnilovic left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevinwallimann Oct 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kevinwallimann commented Oct 22, 2019 •

edited

Loading

Zejnilovic left a comment •

edited

Loading

kevinwallimann Oct 23, 2019 •

edited

Loading