spark-etl

A lightweight framework for running ETL analyses in Spark.

spark-etl let's you define EtlDefinitions, group them in a collection, and provides helper methods to run the etls via jobs.

Components of an ETL

An ETL starts with a DataFrame, runs a series of transformations (filter, custom transformations, repartition), and writes out data.

spark-etl is generic and can be molded to suit all ETL situations. For example, it can read a CSV file from S3, run transformations, and write out Parquet files on your local filesystem.

Code example

This snippet creates a DataFrame and writes it out as a CSV file in your local filesystem.

val sourceDF = spark.createDF(
  List(
    ("bob", 14),
    ("liz", 20)
  ), List(
    ("name", StringType, true),
    ("age", IntegerType, true)
  )
)

def someTransform()(df: DataFrame): DataFrame = {
  df.withColumn("cool", lit("dude"))
}

def someWriter()(df: DataFrame): Unit = {
  val path = new java.io.File("./tmp/example").getCanonicalPath
  df.repartition(1).write.csv(path)
}

val etlDefinition = new EtlDefinition(
  name =  "example",
  sourceDF = sourceDF,
  transform = someTransform(),
  write = someWriter(),
  hidden = false
)

etlDefinition.process()

In production applications, it's more likely that you'll use Spark DataFrame readers to create the sourceDF (e.g. spark.read.parquet("some_s3_path")).

Example production use case

You can define a collection of ETL definitions in a Databricks notebook and create a Slack command that runs an EtlDefinition on command from Slack.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
project		project
src		src
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-etl

Components of an ETL

Code example

Example production use case

About

Releases 1

Packages

Languages

MrPowers/spark-etl

Folders and files

Latest commit

History

Repository files navigation

spark-etl

Components of an ETL

Code example

Example production use case

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages