Data Engineering with Scala and Spark

This is the code repository for Data Engineering with Scala and Spark, published by Packt.

Build streaming and batch pipelines that process massive amounts of data using Scala

What is this book about?

Learn new techniques to ingest, transform, merge, and deliver trusted data to downstream users using modern cloud data architectures and Scala, and learn end-to-end data engineering that will make you the most valuable asset on your data team.

This book covers the following exciting features:

Set up your development environment to build pipelines in Scala
Get to grips with polymorphic functions, type parameterization, and Scala implicits
Use Spark DataFrames, Datasets, and Spark SQL with Scala
Read and write data to object stores
Profile and clean your data using Deequ
Performance tune your data pipelines using Scala

If you feel this book is for you, get your copy today!

Instructions and Navigations

All of the code is organized into folders. For example,

The code will look like the following:

val updateSilver: DataFrame = bronzeData
   .select(from_json(col("value"), jsonSchema).alias("value"))
   .select(
     col("value.device_id"),
     col("value.country"),
     col("value.event_type"),
     col("value.event_ts")
    )
   .dropDuplicates("device_id", "country", "event_ts")

Following is what you need for this book: This book is for data engineers who have experience in working with data and want to understand how to transform raw data into a clean, trusted, and valuable source of information for their organization using Scala and the latest cloud technologies.

With the following software and hardware list you can run all code files present in the book (Chapter 1-13).

Software and Hardware List

Chapter	Software required	OS required
1-13	Microsoft Azure	Windows, macOS, or Linux
1-13	Databricks Community Edition	Windows, macOS, or Linux
1-13	JDK 8	Windows, macOS, or Linux
1-13	Intellij IDEA	Windows, macOS, or Linux
1-13	VS Code	Windows, macOS, or Linux
1-13	Docker Community Edition	Windows, macOS, or Linux
1-13	Apache Spark 3.3.1	Windows, macOS, or Linux
1-13	MySql	Windows, macOS, or Linux
1-13	MinIO	Windows, macOS, or Linux

Related products

Cracking the Data Engineering Interview [Packt] [Amazon]
Data Engineering with dbt [Packt] [Amazon]

Get to Know the Authors

Eric Tome has over 25 years of experience working with data. He has contributed to and led teams that ingested, cleansed, standardized, and prepared data used by business intelligence, data science, and operations teams. He has a background in Mathematics and currently works as a Solutions Architect for Databricks, helping customers solve their data

David Radford has worked in big data for over ten years with a focus on cloud technologies. He led consulting teams for multiple years completing migrations from legacy systems to modern data stacks. He holds a Master's degree in Computer Science and works as a Solutions Architect at Databricks.and AI challenges.

Rupam Bhattacharjee works as a Lead Data Engineer at IBM. He has architected and developed data pipelines processing massive structured and unstructured data using Spark and Scala for on-prem Hadoop and k8s clusters on the public cloud. He has a degree in Electrical Engineering.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
lib		lib
project		project
src		src
.gitignore		.gitignore
.scalafmt.conf		.scalafmt.conf
LICENSE		LICENSE
PROJECT_NAME		PROJECT_NAME
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib

lib

project

project

src

src

.gitignore

.gitignore

.scalafmt.conf

.scalafmt.conf

LICENSE

LICENSE

PROJECT_NAME

PROJECT_NAME

README.md

README.md

build.sbt

build.sbt

Repository files navigation

Data Engineering with Scala and Spark

What is this book about?

Instructions and Navigations

Software and Hardware List

Related products

Get to Know the Authors

About

Releases

Packages

Contributors 3

Languages

License

PacktPublishing/Data-Engineering-with-Scala-and-Spark

Folders and files

Latest commit

History

Repository files navigation

Data Engineering with Scala and Spark

What is this book about?

Instructions and Navigations

Software and Hardware List

Related products

Get to Know the Authors

About

Resources

License

Stars

Watchers

Forks

Languages