Movies data ETL (Spark)

Spark data pipeline that processes movie ratings data.

Data Architecture

We define a Data Lakehouse architecture with the following layers:

Raw: Contains raw data directly ingested from an event stream, e.g. Kafka. This data should generally not be accessible (can contain PII, duplicates, quality issues, etc).
Curated: Contains transformed data according to business and data quality rules. This data should be accessed as tables registered in a data catalog.

Apache Iceberg is used as the table format for both the raw and curated layers.

Data pipeline design

The Spark data pipeline consumes data from the raw layer (incrementally, for a given execution date), performs transformations and business logic, and persists to the curated layer.

After persisting, Data Quality checks can be run using Soda.

The curated datasets are in principle partitioned by execution date.

Note that for the purpose of running this project locally, we use an Iceberg catalog in the local file system. In production, we could use for instance the AWS Glue data catalog, persisting data to S3. See doc.

Additionally, in a production scenario it's recommended to periodically run Iceberg table maintenance operations.

Packaging and dependency management

uv is used for Python packaging and dependency management.

Dependabot is configured to periodically upgrade repo dependencies. See dependabot.yml.

Since there are multiple ways of deploying and running Spark applications in production (Kubernetes, AWS EMR, Databricks, etc), this repo aims to be as agnostic and generic as possible. The application and its dependencies are built into a Docker image (see Dockerfile).

In order to distribute code and dependencies across Spark executors this method is used.

CI/CD

Github Actions workflows for CI/CD are defined here and can be seen here.

The logic is as follows:

On PR creation/update:
- Run code checks and tests.
- Build Docker image.
- Publish Docker image to Github Container Registry with a tag referring to the PR, like ghcr.io/guidok91/spark-movies-etl:pr-123.
On push to master:
- Run code checks and tests.
- Create Github release.
- Build Docker image.
- Publish Docker image to Github Container Registry with the latest tag, e.g. ghcr.io/guidok91/spark-movies-etl:master.

Docker images in the Github Container Registry can be found here.

Execution instructions

The repo includes a Makefile. Please run make help to see usage.

To start with, you can run make docker-build and make docker-run to spin up a Docker container for the project.

Name		Name	Last commit message	Last commit date
Latest commit History 321 Commits
.github		.github
data-lake-dev/movie_ratings_raw		data-lake-dev/movie_ratings_raw
movies_etl		movies_etl
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Movies data ETL (Spark)

Data Architecture

Data pipeline design

Packaging and dependency management

CI/CD

Execution instructions

About

Uh oh!

Releases 15

Packages

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

guidok91/spark-movies-etl

Folders and files

Latest commit

History

Repository files navigation

Movies data ETL (Spark)

Data Architecture

Data pipeline design

Packaging and dependency management

CI/CD

Execution instructions

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 15

Packages 0

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

Packages