Summary

This is Pyspark pipeline for consume message from kafka and insert processed record into Delta table.

In most data processing pipelines, the source is apache Kafka and I needed to monitor the Kafka consumption status and its lag in an external monitoring system. Therefore, I created this project and I used the following technologies:

Spark Structured Streaming: scalable and fault-tolerant stream processing engine
Kafka: message broker and source of the pipeline
minio: distributed object storage for store processed date
DeltaLake: an open-source storage framework that enables building a Lakehouse architecture with compute engines Like Spark
prometheus, prometheus pushgateway and grafana for monitoring system

Faker

If Faker is enabled, in the background, fake data is generated at a defined rate(Faker_num_threads,Faker_sleep_s)

Metrics

The extraction of metrics is implemented in metrics.py, and it can be extended for other sources.

Test

I used the following Docker images to test the code:

bitnami/kafka
minio/minio
prom/prometheus
prom/pushgateway
grafana/grafana

my spark version was 3.4.1 and delta 2.4 StreamingQueryListener is a new class in spark 3.4.0: https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQueryListener.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly