Skip to content

Latest commit

 

History

History
29 lines (23 loc) · 1.33 KB

README.md

File metadata and controls

29 lines (23 loc) · 1.33 KB

Summary

This is Pyspark pipeline for consume message from kafka and insert processed record into Delta table.

In most data processing pipelines, the source is apache Kafka and I needed to monitor the Kafka consumption status and its lag in an external monitoring system. Therefore, I created this project and I used the following technologies:

  • Spark Structured Streaming: scalable and fault-tolerant stream processing engine
  • Kafka: message broker and source of the pipeline
  • minio: distributed object storage for store processed date
  • DeltaLake: an open-source storage framework that enables building a Lakehouse architecture with compute engines Like Spark
  • prometheus, prometheus pushgateway and grafana for monitoring system

Faker

If Faker is enabled, in the background, fake data is generated at a defined rate(Faker_num_threads,Faker_sleep_s)

Metrics

The extraction of metrics is implemented in metrics.py, and it can be extended for other sources.

Test

I used the following Docker images to test the code:
  • bitnami/kafka
  • minio/minio
  • prom/prometheus
  • prom/pushgateway
  • grafana/grafana

my spark version was 3.4.1 and delta 2.4 StreamingQueryListener is a new class in spark 3.4.0: https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQueryListener.html