Skip to content

Batch/stream ETL pipeline of NOAA GLM dataset, using Python frameworks: Dagster, PySpark and Parquet storage.

License

Notifications You must be signed in to change notification settings

BayoAdejare/lightning-streams

Repository files navigation

Lightning Streams

An example of a simple stream and batch query made by implementing PySpark, python API of Apache Spark™, queries on a Lightning flash dataset collected from NOAA's GLM. Uses Apache Parquet file format as the storage backend and Dagster Software-Defined Assets to orchestrate the batch/stream processing pipeline.

Blog post: Lightning Streams: PySpark Batch & Streaming Queries

Technologies used and respective logos
Dagster + PySpark + Parquet

Installation

First make sure, you have the requirements installed, this can be installed from the project directory via pip's setup command:

pip install . # =< python3.11

Quick Start

Run the command to start the dagster orchestration framework:

dagster dev # Start dagster daemon and dagit ui

The dagster daemon is required to start the scheduling, from the dagit ui, you can run and monitor the data assets.

ETL Pipeline

ETL pipe data assets:

  • Source: extracts NOAA GOES-R GLM file datasets from AWS s3 bucket.
  • Transformations: transforms dataset into time series csv.
  • Sink: loads dataset to persistant storage.

Sink loading process refactored to use pyspark (batch and structured streaming queries) and parquet as the storage backend.

ETL Data assets
ETL Data asset group

Clustering Pipeline

Blog post: Exploratory Data Analysis with Lightning Streaming Pipeline

Lightning clustering pipeline Illustration
Materializing Lightning clustering pipeline

Data Ingestion

Ingests the data needed based on specified time window: start and end dates.

Data Assets

  • ingestor: Composed of extract, transform, and load data assets.
  • extract: downloads NOAA GOES-R GLM netCDF files from AWS s3 bucket
  • transform: converts GLM netCDF into time and geo series CSVs
  • load: loads CSVs to a local backend, persistant duckdb

Cluster Analysis

Performs grouping of the ingested data by implementing K-Means clustering algorithm.

An example clustering of flash data points
Visual of clustering process

Data Assets

  • preprocessor: prepares the data for cluster model, clean and normalize the data.
  • kmeans_cluster: fits the data to an implementation of k-means cluster algorithm.
  • silhouette_evaluator: evaluates the choice of 'k' clusters by calculating the silhouette coefficient for each k in defined range.
  • elbow_evaluator: evaluates the choice of 'k' clusters by calculating the sum of the squared distance for each k in defined range.
Display of clustering materialized assets
Displaying Clusering analysis data assets
An example clustering of flash data points
Lightning clustering map

Testing

Use the following command to run tests:

pytest

License

Apache 2.0 License