Lightning Streams

An example of a simple stream and batch query made by implementing PySpark, python API of Apache Spark™, queries on a Lightning flash dataset collected from NOAA's GLM. Uses Apache Parquet file format as the storage backend and Dagster Software-Defined Assets to orchestrate the batch/stream processing pipeline.

Blog post: Lightning Streams: PySpark Batch & Streaming Queries


Dagster + PySpark + Parquet

Installation

First make sure, you have the requirements installed, this can be installed from the project directory via pip's setup command:

pip install . # =< python3.11

Quick Start

Run the command to start the dagster orchestration framework:

dagster dev # Start dagster daemon and dagit ui

The dagster daemon is required to start the scheduling, from the dagit ui, you can run and monitor the data assets.

ETL Pipeline

ETL pipe data assets:

Source: extracts NOAA GOES-R GLM file datasets from AWS s3 bucket.
Transformations: transforms dataset into time series csv.
Sink: loads dataset to persistant storage.

Sink loading process refactored to use pyspark (batch and structured streaming queries) and parquet as the storage backend.


ETL Data asset group

Clustering Pipeline

Blog post: Exploratory Data Analysis with Lightning Streaming Pipeline


Materializing Lightning clustering pipeline

Data Ingestion

Ingests the data needed based on specified time window: start and end dates.

Data Assets

ingestor: Composed of extract, transform, and load data assets.
extract: downloads NOAA GOES-R GLM netCDF files from AWS s3 bucket
transform: converts GLM netCDF into time and geo series CSVs
load: loads CSVs to a local backend, persistant duckdb

Cluster Analysis

Performs grouping of the ingested data by implementing K-Means clustering algorithm.


Visual of clustering process

Data Assets

preprocessor: prepares the data for cluster model, clean and normalize the data.
kmeans_cluster: fits the data to an implementation of k-means cluster algorithm.
silhouette_evaluator: evaluates the choice of 'k' clusters by calculating the silhouette coefficient for each k in defined range.
elbow_evaluator: evaluates the choice of 'k' clusters by calculating the sum of the squared distance for each k in defined range.


Displaying Clusering analysis data assets


Lightning clustering map

Testing

Use the following command to run tests:

pytest

License

Apache 2.0 License

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
img		img
lightning_streams		lightning_streams
lightning_streams_tests		lightning_streams_tests
notebooks		notebooks
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
pyproject.toml		pyproject.toml
readme.md		readme.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lightning Streams

Installation

Quick Start

ETL Pipeline

Clustering Pipeline

Data Ingestion

Data Assets

Cluster Analysis

Data Assets

Testing

License

About

Releases

Packages

Languages

License

BayoAdejare/lightning-streams

Folders and files

Latest commit

History

Repository files navigation

Lightning Streams

Installation

Quick Start

ETL Pipeline

Clustering Pipeline

Data Ingestion

Data Assets

Cluster Analysis

Data Assets

Testing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages