You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Project aims to simulate the user-post pipeline used by Pinterest. User post JSONs are received, and processed using Lambda architecture, in batch and also real-time. Batch data is initially stored in the cloud, cleaned and stored locally, accessed with Presto and Cassandra.
Real-time data is cleaned and monitored using Pyspark in micro-batches, and then sent to postgres.
The userposts are passed from the kafka topic to pyspark. Basic data processing is applied, and a function is applied, which gives live feedback on whether userposts are the result of errors (containing null values).
The processed microbatches are then stored in a local postgres database.
Stream monitoring
Prometheus and grafana are then used to scrape metrics from postgres, and display dashboards.