Skip to content

This project involved managing real-time streaming data from the New York Times Developer API to ensure immediate access to the latest insights and articles. Utilizing Apache Airflow within GCP Cloud Composer, seamless workflow pipelines were orchestrated to automate data retrieval, preprocessing, and incremental loading into Snowflake DW

Notifications You must be signed in to change notification settings

Parag000/New-York-Times-Project

Repository files navigation

Apache AirFlow Pipelines

The nyt_dag.py script defines 2 Airflow DAG pipelines:

  • Real_time_api_pipeline: A pipeline to fetch data from NYT Archive Data real time api, process the data and insert the data into snowflake database NYT_DB.NYT_SCHEMA. This pipeline is scheduled to run every 1st day of month at 12 am.
  • Transformation_pipeline: A pipeline to apply transformations and performs analytics. The summarized results are loaded to snowflake database NYT_DB.NYT_RESULTS_SCHEMA. This pipeline is scheduled to run every 1st day of month at 6 am.

To run the piplines start the Airflow though terminal:

About

This project involved managing real-time streaming data from the New York Times Developer API to ensure immediate access to the latest insights and articles. Utilizing Apache Airflow within GCP Cloud Composer, seamless workflow pipelines were orchestrated to automate data retrieval, preprocessing, and incremental loading into Snowflake DW

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published