Created and deployed this Azure workflow to perform data extraction, cleaning, and visualization of data. Utilizing docker to containerize the entire pipeline (server, database, python code) and deploy locally or on the cloud.
- Data Extraction: Wikipedia Website
- Workflow Automation: Apache Airflow
- Database Management: PostgreSQL
- Cloud Storage: Azure Blob
- Data Transformation: Azure Data Factory
- Query Service: Azure Synapse
- Data Warehousing: Azure Databricks
- Data Visualization: Power BI
The purpose of this pipeline is to automize fetching/scraping data from Reddit post, we will be using the Reddit API, Apache Airflow to trigger tasks that run once a day, Docker to run everything in a containerized local environment, and SQL Postgres database to store the fetched data. After setting everything up locally we want this pipeline running on Cloud infrastructure which provides additional security, storage and processing capacity. I'll set up the pipeline using AWS to fully automate fetching, cleaning, and storing live data using AWS S3, AWS Lambda,AWS Glue, AWS Athena, and AWS Redshift.
- Data Extraction: Reddit API
- Workflow Automation: Apache Airflow, Celery
- Database Management: PostgreSQL
- Cloud Storage: Amazon S3
- Data Transformation: AWS Glue, Lambda
- Query Service: Amazon Athena
- Data Warehousing: Amazon Redshift
- Data Visualization: Quicksight
Showcases my ability to integrate various technologies to create a robust and scalable data pipeline. Demonstrate my expertise in handling big data and my capabilities to deliver efficient and reliable data solutions.

