Stock Data ETL Pipeline with AWS and Polygon

An end-to-end ETL (Extract, Transform, Load) pipeline for ingesting and processing stock data from the Polygon API, leveraging AWS services for scalability, reliability, and performance. This project demonstrates expertise in cloud-based data engineering, including data extraction, transformation, storage, and monitoring.

Overview

This pipeline extracts stock data (e.g., timestamp, open, high, low, close, volume, vwap, ticker) from the Polygon API, processes it for analytics, and stores the results in an AWS S3 bucket. The architecture is designed to handle batch updates while maintaining cost efficiency and data integrity.

Features

Data Pipeline

Data Extraction: Connects to the Polygon API to fetch stock data in batch.
Data Transformation: Cleans, validates, and enriches the raw stock data into an analytics-ready format.
Data Loading: Stores transformed data into an AWS S3 bucket, partitioned by date for efficient querying.

AWS Integration

S3 Data Lake: Utilized for scalable and cost-effective storage of raw and processed data.

Scalability and Optimization

Partitioned Storage: Organizes data in year/month/day partitions to optimize query performance.

Technologies Used

Languages: Python (ETL orchestration and transformation logic)
API: Polygon, boto3, moto
Cloud Services: AWS S3, AWS SDK
Data Processing: Pandas for transformations

Prerequisites

Python 3.x installed
AWS CLI configured
Access to the Polygon API (API key required)

Setup and Usage

Clone the repository:

git clone https://github.com/Drake-Programming/polygon_aws_etl.git
cd polygon_aws_etl

Install dependencies:

pip install -r requirements.txt

Set up environment variables:

Add your Polygon API key to the .env file.
Configure AWS credentials using the AWS CLI.

Run the pipeline:

python run.py

Future Enhancements

Integration with Analytics Platforms: Enable direct querying via AWS Athena or integration with visualization tools like QuickSight.
Error Handling and Retry Mechanism: Implement advanced fault-tolerance mechanisms for robust data ingestion.
CI/CD Pipeline: Automate deployments and updates using AWS CodePipeline or similar tools.

Contact

For questions or feedback, please contact Robert Wallace.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.idea		.idea
configs		configs
etl		etl
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
Pipfile		Pipfile
README.md		README.md
draft.ipynb		draft.ipynb
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stock Data ETL Pipeline with AWS and Polygon

Overview

Features

Data Pipeline

AWS Integration

Scalability and Optimization

Technologies Used

Prerequisites

Setup and Usage

Future Enhancements

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Stock Data ETL Pipeline with AWS and Polygon

Overview

Features

Data Pipeline

AWS Integration

Scalability and Optimization

Technologies Used

Prerequisites

Setup and Usage

Future Enhancements

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages