This project provides a comprehensive data pipeline solution to extract, transform, and load (ETL) Reddit data into a Redshift data warehouse. The pipeline leverages a combination of tools and services including Apache Airflow, Celery, PostgreSQL, Amazon S3, AWS Glue, Amazon Athena, and Amazon Redshift.
The pipeline is designed to:
- Extract data from Reddit using its API.
- Store the raw data into an S3 bucket from Airflow.
- Transform the data using AWS Glue and Amazon Athena.
- Load the transformed data into Amazon Redshift for analytics and querying.
- Reddit API: Source of the data.
- Apache Airflow & Celery: Orchestrates the ETL process and manages task distribution.
- PostgreSQL: Temporary storage and metadata management.
- Amazon S3: Raw data storage.
- AWS Glue: Data cataloging and ETL jobs.
- Amazon Athena: SQL-based data transformation.
- Amazon Redshift: Data warehousing and analytics.
- AWS Account with appropriate permissions for S3, Glue, Athena, and Redshift.
- Reddit API credentials.
- Docker Installation
- Python 3.11 or higher
- Clone the repository.
git clone https://github.com/DimaKuriptya/RedditETL.git
- Create a virtual environment.
python3 -m venv venv
- Activate the virtual environment.
source venv/bin/activate
- Install the dependencies.
pip install -r requirements.txt
- Create a folder
config
in the root directory and a fileconfig.conf
inside it. Fill the file by the folowing template:
[database]
database_host = localhost
database_name = airflow_reddit
database_port = 5432
database_username = postgres
database_password = postrgres
[file_paths]
input_path = /opt/airflow/data/input
output_path = /opt/airflow/data/output
[api_keys]
reddit_secret_key = [SECRET KEY HERE]
reddit_client_id = [CLIENT ID HERE]
[aws]
aws_access_key_id = [aws access key id]
aws_secret_access_key = [aws secret key]
aws_session_token = [aws session token]
aws_region = [aws region]
aws_bucket_name = [s3 bucket name]
[etl_settings]
batch_size = 100
error_handling = abort
log_level = info
-
Starting the containers
Run airflow-init:
docker-compose up airflow-init -d
Wait for airflow-init container to end its job and then run the following command:
docker-compose up -d
-
Launch the Airflow web UI.
open http://localhost:8080