Hello, welcome to my Data Pipeline Practice repo! Here I keep all the python scripts and relevant files I used for creating data pipelines via the extract - load - transform (ELT) method. I am using the Data Pipelines Pocket Reference (Densmore 2021) to guide my ETL-practice journey. Below, I keep track of the ETL Practice Steps and the Highlights of my learning journey.
- Virtual Environment
- AWS Account
- MySQL database
- Create table in MySQL
- Python script (full extract table to s3 bucket)
- Redshift Data Warehouse
- Python script (incremental extract table to s3 bucket)
-
BinLog Replication of MySQL dataNote - will practice CDC method at later point. - MongoDB data extraction method
- REST API data extraction method
- Load CSV file to Redshift data warehouse via query editor
- Load CSV file to Redshift data warehouse via python script
- Deduplicating records in a data warehouse table via sql
- Parsing URLs via python
- Transform data from fact and dimension tables by creating a new data model via SQL
- Install Apache Airflow
- Create Postgres database
- Configure Airflow to use Postgres database
- Build and Run a Simple Airflow DAG
- Build an ELT Pipeline DAG
- Configure DAG Status Alerts
- Coordinate Multiple DAGs with Sensors
- Create validation test script
MySQL Database (via RDS) | Table Created in MySQL | S3 Bucket for Extracted MySQL Table | Redshift Data Warehouse | MongoDB Database |
---|---|---|---|---|
Create Table with Duplicate Count (SQL) | Deduplicate Original Table (SQL) | Create Transformed Data Model (SQL) |
---|---|---|
Install Apache Airflow | ELT Pipeline DAG Graph | Create and Run ELT Pipeline Airflow DAG |
---|---|---|
Densmore, James (2021). Data Pipelines Pocket Reference (O'Reilly). Copyright 2021 James Densmore, 978-1-492-08783-0.