ELT Pipeline in Datalake

ELT - Extract Load Transform pipeline ingestion from Database into a single datalake location to do transform processing.

Diagram

Introduction & Goals

Ingest Data from RDS to a Datalake S3 bucket.
Process transaction files to a single location
- Data set of customer list of sales person
- We use AWS Glue
- Create Data Catalog, transform, load to Data lake
- Data will be transform from CSV to Parquet file
- Parquet stores the file schema in the file metadata. It is easier to work with because they are supported by so many different projects.
- Transformed data will be ready for query in Data Warehouse

The Data Set

We use transaction data set customer.csv.
Data set of sales person's customer list
appropiate size 197 KB for Glue ETL demo

Used Tools

Amazon RDS postgreSQL, AWS Glue, Glue crawler, Athena
- Load .csv file from RDS into input folder inside a bucket.
- Create output folder as target inside the same bucket.
- Add Glue crawler to craw Data catalog of .csv file ( in input folder)
- run Glue job to transform to parquet file
- Set target to load into output folder
In Athena, check view or query of transformed file.
Add Glue crawler to craw Data catalog of .parquet file in (output folder).

Pipelines

Batch processing pipeline for bulk import.
Use source code of AWS Glue

Conclusion

Data set has transform inside a single datalake bucket
Transformed data can be directly analysis with AWS Athena (or Redshift Spectrum)
Glue job can be assign for aggregate , join, filter tables

Follow Me On

https://www.linkedin.com/in/jirasak-pakdeeto-900665214/

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
image		image
README.md		README.md
glue-job.py		glue-job.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

image

image

README.md

README.md

glue-job.py

glue-job.py

Repository files navigation

ELT Pipeline in Datalake

ELT - Extract Load Transform pipeline ingestion from Database into a single datalake location to do transform processing.

Diagram

Introduction & Goals

The Data Set

Used Tools

Pipelines

Conclusion

Follow Me On

About

Releases

Packages

Languages

Jira-saki/AWS-ELT-S3-Athena

Folders and files

Latest commit

History

Repository files navigation

ELT Pipeline in Datalake

ELT - Extract Load Transform pipeline ingestion from Database into a single datalake location to do transform processing.

Diagram

Introduction & Goals

The Data Set

Used Tools

Pipelines

Conclusion

Follow Me On

About

Resources

Stars

Watchers

Forks

Languages