Skip to content

ELT - Extract Load Transform pipeline ingestion from Database into a single datalake location to do transform processing

Notifications You must be signed in to change notification settings

Jira-saki/AWS-ELT-S3-Athena

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ELT Pipeline in Datalake

ELT - Extract Load Transform pipeline ingestion from Database into a single datalake location to do transform processing.

Diagram

Introduction & Goals

  • Ingest Data from RDS to a Datalake S3 bucket.
  • Process transaction files to a single location
    • Data set of customer list of sales person
    • We use AWS Glue
    • Create Data Catalog, transform, load to Data lake
    • Data will be transform from CSV to Parquet file
    • Parquet stores the file schema in the file metadata. It is easier to work with because they are supported by so many different projects.
    • Transformed data will be ready for query in Data Warehouse

The Data Set

  • We use transaction data set customer.csv.
  • Data set of sales person's customer list
  • appropiate size 197 KB for Glue ETL demo

Used Tools

  • Amazon RDS postgreSQL, AWS Glue, Glue crawler, Athena

    • Load .csv file from RDS into input folder inside a bucket.
    • Create output folder as target inside the same bucket.
    • Add Glue crawler to craw Data catalog of .csv file ( in input folder)
    • run Glue job to transform to parquet file
    • Set target to load into output folder
  • In Athena, check view or query of transformed file.

  • Add Glue crawler to craw Data catalog of .parquet file in (output folder).

Pipelines

  • Batch processing pipeline for bulk import.
  • Use source code of AWS Glue

Conclusion

  • Data set has transform inside a single datalake bucket
  • Transformed data can be directly analysis with AWS Athena (or Redshift Spectrum)
  • Glue job can be assign for aggregate , join, filter tables

Follow Me On

https://www.linkedin.com/in/jirasak-pakdeeto-900665214/

About

ELT - Extract Load Transform pipeline ingestion from Database into a single datalake location to do transform processing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages