ELT - Extract Load Transform pipeline ingestion from Database into a single datalake location to do transform processing.
- Ingest Data from RDS to a Datalake S3 bucket.
- Process transaction files to a single location
- Data set of customer list of sales person
- We use AWS Glue
- Create Data Catalog, transform, load to Data lake
- Data will be transform from CSV to Parquet file
- Parquet stores the file schema in the file metadata. It is easier to work with because they are supported by so many different projects.
- Transformed data will be ready for query in Data Warehouse
- We use transaction data set customer.csv.
- Data set of sales person's customer list
- appropiate size 197 KB for Glue ETL demo
-
Amazon RDS postgreSQL, AWS Glue, Glue crawler, Athena
- Load .csv file from RDS into input folder inside a bucket.
- Create output folder as target inside the same bucket.
- Add Glue crawler to craw Data catalog of .csv file ( in input folder)
- run Glue job to transform to parquet file
- Set target to load into output folder
-
In Athena, check view or query of transformed file.
-
Add Glue crawler to craw Data catalog of .parquet file in (output folder).
- Batch processing pipeline for bulk import.
- Use source code of AWS Glue
- Data set has transform inside a single datalake bucket
- Transformed data can be directly analysis with AWS Athena (or Redshift Spectrum)
- Glue job can be assign for aggregate , join, filter tables