Skip to content

A Terraform module that provides an efficient way to activate pieces and services in an AWS account in order to enable users to explore preselected public datasets.

Notifications You must be signed in to change notification settings

ThiagoPanini/datadelivery

Repository files navigation


datadelivery-logo

GitHub release (latest by date) GitHub Last Commit CI workflow Documentation Status

Table of Contents


Overview

The datadelivery project is an open source solution that provides a starter toolkit to be deployed in any AWS account in order to enable users to begin their learning path on AWS analytics services, like Athena, Glue, EMR, Redshift. It does that by supplying a Terraform module that can be called from any Terraform project for deploying all the infrastructure needed to take the first steps using analytics in AWS with public datasets to be explored.

  • Have you ever wanted to have a bunch of datasets to explore in AWS?
  • Have you ever wanted to take public data and start building an ETL process?
  • Have you ever wanted to go deep into the Data Mesh architecture with SoR, SoT and Spec layers?

🚛 Try datadelivery!

Note Now the datadelivery project has an official documentation in readthedocs! Visit the following link and check out usability technical details, practical examples and more!


Features

  • 🚀 A pocket and disposable AWS environment
  • 🪣 Automatic creation of S3 buckets using the SoR, SoT and Spec storage layers approach
  • 🤖 Automatic data cataloging process using a scheduled Glue Crawler
  • 🎲 Provides different dataset tables ready to be explored in any AWS analytics service
  • 🔦 Destroy everything and recreate all again at a touch of a single command

How Does it Work?

When users call the datadelivery Terraform module, the following operations are performed:

  1. Five different buckets are created in the target AWS account
  2. The content of data/ folder at the source module are uploaded to the SoR bucket
  3. An IAM role is created with enough permissions to run a Glue Crawler
  4. A Glue Crawler is created with a S3 target pointing to the SoR bucket
  5. A cron expression is configured to trigger the Glue Crawler 2 minutes after finishing the infrastructure deployment
  6. All files from SoR bucket (previously on data/ folder) are cataloged as new tables on Data Catalog
  7. A preconfigured Athena workgroup is created in order to enable users to run queries

Combining Solutions

The datadelivery Terraform module isn't alone. There are other complementary open source solutions that can be put together to enable the full power of learning analytics on AWS. Check it out if you think they could be useful for you!

A diagram showing how its possible to use other solutions like datadelivery, terraglue and sparksnake


Contacts


References

AWS Glue

Terraform

GitHub

About

A Terraform module that provides an efficient way to activate pieces and services in an AWS account in order to enable users to explore preselected public datasets.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Languages