Skip to content

czhc/serverless-datalake-on-aws

 
 

Repository files navigation

Building Serverless Data Lakes on AWS

Forked from Author: Unni Pillai

Architecture Diagram

Learning outcomes from this workshop?

  • Design serverless data lake architecture
  • Build a data processing pipeline and Data Lake using Amazon S3 for storing data
  • Use Amazon Kinesis for real-time streaming data
  • Use AWS Glue to automatically catalog datasets
  • Run interactive ETL scripts in an Amazon SageMaker Jupyter notebook connected to an AWS Glue development endpoint
  • Query data using Amazon Athena & visualize it using Amazon QuickSight

Pre-requisites:

  • You need to have access to an AWS account with AdminstratorAccess
  • This lab should be executed in us-east-1 region
  • Best is to follow links from this guide & open them in new a tab
  • Run this lab in a modern browser

Syllabus

Content Link
Lab 1: Ingest and Storage Open Lab ▶️
Lab 2: Glue Data Catalog Open Lab ▶️
Lab 3: Serverless Spark ETL on Glue Open Lab ▶️
Lab 3_nb: Serverless Spark ETL on Glue (using Sagemaker Notebook) Open Lab ▶️
Lab 4: Visualize Data with built-in ML transformations Open Lab ▶️

Clean Up

Failing to do this will result in incuring AWS usage charges.

Make sure you bring down / delete all resources created as part of this lab

Resources to delete

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 89.4%
  • Python 10.6%