This documentation provides a comprehensive guide to deploying an AWS Data Lake solution using the AWS Cloud Development Kit (CDK) in Python. The solution includes:
- An Amazon S3 bucket acting as the data lake storage.
- An AWS Glue crawler and database for data cataloging.
- An Amazon Athena workgroup for querying the data.
- An Amazon QuickSight setup for data visualization.
- An AWS Budget alarm to monitor costs exceeding a user-defined amount of USD per month.
This guide includes detailed explanations of each component, deployment instructions, an architecture diagram, deployment and cleanup instructions.
Below is a diagram representing the architecture of the AWS resources:
Responsible for setting up the foundational data lake components:
-
Amazon S3 Bucket (
data_lake_bucket):- Stores raw and processed data.
- Versioning and server-side encryption enabled.
- Configured to auto-delete objects and bucket upon stack deletion.
-
AWS Glue Crawler (
glue_crawler):- Scans the S3 bucket to detect schema changes.
- Updates the AWS Glue Data Catalog.
-
AWS Glue Database (
glue_database):- Stores metadata about the data in the S3 bucket.
-
Amazon Athena Workgroup (
athena_workgroup):- Executes queries against data cataloged by AWS Glue.
- Stores query results in a specified S3 location within the data lake bucket.
-
IAM Roles and Policies:
glue_crawler_role: Grants AWS Glue permissions to read/write to the S3 bucket.
Sets up Amazon QuickSight resources for data visualization:
-
IAM Role (
quicksight_role):- Allows QuickSight to access Athena and S3.
- Must be manually assigned in QuickSight settings.
-
QuickSight Data Source (
data_source):- Connects QuickSight to Athena using the specified workgroup.
-
QuickSight Dataset (
dataset):- Defines the data to be used for analyses and dashboards.
-
Custom Resource for Cleanup:
- AWS Lambda function (
cleanup_function) to delete QuickSight resources upon stack deletion.
- AWS Lambda function (
Creates a budget alarm to monitor AWS costs:
- AWS Budget (
budget):- Sets a monthly budget limit of user-defined monthly_budget_usd amount of USD.
- Sends notifications when actual spend exceeds 100% of the budget.
- Notifications are sent via email to the specified address defined in the quicksight_and_alarm_email variable.
-
AWS Account: An AWS account with permissions to create the necessary resources.
-
AWS CLI Installed: Ensure the AWS CLI is installed and configured with your credentials.
-
AWS CDK Installed: Install the AWS CDK if not already installed.
npm install -g aws-cdk
-
Clone the Repository or Create Project Structure:
mkdir data-lake-cdk-demo cd data-lake-cdk-demo cdk init app --language python -
Install Python Dependencies:
Install the dependencies:
pip install -r requirements.txt -
Update Placeholder Values:
Adapt all the variables under the data_lake_constants key
-
Bootstrap Your AWS Environment:
cdk bootstrap -
Synthesize the CDK App:
cdk synth -
Deploy the CDK App:
cdk deploy --all --require-approval never -
Confirm Budget Subscription:
Check your email for a confirmation message from AWS Budgets and confirm your subscription.
To delete all resources created by the CDK stacks, run the following command:
cdk destroy --all
