# DynamoDB to DataLake

## Setup Python Development Environment

```bash
# use python3.10 because AWS Glue 4.0 uses python3.10
virtualenv -p python3.10 .venv

# activate the virtualenv
source .venv/bin/activate

# install core dependencies
pip install -r requirements.txt

# install dev dependencies (jupyterlab)
pip install -r requirements-dev.txt

# install test dependencies
pip install -r requirements-test.txt
```

## Import the dynamodb_to_datalake Python Library

In [1]:
import dynamodb_to_datalake.api as dynamodb_to_datalake

## Show Project Related Information

In [2]:
dynamodb_to_datalake.show_info()

------ S3 info
s3dir_artifacts: https://console.aws.amazon.com/s3/buckets/807388292768-us-east-1-artifacts?prefix=projects/dynamodb_to_datalake/
s3dir_data: https://console.aws.amazon.com/s3/buckets/807388292768-us-east-1-data?prefix=projects/dynamodb_to_datalake/
s3dir_glue_artifacts: https://console.aws.amazon.com/s3/buckets/807388292768-us-east-1-artifacts?prefix=projects/dynamodb_to_datalake/glue/
s3dir_dynamodb_stream: https://console.aws.amazon.com/s3/buckets/807388292768-us-east-1-data?prefix=projects/dynamodb_to_datalake/dynamodb_stream/
s3dir_dynamodb_export: https://console.aws.amazon.com/s3/buckets/807388292768-us-east-1-data?prefix=projects/dynamodb_to_datalake/dynamodb_export/
s3dir_dynamodb_export_processed: https://console.aws.amazon.com/s3/buckets/807388292768-us-east-1-data?prefix=projects/dynamodb_to_datalake/dynamodb_export_processed/
s3path_dynamodb_export_tracker: https://console.aws.amazon.com/s3/object/807388292768-us-east-1-data?prefix=projects/dynamodb_to_datal

## Deploy the Solution via CDK - Infrastructure Resources

- S3 buckets
- IAM roles

In [4]:
dynamodb_to_datalake.cdk_deploy_1_iam_role()

üöÄ You are deploying stack to AWS Account 807388292768, Region = us-east-1.
preview cloudformation stack at: https://us-east-1.console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks?filteringText=dynamodb-to-datalake&filteringStatus=active&viewNested=true



‚ú®  Synthesis time: 4.02s

dynamodb-to-datalake:  start: Building e78d5deff8a7fcfdbcf01e98f1c6026c8b041df14a6c1f2efa1d334a5ef12bf2:current_account-current_region
dynamodb-to-datalake:  success: Built e78d5deff8a7fcfdbcf01e98f1c6026c8b041df14a6c1f2efa1d334a5ef12bf2:current_account-current_region
dynamodb-to-datalake:  start: Publishing e78d5deff8a7fcfdbcf01e98f1c6026c8b041df14a6c1f2efa1d334a5ef12bf2:current_account-current_region
dynamodb-to-datalake:  success: Published e78d5deff8a7fcfdbcf01e98f1c6026c8b041df14a6c1f2efa1d334a5ef12bf2:current_account-current_region
[1mDynamoDBtoDataLakeStack (dynamodb-to-datalake)[22m: deploying... [1/1]
[1mdynamodb-to-datalake[22m: creating CloudFormation changeset...
dynamodb-to-datalake | 0/4 | 6:11:58 PM | [0mREVIEW_IN_PROGRESS  [0m | AWS::CloudFormation::Stack | [0m[1mdynamodb-to-datalake[22m[0m [36m[1mUser Initiated[22m[39m
dynamodb-to-datalake | 0/4 | 6:12:04 PM | [0mCREATE_IN_PROGRESS  [0m | AWS::CloudFormation::Stack | [0m[1

arn:aws:cloudformation:us-east-1:807388292768:stack/dynamodb-to-datalake/6d28b390-356f-11ee-80c6-12d5765b30bf


dynamodb-to-datalake | 2/4 | 6:12:21 PM | [32mCREATE_COMPLETE     [39m | AWS::IAM::Role     | [32m[1mDynamoDBtoDataLakeStack/GlueRole[22m[39m (GlueRoleDEDFFD2C) 
dynamodb-to-datalake | 3/4 | 6:12:21 PM | [32mCREATE_COMPLETE     [39m | AWS::IAM::Role     | [32m[1mDynamoDBtoDataLakeStack/LambdaRole[22m[39m (LambdaRole3A44B857) 
dynamodb-to-datalake | 4/4 | 6:12:22 PM | [32mCREATE_COMPLETE     [39m | AWS::CloudFormation::Stack | [32m[1mdynamodb-to-datalake[22m[39m 
[32m[39m
[32m ‚úÖ  DynamoDBtoDataLakeStack (dynamodb-to-datalake)[39m

‚ú®  Deployment time: 27.17s

Stack ARN:

‚ú®  Total time: 31.18s




## Deploy the Solution via CDK - Solution Resources

- DynamoDB Table
- Lambda Functions
- Glue Catalog
- Glue Jobs

In [3]:
dynamodb_to_datalake.cdk_deploy_2_everything()

üöÄ You are deploying stack to AWS Account 807388292768, Region = us-east-1.
preview cloudformation stack at: https://us-east-1.console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks?filteringText=dynamodb-to-datalake&filteringStatus=active&viewNested=true
  adding: dynamodb_stream_consumer.py (deflated 60%)
  adding: dynamodb_export_to_s3_post_processor_coordinator.py (deflated 56%)
  adding: dynamodb_export_to_s3_post_processor_worker.py (deflated 62%)



‚ú®  Synthesis time: 4.82s

dynamodb-to-datalake:  start: Building 0372703c9929ab3082bda58f40ea9c03916db1507feb0853e03029af6ec5700c:current_account-current_region
dynamodb-to-datalake:  success: Built 0372703c9929ab3082bda58f40ea9c03916db1507feb0853e03029af6ec5700c:current_account-current_region
dynamodb-to-datalake:  start: Publishing 0372703c9929ab3082bda58f40ea9c03916db1507feb0853e03029af6ec5700c:current_account-current_region
dynamodb-to-datalake:  success: Published 0372703c9929ab3082bda58f40ea9c03916db1507feb0853e03029af6ec5700c:current_account-current_region
[1mDynamoDBtoDataLakeStack (dynamodb-to-datalake)[22m: deploying... [1/1]
[1mdynamodb-to-datalake[22m: creating CloudFormation changeset...
dynamodb-to-datalake |  0/17 | 6:17:35 PM | [0mUPDATE_IN_PROGRESS  [0m | AWS::CloudFormation::Stack      | [0m[1mdynamodb-to-datalake[22m[0m [36m[1mUser Initiated[22m[39m
dynamodb-to-datalake |  0/17 | 6:17:39 PM | [0mCREATE_IN_PROGRESS  [0m | AWS::IAM::Role             

arn:aws:cloudformation:us-east-1:807388292768:stack/dynamodb-to-datalake/6d28b390-356f-11ee-80c6-12d5765b30bf


dynamodb-to-datalake | 16/17 | 6:18:37 PM | [32mUPDATE_COMPLETE_CLEA[39m | AWS::CloudFormation::Stack      | [32m[1mdynamodb-to-datalake[22m[39m 
dynamodb-to-datalake | 17/17 | 6:18:38 PM | [32mUPDATE_COMPLETE     [39m | AWS::CloudFormation::Stack      | [32m[1mdynamodb-to-datalake[22m[39m 
[32m[39m
[32m ‚úÖ  DynamoDBtoDataLakeStack (dynamodb-to-datalake)[39m

‚ú®  Deployment time: 73.53s

Stack ARN:

‚ú®  Total time: 78.36s




## Run Data Faker Program

- Generate 60 events / second
- 70% are insert, 30% are update

In [2]:
dynamodb_to_datalake.run_data_faker(sleep_millisecond=10)

Simulate 1 ith event: insert {'account': '259-142-8436', 'create_at': '2023-08-07T22:39:23.750911+00:00', 'update_at': '2023-08-07T22:39:23.750911+00:00', 'entity': 'Hendrix-Brewer', 'amount': 606, 'is_credit': 0, 'note': 'Executive Congress wait expect laugh party.'}
Simulate 2 ith event: insert {'account': '263-340-4339', 'create_at': '2023-08-07T22:39:25.239092+00:00', 'update_at': '2023-08-07T22:39:25.239092+00:00', 'entity': 'Fleming and Sons', 'amount': 169, 'is_credit': 1, 'note': 'Nation amount either major.'}
Simulate 3 ith event: insert {'account': '097-681-0802', 'create_at': '2023-08-07T22:39:25.789062+00:00', 'update_at': '2023-08-07T22:39:25.789062+00:00', 'entity': 'Evans-Henderson', 'amount': 945, 'is_credit': 1, 'note': 'Necessary already write mission bit.'}
Simulate 4 ith event: update {'account': '263-340-4339', 'create_at': '2023-08-07T22:39:25.239092+00:00', 'update_at': '2023-08-07T22:39:27.316420+00:00', 'entity': 'Fleming and Sons', 'amount': 169, 'is_credit': 

## Preview the DynamoDB Stream Export

Scroll up to "Show Project Related Information" check the ``s3dir_dynamodb_stream`` console url.

## Export DynamoDB Initial Load to S3

- Export the point-in-time snapshot of DynamoDB to S3
- Lambda function will be automatically triggered to transform the export data into ETL Job friendly format.

In [None]:
dynamodb_to_datalake.export_dynamodb_to_s3()

## Run Initial Load Glue Job

Create the glue catalog table and write initial load (point-in-time snapshot of DynamoDB) to the Datalake.

In [2]:
dynamodb_to_datalake.run_initial_load_glue_job()

run initial load glue job 'dynamodb_to_datalake_initial_load'
preview the job run status at: https://us-east-1.console.aws.amazon.com/gluestudio/home?region=us-east-1#/editor/job/dynamodb_to_datalake_initial_load/script


In [2]:
df = dynamodb_to_datalake.preview_hudi_table()
df

preview hudi table 'dynamodb_to_datalake.transactions'
n_rows = 414
preview data: file:///Users/sanhehu/Documents/GitHub/dynamodb_to_datalake-project/query_result.csv


_hoodie_commit_time,_hoodie_commit_seqno,_hoodie_record_key,_hoodie_partition_path,_hoodie_file_name,account,create_at,update_at,entity,amount,is_credit,note,id,create_year,create_month,create_day,create_hour,create_minute
i64,str,str,str,str,str,str,str,str,i64,i64,str,str,i64,i64,i64,i64,i64
20230808012711738,"""20230808012711‚Ä¶","""id:account:051‚Ä¶","""create_year=20‚Ä¶","""65c47bc1-3e05-‚Ä¶","""051-356-5544""","""2023-08-07T22:‚Ä¶","""2023-08-07T22:‚Ä¶","""Dawson-Hall""",839,0,"""Use goal exper‚Ä¶","""account:051-35‚Ä¶",2023,8,7,22,39
20230808012711738,"""20230808012711‚Ä¶","""id:account:263‚Ä¶","""create_year=20‚Ä¶","""65c47bc1-3e05-‚Ä¶","""263-340-4339""","""2023-08-07T22:‚Ä¶","""2023-08-07T22:‚Ä¶","""Fleming and So‚Ä¶",169,1,"""Family informa‚Ä¶","""account:263-34‚Ä¶",2023,8,7,22,39
20230808012711738,"""20230808012711‚Ä¶","""id:account:834‚Ä¶","""create_year=20‚Ä¶","""65c47bc1-3e05-‚Ä¶","""834-943-2629""","""2023-08-07T22:‚Ä¶","""2023-08-07T22:‚Ä¶","""Macias, Hayes ‚Ä¶",192,1,"""American would‚Ä¶","""account:834-94‚Ä¶",2023,8,7,22,39
20230808012711738,"""20230808012711‚Ä¶","""id:account:391‚Ä¶","""create_year=20‚Ä¶","""65c47bc1-3e05-‚Ä¶","""391-200-2475""","""2023-08-07T22:‚Ä¶","""2023-08-07T22:‚Ä¶","""Patterson LLC""",444,1,"""Sign by at num‚Ä¶","""account:391-20‚Ä¶",2023,8,7,22,39
20230808012711738,"""20230808012711‚Ä¶","""id:account:012‚Ä¶","""create_year=20‚Ä¶","""65c47bc1-3e05-‚Ä¶","""012-769-8742""","""2023-08-07T22:‚Ä¶","""2023-08-07T22:‚Ä¶","""Rasmussen, Tuc‚Ä¶",921,1,"""Be lead listen‚Ä¶","""account:012-76‚Ä¶",2023,8,7,22,39
20230808012711738,"""20230808012711‚Ä¶","""id:account:259‚Ä¶","""create_year=20‚Ä¶","""65c47bc1-3e05-‚Ä¶","""259-142-8436""","""2023-08-07T22:‚Ä¶","""2023-08-07T22:‚Ä¶","""Hendrix-Brewer‚Ä¶",606,0,"""Executive Cong‚Ä¶","""account:259-14‚Ä¶",2023,8,7,22,39
20230808012711738,"""20230808012711‚Ä¶","""id:account:236‚Ä¶","""create_year=20‚Ä¶","""65c47bc1-3e05-‚Ä¶","""236-647-8681""","""2023-08-07T22:‚Ä¶","""2023-08-07T22:‚Ä¶","""Jordan and Son‚Ä¶",266,1,"""Upon total off‚Ä¶","""account:236-64‚Ä¶",2023,8,7,22,39
20230808012711738,"""20230808012711‚Ä¶","""id:account:598‚Ä¶","""create_year=20‚Ä¶","""65c47bc1-3e05-‚Ä¶","""598-227-1492""","""2023-08-07T22:‚Ä¶","""2023-08-07T22:‚Ä¶","""Burke-Jackson""",944,0,"""Successful wee‚Ä¶","""account:598-22‚Ä¶",2023,8,7,22,39
20230808012711738,"""20230808012711‚Ä¶","""id:account:097‚Ä¶","""create_year=20‚Ä¶","""65c47bc1-3e05-‚Ä¶","""097-681-0802""","""2023-08-07T22:‚Ä¶","""2023-08-07T22:‚Ä¶","""Evans-Henderso‚Ä¶",945,1,"""Necessary alre‚Ä¶","""account:097-68‚Ä¶",2023,8,7,22,39
20230808012711738,"""20230808012711‚Ä¶","""id:account:351‚Ä¶","""create_year=20‚Ä¶","""3f006b90-5da7-‚Ä¶","""351-942-1725""","""2023-08-07T22:‚Ä¶","""2023-08-07T22:‚Ä¶","""Morgan, Lamber‚Ä¶",737,1,"""Believe piece ‚Ä¶","""account:351-94‚Ä¶",2023,8,7,22,29


## Run Increment Data Orchestrator Lambda Function code Locally

The lambda function will periodically detect if there is new data (incremental data), and launch Glue job to write them into Datalake. If the Glue job is on the fly, it will wait till it finish. If the Glue job is failed, then send notification to developer that "something is wrong"


In [7]:
dynamodb_to_datalake.run_incremental_glue_job(epoch_processed_partition="year=2023/month=08/day=01/hour=00/minute=00")

try to run incremental glue job 'dynamodb_to_datalake_incremental'
there is a running incremental glue job, check the status.
previous glue job finished, status = 'SUCCEEDED', run another one.
prepare the glue job parameters.
there is no new data between (year=2023/month=08/day=08/hour=01/minute=50, year=2023/month=08/day=08/hour=01/minute=59) to process, do nothing


## Compare the Data in DynamoDB to the Data in Datalake

In [2]:
dynamodb_to_datalake.compare()

n_dynamodb_rows: 918
n_hudi_rows: 918
NICE! n_dynamodb_rows == n_hudi_rows
NICE! The data in dynamodb and hudi are exactly the same.
