Skip to content

The Project aims to establish a robust data pipeline for tracking and analyzing sales performance using various AWS services. The process involves creating a DynamoDB database, implementing Change Data Capture (CDC), utilizing Kinesis streams, and finally, storing and querying the data in Amazon Athena.

Notifications You must be signed in to change notification settings

KRISHNASAIRAJ/AWS-Driven-Sales-Performance-Outlook

Repository files navigation

AWS-Driven-Sales-Performance-Outlook

The Project aims to establish a robust data pipeline for tracking and analyzing sales performance using various AWS services. The process involves creating a DynamoDB database, implementing Change Data Capture (CDC), utilizing Kinesis streams, and finally, storing and querying the data in Amazon Athena.

Architecture

Technology Used

  • Python
  • DynamoDB
  • DynamoDB Stream(CDC)
  • Kinesis Stream
  • Kinesis Filehose
  • Event Bridge Pipe(For Stream Ingestion)
  • Kinesis Firehose(To Batch Streaming)
  • Lambda
  • Athena
  • S3

Features

  • Data Generation Script

    • A Python script has been provided to generate synthetic sales data.
    • The script uses the boto3 library to connect with DynamoDB.
    • The file is included in the repository(mock_data_generator_for_dynamodb.py)
  • DynamoDB Setup

    • A DynamoDB database named sales-performance-outlook is created.

    • Implemented Change Data Capture (CDC) for tracking updates and deletions in records. image image

    • Established a DynamoDB table named Orders-data-table with order_id as the key.

    • Enabled DynamoDB stream to capture changes in the table(sales-performance-outlook), specifying what data to capture (old/new item). image

  • Kinesis Stream

    • Created a Kinesis stream named kinesis-sales-order to collect streaming data.
    • Similar to Kafka, Kinesis uses shards for partitioning data streams.
    • Read from Starting position of source data image image
  • Event Bridge Integration

    • Configured an Event Bridge pipe to integrate DynamoDB and Kinesis stream.
    • Specified DynamoDB as the source and Kinesis stream as the destination, sharding data based on the eventID. image
  • Kinesis Firehose

    • Implemented Kinesis Firehose for processing streaming data as batch data.
    • Kinesis Firehose buffer interval 15sec image
    • Used a Lambda function for data transformation, decoding and parsing DynamoDB strings into JSON with newline characters.
    • Configured buffer time for efficient data processing. image
  • Glue

    • Created Catalog named aws-driven-sales-performance-outlook-catlog which uses crawler to get metadata of data in various data sources
    • Which creates table based on the schema of the data and further it can be used by Athena for analytical purpose image image
  • S3 Storage and Crawler

    • Set up S3 as the destination for Kinesis Firehose, storing transformed data in files.

    • The Sample File is stored in the repostory directory(Output_Sample)

    • S3 Collect buffer interval 60sec image image

    • Created a crawler with a JSON classifier to identify raw data patterns in the S3 bucket.

    • Ran the crawler with an output file prefix of outlook_ to create a table in the sales-data-catalog database.

    • "$.order_id,$.product_name,$.quantity,$.price" (classifier pattern)json

    • So the classifier avoids the crawler to scan the raw data violating pattern image

  • Athena Query

    • Utilized AWS Athena, a serverless analytical service, to query the data stored in the sales-data-catalog table.
    • Athena enables seamless querying of data without the need for a dedicated infrastructure. image
  • Permissions Management

    • Added necessary permissions to IAM users for DynamoDB, Kinesis, and Event Bridge.

Key Outcomes

  • The project follows a comprehensive data pipeline architecture to capture, process, and analyze sales data efficiently.
  • The inclusion of CDC ensures that changes in records are tracked, providing a complete view of sales performance over time.
  • The use of serverless services like Athena and Lambda minimizes infrastructure management efforts.
  • The project showcases the integration of multiple AWS services for a seamless end-to-end data processing and analytics solution.

About

The Project aims to establish a robust data pipeline for tracking and analyzing sales performance using various AWS services. The process involves creating a DynamoDB database, implementing Change Data Capture (CDC), utilizing Kinesis streams, and finally, storing and querying the data in Amazon Athena.

Topics

Resources

Stars

Watchers

Forks

Languages