Skip to content

This project demonstrates a serverless data processing pipeline on Amazon Web Services (AWS) using Terraform to manage the infrastructure as code (IaC).

License

Notifications You must be signed in to change notification settings

TechWizard27/serverless-data-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Serverless Data Processing Pipeline on AWS with Terraform

This project demonstrates a serverless data processing pipeline on Amazon Web Services (AWS) using Terraform to manage the infrastructure as code (IaC).

The pipeline automatically processes incoming CSV files containing order data. It aggregates this data and stores the results in a DynamoDB table, showcasing a common and practical cloud architecture.

Architecture

The pipeline's workflow is event-driven. The process is triggered when a new .csv file is uploaded with a specific prefix to an S3 bucket.

[S3 Bucket] --(CSV upload with "orders" prefix)--> [AWS Lambda] --(Process & Aggregate)--> [DynamoDB Table]
  1. Amazon S3 Bucket: Serves as the ingestion point for raw data. The bucket is configured to trigger a Lambda function upon object creation.
  2. AWS Lambda (Python): The core compute engine of the pipeline. This Python function reads the CSV file, processes the orders, aggregates revenue by product category, and writes the result to the DynamoDB table.
  3. Amazon DynamoDB: A NoSQL database used to store the processed, aggregated data. The table's partition key is the product category.
  4. IAM Roles & Policies: Defines the security permissions that allow AWS services to interact with each other (e.g., granting the Lambda function access to read from S3 and write to DynamoDB).

Key Features

  • Infrastructure as Code (IaC): All AWS resources are defined in Terraform, enabling fast, consistent, and repeatable deployments.
  • Event-Driven & Serverless: No idle servers to manage. Costs are based on actual usage, and the pipeline scales automatically with the volume of incoming data.

Prerequisites

To run this project, you will need the following tools installed:

  • Terraform (v1.0.0+)
  • AWS CLI
  • Configured AWS credentials (e.g., via the aws configure command).

Deployment and Usage

  1. Clone the repository:

    git clone https://github.com/TechWizard27/serverless-data-pipeline.git
    cd serverless-data-pipeline
  2. Initialize Terraform: This command downloads the necessary providers (AWS and Random).

    terraform init
  3. Deploy the infrastructure: This command creates a plan and prompts you to approve the creation of the AWS resources. Enter yes to proceed.

    terraform apply

    After a successful deployment, Terraform will print the names of the created S3 bucket and DynamoDB table.

How to Test the Pipeline

  1. Get the S3 bucket name from the output of the terraform apply command, or run:

    terraform output s3_bucket_name
  2. Upload a test file with prefix "orders" to the bucket. You can use the test_orders.csv file included in this project.

    aws s3 cp test_orders.csv s3://<your-bucket-name>/orders_test.csv

    (Replace <your-bucket-name> with the actual bucket name from the output.)

  3. Check the results in DynamoDB: After a few moments, the Lambda function will be triggered. You can verify the results in the DynamoDB table using the AWS Management Console or the AWS CLI. The table should contain items with aggregated revenue for each product category.

Detailed Dependency Graph

Detailed Dependency Graph

Cleanup

To tear down all the resources created by this project, run the following command:

terraform destroy

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

This project demonstrates a serverless data processing pipeline on Amazon Web Services (AWS) using Terraform to manage the infrastructure as code (IaC).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published