This project demonstrates a serverless data processing pipeline on Amazon Web Services (AWS) using Terraform to manage the infrastructure as code (IaC).
The pipeline automatically processes incoming CSV files containing order data. It aggregates this data and stores the results in a DynamoDB table, showcasing a common and practical cloud architecture.
The pipeline's workflow is event-driven. The process is triggered when a new .csv file is uploaded with a specific prefix to an S3 bucket.
[S3 Bucket] --(CSV upload with "orders" prefix)--> [AWS Lambda] --(Process & Aggregate)--> [DynamoDB Table]
- Amazon S3 Bucket: Serves as the ingestion point for raw data. The bucket is configured to trigger a Lambda function upon object creation.
- AWS Lambda (Python): The core compute engine of the pipeline. This Python function reads the CSV file, processes the orders, aggregates revenue by product category, and writes the result to the DynamoDB table.
- Amazon DynamoDB: A NoSQL database used to store the processed, aggregated data. The table's partition key is the product category.
- IAM Roles & Policies: Defines the security permissions that allow AWS services to interact with each other (e.g., granting the Lambda function access to read from S3 and write to DynamoDB).
- Infrastructure as Code (IaC): All AWS resources are defined in Terraform, enabling fast, consistent, and repeatable deployments.
- Event-Driven & Serverless: No idle servers to manage. Costs are based on actual usage, and the pipeline scales automatically with the volume of incoming data.
To run this project, you will need the following tools installed:
-
Clone the repository:
git clone https://github.com/TechWizard27/serverless-data-pipeline.git cd serverless-data-pipeline -
Initialize Terraform: This command downloads the necessary providers (AWS and Random).
terraform init
-
Deploy the infrastructure: This command creates a plan and prompts you to approve the creation of the AWS resources. Enter
yesto proceed.terraform apply
After a successful deployment, Terraform will print the names of the created S3 bucket and DynamoDB table.
-
Get the S3 bucket name from the output of the
terraform applycommand, or run:terraform output s3_bucket_name
-
Upload a test file with prefix "orders" to the bucket. You can use the
test_orders.csvfile included in this project.aws s3 cp test_orders.csv s3://<your-bucket-name>/orders_test.csv
(Replace
<your-bucket-name>with the actual bucket name from the output.) -
Check the results in DynamoDB: After a few moments, the Lambda function will be triggered. You can verify the results in the DynamoDB table using the AWS Management Console or the AWS CLI. The table should contain items with aggregated revenue for each product category.
To tear down all the resources created by this project, run the following command:
terraform destroyThis project is licensed under the MIT License. See the LICENSE file for details.
