Serverless Security Data Lake

A production-ready serverless data platform for ingesting, processing, and analyzing security findings using AWS services. Data is stored as columnar Parquet for efficient, low-cost Athena queries.

Architecture Overview

Security Findings ─► SQS Queue ─► Lambda (batch) ─► S3 (Parquet, partitioned) ─► Athena
                                       │
                                       └─► SNS (HIGH/CRITICAL alerts)

Ingestion — Security findings arrive as JSON on an SQS queue.
Processing — A Lambda function (triggered in batches of 10) validates, normalizes, and writes Parquet files to S3.
Storage — Parquet files land in S3 with Hive-style partitioning (year=/month=/day=), lifecycle-tiered to Intelligent-Tiering → Glacier.
Analysis — Amazon Athena queries the Glue-cataloged Parquet data via partition projection.
Alerting — HIGH/CRITICAL findings trigger SNS notifications.

Ephemeral by Design

This project is ephemeral — provisioned on-demand and torn down when not in use to maintain zero idle costs and validate Infrastructure as Code.

cd terraform
terraform init
terraform apply     # provision
terraform destroy   # tear down when done

Technology Stack

Layer	Technology
IaC	Terraform ≥ 1.0 (modular)
Ingestion	Amazon SQS (with DLQ)
Processing	AWS Lambda (Python 3.14, awswrangler)
Storage	Amazon S3 (Parquet, lifecycle-tiered)
Catalog	AWS Glue
Query	Amazon Athena (engine v3)
Alerting	Amazon SNS
CI/CD	GitHub Actions

Project Structure

.
├── .github/workflows/
│   └── ci.yml                    # CI pipeline (lint, test, terraform validate)
├── src/lambda/
│   └── processor.py              # Lambda function (batch Parquet writer)
├── tests/
│   ├── conftest.py               # Shared fixtures
│   └── test_processor.py         # Unit tests (moto-based)
├── terraform/
│   ├── main.tf                   # Root config (provider, locals, Lambda, module wiring)
│   ├── variables.tf              # Input variables with validation
│   ├── outputs.tf                # Output values
│   └── modules/
│       ├── storage/main.tf       # S3 buckets, encryption, lifecycle
│       ├── ingestion/main.tf     # SQS queues, event source mapping
│       ├── analytics/main.tf     # Glue catalog, Athena workgroup
│       └── alerting/main.tf      # SNS topic, subscriptions, filters
├── examples/
│   ├── test-messages.json        # Sample security findings
│   └── athena-queries.sql        # Example Athena queries
├── scripts/
│   └── send-test-message.py      # CLI tool to send test findings
├── pyproject.toml                # Project metadata, deps, and tool config (uv-managed)
├── uv.lock                      # Locked dependency graph (committed)
├── requirements.txt              # Pinned runtime deps for Lambda packaging
└── README.md

Quick Start

Prerequisites

AWS CLI configured with appropriate permissions
Terraform ≥ 1.0
Python 3.14+

Deploy

cd terraform
terraform init
terraform plan
terraform apply

Send Test Messages

python scripts/send-test-message.py --queue-url $(cd terraform && terraform output -raw sqs_queue_url)

Query Data

Use the Athena queries in examples/athena-queries.sql or query via the AWS Console against the security_db.security_findings table.

Tear Down

cd terraform
terraform destroy

Development

This project uses uv for dependency management to keep your system Python clean.

Setup

uv sync        # creates .venv and installs all deps (runtime + dev)

Run Tests

uv run pytest
uv run pytest --cov=src --cov-report=term-missing

Lint & Format

uv run black --check .
uv run flake8 src/ tests/
uv run mypy src/

Data Schema

Input (Security Finding)

{
  "event_id": "uuid-string",
  "timestamp": "2024-01-15T10:30:00Z",
  "severity": "HIGH|MEDIUM|LOW|CRITICAL",
  "source": "aws-guardduty|aws-security-hub|aws-config|custom",
  "finding_type": "malware|unauthorized-access|data-exfiltration|...",
  "description": "Detailed description",
  "affected_resources": ["arn:aws:..."],
  "metadata": {
    "account_id": "123456789012",
    "region": "us-east-1",
    "tags": {"Environment": "production"}
  }
}

Output (Parquet, partitioned)

Path: s3://bucket/findings/year=YYYY/month=MM/day=DD/*.parquet
Format: Parquet (Snappy compression via awswrangler)
Partition projection: Enabled for zero-maintenance partition discovery

Security

IAM least-privilege policies per resource
S3 bucket policies enforce server-side encryption (AES-256)
Public access blocked on all buckets
All API calls logged via CloudTrail

License

This project is provided as-is for educational and demonstration purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Serverless Security Data Lake

Architecture Overview

Ephemeral by Design

Technology Stack

Project Structure

Quick Start

Prerequisites

Deploy

Send Test Messages

Query Data

Tear Down

Development

Setup

Run Tests

Lint & Format

Data Schema

Input (Security Finding)

Output (Parquet, partitioned)

Security

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github		.github
examples		examples
scripts		scripts
src		src
terraform		terraform
tests		tests
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
architecture.md		architecture.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Serverless Security Data Lake

Architecture Overview

Ephemeral by Design

Technology Stack

Project Structure

Quick Start

Prerequisites

Deploy

Send Test Messages

Query Data

Tear Down

Development

Setup

Run Tests

Lint & Format

Data Schema

Input (Security Finding)

Output (Parquet, partitioned)

Security

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages