This project implements a serverless data ingestion and processing pipeline using Google Cloud Platform (GCP).
The flow is fully automated with Terraform for infrastructure provisioning and GitHub Actions for CI/CD.
- Upload JSON file → User uploads a JSON file to a GCS bucket.
- Cloud Function (2nd Gen) → Triggered on file upload, validates the JSON against a schema, and processes it into a BigQuery table.
- Notification → Once data is successfully inserted into BigQuery, a Pub/Sub notification is published to notify users.
- A bucket is created to store uploaded JSON files.
- Uploading a file to this bucket triggers the Cloud Function.
- Written in Python 3.11.
- Responsibilities:
- Downloads the uploaded JSON file.
- Validates records against a schema (
schema.json
). - Loads valid data into BigQuery.
- Publishes a Pub/Sub message upon completion.
📂 Function Source:
main.py
→ Function logic.requirements.txt
→ Dependencies.schema.json
→ Validation schema for JSON files.
- Dataset:
serverless_data_processing_dataset
- Table:
processed_data
(schema loaded fromschema.json
)
- Topic:
data-processed-topic
- Used to notify subscribers when processing is complete.
- Uploader SA → For uploading JSON files.
- Function Processor SA → For Cloud Function execution (storage, BQ, Pub/Sub access).
- Analyst SA → For querying processed BigQuery data.
- Terraform provisions all GCP resources (
main.tf
). - Ensures reproducibility and role-based security.
- Workflow:
terraform-deploy.yml
. - Runs on PRs and merges into
main
. - Key steps:
- Packages Cloud Function (
zip
). - Runs
terraform init
,validate
,plan
, andapply
. - Posts results as PR comments for visibility.
- Packages Cloud Function (
- GCP project with Workload Identity Federation configured.
- GitHub repository with secrets:
WORKLOAD_IDENTITY_PRVDR
GCP_TERRAFORM_INFRA_SA
-
Clone repo:
git clone https://github.com/EzalB/serverless-data-processing-pipeline.git cd serverless-data-processing-pipeline
-
Update terraform/dev.tfvars with project-specific values.
-
Push changes to a feature branch and open a Pull Request.
-
GitHub Actions will:
- Run Terraform Plan (commented on PR).
- On merge, Terraform Apply will provision resources.
-
Upload file:
- gsutil cp sample.json gs://-data-bucket/
-
Cloud Function validates & loads into BigQuery:
- Invalid file → Error logged.
- Valid file → Records inserted into processed_data.
-
Pub/Sub publishes notification:
- Data processed (filename: sample.json)
- Uses least privilege custom IAM roles for uploader, processor, and analyst.
- Service accounts scoped to required permissions only.
- No plaintext credentials — uses Workload Identity Federation.
- Fully serverless (no VM management).
- Event-driven (process only on file upload).
- Automated CI/CD with GitHub Actions.
- Scalable & cost-effective.