A production-ready, scalable data processing system built on Kubernetes that handles 2TB of data with 10:1 compression using Apache Spark, Airflow, and MongoDB.
This system provides two processing modes:
- Batch Processing: Process static 2TB compressed files
- Stream Processing: Handle continuous data feeds in real-time
- Apache Spark: Distributed data processing engine
- Apache Airflow: Workflow orchestration
- MongoDB Sharded: Scalable document storage
- MinIO: S3-compatible object storage
- Apache Kafka: Streaming data platform
- Prometheus + Grafana: Monitoring and visualization
- Docker Desktop with Kubernetes enabled
- Kind (Kubernetes in Docker)
- Helm 3.x
- kubectl
- Python 3.8+
- 16GB+ RAM recommended
- 500GB+ available disk space
git clone <repository-url>
cd k8s-data-pipelinechmod +x setup-pipeline.sh
./setup-pipeline.shThis script will:
- Create a Kind cluster with 3 worker nodes
- Install all required Helm charts
- Configure storage and networking
- Build and load Docker images
- Set up RBAC and service accounts
./port-forward.shThen access:
- Airflow UI: http://localhost:8080 (admin/admin)
- MinIO Console: http://localhost:9001 (minioadmin/minioadmin)
- Grafana: http://localhost:3000 (admin/admin123)
- Prometheus: http://localhost:9090
Process static 2TB compressed files:
# Generate sample data (scaled down for testing)
./generate-sample-data.sh
# Submit Spark batch job
kubectl apply -f k8s-configs/spark-batch-application.yaml
# Monitor progress
kubectl logs -f spark-batch-processor-2tb-driverHandle continuous data feeds:
# Start data generator
kubectl apply -f k8s-configs/data-generator-deployment.yaml
# Start stream processor
kubectl apply -f k8s-configs/spark-streaming-application.yaml
# Monitor stream
kubectl logs -f spark-stream-processor-2tb-driver- Achieves 10:1 compression ratio using zlib level 9
- Processes data in optimal partition sizes
- Stores compressed data with metadata in MongoDB
- Horizontal scaling with Spark executors
- MongoDB sharding for distributed storage
- Dynamic resource allocation based on workload
- Checkpoint recovery for streaming
- Automatic retry mechanisms
- Persistent storage for critical data
- Real-time metrics with Prometheus
- Custom Grafana dashboards
- Spark UI for job monitoring
- Airflow for workflow visibility
k8s-data-pipeline/
βββ k8s-configs/ # Kubernetes YAML configurations
β βββ spark-batch-application.yaml
β βββ spark-streaming-application.yaml
β βββ airflow-dags-configmap.yaml
β βββ ...
βββ python/ # Python application code
β βββ main.py # CLI entry point
β βββ batch_processor.py # Batch processing logic
β βββ stream_processor.py # Stream processing logic
β βββ data_generator.py # Test data generation
βββ Dockerfile # Docker image for Spark
βββ setup-pipeline.sh # Main setup script
βββ port-forward.sh # Service access script
βββ k8s-data-pipeline-dashboard.html # Interactive UI
Edit spark-batch-application.yaml:
sparkConf:
"spark.executor.memory": "8g"
"spark.executor.cores": "4"
"spark.executor.instances": "10"Configure in Helm values:
shards: 3
shardsvr:
dataNode:
replicas: 3# Create additional topics
kubectl exec -it kafka-cp-kafka-0 -- kafka-topics \
--create --topic my-topic \
--partitions 20 \
--replication-factor 3- Optimal partition size: 100-200MB per partition
- Compression level: 9 for maximum compression
- Batch insert size: 1000 documents
- Max rate per partition: 100,000 records/sec
- Checkpoint interval: 30 seconds
- Trigger interval: processingTime='30 seconds'
- Sharding key: Based on data distribution
- Indexes: Created on compression_ratio, timestamp
- Write concern: Majority for durability
./check-status.sh# List Spark applications
kubectl get sparkapplications
# View driver logs
kubectl logs <spark-app-name>-driver
# View executor logs
kubectl logs -l spark-role=executor# Trigger DAG manually
kubectl exec -n airflow deployment/airflow-webserver -- \
airflow dags trigger batch_2tb_processing# Connect to MongoDB
kubectl exec -it mongodb-mongos-0 -- mongosh
# Check compression stats
db.compressed_data.documents.aggregate([
{ $group: {
_id: null,
avgRatio: { $avg: "$compression_ratio" },
totalDocs: { $sum: 1 }
}}
])-
Insufficient Resources
# Increase Kind cluster resources docker system prune -af # Edit kind config to add more workers
-
Spark Job Failures
# Check events kubectl describe sparkapplication <app-name> # Increase memory/cores in spark config
-
MongoDB Connection Issues
# Verify service endpoints kubectl get svc | grep mongo # Check MongoDB pod status kubectl get pods -l app.kubernetes.io/name=mongodb-sharded
- Enable authentication for all services in production
- Use network policies to restrict pod communication
- Implement RBAC with least privilege principle
- Encrypt data at rest and in transit
- Regular security updates for all components
Expected performance with recommended configuration:
- Batch processing: ~100-200 MB/s throughput
- Stream processing: ~1M records/minute
- Compression ratio: 8-12:1 depending on data
- MongoDB write speed: ~50k documents/second
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
- Apache Spark community
- Kubernetes SIG Big Data
- MongoDB engineering team
- Open source contributors
Note: This is a demonstration system. For production use, ensure proper security, backup, and disaster recovery mechanisms are in place.