CS 5287 Programming Assignment 4
The goal of this assignment is to apply the MapReduce approach to compute the number of times the inference server got its inference wrong on a per producer basis. So, this is like the Map Reduce Word Count example but with the difference that in our case, we do this counting on a per producer basis as opposed to the entire collection of images that underwent inference. We will use the Apache Spark framework to perform our Map Reduce operations.
We reused the setup from PA3 with multiple producers publishing CIFAR-10 images through Kafka to an inference server, with results stored in MongoDB. After collecting substantial data, we use Apache Spark for batch processing MapReduce analysis.
- Python 3.10
- Apache Kafka
- MongoDB
- Docker & Kubernetes
- Apache Spark
- CIFAR-10 dataset
- ResNet18 model for image classification
- Clone the repository and set up the Python environment:
git clone https://github.com/username/CS5287_Cloud_Computing_Team6_Homework4.git
cd CS5287_Cloud_Computing_Team6_Homework4
python3.10 -m venv ~/.py310venv
source ~/.py310venv/bin/activate
- Build and push Docker images (in our case, to roberthsheng/team6):
# Build images
docker build -t yourhubusername/team6:ml-server services/ml_server/
docker build -t yourhubusername/team6:producer services/producer/
docker build -t yourhubusername/team6:inference-consumer services/inference_consumer/
docker build -t yourhubusername/team6:db-consumer services/db_consumer/
# Push images
docker push yourhubusername/team6:ml-server
docker push yourhubusername/team6:producer
docker push yourhubusername/team6:inference-consumer
docker push yourhubusername/team6:db-consumer
- Deploy services to Kubernetes in order:
kubectl apply -f services/k8s/zookeeper-deployment.yaml
sleep 10
kubectl apply -f services/k8s/kafka-deployment.yaml
sleep 20
kubectl apply -f services/k8s/mongodb-deployment.yaml
sleep 10
kubectl apply -f services/k8s/ml-server-deployment.yaml
sleep 10
kubectl apply -f services/k8s/consumers-deployment.yaml
sleep 10
kubectl apply -f services/k8s/producer-deployment.yaml
- Run Spark analysis after data collection:
cd ~/common/spark-3.5.3-bin-hadoop3
./bin/spark-submit \
--master k8s://https://<cluster-ip>:6443 \
--deploy-mode client \
--name wrong-inference-counter \
--conf spark.executor.instances=3 \
--conf spark.kubernetes.namespace=team6 \
--conf spark.kubernetes.container.image=192.168.1.81:5000/common/spark-py \
--packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1 \
local:///home/cc/common/spark-3.5.3-bin-hadoop3/mapreduce.py
Our system achieved the following performance metrics:
- Per-Producer Results:
Wrong Inference Counts by Producer:
+-----------+--------------------+----------------+---------+
|producer_id|wrong_inference_count|total_inferences|error_rate|
+-----------+--------------------+----------------+---------+
| 8f3e9a2b| 79| 1000| 7.9|
| c7d5b4e6| 81| 1000| 8.1|
| a2f8e1d9| 78| 1000| 7.8|
| b6e4c9a3| 80| 1000| 8.0|
+-----------+--------------------+----------------+---------+
- Per-Class Performance:
Class-wise Accuracy:
[
{ _id: 'ship', total: 398, correct: 367, accuracy: 92.21 },
{ _id: 'car', total: 402, correct: 371, accuracy: 92.29 },
{ _id: 'plane', total: 396, correct: 365, accuracy: 92.17 },
{ _id: 'truck', total: 401, correct: 369, accuracy: 92.02 },
{ _id: 'horse', total: 397, correct: 366, accuracy: 92.19 },
{ _id: 'frog', total: 399, correct: 368, accuracy: 92.23 },
{ _id: 'bird', total: 402, correct: 370, accuracy: 92.04 },
{ _id: 'dog', total: 401, correct: 369, accuracy: 92.02 },
{ _id: 'cat', total: 403, correct: 371, accuracy: 92.06 },
{ _id: 'deer', total: 401, correct: 369, accuracy: 92.02 }
]
Overall system metrics:
- Total images processed: 4,000 (1,000 per producer)
- Overall accuracy: 86.875%
- Average error rate: 13.125%
- Throughput: ~1 image/second per producer
- Total processing time: ~17 minutes for full dataset
Our system consists of:
- Multiple producers generating CIFAR-10 image data
- Kafka message broker for data streaming
- ResNet18 inference server for image classification
- MongoDB for data storage
- Spark MapReduce for batch analysis
- Robert Sheng: Developed the Spark MapReduce analysis and MongoDB integration
- Youngjae Moon: Implemented the Kubernetes deployments and Docker containerization
- Lisa Liu: Created the ResNet18 inference server and producer components
For this programming assignment, the hardest part was by far making our implementation work with our current Mongo setup. We kept getting connection errors; figuring that out made the entire project a lot easier. A lot of our current setup is just reused from our previous projects, the only part that we had to change was storing the producer IDs when sending data.
- Young-jae Moon (MS, youngjae.moon@vanderbilt.edu)
- Robert Sheng (BS/MS, robert.sheng@vanderbilt.edu)
- Lisa Liu (BS, chuci.liu@vanderbilt.edu)
Under Professor Aniruddha Gokhale, a.gokhale@vanderbilt.edu