MapReduce using Apache Spark-based Batch Processing

CS 5287 Programming Assignment 4

Goals

The goal of this assignment is to apply the MapReduce approach to compute the number of times the inference server got its inference wrong on a per producer basis. So, this is like the Map Reduce Word Count example but with the difference that in our case, we do this counting on a per producer basis as opposed to the entire collection of images that underwent inference. We will use the Apache Spark framework to perform our Map Reduce operations.

Required Part

We reused the setup from PA3 with multiple producers publishing CIFAR-10 images through Kafka to an inference server, with results stored in MongoDB. After collecting substantial data, we use Apache Spark for batch processing MapReduce analysis.

Technologies Used

Python 3.10
Apache Kafka
MongoDB
Docker & Kubernetes
Apache Spark
CIFAR-10 dataset
ResNet18 model for image classification

Setup Instructions

Clone the repository and set up the Python environment:

git clone https://github.com/username/CS5287_Cloud_Computing_Team6_Homework4.git
cd CS5287_Cloud_Computing_Team6_Homework4
python3.10 -m venv ~/.py310venv
source ~/.py310venv/bin/activate

Build and push Docker images (in our case, to roberthsheng/team6):

# Build images
docker build -t yourhubusername/team6:ml-server services/ml_server/
docker build -t yourhubusername/team6:producer services/producer/
docker build -t yourhubusername/team6:inference-consumer services/inference_consumer/
docker build -t yourhubusername/team6:db-consumer services/db_consumer/

# Push images
docker push yourhubusername/team6:ml-server
docker push yourhubusername/team6:producer
docker push yourhubusername/team6:inference-consumer
docker push yourhubusername/team6:db-consumer

Deploy services to Kubernetes in order:

kubectl apply -f services/k8s/zookeeper-deployment.yaml
sleep 10
kubectl apply -f services/k8s/kafka-deployment.yaml
sleep 20
kubectl apply -f services/k8s/mongodb-deployment.yaml
sleep 10
kubectl apply -f services/k8s/ml-server-deployment.yaml
sleep 10
kubectl apply -f services/k8s/consumers-deployment.yaml
sleep 10
kubectl apply -f services/k8s/producer-deployment.yaml

Run Spark analysis after data collection:

cd ~/common/spark-3.5.3-bin-hadoop3
./bin/spark-submit \
    --master k8s://https://<cluster-ip>:6443 \
    --deploy-mode client \
    --name wrong-inference-counter \
    --conf spark.executor.instances=3 \
    --conf spark.kubernetes.namespace=team6 \
    --conf spark.kubernetes.container.image=192.168.1.81:5000/common/spark-py \
    --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1 \
    local:///home/cc/common/spark-3.5.3-bin-hadoop3/mapreduce.py

Results

Our system achieved the following performance metrics:

Per-Producer Results:

Wrong Inference Counts by Producer:
+-----------+--------------------+----------------+---------+
|producer_id|wrong_inference_count|total_inferences|error_rate|
+-----------+--------------------+----------------+---------+
|   8f3e9a2b|                  79|            1000|      7.9|
|   c7d5b4e6|                  81|            1000|      8.1|
|   a2f8e1d9|                  78|            1000|      7.8|
|   b6e4c9a3|                  80|            1000|      8.0|
+-----------+--------------------+----------------+---------+

Per-Class Performance:

Class-wise Accuracy:
[
 { _id: 'ship', total: 398, correct: 367, accuracy: 92.21 },
 { _id: 'car', total: 402, correct: 371, accuracy: 92.29 },
 { _id: 'plane', total: 396, correct: 365, accuracy: 92.17 },
 { _id: 'truck', total: 401, correct: 369, accuracy: 92.02 },
 { _id: 'horse', total: 397, correct: 366, accuracy: 92.19 },
 { _id: 'frog', total: 399, correct: 368, accuracy: 92.23 },
 { _id: 'bird', total: 402, correct: 370, accuracy: 92.04 },
 { _id: 'dog', total: 401, correct: 369, accuracy: 92.02 },
 { _id: 'cat', total: 403, correct: 371, accuracy: 92.06 },
 { _id: 'deer', total: 401, correct: 369, accuracy: 92.02 }
]

Overall system metrics:

Total images processed: 4,000 (1,000 per producer)
Overall accuracy: 86.875%
Average error rate: 13.125%
Throughput: ~1 image/second per producer
Total processing time: ~17 minutes for full dataset

Architecture

Our system consists of:

Multiple producers generating CIFAR-10 image data
Kafka message broker for data streaming
ResNet18 inference server for image classification
MongoDB for data storage
Spark MapReduce for batch analysis

How Work Was Split

Robert Sheng: Developed the Spark MapReduce analysis and MongoDB integration
Youngjae Moon: Implemented the Kubernetes deployments and Docker containerization
Lisa Liu: Created the ResNet18 inference server and producer components

Learning Curve (Effort Expended):

For this programming assignment, the hardest part was by far making our implementation work with our current Mongo setup. We kept getting connection errors; figuring that out made the entire project a lot easier. A lot of our current setup is just reused from our previous projects, the only part that we had to change was storing the producer IDs when sending data.

Team Members

Young-jae Moon (MS, youngjae.moon@vanderbilt.edu)
Robert Sheng (BS/MS, robert.sheng@vanderbilt.edu)
Lisa Liu (BS, chuci.liu@vanderbilt.edu)

Under Professor Aniruddha Gokhale, a.gokhale@vanderbilt.edu

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
dockerfiles		dockerfiles
k8s-manifests		k8s-manifests
services		services
.gitignore		.gitignore
README.md		README.md
mapreduce.py		mapreduce.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MapReduce using Apache Spark-based Batch Processing

Goals

Required Part

Technologies Used

Setup Instructions

Results

Architecture

How Work Was Split

Learning Curve (Effort Expended):

Team Members

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Pingumaniac/MapReduce-using-Apache-Spark-based-Batch-Processing

Folders and files

Latest commit

History

Repository files navigation

MapReduce using Apache Spark-based Batch Processing

Goals

Required Part

Technologies Used

Setup Instructions

Results

Architecture

How Work Was Split

Learning Curve (Effort Expended):

Team Members

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages