spark-sched-sim

An Apache Spark job scheduling simulator, implemented as a Gymnasium environment.


Two Gantt charts comparing the behavior of different job scheduling algorithms. In these experiments, 50 jobs are identified by unique colors and processed in parallel by 10 identical executors (stacked vertically). Decima achieves better resource packing and lower average job completion time than Spark's fair scheduler.

What is job scheduling in Spark?

A Spark application is a long-running program within the cluster that submits jobs to be processed by its share of the cluster's resources. Each job encodes a directed acyclic graph (DAG) of stages that depend on each other, where a dependency $A\to B$ means that stage $A$ must finish executing before stage $B$ can begin. Each stage consists of many identical tasks which are units of work that operate over different shards of data. Tasks are processed by executors, which are JVM's running on the cluster's worker nodes.
Scheduling jobs means designating which tasks runs on which executors at each time.
For more backround on Spark, see this article.

Why this simulator?

Job scheduling is important, because a smarter scheduling algorithm can result in faster job turnaround time.
This simulator allows researchers to test scheduling heuristics and train neural schedulers using reinforcement learning.

This repository is a PyTorch Geometric implementaion of the Decima codebase, adhering to the Gymnasium interface. It also includes enhancements to the reinforcement learning algorithm and model design, along with a basic PyGame renderer that generates the above charts in real time.

Enhancements include:

Continuously discounted returns, improving training speed
Proximal Polixy Optimization (PPO), improving training speed and stability
A restricted action space, encouraging a fairer policy to be learned
Multiple different job sequences experienced per training iteration, reducing variance in the policy gradient (PG) estimate
No learning curriculum, improving training speed

After cloning this repo, please run pip install -r requirements.txt to install the project's dependencies.

To start out, try running examples via examples.py --sched [fair|decima]. To train Decima from scratch, modify the provided config file config/decima_tpch.yaml as needed, then provide the config to train.py -f CFG_FILE.

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
.github/workflows		.github/workflows
config		config
models/decima		models/decima
schedulers		schedulers
spark_sched_sim		spark_sched_sim
test		test
trainers		trainers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cfg_loader.py		cfg_loader.py
examples.py		examples.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

config

config

models/decima

models/decima

schedulers

schedulers

spark_sched_sim

spark_sched_sim

test

test

trainers

trainers

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

cfg_loader.py

cfg_loader.py

examples.py

examples.py

requirements.txt

requirements.txt

train.py

train.py

Repository files navigation

spark-sched-sim

About

Releases

Packages

Languages

License

ArchieGertsman/spark-sched-sim

Folders and files

Latest commit

History

Repository files navigation

spark-sched-sim

About

Resources

License

Stars

Watchers

Forks

Languages