AutoTopology-CostModelForOperators

Modeling and simulating experiments on communication and computing costs.

Introduction

Having multiple senders and receivers compounds communication inefficiencies.

For small transfers, latencies dominate; more participants increase latency.

For large transfers, bandwidth is key; bottlenecks are easily exposed.

Thus, topology-aware implementation for high performance is necessary.

Btw, collectives are often non-overlapped, but computation and communication are not.

Topologies can be complex, where not every system is a fat tree, there are ring topology and mesh topology as well.

Communication algorithms are topology-dependent.

Most collectives amenable to bandwidth-optimal implementation on rings, and many topologies can be interpreted as one or more rings [P. Patarasuk and X. Yuan].

In this experiment, the delay, bandwidth, and calculation of multiple communication operators are modelled, including reduce, broadcast, reduce-scatter, all-gather, all-reduce and etc.

On the basis, several factors have been considered:

The bandwidth difference between intra machine and inter machine.

The operator complexity of different topological structures.

The overlap of communication and computation.

Procedure

The experimental process is shown in the figures below.

Cost Model

Our modeling of operators is shown in the table.

α: latency

β: bandwidth

γ: computation cost

n: number of bytes transferred

p: number of process

s: split data into s messages)

Different frameworks support different operators, as shown in the table.

The types of operators that constitute different synchronization algorithms are different, as shown in the table.

i: intra group -> bandwidth β

o: inter group -> bandwidth β'

For ring topology:

For mesh topology:

For tree topology:

And in our model, communication and computation overlap has also been considered.

If the overlapping of communication and computation is considered, the optimal situation is that the communication time is less than the computation time (completely overlapping), then the cost model of each formula can be rewrite as:

cost=max((α,β),γ)

If not ideal, an overlap factor δ is required:

δ=(α,β)⊓γ

cost=(α,β)+γ-δ

Results

We only show part of the results.

For ring topology:

Normal case:

x-axis: cost.

y-axis: number of process.

Latency dominate case:

Bandwidth dominate case:

While each synchronization algorithm has its corresponding optimal deployment scenario, which makes AutoTopology necessary.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
CMakeLists.txt		CMakeLists.txt
README.md		README.md
all_gather.cpp		all_gather.cpp
all_gather.h		all_gather.h
all_reduce.cpp		all_reduce.cpp
all_reduce.h		all_reduce.h
broadcast.cpp		broadcast.cpp
broadcast.h		broadcast.h
commOp.cpp		commOp.cpp
commOp.h		commOp.h
main.cpp		main.cpp
mesh_topology.cpp		mesh_topology.cpp
mesh_topology.h		mesh_topology.h
reduce.cpp		reduce.cpp
reduce.h		reduce.h
reduce_scatter.cpp		reduce_scatter.cpp
reduce_scatter.h		reduce_scatter.h
ring_topology.cpp		ring_topology.cpp
ring_topology.h		ring_topology.h
tree_topology.cpp		tree_topology.cpp
tree_topology.h		tree_topology.h

Youhe-Jiang/AutoTopology-CostModelForOperators

Folders and files

Latest commit

History

Repository files navigation

AutoTopology-CostModelForOperators

Introduction

Having multiple senders and receivers compounds communication inefficiencies.

Topologies can be complex, where not every system is a fat tree, there are ring topology and mesh topology as well.

In this experiment, the delay, bandwidth, and calculation of multiple communication operators are modelled, including reduce, broadcast, reduce-scatter, all-gather, all-reduce and etc.

Procedure

Cost Model

Our modeling of operators is shown in the table.

Different frameworks support different operators, as shown in the table.

The types of operators that constitute different synchronization algorithms are different, as shown in the table.

For ring topology:

For mesh topology:

For tree topology:

And in our model, communication and computation overlap has also been considered.

Results

For ring topology:

Normal case:

Latency dominate case:

Bandwidth dominate case:

About

Resources

Stars

Watchers

Forks

Languages