Non uniform partitioning for `Distributed` architectures #3339

simone-silvestri · 2023-10-13T18:22:00Z

This PR tweaks the API to simplify non-uniform partitioning which should already be supported by the algorithm.

This PR also extends the tests to include non-uniform distributed partitioning

The proposal of this PR (up to discussion and tweaking) is to allow calling

arch = Distributed(CPU(); partition = Partition(Rx = [0.3, 0.1, 0.6])

which allows to distributed the domain over 3 workers which hold 30%, 10% and 60% of the computation, respectively

glwagner · 2023-10-27T15:48:50Z

Can you update the top-level description, and add a docstring for Partition with a few examples that enumerates the various possible syntaxes for common important cases?

glwagner · 2023-11-02T15:58:54Z

src/DistributedComputations/distributed_architectures.jl

+`x`, `y` and `z` can be `Int`, `Equal`, `Fractional` or `Sizes` 
+(see below)


I think we need to explain what each option means, something like:

Suggested change

`x`, `y` and `z` can be `Int`, `Equal`, `Fractional` or `Sizes`

(see below)

`x`, `y` and `z` can be:

`x::Int`: allocate `x` processors to the first dimension

`Equal()`: divide the domain in `x` equally among the remaining processes

`Fractional(ϵ1, ϵ2, ..., ϵN):` divide the domain unequally among `N` processes. The total work is `W = sum(ϵi)`, and each process is then allocated `ϵi / W` of the domain.

`Sizes`:

glwagner · 2023-11-02T15:59:52Z

src/DistributedComputations/distributed_architectures.jl

+Fractional(args...) = Fractional(tuple(args ./ sum(args)...))  # We need to make sure that `sum(R) == 1`
+     Sizes(args...) = Sizes(tuple(args...))


docstrings. Make sure to add @ref to Partition.

glwagner · 2023-11-02T16:02:11Z

src/DistributedComputations/distributed_architectures.jl

+"""type representing equal domain partitioning (not supported for more than one direction)"""
+struct Equal end


I think we need a real docstring because this is user facing, something like

Suggested change

"""type representing equal domain partitioning (not supported for more than one direction)"""

struct Equal end

"""

Equal()

Return a type that partitions a direction equally among remaining processes.

`Equal()` can be used for only one direction. Other directions must either be unspecified, or

specifically defined by `Int`, `Fractional`, or `Sizes`.

"""

glwagner · 2023-11-02T16:03:33Z

src/DistributedComputations/distributed_architectures.jl

+Base.show(io::IO, p::Partition) =
+    print(io,
+    "Domain partitioning with $(ranks(p)) ranks", "\n",
+    "├── x-partitioning: $(ranks(p.x) == 1 ? "none" : p.x)", "\n",


Should we say "1" rather than "none"?

The purpose of a display is to give as much code-relevant information about the content of a type as possible. So I think "1" is more accurate, while "none" is not right, it doesn't correspond to any julia type (unless we are using "nothing"), but in that case we should say "nothing" not "none").

I like to differentiate from 1 because with 1 rank there is no partitioning. nothing is a good option

probably it is just better to omit the directions in which rank == 1

what's the difference between "1" and "nothing"? If you write x=1, isn't that the same thing as no partition?

It's important that it matches the underlying code, ie partition.x should match what we claim the x partitioning is. That's a general philosophy

ok let's go with nothing then

ok, I think we need to change validate_partition as well in that case

glwagner · 2023-11-02T16:03:53Z

test/utils_for_runtests.jl

               # TODO: add support for Non uniform partitioning
               # Distributed(child_arch; partition = Partition(Rx = [0.2, 0.1, 0.5, 0.3])))
               # Distributed(child_arch; partition = Partition(Ry = [0.2, 0.1, 0.5, 0.3])))


Suggested change

# TODO: add support for Non uniform partitioning

# Distributed(child_arch; partition = Partition(Rx = [0.2, 0.1, 0.5, 0.3])))

# Distributed(child_arch; partition = Partition(Ry = [0.2, 0.1, 0.5, 0.3])))

glwagner · 2023-11-02T16:48:20Z

src/DistributedComputations/distributed_architectures.jl

@@ -39,7 +39,7 @@ if supplied as positional arguments `x` will be the first argument,
    `Equal()`: divide the domain in `x` equally among the remaining processes (not supported for multiple directions)
    `Fractional(ϵ1, ϵ2, ..., ϵN):` divide the domain unequally among `N` processes. The total work is `W = sum(ϵi)`, 
                                   and each process is then allocated `ϵi / W` of the domain.
-    `Sizes(ϵ1, ϵ2, ..., ϵN)`: divide the domain unequally. EThe total work is `W = sum(ϵi)`, 
+    `Sizes(ϵ1, ϵ2, ..., ϵN)`: divide the domain unequally. The total work is `W = sum(ϵi)`, 


Suggested change

`Sizes(ϵ1, ϵ2, ..., ϵN)`: divide the domain unequally. The total work is `W = sum(ϵi)`,

`Sizes(n1, n2, ..., nN)`: divide the direction by number of grid points, where `ni` is the number of grid points allocated to process `i`. The total size of the direction is `N = sum(ni)`,

Maybe a different letter than N for the number of processors, maybe P is better?

glwagner · 2023-11-02T16:49:12Z

src/DistributedComputations/distributed_architectures.jl

-     Sizes(args...) = Sizes(tuple(args...))
+
+"""
+`Sizes(ϵ1, ϵ2, ..., ϵN)`


I used ϵ for fractions because that made sense but we should use a different letter for sizes.

It's good to get this right because users will copy the notation we use in docstrings, so it propagates everywhere.

glwagner · 2023-11-02T16:57:57Z

src/DistributedComputations/distributed_architectures.jl

+
+Base.size(p::Partition) = ranks(p)
+
+validate_partition(x, y, z) = (x, y, z)


This has to convert 1 to nothing right?

src/DistributedComputations/distributed_architectures.jl

Co-authored-by: Gregory L. Wagner <wagner.greg@gmail.com>

…nigans.jl into ss/non_uniform_partitioning

simone-silvestri added 30 commits October 10, 2023 14:57

partitioning

bb10cae

ready for regression

c679d88

mutable

34e07a5

remove topology from arch

fd8e099

fix topology issue

de1177d

no need to regularize connectivity

9a87593

comments

e40dfe6

some bugfixxes

8c5bf7f

more bug fixing

a2165ca

one bug down

9831b7e

fix indent

492d376

bugfix

28b4293

start changing a bit stuff

8047465

bugfix

798892f

fixed tests

2855375

fixed tests

1cd66f7

bugifx

527a246

bugfix

7ca0e40

bugfix

7b632bc

fixed tests

ce75461

downloading correct data

c88901d

another test

f8edf35

last debuggging to follow

3952535

bounded regression to fix

394d593

test on caltech cluster

1fb7d3f

test correct files

cbc20c2

Merge branch 'ss/distributed_tests' into ss/non_uniform_partitioning

2c85645

integer size

9f76eeb

percentages

2dfcc85

correct keys

b43f1fc

simone-silvestri and others added 3 commits October 27, 2023 14:03

comments

b304e89

comment

e27b189

Merge branch 'main' into ss/non_uniform_partitioning

9ce3533

glwagner reviewed Nov 2, 2023

View reviewed changes

simone-silvestri added 2 commits November 2, 2023 12:31

improve show method + add comments

ee3b6be

docstrings

5a6afa5

glwagner reviewed Nov 2, 2023

View reviewed changes

simone-silvestri added 4 commits November 2, 2023 13:10

finished?

48e0a9d

added nothing

fce60eb

change to correct tests

a5a07a3

remove space

3c4817c

glwagner approved these changes Nov 2, 2023

View reviewed changes

glwagner reviewed Nov 2, 2023

View reviewed changes

src/DistributedComputations/distributed_architectures.jl Outdated Show resolved Hide resolved

glwagner reviewed Nov 2, 2023

View reviewed changes

src/DistributedComputations/distributed_architectures.jl Outdated Show resolved Hide resolved

glwagner reviewed Nov 2, 2023

View reviewed changes

src/DistributedComputations/distributed_architectures.jl Outdated Show resolved Hide resolved

simone-silvestri and others added 6 commits November 2, 2023 13:52

Update src/DistributedComputations/distributed_architectures.jl

a8da638

Co-authored-by: Gregory L. Wagner <wagner.greg@gmail.com>

Update src/DistributedComputations/distributed_architectures.jl

12d0517

Co-authored-by: Gregory L. Wagner <wagner.greg@gmail.com>

Update src/DistributedComputations/distributed_architectures.jl

145e4f3

Co-authored-by: Gregory L. Wagner <wagner.greg@gmail.com>

all ones to nothing

74394c5

Merge branch 'ss/non_uniform_partitioning' of github.com:CliMA/Oceana…

3b323bf

…nigans.jl into ss/non_uniform_partitioning

small comment

777b427

simone-silvestri merged commit 7796f57 into main Nov 3, 2023
48 checks passed

simone-silvestri deleted the ss/non_uniform_partitioning branch November 3, 2023 13:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non uniform partitioning for `Distributed` architectures #3339

Non uniform partitioning for `Distributed` architectures #3339

simone-silvestri commented Oct 13, 2023 •

edited

Loading

glwagner commented Oct 27, 2023

glwagner Nov 2, 2023

glwagner Nov 2, 2023

glwagner Nov 2, 2023

glwagner Nov 2, 2023

simone-silvestri Nov 2, 2023

simone-silvestri Nov 2, 2023

glwagner Nov 2, 2023

glwagner Nov 2, 2023 •

edited

Loading

simone-silvestri Nov 2, 2023

glwagner Nov 2, 2023

glwagner Nov 2, 2023

glwagner Nov 2, 2023

glwagner Nov 2, 2023

glwagner Nov 2, 2023

		`x`, `y` and `z` can be `Int`, `Equal`, `Fractional` or `Sizes`
		(see below)

-`x`, `y` and `z` can be `Int`, `Equal`, `Fractional` or `Sizes`
-(see below)
+`x`, `y` and `z` can be:
+    `x::Int`: allocate `x` processors to the first dimension
+    `Equal()`: divide the domain in `x` equally among the remaining processes
+    `Fractional(ϵ1, ϵ2, ..., ϵN):` divide the domain unequally among `N` processes. The total work is `W = sum(ϵi)`, and each process is then allocated `ϵi / W` of the domain.
+    `Sizes`:

		Fractional(args...) = Fractional(tuple(args ./ sum(args)...)) # We need to make sure that `sum(R) == 1`
		Sizes(args...) = Sizes(tuple(args...))

		"""type representing equal domain partitioning (not supported for more than one direction)"""
		struct Equal end

-"""type representing equal domain partitioning (not supported for more than one direction)"""
-struct Equal end
+"""
+    Equal()
+Return a type that partitions a direction equally among remaining processes.
+`Equal()` can be used for only one direction. Other directions must either be unspecified, or
+specifically defined by `Int`, `Fractional`, or `Sizes`.
+"""

	# TODO: add support for Non uniform partitioning
	# Distributed(child_arch; partition = Partition(Rx = [0.2, 0.1, 0.5, 0.3])))
	# Distributed(child_arch; partition = Partition(Ry = [0.2, 0.1, 0.5, 0.3])))

	`Sizes(ϵ1, ϵ2, ..., ϵN)`: divide the domain unequally. The total work is `W = sum(ϵi)`,
	`Sizes(n1, n2, ..., nN)`: divide the direction by number of grid points, where `ni` is the number of grid points allocated to process `i`. The total size of the direction is `N = sum(ni)`,


		Base.size(p::Partition) = ranks(p)

		validate_partition(x, y, z) = (x, y, z)

Non uniform partitioning for Distributed architectures #3339

Non uniform partitioning for Distributed architectures #3339

Conversation

simone-silvestri commented Oct 13, 2023 • edited Loading

glwagner commented Oct 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glwagner Nov 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Non uniform partitioning for `Distributed` architectures #3339

Non uniform partitioning for `Distributed` architectures #3339

simone-silvestri commented Oct 13, 2023 •

edited

Loading

glwagner Nov 2, 2023 •

edited

Loading