Send / recv proof of concept #2

dionhaefner · 2020-07-22T16:14:25Z

The following script works:

from mpi4py import MPI

import numpy as onp
import jax
import jax.numpy as jnp
from mpi4jax import Send, Recv

SHAPE = (10, 10)

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
status = MPI.Status()


def send_recv(x, root=0, use_status=False):
    if rank == root:
        if use_status:
            x = Recv(x, comm=comm, status=status)
        else:
            x = Recv(x, comm=comm)
    else:
        x = Send(x, root, comm=comm)
    return x


send_recv_jit = jax.jit(send_recv, static_argnums=(1, 2))


if __name__ == '__main__':
    assert size == 2
    root = 1

    if rank == root:
        x = jnp.empty(SHAPE)
    else:
        x = jnp.ones(SHAPE)

    res = send_recv(x, root)
    expected_res = onp.ones(SHAPE, dtype='float32')
    assert onp.array_equal(res, expected_res), (rank, res)
    print('ok', rank)

    res_jit = send_recv_jit(x, root, False)
    assert onp.array_equal(res_jit, expected_res), (rank, res_jit)
    print('ok', rank)

    res_jit = send_recv_jit(x, root, True)
    assert onp.array_equal(res_jit, expected_res), (rank, res_jit)
    print('ok', rank)

    if rank == root:
        assert status.Get_source() == 1 - rank
    else:
        assert status.Get_source() == -1

There is one problem though that you have to assign something to the result of the Send call, otherwise it gets optimized out and everything deadlocks. I.e., this doesn't work:

@jax.jit
def send_recv(x):
    if rank == 0:
        x = Recv(x, comm=comm)
    else:
        Send(x, 0, comm=comm)  # has to be x = Send(...)
    return x

Not sure if there's anything we can do about that. The whole implementation is pretty hacky with an unnecessary memcpy, but I don't think JAX / XLA accounts for custom calls that have side effects.

I didn't touch the gradient code (yet).

dionhaefner · 2020-07-23T08:51:08Z

One solution could be to only implement sendrecv. Then all processes call the same function, which should be a bit more robust to being optimized away on some but not all processes. But this would still fail:

@jax.jit
def foo(x):
    y = sendrecv(x, source=0, dest=1)
    if rank == 1:
        return y
    return x

since rank 0 doesn't do anything with y.

PhilipVinc · 2020-07-23T10:34:44Z

I think that this is because XLA's compiler is very aggressive.
As soon as he sees that you don't use the output value of a leaf of the computational graph, he optimises it out.
Send of course has no used leaves, so he gets rid of it.

I think for this we should ask jag's people if it's somehow possible to tag as 'do_not_optimise' a function.

PhilipVinc · 2020-07-23T12:40:31Z

By the way, if you rebase, tests should be working now.

PhilipVinc · 2020-07-27T13:28:12Z

I guess that until Jax#3370 is merged we should rather focus on collective all-to-all communications, which are not affected by the side-effect problem.
(BTW, I'll be on holiday in the next few weeks so I won't be working on this, but in case you cook up some PR I'll be quick to review)

dionhaefner · 2020-07-27T14:20:51Z

Collective all-to-all operations are affected, too, though. Example:

@jax.jit
def foo(x):
    x = Allreduce(x)
    if rank == 0:
        return 0  # kaboom
    return x

So right now, it is the user’s responsibility to make sure that there is a data dependency on the return value of the MPI calls.

PhilipVinc · 2020-07-27T14:47:48Z

Allreduce is what I have implemented and, at least in my experience, is working well.

Why are you returning 0 if rank == 0?

PhilipVinc · 2020-07-27T14:49:46Z

Ah, ok, I get it.
You mean that if the function does not return something that depends on it's result.

What I meant is that all-to-all are (usually) used in contexts where all ranks execute the same code, so that is (sometimes) not an issue.

dionhaefner · 2020-07-27T16:51:07Z

I agree, all-to-all are lower risk. Ultimately it's the user's responsibility not to mess up though, so we should put a warning in the readme or so :)

PhilipVinc · 2020-08-03T08:41:13Z

The omnistaging and has_side_effects stuff are not yet on a released version right?
should we bump the minimum jax version?

mpi4jax/__init__.py

mpi4jax/collective_ops/send.py

dionhaefner · 2020-08-03T08:49:49Z

The omnistaging and has_side_effects stuff are not yet on a released version right?
should we bump the minimum jax version?

Yes, I think that would be sensible. I don't think there's a strong motivation to introduce a bunch of extra logic for JAX versions pre-omnistaging / side effect support.

PhilipVinc · 2020-08-03T08:50:28Z

What is the tradeoff of omnistaging?
We could also activate it ourselves in init.
Most of mpi4jax won't work otherwise...

dionhaefner · 2020-08-03T08:55:11Z

Most of mpi4jax won't work otherwise...

It works fine if you don't use jit or just use all-to-all-type operations. We don't know the performance impact of omnistaging yet, and some packages like Tensorflow probability break with it. I opted for a warning for now whenever an MPI call is being jitted.

But yes, when / if omnistaging becomes the default in JAX I think we should just require it.

dionhaefner · 2020-08-03T10:27:09Z

This is done from my side. Tests are failing hard until omnistaging is released, but it works on my machine™️

dionhaefner · 2020-08-07T10:04:21Z

When using this I noticed that XLA would sometimes re-order send and recv calls, which causes deadlocks.

So unless a solution comes up in jax-ml/jax#3976, this will need some token mechanism to ensure proper order.

This principally affects all primitives, but it's easiest to run into with send and recv.

PhilipVinc · 2020-08-07T10:20:28Z

Are you testing this with a JaxLib you built yourself, and their staging mechanism? XLA should hopefully not reorder operations with side-effects, because... they have side effects. Maybe we should cc them (can’t from my phone...) Il 7 ago 2020, 12:04 +0200, Dion Häfner <notifications@github.com>, ha scritto:

…

When using this I noticed that XLA would sometimes re-order send and recv calls, which causes deadlocks. So unless a solution comes up in jax-ml/jax#3976, this will need some token mechanism to ensure proper order. This principally affects all primitives, but it's easiest to run into with send and recv. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

dionhaefner · 2020-08-07T10:56:34Z

This is current JAX master. I've opened an issue already and the JAX devs confirmed the problem (jax-ml/jax#3976). If there's no fix on the horizon, I had some ideas of a "token tape" that should allow us to get around this with 1 line of extra boilerplate in user code.

dionhaefner · 2020-08-18T11:17:25Z

I added a token mechanism which ensures that proper order is conserved. The idea is to chain the calls by using an XLA token:

token = Send(...)  # not passing a token creates a new one
token = Send(..., token=token)  # re-use previous token
arr, token = Recv(..., token=token)
arr, token = Sendrecv(..., token=token)
arr, token = Allreduce(..., token=token)

As long as the correct token is passed, those statements should never get re-ordered (relative to each other) or optimized away.

It sounds like the JAX people are cooking up a solution that does this token chaining automatically behind the scenes, but my feeling is that this might take a while to land.

Left to-do:

vmap and grad support is broken

PhilipVinc · 2020-08-19T13:25:38Z

very well
It's clumsy but indeed we cannot do much without it, until they fix this upstreams.

dionhaefner · 2020-08-20T07:16:05Z

How important are vmap and grad for you? AFAICT, grad would only be meaningful on global sums, and vmap is a bit pointless, too. I'm asking because create_token supports neither of those, so it would require some additional hacking to get this to work.

PhilipVinc · 2020-08-20T12:52:34Z

They are not essential.
I'd say we can merge this as-is and I can add vmap and grad by myself in the future.
In that case, however, I think we should make a table with supported features in the readme.md for every operation.

--

grad for send and recv is a bit tricky to implement and would require some thinking.
vmap instead I think it would be simple: the operation is the same, you simply have to recompute the total buffer size, as I do with allreduce. For the token... you can keep a single token, I guess. Why would you need more than one?

dionhaefner · 2020-08-20T13:03:48Z

vmap is easy conceptionally, but JAX doesn't like it, because you cannot create tokens in vmapped functions. To get around that, we could write a thin wrapper around create_token that defines a trivial batching rule, but it requires a bit more work.

I'll patch out grad and vmap for now, and then we can merge as soon as the JAX release with omnistaging is available.

PhilipVinc · 2020-08-20T13:06:00Z

Ok, thanks for the investigation!

That's fine by me.

PhilipVinc · 2020-09-09T14:28:37Z

I'm back from holidays and starting to work again!

How are we with the merging of omnistaging in jax? do you have any news?

dionhaefner · 2020-09-09T15:22:54Z

Welcome back! ~~Omnistaging is merged, but there is still no jaxlib release, so has_side_effects is not yet available. Maybe we could ask for a jaxlib release, it's almost been 2 months now.~~

There has been a jaxlib release today, so this should work now :)

dionhaefner · 2020-09-10T06:13:06Z

Ah, it's just a tag, not a release... I asked for one.

PhilipVinc · 2020-09-11T09:20:16Z

Yay! Thanks for bumping the google guys

dionhaefner · 2020-09-11T09:35:40Z

Done from my side (for real this time).

mpi4jax/collective_ops/sendrecv.py

mpi4jax/cython/__init__.py

PhilipVinc · 2020-09-11T11:58:43Z

All is great.
If you can add just a comment somewhere in code about MPI_STATUS_IGNORE, then I'll merge and try to tag a new release

dionhaefner added 2 commits July 22, 2020 16:38

implement send / recv

c7cdce3

works now

3b1f044

dionhaefner mentioned this pull request Jul 23, 2020

Prevent custom calls with side effects to be optimized out jax-ml/jax#3829

Closed

Merge branch 'master' of github.com:PhilipVinc/mpi4jax

5fd2db9

dionhaefner added 2 commits August 3, 2020 10:31

Merge branch 'master' of github.com:PhilipVinc/mpi4jax

7424891

remove workarounds now that we have omnistaging

4fa25f0

PhilipVinc reviewed Aug 3, 2020

View reviewed changes

mpi4jax/__init__.py Show resolved Hide resolved

PhilipVinc reviewed Aug 3, 2020

View reviewed changes

mpi4jax/collective_ops/send.py Outdated Show resolved Hide resolved

dionhaefner added 3 commits August 3, 2020 12:23

check for MPI errors

0ae3605

remove send / recv gradients

2a50ed3

test send / recv

5055125

implement sendrecv

aff45f0

dionhaefner added 2 commits August 18, 2020 13:04

use cythonize in setup.py

8bada56

add .cpp to gitignore

a6de4b6

dionhaefner added 3 commits August 18, 2020 13:06

remove error checks; add debug logging; thread tokens

96b379e

proper token handling

7c4abab

test logging and sendrecv

7893b0e

dionhaefner added 2 commits August 20, 2020 15:07

remove vmap and grad support

c7af36a

apply black formatting

370b4ac

require jaxlib>=0.1.55

865bc09

PhilipVinc marked this pull request as ready for review September 11, 2020 09:20

skip sendrecv tests when size < 2

46f10c9

PhilipVinc mentioned this pull request Sep 11, 2020

Docs: token system #8

Closed

polish readme

d559536

PhilipVinc reviewed Sep 11, 2020

View reviewed changes

mpi4jax/collective_ops/sendrecv.py Show resolved Hide resolved

PhilipVinc reviewed Sep 11, 2020

View reviewed changes

mpi4jax/cython/__init__.py Show resolved Hide resolved

add debugging to readme

eea123c

dionhaefner and others added 2 commits September 11, 2020 14:27

document reasoning behind MPI_STATUS_IGNORE_ADDR

8b1009b

Merge branch 'master' into master

4d13a49

PhilipVinc merged commit 989237b into mpi4jax:master Sep 12, 2020

PhilipVinc mentioned this pull request Mar 10, 2021

Logo #56

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Send / recv proof of concept #2

Send / recv proof of concept #2

dionhaefner commented Jul 22, 2020

dionhaefner commented Jul 23, 2020

PhilipVinc commented Jul 23, 2020

PhilipVinc commented Jul 23, 2020

PhilipVinc commented Jul 27, 2020

dionhaefner commented Jul 27, 2020 •

edited

Loading

PhilipVinc commented Jul 27, 2020

PhilipVinc commented Jul 27, 2020

dionhaefner commented Jul 27, 2020

PhilipVinc commented Aug 3, 2020

dionhaefner commented Aug 3, 2020

PhilipVinc commented Aug 3, 2020

dionhaefner commented Aug 3, 2020

dionhaefner commented Aug 3, 2020

dionhaefner commented Aug 7, 2020

PhilipVinc commented Aug 7, 2020 via email

dionhaefner commented Aug 7, 2020

dionhaefner commented Aug 18, 2020

PhilipVinc commented Aug 19, 2020

dionhaefner commented Aug 20, 2020

PhilipVinc commented Aug 20, 2020

dionhaefner commented Aug 20, 2020

PhilipVinc commented Aug 20, 2020

PhilipVinc commented Sep 9, 2020

dionhaefner commented Sep 9, 2020 •

edited

Loading

dionhaefner commented Sep 10, 2020

PhilipVinc commented Sep 11, 2020

dionhaefner commented Sep 11, 2020

PhilipVinc commented Sep 11, 2020

Send / recv proof of concept #2

Send / recv proof of concept #2

Conversation

dionhaefner commented Jul 22, 2020

dionhaefner commented Jul 23, 2020

PhilipVinc commented Jul 23, 2020

PhilipVinc commented Jul 23, 2020

PhilipVinc commented Jul 27, 2020

dionhaefner commented Jul 27, 2020 • edited Loading

PhilipVinc commented Jul 27, 2020

PhilipVinc commented Jul 27, 2020

dionhaefner commented Jul 27, 2020

PhilipVinc commented Aug 3, 2020

dionhaefner commented Aug 3, 2020

PhilipVinc commented Aug 3, 2020

dionhaefner commented Aug 3, 2020

dionhaefner commented Aug 3, 2020

dionhaefner commented Aug 7, 2020

PhilipVinc commented Aug 7, 2020 via email

dionhaefner commented Aug 7, 2020

dionhaefner commented Aug 18, 2020

Left to-do:

PhilipVinc commented Aug 19, 2020

dionhaefner commented Aug 20, 2020

PhilipVinc commented Aug 20, 2020

dionhaefner commented Aug 20, 2020

PhilipVinc commented Aug 20, 2020

PhilipVinc commented Sep 9, 2020

dionhaefner commented Sep 9, 2020 • edited Loading

dionhaefner commented Sep 10, 2020

PhilipVinc commented Sep 11, 2020

dionhaefner commented Sep 11, 2020

PhilipVinc commented Sep 11, 2020

dionhaefner commented Jul 27, 2020 •

edited

Loading

dionhaefner commented Sep 9, 2020 •

edited

Loading