Skip to content

Conversation

@Alexander-Barth
Copy link
Contributor

@Alexander-Barth Alexander-Barth commented Feb 13, 2025

With ROCM-aware MPI, the initialization,

import AMDGPU
using Flux
import MPI
AMDGPU.allowscalar(false)
DistributedUtils.initialize(MPIBackend)

fails with (output from 2 MPI tasks):

ERROR: LoadError: UndefVarError: `AMDGPU` not defined in `FluxMPIExt`
Stacktrace:
ERROR: LoadError: UndefVarError: `AMDGPU` not defined in `FluxMPIExt`
Stacktrace:
 [1] __initialize(::Type{MPIBackend}; cuda_devices::Nothing, amdgpu_devices::Nothing, force_cuda::Bool, caller::String, force_amdgpu::Bool)
 [1] __initialize(::Type{MPIBackend}; cuda_devices::Nothing, amdgpu_devices::Nothing, force_cuda::Bool, caller::String, force_amdgpu::Bool)
   @ FluxMPIExt ~/julia-depot/packages/Flux/BkG8S/ext/FluxMPIExt/FluxMPIExt.jl:27
   @ FluxMPIExt ~/julia-depot/packages/Flux/BkG8S/ext/FluxMPIExt/FluxMPIExt.jl:27
 [2] __initialize
 [2] __initialize
   @ ~/julia-depot/packages/Flux/BkG8S/ext/FluxMPIExt/FluxMPIExt.jl:9 [inlined]
 [3] #initialize#1
   @ ~/julia-depot/packages/Flux/BkG8S/ext/FluxMPIExt/FluxMPIExt.jl:9 [inlined]
   @ ~/julia-depot/packages/Flux/BkG8S/src/distributed/public_api.jl:47 [inlined]
 [3] #initialize#1
   @ ~/julia-depot/packages/Flux/BkG8S/src/distributed/public_api.jl:47 [inlined]
 [4] initialize(backend::Type{MPIBackend})
 [4] initialize(backend::Type{MPIBackend})
   @ Flux.DistributedUtils ~/julia-depot/packages/Flux/BkG8S/src/distributed/public_api.jl:45
 [5] top-level scope
   @ Flux.DistributedUtils ~/julia-depot/packages/Flux/BkG8S/src/distributed/public_api.jl:45

I had also the error UndefVarError: cpu_device not defined in FluxMPIExt.
These two issues are addressed in this PR.

PR Checklist

  • Tests are added
  • Entry in NEWS.md
  • Documentation, if applicable

@mcabbott
Copy link
Member

Thanks. Can your example be included as a test somewhere?

Other test failures seem to be Reactant.jl, cc @wsmoses about what changed.

@wsmoses
Copy link
Contributor

wsmoses commented Feb 13, 2025

the reactant that CI found is rather old, what if you update it to latest? the error thrown shouldn't ever happen on the latest so I'm a bit surprised

@Alexander-Barth
Copy link
Contributor Author

Alexander-Barth commented Feb 14, 2025

If I just enable FLUX_TEST_DISTRIBUTED_MPI:

ENV["FLUX_TEST_AMDGPU"] = "false"
ENV["FLUX_TEST_CUDA"] = "false"
ENV["FLUX_TEST_METAL"] = "false"
ENV["FLUX_TEST_CPU"] = "false"
ENV["FLUX_TEST_DISTRIBUTED_MPI"] = "true"
ENV["FLUX_TEST_DISTRIBUTED_NCCL"] = "false"
ENV["FLUX_TEST_ENZYME"] = "false"
ENV["FLUX_TEST_REACTANT"] = "false"

It seems that there are no MPI test currently:

Test Summary: | Total   Time
Flux.jl       |     0  25.3s
  Distributed |     0   0.9s
     Testing Flux tests passed 

as there are no files ending with _distributedtest.jl:

isdistributedtest(f) = endswith(f, "_distributedtest.jl")

Let me know if I miss something. I will try to add a test.

@Alexander-Barth
Copy link
Contributor Author

Alexander-Barth commented Feb 14, 2025

I just added a test which initializes the backend and does a simple reduction.
It is tested with (pure) MPI and NCCL:

julia> ENV["FLUX_TEST_DISTRIBUTED_NCCL"] = "true"
julia> include("/home/abarth/.julia/dev/Flux/test/ext_distributed/runtests.jl");
[ Info: Running Distributed Tests with 4 processes
[ Info: Running /home/abarth/.julia/dev/Flux/test/ext_distributed/reduce_distributedtest.jl with mpi backend
[ Info: Running /home/abarth/.julia/dev/Flux/test/ext_distributed/reduce_distributedtest.jl with nccl backend
Test Summary: | Pass  Total   Time
Distributed   |    2      2  13.1s

I verified that the MPI test indeed fails without the PR.

Is there anything else needed before the PR can be merged? Should I do an entry in NEWS.md?

@Alexander-Barth Alexander-Barth changed the title Fix FluxMPIExt for AMDGPU Fix FluxMPIExt Feb 14, 2025
@Alexander-Barth Alexander-Barth changed the title Fix FluxMPIExt Fix missing import in FluxMPIExt Feb 14, 2025
@Alexander-Barth Alexander-Barth changed the title Fix missing import in FluxMPIExt Fix missing imports in FluxMPIExt Feb 14, 2025
@mcabbott
Copy link
Member

Thanks for adding tests.

Have never touched MPI stuff. I think @askorupka wrote most of this, and may have opinions?

@Alexander-Barth
Copy link
Contributor Author

It would be nice if somebody can review this PR. It does not implement any new feature but fixes some import issues.

@CarloLucibello CarloLucibello merged commit 2c89b9d into FluxML:master Mar 4, 2025
2 of 9 checks passed
@CarloLucibello
Copy link
Member

thanks!

@Alexander-Barth
Copy link
Contributor Author

Thanks @CarloLucibello !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants