Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cifar10 - multi GPU training #162

Closed
sergeimonakhov opened this issue Sep 12, 2018 · 8 comments
Closed

cifar10 - multi GPU training #162

sergeimonakhov opened this issue Sep 12, 2018 · 8 comments

Comments

@sergeimonakhov
Copy link

hi, i have 5 card amd x470, i run python3 ./cifar10_multi_gpu_train.py --num_gpus=5, i get visible one card in x16 pci. How work other cards in x1 pci?

@whchung
Copy link
Collaborator

whchung commented Sep 12, 2018

@D1abloRUS please refer to https://github.com/RadeonOpenCompute/ROCm#supported-cpus . RX470 is of GFX8 family so we don't support them on x1 PCIe yet.

@sergeimonakhov
Copy link
Author

@whchung hmm ok. What about gfx7xx? 2018-09-12 17:16:59.854955: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Ignoring visible gpu device (device: 2, name: Hawaii XT [Radeon R9 290X], pci bus id: 0000:03:00.0) with AMDGPU ISA gfx701. The minimum required AMDGPU ISA is gfx803. how run him?

@whchung
Copy link
Collaborator

whchung commented Sep 12, 2018

Unfortunately Hawaii (GFX7) family is not in the roadmap. Quite a few DNN algorithms in MIOpen are implemented in GFX-specific assembly, so we are only focused on GFX8/GFX9 and upcoming architectures.

@sergeimonakhov
Copy link
Author

sergeimonakhov commented Sep 12, 2018

@whchung

RX470 is of GFX8 family so we don't support them on x1 PCIe yet.

what about x8?

@whchung
Copy link
Collaborator

whchung commented Sep 12, 2018

x8 should work. please check:
https://rocm.github.io/hardware.html

@dagamayank
Copy link

/cc @jlgreathouse to confirm supported hw list.

@jlgreathouse
Copy link

Hi @D1abloRUS

When you say "x8", "x1", etc. the major thing to ask is how are these GPUs connected to your CPU? In particular, gfx8 GPUs require PCIe Gen 3 atomics at every step between the CPU and the GPUs. Many people running multiple GPUs through x1 lanes are using PCIe switches to split off multiple ports from a single port. One of the major impediments here is that your PCIe switches must know how to properly forward PCIe atomic commands.

Note that this is true for "x1" or "x8". So if your "PCIe x8" solution also has a switch in between your CPU and your GPU(s), you will also need to make sure this switch properly handles atomics.

Towards that end, I'll ask:

  • What CPU are you using?
  • What motherboard are you using?
  • How are you connecting your GPUs to that motherboard?
    • In other words, if your motherboard has X PCIe slots, which slot is each of your GPUs connected to, and how?

Thanks.

@sunway513
Copy link

Closing this ticket as there're no more feedbacks.
@D1abloRUS feel free to reopen it if you have further questions.

deven-amd pushed a commit that referenced this issue Oct 11, 2019
This PR is a stepping stone towards supporting generic multi-store
source loop nests in affine loop fusion. It extends the algorithm to
support fusion of multi-store loop nests that:
 1. have only one store that writes to a function-local live out, and
 2. the remaining stores are involved in loop nest self dependences
    or no dependences within the function.

Closes #162

COPYBARA_INTEGRATE_REVIEW=tensorflow/mlir#162 from dcaballe:dcaballe/multi-output-fusion 7fb7dec6fe8b45f5ce176f018bfe37b256420c45
PiperOrigin-RevId: 273773907
deven-amd pushed a commit that referenced this issue Nov 19, 2019
This PR is a stepping stone towards supporting generic multi-store
source loop nests in affine loop fusion. It extends the algorithm to
support fusion of multi-store loop nests that:
 1. have only one store that writes to a function-local live out, and
 2. the remaining stores are involved in loop nest self dependences
    or no dependences within the function.

Closes #162

COPYBARA_INTEGRATE_REVIEW=tensorflow/mlir#162 from dcaballe:dcaballe/multi-output-fusion 7fb7dec6fe8b45f5ce176f018bfe37b256420c45
PiperOrigin-RevId: 273773907
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants