Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch Python fork fix #291

Merged
merged 2 commits into from Jun 22, 2023
Merged

Conversation

jrmadsen
Copy link
Collaborator

@jrmadsen jrmadsen commented Jun 22, 2023

Test Cases

Follow basic setup steps in #284.

Note: on system used for testing (Lockhart) LD_PRELOAD=/usr/lib64/libstdc++.so.6 was required due to libstdc++.so.6 from conda env being too old for the ROCm libraries linked by omnitrace (omnitrace was built with -static-libstdcxx)

  1. Configure stemdlConfig.yaml with 2 GPUs and execute srun -G 2 python -m omnitrace -- ./stemdl_classification.py --config ./stemdlConfig.yaml
  2. Configure stemdlConfig.yaml with 4 GPUs and execute srun -G 4 python -m omnitrace -- ./stemdl_classification.py --config ./stemdlConfig.yaml
  3. Wrote run.sh and execute srun -G 2 ./run.sh

run.sh Contents

#!/bin/bash

set +e
pkill traced
pkill perfetto

set -e
traced --background
perfetto --out stemdl.proto --txt -c ./omni-perfetto.cfg --background

export OMNITRACE_PERFETTO_BACKEND=system
python -m omnitrace -- ./stemdl_classification.py --config ./stemdlConfig.yaml

omni-perfetto.cfg Contents

Used by perfetto command in run.sh

duration_ms: 3000
write_into_file: true
file_write_period_ms: 3000
flush_period_ms: 3000

buffers {
  size_kb: 102400000
  fill_policy: RING_BUFFER
}

data_sources {
  config {
      name: "track_event"
  }
}

Additional Notes

Omnitrace had to be built from scratch with OMNITRACE_MAX_THREADS=4096 to complete at least one of the PyTorch runs because it created > 2048 threads (the default max threads in an installer release) and caused omnitrace to abort. However, this absolute restriction on the total number of threads created by a process will eventually be removed (hopefully soon).

- fixes issue where forking process in PyTorch causes omnitrace/__main__.py to fail due to missing script argument
@jrmadsen jrmadsen added bug fix Fixes a bug libpyomnitrace Involves the omnitrace python bindings labels Jun 22, 2023
Remove debugging "print" LOC
@jrmadsen jrmadsen merged commit a85f141 into ROCm:main Jun 22, 2023
46 checks passed
@jrmadsen jrmadsen deleted the pytorch-python-fork-fix branch June 22, 2023 03:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix Fixes a bug libpyomnitrace Involves the omnitrace python bindings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Omnitrace hangs and prints errors while running STEMDL/stdfc with more than 1 GPU
1 participant