Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Omnitrace hangs and prints errors while running STEMDL/stdfc with more than 1 GPU #284

Closed
daviteix opened this issue Jun 15, 2023 · 9 comments · Fixed by #291 or #292
Closed

Omnitrace hangs and prints errors while running STEMDL/stdfc with more than 1 GPU #284

daviteix opened this issue Jun 15, 2023 · 9 comments · Fixed by #291 or #292
Labels
bug Something isn't working libpyomnitrace Involves the omnitrace python bindings

Comments

@daviteix
Copy link

daviteix commented Jun 15, 2023

Here are the steps to reproduce:

  1. git clone https://github.com/mlcommons/science.git
  2. download data:
    aws s3 --no-sign-request --endpoint-url https://s3.echo.stfc.ac.uk/ sync s3://sciml-datasets/ms/stemdl_ds1a ./
  3. conda create stemdl
  4. pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2
  5. pip3 install pytorch-lightning scikit-learn
  6. git clone https://github.com/mlperf/logging.git mlperf-logging
  7. pip3 install -e mlperf-logging
  8. cd STEMDL/science/benchmarks/stemdl/stfc
  9. change gpu: 1 to gpu: 4 in stemdlConfig.yaml
  10. omnitrace-python-3.8 -- ./stemdl_classification.py --config ./stemdlConfig.yaml

It will print the following and then hang:

_##### omnitrace :: executing 'python3.8 -m omnitrace -- ./stemdl_classification.py --config ./stemdlConfig.yaml'... #####

[omnitrace]> profiling: ['/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py', '--config', './stemdlConfig.yaml']
[omnitrace][569913][omnitrace_init_tooling] Instrumentation mode: Trace


      ______   .___  ___. .__   __.  __  .___________..______          ___       ______  _______
     /  __  \  |   \/   | |  \ |  | |  | |           ||   _  \        /   \     /      ||   ____|
    |  |  |  | |  \  /  | |   \|  | |  | `---|  |----`|  |_)  |      /  ^  \   |  ,----'|  |__
    |  |  |  | |  |\/|  | |  . `  | |  |     |  |     |      /      /  /_\  \  |  |     |   __|
    |  `--'  | |  |  |  | |  |\   | |  |     |  |     |  |\  \----./  _____  \ |  `----.|  |____
     \______/  |__|  |__| |__| \__| |__|     |__|     | _| `._____/__/     \__\ \______||_______|

    omnitrace v1.10.0 (rev: 9de3a6b0b4243bf8ec10164babdd99f64dbc65f2, tag: v1.10.0, compiler: GNU v8.5.0, rocm: v5.4.x)
[omnitrace][569913][2047] No signals to block...
[omnitrace][569913][2046] No signals to block...
[omnitrace][569913][2045] No signals to block...
[omnitrace][569913][2044] No signals to block...
[966.269]       perfetto.cc:58656 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
[omnitrace][569913] fork() called on PID 569913 (rank: 0), TID 0
/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/lightning_fabric/connector.py:555: UserWarning: 16 is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!
  rank_zero_warn(
/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/lightning_fabric/plugins/environments/slurm.py:165: PossibleUserWarning: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python3.8 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/st ...
  rank_zero_warn(
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
:::MLLOG {"namespace": "", "time_ms": 1686768962518, "event_type": "POINT_IN_TIME", "key": "submission_benchmark", "value": "STEMDL", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 145}}
:::MLLOG {"namespace": "", "time_ms": 1686768966794, "event_type": "POINT_IN_TIME", "key": "submission_org", "value": "STFC", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 146}}
:::MLLOG {"namespace": "", "time_ms": 1686768966876, "event_type": "POINT_IN_TIME", "key": "submission_division", "value": "SciML", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 147}}
:::MLLOG {"namespace": "", "time_ms": 1686768966956, "event_type": "POINT_IN_TIME", "key": "submission_status", "value": "research", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 148}}
:::MLLOG {"namespace": "", "time_ms": 1686768967037, "event_type": "POINT_IN_TIME", "key": "submission_platform", "value": "AMD MI250", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 149}}
:::MLLOG {"namespace": "", "time_ms": 1686768967119, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 150}}
:::MLLOG {"namespace": "", "time_ms": 1686768967199, "event_type": "POINT_IN_TIME", "key": "number_of_ranks", "value": 4, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 153}}
:::MLLOG {"namespace": "", "time_ms": 1686768967280, "event_type": "POINT_IN_TIME", "key": "number_of_nodes", "value": 1, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 154}}
:::MLLOG {"namespace": "", "time_ms": 1686768967361, "event_type": "POINT_IN_TIME", "key": "accelerators_per_node", "value": 8, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 155}}
:::MLLOG {"namespace": "", "time_ms": 1686768967441, "event_type": "INTERVAL_END", "key": "init_stop", "value": null, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 156}}
:::MLLOG {"namespace": "", "time_ms": 1686768967521, "event_type": "POINT_IN_TIME", "key": "eval_start", "value": "Start:Loading datasets", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 157}}
:::MLLOG {"namespace": "", "time_ms": 1686768991051, "event_type": "POINT_IN_TIME", "key": "eval_stop", "value": "Stop: Loading datasets", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 175}}
:::MLLOG {"namespace": "", "time_ms": 1686768991135, "event_type": "POINT_IN_TIME", "key": "eval_start", "value": "Start: Loading model", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 176}}
/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=None`.
  warnings.warn(msg)
:::MLLOG {"namespace": "", "time_ms": 1686768991708, "event_type": "POINT_IN_TIME", "key": "eval_stop", "value": "Stop: Loading model", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 181}}
:::MLLOG {"namespace": "", "time_ms": 1686768991791, "event_type": "POINT_IN_TIME", "key": "eval_start", "value": "Start: Training", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 189}}
[omnitrace][569913] fork() called on PID 569913 (rank: 0), TID 0
[omnitrace][569913] fork() called on PID 569913 (rank: 0), TID 0
Traceback (most recent call last):
  File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/dteixeir/OMNITRACE/rocm-5.4/lib/python/site-packages/omnitrace/__main__.py", line 397, in <module>
    main()
  File "/home/dteixeir/OMNITRACE/rocm-5.4/lib/python/site-packages/omnitrace/__main__.py", line 290, in main
    raise RuntimeError(
RuntimeError: Could not determine input script. Use '--' before the script and its arguments to ensure correct parsing.
E.g. python -m omnitrace -- ./script.py
[omnitrace][569913] fork() called on PID 569913 (rank: 0), TID 0
/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/lightning_fabric/plugins/environments/slurm.py:165: PossibleUserWarning: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python3.8 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/st ...
  rank_zero_warn(
Traceback (most recent call last):
  File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/dteixeir/OMNITRACE/rocm-5.4/lib/python/site-packages/omnitrace/__main__.py", line 397, in <module>
    main()
  File "/home/dteixeir/OMNITRACE/rocm-5.4/lib/python/site-packages/omnitrace/__main__.py", line 290, in main
    raise RuntimeError(
RuntimeError: Could not determine input script. Use '--' before the script and its arguments to ensure correct parsing.
E.g. python -m omnitrace -- ./script.py
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Traceback (most recent call last):
  File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/dteixeir/OMNITRACE/rocm-5.4/lib/python/site-packages/omnitrace/__main__.py", line 397, in <module>
    main()
  File "/home/dteixeir/OMNITRACE/rocm-5.4/lib/python/site-packages/omnitrace/__main__.py", line 290, in main
    raise RuntimeError(
RuntimeError: Could not determine input script. Use '--' before the script and its arguments to ensure correct parsing.
E.g. python -m omnitrace -- ./script.py_

With gpu: 1, it works fine.

@jrmadsen
Copy link
Collaborator

conda activate stemdl

where does this conda env come from?

Do you know how the PyTorch execution model changes when multiple GPUs are used? Does it fork for each additional GPU? Bc I’m seeing 3 fork calls which suggests that might be the root cause of the issue.

@daviteix
Copy link
Author

My mistake, it should have been: conda create stemdl. Yes, it uses fork. Is there a workaround?

@jrmadsen
Copy link
Collaborator

fork has caused a number of problems in the past, mostly related to perfetto bc of a background thread. You might want to try perfetto with the system backend. You will probably want to increase the flush and write periods to the same as the duration in the perfetto config file (see sample here) because of quirks w.r.t. how perfetto writes that file and how omnitrace writes some perfetto data — essentially once perfetto flushes/writes data, you can’t add any time-stamped data that happened before that point and a fair amount of data gathered through sampling isn’t passed to perfetto until finalization bc we have to map instruction pointers to line info and doing so while sampling adds too much overhead during runtime

@daviteix
Copy link
Author

Is there a command example when using omnitrace-python? I have tried without success:
export OMNITRACE_PERFETTO_BACKEND=system
omnitrace-perfetto-traced --background
omnitrace-perfetto --out ./omnitrace-perfetto.proto --txt -c ${OMNITRACE_ROOT}/rocm-5.4/share/omnitrace/omnitrace.cfg --background
omnitrace-python-3.8 -- ./stemdl_classification.py --config ./stemdlConfig.yaml
The option --perfetto-backend=system is not valid for omnitrace-python.

@jrmadsen
Copy link
Collaborator

Update: I’ve tracked down the issue. It’s not related to perfetto, but rather the sys.argv passed to omnitrace’s __main__.py upon re-entry after PyTorch forks. I should have a PR merged with the fix by tomorrow afternoon.

@jrmadsen jrmadsen added bug Something isn't working libpyomnitrace Involves the omnitrace python bindings labels Jun 22, 2023
@daviteix
Copy link
Author

daviteix commented Jun 22, 2023 via email

@jrmadsen
Copy link
Collaborator

Only difference is I am not using slurm

Ah yeah, I’m running this on Lockhart and without using SLURM, I end up with only 1 CPU available to me (e.g. nproc returns 1) whereas srun nproc returns 128. Given all the threads that are created, I figured that was desirable and maybe just an omission in the instructions. As it turns out, I assumed, incorrectly, that the execution model would be the same.

It appears PyTorch will make even more forks when nproc < ngpu and these forks appear to not retain the variable I stored in #291 to re-patch sys.argv. Storing it in an environment variable in #292 appears to do the trick.

@jrmadsen
Copy link
Collaborator

By the way, if you are also running on Lockhart, I'd highly recommend using srun. PyTorch may try to compensate by forking instead of creating threads but from viewing top while that code was running, all 4 of the forked processes were all sharing the same CPU (i.e. their CPU% was all roughly ~25% instead of ~100%, which is what you would see if they were running on separate CPUs)

@daviteix
Copy link
Author

Thanks #292 fixed the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libpyomnitrace Involves the omnitrace python bindings
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants