Omnitrace hangs and prints errors while running STEMDL/stdfc with more than 1 GPU #284

daviteix · 2023-06-15T13:28:29Z

Here are the steps to reproduce:

git clone https://github.com/mlcommons/science.git
download data:
aws s3 --no-sign-request --endpoint-url https://s3.echo.stfc.ac.uk/ sync s3://sciml-datasets/ms/stemdl_ds1a ./
conda create stemdl
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2
pip3 install pytorch-lightning scikit-learn
git clone https://github.com/mlperf/logging.git mlperf-logging
pip3 install -e mlperf-logging
cd STEMDL/science/benchmarks/stemdl/stfc
change gpu: 1 to gpu: 4 in stemdlConfig.yaml
omnitrace-python-3.8 -- ./stemdl_classification.py --config ./stemdlConfig.yaml

It will print the following and then hang:

_##### omnitrace :: executing 'python3.8 -m omnitrace -- ./stemdl_classification.py --config ./stemdlConfig.yaml'... #####

[omnitrace]> profiling: ['/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py', '--config', './stemdlConfig.yaml']
[omnitrace][569913][omnitrace_init_tooling] Instrumentation mode: Trace


      ______   .___  ___. .__   __.  __  .___________..______          ___       ______  _______
     /  __  \  |   \/   | |  \ |  | |  | |           ||   _  \        /   \     /      ||   ____|
    |  |  |  | |  \  /  | |   \|  | |  | `---|  |----`|  |_)  |      /  ^  \   |  ,----'|  |__
    |  |  |  | |  |\/|  | |  . `  | |  |     |  |     |      /      /  /_\  \  |  |     |   __|
    |  `--'  | |  |  |  | |  |\   | |  |     |  |     |  |\  \----./  _____  \ |  `----.|  |____
     \______/  |__|  |__| |__| \__| |__|     |__|     | _| `._____/__/     \__\ \______||_______|

    omnitrace v1.10.0 (rev: 9de3a6b0b4243bf8ec10164babdd99f64dbc65f2, tag: v1.10.0, compiler: GNU v8.5.0, rocm: v5.4.x)
[omnitrace][569913][2047] No signals to block...
[omnitrace][569913][2046] No signals to block...
[omnitrace][569913][2045] No signals to block...
[omnitrace][569913][2044] No signals to block...
[966.269]       perfetto.cc:58656 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
[omnitrace][569913] fork() called on PID 569913 (rank: 0), TID 0
/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/lightning_fabric/connector.py:555: UserWarning: 16 is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!
  rank_zero_warn(
/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/lightning_fabric/plugins/environments/slurm.py:165: PossibleUserWarning: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python3.8 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/st ...
  rank_zero_warn(
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
:::MLLOG {"namespace": "", "time_ms": 1686768962518, "event_type": "POINT_IN_TIME", "key": "submission_benchmark", "value": "STEMDL", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 145}}
:::MLLOG {"namespace": "", "time_ms": 1686768966794, "event_type": "POINT_IN_TIME", "key": "submission_org", "value": "STFC", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 146}}
:::MLLOG {"namespace": "", "time_ms": 1686768966876, "event_type": "POINT_IN_TIME", "key": "submission_division", "value": "SciML", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 147}}
:::MLLOG {"namespace": "", "time_ms": 1686768966956, "event_type": "POINT_IN_TIME", "key": "submission_status", "value": "research", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 148}}
:::MLLOG {"namespace": "", "time_ms": 1686768967037, "event_type": "POINT_IN_TIME", "key": "submission_platform", "value": "AMD MI250", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 149}}
:::MLLOG {"namespace": "", "time_ms": 1686768967119, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 150}}
:::MLLOG {"namespace": "", "time_ms": 1686768967199, "event_type": "POINT_IN_TIME", "key": "number_of_ranks", "value": 4, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 153}}
:::MLLOG {"namespace": "", "time_ms": 1686768967280, "event_type": "POINT_IN_TIME", "key": "number_of_nodes", "value": 1, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 154}}
:::MLLOG {"namespace": "", "time_ms": 1686768967361, "event_type": "POINT_IN_TIME", "key": "accelerators_per_node", "value": 8, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 155}}
:::MLLOG {"namespace": "", "time_ms": 1686768967441, "event_type": "INTERVAL_END", "key": "init_stop", "value": null, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 156}}
:::MLLOG {"namespace": "", "time_ms": 1686768967521, "event_type": "POINT_IN_TIME", "key": "eval_start", "value": "Start:Loading datasets", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 157}}
:::MLLOG {"namespace": "", "time_ms": 1686768991051, "event_type": "POINT_IN_TIME", "key": "eval_stop", "value": "Stop: Loading datasets", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 175}}
:::MLLOG {"namespace": "", "time_ms": 1686768991135, "event_type": "POINT_IN_TIME", "key": "eval_start", "value": "Start: Loading model", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 176}}
/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=None`.
  warnings.warn(msg)
:::MLLOG {"namespace": "", "time_ms": 1686768991708, "event_type": "POINT_IN_TIME", "key": "eval_stop", "value": "Stop: Loading model", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 181}}
:::MLLOG {"namespace": "", "time_ms": 1686768991791, "event_type": "POINT_IN_TIME", "key": "eval_start", "value": "Start: Training", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 189}}
[omnitrace][569913] fork() called on PID 569913 (rank: 0), TID 0
[omnitrace][569913] fork() called on PID 569913 (rank: 0), TID 0
Traceback (most recent call last):
  File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/dteixeir/OMNITRACE/rocm-5.4/lib/python/site-packages/omnitrace/__main__.py", line 397, in <module>
    main()
  File "/home/dteixeir/OMNITRACE/rocm-5.4/lib/python/site-packages/omnitrace/__main__.py", line 290, in main
    raise RuntimeError(
RuntimeError: Could not determine input script. Use '--' before the script and its arguments to ensure correct parsing.
E.g. python -m omnitrace -- ./script.py
[omnitrace][569913] fork() called on PID 569913 (rank: 0), TID 0
/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/lightning_fabric/plugins/environments/slurm.py:165: PossibleUserWarning: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python3.8 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/st ...
  rank_zero_warn(
Traceback (most recent call last):
  File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/dteixeir/OMNITRACE/rocm-5.4/lib/python/site-packages/omnitrace/__main__.py", line 397, in <module>
    main()
  File "/home/dteixeir/OMNITRACE/rocm-5.4/lib/python/site-packages/omnitrace/__main__.py", line 290, in main
    raise RuntimeError(
RuntimeError: Could not determine input script. Use '--' before the script and its arguments to ensure correct parsing.
E.g. python -m omnitrace -- ./script.py
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Traceback (most recent call last):
  File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/dteixeir/OMNITRACE/rocm-5.4/lib/python/site-packages/omnitrace/__main__.py", line 397, in <module>
    main()
  File "/home/dteixeir/OMNITRACE/rocm-5.4/lib/python/site-packages/omnitrace/__main__.py", line 290, in main
    raise RuntimeError(
RuntimeError: Could not determine input script. Use '--' before the script and its arguments to ensure correct parsing.
E.g. python -m omnitrace -- ./script.py_

With gpu: 1, it works fine.

The text was updated successfully, but these errors were encountered:

jrmadsen · 2023-06-15T14:46:26Z

conda activate stemdl

where does this conda env come from?

Do you know how the PyTorch execution model changes when multiple GPUs are used? Does it fork for each additional GPU? Bc I’m seeing 3 fork calls which suggests that might be the root cause of the issue.

daviteix · 2023-06-15T14:50:23Z

My mistake, it should have been: conda create stemdl. Yes, it uses fork. Is there a workaround?

jrmadsen · 2023-06-15T15:04:07Z

fork has caused a number of problems in the past, mostly related to perfetto bc of a background thread. You might want to try perfetto with the system backend. You will probably want to increase the flush and write periods to the same as the duration in the perfetto config file (see sample here) because of quirks w.r.t. how perfetto writes that file and how omnitrace writes some perfetto data — essentially once perfetto flushes/writes data, you can’t add any time-stamped data that happened before that point and a fair amount of data gathered through sampling isn’t passed to perfetto until finalization bc we have to map instruction pointers to line info and doing so while sampling adds too much overhead during runtime

daviteix · 2023-06-19T15:38:43Z

Is there a command example when using omnitrace-python? I have tried without success:
export OMNITRACE_PERFETTO_BACKEND=system
omnitrace-perfetto-traced --background
omnitrace-perfetto --out ./omnitrace-perfetto.proto --txt -c ${OMNITRACE_ROOT}/rocm-5.4/share/omnitrace/omnitrace.cfg --background
omnitrace-python-3.8 -- ./stemdl_classification.py --config ./stemdlConfig.yaml
The option --perfetto-backend=system is not valid for omnitrace-python.

jrmadsen · 2023-06-21T00:11:55Z

Update: I’ve tracked down the issue. It’s not related to perfetto, but rather the sys.argv passed to omnitrace’s __main__.py upon re-entry after PyTorch forks. I should have a PR merged with the fix by tomorrow afternoon.

daviteix · 2023-06-22T21:30:59Z

[AMD Official Use Only - General] I still get the error with the new code. Only difference is I am not using slurm. (stemdl) [dteixeir@electra019 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc]$cd - /home/dteixeir/OMNITRACE/omnitrace/source/python/omnitrace (stemdl) [dteixeir@electra019 ~/OMNITRACE/omnitrace/source/python/omnitrace]$git log -1 commit a85f141 (HEAD -> main, origin/main, origin/HEAD) Author: Jonathan R. Madsen <jrmadsen@users.noreply.github.com> Date: Wed Jun 21 22:30:47 2023 -0500 PyTorch Python fork fix (#291) * PyTorch Python fork fix - fixes issue where forking process in PyTorch causes omnitrace/__main__.py to fail due to missing script argument * Update source/python/omnitrace/__main__.py Remove debugging "print" LOC (stemdl) [dteixeir@electra019 ~/OMNITRACE/omnitrace/source/python/omnitrace]$cd - /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc (stemdl) [dteixeir@electra019 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc]$which omnitrace ~/OMNITRACE/omnitrace_install/bin/omnitrace (stemdl) [dteixeir@electra019 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc]$ps |grep perfetto 1 S dteixeir 2109553 1 0 80 0 - 1126 - 23:23 ? 00:00:00 perfetto --out stemdl.proto --txt -c ./omni-perfetto.cfg --background 0 S dteixeir 2110245 1967519 0 80 0 - 3037 - 23:27 pts/0 00:00:00 grep --color=auto perfetto (stemdl) [dteixeir@electra019 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc]$export OMNITRACE_PERFETTO_BACKEND=system (stemdl) [dteixeir@electra019 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc]$ps |grep traced 1 S dteixeir 2104500 1 0 80 0 - 2834 ia32_s 22:45 ? 00:00:10 traced --background 0 S dteixeir 2110356 1967519 0 80 0 - 3037 - 23:28 pts/0 00:00:00 grep --color=auto traced (stemdl) [dteixeir@electra019 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc]$python -m omnitrace -- ./stemdl_classification.py --config ./stemdlConfig.yaml [omnitrace]> profiling: ['/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py', '--config', './stemdlConfig.yaml'] [omnitrace][2110366][omnitrace_init_tooling] Instrumentation mode: Trace

…

______ .___ ___. .__ __. __ .___________..______ ___ ______ _______ / __ \ | \/ | | \ | | | | | || _ \ / \ / || ____| | | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__ | | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __| | `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____ \______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______| omnitrace v1.10.1 (compiler: GNU v8.5.0, rocm: v5.4.x) [omnitrace][2110366][510] No signals to block... [omnitrace][2110366][509] No signals to block... [omnitrace][2110366][508] No signals to block... [omnitrace][2110366][507] No signals to block... [omnitrace][2110366] fork() called on PID 2110366 (rank: 0), TID 0 /home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/lightning_fabric/connector.py:555: UserWarning: 16 is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead! rank_zero_warn( /home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/lightning_fabric/plugins/environments/slurm.py:165: PossibleUserWarning: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemd ... rank_zero_warn( Using 16bit Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs :::MLLOG {"namespace": "", "time_ms": 1687469339160, "event_type": "POINT_IN_TIME", "key": "submission_benchmark", "value": "STEMDL", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 145}} :::MLLOG {"namespace": "", "time_ms": 1687469343278, "event_type": "POINT_IN_TIME", "key": "submission_org", "value": "STFC", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 146}} :::MLLOG {"namespace": "", "time_ms": 1687469343364, "event_type": "POINT_IN_TIME", "key": "submission_division", "value": "SciML", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 147}} :::MLLOG {"namespace": "", "time_ms": 1687469343444, "event_type": "POINT_IN_TIME", "key": "submission_status", "value": "research", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 148}} :::MLLOG {"namespace": "", "time_ms": 1687469343739, "event_type": "POINT_IN_TIME", "key": "submission_platform", "value": "AMD MI250", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 149}} :::MLLOG {"namespace": "", "time_ms": 1687469343817, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 150}} :::MLLOG {"namespace": "", "time_ms": 1687469343894, "event_type": "POINT_IN_TIME", "key": "number_of_ranks", "value": 2, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 153}} :::MLLOG {"namespace": "", "time_ms": 1687469343975, "event_type": "POINT_IN_TIME", "key": "number_of_nodes", "value": 1, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 154}} :::MLLOG {"namespace": "", "time_ms": 1687469344055, "event_type": "POINT_IN_TIME", "key": "accelerators_per_node", "value": 8, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 155}} :::MLLOG {"namespace": "", "time_ms": 1687469344132, "event_type": "INTERVAL_END", "key": "init_stop", "value": null, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 156}} :::MLLOG {"namespace": "", "time_ms": 1687469344211, "event_type": "POINT_IN_TIME", "key": "eval_start", "value": "Start:Loading datasets", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 157}} :::MLLOG {"namespace": "", "time_ms": 1687469368432, "event_type": "POINT_IN_TIME", "key": "eval_stop", "value": "Stop: Loading datasets", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 175}} :::MLLOG {"namespace": "", "time_ms": 1687469368520, "event_type": "POINT_IN_TIME", "key": "eval_start", "value": "Start: Loading model", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 176}} /home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. warnings.warn( /home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=None`. warnings.warn(msg) :::MLLOG {"namespace": "", "time_ms": 1687469369069, "event_type": "POINT_IN_TIME", "key": "eval_stop", "value": "Stop: Loading model", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 181}} :::MLLOG {"namespace": "", "time_ms": 1687469369152, "event_type": "POINT_IN_TIME", "key": "eval_start", "value": "Start: Training", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 189}} [omnitrace][2110366] fork() called on PID 2110366 (rank: 0), TID 0 /home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/lightning_fabric/plugins/environments/slurm.py:165: PossibleUserWarning: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemd ... rank_zero_warn( Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 Traceback (most recent call last): File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/dteixeir/OMNITRACE/omnitrace_install/lib/python3.8/site-packages/omnitrace/__main__.py", line 404, in <module> main(args) File "/home/dteixeir/OMNITRACE/omnitrace_install/lib/python3.8/site-packages/omnitrace/__main__.py", line 290, in main raise RuntimeError( RuntimeError: Could not determine input script in '--config ./stemdlConfig.yaml'. Use '--' before the script and its arguments to ensure correct parsing. E.g. python -m omnitrace -- ./script.py

jrmadsen · 2023-06-22T22:25:12Z

Only difference is I am not using slurm

Ah yeah, I’m running this on Lockhart and without using SLURM, I end up with only 1 CPU available to me (e.g. nproc returns 1) whereas srun nproc returns 128. Given all the threads that are created, I figured that was desirable and maybe just an omission in the instructions. As it turns out, I assumed, incorrectly, that the execution model would be the same.

It appears PyTorch will make even more forks when nproc < ngpu and these forks appear to not retain the variable I stored in #291 to re-patch sys.argv. Storing it in an environment variable in #292 appears to do the trick.

jrmadsen · 2023-06-22T22:36:29Z

By the way, if you are also running on Lockhart, I'd highly recommend using srun. PyTorch may try to compensate by forking instead of creating threads but from viewing top while that code was running, all 4 of the forked processes were all sharing the same CPU (i.e. their CPU% was all roughly ~25% instead of ~100%, which is what you would see if they were running on separate CPUs)

daviteix · 2023-06-23T16:46:53Z

Thanks #292 fixed the issue.

jrmadsen mentioned this issue Jun 22, 2023

PyTorch Python fork fix #291

Merged

jrmadsen added bug Something isn't working libpyomnitrace Involves the omnitrace python bindings labels Jun 22, 2023

jrmadsen closed this as completed in #291 Jun 22, 2023

jrmadsen mentioned this issue Jun 22, 2023

PyTorch Python fork fix part 2 #292

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Omnitrace hangs and prints errors while running STEMDL/stdfc with more than 1 GPU #284

Omnitrace hangs and prints errors while running STEMDL/stdfc with more than 1 GPU #284

daviteix commented Jun 15, 2023 •

edited

Loading

jrmadsen commented Jun 15, 2023

daviteix commented Jun 15, 2023

jrmadsen commented Jun 15, 2023

daviteix commented Jun 19, 2023

jrmadsen commented Jun 21, 2023

daviteix commented Jun 22, 2023 via email •

edited

Loading

jrmadsen commented Jun 22, 2023

jrmadsen commented Jun 22, 2023

daviteix commented Jun 23, 2023

Omnitrace hangs and prints errors while running STEMDL/stdfc with more than 1 GPU #284

Omnitrace hangs and prints errors while running STEMDL/stdfc with more than 1 GPU #284

Comments

daviteix commented Jun 15, 2023 • edited Loading

jrmadsen commented Jun 15, 2023

daviteix commented Jun 15, 2023

jrmadsen commented Jun 15, 2023

daviteix commented Jun 19, 2023

jrmadsen commented Jun 21, 2023

daviteix commented Jun 22, 2023 via email • edited Loading

jrmadsen commented Jun 22, 2023

jrmadsen commented Jun 22, 2023

daviteix commented Jun 23, 2023

daviteix commented Jun 15, 2023 •

edited

Loading

daviteix commented Jun 22, 2023 via email •

edited

Loading