-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Omnitrace hangs and prints errors while running STEMDL/stdfc with more than 1 GPU #284
Comments
where does this conda env come from? Do you know how the PyTorch execution model changes when multiple GPUs are used? Does it fork for each additional GPU? Bc I’m seeing 3 fork calls which suggests that might be the root cause of the issue. |
My mistake, it should have been: conda create stemdl. Yes, it uses fork. Is there a workaround? |
fork has caused a number of problems in the past, mostly related to perfetto bc of a background thread. You might want to try perfetto with the system backend. You will probably want to increase the flush and write periods to the same as the duration in the perfetto config file (see sample here) because of quirks w.r.t. how perfetto writes that file and how omnitrace writes some perfetto data — essentially once perfetto flushes/writes data, you can’t add any time-stamped data that happened before that point and a fair amount of data gathered through sampling isn’t passed to perfetto until finalization bc we have to map instruction pointers to line info and doing so while sampling adds too much overhead during runtime |
Is there a command example when using omnitrace-python? I have tried without success: |
Update: I’ve tracked down the issue. It’s not related to perfetto, but rather the sys.argv passed to omnitrace’s |
[AMD Official Use Only - General]
I still get the error with the new code. Only difference is I am not using slurm.
(stemdl) [dteixeir@electra019 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc]$cd -
/home/dteixeir/OMNITRACE/omnitrace/source/python/omnitrace
(stemdl) [dteixeir@electra019 ~/OMNITRACE/omnitrace/source/python/omnitrace]$git log -1
commit a85f141 (HEAD -> main, origin/main, origin/HEAD)
Author: Jonathan R. Madsen <jrmadsen@users.noreply.github.com>
Date: Wed Jun 21 22:30:47 2023 -0500
PyTorch Python fork fix (#291)
* PyTorch Python fork fix
- fixes issue where forking process in PyTorch causes omnitrace/__main__.py to fail due to missing script argument
* Update source/python/omnitrace/__main__.py
Remove debugging "print" LOC
(stemdl) [dteixeir@electra019 ~/OMNITRACE/omnitrace/source/python/omnitrace]$cd -
/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc
(stemdl) [dteixeir@electra019 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc]$which omnitrace
~/OMNITRACE/omnitrace_install/bin/omnitrace
(stemdl) [dteixeir@electra019 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc]$ps |grep perfetto
1 S dteixeir 2109553 1 0 80 0 - 1126 - 23:23 ? 00:00:00 perfetto --out stemdl.proto --txt -c ./omni-perfetto.cfg --background
0 S dteixeir 2110245 1967519 0 80 0 - 3037 - 23:27 pts/0 00:00:00 grep --color=auto perfetto
(stemdl) [dteixeir@electra019 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc]$export OMNITRACE_PERFETTO_BACKEND=system
(stemdl) [dteixeir@electra019 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc]$ps |grep traced
1 S dteixeir 2104500 1 0 80 0 - 2834 ia32_s 22:45 ? 00:00:10 traced --background
0 S dteixeir 2110356 1967519 0 80 0 - 3037 - 23:28 pts/0 00:00:00 grep --color=auto traced
(stemdl) [dteixeir@electra019 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc]$python -m omnitrace -- ./stemdl_classification.py --config ./stemdlConfig.yaml
[omnitrace]> profiling: ['/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py', '--config', './stemdlConfig.yaml']
[omnitrace][2110366][omnitrace_init_tooling] Instrumentation mode: Trace
…______ .___ ___. .__ __. __ .___________..______ ___ ______ _______
/ __ \ | \/ | | \ | | | | | || _ \ / \ / || ____|
| | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__
| | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __|
| `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____
\______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______|
omnitrace v1.10.1 (compiler: GNU v8.5.0, rocm: v5.4.x)
[omnitrace][2110366][510] No signals to block...
[omnitrace][2110366][509] No signals to block...
[omnitrace][2110366][508] No signals to block...
[omnitrace][2110366][507] No signals to block...
[omnitrace][2110366] fork() called on PID 2110366 (rank: 0), TID 0
/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/lightning_fabric/connector.py:555: UserWarning: 16 is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!
rank_zero_warn(
/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/lightning_fabric/plugins/environments/slurm.py:165: PossibleUserWarning: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemd ...
rank_zero_warn(
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
:::MLLOG {"namespace": "", "time_ms": 1687469339160, "event_type": "POINT_IN_TIME", "key": "submission_benchmark", "value": "STEMDL", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 145}}
:::MLLOG {"namespace": "", "time_ms": 1687469343278, "event_type": "POINT_IN_TIME", "key": "submission_org", "value": "STFC", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 146}}
:::MLLOG {"namespace": "", "time_ms": 1687469343364, "event_type": "POINT_IN_TIME", "key": "submission_division", "value": "SciML", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 147}}
:::MLLOG {"namespace": "", "time_ms": 1687469343444, "event_type": "POINT_IN_TIME", "key": "submission_status", "value": "research", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 148}}
:::MLLOG {"namespace": "", "time_ms": 1687469343739, "event_type": "POINT_IN_TIME", "key": "submission_platform", "value": "AMD MI250", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 149}}
:::MLLOG {"namespace": "", "time_ms": 1687469343817, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 150}}
:::MLLOG {"namespace": "", "time_ms": 1687469343894, "event_type": "POINT_IN_TIME", "key": "number_of_ranks", "value": 2, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 153}}
:::MLLOG {"namespace": "", "time_ms": 1687469343975, "event_type": "POINT_IN_TIME", "key": "number_of_nodes", "value": 1, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 154}}
:::MLLOG {"namespace": "", "time_ms": 1687469344055, "event_type": "POINT_IN_TIME", "key": "accelerators_per_node", "value": 8, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 155}}
:::MLLOG {"namespace": "", "time_ms": 1687469344132, "event_type": "INTERVAL_END", "key": "init_stop", "value": null, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 156}}
:::MLLOG {"namespace": "", "time_ms": 1687469344211, "event_type": "POINT_IN_TIME", "key": "eval_start", "value": "Start:Loading datasets", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 157}}
:::MLLOG {"namespace": "", "time_ms": 1687469368432, "event_type": "POINT_IN_TIME", "key": "eval_stop", "value": "Stop: Loading datasets", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 175}}
:::MLLOG {"namespace": "", "time_ms": 1687469368520, "event_type": "POINT_IN_TIME", "key": "eval_start", "value": "Start: Loading model", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 176}}
/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=None`.
warnings.warn(msg)
:::MLLOG {"namespace": "", "time_ms": 1687469369069, "event_type": "POINT_IN_TIME", "key": "eval_stop", "value": "Stop: Loading model", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 181}}
:::MLLOG {"namespace": "", "time_ms": 1687469369152, "event_type": "POINT_IN_TIME", "key": "eval_start", "value": "Start: Training", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 189}}
[omnitrace][2110366] fork() called on PID 2110366 (rank: 0), TID 0
/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/lightning_fabric/plugins/environments/slurm.py:165: PossibleUserWarning: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemd ...
rank_zero_warn(
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Traceback (most recent call last):
File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/dteixeir/OMNITRACE/omnitrace_install/lib/python3.8/site-packages/omnitrace/__main__.py", line 404, in <module>
main(args)
File "/home/dteixeir/OMNITRACE/omnitrace_install/lib/python3.8/site-packages/omnitrace/__main__.py", line 290, in main
raise RuntimeError(
RuntimeError: Could not determine input script in '--config ./stemdlConfig.yaml'. Use '--' before the script and its arguments to ensure correct parsing.
E.g. python -m omnitrace -- ./script.py
|
Ah yeah, I’m running this on Lockhart and without using SLURM, I end up with only 1 CPU available to me (e.g. It appears PyTorch will make even more forks when nproc < ngpu and these forks appear to not retain the variable I stored in #291 to re-patch |
By the way, if you are also running on Lockhart, I'd highly recommend using srun. PyTorch may try to compensate by forking instead of creating threads but from viewing |
Thanks #292 fixed the issue. |
Here are the steps to reproduce:
aws s3 --no-sign-request --endpoint-url https://s3.echo.stfc.ac.uk/ sync s3://sciml-datasets/ms/stemdl_ds1a ./
It will print the following and then hang:
With gpu: 1, it works fine.
The text was updated successfully, but these errors were encountered: