Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception in RecordFunction callback: state_ptr INTERNAL ASSERT FAILED at "../torch/csrc/profiler/standalone/nvtx_observer.cpp":115 #19848

Open
nhkhoi91 opened this issue May 5, 2024 · 0 comments
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers

Comments

@nhkhoi91
Copy link

nhkhoi91 commented May 5, 2024

Bug description

🐛 Bug

I am trying to use PytorchProfiler and write to Tensorboard folder on S3, and get the exception as above

What version are you seeing the problem on?

v2.2

How to reproduce the bug

The code is submitted to AWS Sagemaker via remote function as a training job. I am not sure if that would be part of the problem. Otherwise, code is as below


tensorboard_logs_path = f's3://donut_extraction'
logger = TensorBoardLogger(tensorboard_logs_path, name="donut", version='v1')
processor = DonutProcessor.from_pretrained(MODEL_NAME)
wrap_policy = {DonutSwinEncoder, MBartForCausalLM, DonutSwinModel}
strategy = FSDPStrategy( 
    auto_wrap_policy=wrap_policy,
    state_dict_type="sharded",
    limit_all_gathers=True,
)
device_stats = DeviceStatsMonitor(cpu_stats=True)
model_module = ImageModelModule(train_config,
    processor,
    train_dataloader, 
    val_dataloader,
    version=1
)
profiler = PyTorchProfiler(
      
   on_trace_ready=torch.profiler.tensorboard_trace_handler(f'{tensorboard_logs_path}/profiler0'),
    filename='perf-logs',
    emit_nvtx=True
)
trainer = pl.Trainer(
            devices=4,
            accelerator='cuda',
            accumulate_grad_batches=ACUMULATE_GRAD_BATCHES,
            #max_epochs=train_config.max_epochs,
            max_epochs=4,
            val_check_interval=train_config.val_check_interval,
            check_val_every_n_epoch=2,
            precision="16-mixed",
            num_sanity_val_steps=0,
            callbacks=[device_stats],
            # default_root_dir=ckpt_path,
            strategy=strategy,
            logger=logger,
            profiler=profiler,
        )
trainer.fit(model_module)

Error messages and logs

[rank1]:[2024-05-05 09:59:51,519] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank0]:[2024-05-05 09:59:51,523] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank1]:[2024-05-05 09:59:51,523] [0/0_1] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank0]:[2024-05-05 09:59:51,527] [0/0_1] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank2]:[2024-05-05 09:59:51,535] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank3]:[2024-05-05 09:59:51,538] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank2]:[2024-05-05 09:59:51,539] [0/0_1] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank3]:[2024-05-05 09:59:51,542] [0/0_1] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored

Traceback (most recent call last):
  File "/var/folders/h8/1_7bqspx4mj27hqz4qr1gp_m0000gn/T/ipykernel_37353/4058334054.py", line 165, in train_donut
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 579, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 986, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1032, in _run_stage
    self.fit_loop.run()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
    self.advance()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 138, in run
    self.advance(data_fetcher)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 242, in advance
    batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 184, in run
    closure()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 144, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 129, in closure
    step_output = self._step_fn()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 319, in _training_step
    training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 390, in training_step
    return self._forward_redirection(self.model, self.lightning_module, "training_step", *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 642, in __call__
    wrapper_output = wrapper_module(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 849, in forward
    output = self._fsdp_wrapped_module(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 635, in wrapped_forward
    out = method(*_args, **_kwargs)
  File "/var/folders/h8/1_7bqspx4mj27hqz4qr1gp_m0000gn/T/ipykernel_37353/1410972043.py", line 86, in training_step
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1561, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1550, in _call_impl
    args_result = hook(self, args)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/profilers/pytorch.py", line 72, in _start_recording_forward
    record.__enter__()
TypeError: nullcontext.__enter__() missing 1 required positional argument: 'self'

[rank2]:[W record_function.cpp:499] Exception in RecordFunction callback: state_ptr INTERNAL ASSERT FAILED at "../torch/csrc/profiler/standalone/nvtx_observer.cpp":115, please report a bug to PyTorch. Expected profiler state set
Exception raised from updateOutputTensorTracker at ../torch/csrc/profiler/standalone/nvtx_observer.cpp:115 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd0d5e76d87 in [/usr/local/lib/python3.10/site-packages/torch/lib/libc10.so](https://file+.vscode-resource.vscode-cdn.net/usr/local/lib/python3.10/site-packages/torch/lib/libc10.so))
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fd0d5e2775f in [/usr/local/lib/python3.10/site-packages/torch/lib/libc10.so](https://file+.vscode-resource.vscode-cdn.net/usr/local/lib/python3.10/site-packages/torch/lib/libc10.so))
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) + 0x43 (0x7fd0d5e74873 in [/usr/local/lib/python3.10/site-packages/torch/lib/libc10.so](https://file+.vscode-resource.vscode-cdn.net/usr/local/lib/python3.10/site-packages/torch/lib/libc10.so))
frame #3: <unknown function> + 0x56c3f26 (0x7fd0be294f26 in [/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so](https://file+.vscode-resource.vscode-cdn.net/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so))
frame #4: at::RecordFunction::end() + 0x51 (0x7fd0ba5bf411 in [/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so](https://file+.vscode-resource.vscode-cdn.net/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so))
frame #5: at::RecordFunction::~RecordFunction() + 0x22 (0x7fd0ba5bf462 in [/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so](https://file+.vscode-resource.vscode-cdn.net/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so))
frame #6: <unknown function> + 0x4ee58a8 (0x7fd0bdab68a8 in [/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so](https://file+.vscode-resource.vscode-cdn.net/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so))
frame #7: <unknown function> + 0x7a067c (0x7fd0d672267c in [/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so](https://file+.vscode-resource.vscode-cdn.net/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so))
frame #8: <unknown function> + 0xa480b5 (0x7fd0d69ca0b5 in [/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so](https://file+.vscode-resource.vscode-cdn.net/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so))
frame #9: <unknown function> + 0x4117ab (0x7fd0d63937ab in [/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so](https://file+.vscode-resource.vscode-cdn.net/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so))
frame #10: <unknown function> + 0x412731 (0x7fd0d6394731 in [/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so](https://file+.vscode-resource.vscode-cdn.net/usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so))
<omitting python frames>
frame #22: __libc_start_main + 0xea (0x7fd18ce5ed0a in [/lib/x86_64-linux-gnu/libc.so.6](https://file+.vscode-resource.vscode-cdn.net/lib/x86_64-linux-gnu/libc.so.6))
frame #23: _start + 0x2a (0x55e3c20bf07a in [/usr/local/bin/python](https://file+.vscode-resource.vscode-cdn.net/usr/local/bin/python))
 , for the range [pl][module]torch._dynamo.eval_frame.OptimizedModule: model

Environment

Current environment
* CUDA:
#011- GPU:
#011#011- Tesla V100-SXM2-16GB
#011#011- Tesla V100-SXM2-16GB
#011#011- Tesla V100-SXM2-16GB
#011#011- Tesla V100-SXM2-16GB
#011- available:         True
#011- version:           12.1
* Lightning:
#011- lightning:         2.2.0.post0
#011- lightning-utilities: 0.10.1
#011- pytorch-lightning: 2.2.0.post0
#011- torch:             2.2.0
#011- torchmetrics:      1.3.1
#011- torchvision:       0.17.0
* Packages:
#011- absl-py:           2.1.0
#011- accelerate:        0.27.2
#011- aiobotocore:       2.11.2
#011- aiohttp:           3.9.3
#011- aioitertools:      0.11.0
#011- aiosignal:         1.3.1
#011- asttokens:         2.4.1
#011- async-timeout:     4.0.3
#011- attrs:             23.2.0
#011- authlib:           1.3.0
#011- awscli:            1.32.32
#011- boto3:             1.34.34
#011- botocore:          1.34.34
#011- certifi:           2024.2.2
#011- cffi:              1.16.0
#011- charset-normalizer: 3.3.2
#011- click:             8.1.7
#011- cloudpickle:       2.2.1
#011- colorama:          0.4.4
#011- comm:              0.2.1
#011- contextlib2:       21.6.0
#011- cryptography:      42.0.2
#011- debugpy:           1.8.0
#011- decorator:         5.1.1
#011- dill:              0.3.8
#011- docker:            7.0.0
#011- docutils:          0.16
#011- donut:             0.2.2
#011- dparse:            0.6.4b0
#011- exceptiongroup:    1.2.0
#011- executing:         2.0.1
#011- filelock:          3.13.1
#011- frozenlist:        1.4.1
#011- fsspec:            2024.2.0
#011- google-pasta:      0.2.0
#011- grpcio:            1.60.1
#011- huggingface-hub:   0.20.3
#011- idna:              3.6
#011- importlib-metadata: 6.11.0
#011- ipykernel:         6.29.0
#011- ipython:           8.21.0
#011- jedi:              0.19.1
#011- jinja2:            3.1.3
#011- jmespath:          1.0.1
#011- joblib:            1.3.2
#011- jsonschema:        4.21.1
#011- jsonschema-specifications: 2023.12.1
#011- jupyter-client:    8.6.0
#011- jupyter-core:      5.7.1
#011- lightning:         2.2.0.post0
#011- lightning-utilities: 0.10.1
#011- markdown:          3.5.2
#011- markdown-it-py:    3.0.0
#011- markupsafe:        2.1.5
#011- marshmallow:       3.20.2
#011- matplotlib-inline: 0.1.6
#011- mdurl:             0.1.2
#011- mpmath:            1.3.0
#011- multidict:         6.0.5
#011- multiprocess:      0.70.16
#011- nest-asyncio:      1.6.0
#011- networkx:          3.2.1
#011- nltk:              3.8.1
#011- numpy:             1.26.4
#011- nvidia-cublas-cu12: 12.1.3.1
#011- nvidia-cuda-cupti-cu12: 12.1.105
#011- nvidia-cuda-nvrtc-cu12: 12.1.105
#011- nvidia-cuda-runtime-cu12: 12.1.105
#011- nvidia-cudnn-cu12: 8.9.2.26
#011- nvidia-cufft-cu12: 11.0.2.54
#011- nvidia-curand-cu12: 10.3.2.106
#011- nvidia-cusolver-cu12: 11.4.5.107
#011- nvidia-cusparse-cu12: 12.1.0.106
#011- nvidia-nccl-cu12:  2.19.3
#011- nvidia-nvjitlink-cu12: 12.3.101
#011- nvidia-nvtx-cu12:  12.1.105
#011- packaging:         23.2
#011- pandas:            1.5.3
#011- parso:             0.8.3
#011- pathos:            0.3.2
#011- pexpect:           4.9.0
#011- pillow:            10.2.0
#011- pip:               23.3.2
#011- platformdirs:      4.2.0
#011- pox:               0.3.4
#011- ppft:              1.7.6.8
#011- prompt-toolkit:    3.0.43
#011- protobuf:          4.25.3
#011- psutil:            5.9.8
#011- ptyprocess:        0.7.0
#011- pure-eval:         0.2.2
#011- pyasn1:            0.5.1
#011- pycparser:         2.21
#011- pydantic:          1.10.14
#011- pygments:          2.17.2
#011- python-dateutil:   2.8.2
#011- pytorch-lightning: 2.2.0.post0
#011- pytz:              2024.1
#011- pyyaml:            6.0.1
#011- pyzmq:             25.1.2
#011- rapidfuzz:         3.6.2
#011- referencing:       0.33.0
#011- regex:             2023.12.25
#011- requests:          2.31.0
#011- rich:              13.7.0
#011- rpds-py:           0.18.0
#011- rsa:               4.7.2
#011- ruamel.yaml:       0.18.5
#011- ruamel.yaml.clib:  0.2.8
#011- s3fs:              2024.2.0
#011- s3transfer:        0.10.0
#011- safetensors:       0.4.2
#011- safety-schemas:    0.0.1
#011- sagemaker:         2.208.0
#011- schema:            0.7.5
#011- scikit-learn:      1.4.1.post1
#011- scipy:             1.12.0
#011- sentence-transformers: 2.3.1
#011- sentencepiece:     0.2.0
#011- setuptools:        69.1.0
#011- six:               1.16.0
#011- smdebug-rulesconfig: 1.0.1
#011- smpppdu:           0.1.2
#011- smppy:             0.3.2
#011- stack-data:        0.6.3
#011- sympy:             1.12
#011- tblib:             2.0.0
#011- tensorboard:       2.16.2
#011- tensorboard-data-server: 0.7.2
#011- thefuzz:           0.22.1
#011- threadpoolctl:     3.3.0
#011- tokenizers:        0.15.2
#011- tomli:             2.0.1
#011- torch:             2.2.0
#011- torchmetrics:      1.3.1
#011- torchvision:       0.17.0
#011- tornado:           6.4
#011- tqdm:              4.66.2
#011- traitlets:         5.14.1
#011- transformers:      4.38.0
#011- triton:            2.2.0
#011- typer:             0.9.0
#011- typing-extensions: 4.9.0
#011- urllib3:           2.0.7
#011- wcwidth:           0.2.13
#011- werkzeug:          3.0.1
#011- wheel:             0.42.0
#011- wrapt:             1.16.0
#011- xmltodict:         0.13.0
#011- yarl:              1.9.4
#011- zipp:              3.17.0
* System:
#011- OS:                Linux
#011- architecture:
#011#011- 64bit
#011#011- ELF
#011- processor:         
#011- python:            3.10.8
#011- release:           5.10.210-201.855.amzn2.x86_64
#011- version:           #1 SMP Tue Mar 12 19:03:26 UTC 2024

More info

No response

@nhkhoi91 nhkhoi91 added bug Something isn't working needs triage Waiting to be triaged by maintainers labels May 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers
Projects
None yet
Development

No branches or pull requests

1 participant