-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Bug description
I configure a full validation epoch before training epoch by settingnum_sanity_val_steps: -1
with 4 RTX4090s. Buftthe program just stops after the validation epoch ends while the GPU util is 100%.
Nothing I can do but using pkill -9 python
to terminate the process (I even tried ctrl + c to interrupt the process but it does not works.).
I also tried setting trace after validation sanity checking, it seems that some errors occur while calling self.fit_loop.run() but uncaught.
I paste the codes of my datamodule here.
Please tell me if any further information needed!
What version are you seeing the problem on?
v2.5
How to reproduce the bug
class pcdet_dataset(L.LightningDataModule):
def __init__(self, pcdet_dataset_config:dict,
class_names: List[str],
batch_size:int,
dist_train:bool,
workers:int,
merge_all_iters_to_one_epoch:bool,
total_epochs:int,
):
super().__init__()
self.pcdet_dataset_config = convert_to_easydict(pcdet_dataset_config)
self.class_names = class_names
self.dataset = None
self.batch_size = batch_size
self.dist_train = dist_train
self.workers = workers
self.merge_all_iters_to_one_epoch = merge_all_iters_to_one_epoch
self.total_epochs = total_epochs
def setup(self, stage: str) -> None:
from pcdet.datasets import __all__
print(f"STAGE: {stage}")
if self.dataset is not None:
return
if stage in ("fit", "validate"):
self.dataset:DatasetTemplate = __all__[self.pcdet_dataset_config.DATASET](
dataset_cfg=self.pcdet_dataset_config,
class_names=self.class_names,
root_path=None,
training=False,
logger=print_logger(),
)
elif stage == "test":
self.dataset:DatasetTemplate = __all__[self.pcdet_dataset_config.DATASET](
dataset_cfg=self.pcdet_dataset_config,
class_names=self.class_names,
root_path=None,
training=False,
logger=print_logger(),
)
def build_datalaoder(self) -> DataLoader:
return DataLoader(self.dataset,
batch_size=self.batch_size,
shuffle = False,
collate_fn=voxel_collate_batch,)
def train_dataloader(self) -> DataLoader:
return self.build_datalaoder()
def test_dataloader(self) -> DataLoader:
return self.build_datalaoder()
def val_dataloader(self) -> DataLoader:
return self.build_datalaoder()
Error messages and logs
-> self.state.stage = stage
(Pdb) pp stage
<RunningStage.TRAINING: 'train'>
(Pdb) l
1091 # reset the progress tracking state after sanity checking. we don't need to set the state before
1092 # because sanity check only runs when we are not restarting
1093 _reset_progress(val_loop)
1094
1095 # restore the previous stage when the sanity check if finished
1096 -> self.state.stage = stage
1097
1098 def __setup_profiler(self) -> None:
1099 assert self.state.fn is not None
1100 local_rank = self.local_rank if self.world_size > 1 else None
1101 self.profiler._lightning_module = proxy(self.lightning_module)
(Pdb) n
/home/ksas/anaconda3/envs/physical_attack/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py(1053)_run_stage()
-> with isolate_rng():
(Pdb) n
/home/ksas/anaconda3/envs/physical_attack/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py(1055)_run_stage()
-> with torch.autograd.set_detect_anomaly(self._detect_anomaly):
(Pdb) n
/home/ksas/anaconda3/envs/physical_attack/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py(1056)_run_stage()
-> self.fit_loop.run()
(Pdb) n
/home/ksas/anaconda3/envs/physical_attack/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of thenum_workers
argumentto
num_workers=31in the
DataLoader` to improve performance.
------ the process stops forever here --------
Environment
Current environment
- CUDA:
- GPU:
- NVIDIA GeForce RTX 4090
- NVIDIA GeForce RTX 4090
- NVIDIA GeForce RTX 4090
- NVIDIA GeForce RTX 4090
- available: True
- version: 11.8 - Lightning:
- lightning: 2.5.1
- lightning-utilities: 0.14.2
- pytorch-lightning: 2.5.1
- pytorch3d: 0.7.8
- raytorch: 0.1.0+aeaaf25
- torch: 2.6.0+cu118
- torchaudio: 2.6.0+cu118
- torchmetrics: 1.7.0
- torchvision: 0.21.0+cu118 - Packages:
- addict: 2.4.0
- aiofiles: 24.1.0
- aiohappyeyeballs: 2.6.1
- aiohttp: 3.11.14
- aiosignal: 1.3.2
- antlr4-python3-runtime: 4.9.3
- anyio: 4.9.0
- apptools: 5.3.0
- argcomplete: 3.6.1
- argon2-cffi: 23.1.0
- argon2-cffi-bindings: 21.2.0
- arrow: 1.3.0
- asttokens: 3.0.0
- async-lru: 2.0.5
- attrs: 25.3.0
- autocommand: 2.2.2
- av: 14.2.0
- av2: 0.3.4
- babel: 2.17.0
- backports.tarfile: 1.2.0
- beautifulsoup4: 4.13.3
- bleach: 6.2.0
- blinker: 1.9.0
- cachetools: 5.5.2
- ccimport: 0.4.4
- certifi: 2025.1.31
- cffi: 1.17.1
- charset-normalizer: 3.4.1
- click: 8.1.8
- colorlog: 6.9.0
- comm: 0.2.2
- configargparse: 1.7
- configobj: 5.0.9
- contourpy: 1.3.1
- cumm-cu118: 0.7.11
- cycler: 0.12.1
- dash: 3.0.1
- debugpy: 1.8.13
- decorator: 5.2.1
- defusedxml: 0.7.1
- dependency-groups: 1.3.0
- descartes: 1.1.0
- distlib: 0.3.9
- docstring-parser: 0.16
- easydict: 1.13
- envisage: 7.0.3
- executing: 2.2.0
- fastjsonschema: 2.21.1
- filelock: 3.13.1
- fire: 0.7.0
- flask: 3.0.3
- fonttools: 4.56.0
- fqdn: 1.5.1
- frozenlist: 1.5.0
- fsspec: 2024.6.1
- h11: 0.14.0
- httpcore: 1.0.7
- httpx: 0.28.1
- idna: 3.10
- imageio: 2.37.0
- importlib-metadata: 8.6.1
- importlib-resources: 6.5.2
- inflect: 7.3.1
- iopath: 0.1.10
- ipykernel: 6.29.5
- ipython: 9.0.2
- ipython-pygments-lexers: 1.1.1
- ipywidgets: 8.1.5
- isoduration: 20.11.0
- itsdangerous: 2.2.0
- jaraco.collections: 5.1.0
- jaraco.context: 5.3.0
- jaraco.functools: 4.0.1
- jaraco.text: 3.12.1
- jedi: 0.19.2
- jinja2: 3.1.4
- joblib: 1.4.2
- json5: 0.10.0
- jsonargparse: 4.37.0
- jsonpointer: 3.0.0
- jsonschema: 4.23.0
- jsonschema-specifications: 2024.10.1
- jupyter: 1.1.1
- jupyter-client: 8.6.3
- jupyter-console: 6.6.3
- jupyter-core: 5.7.2
- jupyter-events: 0.12.0
- jupyter-lsp: 2.2.5
- jupyter-server: 2.15.0
- jupyter-server-terminals: 0.5.3
- jupyterlab: 4.3.6
- jupyterlab-pygments: 0.3.0
- jupyterlab-server: 2.27.3
- jupyterlab-widgets: 3.0.13
- kiwisolver: 1.4.8
- kornia: 0.6.8
- kornia-rs: 0.1.8
- lark: 1.2.2
- lazy-loader: 0.4
- lightning: 2.5.1
- lightning-utilities: 0.14.2
- llvmlite: 0.44.0
- markdown-it-py: 3.0.0
- markupsafe: 2.1.5
- matplotlib: 3.5.3
- matplotlib-inline: 0.1.7
- mayavi: 4.8.2
- mdurl: 0.1.2
- mistune: 3.1.3
- more-itertools: 10.3.0
- motmetrics: 1.4.0
- mpmath: 1.3.0
- multidict: 6.2.0
- narwhals: 1.32.0
- nbclient: 0.10.2
- nbconvert: 7.16.6
- nbformat: 5.10.4
- nest-asyncio: 1.6.0
- networkx: 3.3
- ninja: 1.11.1.4
- notebook: 7.3.3
- notebook-shim: 0.2.4
- nox: 2025.2.9
- numba: 0.61.0
- numpy: 1.26.4
- nuscenes-devkit: 1.1.11
- nvidia-cublas-cu11: 11.11.3.6
- nvidia-cuda-cupti-cu11: 11.8.87
- nvidia-cuda-nvrtc-cu11: 11.8.89
- nvidia-cuda-runtime-cu11: 11.8.89
- nvidia-cudnn-cu11: 9.1.0.70
- nvidia-cufft-cu11: 10.9.0.58
- nvidia-curand-cu11: 10.3.0.86
- nvidia-cusolver-cu11: 11.4.1.48
- nvidia-cusparse-cu11: 11.7.5.86
- nvidia-nccl-cu11: 2.21.5
- nvidia-nvtx-cu11: 11.8.86
- omegaconf: 2.3.0
- open3d: 0.19.0
- opencv-python: 4.11.0.86
- overrides: 7.7.0
- packaging: 24.2
- pandas: 2.2.3
- pandocfilters: 1.5.1
- parso: 0.8.4
- pccm: 0.4.16
- pcdet: 0.6.0+8caccce
- pexpect: 4.9.0
- pillow: 11.0.0
- pip: 25.0.1
- platformdirs: 4.3.7
- plotly: 6.0.1
- polars: 1.25.2
- portalocker: 3.1.1
- prometheus-client: 0.21.1
- prompt-toolkit: 3.0.50
- propcache: 0.3.0
- protobuf: 6.30.1
- psutil: 7.0.0
- ptyprocess: 0.7.0
- pure-eval: 0.2.3
- pyarrow: 19.0.1
- pybind11: 2.13.6
- pycocotools: 2.0.8
- pycparser: 2.22
- pyface: 8.0.0
- pygments: 2.19.1
- pyparsing: 3.2.1
- pyproj: 3.7.1
- pyqt5-qt5: 5.15.16
- pyqt5-sip: 12.17.0
- pyquaternion: 0.9.9
- python-dateutil: 2.9.0.post0
- python-json-logger: 3.3.0
- pytorch-lightning: 2.5.1
- pytorch3d: 0.7.8
- pytz: 2025.1
- pyyaml: 6.0.2
- pyzmq: 26.3.0
- raytorch: 0.1.0+aeaaf25
- referencing: 0.36.2
- requests: 2.32.3
- retrying: 1.3.4
- rfc3339-validator: 0.1.4
- rfc3986-validator: 0.1.1
- rich: 13.9.4
- rpds-py: 0.23.1
- scikit-image: 0.25.2
- scikit-learn: 1.6.1
- scipy: 1.15.2
- send2trash: 1.8.3
- setuptools: 77.0.3
- shapely: 1.8.5.post1
- sharedarray: 3.2.4
- six: 1.17.0
- sniffio: 1.3.1
- sort-vertices: 0.0.0
- soupsieve: 2.6
- spconv-cu118: 2.3.8
- stack-data: 0.6.3
- sympy: 1.13.1
- tensorboardx: 2.6.2.2
- termcolor: 2.5.0
- terminado: 0.18.1
- threadpoolctl: 3.6.0
- tifffile: 2025.3.13
- tinycss2: 1.4.0
- tomli: 2.0.1
- torch: 2.6.0+cu118
- torchaudio: 2.6.0+cu118
- torchmetrics: 1.7.0
- torchvision: 0.21.0+cu118
- tornado: 6.4.2
- tqdm: 4.67.1
- traitlets: 5.14.3
- traits: 7.0.2
- traitsui: 8.0.0
- triton: 3.2.0
- typeguard: 4.3.0
- types-python-dateutil: 2.9.0.20241206
- typeshed-client: 2.7.0
- typing-extensions: 4.12.2
- tzdata: 2025.1
- universal-pathlib: 0.2.6
- uri-template: 1.3.0
- urllib3: 2.3.0
- virtualenv: 20.29.3
- voxel-ops: 0.0.0
- vtk: 9.4.1
- wcwidth: 0.2.13
- webcolors: 24.11.1
- webencodings: 0.5.1
- websocket-client: 1.8.0
- werkzeug: 3.0.6
- wheel: 0.45.1
- widgetsnbextension: 4.0.13
- xmltodict: 0.14.2
- yarl: 1.18.3
- zipp: 3.21.0 - System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.12.9
- release: 6.8.0-52-generic
- version: Useblack
for autoformatting #53~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jan 15 19:18:46 UTC 2
More info
No response