Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while training with MONAILabel's deepedit #1489

Closed
lukasvanderstricht opened this issue Jul 14, 2023 · 9 comments
Closed

Error while training with MONAILabel's deepedit #1489

lukasvanderstricht opened this issue Jul 14, 2023 · 9 comments

Comments

@lukasvanderstricht
Copy link

Dear all

I am currently using 3D Slicer and its MONAILabel extension to train a segmentation model using the DeepEdit model from the predefined radiology app. Both manual segmentation and training have been going smoothly up till now and the automatic segmentation functionality seems to be doing its job.
However, when I want to further train the model at this point, without having added any new labels (so just starting the training process again), I always get one of the two following errors

  1. Exit code -9
    

[2023-07-13 08:52:08,765] [24408] [MainThread] [INFO] (monailabel.utils.sessions:51) - Session Path: /home/dellxpsazdelta/.cache/monailabel/sessions
[2023-07-13 08:52:08,765] [24408] [MainThread] [INFO] (monailabel.utils.sessions:52) - Session Expiry (max): 3600
[2023-07-13 08:52:08,765] [24408] [MainThread] [INFO] (monailabel.tasks.train.basic_train:353) - Train Request (input): {‘model’: ‘deepedit’, ‘name’: ‘train_01’, ‘pretrained’: 1, ‘device’: ‘cuda’, ‘max_epochs’: 50, ‘early_stop_patience’: -1, ‘val_split’: 0.2, ‘train_batch_size’: 1, ‘val_batch_size’: 1, ‘multi_gpu’: True, ‘gpus’: ‘all’, ‘dataset’: ‘CacheDataset’, ‘dataloader’: ‘ThreadDataLoader’, ‘client_id’: ‘user-xyz’, ‘local_rank’: 0}
[2023-07-13 08:52:08,766] [24408] [MainThread] [INFO] (monailabel.tasks.train.basic_train:363) - CUDA_VISIBLE_DEVICES: 0,1
[2023-07-13 08:52:08,767] [24408] [MainThread] [INFO] (monailabel.tasks.train.basic_train:368) - Distributed/Multi GPU is limited
[2023-07-13 08:52:08,767] [24408] [MainThread] [INFO] (monailabel.tasks.train.basic_train:381) - Distributed Training = FALSE
[2023-07-13 08:52:08,767] [24408] [MainThread] [INFO] (monailabel.tasks.train.basic_train:408) - 0 - Train Request (final): {‘name’: ‘train_01’, ‘pretrained’: 1, ‘device’: ‘cuda’, ‘max_epochs’: 50, ‘early_stop_patience’: -1, ‘val_split’: 0.2, ‘train_batch_size’: 1, ‘val_batch_size’: 1, ‘multi_gpu’: False, ‘gpus’: ‘all’, ‘dataset’: ‘CacheDataset’, ‘dataloader’: ‘ThreadDataLoader’, ‘model’: ‘deepedit’, ‘client_id’: ‘user-xyz’, ‘local_rank’: 0, ‘run_id’: ‘20230713_0852’}
[2023-07-13 08:52:08,768] [24408] [MainThread] [INFO] (monailabel.tasks.train.basic_train:504) - 0 - Using Device: cuda; IDX: None
[2023-07-13 08:52:08,768] [24408] [MainThread] [INFO] (monailabel.tasks.train.basic_train:331) - Total Records for Training: 5
[2023-07-13 08:52:08,768] [24408] [MainThread] [INFO] (monailabel.tasks.train.basic_train:332) - Total Records for Validation: 1
Loading dataset: 0%| | 0/1 [00:00<?, ?it/s]
Loading dataset: 100%|██████████| 1/1 [00:10<00:00, 10.10s/it]
Loading dataset: 100%|██████████| 1/1 [00:10<00:00, 10.10s/it]
[2023-07-13 08:53:05,649] [24408] [MainThread] [INFO] (monailabel.tasks.train.basic_train:275) - 0 - Records for Validation: 1
[2023-07-13 08:53:05,748] [24408] [MainThread] [INFO] (monailabel.tasks.train.basic_train:265) - 0 - Adding Validation to run every ‘1’ interval
[2023-07-13 08:53:05,749] [24408] [MainThread] [INFO] (monailabel.tasks.train.basic_train:591) - 0 - Load Path /home/dellxpsazdelta/radiology_psoas_azd/model/deepedit_dynunet/train_01/model.pt
Loading dataset: 0%| | 0/5 [00:00<?, ?it/s]
Loading dataset: 20%|██ | 1/5 [00:05<00:20, 5.20s/it]
Loading dataset: 40%|████ | 2/5 [00:10<00:15, 5.05s/it]
Loading dataset: 60%|██████ | 3/5 [00:13<00:08, 4.37s/it]
Loading dataset: 80%|████████ | 4/5 [00:18<00:04, 4.36s/it]
Loading dataset: 100%|██████████| 5/5 [00:23<00:00, 4.62s/it]
Loading dataset: 100%|██████████| 5/5 [00:23<00:00, 4.63s/it]
[2023-07-13 08:53:29,092] [24408] [MainThread] [INFO] (monailabel.tasks.train.basic_train:227) - 0 - Records for Training: 5
[2023-07-13 08:53:29,098] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:697) - Engine run resuming from iteration 0, epoch 0 until 50 epochs
[2023-07-13 08:53:30,700] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:138) - Restored all variables from /home/dellxpsazdelta/radiology_psoas_azd/model/deepedit_dynunet/train_01/model.pt
[2023-07-13 08:53:38,945] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:269) - Epoch: 1/50, Iter: 1/5 – train_loss: 0.8791
2023-07-13 08:53:40,341 - INFO - Number of simulated clicks: 11
[2023-07-13 08:53:41,639] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:269) - Epoch: 1/50, Iter: 2/5 – train_loss: 0.8105
[2023-07-13 08:53:42,155] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:269) - Epoch: 1/50, Iter: 3/5 – train_loss: 0.8015
[2023-07-13 08:53:42,632] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:269) - Epoch: 1/50, Iter: 4/5 – train_loss: 0.8339
2023-07-13 08:53:45,352 - INFO - Number of simulated clicks: 9
[2023-07-13 08:53:46,177] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:269) - Epoch: 1/50, Iter: 5/5 – train_loss: 0.8071
[2023-07-13 08:53:46,433] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:265) - Got new best metric of train_dice: 0.678305447101593
[2023-07-13 08:53:46,434] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:85) - Current learning rate: 0.0001
[2023-07-13 08:53:46,434] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:199) - Epoch[1] Metrics – left psoas_dice: 0.7228 right psoas_dice: 0.6338 train_dice: 0.6783
[2023-07-13 08:53:46,434] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:209) - Key metric: train_dice best value: 0.678305447101593 at epoch: 1
[2023-07-13 08:53:46,435] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedEvaluator:697) - Engine run resuming from iteration 0, epoch 0 until 1 epochs
2023-07-13 08:53:48,278 - INFO - Number of simulated clicks: 10
[2023-07-13 08:53:49,005] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedEvaluator:265) - Got new best metric of val_mean_dice: 0.6918515563011169
[2023-07-13 08:53:49,005] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedEvaluator:199) - Epoch[1] Metrics – left psoas_dice: 0.7021 right psoas_dice: 0.6816 val_mean_dice: 0.6919
[2023-07-13 08:53:49,005] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedEvaluator:209) - Key metric: val_mean_dice best value: 0.6918515563011169 at epoch: 1
[2023-07-13 08:53:51,338] [24408] [MainThread] [INFO] (monailabel.tasks.train.handler:86) - New Model published: /home/dellxpsazdelta/radiology_psoas_azd/model/deepedit_dynunet/train_01/model.pt => /home/dellxpsazdelta/radiology_psoas_azd/model/deepedit_dynunet.pt
[2023-07-13 08:53:51,339] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedEvaluator:765) - Epoch[1] Complete. Time taken: 00:00:05
[2023-07-13 08:53:51,339] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedEvaluator:778) - Engine run complete. Time taken: 00:00:05
[2023-07-13 08:53:52,354] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:765) - Epoch[1] Complete. Time taken: 00:00:22
2023-07-13 08:53:54,353 - INFO - Number of simulated clicks: 3
[2023-07-13 08:53:55,735] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:269) - Epoch: 2/50, Iter: 1/5 – train_loss: 0.7936
[2023-07-13 08:53:56,209] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:269) - Epoch: 2/50, Iter: 2/5 – train_loss: 0.8046
2023-07-13 08:53:57,274 - INFO - Number of simulated clicks: 8
[2023-07-13 08:53:58,583] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:269) - Epoch: 2/50, Iter: 3/5 – train_loss: 0.7968
[2023-07-13 08:53:59,060] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:269) - Epoch: 2/50, Iter: 4/5 – train_loss: 0.8337
[2023-07-13 08:53:59,851] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:269) - Epoch: 2/50, Iter: 5/5 – train_loss: 0.8011
[2023-07-13 08:53:59,853] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:265) - Got new best metric of train_dice: 0.8020626306533813
[2023-07-13 08:53:59,853] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:85) - Current learning rate: 0.0001
[2023-07-13 08:53:59,853] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:199) - Epoch[2] Metrics – left psoas_dice: 0.7723 right psoas_dice: 0.8318 train_dice: 0.8021
[2023-07-13 08:53:59,853] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:209) - Key metric: train_dice best value: 0.8020626306533813 at epoch: 2
[2023-07-13 08:53:59,854] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedEvaluator:697) - Engine run resuming from iteration 0, epoch 1 until 2 epochs
2023-07-13 08:54:01,733 - INFO - Number of simulated clicks: 9
[2023-07-13 08:54:02,684] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedEvaluator:265) - Got new best metric of val_mean_dice: 0.709175705909729
[2023-07-13 08:54:02,685] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedEvaluator:199) - Epoch[2] Metrics – left psoas_dice: 0.7219 right psoas_dice: 0.6965 val_mean_dice: 0.7092
[2023-07-13 08:54:02,685] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedEvaluator:209) - Key metric: val_mean_dice best value: 0.709175705909729 at epoch: 2
[2023-07-13 08:54:05,116] [24408] [MainThread] [INFO] (monailabel.tasks.train.handler:86) - New Model published: /home/dellxpsazdelta/radiology_psoas_azd/model/deepedit_dynunet/train_01/model.pt => /home/dellxpsazdelta/radiology_psoas_azd/model/deepedit_dynunet.pt
[2023-07-13 08:54:05,116] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedEvaluator:765) - Epoch[2] Complete. Time taken: 00:00:05
[2023-07-13 08:54:05,116] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedEvaluator:778) - Engine run complete. Time taken: 00:00:05
[2023-07-13 08:54:05,418] [24408] [MainThread] [INFO] (ignite.engine.engine.SupervisedTrainer:765) - Epoch[2] Complete. Time taken: 00:00:13
[2023-07-13 08:54:14,842] [26118] [ThreadPoolExecutor-0_0] [INFO] (monailabel.utils.async_tasks.utils:76) - Return code: -9

  1. Exit code 1
    

[2023-07-13 12:20:04,630] [23149] [MainThread] [INFO] (monailabel.utils.async_tasks.task:36) - Train request: {‘model’: ‘deepedit’, ‘name’: ‘train_01’, ‘pretrained’: 1, ‘device’: ‘cuda’, ‘max_epochs’: 50, ‘early_stop_patience’: -1, ‘val_split’: 0.2, ‘train_batch_size’: 1, ‘val_batch_size’: 1, ‘multi_gpu’: True, ‘gpus’: ‘all’, ‘dataset’: ‘CacheDataset’, ‘dataloader’: ‘ThreadDataLoader’, ‘client_id’: ‘user-xyz’}
[2023-07-13 12:20:04,632] [23149] [ThreadPoolExecutor-0_0] [INFO] (monailabel.utils.async_tasks.utils:58) - COMMAND:: /opt/conda/bin/python -m monailabel.interfaces.utils.app -m train -r {“model”:“deepedit”,“name”:“train_01”,“pretrained”:1,“device”:“cuda”,“max_epochs”:50,“early_stop_patience”:-1,“val_split”:0.2,“train_batch_size”:1,“val_batch_size”:1,“multi_gpu”:true,“gpus”:“all”,“dataset”:“CacheDataset”,“dataloader”:“ThreadDataLoader”,“client_id”:“user-xyz”}
[2023-07-13 12:20:05,144] [5998] [MainThread] [INFO] (main:38) - Initializing App from: /home/dellxpsazdelta/radiology_psoas_azd; studies: /home/dellxpsazdelta/psoas-azd-images/train-images; conf: {‘models’: ‘deepedit’}
[2023-07-13 12:20:30,698] [5998] [MainThread] [INFO] (monailabel.utils.others.class_utils:36) - Subclass for MONAILabelApp Found: <class ‘main.MyApp’>
[2023-07-13 12:20:30,748] [5998] [MainThread] [INFO] (monailabel.utils.others.class_utils:36) - Subclass for TaskConfig Found: <class ‘lib.configs.segmentation.Segmentation’>
[2023-07-13 12:20:30,749] [5998] [MainThread] [INFO] (monailabel.utils.others.class_utils:36) - Subclass for TaskConfig Found: <class ‘lib.configs.deepgrow_3d.Deepgrow3D’>
[2023-07-13 12:20:30,770] [5998] [MainThread] [INFO] (monailabel.utils.others.class_utils:36) - Subclass for TaskConfig Found: <class ‘lib.configs.deepedit.DeepEdit’>
[2023-07-13 12:20:30,771] [5998] [MainThread] [INFO] (monailabel.utils.others.class_utils:36) - Subclass for TaskConfig Found: <class ‘lib.configs.segmentation_spleen.SegmentationSpleen’>
[2023-07-13 12:20:30,772] [5998] [MainThread] [INFO] (monailabel.utils.others.class_utils:36) - Subclass for TaskConfig Found: <class ‘lib.configs.deepgrow_2d.Deepgrow2D’>
[2023-07-13 12:20:30,773] [5998] [MainThread] [INFO] (main:83) - +++ Adding Model: deepedit => lib.configs.deepedit.DeepEdit
[2023-07-13 12:20:32,241] [5998] [MainThread] [INFO] (lib.configs.deepedit:145) - EPISTEMIC Enabled: 0; Samples: 5
[2023-07-13 12:20:32,241] [5998] [MainThread] [INFO] (lib.configs.deepedit:149) - TTA Enabled: 0; Samples: 5
[2023-07-13 12:20:32,241] [5998] [MainThread] [INFO] (main:87) - +++ Using Models: [‘deepedit’]
[2023-07-13 12:20:32,241] [5998] [MainThread] [INFO] (monailabel.interfaces.app:126) - Init Datastore for: /home/dellxpsazdelta/psoas-azd-images/train-images
[2023-07-13 12:20:32,242] [5998] [MainThread] [INFO] (monailabel.datastore.local:125) - Auto Reload: False; Extensions: [‘.nii.gz’, '.nii’, ‘.nrrd’, '.jpg’, ‘.png’, '.tif’, ‘.svs’, '.xml’]
[2023-07-13 12:20:32,300] [5998] [MainThread] [INFO] (monailabel.datastore.local:540) - Invalidate count: 0
[2023-07-13 12:20:32,300] [5998] [MainThread] [INFO] (main:112) - +++ Adding Inferer:: deepedit => <lib.infers.deepedit.DeepEdit object at 0x7f395c5280d0>
[2023-07-13 12:20:32,300] [5998] [MainThread] [INFO] (main:112) - +++ Adding Inferer:: deepedit_seg => <lib.infers.deepedit.DeepEdit object at 0x7f395c12b990>
[2023-07-13 12:20:32,301] [5998] [MainThread] [INFO] (main:161) - +++ Adding Trainer:: deepedit => <lib.trainers.deepedit.DeepEdit object at 0x7f395c12b7d0>
[2023-07-13 12:20:32,301] [5998] [MainThread] [INFO] (monailabel.utils.sessions:51) - Session Path: /home/dellxpsazdelta/.cache/monailabel/sessions
[2023-07-13 12:20:32,301] [5998] [MainThread] [INFO] (monailabel.utils.sessions:52) - Session Expiry (max): 3600
[2023-07-13 12:20:32,301] [5998] [MainThread] [INFO] (monailabel.tasks.train.basic_train:353) - Train Request (input): {‘model’: ‘deepedit’, ‘name’: ‘train_01’, ‘pretrained’: 1, ‘device’: ‘cuda’, ‘max_epochs’: 50, ‘early_stop_patience’: -1, ‘val_split’: 0.2, ‘train_batch_size’: 1, ‘val_batch_size’: 1, ‘multi_gpu’: True, ‘gpus’: ‘all’, ‘dataset’: ‘CacheDataset’, ‘dataloader’: ‘ThreadDataLoader’, ‘client_id’: ‘user-xyz’, ‘local_rank’: 0}
[2023-07-13 12:20:32,301] [5998] [MainThread] [INFO] (monailabel.tasks.train.basic_train:363) - CUDA_VISIBLE_DEVICES: None
[2023-07-13 12:20:32,302] [5998] [MainThread] [INFO] (monailabel.tasks.train.basic_train:368) - Distributed/Multi GPU is limited
[2023-07-13 12:20:32,302] [5998] [MainThread] [INFO] (monailabel.tasks.train.basic_train:381) - Distributed Training = FALSE
[2023-07-13 12:20:32,302] [5998] [MainThread] [INFO] (monailabel.tasks.train.basic_train:408) - 0 - Train Request (final): {‘name’: ‘train_01’, ‘pretrained’: 1, ‘device’: ‘cuda’, ‘max_epochs’: 50, ‘early_stop_patience’: -1, ‘val_split’: 0.2, ‘train_batch_size’: 1, ‘val_batch_size’: 1, ‘multi_gpu’: False, ‘gpus’: ‘all’, ‘dataset’: ‘CacheDataset’, ‘dataloader’: ‘ThreadDataLoader’, ‘model’: ‘deepedit’, ‘client_id’: ‘user-xyz’, ‘local_rank’: 0, ‘run_id’: ‘20230713_1220’}
[2023-07-13 12:20:32,302] [5998] [MainThread] [INFO] (monailabel.tasks.train.basic_train:504) - 0 - Using Device: cuda; IDX: None
[2023-07-13 12:20:32,303] [5998] [MainThread] [INFO] (monailabel.tasks.train.basic_train:331) - Total Records for Training: 7
[2023-07-13 12:20:32,303] [5998] [MainThread] [INFO] (monailabel.tasks.train.basic_train:332) - Total Records for Validation: 2
Loading dataset: 0%| | 0/2 [00:00<?, ?it/s]
Loading dataset: 50%|█████ | 1/2 [00:09<00:09, 9.83s/it]
Loading dataset: 100%|██████████| 2/2 [00:13<00:00, 6.29s/it]
Loading dataset: 100%|██████████| 2/2 [00:13<00:00, 6.82s/it]
[2023-07-13 12:21:44,249] [5998] [MainThread] [INFO] (monailabel.tasks.train.basic_train:275) - 0 - Records for Validation: 2
[2023-07-13 12:21:44,258] [5998] [MainThread] [INFO] (monailabel.tasks.train.basic_train:265) - 0 - Adding Validation to run every ‘1’ interval
[2023-07-13 12:21:44,258] [5998] [MainThread] [INFO] (monailabel.tasks.train.basic_train:591) - 0 - Load Path /home/dellxpsazdelta/radiology_psoas_azd/model/deepedit_dynunet/train_01/model.pt
Loading dataset: 0%| | 0/7 [00:00<?, ?it/s]
Loading dataset: 14%|█▍ | 1/7 [00:04<00:25, 4.25s/it]
Loading dataset: 29%|██▊ | 2/7 [00:08<00:21, 4.35s/it]
Loading dataset: 43%|████▎ | 3/7 [00:12<00:16, 4.16s/it]
Loading dataset: 57%|█████▋ | 4/7 [00:16<00:11, 3.93s/it]
Loading dataset: 71%|███████▏ | 5/7 [00:25<00:11, 5.88s/it]
Loading dataset: 71%|███████▏ | 5/7 [00:29<00:11, 5.87s/it]
Traceback (most recent call last):
File “/opt/conda/lib/python3.7/site-packages/monai/transforms/transform.py”, line 89, in apply_transform
return _apply_transform(transform, data, unpack_items)
File “/opt/conda/lib/python3.7/site-packages/monai/transforms/transform.py”, line 53, in _apply_transform
return transform(parameters)
File “/opt/conda/lib/python3.7/site-packages/monai/apps/deepedit/transforms.py”, line 99, in call
label = np.zeros(d[key].shape)
numpy.core._exceptions.MemoryError: Unable to allocate 502. MiB for an array with shape (512, 512, 251) and data type float64
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File “/opt/conda/lib/python3.7/runpy.py”, line 193, in _run_module_as_main
“main”, mod_spec)
File “/opt/conda/lib/python3.7/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/opt/conda/lib/python3.7/site-packages/monailabel/interfaces/utils/app.py”, line 132, in
run_main()
File “/opt/conda/lib/python3.7/site-packages/monailabel/interfaces/utils/app.py”, line 117, in run_main
result = a.train(request)
File “/opt/conda/lib/python3.7/site-packages/monailabel/interfaces/app.py”, line 380, in train
result = task(request, self.datastore())
File “/opt/conda/lib/python3.7/site-packages/monailabel/tasks/train/basic_train.py”, line 382, in call
res = self.train(0, world_size, req, datalist)
File “/opt/conda/lib/python3.7/site-packages/monailabel/tasks/train/basic_train.py”, line 428, in train
context.trainer = self._create_trainer(context)
File “/opt/conda/lib/python3.7/site-packages/monailabel/tasks/train/basic_train.py”, line 575, in _create_trainer
train_data_loader=self.train_data_loader(context),
File “/opt/conda/lib/python3.7/site-packages/monailabel/tasks/train/basic_train.py”, line 226, in train_data_loader
dataset, datalist = self._dataset(context, context.train_datalist)
File “/opt/conda/lib/python3.7/site-packages/monailabel/tasks/train/basic_train.py”, line 200, in _dataset
if context.dataset_type == “CacheDataset”
File “/opt/conda/lib/python3.7/site-packages/monai/data/dataset.py”, line 723, in init
self.set_data(data)
File “/opt/conda/lib/python3.7/site-packages/monai/data/dataset.py”, line 748, in set_data
self._cache = _compute_cache()
File “/opt/conda/lib/python3.7/site-packages/monai/data/dataset.py”, line 737, in _compute_cache
return self._fill_cache()
File “/opt/conda/lib/python3.7/site-packages/monai/data/dataset.py”, line 761, in _fill_cache
desc=“Loading dataset”,
File “/opt/conda/lib/python3.7/site-packages/tqdm/std.py”, line 1195, in iter
for obj in iterable:
File “/opt/conda/lib/python3.7/multiprocessing/pool.py”, line 748, in next
raise value
File “/opt/conda/lib/python3.7/multiprocessing/pool.py”, line 121, in worker
result = (True, func(*args, **kwds))
File “/opt/conda/lib/python3.7/site-packages/monai/data/dataset.py”, line 777, in _load_cache_item
item = apply_transform(_xform, item)
File “/opt/conda/lib/python3.7/site-packages/monai/transforms/transform.py”, line 113, in apply_transform
raise RuntimeError(f"applying transform {transform}") from e
RuntimeError: applying transform <monai.apps.deepedit.transforms.NormalizeLabelsInDatasetd object at 0x7f393a2fdfd0>
[2023-07-13 12:22:24,359] [23149] [ThreadPoolExecutor-0_0] [INFO] (monailabel.utils.async_tasks.utils:76) - Return code: 1

It seems weird to me that without changing anything (such as adding new labels), the training suddenly starts to systematically fail while it was working fine before.
Does anyone have any clue as to why these errors occur?

Thanks in advance!

@diazandr3s
Copy link
Collaborator

Hi @lukasvanderstricht,

Thanks for opening the issue here.
From the logs I see you're having a memory issue:

File “/opt/conda/lib/python3.7/site-packages/monai/apps/deepedit/transforms.py”, line 99, in call
label = np.zeros(d[key].shape)
numpy.core._exceptions.MemoryError: Unable to allocate 502. MiB for an array with shape (512, 512, 251) and data type float64

I also see that you're using CacheDataset to load the dataset for training:

[2023-07-13 12:20:32,301] [5998] [MainThread] [INFO] (monailabel.tasks.train.basic_train:353) - Train Request (input): {‘model’: ‘deepedit’, ‘name’: ‘train_01’, ‘pretrained’: 1, ‘device’: ‘cuda’, ‘max_epochs’: 50, ‘early_stop_patience’: -1, ‘val_split’: 0.2, ‘train_batch_size’: 1, ‘val_batch_size’: 1, ‘multi_gpu’: True, ‘gpus’: ‘all’, ‘dataset’: ‘CacheDataset’, ‘dataloader’: ‘ThreadDataLoader’, ‘client_id’: ‘user-xyz’, ‘local_rank’: 0}

Caching the dataset is great because is faster. However, it seems the GPU you have can't cache the number of volumes you are using to train the model.

I'd recommend:

  • Change the data loader to Dataset

Screenshot from 2023-07-14 10-19-10

The downside of this is that training is slower.

  • Continue using the CacheDataset but train the model by batches of a size that your GPU allows

Just to understand a bit more about the use case here:

1/ How many labels are you trying to segment (https://github.com/Project-MONAI/MONAILabel/blob/main/sample-apps/radiology/lib/configs/deepedit.py#L42-L51)?

2/ Did you change the default volume size (https://github.com/Project-MONAI/MONAILabel/blob/main/sample-apps/radiology/lib/configs/deepedit.py#L77)?

Let us know

@lukasvanderstricht
Copy link
Author

Hi @diazandr3s

Thank you for your answer! I am indeed able to execute the training process when I switch from CacheDataset to Dataset. The downside is indeed that this switch entails a significant change in the time the training takes, so I was wondering if there was any way to fix it while still using CacheDataset.
I will try the option of using larger batches, thank you for the suggestion!

Here some more background:
1/ I have 9 labeled images with 3 labels each (one of which is the background label)

2/I have not changed anything about the volume size yet

Hope this helps!

Thank you again for your answer!

@diazandr3s
Copy link
Collaborator

Thanks for the reply, @lukasvanderstricht

With regards to this:

I will try the option of using larger batches, thank you for the suggestion!

I meant you train the model on the number of volumes your GPU can cache. Then retrain on the other volumes. Keep using the default batch size of 1.

@nvahmadi
Copy link

Hi,
out of curiosity - have you tried using PersistentDataset (still with batch size of 1)? In my experience that also results in a nice speedup, especially if source volumes are compressed (e.g. .nii.gz) or if compute-heavy pre-processing happens (e.g. resampling, which is the case by default in the radiology app, especially at 512x512xN). Could be worth a try.

@lukasvanderstricht
Copy link
Author

Hi @nvahmadi

Thanks for the suggestion! It indeed also seems to work, but I don't see a major difference in speed when compared to Dataset. Thanks for your reply, though!

Kind regards

@nvahmadi
Copy link

Thanks for reporting back, and interesting to note. Not sure why, but for me the speedup was drastic, it was comparable to CacheDataset. Perhaps two reasons: 1) I cache to NVMe drives and 2) the first epoch will still be perceptibly as slow as Dataset as every sample that gets encountered for the first time needs to be pre-processed and written to disk first. The speed-up should become noticeable in epochs 2 and up though. Did you let it run beyond epoch 1 and are you caching to NVMe drives as well?

@lukasvanderstricht
Copy link
Author

I indeed let it run further than 1 epoch but it still remains as slow as Dataset. I don't cache to NVMe drives though.

@nvahmadi
Copy link

Ok good to know, thanks. One note - I just remembered that I made this experience in context of MONAI Core and on larger batch sizes. I'd need to try myself whether I get similar speed-ups in MONAI Label and e.g. batch sizes of 1. Sorry for the confusion!

@diazandr3s
Copy link
Collaborator

Closing this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants