[Question] np.pad() consumes a lot of time when the dataset is big? #67

xuelicheng1992 · 2022-04-18T06:40:00Z

❓ Question

Hi! great job!:
I've been using nnDetection in my experiments.I found a phenomenon：
1.one epoch costs about 30 minute (2600/2600,patch size [ 96 192 160]) ,when the number of dataset cases is 163 .
2.one epoch costs about 6 hour (2600/2600,patch size [ 96 160 160] ) ,when the number of dataset cases is 534 .
Then I search what went wrong. I found top command：
1.%CPU 0.0wa,when the number of dataset cases is 163 .
2.%CPU 18.0~36.0wa,when the number of dataset cases is 534 .
Then I know something wrong about data IO and debug /XXX/XXX/nndet/io/datamodule/bg_loader.py Function generate_train_batch(). Finally.found /XXX/XXX/nndet/io/patching.py:line 432 np.pad() was the reason.
1.about 20ms each time,when the number of dataset cases is 163 .
1.about 6-8s each time,when the number of dataset cases is 534 .
How can I solve this problem？

xuelicheng1992 · 2022-04-18T06:40:46Z

[ ]

tohsakask · 2022-05-25T03:56:35Z

Did you solved the problem?
I met the same problem.

mibaumgartner · 2022-05-30T08:22:25Z

Dear @xuelicheng1992 , @tohsakask ,

I'm not sure if there is any way to improve the current IO setup: The dataloader uses numpy memory maps to avoid loading the entire scans and only extracts the needed patch. The np.pad operation is the first operation which will actually load the data from disk into memory and thus it consumes more than "usual". In theory, the time to load the data from disk should only scale with the final patch size (i.e. the actual data loaded) but I noticed that it is slower if the whole scan has a large spatial size (even though it is not loaded fully).

I wouldn't expect this to be an issue with the dataset size though, is there a difference between the two datasets you used in terms of spatial size (image "resolution" / number of voxels)? Since it is hard to debug the issue without having access to your specific dataset, did you try to reproduce the issue with the toy data?

Best,
Michael

xuelicheng1992 · 2022-06-09T08:39:44Z

Hi @mibaumgartner ,
Thank you for the reply!
I think RAM size is too small relative to data is the main reason.
I normalize data_file (float32) to uint8 and record the maximum and minimum values, seg_file (int32) to uint8 and -1 to 255. This way the RAM used by the data will be greatly reduced.
When training, after the data is read in, data_file is normalized to max and min to float32, seg_file to int32 and 255 to -1
Then，one epoch costs about 50 minute！
I think when training time of LUNA16 ，if reduce the RAM to 64G or less and you will find the problem。
I guess when data_file and seg_file size bigger than RAM size np.pad() will execute very slow. I think the best solution should be pad before training and only crop when training，during RAM is small！
Best,
Xuelicheng

xuelicheng1992 · 2022-06-09T08:49:14Z

Hi @tohsakask ，
Increase RAM or decrease data size ，make sure data size smaller than RAM size。
maybe when training only use crop is also useful！

xuelicheng1992 · 2022-06-09T09:04:23Z

Hi @mibaumgartner ,
when training ,why the crop patch size is at least two situations(one samll one big )!
Hope to understand my weak English！
Best,
Xuelicheng

mibaumgartner · 2022-06-09T09:26:03Z

Hi @xuelicheng1992 ,

thanks for the detailed update, indeed, I didn't think about RAM consumption when reading this issue. Training / Augmentation will definitely slow down significantly when the RAM is full.

The RAM usage depends on several parameters of nnDetection: patch size for training (this is independent from the data size in itself, since it only reads the needed patch), number of workers/processes (each worker reads a single batch, so increasing num workers leads to faster augmentation but it increases RAM usage as well), num_cached_per_thread (can be found in the train config; it defined how many results after augmentation will be saved to a queue and decreasing this will save less batches there) and batch size (since each worker reads a batch, increasing the batch size will result in more patches loaded simultaneously). Decreasing num_cacher_per_thread and finding a balance

Please don't normalize the data file (float32) to uint8! This will certainly destroy your data - The data is normalized to zero mean and unit standard variance, thus normalization to an integer type is not an option. I tried some experiments with float16 for data and uint8 for seg but didn't test this setting on the full cohort, so I can not guarantee that it won't decrease results. Furthermore, I'm not sure if this will actually reduce RAM consumption since the data is casted to float32 (I think scikit might even cast to float64 internally for the resampling in the SpatialTransform) after loading, it is a good way to reduce IO though.

Best,
Michael

mibaumgartner · 2022-06-09T09:27:43Z

Regarding the big and small patch size: The extracted patch size is bigger than the training patch size to avoid border artefacts during the augmentation step. The spatial transform will automatically crop the patch to the final (i.e. the size used to train) patch size.

xuelicheng1992 · 2022-06-09T10:35:03Z

Hi @mibaumgartner ,
Thank you for the reply! I will try !
Best,
Xuelicheng

xuelicheng1992 closed this as completed Apr 18, 2022

xuelicheng1992 reopened this May 23, 2022

mibaumgartner mentioned this issue May 30, 2022

Training time #83

Closed

xuelicheng1992 closed this as completed Nov 22, 2022

xuelicheng1992 reopened this Nov 22, 2022

xuelicheng1992 closed this as completed Nov 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] np.pad() consumes a lot of time when the dataset is big? #67

[Question] np.pad() consumes a lot of time when the dataset is big? #67

xuelicheng1992 commented Apr 18, 2022

xuelicheng1992 commented Apr 18, 2022

tohsakask commented May 25, 2022

mibaumgartner commented May 30, 2022

xuelicheng1992 commented Jun 9, 2022

xuelicheng1992 commented Jun 9, 2022

xuelicheng1992 commented Jun 9, 2022

mibaumgartner commented Jun 9, 2022

mibaumgartner commented Jun 9, 2022

xuelicheng1992 commented Jun 9, 2022

[Question] np.pad() consumes a lot of time when the dataset is big? #67

[Question] np.pad() consumes a lot of time when the dataset is big? #67

Comments

xuelicheng1992 commented Apr 18, 2022

❓ Question

xuelicheng1992 commented Apr 18, 2022

tohsakask commented May 25, 2022

mibaumgartner commented May 30, 2022

xuelicheng1992 commented Jun 9, 2022

xuelicheng1992 commented Jun 9, 2022

xuelicheng1992 commented Jun 9, 2022

mibaumgartner commented Jun 9, 2022

mibaumgartner commented Jun 9, 2022

xuelicheng1992 commented Jun 9, 2022