Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] np.pad() consumes a lot of time when the dataset is big? #67

Closed
xuelicheng1992 opened this issue Apr 18, 2022 · 9 comments
Closed

Comments

@xuelicheng1992
Copy link

❓ Question

Hi! great job!:
I've been using nnDetection in my experiments.I found a phenomenon:
1.one epoch costs about 30 minute (2600/2600,patch size [ 96 192 160]) ,when the number of dataset cases is 163 .
2.one epoch costs about 6 hour (2600/2600,patch size [ 96 160 160] ) ,when the number of dataset cases is 534 .
Then I search what went wrong. I found top command:
1.%CPU 0.0wa,when the number of dataset cases is 163 .
2.%CPU 18.0~36.0wa,when the number of dataset cases is 534 .
Then I know something wrong about data IO and debug /XXX/XXX/nndet/io/datamodule/bg_loader.py Function generate_train_batch(). Finally.found /XXX/XXX/nndet/io/patching.py:line 432 np.pad() was the reason.
1.about 20ms each time,when the number of dataset cases is 163 .
1.about 6-8s each time,when the number of dataset cases is 534 .
How can I solve this problem?

@xuelicheng1992
Copy link
Author

  • [ ]
    1261981087

@tohsakask
Copy link

Did you solved the problem?
I met the same problem.

@mibaumgartner
Copy link
Collaborator

Dear @xuelicheng1992 , @tohsakask ,

I'm not sure if there is any way to improve the current IO setup: The dataloader uses numpy memory maps to avoid loading the entire scans and only extracts the needed patch. The np.pad operation is the first operation which will actually load the data from disk into memory and thus it consumes more than "usual". In theory, the time to load the data from disk should only scale with the final patch size (i.e. the actual data loaded) but I noticed that it is slower if the whole scan has a large spatial size (even though it is not loaded fully).

I wouldn't expect this to be an issue with the dataset size though, is there a difference between the two datasets you used in terms of spatial size (image "resolution" / number of voxels)? Since it is hard to debug the issue without having access to your specific dataset, did you try to reproduce the issue with the toy data?

Best,
Michael

@xuelicheng1992
Copy link
Author

Hi @mibaumgartner ,
Thank you for the reply!
I think RAM size is too small relative to data is the main reason.
I normalize data_file (float32) to uint8 and record the maximum and minimum values, seg_file (int32) to uint8 and -1 to 255. This way the RAM used by the data will be greatly reduced.
When training, after the data is read in, data_file is normalized to max and min to float32, seg_file to int32 and 255 to -1
Then,one epoch costs about 50 minute!
I think when training time of LUNA16 ,if reduce the RAM to 64G or less and you will find the problem。
I guess when data_file and seg_file size bigger than RAM size np.pad() will execute very slow. I think the best solution should be pad before training and only crop when training,during RAM is small!
Best,
Xuelicheng

@xuelicheng1992
Copy link
Author

Hi @tohsakask
Increase RAM or decrease data size ,make sure data size smaller than RAM size。
maybe when training only use crop is also useful!

@xuelicheng1992
Copy link
Author

Hi @mibaumgartner ,
when training ,why the crop patch size is at least two situations(one samll one big )!
Hope to understand my weak English!
Best,
Xuelicheng

@mibaumgartner
Copy link
Collaborator

Hi @xuelicheng1992 ,

thanks for the detailed update, indeed, I didn't think about RAM consumption when reading this issue. Training / Augmentation will definitely slow down significantly when the RAM is full.

The RAM usage depends on several parameters of nnDetection: patch size for training (this is independent from the data size in itself, since it only reads the needed patch), number of workers/processes (each worker reads a single batch, so increasing num workers leads to faster augmentation but it increases RAM usage as well), num_cached_per_thread (can be found in the train config; it defined how many results after augmentation will be saved to a queue and decreasing this will save less batches there) and batch size (since each worker reads a batch, increasing the batch size will result in more patches loaded simultaneously). Decreasing num_cacher_per_thread and finding a balance

Please don't normalize the data file (float32) to uint8! This will certainly destroy your data - The data is normalized to zero mean and unit standard variance, thus normalization to an integer type is not an option. I tried some experiments with float16 for data and uint8 for seg but didn't test this setting on the full cohort, so I can not guarantee that it won't decrease results. Furthermore, I'm not sure if this will actually reduce RAM consumption since the data is casted to float32 (I think scikit might even cast to float64 internally for the resampling in the SpatialTransform) after loading, it is a good way to reduce IO though.

Best,
Michael

@mibaumgartner
Copy link
Collaborator

Regarding the big and small patch size: The extracted patch size is bigger than the training patch size to avoid border artefacts during the augmentation step. The spatial transform will automatically crop the patch to the final (i.e. the size used to train) patch size.

@xuelicheng1992
Copy link
Author

Hi @mibaumgartner ,
Thank you for the reply! I will try !
Best,
Xuelicheng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants