New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Activelearning strategy "Random" very slow with large datastores #1236
Comments
Probably the use of weights is also relevant in case of several parallel users working on the same datastore... |
We can have a flag for this computation and keep it disabled/enabled |
I dug a little deeper. The time delay is due to multiple repeated MONAILabel/monailabel/datastore/local.py Lines 280 to 288 in cb8421c
As far as I understood and searched the code, the image_info is only modified in one place, here:MONAILabel/monailabel/datastore/local.py Lines 290 to 292 in cb8421c
However, info["path"] is not consumed anywhere. Therefore I wonder if the time consuming deepcopy() is needed at all.
The fast execution of |
I agree. This would be a simple workaround and would work for the Any thoughts or suggestions? Thank you very much. |
This is very surprising.. each individual image info object is very very tiny. hardly couple of keys in the dict. If we were doing a deepcopy of entire datastore.json then possibly it would have created some delay. are you sure it works best when you simply remove or change the line to if obj:
name = self._filename(image_id, obj.image.ext)
path = os.path.realpath(os.path.join(self._datastore.image_path(), name))
info["path"] = path |
Any updates? |
Thank you for pointing me in the right direction. Today, I would attribute the delay described above to the fact that a too slow network share was actually used. Just tested again on local data with no significant delays. |
In various projects I have noticed that the speed at which the next sample is determined by the "Random" strategy depends strongly on the total number of studies in the datastore, i.e. it takes up to 10 s when having 2,000 unlabeled studies in the store. I consider this to be a great drawback for the user experience.
The time bottleneck originates from the following for-loop which is used to get further information on every unlabeled image. This information is then used to retrieve the image's last timestamp in order to generate an image-specific weight:
MONAILabel/monailabel/tasks/activelearning/random.py
Lines 39 to 43 in cb8421c
All unlabeled images' weights are then used to determine one random image out of all unlabeled images:
MONAILabel/monailabel/tasks/activelearning/random.py
Line 45 in cb8421c
Now I wonder if we need this time intense weighting in a random draw at all? Probably this is very valid for small datastores to avoid repeated image selections. But in larger datastores this won't play a role. What do you think about a PR that deactivates weighting if a user-specified number of unlabeled images is available(i.e. > 50 images)? Or is there a more time-efficient way to determine the weights?
The text was updated successfully, but these errors were encountered: