Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Activelearning strategy "Random" very slow with large datastores #1236

Closed
jlvahldiek opened this issue Dec 29, 2022 · 7 comments
Closed

Activelearning strategy "Random" very slow with large datastores #1236

jlvahldiek opened this issue Dec 29, 2022 · 7 comments
Labels
0.7.0 Targeted to Release version 0.7

Comments

@jlvahldiek
Copy link
Contributor

jlvahldiek commented Dec 29, 2022

In various projects I have noticed that the speed at which the next sample is determined by the "Random" strategy depends strongly on the total number of studies in the datastore, i.e. it takes up to 10 s when having 2,000 unlabeled studies in the store. I consider this to be a great drawback for the user experience.

The time bottleneck originates from the following for-loop which is used to get further information on every unlabeled image. This information is then used to retrieve the image's last timestamp in order to generate an image-specific weight:

for image in images:
images_info.append(datastore.get_image_info(image).get("strategy", {}).get(strategy, {}))
current_ts = int(time.time())
weights = [current_ts - info.get("ts", 0) for info in images_info]

All unlabeled images' weights are then used to determine one random image out of all unlabeled images:

image = random.choices(images, weights=weights)[0]

Now I wonder if we need this time intense weighting in a random draw at all? Probably this is very valid for small datastores to avoid repeated image selections. But in larger datastores this won't play a role. What do you think about a PR that deactivates weighting if a user-specified number of unlabeled images is available(i.e. > 50 images)? Or is there a more time-efficient way to determine the weights?

@jlvahldiek
Copy link
Contributor Author

Probably the use of weights is also relevant in case of several parallel users working on the same datastore...

@SachidanandAlle
Copy link
Collaborator

We can have a flag for this computation and keep it disabled/enabled

@jlvahldiek
Copy link
Contributor Author

I dug a little deeper. The time delay is due to multiple repeated deepcopy() calls in LocalDatastore's get_image_info().

def get_image_info(self, image_id: str) -> Dict[str, Any]:
"""
Get the image information for the given image id
:param image_id: the desired image id
:return: image info as a list of dictionaries Dict[str, Any]
"""
obj = self._datastore.objects.get(image_id)
info = copy.deepcopy(obj.image.info) if obj else {}

As far as I understood and searched the code, the image_info is only modified in one place, here:
name = self._filename(image_id, obj.image.ext)
path = os.path.realpath(os.path.join(self._datastore.image_path(), name))
info["path"] = path

However, info["path"] is not consumed anywhere. Therefore I wonder if the time consuming deepcopy() is needed at all.

The fast execution of get_image_info() is crucial, because it is not only used for the Random strategy, but appears in many places in the base code (i.e. in Epistemic strategy)

@jlvahldiek
Copy link
Contributor Author

We can have a flag for this computation and keep it disabled/enabled

I agree. This would be a simple workaround and would work for the Random strategy. More sustainable might be a workaround around the deepcopy() described above.

Any thoughts or suggestions? Thank you very much.

@SachidanandAlle
Copy link
Collaborator

This is very surprising.. each individual image info object is very very tiny. hardly couple of keys in the dict. If we were doing a deepcopy of entire datastore.json then possibly it would have created some delay.

are you sure it works best when you simply remove or change the line to info = obj.image.info if obj else {} works best? something is fishy. may be problem is somewhere else? one example can be.. getting realpath over network shared/mount filesystem is causing more delay.

        if obj:
            name = self._filename(image_id, obj.image.ext)
            path = os.path.realpath(os.path.join(self._datastore.image_path(), name))
            info["path"] = path

@SachidanandAlle
Copy link
Collaborator

Any updates?

@SachidanandAlle SachidanandAlle added the 0.7.0 Targeted to Release version 0.7 label Jan 23, 2023
@SachidanandAlle SachidanandAlle added this to To do in MONAILabel-v0.7.0 via automation Jan 23, 2023
@jlvahldiek
Copy link
Contributor Author

Thank you for pointing me in the right direction. Today, I would attribute the delay described above to the fact that a too slow network share was actually used. Just tested again on local data with no significant delays.

MONAILabel-v0.7.0 automation moved this from To do to Done Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.7.0 Targeted to Release version 0.7
Projects
Development

No branches or pull requests

2 participants