Activelearning strategy "Random" very slow with large datastores #1236

jlvahldiek · 2022-12-29T13:07:54Z

In various projects I have noticed that the speed at which the next sample is determined by the "Random" strategy depends strongly on the total number of studies in the datastore, i.e. it takes up to 10 s when having 2,000 unlabeled studies in the store. I consider this to be a great drawback for the user experience.

The time bottleneck originates from the following for-loop which is used to get further information on every unlabeled image. This information is then used to retrieve the image's last timestamp in order to generate an image-specific weight:

MONAILabel/monailabel/tasks/activelearning/random.py

Lines 39 to 43 in cb8421c

    
           for image in images: 
        
               images_info.append(datastore.get_image_info(image).get("strategy", {}).get(strategy, {})) 
        
           current_ts = int(time.time()) 
        
           weights = [current_ts - info.get("ts", 0) for info in images_info]

All unlabeled images' weights are then used to determine one random image out of all unlabeled images:

MONAILabel/monailabel/tasks/activelearning/random.py

Line 45 in cb8421c

image = random.choices(images, weights=weights)[0]

Now I wonder if we need this time intense weighting in a random draw at all? Probably this is very valid for small datastores to avoid repeated image selections. But in larger datastores this won't play a role. What do you think about a PR that deactivates weighting if a user-specified number of unlabeled images is available(i.e. > 50 images)? Or is there a more time-efficient way to determine the weights?

jlvahldiek · 2022-12-29T13:19:52Z

Probably the use of weights is also relevant in case of several parallel users working on the same datastore...

SachidanandAlle · 2022-12-29T16:25:57Z

We can have a flag for this computation and keep it disabled/enabled

jlvahldiek · 2022-12-30T09:02:53Z

I dug a little deeper. The time delay is due to multiple repeated deepcopy() calls in LocalDatastore's get_image_info().

MONAILabel/monailabel/datastore/local.py

Lines 280 to 288 in cb8421c

    
               def get_image_info(self, image_id: str) -> Dict[str, Any]: 
        
                   """ 
        
                   Get the image information for the given image id 
        
                   :param image_id: the desired image id 
        
                   :return: image info as a list of dictionaries Dict[str, Any] 
        
                   """ 
        
                   obj = self._datastore.objects.get(image_id) 
        
                   info = copy.deepcopy(obj.image.info) if obj else {}

As far as I understood and searched the code, the image_info is only modified in one place, here:

MONAILabel/monailabel/datastore/local.py

Lines 290 to 292 in cb8421c

    
           name = self._filename(image_id, obj.image.ext) 
        
           path = os.path.realpath(os.path.join(self._datastore.image_path(), name)) 
        
           info["path"] = path

However, info["path"] is not consumed anywhere. Therefore I wonder if the time consuming deepcopy() is needed at all.

The fast execution of get_image_info() is crucial, because it is not only used for the Random strategy, but appears in many places in the base code (i.e. in Epistemic strategy)

jlvahldiek · 2022-12-30T09:06:04Z

We can have a flag for this computation and keep it disabled/enabled

I agree. This would be a simple workaround and would work for the Random strategy. More sustainable might be a workaround around the deepcopy() described above.

Any thoughts or suggestions? Thank you very much.

SachidanandAlle · 2023-01-11T23:29:51Z

This is very surprising.. each individual image info object is very very tiny. hardly couple of keys in the dict. If we were doing a deepcopy of entire datastore.json then possibly it would have created some delay.

are you sure it works best when you simply remove or change the line to info = obj.image.info if obj else {} works best? something is fishy. may be problem is somewhere else? one example can be.. getting realpath over network shared/mount filesystem is causing more delay.

        if obj:
            name = self._filename(image_id, obj.image.ext)
            path = os.path.realpath(os.path.join(self._datastore.image_path(), name))
            info["path"] = path

SachidanandAlle · 2023-01-23T19:07:52Z

Any updates?

jlvahldiek · 2023-01-23T21:52:03Z

Thank you for pointing me in the right direction. Today, I would attribute the delay described above to the fact that a too slow network share was actually used. Just tested again on local data with no significant delays.

SachidanandAlle added the 0.7.0 Targeted to Release version 0.7 label Jan 23, 2023

SachidanandAlle added this to To do in MONAILabel-v0.7.0 via automation Jan 23, 2023

SachidanandAlle added this to the Bugs & Fixes & Misc for 0.7 milestone Jan 23, 2023

jlvahldiek closed this as completed Jan 23, 2023

MONAILabel-v0.7.0 automation moved this from To do to Done Jan 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Activelearning strategy "Random" very slow with large datastores #1236

Activelearning strategy "Random" very slow with large datastores #1236

jlvahldiek commented Dec 29, 2022 •

edited

jlvahldiek commented Dec 29, 2022

SachidanandAlle commented Dec 29, 2022

jlvahldiek commented Dec 30, 2022

jlvahldiek commented Dec 30, 2022

SachidanandAlle commented Jan 11, 2023

SachidanandAlle commented Jan 23, 2023

jlvahldiek commented Jan 23, 2023

Activelearning strategy "Random" very slow with large datastores #1236

Activelearning strategy "Random" very slow with large datastores #1236

Comments

jlvahldiek commented Dec 29, 2022 • edited

jlvahldiek commented Dec 29, 2022

SachidanandAlle commented Dec 29, 2022

jlvahldiek commented Dec 30, 2022

jlvahldiek commented Dec 30, 2022

SachidanandAlle commented Jan 11, 2023

SachidanandAlle commented Jan 23, 2023

jlvahldiek commented Jan 23, 2023

jlvahldiek commented Dec 29, 2022 •

edited