Slow performance due copying of instances #987

yoavkatz · 2024-07-03T08:32:39Z

Current dict_utils performs overwrite of dictionaries:

if query.strip() == "":
       # change the whole input dic, as dic indeed matches ""
       if isinstance(dic, dict):
           if value is None or not isinstance(value, dict):
               raise ValueError(
                   f"Through an empty query, trying to set a whole new value, {value}, to the whole of dic, {dic}, but value is not a dict"
               )
           dic.clear()
           dic.update(value)
           return

       if isinstance(dic, list):
           if value is None or not isinstance(value, list):
               raise ValueError(
                   f"Through an empty query, trying to set a whole new value, {value}, to the whole of dic, {dic}, but value is not a list"
               )
           dic.clear()
           dic.extend(value)
           return

This caused problems because operators made changes to data of their source and it cause conflict with other operators. To workaround this, we now copy instances :

class InstanceFieldOperator(InstanceOperator):

def process(
       self, instance: Dict[str, Any], stream_name: Optional[str] = None
   ) -> Dict[str, Any]:
       # Need to deep copy instance, because when assigning two dictionary fields,
       # dict_set() the target field dictionary fields.
       # This means that if this target field was assigned to another field before,
       # the field is updated as well.
       instance = deepcopy(instance)

However this cause performance issue with large datasets (6x runtime increase) We should change the code such that new instances are returned with shallow copy - and only relevant atomic are copied.

The text was updated successfully, but these errors were encountered:

elronbandel · 2024-07-03T10:10:08Z

To me operators should modify dictionaries in place, the reason is that that's how most people work with dictionaries. In cases where this creates confusion we should avoid it. Also I think we should have a separate issue for creating performance benchmark to unitxt we run on different pipelines, so we can keep track over performance across PRs.

yoavkatz · 2024-07-03T10:36:04Z

The problem is when operators modify nested dictionary

instance:
predictions : { "a" : 3, "b" :4}
references : [{ "a" : 5, "b" :6}]

Then we have an operator Copy(field="predictions", to_field="orig_predictions"}

instance:
predictions : { "a" : 3, "b" :4}
orig_predictions = { "a" : 3, "b" :4} (same object as above)
references : [{ "a" : 5, "b" :6}]

Now we do Copy(field = "references/0/a" to, "predictions/a")

instance:
predictions : { "a" : 5, "b" :4}
orig_predictions = { "a" : 5, "b" :4} (same object as above < - notice how it was corrupted)
references : [{ "a" : 5, "b" :6}]

How do you suggest we handle it?

dafnapension · 2024-07-03T18:54:37Z

So, until that metric that modifies predictions is crucified, here is an attempt to reduce the damange:
#991

dafnapension · 2024-08-06T07:18:02Z

Addressed at #1087
An important lesson learned in the process is:
#1087 (comment)

dafnapension · 2024-09-11T06:37:23Z

There is no bug in dict_utils. Dict_utils performs exactly what described in its docstring, and dict_utils has nothing to do with the case described here above.
If Unitxt wants to allow many metric evaluation on one model's output, there should be a careful save and restore of the stream post-model, in between metric evaluations (because some metrics do modify the stream loud and clear). This has nothing to do with dict_utils.
A solution to save and restore the stream in between metric evaluations, was presented in #1119
However, this solution was not deployed.
On the contrary, deepcopies only spread out more and more in Unitxt, aiming at "ease debugging on our users".
So this aim costs performance.

yoavkatz added the bug Something isn't working label Jul 3, 2024

yoavkatz assigned elronbandel and dafnapension Jul 3, 2024

dafnapension mentioned this issue Aug 12, 2024

Rag no deep 2 #1119

Closed

dafnapension closed this as completed Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow performance due copying of instances #987

Slow performance due copying of instances #987

yoavkatz commented Jul 3, 2024 •

edited

Loading

elronbandel commented Jul 3, 2024

yoavkatz commented Jul 3, 2024

dafnapension commented Jul 3, 2024

dafnapension commented Aug 6, 2024

dafnapension commented Sep 11, 2024 •

edited

Loading

Slow performance due copying of instances #987

Slow performance due copying of instances #987

Comments

yoavkatz commented Jul 3, 2024 • edited Loading

elronbandel commented Jul 3, 2024

yoavkatz commented Jul 3, 2024

dafnapension commented Jul 3, 2024

dafnapension commented Aug 6, 2024

dafnapension commented Sep 11, 2024 • edited Loading

yoavkatz commented Jul 3, 2024 •

edited

Loading

dafnapension commented Sep 11, 2024 •

edited

Loading