Add inference engine caching #1645

eladven · 2025-03-04T08:53:58Z

Caching can be disabled via the use_cache attribute of the inference engine class (default: True).
Caching is performed in batches, with batch size controlled by the cache_batch_size attribute (default: 100).
The disk cache location is configurable via the inference_engine_cache_path setting (default: ./inference_engine_cache/<inference engine class>).
Although the cache is updated after each batch, it operates at the granularity of a single instance.
The cache key includes the instance message and all current inference engine attributes.

yoavkatz · 2025-03-04T09:59:33Z

src/unitxt/inference.py

-        """Verifies instances of a dataset and perform inference on the input dataset.
+    def _get_cache_key(self, instance: Dict[str, Any]) -> str:
+        """Generate a unique cache key for each input."""
+        record = {"messages": self.to_messages(instance)}


Are you sure this works in all cases? Both in chat api and non chat_api?

yoavkatz

We need to add tests for this functionality,

elronbandel

I have to say I am a bit skeptical of this solution from one main reason: it messes with the ability of the inference engine to fully control its execution flow. We have cases where inference engines have distribution abilities (For example lite llm can distribute calls to different api keys) or internal engines that can distribute the job over different nodes. In all of those cases you may say just increase the caching page size, but at a certain size it is losing its point for a mechanism that save time for small workloads or large workloads.

I much prefer a situation where we implement general cache but every engine populate it based on its own logic. Api calling engines such as litellm (which cover 90% of the engine uses) will do it at the instance level, and local engines will do it at the batch level.

yoavkatz · 2025-03-10T04:57:39Z

I have to say I am a bit skeptical of this solution from one main reason: it messes with the ability of the inference engine to fully control its execution flow. We have cases where inference engines have distribution abilities (For example lite llm can distribute calls to different api keys) or internal engines that can distribute the job over different nodes. In all of those cases you may say just increase the caching page size, but at a certain size it is losing its point for a mechanism that save time for small workloads or large workloads.

I much prefer a situation where we implement general cache but every engine populate it based on its own logic. Api calling engines such as litellm (which cover 90% of the engine uses) will do it at the instance level, and local engines will do it at the batch level.

I partially I understand your concern but I dont think it less common and can bring easily addressed. The current approache groups instances to batches of 100. It then for checks each instance in batch if its in the cache. It calls the inference engine for all instaces not found in the cache (as one batch, allowing the original inference model to distribute the calls). Let's take a worst case of a large dataset with 10000 instances where 99% in the cache and are randomly distributed in the batches. Then instead of taking all the 100 missing instances in one batch and running in parallel . It may run them one by one each in a seperate batch.. assuming we can run 10 in parallel, we effectively reduce from 1000 calls (10 instance each) to 100 calls (1 instance). The optimal is 10 calls (10 instance each).

If we increase the batch size to 1000 . It will be on par. Because each batch will have 10 instances. Beyond this the randomly distributed missing instances assumption is not very realistic.

elronbandel · 2025-03-10T11:14:56Z

Ill break my argument down:

Inference engines are solely in charge on the inference execution flow as they know best their constraints framework and infrastructure.
The desired model of execution is that the dataset (or stream of inference-inputs) is availble to the inference engine which in turn can pull as many instances as it needs whenever they are needed.
No external constraints should be put on the inference engine execution flow because (1) they are forcing the inference engine to be suboptimal (2) they introduce a complex hidden logic which is hard for users to understand and debug.

Two very realistic example where this approach is a problem:
(1) running vllm on a machine with many gpus: if we have 8 gpus and 32 instance per machine we are already above the 100 limit which make it suboptimal and reduce the engine efficiency by over 50%. So we may increase it 1000, but then if we fail within those 1000 the entire batch is gone. and what if we then need a batch of 5000? will the user know to set it up every time? or maybe we should set it automatically for optimal number for each inference engine. The we may as well just set for each inference engine its own logic. This can be very complicated in comparison to the alternative (which is cache per instance).
2) Batch execution api: most apis today support batch execution, you give all the inputs you get futures and then you can wait for everything to be ready before you move on to the next stage. In this mode this caching approach will not work whatsoever.

I think there should be a simpler solution here.

- Caching can be disabled via the `use_cache` attribute of the inference engine class (default: `True`). - Caching is performed in batches, with batch size controlled by the `cache_batch_size` attribute (default: `100`). - The disk cache location is configurable via the `inference_engine_cache_path` setting (default: `./inference_engine_cache/<inference engine class>`). - Although the cache is updated after each batch, it operates at the granularity of a single instance. - The cache key includes the instance message and all current inference engine attributes. Signed-off-by: Elad Venezian <eladv@il.ibm.com>

Signed-off-by: Elad Venezian <eladv@il.ibm.com>

elronbandel

Is all the parameters of the inference engine affect the cache key/address?
If I changed the seed of the engine or the temperature will generate new data?

Also, last try to get to a better place: can we change _infer api so it will get stream and return stream? this will make the batching mechanism unnecessary no?

eladven requested review from elronbandel and yoavkatz March 4, 2025 08:53

eladven force-pushed the inference_engine_cache branch 3 times, most recently from b84cd95 to 6b3e758 Compare March 4, 2025 09:10

yoavkatz reviewed Mar 4, 2025

View reviewed changes

yoavkatz requested changes Mar 4, 2025

View reviewed changes

elronbandel requested changes Mar 6, 2025

View reviewed changes

eladven force-pushed the inference_engine_cache branch from 6b3e758 to f9e6c91 Compare March 11, 2025 11:52

Add tests to inference engine cache.

e9dd23b

Signed-off-by: Elad Venezian <eladv@il.ibm.com>

eladven force-pushed the inference_engine_cache branch from f9e6c91 to e9dd23b Compare March 11, 2025 11:53

elronbandel requested changes Mar 11, 2025

View reviewed changes

elronbandel approved these changes Mar 12, 2025

View reviewed changes

eladven added 3 commits March 12, 2025 21:37

Merge branch 'main' into inference_engine_cache

3b5af0c

Merge branch 'main' into inference_engine_cache

86d1104

Merge branch 'main' into inference_engine_cache

32d032d

elronbandel merged commit 3d68379 into main Mar 13, 2025
14 of 17 checks passed

elronbandel deleted the inference_engine_cache branch March 13, 2025 09:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add inference engine caching #1645

Add inference engine caching #1645

eladven commented Mar 4, 2025

Uh oh!

yoavkatz Mar 4, 2025

Uh oh!

yoavkatz left a comment

Uh oh!

elronbandel left a comment

Uh oh!

yoavkatz commented Mar 10, 2025

Uh oh!

elronbandel commented Mar 10, 2025

Uh oh!

elronbandel left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add inference engine caching #1645

Add inference engine caching #1645

Conversation

eladven commented Mar 4, 2025

Uh oh!

yoavkatz Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

yoavkatz left a comment

Choose a reason for hiding this comment

Uh oh!

elronbandel left a comment

Choose a reason for hiding this comment

Uh oh!

yoavkatz commented Mar 10, 2025

Uh oh!

elronbandel commented Mar 10, 2025

Uh oh!

elronbandel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants