Skip to content

Conversation

@eladven
Copy link
Member

@eladven eladven commented Mar 4, 2025

  • Caching can be disabled via the use_cache attribute of the inference engine class (default: True).
  • Caching is performed in batches, with batch size controlled by the cache_batch_size attribute (default: 100).
  • The disk cache location is configurable via the inference_engine_cache_path setting (default: ./inference_engine_cache/<inference engine class>).
  • Although the cache is updated after each batch, it operates at the granularity of a single instance.
  • The cache key includes the instance message and all current inference engine attributes.

@eladven eladven requested review from elronbandel and yoavkatz March 4, 2025 08:53
@eladven eladven force-pushed the inference_engine_cache branch 3 times, most recently from b84cd95 to 6b3e758 Compare March 4, 2025 09:10
"""Verifies instances of a dataset and perform inference on the input dataset.
def _get_cache_key(self, instance: Dict[str, Any]) -> str:
"""Generate a unique cache key for each input."""
record = {"messages": self.to_messages(instance)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this works in all cases? Both in chat api and non chat_api?

Copy link
Member

@yoavkatz yoavkatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add tests for this functionality,

Copy link
Member

@elronbandel elronbandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to say I am a bit skeptical of this solution from one main reason: it messes with the ability of the inference engine to fully control its execution flow. We have cases where inference engines have distribution abilities (For example lite llm can distribute calls to different api keys) or internal engines that can distribute the job over different nodes. In all of those cases you may say just increase the caching page size, but at a certain size it is losing its point for a mechanism that save time for small workloads or large workloads.

I much prefer a situation where we implement general cache but every engine populate it based on its own logic. Api calling engines such as litellm (which cover 90% of the engine uses) will do it at the instance level, and local engines will do it at the batch level.

@yoavkatz
Copy link
Member

I have to say I am a bit skeptical of this solution from one main reason: it messes with the ability of the inference engine to fully control its execution flow. We have cases where inference engines have distribution abilities (For example lite llm can distribute calls to different api keys) or internal engines that can distribute the job over different nodes. In all of those cases you may say just increase the caching page size, but at a certain size it is losing its point for a mechanism that save time for small workloads or large workloads.

I much prefer a situation where we implement general cache but every engine populate it based on its own logic. Api calling engines such as litellm (which cover 90% of the engine uses) will do it at the instance level, and local engines will do it at the batch level.

I partially I understand your concern but I dont think it less common and can bring easily addressed. The current approache groups instances to batches of 100. It then for checks each instance in batch if its in the cache. It calls the inference engine for all instaces not found in the cache (as one batch, allowing the original inference model to distribute the calls). Let's take a worst case of a large dataset with 10000 instances where 99% in the cache and are randomly distributed in the batches. Then instead of taking all the 100 missing instances in one batch and running in parallel . It may run them one by one each in a seperate batch.. assuming we can run 10 in parallel, we effectively reduce from 1000 calls (10 instance each) to 100 calls (1 instance). The optimal is 10 calls (10 instance each).

If we increase the batch size to 1000 . It will be on par. Because each batch will have 10 instances. Beyond this the randomly distributed missing instances assumption is not very realistic.

@elronbandel
Copy link
Member

Ill break my argument down:

  1. Inference engines are solely in charge on the inference execution flow as they know best their constraints framework and infrastructure.
  2. The desired model of execution is that the dataset (or stream of inference-inputs) is availble to the inference engine which in turn can pull as many instances as it needs whenever they are needed.
  3. No external constraints should be put on the inference engine execution flow because (1) they are forcing the inference engine to be suboptimal (2) they introduce a complex hidden logic which is hard for users to understand and debug.

Two very realistic example where this approach is a problem:
(1) running vllm on a machine with many gpus: if we have 8 gpus and 32 instance per machine we are already above the 100 limit which make it suboptimal and reduce the engine efficiency by over 50%. So we may increase it 1000, but then if we fail within those 1000 the entire batch is gone. and what if we then need a batch of 5000? will the user know to set it up every time? or maybe we should set it automatically for optimal number for each inference engine. The we may as well just set for each inference engine its own logic. This can be very complicated in comparison to the alternative (which is cache per instance).
2) Batch execution api: most apis today support batch execution, you give all the inputs you get futures and then you can wait for everything to be ready before you move on to the next stage. In this mode this caching approach will not work whatsoever.

I think there should be a simpler solution here.

- Caching can be disabled via the `use_cache` attribute of the inference
  engine class (default: `True`).
- Caching is performed in batches, with batch size controlled by the
  `cache_batch_size` attribute (default: `100`).
- The disk cache location is configurable via the
  `inference_engine_cache_path` setting
  (default: `./inference_engine_cache/<inference engine class>`).
- Although the cache is updated after each batch, it operates at the
  granularity of a single instance.
- The cache key includes the instance message and all current inference
  engine attributes.

Signed-off-by: Elad Venezian <eladv@il.ibm.com>
@eladven eladven force-pushed the inference_engine_cache branch from 6b3e758 to f9e6c91 Compare March 11, 2025 11:52
Signed-off-by: Elad Venezian <eladv@il.ibm.com>
@eladven eladven force-pushed the inference_engine_cache branch from f9e6c91 to e9dd23b Compare March 11, 2025 11:53
Copy link
Member

@elronbandel elronbandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is all the parameters of the inference engine affect the cache key/address?
If I changed the seed of the engine or the temperature will generate new data?

Also, last try to get to a better place: can we change _infer api so it will get stream and return stream? this will make the batching mechanism unnecessary no?

@elronbandel elronbandel merged commit 3d68379 into main Mar 13, 2025
14 of 17 checks passed
@elronbandel elronbandel deleted the inference_engine_cache branch March 13, 2025 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants