-
Notifications
You must be signed in to change notification settings - Fork 62
Add inference engine caching #1645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b84cd95 to
6b3e758
Compare
src/unitxt/inference.py
Outdated
| """Verifies instances of a dataset and perform inference on the input dataset. | ||
| def _get_cache_key(self, instance: Dict[str, Any]) -> str: | ||
| """Generate a unique cache key for each input.""" | ||
| record = {"messages": self.to_messages(instance)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure this works in all cases? Both in chat api and non chat_api?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to add tests for this functionality,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to say I am a bit skeptical of this solution from one main reason: it messes with the ability of the inference engine to fully control its execution flow. We have cases where inference engines have distribution abilities (For example lite llm can distribute calls to different api keys) or internal engines that can distribute the job over different nodes. In all of those cases you may say just increase the caching page size, but at a certain size it is losing its point for a mechanism that save time for small workloads or large workloads.
I much prefer a situation where we implement general cache but every engine populate it based on its own logic. Api calling engines such as litellm (which cover 90% of the engine uses) will do it at the instance level, and local engines will do it at the batch level.
I partially I understand your concern but I dont think it less common and can bring easily addressed. The current approache groups instances to batches of 100. It then for checks each instance in batch if its in the cache. It calls the inference engine for all instaces not found in the cache (as one batch, allowing the original inference model to distribute the calls). Let's take a worst case of a large dataset with 10000 instances where 99% in the cache and are randomly distributed in the batches. Then instead of taking all the 100 missing instances in one batch and running in parallel . It may run them one by one each in a seperate batch.. assuming we can run 10 in parallel, we effectively reduce from 1000 calls (10 instance each) to 100 calls (1 instance). The optimal is 10 calls (10 instance each). If we increase the batch size to 1000 . It will be on par. Because each batch will have 10 instances. Beyond this the randomly distributed missing instances assumption is not very realistic. |
|
Ill break my argument down:
Two very realistic example where this approach is a problem: I think there should be a simpler solution here. |
- Caching can be disabled via the `use_cache` attribute of the inference engine class (default: `True`). - Caching is performed in batches, with batch size controlled by the `cache_batch_size` attribute (default: `100`). - The disk cache location is configurable via the `inference_engine_cache_path` setting (default: `./inference_engine_cache/<inference engine class>`). - Although the cache is updated after each batch, it operates at the granularity of a single instance. - The cache key includes the instance message and all current inference engine attributes. Signed-off-by: Elad Venezian <eladv@il.ibm.com>
6b3e758 to
f9e6c91
Compare
Signed-off-by: Elad Venezian <eladv@il.ibm.com>
f9e6c91 to
e9dd23b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is all the parameters of the inference engine affect the cache key/address?
If I changed the seed of the engine or the temperature will generate new data?
Also, last try to get to a better place: can we change _infer api so it will get stream and return stream? this will make the batching mechanism unnecessary no?
use_cacheattribute of the inference engine class (default:True).cache_batch_sizeattribute (default:100).inference_engine_cache_pathsetting (default:./inference_engine_cache/<inference engine class>).