You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Codefuse-ModelCache is a semantic cache for large language models (LLMs). By caching pre-generated model results, it reduces response time for similar requests and improves user experience. <br />This project aims to optimize services by introducing a caching mechanism. It helps businesses and research institutions reduce the cost of inference deployment, improve model performance and efficiency, and provide scalable services for large models. Through open-source, we aim to share and exchange technologies related to large model semantic cache.
30
+
## Quick Deployment
31
+
### Dependencies
31
32
32
-
-python版本: 3.8及以上
33
-
-依赖包安装:
33
+
-Python version: 3.8 and above
34
+
-Package Installation
34
35
```shell
35
36
pip install requirements.txt
36
37
```
38
+
### Environment Configuration
39
+
Before starting the service, the following environment configurations should be performed:
1. Install the relational database MySQL and import the SQL file to create the data tables. The SQL file can be found at: reference_doc/create_table.sql
42
+
2. Install the vector database Milvus.
43
+
3. Add the database access information to the configuration files:
4.Download the embedding model bin file from the following address: [https://huggingface.co/shibing624/text2vec-base-chinese/tree/main](https://huggingface.co/shibing624/text2vec-base-chinese/tree/main). Place the downloaded bin file in the model/text2vec-base-chinese folder.
47
+
5.Start the backend service using the flask4modelcache.py script.
48
+
## Service-Access
49
+
The current service provides three core functionalities through RESTful API.: Cache-Writing, Cache-Querying, and Cache-Clearing. Demos:
chat_info = [{"query": [{"role": "system", "content": "You are an AI code assistant and you must provide neutral and harmless answers to help users solve code-related problems."}, {"role": "user", "content": "你是谁?"}],
58
+
"answer": "Hello, I am an intelligent assistant. How can I assist you?"}]
59
59
data = {'type': type, 'scope': scope, 'chat_info': chat_info}
60
60
headers = {"Content-Type": "application/json"}
61
61
res = requests.post(url, headers=headers, json=json.dumps(data))
query = [{"role": "system", "content": "You are an AI code assistant and you must provide neutral and harmless answers to help users solve code-related problems."}, {"role": "user", "content": "Who are you?"}]
71
71
data = {'type': type, 'scope': scope, 'query': query}
72
72
73
73
headers = {"Content-Type": "application/json"}
74
74
res = requests.post(url, headers=headers, json=json.dumps(data))
In ModelCache, we adopted the main idea of GPTCache, includes core modules: adapter, embedding, similarity, and data_manager. The adapter module is responsible for handling the business logic of various tasks and can connect the embedding, similarity, and data_manager modules. The embedding module is mainly responsible for converting text into semantic vector representations, it transforms user queries into vector form.The rank module is used for sorting and evaluating the similarity of the recalled vectors. The data_manager module is primarily used for managing the database. In order to better facilitate industrial applications, we have made architectural and functional upgrades as follows:
111
95
112
-
未来会持续建设的功能:
96
+
-[x] We have modified it similar to Redis and embedded it into the LLMs product, providing semantic caching capabilities. This ensures that it does not interfere with LLM calls, security audits, and other functionalities, achieving compatibility with all large-scale model services.
97
+
-[x] Multiple Model Loading Schemes:
98
+
- Support loading local embedding models to address Hugging Face network connectivity issues.
99
+
- Support loading various pretrained model embedding layers.
100
+
-[x] Data Isolation Capability
101
+
- Environment Isolation: Can pull different database configurations based on the environment to achieve environment isolation (dev, prepub, prod).
102
+
- Multi-tenant Data Isolation: Dynamically create collections based on the model for data isolation, addressing data isolation issues in multi-model/services scenarios in LLMs products.
103
+
-[x] Support for System Commands: Adopting a concatenation approach to address the issue of system commands in the prompt format.
104
+
-[x] Differentiation of Long and Short Texts: Long texts pose more challenges for similarity evaluation. To address this, we have added differentiation between long and short texts, allowing for separate configuration of threshold values for determining similarity.
105
+
-[x] Milvus Performance Optimization: The consistency_level of Milvus has been adjusted to "Session" level, which can result in better performance.
106
+
-[x] Data Management Capability:
107
+
- Ability to clear the cache, used for data management after model upgrades.
108
+
- Hitquery recall for subsequent data analysis and model iteration reference.
109
+
- Asynchronous log write-back capability for data analysis and statistics.
110
+
- Added model field and data statistics field for feature expansion.
-[ ] System prompt partitioning storage capability to enhance accuracy and efficiency of similarity matching.
116
+
-[ ] More versatile embedding models and similarity evaluation algorithms.
117
+
## Acknowledgements
118
+
This project has referenced the following open-source projects. We would like to express our gratitude to the projects and their developers for their contributions and research.<br />[GPTCache](https://github.com/zilliztech/GPTCache)
0 commit comments