Skip to content

Commit

Permalink
FEAT: Support Mixtral-8x7B-v0.1 models (xorbitsai#782)
Browse files Browse the repository at this point in the history
Co-authored-by: ChengjieLi <chengjieli23@outlook.com>
  • Loading branch information
Bojun-Feng and ChengjieLi28 committed Dec 27, 2023
1 parent 9e651d1 commit 314a999
Show file tree
Hide file tree
Showing 15 changed files with 316 additions and 111 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ potential of cutting-edge AI models.
- Speculative decoding: [#509](https://github.com/xorbitsai/inference/pull/509)
- Incorporate vLLM: [#445](https://github.com/xorbitsai/inference/pull/445)
### New Models
- Built-in support for [Mixtral-8x7B-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1): [#782](https://github.com/xorbitsai/inference/pull/782)
- Built-in support for [OpenHermes 2.5](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B): [#776](https://github.com/xorbitsai/inference/pull/776)
- Built-in support for [Yi](https://huggingface.co/01-ai): [#629](https://github.com/xorbitsai/inference/pull/629)
- Built-in support for [zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha) and [zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta): [#597](https://github.com/xorbitsai/inference/pull/597)
Expand Down
1 change: 1 addition & 0 deletions README_zh_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ Xorbits Inference(Xinference)是一个性能强大且功能全面的分布
- 投机采样: [#509](https://github.com/xorbitsai/inference/pull/509)
- 引入 vLLM: [#445](https://github.com/xorbitsai/inference/pull/445)
### 新模型
- 内置 [Mixtral-8x7B-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1): [#782](https://github.com/xorbitsai/inference/pull/782)
- 内置 [OpenHermes 2.5](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B): [#776](https://github.com/xorbitsai/inference/pull/776)
- 内置 [Yi](https://huggingface.co/01-ai): [#629](https://github.com/xorbitsai/inference/pull/629)
- 内置 [zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha)[zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta): [#597](https://github.com/xorbitsai/inference/pull/597)
Expand Down
6 changes: 5 additions & 1 deletion doc/source/models/builtin/llm/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ The following is a list of built-in LLM in Xinference:
glaive-coder

gorilla-openfunctions-v1

gpt-2

internlm-20b
Expand All @@ -65,6 +65,10 @@ The following is a list of built-in LLM in Xinference:

mistral-v0.1

mixtral-instruct-v0.1

mixtral-v0.1

openbuddy

openhermes-2.5
Expand Down
43 changes: 43 additions & 0 deletions doc/source/models/builtin/llm/mixtral-instruct-v0.1.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
.. _models_llm_mixtral-instruct-v0.1:

========================================
mixtral-instruct-v0.1
========================================

- **Context Length:** 32768
- **Model Name:** mixtral-instruct-v0.1
- **Languages:** en, fr, it, de, es
- **Abilities:** chat
- **Description:** Mistral-8x7B-Instruct is a fine-tuned version of the Mistral-8x7B LLM, specializing in chatting.

Specifications
^^^^^^^^^^^^^^


Model Spec 1 (pytorch, 46_7 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** pytorch
- **Model Size (in billions):** 46_7
- **Quantizations:** 4-bit, 8-bit, none
- **Model ID:** mistralai/Mixtral-8x7B-Instruct-v0.1

Execute the following command to launch the model, remember to replace ``${quantization}`` with your
chosen quantization method from the options listed above::

xinference launch --model-name mixtral-instruct-v0.1 --size-in-billions 46_7 --model-format pytorch --quantization ${quantization}


Model Spec 2 (ggufv2, 46_7 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** ggufv2
- **Model Size (in billions):** 46_7
- **Quantizations:** Q2_K, Q3_K_M, Q4_0, Q4_K_M, Q5_0, Q5_K_M, Q6_K, Q8_0
- **Model ID:** TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF

Execute the following command to launch the model, remember to replace ``${quantization}`` with your
chosen quantization method from the options listed above::

xinference launch --model-name mixtral-instruct-v0.1 --size-in-billions 46_7 --model-format ggufv2 --quantization ${quantization}

43 changes: 43 additions & 0 deletions doc/source/models/builtin/llm/mixtral-v0.1.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
.. _models_llm_mixtral-v0.1:

========================================
mixtral-v0.1
========================================

- **Context Length:** 32768
- **Model Name:** mixtral-v0.1
- **Languages:** en, fr, it, de, es
- **Abilities:** generate
- **Description:** The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts.

Specifications
^^^^^^^^^^^^^^


Model Spec 1 (pytorch, 46_7 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** pytorch
- **Model Size (in billions):** 46_7
- **Quantizations:** 4-bit, 8-bit, none
- **Model ID:** mistralai/Mixtral-8x7B-v0.1

Execute the following command to launch the model, remember to replace ``${quantization}`` with your
chosen quantization method from the options listed above::

xinference launch --model-name mixtral-v0.1 --size-in-billions 46_7 --model-format pytorch --quantization ${quantization}


Model Spec 2 (ggufv2, 46_7 Billion)
++++++++++++++++++++++++++++++++++++++++

- **Model Format:** ggufv2
- **Model Size (in billions):** 46_7
- **Quantizations:** Q2_K, Q3_K_M, Q4_0, Q4_K_M, Q5_0, Q5_K_M, Q6_K, Q8_0
- **Model ID:** TheBloke/Mixtral-8x7B-v0.1-GGUF

Execute the following command to launch the model, remember to replace ``${quantization}`` with your
chosen quantization method from the options listed above::

xinference launch --model-name mixtral-v0.1 --size-in-billions 46_7 --model-format ggufv2 --quantization ${quantization}

2 changes: 1 addition & 1 deletion xinference/client/restful/restful_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -601,7 +601,7 @@ def launch_model(
model_name: str,
model_type: str = "LLM",
model_uid: Optional[str] = None,
model_size_in_billions: Optional[int] = None,
model_size_in_billions: Optional[Union[int, str]] = None,
model_format: Optional[str] = None,
quantization: Optional[str] = None,
replica: int = 1,
Expand Down
27 changes: 13 additions & 14 deletions xinference/core/worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -342,21 +342,20 @@ async def launch_builtin_model(
model_uid, model_type, n_gpu=n_gpu
)

model, model_description = await asyncio.to_thread(
create_model_instance,
subpool_address,
devices,
model_uid,
model_type,
model_name,
model_format,
model_size_in_billions,
quantization,
is_local_deployment,
**kwargs,
)

try:
model, model_description = await asyncio.to_thread(
create_model_instance,
subpool_address,
devices,
model_uid,
model_type,
model_name,
model_format,
model_size_in_billions,
quantization,
is_local_deployment,
**kwargs,
)
model_ref = await xo.create_actor(
ModelActor,
address=subpool_address,
Expand Down
9 changes: 6 additions & 3 deletions xinference/deploy/cmdline.py
Original file line number Diff line number Diff line change
Expand Up @@ -441,7 +441,7 @@ def list_model_registrations(
"--size-in-billions",
"-s",
default=None,
type=int,
type=str,
help="Specify the model size in billions of parameters.",
)
@click.option(
Expand Down Expand Up @@ -482,7 +482,7 @@ def model_launch(
model_name: str,
model_type: str,
model_uid: str,
size_in_billions: int,
size_in_billions: str,
model_format: str,
quantization: str,
replica: int,
Expand All @@ -497,13 +497,16 @@ def model_launch(
_n_gpu = int(n_gpu)

endpoint = get_endpoint(endpoint)
model_size: Union[str, int] = (
size_in_billions if "_" in size_in_billions else int(size_in_billions)
)

client = RESTfulClient(base_url=endpoint)
model_uid = client.launch_model(
model_name=model_name,
model_type=model_type,
model_uid=model_uid,
model_size_in_billions=size_in_billions,
model_size_in_billions=model_size,
model_format=model_format,
quantization=quantization,
replica=replica,
Expand Down
92 changes: 92 additions & 0 deletions xinference/model/llm/llm_family.json
Original file line number Diff line number Diff line change
Expand Up @@ -2157,6 +2157,98 @@
}
]
},
{
"version": 1,
"context_length": 32768,
"model_name": "mixtral-v0.1",
"model_lang": [
"en", "fr", "it", "de", "es"
],
"model_ability": [
"generate"
],
"model_description": "The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts.",
"model_specs": [
{
"model_format": "pytorch",
"model_size_in_billions": "46_7",
"quantizations": [
"4-bit",
"8-bit",
"none"
],
"model_id": "mistralai/Mixtral-8x7B-v0.1",
"model_revision": "58301445dc1378584211722b7ebf8743ec4e192b"
},
{
"model_format": "ggufv2",
"model_size_in_billions": "46_7",
"quantizations": [
"Q2_K",
"Q3_K_M",
"Q4_0",
"Q4_K_M",
"Q5_0",
"Q5_K_M",
"Q6_K",
"Q8_0"
],
"model_id": "TheBloke/Mixtral-8x7B-v0.1-GGUF",
"model_file_name_template": "mixtral-8x7b-v0.1.{quantization}.gguf"
}
]
},
{
"version": 1,
"context_length": 32768,
"model_name": "mixtral-instruct-v0.1",
"model_lang": [
"en", "fr", "it", "de", "es"
],
"model_ability": [
"chat"
],
"model_description": "Mistral-8x7B-Instruct is a fine-tuned version of the Mistral-8x7B LLM, specializing in chatting.",
"model_specs": [
{
"model_format": "pytorch",
"model_size_in_billions": "46_7",
"quantizations": [
"4-bit",
"8-bit",
"none"
],
"model_id": "mistralai/Mixtral-8x7B-Instruct-v0.1",
"model_revision": "125c431e2ff41a156b9f9076f744d2f35dd6e67a"
},
{
"model_format": "ggufv2",
"model_size_in_billions": "46_7",
"quantizations": [
"Q2_K",
"Q3_K_M",
"Q4_0",
"Q4_K_M",
"Q5_0",
"Q5_K_M",
"Q6_K",
"Q8_0"
],
"model_id": "TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF",
"model_file_name_template": "mixtral-8x7b-instruct-v0.1.{quantization}.gguf"
}
],
"prompt_style": {
"style_name": "MIXTRAL_V01",
"system_prompt": "",
"roles": [
"user",
"assistant"
],
"intra_message_sep": "",
"inter_message_sep": ""
}
},
{
"version": 1,
"context_length": 4096,
Expand Down
2 changes: 1 addition & 1 deletion xinference/model/llm/llm_family.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ class LLMFamilyV1(BaseModel):
version: Literal[1]
context_length: Optional[int] = DEFAULT_CONTEXT_LENGTH
model_name: str
model_lang: List[Literal["en", "zh"]]
model_lang: List[str]
model_ability: List[Literal["embed", "generate", "chat"]]
model_description: Optional[str]
model_specs: List["LLMSpecV1"]
Expand Down
62 changes: 62 additions & 0 deletions xinference/model/llm/llm_family_modelscope.json
Original file line number Diff line number Diff line change
Expand Up @@ -946,6 +946,68 @@
}
]
},
{
"version": 1,
"context_length": 32768,
"model_name": "mixtral-v0.1",
"model_lang": [
"en", "fr", "it", "de", "es"
],
"model_ability": [
"generate"
],
"model_description": "The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts.",
"model_specs": [
{
"model_format": "pytorch",
"model_size_in_billions": "46_7",
"quantizations": [
"4-bit",
"8-bit",
"none"
],
"model_hub": "modelscope",
"model_id": "AI-ModelScope/Mixtral-8x7B-v0.1",
"model_revision": "master"
}
]
},
{
"version": 1,
"context_length": 32768,
"model_name": "mixtral-instruct-v0.1",
"model_lang": [
"en", "fr", "it", "de", "es"
],
"model_ability": [
"chat"
],
"model_description": "Mistral-8x7B-Instruct is a fine-tuned version of the Mistral-8x7B LLM, specializing in chatting.",
"model_specs": [
{
"model_format": "pytorch",
"model_size_in_billions": "46_7",
"quantizations": [
"4-bit",
"8-bit",
"none"
],
"model_hub": "modelscope",
"model_id": "AI-ModelScope/Mixtral-8x7B-Instruct-v0.1",
"model_revision": "master"
}
],
"prompt_style": {
"style_name": "MIXTRAL_V01",
"system_prompt": "",
"roles": [
"user",
"assistant"
],
"intra_message_sep": "",
"inter_message_sep": ""
}
},
{
"version": 1,
"context_length": 4096,
Expand Down
9 changes: 9 additions & 0 deletions xinference/model/llm/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,15 @@ def get_prompt(
else:
ret += role + ":"
return ret
elif prompt_style.style_name == "MIXTRAL_V01":
ret = ""
for i, message in enumerate(chat_history):
content = message["content"]
if i % 2 == 0: # user
ret += f"<s> [INST] {content} [/INST]"
else: # assistant
ret += f"{content} </s>"
return ret
elif prompt_style.style_name == "CHATGLM":
round_add_n = 1 if prompt_style.intra_message_sep == "\n\n" else 0
if prompt_style.system_prompt:
Expand Down
Loading

0 comments on commit 314a999

Please sign in to comment.