FEAT: Support Mixtral-8x7B-v0.1 models (xorbitsai#782)

Co-authored-by: ChengjieLi <chengjieli23@outlook.com>
Bojun-Feng · Dec 27, 2023 · 314a999 · 314a999
1 parent 9e651d1
commit 314a999
Show file tree

Hide file tree

Showing 15 changed files with 316 additions and 111 deletions.
diff --git a/README.md b/README.md
@@ -32,6 +32,7 @@ potential of cutting-edge AI models.
 - Speculative decoding: [#509](https://github.com/xorbitsai/inference/pull/509)
 - Incorporate vLLM: [#445](https://github.com/xorbitsai/inference/pull/445)
 ### New Models
+- Built-in support for [Mixtral-8x7B-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1): [#782](https://github.com/xorbitsai/inference/pull/782)
 - Built-in support for [OpenHermes 2.5](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B): [#776](https://github.com/xorbitsai/inference/pull/776)
 - Built-in support for [Yi](https://huggingface.co/01-ai): [#629](https://github.com/xorbitsai/inference/pull/629)
 - Built-in support for [zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha) and [zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta): [#597](https://github.com/xorbitsai/inference/pull/597) 

diff --git a/README_zh_CN.md b/README_zh_CN.md
@@ -30,6 +30,7 @@ Xorbits Inference（Xinference）是一个性能强大且功能全面的分布
 - 投机采样: [#509](https://github.com/xorbitsai/inference/pull/509)
 - 引入 vLLM: [#445](https://github.com/xorbitsai/inference/pull/445)
 ### 新模型
+- 内置 [Mixtral-8x7B-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1): [#782](https://github.com/xorbitsai/inference/pull/782)
 - 内置 [OpenHermes 2.5](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B): [#776](https://github.com/xorbitsai/inference/pull/776)
 - 内置 [Yi](https://huggingface.co/01-ai): [#629](https://github.com/xorbitsai/inference/pull/629)
 - 内置 [zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha) 与 [zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta): [#597](https://github.com/xorbitsai/inference/pull/597)

diff --git a/doc/source/models/builtin/llm/index.rst b/doc/source/models/builtin/llm/index.rst
@@ -46,7 +46,7 @@ The following is a list of built-in LLM in Xinference:
    glaive-coder
 
    gorilla-openfunctions-v1
-  
+
    gpt-2
 
    internlm-20b
@@ -65,6 +65,10 @@ The following is a list of built-in LLM in Xinference:
 
    mistral-v0.1
 
+   mixtral-instruct-v0.1
+
+   mixtral-v0.1
+
    openbuddy
 
    openhermes-2.5

diff --git a/doc/source/models/builtin/llm/mixtral-instruct-v0.1.rst b/doc/source/models/builtin/llm/mixtral-instruct-v0.1.rst
@@ -0,0 +1,43 @@
+.. _models_llm_mixtral-instruct-v0.1:
+
+========================================
+mixtral-instruct-v0.1
+========================================
+
+- **Context Length:** 32768
+- **Model Name:** mixtral-instruct-v0.1
+- **Languages:** en, fr, it, de, es
+- **Abilities:** chat
+- **Description:** Mistral-8x7B-Instruct is a fine-tuned version of the Mistral-8x7B LLM, specializing in chatting.
+
+Specifications
+^^^^^^^^^^^^^^
+
+
+Model Spec 1 (pytorch, 46_7 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** pytorch
+- **Model Size (in billions):** 46_7
+- **Quantizations:** 4-bit, 8-bit, none
+- **Model ID:** mistralai/Mixtral-8x7B-Instruct-v0.1
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-name mixtral-instruct-v0.1 --size-in-billions 46_7 --model-format pytorch --quantization ${quantization}
+
+
+Model Spec 2 (ggufv2, 46_7 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** ggufv2
+- **Model Size (in billions):** 46_7
+- **Quantizations:** Q2_K, Q3_K_M, Q4_0, Q4_K_M, Q5_0, Q5_K_M, Q6_K, Q8_0
+- **Model ID:** TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-name mixtral-instruct-v0.1 --size-in-billions 46_7 --model-format ggufv2 --quantization ${quantization}
+
diff --git a/doc/source/models/builtin/llm/mixtral-v0.1.rst b/doc/source/models/builtin/llm/mixtral-v0.1.rst
@@ -0,0 +1,43 @@
+.. _models_llm_mixtral-v0.1:
+
+========================================
+mixtral-v0.1
+========================================
+
+- **Context Length:** 32768
+- **Model Name:** mixtral-v0.1
+- **Languages:** en, fr, it, de, es
+- **Abilities:** generate
+- **Description:** The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts.
+
+Specifications
+^^^^^^^^^^^^^^
+
+
+Model Spec 1 (pytorch, 46_7 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** pytorch
+- **Model Size (in billions):** 46_7
+- **Quantizations:** 4-bit, 8-bit, none
+- **Model ID:** mistralai/Mixtral-8x7B-v0.1
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-name mixtral-v0.1 --size-in-billions 46_7 --model-format pytorch --quantization ${quantization}
+
+
+Model Spec 2 (ggufv2, 46_7 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** ggufv2
+- **Model Size (in billions):** 46_7
+- **Quantizations:** Q2_K, Q3_K_M, Q4_0, Q4_K_M, Q5_0, Q5_K_M, Q6_K, Q8_0
+- **Model ID:** TheBloke/Mixtral-8x7B-v0.1-GGUF
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-name mixtral-v0.1 --size-in-billions 46_7 --model-format ggufv2 --quantization ${quantization}
+
diff --git a/xinference/client/restful/restful_client.py b/xinference/client/restful/restful_client.py
@@ -601,7 +601,7 @@ def launch_model(
         model_name: str,
         model_type: str = "LLM",
         model_uid: Optional[str] = None,
-        model_size_in_billions: Optional[int] = None,
+        model_size_in_billions: Optional[Union[int, str]] = None,
         model_format: Optional[str] = None,
         quantization: Optional[str] = None,
         replica: int = 1,

diff --git a/xinference/core/worker.py b/xinference/core/worker.py
@@ -342,21 +342,20 @@ async def launch_builtin_model(
             model_uid, model_type, n_gpu=n_gpu
         )
 
-        model, model_description = await asyncio.to_thread(
-            create_model_instance,
-            subpool_address,
-            devices,
-            model_uid,
-            model_type,
-            model_name,
-            model_format,
-            model_size_in_billions,
-            quantization,
-            is_local_deployment,
-            **kwargs,
-        )
-
         try:
+            model, model_description = await asyncio.to_thread(
+                create_model_instance,
+                subpool_address,
+                devices,
+                model_uid,
+                model_type,
+                model_name,
+                model_format,
+                model_size_in_billions,
+                quantization,
+                is_local_deployment,
+                **kwargs,
+            )
             model_ref = await xo.create_actor(
                 ModelActor,
                 address=subpool_address,

diff --git a/xinference/deploy/cmdline.py b/xinference/deploy/cmdline.py
@@ -441,7 +441,7 @@ def list_model_registrations(
     "--size-in-billions",
     "-s",
     default=None,
-    type=int,
+    type=str,
     help="Specify the model size in billions of parameters.",
 )
 @click.option(
@@ -482,7 +482,7 @@ def model_launch(
     model_name: str,
     model_type: str,
     model_uid: str,
-    size_in_billions: int,
+    size_in_billions: str,
     model_format: str,
     quantization: str,
     replica: int,
@@ -497,13 +497,16 @@ def model_launch(
         _n_gpu = int(n_gpu)
 
     endpoint = get_endpoint(endpoint)
+    model_size: Union[str, int] = (
+        size_in_billions if "_" in size_in_billions else int(size_in_billions)
+    )
 
     client = RESTfulClient(base_url=endpoint)
     model_uid = client.launch_model(
         model_name=model_name,
         model_type=model_type,
         model_uid=model_uid,
-        model_size_in_billions=size_in_billions,
+        model_size_in_billions=model_size,
         model_format=model_format,
         quantization=quantization,
         replica=replica,

diff --git a/xinference/model/llm/llm_family.json b/xinference/model/llm/llm_family.json
@@ -2157,6 +2157,98 @@
       }
     ]
   },
+  {
+    "version": 1,
+    "context_length": 32768,
+    "model_name": "mixtral-v0.1",
+    "model_lang": [
+      "en", "fr", "it", "de", "es"
+    ],
+    "model_ability": [
+      "generate"
+    ],
+    "model_description": "The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts.",
+    "model_specs": [
+      {
+        "model_format": "pytorch",
+        "model_size_in_billions": "46_7",
+        "quantizations": [
+          "4-bit",
+          "8-bit",
+          "none"
+        ],
+        "model_id": "mistralai/Mixtral-8x7B-v0.1",
+        "model_revision": "58301445dc1378584211722b7ebf8743ec4e192b"
+      },
+      {
+        "model_format": "ggufv2",
+        "model_size_in_billions": "46_7",
+        "quantizations": [
+          "Q2_K",
+          "Q3_K_M",
+          "Q4_0",
+          "Q4_K_M",
+          "Q5_0",
+          "Q5_K_M",
+          "Q6_K",
+          "Q8_0"
+        ],
+        "model_id": "TheBloke/Mixtral-8x7B-v0.1-GGUF",
+        "model_file_name_template": "mixtral-8x7b-v0.1.{quantization}.gguf"
+      }
+    ]
+  },
+  {
+    "version": 1,
+    "context_length": 32768,
+    "model_name": "mixtral-instruct-v0.1",
+    "model_lang": [
+      "en", "fr", "it", "de", "es"
+    ],
+    "model_ability": [
+      "chat"
+    ],
+    "model_description": "Mistral-8x7B-Instruct is a fine-tuned version of the Mistral-8x7B LLM, specializing in chatting.",
+    "model_specs": [
+      {
+        "model_format": "pytorch",
+        "model_size_in_billions": "46_7",
+        "quantizations": [
+          "4-bit",
+          "8-bit",
+          "none"
+        ],
+        "model_id": "mistralai/Mixtral-8x7B-Instruct-v0.1",
+        "model_revision": "125c431e2ff41a156b9f9076f744d2f35dd6e67a"
+      },
+      {
+        "model_format": "ggufv2",
+        "model_size_in_billions": "46_7",
+        "quantizations": [
+          "Q2_K",
+          "Q3_K_M",
+          "Q4_0",
+          "Q4_K_M",
+          "Q5_0",
+          "Q5_K_M",
+          "Q6_K",
+          "Q8_0"
+        ],
+        "model_id": "TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF",
+        "model_file_name_template": "mixtral-8x7b-instruct-v0.1.{quantization}.gguf"
+      }
+    ],
+    "prompt_style": {
+      "style_name": "MIXTRAL_V01",
+      "system_prompt": "",
+      "roles": [
+        "user",
+        "assistant"
+      ],
+      "intra_message_sep": "",
+      "inter_message_sep": ""
+    }
+  },
   {
     "version": 1,
     "context_length": 4096,

diff --git a/xinference/model/llm/llm_family.py b/xinference/model/llm/llm_family.py
@@ -102,7 +102,7 @@ class LLMFamilyV1(BaseModel):
     version: Literal[1]
     context_length: Optional[int] = DEFAULT_CONTEXT_LENGTH
     model_name: str
-    model_lang: List[Literal["en", "zh"]]
+    model_lang: List[str]
     model_ability: List[Literal["embed", "generate", "chat"]]
     model_description: Optional[str]
     model_specs: List["LLMSpecV1"]

diff --git a/xinference/model/llm/llm_family_modelscope.json b/xinference/model/llm/llm_family_modelscope.json
@@ -946,6 +946,68 @@
       }
     ]
   },
+  {
+    "version": 1,
+    "context_length": 32768,
+    "model_name": "mixtral-v0.1",
+    "model_lang": [
+      "en", "fr", "it", "de", "es"
+    ],
+    "model_ability": [
+      "generate"
+    ],
+    "model_description": "The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts.",
+    "model_specs": [
+      {
+        "model_format": "pytorch",
+        "model_size_in_billions": "46_7",
+        "quantizations": [
+          "4-bit",
+          "8-bit",
+          "none"
+        ],
+        "model_hub": "modelscope",
+        "model_id": "AI-ModelScope/Mixtral-8x7B-v0.1",
+        "model_revision": "master"
+      }
+    ]
+  },
+  {
+    "version": 1,
+    "context_length": 32768,
+    "model_name": "mixtral-instruct-v0.1",
+    "model_lang": [
+      "en", "fr", "it", "de", "es"
+    ],
+    "model_ability": [
+      "chat"
+    ],
+    "model_description": "Mistral-8x7B-Instruct is a fine-tuned version of the Mistral-8x7B LLM, specializing in chatting.",
+    "model_specs": [
+      {
+        "model_format": "pytorch",
+        "model_size_in_billions": "46_7",
+        "quantizations": [
+          "4-bit",
+          "8-bit",
+          "none"
+        ],
+        "model_hub": "modelscope",
+        "model_id": "AI-ModelScope/Mixtral-8x7B-Instruct-v0.1",
+        "model_revision": "master"
+      }
+    ],
+    "prompt_style": {
+      "style_name": "MIXTRAL_V01",
+      "system_prompt": "",
+      "roles": [
+        "user",
+        "assistant"
+      ],
+      "intra_message_sep": "",
+      "inter_message_sep": ""
+    }
+  },
   {
     "version": 1,
     "context_length": 4096,

diff --git a/xinference/model/llm/utils.py b/xinference/model/llm/utils.py
@@ -114,6 +114,15 @@ def get_prompt(
                 else:
                     ret += role + ":"
             return ret
+        elif prompt_style.style_name == "MIXTRAL_V01":
+            ret = ""
+            for i, message in enumerate(chat_history):
+                content = message["content"]
+                if i % 2 == 0:  # user
+                    ret += f"<s> [INST] {content} [/INST]"
+                else:  # assistant
+                    ret += f"{content} </s>"
+            return ret
         elif prompt_style.style_name == "CHATGLM":
             round_add_n = 1 if prompt_style.intra_message_sep == "\n\n" else 0
             if prompt_style.system_prompt: