AnyMoE: Build an MoE model from anything, quickly (#476)

* Add gating layer and some infrastructure * Add traits to the pipeline * Add the training loop * Remove unused * Add loader and pipeline * Complete merge * Move method * Add a default for anymoeconfig * Add training support * Expose in toml selector * Inject anymoe layers * Load pretraining dataset from csv * Run the training * Add a csv file * Add default dtype option * Fix lin varmap * Add some debugs and fix * Template it * Fix assert condition * To scalar * Take cached outputs * It doesn't oom * Remove debugs * Nice progress bar * Nice progress bar * Add anymoe support to plain models * Add get mlps and get mlps mut layers * Check if supported * Clippy and slightly more info * Load the mlps into vbs * Add support for loading experts * Update toml selector * Remove deadlock * Add support for selecting only certain layers * Default * Fix off by one * Handle it correctly * Check if is moe layer * Update csv training set and add moe layers to toml * Remove unnecessary training infra * Done training * Fix trainable params calculation * More consistent naming * Add support for loading from lora experts * Fix the toml files * Get in and out dims * Add amoe support for gemma2 * More info * Complete merge * Add some docs * Add example adapter * Handle target modules * Fix toml selector * Add topk * Correctly gate * Correctly gate * Fix clippy * Fix training * Fix typos * Fix scale * Use the zephyr lora adapter * Change the base moel * Typo * Add to the python api * Update the type stubs * Update readme * Add some examples * Add some examples and docs * Update docs * Update example to use layers * Take into account silent loading * Clippy * Update example lora model id * Update target modules * Remove multiple target modules from examples * Update readme * Update readme * Missed anymoe lora rust example
EricLBuehler · Jul 1, 2024 · a3c8eaa · a3c8eaa
1 parent 21ed180
commit a3c8eaa
Show file tree

Hide file tree

Showing 58 changed files with 3,369 additions and 215 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/README.md b/README.md
@@ -26,18 +26,25 @@ Please submit requests for new models [here](https://github.com/EricLBuehler/mis
     - [OpenAI compatible HTTP server](examples/http.md)
 
 ## Quick examples
-- 💎 Run the Gemma 2 model
 
-    *After following installation instructions*
+*After following installation instructions*
+
+- 🔥 AnyMoE: Build an MoE model quickly from anything, [docs here](docs/ANYMOE.md)
+
+    ```
+    ./mistralrs_server -i toml -f toml-selectors/anymoe_lora.toml
+    ```
+
+    Paper: https://arxiv.org/abs/2405.19076
+
+- 💎 Run the Gemma 2 model
 
     ```
     ./mistralrs_server -i plain -m google/gemma-2-9b-it -a gemma2
     ```
 
 - φ³ Run the Phi 3 model with 128K context window
 
-    *After following installation instructions*
-
     ```
     ./mistralrs_server -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3
     ```
@@ -47,8 +54,6 @@ Please submit requests for new models [here](https://github.com/EricLBuehler/mis
     <img src="https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg" alt="Mount Washington" width = "400" height = "267">
     <h6><a href = "https://www.nhmagazine.com/mount-washington/">Credit</a></h6>
 
-    *After following installation instructions*
-
     ```
     ./mistralrs_server --port 1234 vision-plain -m microsoft/Phi-3-vision-128k-instruct -a phi3v
     ```
@@ -84,6 +89,7 @@ Mistal.rs supports several model categories:
 - First X-LoRA inference platform with first class support.
 - Speculative Decoding: Mix supported models as the draft model or the target model
 - Dynamic LoRA adapter swapping at runtime with adapter preloading: [examples and docs](docs/ADAPTER_MODELS.md#adapter-model-dynamic-adapter-activation)
+- AnyMoE: Build an MoE model from anything, quickly: [docs](docs/ANYMOE.md)
 
 
 This is a demo of interactive mode with streaming running Phi 3 128k mini with quantization via ISQ to Q4K.
@@ -97,18 +103,18 @@ https://github.com/EricLBuehler/mistral.rs/assets/65165915/09d9a30f-1e22-4b9a-90
 
 > Note: See [supported models](#supported-models) for more information
 
-|Model|Supports quantization|Supports adapters|Supports device mapping|
-|--|--|--|--|
-|Mistral v0.1/v0.2/v0.3|✅|✅|✅|
-|Gemma|✅|✅|✅|
-|Llama 2/3|✅|✅|✅|
-|Mixtral|✅|✅|✅|
-|Phi 2|✅|✅|✅|
-|Phi 3|✅|✅|✅|
-|Qwen 2|✅| |✅|
-|Phi 3 Vision|✅| |✅|
-|Idefics 2|✅| |✅|
-|Gemma 2|✅|✅|✅|
+|Model|Supports quantization|Supports adapters|Supports device mapping|Supported by AnyMoE|
+|--|--|--|--|--|
+|Mistral v0.1/v0.2/v0.3|✅|✅|✅|✅|
+|Gemma|✅|✅|✅|✅|
+|Llama 2/3|✅|✅|✅|✅|
+|Mixtral|✅|✅|✅| |
+|Phi 2|✅|✅|✅|✅|
+|Phi 3|✅|✅|✅|✅|
+|Qwen 2|✅| |✅|✅|
+|Phi 3 Vision|✅| |✅| |
+|Idefics 2|✅| |✅| |
+|Gemma 2|✅|✅|✅|✅|
 
 ## APIs and Integrations
 
@@ -422,15 +428,16 @@ Example:
 **Quantization support**
 |Model|GGUF|GGML|ISQ|
 |--|--|--|--|
-|Mistral 7B |✅| |✅|
+|Mistral|✅| |✅|
 |Gemma| | |✅|
 |Llama|✅|✅|✅|
-|Mixtral 8x7B|✅| |✅|
+|Mixtral|✅| |✅|
 |Phi 2|✅| |✅|
 |Phi 3|✅| |✅|
 |Qwen 2| | |✅|
 |Phi 3 Vision| | |✅|
 |Idefics 2| | |✅|
+|Gemma 2| | |✅|
 
 **Device mapping support**
 |Model category|Supported|
@@ -443,15 +450,31 @@ Example:
 **X-LoRA and LoRA support**
 |Model|X-LoRA|X-LoRA+GGUF|X-LoRA+GGML|
 |--|--|--|--|
-|Mistral 7B |✅|✅| |
+|Mistral|✅|✅| |
 |Gemma|✅| | |
 |Llama|✅|✅|✅|
-|Mixtral 8x7B|✅|✅| |
+|Mixtral✅|✅| |
 |Phi 2|✅| | |
 |Phi 3|✅|✅| |
 |Qwen 2| | | |
 |Phi 3 Vision| | | |
 |Idefics 2| | | |
+|Gemma 2|✅| | |
+
+**AnyMoE support**
+|Model|AnyMoE|
+|--|--|
+|Mistral 7B|✅|
+|Gemma|✅|
+|Llama|✅|
+|Mixtral|✅|
+|Phi 2|✅|
+|Phi 3|✅|
+|Qwen 2|✅|
+|Phi 3 Vision| |
+|Idefics 2| |
+|Gemma 2|✅|
+
 
 ### Using derivative model
 

diff --git a/docs/ANYMOE.md b/docs/ANYMOE.md
@@ -0,0 +1,206 @@
+# AnyMoE: Build an MoE model from anything, quickly
+
+AnyMoE is technique to dynamically and efficiently create MoE models. By providing a set of experts and a small pretraining dataset, you can create an MoE locally!
+
+It has the following features:
+- Apply AnyMoE to any supported model
+    - `plain`
+- Specify the layers to apply AnyMoE to for efficient training
+
+## Dataset
+Currently, AnyMoE expects a CSV dataset with 2 columns: `prompt` and `expert`. For example:
+```csv
+prompt,expert
+Discuss the impact of Renaissance art on modern aesthetics,0
+Explain the significance of the theory of relativity in modern physics,1
+Analyze the themes of existentialism in 20th-century literature,0
+Describe the process of photosynthesis and its importance to ecosystems,1
+Evaluate the role of classical music in contemporary film scores,0
+Outline the steps of the scientific method and their importance in experiments,1
+Compare and contrast the philosophies of Socrates and Nietzsche,0
+Discuss the ethical implications of artificial intelligence in society,1
+Interpret the symbolism in Salvador Dalí's paintings,0
+Describe the function and structure of DNA in genetic inheritance,1
+```
+
+## Experts
+AnyMoE experts can be either fine-tuned models or LoRA adapter models. Only the mlp layers will be loaded from each. The experts must be homogeneous: they must be all fine-tuned or all adapter. Additionally, certain layers can be specified to apply AnyMoE.
+
+> Note: When using LoRA adapter experts, it may not be necessary to set the layers where AnyMoE will be applied due to the lower memory usage.
+
+### Example of TOML selector with fine-tuned experts
+```toml
+[model]
+model_id = "mistralai/Mistral-7B-Instruct-v0.1"
+arch = "mistral"
+
+[anymoe]
+dataset_csv = "examples/amoe.csv"
+prefix = "model.layers"
+mlp = "mlp"
+model_ids = ["HuggingFaceH4/zephyr-7b-beta"]
+layers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
+
+[anymoe.config]
+hidden_size = 4096
+expert_type = "fine_tuned"
+```
+
+### Example of TOML selector with LoRA adapter experts
+```toml
+[model]
+model_id = "HuggingFaceH4/zephyr-7b-beta"
+arch = "mistral"
+
+[anymoe]
+dataset_csv = "examples/amoe.csv"
+prefix = "model.layers"
+mlp = "mlp"
+model_ids = ["EricB/example_adapter"]
+layers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
+
+[anymoe.config]
+hidden_size = 4096
+
+[anymoe.config.expert_type.lora_adapter]
+rank = 16
+alpha = 16
+target_modules = ["gate_proj"]
+```
+
+## Examples
+
+## `mistralrs-server`
+
+CLI usage is via the [TOML selector](TOML_SELECTOR.md#anymoe) where you can also find docs on the required fields.
+
+For example, to use the demo fine-tuned expert:
+```
+./mistralrs_server -i toml -f toml-selectors/anymoe.toml
+```
+
+To use the demo LoRA expert:
+```
+./mistralrs_server -i toml -f toml-selectors/anymoe_lora.toml
+```
+
+## Python example
+```py
+from mistralrs import (
+    Runner,
+    Which,
+    ChatCompletionRequest,
+    Architecture,
+    AnyMoeConfig,
+    AnyMoeExpertType,
+)
+
+runner = Runner(
+    which=Which.Plain(
+        model_id="mistralai/Mistral-7B-Instruct-v0.1",
+        tokenizer_json=None,
+        repeat_last_n=64,
+        arch=Architecture.Mistral,
+    ),
+    anymoe_config=AnyMoeConfig(
+        hidden_size=4096,
+        dataset_csv="examples/amoe.csv",
+        prefix="model.layers",
+        mlp="mlp",
+        expert_type=AnyMoeExpertType.FineTuned(),
+        lr=1e-3,
+        epochs=100,
+        batch_size=4,
+        model_ids=["HuggingFaceH4/zephyr-7b-beta"],
+    ),
+)
+
+res = runner.send_chat_completion_request(
+    ChatCompletionRequest(
+        model="mistral",
+        messages=[
+            {"role": "user", "content": "Tell me a story about the Rust type system."}
+        ],
+        max_tokens=256,
+        presence_penalty=1.0,
+        top_p=0.1,
+        temperature=0.1,
+    )
+)
+print(res.choices[0].message.content)
+print(res.usage)
+```
+
+## Rust API
+```rust
+use either::Either;
+use indexmap::IndexMap;
+use std::sync::Arc;
+use tokio::sync::mpsc::channel;
+
+use mistralrs::{
+    AnyMoeConfig, AnyMoeExpertType, AnyMoeLoader, Constraint, Device, DeviceMapMetadata, Loader,
+    MistralRs, MistralRsBuilder, ModelDType, NormalLoaderBuilder, NormalLoaderType, NormalRequest,
+    NormalSpecificConfig, Request, RequestMessage, Response, Result, SamplingParams,
+    SchedulerMethod, TokenSource,
+};
+
+/// Gets the best device, cpu, cuda if compiled with CUDA
+pub(crate) fn best_device() -> Result<Device> {
+    #[cfg(not(feature = "metal"))]
+    {
+        Device::cuda_if_available(0)
+    }
+    #[cfg(feature = "metal")]
+    {
+        Device::new_metal(0)
+    }
+}
+
+fn setup() -> anyhow::Result<Arc<MistralRs>> {
+    // Select a Mistral model
+    let loader = NormalLoaderBuilder::new(
+        NormalSpecificConfig {
+            use_flash_attn: false,
+            repeat_last_n: 64,
+        },
+        None,
+        None,
+        Some("mistralai/Mistral-7B-Instruct-v0.1".to_string()),
+    )
+    .build(NormalLoaderType::Mistral);
+    let loader: Box<dyn Loader> = Box::new(AnyMoeLoader {
+        target: loader,
+        config: AnyMoeConfig {
+            hidden_size: 4096,
+            lr: 1e-3,
+            epochs: 100,
+            batch_size: 4,
+            expert_type: AnyMoeExpertType::LoraAdapter {
+                rank: 64,
+                alpha: 16.,
+                target_modules: vec![
+                    "gate_proj".to_string(),
+                ],
+            },
+        },
+        prefix: "model.layers".to_string(),
+        mlp: "mlp".to_string(),
+        path: "examples/amoe.csv".to_string(),
+        model_ids: vec!["typeof/zephyr-7b-beta-lora".to_string()],
+        layers: vec![],
+    });
+    // Load, into a Pipeline
+    let pipeline = loader.load_model_from_hf(
+        None,
+        TokenSource::CacheToken,
+        &ModelDType::Auto,
+        &best_device()?,
+        false,
+        DeviceMapMetadata::dummy(),
+        None,
+    )?;
+    // Create the MistralRs, which is a runner
+    Ok(MistralRsBuilder::new(pipeline, SchedulerMethod::Fixed(5.try_into().unwrap())).build())
+}
+```