Skip to content

Commit

Permalink
AnyMoE: Build an MoE model from anything, quickly (#476)
Browse files Browse the repository at this point in the history
* Add gating layer and some infrastructure

* Add traits to the pipeline

* Add the training loop

* Remove unused

* Add loader and pipeline

* Complete merge

* Move method

* Add a default for anymoeconfig

* Add training support

* Expose in toml selector

* Inject anymoe layers

* Load pretraining dataset from csv

* Run the training

* Add a csv file

* Add default dtype option

* Fix lin varmap

* Add some debugs and fix

* Template it

* Fix assert condition

* To scalar

* Take cached outputs

* It doesn't oom

* Remove debugs

* Nice progress bar

* Nice progress bar

* Add anymoe support to plain models

* Add get mlps and get mlps mut layers

* Check if supported

* Clippy and slightly more info

* Load the mlps into vbs

* Add support for loading experts

* Update toml selector

* Remove deadlock

* Add support for selecting only certain layers

* Default

* Fix off by one

* Handle it correctly

* Check if is moe layer

* Update csv training set and add moe layers to toml

* Remove unnecessary training infra

* Done training

* Fix trainable params calculation

* More consistent naming

* Add support for loading from lora experts

* Fix the toml files

* Get in and out dims

* Add amoe support for gemma2

* More info

* Complete merge

* Add some docs

* Add example adapter

* Handle target modules

* Fix toml selector

* Add topk

* Correctly gate

* Correctly gate

* Fix clippy

* Fix training

* Fix typos

* Fix scale

* Use the zephyr lora adapter

* Change the base moel

* Typo

* Add to the python api

* Update the type stubs

* Update readme

* Add some examples

* Add some examples and docs

* Update docs

* Update example to use layers

* Take into account silent loading

* Clippy

* Update example lora model id

* Update target modules

* Remove multiple target modules from examples

* Update readme

* Update readme

* Missed anymoe lora rust example
  • Loading branch information
EricLBuehler committed Jul 1, 2024
1 parent 21ed180 commit a3c8eaa
Show file tree
Hide file tree
Showing 58 changed files with 3,369 additions and 215 deletions.
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

67 changes: 45 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,18 +26,25 @@ Please submit requests for new models [here](https://github.com/EricLBuehler/mis
- [OpenAI compatible HTTP server](examples/http.md)

## Quick examples
- 💎 Run the Gemma 2 model

*After following installation instructions*
*After following installation instructions*

- 🔥 AnyMoE: Build an MoE model quickly from anything, [docs here](docs/ANYMOE.md)

```
./mistralrs_server -i toml -f toml-selectors/anymoe_lora.toml
```

Paper: https://arxiv.org/abs/2405.19076

- 💎 Run the Gemma 2 model

```
./mistralrs_server -i plain -m google/gemma-2-9b-it -a gemma2
```

- φ³ Run the Phi 3 model with 128K context window

*After following installation instructions*

```
./mistralrs_server -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3
```
Expand All @@ -47,8 +54,6 @@ Please submit requests for new models [here](https://github.com/EricLBuehler/mis
<img src="https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg" alt="Mount Washington" width = "400" height = "267">
<h6><a href = "https://www.nhmagazine.com/mount-washington/">Credit</a></h6>

*After following installation instructions*

```
./mistralrs_server --port 1234 vision-plain -m microsoft/Phi-3-vision-128k-instruct -a phi3v
```
Expand Down Expand Up @@ -84,6 +89,7 @@ Mistal.rs supports several model categories:
- First X-LoRA inference platform with first class support.
- Speculative Decoding: Mix supported models as the draft model or the target model
- Dynamic LoRA adapter swapping at runtime with adapter preloading: [examples and docs](docs/ADAPTER_MODELS.md#adapter-model-dynamic-adapter-activation)
- AnyMoE: Build an MoE model from anything, quickly: [docs](docs/ANYMOE.md)


This is a demo of interactive mode with streaming running Phi 3 128k mini with quantization via ISQ to Q4K.
Expand All @@ -97,18 +103,18 @@ https://github.com/EricLBuehler/mistral.rs/assets/65165915/09d9a30f-1e22-4b9a-90

> Note: See [supported models](#supported-models) for more information
|Model|Supports quantization|Supports adapters|Supports device mapping|
|--|--|--|--|
|Mistral v0.1/v0.2/v0.3||||
|Gemma||||
|Llama 2/3||||
|Mixtral||||
|Phi 2||||
|Phi 3||||
|Qwen 2|| ||
|Phi 3 Vision|| ||
|Idefics 2|| ||
|Gemma 2||||
|Model|Supports quantization|Supports adapters|Supports device mapping|Supported by AnyMoE|
|--|--|--|--|--|
|Mistral v0.1/v0.2/v0.3|||||
|Gemma|||||
|Llama 2/3|||||
|Mixtral|||| |
|Phi 2|||||
|Phi 3|||||
|Qwen 2|| |||
|Phi 3 Vision|| || |
|Idefics 2|| || |
|Gemma 2|||||

## APIs and Integrations

Expand Down Expand Up @@ -422,15 +428,16 @@ Example:
**Quantization support**
|Model|GGUF|GGML|ISQ|
|--|--|--|--|
|Mistral 7B || ||
|Mistral|| ||
|Gemma| | ||
|Llama||||
|Mixtral 8x7B|| ||
|Mixtral|| ||
|Phi 2|| ||
|Phi 3|| ||
|Qwen 2| | ||
|Phi 3 Vision| | ||
|Idefics 2| | ||
|Gemma 2| | ||

**Device mapping support**
|Model category|Supported|
Expand All @@ -443,15 +450,31 @@ Example:
**X-LoRA and LoRA support**
|Model|X-LoRA|X-LoRA+GGUF|X-LoRA+GGML|
|--|--|--|--|
|Mistral 7B ||| |
|Mistral||| |
|Gemma|| | |
|Llama||||
|Mixtral 8x7B||| |
|Mixtral✅|| |
|Phi 2|| | |
|Phi 3||| |
|Qwen 2| | | |
|Phi 3 Vision| | | |
|Idefics 2| | | |
|Gemma 2|| | |

**AnyMoE support**
|Model|AnyMoE|
|--|--|
|Mistral 7B||
|Gemma||
|Llama||
|Mixtral||
|Phi 2||
|Phi 3||
|Qwen 2||
|Phi 3 Vision| |
|Idefics 2| |
|Gemma 2||


### Using derivative model

Expand Down
206 changes: 206 additions & 0 deletions docs/ANYMOE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,206 @@
# AnyMoE: Build an MoE model from anything, quickly

AnyMoE is technique to dynamically and efficiently create MoE models. By providing a set of experts and a small pretraining dataset, you can create an MoE locally!

It has the following features:
- Apply AnyMoE to any supported model
- `plain`
- Specify the layers to apply AnyMoE to for efficient training

## Dataset
Currently, AnyMoE expects a CSV dataset with 2 columns: `prompt` and `expert`. For example:
```csv
prompt,expert
Discuss the impact of Renaissance art on modern aesthetics,0
Explain the significance of the theory of relativity in modern physics,1
Analyze the themes of existentialism in 20th-century literature,0
Describe the process of photosynthesis and its importance to ecosystems,1
Evaluate the role of classical music in contemporary film scores,0
Outline the steps of the scientific method and their importance in experiments,1
Compare and contrast the philosophies of Socrates and Nietzsche,0
Discuss the ethical implications of artificial intelligence in society,1
Interpret the symbolism in Salvador Dalí's paintings,0
Describe the function and structure of DNA in genetic inheritance,1
```

## Experts
AnyMoE experts can be either fine-tuned models or LoRA adapter models. Only the mlp layers will be loaded from each. The experts must be homogeneous: they must be all fine-tuned or all adapter. Additionally, certain layers can be specified to apply AnyMoE.

> Note: When using LoRA adapter experts, it may not be necessary to set the layers where AnyMoE will be applied due to the lower memory usage.
### Example of TOML selector with fine-tuned experts
```toml
[model]
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
arch = "mistral"

[anymoe]
dataset_csv = "examples/amoe.csv"
prefix = "model.layers"
mlp = "mlp"
model_ids = ["HuggingFaceH4/zephyr-7b-beta"]
layers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

[anymoe.config]
hidden_size = 4096
expert_type = "fine_tuned"
```

### Example of TOML selector with LoRA adapter experts
```toml
[model]
model_id = "HuggingFaceH4/zephyr-7b-beta"
arch = "mistral"

[anymoe]
dataset_csv = "examples/amoe.csv"
prefix = "model.layers"
mlp = "mlp"
model_ids = ["EricB/example_adapter"]
layers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

[anymoe.config]
hidden_size = 4096

[anymoe.config.expert_type.lora_adapter]
rank = 16
alpha = 16
target_modules = ["gate_proj"]
```

## Examples

## `mistralrs-server`

CLI usage is via the [TOML selector](TOML_SELECTOR.md#anymoe) where you can also find docs on the required fields.

For example, to use the demo fine-tuned expert:
```
./mistralrs_server -i toml -f toml-selectors/anymoe.toml
```

To use the demo LoRA expert:
```
./mistralrs_server -i toml -f toml-selectors/anymoe_lora.toml
```

## Python example
```py
from mistralrs import (
Runner,
Which,
ChatCompletionRequest,
Architecture,
AnyMoeConfig,
AnyMoeExpertType,
)

runner = Runner(
which=Which.Plain(
model_id="mistralai/Mistral-7B-Instruct-v0.1",
tokenizer_json=None,
repeat_last_n=64,
arch=Architecture.Mistral,
),
anymoe_config=AnyMoeConfig(
hidden_size=4096,
dataset_csv="examples/amoe.csv",
prefix="model.layers",
mlp="mlp",
expert_type=AnyMoeExpertType.FineTuned(),
lr=1e-3,
epochs=100,
batch_size=4,
model_ids=["HuggingFaceH4/zephyr-7b-beta"],
),
)

res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="mistral",
messages=[
{"role": "user", "content": "Tell me a story about the Rust type system."}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
```

## Rust API
```rust
use either::Either;
use indexmap::IndexMap;
use std::sync::Arc;
use tokio::sync::mpsc::channel;

use mistralrs::{
AnyMoeConfig, AnyMoeExpertType, AnyMoeLoader, Constraint, Device, DeviceMapMetadata, Loader,
MistralRs, MistralRsBuilder, ModelDType, NormalLoaderBuilder, NormalLoaderType, NormalRequest,
NormalSpecificConfig, Request, RequestMessage, Response, Result, SamplingParams,
SchedulerMethod, TokenSource,
};

/// Gets the best device, cpu, cuda if compiled with CUDA
pub(crate) fn best_device() -> Result<Device> {
#[cfg(not(feature = "metal"))]
{
Device::cuda_if_available(0)
}
#[cfg(feature = "metal")]
{
Device::new_metal(0)
}
}

fn setup() -> anyhow::Result<Arc<MistralRs>> {
// Select a Mistral model
let loader = NormalLoaderBuilder::new(
NormalSpecificConfig {
use_flash_attn: false,
repeat_last_n: 64,
},
None,
None,
Some("mistralai/Mistral-7B-Instruct-v0.1".to_string()),
)
.build(NormalLoaderType::Mistral);
let loader: Box<dyn Loader> = Box::new(AnyMoeLoader {
target: loader,
config: AnyMoeConfig {
hidden_size: 4096,
lr: 1e-3,
epochs: 100,
batch_size: 4,
expert_type: AnyMoeExpertType::LoraAdapter {
rank: 64,
alpha: 16.,
target_modules: vec![
"gate_proj".to_string(),
],
},
},
prefix: "model.layers".to_string(),
mlp: "mlp".to_string(),
path: "examples/amoe.csv".to_string(),
model_ids: vec!["typeof/zephyr-7b-beta-lora".to_string()],
layers: vec![],
});
// Load, into a Pipeline
let pipeline = loader.load_model_from_hf(
None,
TokenSource::CacheToken,
&ModelDType::Auto,
&best_device()?,
false,
DeviceMapMetadata::dummy(),
None,
)?;
// Create the MistralRs, which is a runner
Ok(MistralRsBuilder::new(pipeline, SchedulerMethod::Fixed(5.try_into().unwrap())).build())
}
```
Loading

0 comments on commit a3c8eaa

Please sign in to comment.