## Install Mergoo

In [None]:
!pip install mergoo

## Create Mergoo-MOE Checkpoint

**Selecting Experts:**  

You can easily merge phi3-based LLM experts. In the following, we have merged two fine-tuned experts:

- [microsoft/Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct): Base generic Phi3 model.
- [RDson/Phi-3-mini-code-finetune-128k-instruct-v1](https://huggingface.co/RDson/Phi-3-mini-code-finetune-128k-instruct-v1): Phi3-based LLM model, fine-tuned on instrcution-based dataset for coding.  
- [NickyNicky/Phi-3-mini-128k-instruct_function](https://huggingface.co/NickyNicky/Phi-3-mini-128k-instruct_function): fine-tuned Phi3-based model for function calling.  

**Preparing Config:**
- `model_type`: llama/mistral/bert/phi3. This is the base model family of the experts. At the moment, all the experts should come from the same base model family.
- `num_experts_per_tok`: Total number of active experts at each step. These experts are selected sparsely.
- `experts`: List of dictionaries of seed models that would get merged. For each expert, `model_id` is mandatory. The model_id can be either a local path or a Huggingface model id.
- `router_layers`: These are the layer names that would be replaced with MOE layers. Weights of the rest of the layers are aggregated using averaging. In the future, we will support multiple aggregation methods from MergeKit.
- `router_layers_index`: List of indexes. These are the indexes of transformer blocks, layers of these index would be converted to MOE. Default `router_layers_index` is empty meaning the MOE conversion gets applied on all the layers, given that `router_layers` identifier matches. `[None]` can be used when no MOE layer should be kept following the [BTM](https://arxiv.org/abs/2208.03306) architecture.

In [None]:
import torch
from mergoo.compose_experts import ComposeExperts

model_id =  "data/checkpoint_demo"
config = \
{
    "model_type": "phi3",
    "num_experts_per_tok": 2,
    "experts":[
        {
            "expert_name" : "base_expert",
            "model_id" : "microsoft/Phi-3-mini-128k-instruct"
        },
        {
            "expert_name" : "expert_1",
            "model_id" : "RDson/Phi-3-mini-code-finetune-128k-instruct-v1",
        },
        {
            "expert_name" : "expert_2",
            "model_id" : "NickyNicky/Phi-3-mini-128k-instruct_function",
        },
    ],
    "router_layers":[
        "gate_up_proj",
        "down_proj",
    ],
}
# create checkpoint
expertmerger = ComposeExperts( config, torch_dtype=torch.float16 )
expertmerger.compose()
expertmerger.save_checkpoint(model_id)

## Training

Now that we have created an MOE checkpoint, all the layers of this model are pretrained except for the gating/routing layers that we added. The routing layer selects the top K experts, in our case K=2. We support HuggingFace trainers: Trainer, SFTrainer. In this example, we are using the Python_code_instructions_18k_alpaca dataset for finetuning. We will train only the router layers, keeping all the other layers frozen.

In [7]:
# load the composed checkkpoint
import torch
from mergoo.models.modeling_phi3 import Phi3ForCausalLM

model = Phi3ForCausalLM.from_pretrained(
    model_id, 
    device_map="auto", 
    torch_dtype=torch.bfloat16,
)# 'gate' / router layers are untrained hence loaded warning would appear for them

Some weights of the model checkpoint at data/checkpoint_demo were not used when initializing Phi3ForCausalLM: ['model.layers.0.mlp.down_proj.weight', 'model.layers.0.mlp.gate_up_proj.weight', 'model.layers.1.mlp.down_proj.weight', 'model.layers.1.mlp.gate_up_proj.weight', 'model.layers.10.mlp.down_proj.weight', 'model.layers.10.mlp.gate_up_proj.weight', 'model.layers.11.mlp.down_proj.weight', 'model.layers.11.mlp.gate_up_proj.weight', 'model.layers.12.mlp.down_proj.weight', 'model.layers.12.mlp.gate_up_proj.weight', 'model.layers.13.mlp.down_proj.weight', 'model.layers.13.mlp.gate_up_proj.weight', 'model.layers.14.mlp.down_proj.weight', 'model.layers.14.mlp.gate_up_proj.weight', 'model.layers.15.mlp.down_proj.weight', 'model.layers.15.mlp.gate_up_proj.weight', 'model.layers.16.mlp.down_proj.weight', 'model.layers.16.mlp.gate_up_proj.weight', 'model.layers.17.mlp.down_proj.weight', 'model.layers.17.mlp.gate_up_proj.weight', 'model.layers.18.mlp.down_proj.weight', 'model.layers.18.mlp.ga

In [8]:
# train only router (gating) layers
n_weights, n_router_weights  = 0,0
for name, weight in model.named_parameters():
    if "gate" not in name:
        weight.requires_grad_(False)
        n_router_weights += 1
    n_weights += 1
n_weights, n_router_weights

(387, 227)

In [11]:
import datasets
import random

dataset = datasets.load_dataset("iamtarun/python_code_instructions_18k_alpaca")['train']
dataset = dataset['prompt']
random.shuffle(dataset)
dataset_train =  datasets.Dataset.from_dict(dict(prompt=dataset[:-1000]))
dataset_test = datasets.Dataset.from_dict(dict(prompt=dataset[-1000:]))

In [12]:
dataset_train, dataset_test

(Dataset({
     features: ['prompt'],
     num_rows: 17612
 }),
 Dataset({
     features: ['prompt'],
     num_rows: 1000
 }))

In [14]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer_args = TrainingArguments(
    output_dir= "checkpoints/phi3_moe",
    per_device_train_batch_size = 1,
    per_device_eval_batch_size = 1, 
    learning_rate= 1e-5,
    save_total_limit=1,
    num_train_epochs=1,
    eval_steps= 5000,
    logging_strategy="steps",
    logging_steps= 25,
    gradient_accumulation_steps=4,
    bf16=True
)

trainer = SFTTrainer(
    model,
    args= trainer_args,
    train_dataset= dataset_train,
    eval_dataset= dataset_test,
    dataset_text_field="prompt",
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Map:   0%|          | 0/17612 [00:00<?, ? examples/s]

Map: 100%|██████████| 17612/17612 [00:01<00:00, 9167.40 examples/s] 
Map: 100%|██████████| 1000/1000 [00:00<00:00, 12832.94 examples/s]


In [None]:
trainer.train()