# 🥱 LazyMergekit

> 🗣️ [Large Language Model Course](https://github.com/mlabonne/llm-course)

❤️ Created by [@maximelabonne](https://twitter.com/maximelabonne).

This notebook allows you to easily merge multiple models using [mergekit](https://github.com/cg123/mergekit). To evaluate your merges, see [🧐 LLM AutoEval](https://colab.research.google.com/drive/1Igs3WZuXAIv9X0vwqiE90QlEPys8e8Oa?usp=sharing#scrollTo=elyxjYI_rY5W).

*Special thanks to [@cg123](https://github.com/cg123) for this library and [@mrfakename](https://gist.github.com/fakerybakery) who told me about sharding (see his [Gist](https://gist.github.com/fakerybakery/d30a4d31b4f914757c1381166b9c683b)).*

In [24]:
# import libraries
import wandb
from pytorch_lightning.loggers import WandbLogger
import lightning.pytorch as pl
import os


In [14]:
MODEL_NAME = "Mathmate-7B-moe"
yaml_config = """
base_model: AI-MO/NuminaMath-7B-TIR
merge_method: della # as moe seems to take franken-moe/passthrough as default
gate_mode: hidden
dtype: bfloat16
experts:
  - source_model: AI-MO/NuminaMath-7B-TIR
    positive_prompts:
      - "This model is good at solving math questions at high school level and generating python code for the same"
  # - source_model: Qwen/Qwen2-Math-7B-Instruct
  #   positive_prompts:
  #     - "This model is really good at solving college level math to olympiad level questions"
  - source_model: deepseek-ai/DeepSeek-Prover-V1.5-RL
    positive_prompts:
      - "This model is good at formal theorem providing math problems"
"""



# base_model: "/teamspace/uploads/NuminaMath-7B-TIR.q6_k.gguf"
# gate_mode: hidden # one of "hidden", "cheap_embed", or "random"
# dtype: bfloat16 # output dtype (float32, float16, or bfloat16)
# ## (optional)
# # experts_per_token: 2
# experts:
#   - source_model: Qwen/Qwen2-Math-7B-Instruct
#     positive_prompts:
#       - "This is a prompt that is demonstrative of what expert_model_1 excels at"
#     ## (optional)
#     # negative_prompts:
#     #   - "This is a prompt expert_model_1 should not be used for"
#   - source_model: AI-MO/NuminaMath-7B-TIR
#   # ... and so on

In [15]:
# @title ## Run merge

# @markdown ### Runtime type
# @markdown Select your runtime (CPU, High RAM, GPU)

runtime = "GPU" # @param ["CPU", "CPU + High-RAM", "GPU"]

# @markdown ### Mergekit arguments
# @markdown Use the `main` branch by default, [`mixtral`](https://github.com/cg123/mergekit/blob/mixtral/moe.md) if you want to create a Mixture of Experts.

branch = "mixtral" # @param ["main", "mixtral"]
trust_remote_code = True # @param {type:"boolean"}

# Install mergekit
if branch == "main":
    !git clone https://github.com/arcee-ai/mergekit.git
    !cd mergekit && pip install -qqq -e . --progress-bar off
elif branch == "mixtral":
    !git clone -b mixtral https://github.com/arcee-ai/mergekit.git
    !cd mergekit && pip install -qqq -e . --progress-bar off
    !pip install -qqq -U transformers --progress-bar off

# Save config as yaml file
with open('config.yaml', 'w', encoding="utf-8") as f:
    f.write(yaml_config)

# Base CLI
if branch == "main":
    cli = "mergekit-yaml config.yaml merge --copy-tokenizer"
elif branch == "mixtral":
    cli = "mergekit-moe config.yaml merge --copy-tokenizer"

# Additional arguments
if runtime == "CPU":
    cli += " --allow-crimes --out-shard-size 1B --lazy-unpickle"
elif runtime == "GPU":
    cli += " --device cuda --low-cpu-memory"
if trust_remote_code:
    cli += " --trust-remote-code"

print(cli)

# Merge models
!{cli}

fatal: destination path 'mergekit' already exists and is not an empty directory.


[38;5;57m[1m⚡️ Tip[0m	Check organization access: [4mhttps://github.com/settings/connections/applications/c7457225b242a94d60c6[0m



mergekit-moe config.yaml merge --copy-tokenizer --device cuda --low-cpu-memory --trust-remote-code
Warm up loaders:   0%|                                    | 0/3 [00:00<?, ?it/s]
Fetching 13 files: 100%|████████████████████| 13/13 [00:00<00:00, 100047.62it/s][A

Fetching 13 files: 100%|████████████████████| 13/13 [00:00<00:00, 252434.96it/s][A
Warm up loaders:  67%|██████████████████▋         | 2/3 [00:00<00:00, 17.75it/s]
Fetching 6 files: 100%|███████████████████████| 6/6 [00:00<00:00, 147168.56it/s][A
Warm up loaders: 100%|████████████████████████████| 3/3 [00:00<00:00, 18.57it/s]
100%|█████████████████████████████████████████████| 9/9 [00:54<00:00,  6.07s/it]
Loading checkpoint shards: 100%|██████████████████| 3/3 [00:04<00:00,  1.59s/it]
expert prompts:   0%|                                     | 0/2 [00:00<?, ?it/s]We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https:/

In [18]:
# Initialize WandB
wandb.init(project='Mathmate-stage1-finetuning', name=MODEL_NAME)

# Set up WandbLogger
wandb_logger = WandbLogger(project='my-merge-project')

# Optional: Add any hyperparameters or configuration settings
wandb_logger.experiment.config.update({
    "model_name": MODEL_NAME,
    "runtime": runtime,
    "branch": branch,
    # Add any other hyperparameters or settings you want to log
})


Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


[34m[1mwandb[0m: Currently logged in as: [33mhaleshot[0m ([33mhaleshot-SVKM's Narsee Monjee Institute of Management St[0m). Use [1m`wandb login --relogin`[0m to force relogin


/home/zeus/miniconda3/envs/cloudspace/lib/python3.11/site-packages/pytorch_lightning/loggers/wandb.py:396: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.


In [22]:
trainer = pl.Trainer(
    logger=wandb_logger,
    # Include any other trainer arguments here
)


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


In [31]:
# @title ## Upload model to Hugging Face { display-mode: "form" }
# @markdown Enter your HF username and the name of Colab secret that stores your [Hugging Face access token](https://huggingface.co/settings/tokens).
from dotenv import load_dotenv
load_dotenv() # This loads the .env file at the project root

username = 'Haleshot' # @param {type:"string"}
# token = 'HF_TOKEN' # @param {type:"string"}

token = os.getenv('HF_TOKEN')
license = "apache-2.0" # @param ["apache-2.0", "cc-by-nc-4.0", "mit", "openrail"] {allow-input: true}

!pip install -qU huggingface_hub

import yaml

from huggingface_hub import ModelCard, ModelCardData, HfApi
# from google.colab import userdata
from jinja2 import Template

if branch == "main":
    template_text = """
---
license: {{ license }}
base_model:
{%- for model in models %}
  - {{ model }}
{%- endfor %}
tags:
- merge
- mergekit
- lazymergekit
{%- for model in models %}
- {{ model }}
{%- endfor %}
---

# {{ model_name }}

{{ model_name }} is a merge of the following models using [LazyMergekit](https://colab.research.google.com/drive/1obulZ1ROXHjYLn6PPZJwRR6GzgQogxxb?usp=sharing):

{%- for model in models %}
* [{{ model }}](https://huggingface.co/{{ model }})
{%- endfor %}

## 🧩 Configuration

```yaml
{{- yaml_config -}}
```

## 💻 Usage

```python
!pip install -qU transformers accelerate

from transformers import AutoTokenizer
import transformers
import torch

model = "{{ username }}/{{ model_name }}"
messages = [{"role": "user", "content": "What is a large language model?"}]

tokenizer = AutoTokenizer.from_pretrained(model)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
```
"""

    # Create a Jinja template object
    jinja_template = Template(template_text.strip())

    # Get list of models from config
    data = yaml.safe_load(yaml_config)
    if "models" in data:
        models = [data["models"][i]["model"] for i in range(len(data["models"])) if "parameters" in data["models"][i]]
    elif "parameters" in data:
        models = [data["slices"][0]["sources"][i]["model"] for i in range(len(data["slices"][0]["sources"]))]
    elif "slices" in data:
        models = [data["slices"][i]["sources"][0]["model"] for i in range(len(data["slices"]))]
    else:
        raise Exception("No models or slices found in yaml config")

    # Fill the template
    content = jinja_template.render(
        model_name=MODEL_NAME,
        models=models,
        yaml_config=yaml_config,
        username=username,
    )

elif branch == "mixtral":
    template_text = """
---
license: {{ license }}
base_model:
{%- for model in models %}
  - {{ model }}
{%- endfor %}
tags:
- moe
- frankenmoe
- merge
- mergekit
- lazymergekit
{%- for model in models %}
- {{ model }}
{%- endfor %}
---

# {{ model_name }}

{{ model_name }} is a Mixture of Experts (MoE) made with the following models using [LazyMergekit](https://colab.research.google.com/drive/1obulZ1ROXHjYLn6PPZJwRR6GzgQogxxb?usp=sharing):

{%- for model in models %}
* [{{ model }}](https://huggingface.co/{{ model }})
{%- endfor %}

## 🧩 Configuration

```yaml
{{- yaml_config -}}
```

## 💻 Usage

```python
!pip install -qU transformers bitsandbytes accelerate

from transformers import AutoTokenizer
import transformers
import torch

model = "{{ username }}/{{ model_name }}"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    model_kwargs={"torch_dtype": torch.float16, "load_in_4bit": True},
)

messages = [{"role": "user", "content": "Explain what a Mixture of Experts is in less than 100 words."}]
prompt = pipeline.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
```
"""

    # Create a Jinja template object
    jinja_template = Template(template_text.strip())

    # Fill the template
    data = yaml.safe_load(yaml_config)
    models = [model['source_model'] for model in data['experts']]

    content = jinja_template.render(
        model_name=MODEL_NAME,
        models=models,
        yaml_config=yaml_config,
        username=username,
        license=license
    )

# Save the model card
card = ModelCard(content)
card.save('merge/README.md')

# Defined in the secrets tab in Google Colab
api = HfApi(token=token)

# Upload merge folder
api.create_repo(
    repo_id=f"{username}/{MODEL_NAME}",
    repo_type="model",
    exist_ok=True,
)
api.upload_folder(
    repo_id=f"{username}/{MODEL_NAME}",
    folder_path="merge",
)

model-00001-of-00003.safetensors:   0%|          | 0.00/9.97G [00:00<?, ?B/s]
[A

model-00001-of-00003.safetensors:   0%|          | 16.0M/9.97G [00:00<02:05, 79.7MB/s]

model-00001-of-00003.safetensors:   0%|          | 32.0M/9.97G [00:00<02:09, 76.5MB/s]

model-00001-of-00003.safetensors:   0%|          | 48.0M/9.97G [00:00<02:11, 75.3MB/s]

model-00001-of-00003.safetensors:   1%|          | 64.0M/9.97G [00:00<02:05, 79.1MB/s]

model-00001-of-00003.safetensors:   1%|          | 80.0M/9.97G [00:01<02:10, 75.9MB/s]

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

model-00001-of-00003.safetensors:   1%|          | 112M/9.97G [00:06<13:35, 12.1MB/s] 

model-00001-of-00003.safetensors:   1%|▏         | 128M/9.97G [00:06<09:49, 16.7MB/s]

model-00001-of-00003.safetensors:   1%|▏         | 144M/9.97G [00:06<07:20, 22.3MB/s]

model-00001-of-00003.safetensors:   2%|

CommitInfo(commit_url='https://huggingface.co/Haleshot/Mathmate-7B-dare-ties/commit/e1aa198a23a84a7b031164d70b858c5910ba7809', commit_message='Upload folder using huggingface_hub', commit_description='', oid='e1aa198a23a84a7b031164d70b858c5910ba7809', pr_url=None, pr_revision=None, pr_num=None)

In [32]:
wandb.finish()