# 🥱 LazyMergekit

> 🗣️ [Large Language Model Course](https://github.com/mlabonne/llm-course)

❤️ Created by [@maximelabonne](https://twitter.com/maximelabonne).

This notebook allows you to easily merge multiple models using [mergekit](https://github.com/cg123/mergekit). To evaluate your merges, see [🧐 LLM AutoEval](https://colab.research.google.com/drive/1Igs3WZuXAIv9X0vwqiE90QlEPys8e8Oa?usp=sharing#scrollTo=elyxjYI_rY5W).

*Special thanks to [@cg123](https://github.com/cg123) for this library and [@mrfakename](https://gist.github.com/fakerybakery) who told me about sharding (see his [Gist](https://gist.github.com/fakerybakery/d30a4d31b4f914757c1381166b9c683b)).*

In [1]:
MODEL_NAME = "Upshot-NeuralHermes-2.5-Mistral-7-7B-slerp"
yaml_config = """
slices:
  - sources:
      - model: mlabonne/NeuralHermes-2.5-Mistral-7B
        layer_range: [0, 32]
      - model: Aditya685/upshot-sih
        layer_range: [0, 32]
merge_method: slerp
base_model: mlabonne/NeuralHermes-2.5-Mistral-7B
parameters:
  t:
    - filter: self_attn
      value: [0, 0.5, 0.3, 0.7, 1]
    - filter: mlp
      value: [1, 0.5, 0.7, 0.3, 0]
    - value: 0.5
dtype: bfloat16
"""

In [2]:
# @title ## Run merge

# @markdown ### Runtime type
# @markdown Select your runtime (CPU, High RAM, GPU)

runtime = "GPU" # @param ["CPU", "CPU + High-RAM", "GPU"]

# @markdown ### Mergekit arguments
# @markdown Use the `main` branch by default, [`mixtral`](https://github.com/cg123/mergekit/blob/mixtral/moe.md) if you want to create a Mixture of Experts.

branch = "main" # @param ["main", "mixtral"]
trust_remote_code = True # @param {type:"boolean"}

# Install mergekit
if branch == "main":
    !git clone https://github.com/cg123/mergekit.git
    !cd mergekit && pip install -qqq -e . --progress-bar off
elif branch == "mixtral":
    !git clone -b mixtral https://github.com/cg123/mergekit.git
    !cd mergekit && pip install -qqq -e . --progress-bar off
    !pip install -qqq -U transformers --progress-bar off

# Save config as yaml file
with open('config.yaml', 'w', encoding="utf-8") as f:
    f.write(yaml_config)

# Base CLI
if branch == "main":
    cli = "mergekit-yaml config.yaml merge --copy-tokenizer"
elif branch == "mixtral":
    cli = "mergekit-moe config.yaml merge --copy-tokenizer"

# Additional arguments
if runtime == "CPU":
    cli += " --allow-crimes --out-shard-size 1B --lazy-unpickle"
elif runtime == "GPU":
    cli += " --cuda --low-cpu-memory"
if trust_remote_code:
    cli += " --trust-remote-code"

print(cli)

# Merge models
!{cli}

[1;30;43mStreaming output truncated to the last 5000 lines.[0m

model-00003-of-00003.safetensors:  77% 3.51G/4.54G [01:15<00:16, 64.2MB/s][A[A


model-00002-of-00003.safetensors:  70% 3.48G/5.00G [01:15<00:30, 50.3MB/s][A[A[A



model-00001-of-00003.safetensors:  70% 3.44G/4.94G [01:15<00:27, 55.4MB/s][A[A[A[A

model-00003-of-00003.safetensors:  78% 3.52G/4.54G [01:15<00:14, 69.0MB/s][A[A


model-00002-of-00003.safetensors:  70% 3.49G/5.00G [01:15<00:26, 57.1MB/s][A[A[A



model-00001-of-00003.safetensors:  70% 3.45G/4.94G [01:15<00:24, 60.4MB/s][A[A[A[A

model-00003-of-00003.safetensors:  78% 3.53G/4.54G [01:15<00:14, 70.7MB/s][A[A


model-00002-of-00003.safetensors:  70% 3.50G/5.00G [01:15<00:24, 61.1MB/s][A[A[A



model-00001-of-00003.safetensors:  70% 3.46G/4.94G [01:15<00:22, 67.0MB/s][A[A[A[A


model-00002-of-00003.safetensors:  70% 3.51G/5.00G [01:15<00:21, 68.0MB/s][A[A[A

model-00003-of-00003.safetensors:  78% 3.54G/4.54G [01:15<00:13, 73.9MB/s]

In [9]:
# @title ## Upload model to Hugging Face { display-mode: "form" }
# @markdown Enter your HF username and the name of Colab secret that stores your [Hugging Face access token](https://huggingface.co/settings/tokens).
username = 'Aditya685' # @param {type:"string"}
token = 'HUGGINGFACE_TOKEN' # @param {type:"string"}
license = "apache-2.0" # @param ["apache-2.0", "cc-by-nc-4.0", "mit", "openrail"] {allow-input: true}

!pip install -qU huggingface_hub

import yaml

from huggingface_hub import ModelCard, ModelCardData, HfApi
from google.colab import userdata
from jinja2 import Template

if branch == "main":
    template_text = """
---
license: {{ license }}
base_model:
{%- for model in models %}
  - {{ model }}
{%- endfor %}
tags:
- merge
- mergekit
- lazymergekit
{%- for model in models %}
- {{ model }}
{%- endfor %}
---

# {{ model_name }}

{{ model_name }} is a merge of the following models using [LazyMergekit](https://colab.research.google.com/drive/1obulZ1ROXHjYLn6PPZJwRR6GzgQogxxb?usp=sharing):

{%- for model in models %}
* [{{ model }}](https://huggingface.co/{{ model }})
{%- endfor %}

## 🧩 Configuration

```yaml
{{- yaml_config -}}
```

## 💻 Usage

```python
!pip install -qU transformers accelerate

from transformers import AutoTokenizer
import transformers
import torch

model = "{{ username }}/{{ model_name }}"
messages = [{"role": "user", "content": "What is a large language model?"}]

tokenizer = AutoTokenizer.from_pretrained(model)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
```
"""

    # Create a Jinja template object
    jinja_template = Template(template_text.strip())

    # Get list of models from config
    data = yaml.safe_load(yaml_config)
    if "models" in data:
        models = [data["models"][i]["model"] for i in range(len(data["models"])) if "parameters" in data["models"][i]]
    elif "parameters" in data:
        models = [data["slices"][0]["sources"][i]["model"] for i in range(len(data["slices"][0]["sources"]))]
    elif "slices" in data:
        models = [data["slices"][i]["sources"][0]["model"] for i in range(len(data["slices"]))]
    else:
        raise Exception("No models or slices found in yaml config")

    # Fill the template
    content = jinja_template.render(
        model_name=MODEL_NAME,
        models=models,
        yaml_config=yaml_config,
        username=username,
    )

elif branch == "mixtral":
    template_text = """
---
license: {{ license }}
base_model:
{%- for model in models %}
  - {{ model }}
{%- endfor %}
tags:
- moe
- frankenmoe
- merge
- mergekit
- lazymergekit
{%- for model in models %}
- {{ model }}
{%- endfor %}
---

# {{ model_name }}

{{ model_name }} is a Mixure of Experts (MoE) made with the following models using [LazyMergekit](https://colab.research.google.com/drive/1obulZ1ROXHjYLn6PPZJwRR6GzgQogxxb?usp=sharing):

{%- for model in models %}
* [{{ model }}](https://huggingface.co/{{ model }})
{%- endfor %}

## 🧩 Configuration

```yaml
{{- yaml_config -}}
```

## 💻 Usage

```python
!pip install -qU transformers bitsandbytes accelerate

from transformers import AutoTokenizer
import transformers
import torch

model = "{{ username }}/{{ model_name }}"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    model_kwargs={"torch_dtype": torch.float16, "load_in_4bit": True},
)

messages = [{"role": "user", "content": "Explain what a Mixture of Experts is in less than 100 words."}]
prompt = pipeline.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
```
"""

    # Create a Jinja template object
    jinja_template = Template(template_text.strip())

    # Fill the template
    data = yaml.safe_load(yaml_config)
    models = [model['source_model'] for model in data['experts']]

    content = jinja_template.render(
        model_name=MODEL_NAME,
        models=models,
        yaml_config=yaml_config,
        username=username,
        license=license
    )

# Save the model card
card = ModelCard(content)
card.save('merge/README.md')

# Defined in the secrets tab in Google Colab
api = HfApi(token=userdata.get(token))

# Upload merge folder
api.create_repo(
    repo_id=f"{username}/{MODEL_NAME}",
    repo_type="model",
    exist_ok=True,
)
api.upload_folder(
    repo_id=f"{username}/{MODEL_NAME}",
    folder_path="merge",
)

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Aditya685/Upshot-NeuralHermes-2.5-Mistral-7-7B-slerp/commit/b1e8a40527664c530e4b68bf6d9c08e34f384c1e', commit_message='Upload folder using huggingface_hub', commit_description='', oid='b1e8a40527664c530e4b68bf6d9c08e34f384c1e', pr_url=None, pr_revision=None, pr_num=None)

In [1]:
!pip install -qU transformers accelerate

In [2]:

from transformers import AutoTokenizer
import transformers
import torch

model = "Aditya685/Upshot-NeuralHermes-2.5-Mistral-7-7B-slerp"
messages = [{"role": "user", "content": "What is a large language model?"}]

tokenizer = AutoTokenizer.from_pretrained(model)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


model.safetensors.index.json:   0%|          | 0.00/22.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


<|im_start|>user
What is a large language model?<|im_end|>
<|im_start|>assistant
A large language model is a type of artificial intelligence (AI) system that is designed to understand and generate human language. It is trained on vast amounts of text data to learn patterns and structures of language, enabling it to generate coherent and contextually relevant text. These models can be used for a variety of tasks, such as natural language processing, chatbots, language translation, and text summarization. The size of the model, measured in parameters or the number of connections between neurons, is often correlated with its performance and ability to handle complex language tasks. Some well-known large language models include OpenAI's GPT-3, Google's BERT, and Facebook's RoBERTa.


In [13]:
def response(question):
  messages = [{"role": "user", "content": question}]
  prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
  outputs = pipeline(prompt, max_new_tokens=2000, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
  return outputs[0]["generated_text"].split('assistant')[-1]



In [14]:
response('what is gst?')

Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


"\nGST stands for Goods and Services Tax. It is a comprehensive, multi-stage, destination-based tax levied on the supply of goods and services, right from the manufacturer to the consumer. The primary objective of GST is to simplify and harmonize India's complex tax system by replacing various indirect taxes like Value Added Tax (VAT), Central Excise Duty, and Service Tax, among others. The implementation of GST aims to make the tax system more efficient, transparent, and easy to administer, ultimately benefiting both businesses and consumers."

In [15]:
response('what are the document required for applying for GST Number')

Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


'\nTo apply for a GST Number (Goods and Services Tax Number) in India, you will need the following documents:\n\n1. Proof of identity: A valid PAN card (Permanent Account Number) is required as proof of identity.\n2. Proof of address: You can provide any of the following documents as proof of address:\n\t* Aadhaar card\n\t* Voter ID card\n\t* Passport\n\t* Driving license\n\t* Bank account statement or passbook\n\t* Ration card\n\t* Electricity bill\n\t* Water bill\n\t* Gas bill\n\t* Telephone bill\n\t* Property tax receipt\n3. Proof of constitution: Depending on your business type, you will need to provide the following documents:\n\t* For a proprietorship firm: Copy of the registration certificate (if registered under the Shops and Establishments Act)\n\t* For a partnership firm: Copy of the partnership deed and registration certificate (if registered under the Shops and Establishments Act)\n\t* For a company: Copy of the incorporation certificate, Articles of Association, and Memora