<a href="https://colab.research.google.com/github/Prajwal011/LLM-s/blob/main/Merging_LLM's_using_Mergekit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Merge Large Language Models with mergekit
> 🗣️ [Large Language Model Course](https://github.com/mlabonne/llm-course)

❤️ Created by [@maximelabonne](https://twitter.com/maximelabonne).

Model merging only requires a lot of RAM. With a free Google Colab account, you should be able to run it using a T4 GPU (VRAM offloading).

Examples of merge configurations:

### TIES-Merging

```yaml
models:
  - model: sarvamai/sarvam-1
    # no parameters necessary for base model
  - model: sarvamai/sarvam-2b-v0.5
    parameters:
      density: 0.5
      weight: 0.5
  - model: mlabonne/NeuralHermes-2.5-Mistral-7B
    parameters:
      density: 0.5
      weight: 0.3
merge_method: ties
base_model: mistralai/Mistral-7B-v0.1
parameters:
  normalize: true
dtype: float16
```

You can find the final model on the Hugging Face Hub at [mlabonne/NeuralPipe-7B-ties](https://huggingface.co/mlabonne/NeuralPipe-7B-ties).

### SLERP

```yaml
slices:
  - sources:
      - model: OpenPipe/mistral-ft-optimized-1218
        layer_range: [0, 32]
      - model: mlabonne/NeuralHermes-2.5-Mistral-7B
        layer_range: [0, 32]
merge_method: slerp
base_model: OpenPipe/mistral-ft-optimized-1218
parameters:
  t:
    - filter: self_attn
      value: [0, 0.5, 0.3, 0.7, 1]
    - filter: mlp
      value: [1, 0.5, 0.7, 0.3, 0]
    - value: 0.5
dtype: bfloat16
```

You can find the final model on the Hugging Face Hub at [mlabonne/NeuralPipe-7B-slerp](https://huggingface.co/mlabonne/NeuralPipe-7B-slerp).

### Passthrough

```yaml
slices:
  - sources:
    - model: OpenPipe/mistral-ft-optimized-1218
      layer_range: [0, 32]
  - sources:
    - model: mlabonne/NeuralHermes-2.5-Mistral-7B
      layer_range: [24, 32]
merge_method: passthrough
dtype: bfloat16
```

You can find the final model on the Hugging Face Hub at [mlabonne/NeuralPipe-9B-merged](https://huggingface.co/mlabonne/NeuralPipe-9B-merged).

In [None]:
!git clone https://github.com/cg123/mergekit.git
!cd mergekit && pip install -q -e .

Cloning into 'mergekit'...
remote: Enumerating objects: 2401, done.[K
remote: Counting objects: 100% (870/870), done.[K
remote: Compressing objects: 100% (208/208), done.[K
remote: Total 2401 (delta 762), reused 689 (delta 662), pack-reused 1531 (from 1)[K
Receiving objects: 100% (2401/2401), 698.07 KiB | 2.44 MiB/s, done.
Resolving deltas: 100% (1670/1670), done.
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.9/96.9 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━

As we saw previously we'll need to know what models to merge and what parameters to set,

run below code with llms that you want to merge to get idea of how many layers they have  

In [None]:
!pip install diffusers accelerate transformers

Collecting torch
  Downloading torch-2.5.1-cp310-cp310-manylinux1_x86_64.whl.metadata (28 kB)
Collecting torchvision
  Downloading torchvision-0.20.1-cp310-cp310-manylinux1_x86_64.whl.metadata (6.1 kB)
Collecting torchaudio
  Downloading torchaudio-2.5.1-cp310-cp310-manylinux1_x86_64.whl.metadata (6.4 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cubla

Access NLP models

In [None]:
from huggingface_hub import notebook_login,login
login(token = userdata.get("hf_token"))

In [None]:
from google.colab import userdata
from transformers import pipeline

pipe = pipeline("text-generation", model="sarvamai/sarvam-1", use_auth_token=userdata.get("hf_token"))

model = pipe.model
print(model)

# Inspect layers as desired
# for name, layer in model.named_modules():
#     print(name, layer)

config.json:   0%|          | 0.00/717 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/21.0k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.77G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/279M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/193 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/775k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.51M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(68096, 2048)
    (layers): ModuleList(
      (0-27): 28 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=1024, bias=False)
          (v_proj): Linear(in_features=2048, out_features=1024, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=11008, bias=False)
          (up_proj): Linear(in_features=2048, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-06)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-06)
      )
    )
    (no

Access Computer Vision models

In [None]:
import torch
from diffusers import DiffusionPipeline

# # Initialize the Hugging Face pipeline for the desired model
pipe = DiffusionPipeline.from_pretrained("black-forest-labs/FLUX.1-dev")
pipe.load_lora_weights("glif-loradex-trainer/maxxd4240_minimalistPastel")

# # Access the model directly through the pipeline
model = pipe.model
print(model)

Below code gives an idea of what family model belongs to so it's easy to pick models with same family structure for better performance

In [None]:
# glif-loradex-trainer/maxxd4240_minimalistPastel
# zhreyu/ComicStrips-Lora-Fluxdev

In [None]:
# @title # 🌳 Model Family Tree
# @markdown Automatically calculate the <strong>family tree of a given model</strong>. It also displays the type of license each model uses (permissive, noncommercial, or unknown). Special thanks to [leonardlin](https://huggingface.co/leonardlin) for his caching implementation.

# @markdown You can also run the code in this [Hugging Face Space](https://huggingface.co/spaces/mlabonne/model-family-tree).
!apt install -qq graphviz graphviz-dev
!pip install -qqq huggingface_hub pygraphviz --progress-bar off

import sys
from huggingface_hub import ModelCard, HfApi, RepoCard
import requests
import networkx as nx
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
from collections import defaultdict
from networkx.drawing.nx_agraph import graphviz_layout
from IPython.display import clear_output

MODEL_ID = "glif-loradex-trainer/maxxd4240_minimalistPastel" # @param {type:"string"}

# We should first try to cache models
class CachedModelCard(ModelCard):
  _cache = {}

  @classmethod
  def load(cls, model_id: str, **kwargs) -> "ModelCard":
    if model_id not in cls._cache:
      try:
        print('REQUEST ModelCard:', model_id)
        cls._cache[model_id] = super().load(model_id, **kwargs)
      except:
        cls._cache[model_id] = None
    else:
      print('CACHED:', model_id)
    return cls._cache[model_id]


def get_model_names_from_yaml(url):
    """Get a list of parent model names from the yaml file."""
    model_tags = []
    response = requests.get(url)
    if response.status_code == 200:
        model_tags.extend([item for item in response.content if '/' in str(item)])
    return model_tags


def get_license_color(model):
    """Get the color of the model based on its license."""
    try:
        card = CachedModelCard.load(model)
        license = card.data.to_dict()['license'].lower()
        # Define permissive licenses
        permissive_licenses = ['mit', 'bsd', 'apache-2.0', 'openrail']  # Add more as needed
        # Check license type
        if any(perm_license in license for perm_license in permissive_licenses):
            return 'lightgreen'  # Permissive licenses
        else:
            return 'lightcoral'  # Noncommercial or other licenses
    except Exception as e:
        print(f"Error retrieving license for {model}: {e}")
        return 'lightgray'


def get_model_names(model, genealogy, found_models=None, visited_models=None):
    print('---')
    print(model)
    if found_models is None:
        found_models = set()
    if visited_models is None:
        visited_models = set()

    if model in visited_models:
        print("Model already visited...")
        return found_models
    visited_models.add(model)

    try:
        card = CachedModelCard.load(model)
        card_dict = card.data.to_dict()
        license = card_dict['license']

        model_tags = []
        if 'base_model' in card_dict:
            model_tags = card_dict['base_model']

        if 'tags' in card_dict and not model_tags:
            tags = card_dict['tags']
            model_tags = [model_name for model_name in tags if '/' in model_name]

        if not model_tags:
            model_tags.extend(get_model_names_from_yaml(f"https://huggingface.co/{model}/blob/main/merge.yml"))
        if not model_tags:
            model_tags.extend(get_model_names_from_yaml(f"https://huggingface.co/{model}/blob/main/mergekit_config.yml"))

        if not isinstance(model_tags, list):
            model_tags = [model_tags] if model_tags else []

        found_models.add(model)

        for model_tag in model_tags:
            genealogy[model_tag].append(model)
            get_model_names(model_tag, genealogy, found_models, visited_models)

    except Exception as e:
        print(f"Could not find model names for {model}: {e}")

    return found_models


def find_root_nodes(G):
    """ Find all nodes in the graph with no predecessors """
    return [n for n, d in G.in_degree() if d == 0]


def max_width_of_tree(G):
    """ Calculate the maximum width of the tree """
    max_width = 0
    for root in find_root_nodes(G):
        width_at_depth = calculate_width_at_depth(G, root)
        local_max_width = max(width_at_depth.values())
        max_width = max(max_width, local_max_width)
    return max_width


def calculate_width_at_depth(G, root):
    """ Calculate width at each depth starting from a given root """
    depth_count = defaultdict(int)
    queue = [(root, 0)]
    while queue:
        node, depth = queue.pop(0)
        depth_count[depth] += 1
        for child in G.successors(node):
            queue.append((child, depth + 1))
    return depth_count


def create_family_tree(start_model):
    genealogy = defaultdict(list)
    get_model_names(start_model, genealogy)  # Assuming this populates the genealogy

    print("Number of models:", len(CachedModelCard._cache))

    # Create a directed graph
    G = nx.DiGraph()

    # Add nodes and edges to the graph
    for parent, children in genealogy.items():
        for child in children:
            G.add_edge(parent, child)

    try:
        # Get max depth and width
        max_depth = nx.dag_longest_path_length(G) + 1
        max_width = max_width_of_tree(G) + 1
    except:
        # Get max depth and width
        max_depth = 21
        max_width = 9

    # Estimate plot size
    height = max(8, 1.6 * max_depth)
    width = max(8, 6 * max_width)

    # Set Graphviz layout attributes for a bottom-up tree
    plt.figure(figsize=(width, height))
    pos = graphviz_layout(G, prog="dot")

    # Determine node colors based on license
    node_colors = [get_license_color(node) for node in G.nodes()]
    clear_output()

    # Create a label mapping with line breaks
    labels = {node: node.replace("/", "\n") for node in G.nodes()}

    # Draw the graph
    nx.draw(G, pos, labels=labels, with_labels=True, node_color=node_colors, font_size=12, node_size=8_000, edge_color='black')

    # Create a legend for the colors
    legend_elements = [
        Patch(facecolor='lightgreen', label='Permissive'),
        Patch(facecolor='lightcoral', label='Noncommercial'),
        Patch(facecolor='lightgray', label='Unknown')
    ]
    plt.legend(handles=legend_elements, loc='upper left')

    plt.title(f"{start_model}'s Family Tree", fontsize=20)
    plt.show()

create_family_tree(MODEL_ID)

In [None]:
import yaml

MODEL_NAME = "Sarvam_updated"
yaml_config = """
models:
  - model: black-forest-labs/FLUX.1-dev
    # no parameters necessary for base model
  - model: glif-loradex-trainer/maxxd4240_minimalistPastel
    parameters:
      density: 0.5
      weight: 0.3
  - model: Jovie/Midjourney
    parameters:
      density: 0.5
      weight: 0.5
merge_method: ties
base_model: black-forest-labs/FLUX.1-dev
parameters:
  normalize: true
  int8_mask: true
dtype: float16
# model_type:
"""

# Save config as yaml file
with open('config.yaml', 'w', encoding="utf-8") as f:
    f.write(yaml_config)

In [None]:
# Merge models
!mergekit-yaml config.yaml merge --copy-tokenizer --allow-crimes --out-shard-size 1B --lazy-unpickle --trust-remote-code

2024-11-10 09:40:24.421453: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-10 09:40:24.453082: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-10 09:40:24.464867: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-10 09:40:24.487248: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "/usr/local

In [None]:
!pip install -qU huggingface_hub

from huggingface_hub import ModelCard, ModelCardData
from jinja2 import Template

username = "Prajwall11"

template_text = """
---
license: apache-2.0
tags:
- merge
- mergekit
- lazymergekit
{%- for model in models %}
- {{ model }}
{%- endfor %}
---

# {{ model_name }}

{{ model_name }} is a merge of the following models using [mergekit](https://github.com/cg123/mergekit):

{%- for model in models %}
* [{{ model }}](https://huggingface.co/{{ model }})
{%- endfor %}

## 🧩 Configuration

```yaml
{{- yaml_config -}}
```
"""

# Create a Jinja template object
jinja_template = Template(template_text.strip())

# Get list of models from config
data = yaml.safe_load(yaml_config)
if "models" in data:
    models = [data["models"][i]["model"] for i in range(len(data["models"])) if "parameters" in data["models"][i]]
elif "parameters" in data:
    models = [data["slices"][0]["sources"][i]["model"] for i in range(len(data["slices"][0]["sources"]))]
elif "slices" in data:
    models = [data["slices"][i]["sources"][0]["model"] for i in range(len(data["slices"]))]
else:
    raise Exception("No models or slices found in yaml config")

# Fill the template
content = jinja_template.render(
    model_name=MODEL_NAME,
    models=models,
    yaml_config=yaml_config,
    username=username,
)

# Save the model card
card = ModelCard(content)
card.save('merge/README.md')

In [None]:
from google.colab import userdata
from huggingface_hub import HfApi

username = "Prajwall11"

api = HfApi(token=userdata.get("hf_token"))

# api.create_repo(
#     repo_id=f"{username}/{MODEL_NAME}",
#     repo_type="model"
# )

api.upload_folder(
    repo_id=f"{username}/{MODEL_NAME}",
    folder_path="merge",
)

Access model

In [None]:
!pip install -qU transformers accelerate

from transformers import AutoTokenizer
import transformers
import torch

model = "Prajwall11/Sarvam_updated"
messages = [{"role": "user", "content": "कर्नाटक की राजधानी है"}]

tokenizer = AutoTokenizer.from_pretrained(model)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])