# Model Merging

 ## This Colab notebook provides a step-by-step guide to model merging using MergeKit, performing inference on the merged model, and pushing it to Hugging Face.


# ---
# Setup and Installation
## First, we will install the required dependencies and clone the MergeKit repository. MergeKit is a powerful library for merging pre-trained machine learning models.

In [None]:
! pip3 install bitsandbytes accelerate -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25h

## Clone MergeKit repository

In [None]:
!git clone https://github.com/arcee-ai/mergekit.git
%cd mergekit
!pip install -e .


Cloning into 'mergekit'...
remote: Enumerating objects: 2257, done.[K
remote: Counting objects: 100% (572/572), done.[K
remote: Compressing objects: 100% (267/267), done.[K
remote: Total 2257 (delta 384), reused 443 (delta 305), pack-reused 1685 (from 1)[K
Receiving objects: 100% (2257/2257), 700.61 KiB | 15.92 MiB/s, done.
Resolving deltas: 100% (1526/1526), done.
/content/mergekit
Obtaining file:///content/mergekit
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tqdm==4.66.4 (from mergekit==0.0.4.4)
  Downloading tqdm-4.66.4-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate~=0.30.1 (from mergekit==0.0.4.4)
  Downloading accelerate-0.30.1-py3

# ---
#  Model Merging
## We will now perform model merging. Ensure you have the required configuration file (`config.yaml`) set up correctly. This file contains details about the models you want to merge and their respective weights.

In [None]:
OUTPUT_PATH = "./models/Llama-3.2-3B-Instruct-TIES"
LORA_MERGE_CACHE = "/tmp"
CONFIG_YML = "/content/config1.yaml"
COPY_TOKENIZER = True
LAZY_UNPICKLE = False
LOW_CPU_MEMORY = False

import torch
import yaml

from mergekit.config import MergeConfiguration
from mergekit.merge import MergeOptions, run_merge

with open(CONFIG_YML, "r", encoding="utf-8") as fp:
    merge_config = MergeConfiguration.model_validate(yaml.safe_load(fp))

run_merge(
    merge_config,
    out_path=OUTPUT_PATH,
    options=MergeOptions(
        lora_merge_cache=LORA_MERGE_CACHE,
        cuda=torch.cuda.is_available(),
        copy_tokenizer=COPY_TOKENIZER,
        lazy_unpickle=LAZY_UNPICKLE,
        low_cpu_memory=LOW_CPU_MEMORY,
    ),
)
print("Done! ")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/844 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/928 [00:00<?, ?B/s]

Warmup loader cache:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

original/orig_params.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

Warmup loader cache:  33%|███▎      | 1/3 [00:31<01:02, 31.12s/it]

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

original/params.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

Warmup loader cache:  67%|██████▋   | 2/3 [01:18<00:40, 40.63s/it]

Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.6k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

Warmup loader cache: 100%|██████████| 3/3 [01:58<00:00, 39.59s/it]
Executing graph: 100%|██████████| 1526/1526 [09:28<00:00,  2.68it/s]


Done! 


# ---
# Inference with the Merged Model
## After merging, we will load the model and tokenizer for inference. This example uses a quantization configuration to load the model in 4-bit precision, optimizing for memory and speed.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

QUANT_CONFIG = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained("/content/mergekit/models/Llama-3.2-3B-Instruct-TIES")
model = AutoModelForCausalLM.from_pretrained("//content/mergekit/models/Llama-3.2-3B-Instruct-TIES", quantization_config=QUANT_CONFIG)

user_message = "Write a recursive function that calculates Fibonacci sequence in Python."
prompt = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{user_message}

### Response:
"""

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs.to("cuda"),
                              max_new_tokens=512,
                              num_beams=10,
                              early_stopping=True,
                              no_repeat_ngram_size=2,
                              )
result = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(result[0])

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Write a recursive function that calculates Fibonacci sequence in Python.

### Response:
```python
def fibonacci(n):
    """
    Recursive function to calculate the nth Fibonacci number.
    
    Args:
        n (int): The position of the number in the sequence (0-indexed).
        
        Returns:
            int: The value at position n.
    """

    # Base case: If n is 0 or 1, return n because these are the first two Fibonacci numbers
    if n < 2:  # This is a more efficient way to check the base case, instead of using an if-else statement
        return (n, n-1)
    else: 
        # Call the function again, but with a smaller n, and add the two previous numbers together.
        a, b = (fibonacci(n-2)[0] +  fibonaccin( n -1) [0], 
                fibonnacin(n -2)  [1]  +   fbonacccin (  n  -  3)[1])
    ret

# ---
# Pushing the Merged Model to Hugging Face
## Finally, we push the merged model and tokenizer to the Hugging Face Hub for easy sharing and deployment.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

model.push_to_hub("vhab10/Llama-3.2-Instruct-3B-TIES")
tokenizer.push_to_hub("vhab10/Llama-3.2-Instruct-3B-TIES")



model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/vhab10/Llama-3.2-Instruct-3B-TIES/commit/a463aafe72e418c437e7dcbe24690a64607ba8f8', commit_message='Upload tokenizer', commit_description='', oid='a463aafe72e418c437e7dcbe24690a64607ba8f8', pr_url=None, pr_revision=None, pr_num=None)