<a href="https://colab.research.google.com/github/0xVolt/whats-up-doc/blob/main/test/notebooks/model-blending/blend.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Merging CodeLLMs to Create an Efficant Low-Memory Quantized Model for `whats-up-doc` using the TIES Method

In [1]:
import os
import yaml
from transformers import AutoModelWithLMHead, AutoTokenizer, pipeline

2024-05-19 17:01:03.428995: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-19 17:01:03.429117: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-19 17:01:03.511715: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Download and Install `mergekit`

In [2]:
dirName = "mergekit"
cwd = os.getcwd()

concatDirPath = os.path.join(cwd, dirName)

if not os.path.exists(concatDirPath):
    !git clone https://github.com/cg123/mergekit.git
    !cd mergekit && pip install -q -e .

Cloning into 'mergekit'...
remote: Enumerating objects: 2265, done.[K
remote: Counting objects: 100% (1354/1354), done.[K
remote: Compressing objects: 100% (520/520), done.[K
remote: Total 2265 (delta 1081), reused 947 (delta 833), pack-reused 911[K
Receiving objects: 100% (2265/2265), 640.50 KiB | 5.93 MiB/s, done.
Resolving deltas: 100% (1584/1584), done.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
keras-cv 0.8.2 requires keras-core, which is not installed.
keras-nlp 0.9.3 requires keras-core, which is not installed.
beatrix-jupyterlab 2023.128.151533 requires jupyterlab~=3.6.0, but you have jupyterlab 4.1.6 which is incompatible.
momepy 0.7.0 requires shapely>=2, but you have shapely 1.8.5.post1 which is incompatible.
spopt 0.6.0 requires shapely>=2.0.1, but you have shapely 1.8.5.post1 which is incompatible.
ydata-profiling 4.6.4 requires numpy

## Login to HF with a Write API Key

In [10]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Create the YAML Config File to Merge Models with SLERP

### Write Config Script

In [11]:
# Set the resultant model's name
MODEL_NAME = 'whats-up-llamas-ties'

MODEL_1 = "codellama/CodeLlama-7b-Instruct-hf"
MODEL_2 = "meta-llama/Meta-Llama-3-8B-Instruct"

OUTPUT_DIR = "whats-up-llamas-ties"

#### TIES YAML Config Creation

What I've found to work is the model with the least `intermediate_size` param of the models is taken to be the base. The only explanation I can think of is that it works when going from a larger vector to a smaller vector, but not the other way around.

For example,
The `meta-llama/Meta-Llama-3-8B-Instruct` model's config looks like:
```json
{
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128009,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.40.0.dev0",
  "use_cache": true,
  "vocab_size": 128256
}
```

Compared to `codellama/CodeLlama-7b-Instruct-hf`'s:
```json
{
  "_name_or_path": "codellama/CodeLlama-7b-Instruct-hf",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 16384,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 1000000,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.33.0.dev0",
  "use_cache": true,
  "vocab_size": 32016
}
```

Observe that the models `intermediate_size` are,
`CodeLlama`'s = 11008 &
`Llama-3`'s = 14336

Hence, my decision to take `CodeLlama` as the base model to merge `Llama-3` with.

**More testing required.**

In [15]:
yamlConfigTIESLlamas = f"""
models:
  - model: codellama/CodeLlama-7b-Instruct-hf  # no parameters necessary for base model
  - model: meta-llama/Meta-Llama-3-8B-Instruct
    parameters:
      density: 0.5
      weight: 0.5
merge_method: ties
base_model: codellama/CodeLlama-7b-Instruct-hf
parameters:
  normalize: true
  int8_mask: true
dtype: float16
"""

### Save Config Script

In [16]:
# Save the YAML configuration to a file
yamlFileName = "config.yaml"
with open(yamlFileName, "w") as f:
    f.write(yamlConfigTIESLlamas)

## Merge Models

In [17]:
cmd = f"mergekit-yaml {yamlFileName} {OUTPUT_DIR} --allow-crimes --copy-tokenizer --out-shard-size 1B --low-cpu-memory --write-model-card --lazy-unpickle"
!{cmd}

Warmup loader cache:   0%|                                | 0/2 [00:00<?, ?it/s]
Fetching 11 files: 100%|█████████████████████| 11/11 [00:00<00:00, 45590.26it/s][A
Warmup loader cache:  50%|████████████            | 1/2 [00:00<00:00,  8.29it/s]
Fetching 10 files: 100%|█████████████████████| 10/10 [00:00<00:00, 18009.03it/s][A
Warmup loader cache: 100%|████████████████████████| 2/2 [00:00<00:00,  8.26it/s]
Executing graph: 100%|██████████████████████| 1457/1457 [05:50<00:00,  4.16it/s]


## Get the Write Token from Kaggle Notebook Secrets

In [8]:
from kaggle_secrets import UserSecretsClient

userSecrets = UserSecretsClient()
HF_WRITE_TOKEN = userSecrets.get_secret("HF_WRITE_TOKEN")

## Use the HF API to Write the Model to a Repository

In [18]:
from huggingface_hub import HfApi

username = "0xVolt"

# Defined in the secrets tab in Kaggle Secrets
api = HfApi(token=HF_WRITE_TOKEN)

api.create_repo(
    repo_id=f"{username}/{MODEL_NAME}",
    repo_type="model"
)

# Push the whole merged-model folder to the hub
api.upload_folder(
    repo_id=f"{username}/{MODEL_NAME}",
    folder_path=OUTPUT_DIR,
)

model-00001-of-00014.safetensors:   0%|          | 0.00/929M [00:00<?, ?B/s]

model-00002-of-00014.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

model-00004-of-00014.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

model-00005-of-00014.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Upload 15 LFS files:   0%|          | 0/15 [00:00<?, ?it/s]

model-00003-of-00014.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

model-00006-of-00014.safetensors:   0%|          | 0.00/944M [00:00<?, ?B/s]

model-00007-of-00014.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

model-00008-of-00014.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

model-00009-of-00014.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

model-00010-of-00014.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

model-00011-of-00014.safetensors:   0%|          | 0.00/944M [00:00<?, ?B/s]

model-00012-of-00014.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

model-00013-of-00014.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

model-00014-of-00014.safetensors:   0%|          | 0.00/877M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/0xVolt/whats-up-llamas-ties/commit/ee93e04fb153124d790f7c77930983ce019efa8b', commit_message='Upload folder using huggingface_hub', commit_description='', oid='ee93e04fb153124d790f7c77930983ce019efa8b', pr_url=None, pr_revision=None, pr_num=None)

## Resultant Model's Config File

In [19]:
from transformers import AutoConfig

config = AutoConfig.from_pretrained("0xVolt/whats-up-llamas-ties")
print(config)

config.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

LlamaConfig {
  "_name_or_path": "0xVolt/whats-up-llamas-ties",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 16384,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 1000000,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.39.3",
  "use_cache": true,
  "vocab_size": 32016
}

