# Merging fine-tuned models
After fine-tuning a model, it must be merged with the base model to make a new model others can simply download and try.

In [1]:
base_model = "nvidia/Llama-3.1-Minitron-4B-Depth-Base"
new_model = "Llama-3.1-Minitron-4B-Depth-BIA-proof-of-concept"
user_org = "haesleinhuepf"

For the merging step, we reload the model and the base model.

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft import PeftModel
import torch
from trl import setup_chat_format
# Reload tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(base_model)

base_model_reload = AutoModelForCausalLM.from_pretrained(
        base_model,
        return_dict=True,
        low_cpu_mem_usage=True,
        torch_dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True,
)

base_model_reload, tokenizer = setup_chat_format(base_model_reload, tokenizer)

# Merge adapter with base model
model = PeftModel.from_pretrained(base_model_reload, new_model + "_temp")

merged_model = model.merge_and_unload()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Test model
After merging, we can test the model.

In [3]:
messages = [{"role": "user", "content": """
Write Python code to load the image ../11a_prompt_engineering/data/blobs.tif,
segment the nuclei in it and
show the result
"""}]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipe = pipeline(
    "text-generation",
    model=merged_model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
)

outputs = pipe(prompt, max_new_tokens=120, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])

  attn_output = torch.nn.functional.scaled_dot_product_attention(


<|im_start|>user

Write Python code to load the image ../11a_prompt_engineering/data/blobs.tif,
segment the nuclei in it and
show the result
<|im_end|>
<|im_start|>assistant
This code imports the necessary libraries and functions for image processing. It reads an input image, performs Gaussian Otsu thresholding on the image, and stores the resulting binary image in the `binary_image` variable. It then displays the binary image using the `imshow` function.

```python

import pyclesperanto_prototype as cle
from skimage.io import imread

binary_image = cle.gauss_otsu_thresholding(input_image, sigma=3)
cle.imshow(binary_image)

```
```python

The code imports the `cle` library and assigns it the name `cle`. It


In [4]:
merged_model.save_pretrained(new_model)
tokenizer.save_pretrained(new_model)

('Llama-3.1-Minitron-4B-Depth-BIA-proof-of-concept\\tokenizer_config.json',
 'Llama-3.1-Minitron-4B-Depth-BIA-proof-of-concept\\special_tokens_map.json',
 'Llama-3.1-Minitron-4B-Depth-BIA-proof-of-concept\\tokenizer.json')

In [5]:
merged_model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128258, 4096)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
    (rotary_emb

## Uploading the model to Huggingface hub
Next, we upload the model to the Huggingface hub. Afterwards, we can use it.

In [6]:
merged_model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)


model-00002-of-00002.safetensors:   0%|          | 0.00/4.10G [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


CommitInfo(commit_url='https://huggingface.co/haesleinhuepf/Llama-3.1-Minitron-4B-Depth-BIA-proof-of-concept/commit/1eeb76a4314db050765dae50ccbc71fc6d658dbd', commit_message='Upload tokenizer', commit_description='', oid='1eeb76a4314db050765dae50ccbc71fc6d658dbd', pr_url=None, pr_revision=None, pr_num=None)